[00:00:23] (03PS3) 10Alex Monk: dns-floating-ip-updater: use python's ipaddress class to determine PTR FQDNs for IPs [puppet] - 10https://gerrit.wikimedia.org/r/309708 [00:03:01] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:19:00] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:43:30] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:10:23] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#2642543 (10Dzahn) 15:54 < mutante> anyone own an xbox 360 here? we'd still like a confirmation if Wikipedia can be viewed from that browser... [01:10:38] 06Operations, 10Traffic, 07Browser-Support-Internet-Explorer, 07HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#2642544 (10Dzahn) 05Open>03Resolved a:03Dzahn [01:21:41] 06Operations, 10Jupyter-Hub: notebook1001 shown as DOWN in icinga, due to firewall rules - https://phabricator.wikimedia.org/T138685#2642557 (10Dzahn) still marked as DOWN in Icinga for a couple months now. notebook1002 does not have this issue [01:28:12] !log mw1294 - down and frozen, powercycled [01:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:30:10] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [01:43:59] PROBLEM - MegaRAID on db2017 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:48:04] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1814.918584 Seconds [01:48:04] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1814.958812 Seconds [01:50:32] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 72.915862 Seconds [01:50:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 72.948597 Seconds [01:58:16] (03PS1) 10Dzahn: salt: add Icinga plugin to check for unaccepted keys [puppet] - 10https://gerrit.wikimedia.org/r/311079 (https://phabricator.wikimedia.org/T144801) [02:04:03] (03CR) 10Dzahn: [C: 04-1] "needs to detect if it fails due to lack of permissions. (Failed to create directory path...). needs to be run with sudo." [puppet] - 10https://gerrit.wikimedia.org/r/311079 (https://phabricator.wikimedia.org/T144801) (owner: 10Dzahn) [02:41:33] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 18m 25s) [02:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:32] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:16:36] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.19) (duration: 18m 44s) [03:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Sep 16 03:21:54 UTC 2016 (duration 5m 18s) [03:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:38:42] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:57:44] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.066 second response time [04:20:43] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/07-wikimania.conf] [04:20:51] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2642711 (10Smalyshev) If we allow to change password without knowing old password (... [04:45:32] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:49:53] (03PS1) 10Urbanecm: Throttling rule for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) [04:52:55] (03PS1) 10Urbanecm: [throttle] Allow the same number of accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311087 [04:59:54] (03PS1) 10Urbanecm: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) [05:00:31] (03PS2) 10Urbanecm: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) [05:01:10] (03PS2) 10Urbanecm: Throttling rule for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) [05:21:23] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:48:34] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:14:12] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:22:33] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:26] <_joe_> uhm happening a bit often, I'd say [06:38:23] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2642829 (10Naveenpf) I think we should discuss about country portal [[ https://meta.wikimedia.org/wiki/Talk:Country_portals#Country_Portal_Policy... [06:38:43] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:17] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [06:49:22] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:51:08] (03PS1) 10Marostegui: Install timelimit package in the database servers. It is useful to limit the execution time of a script or external scripts, ie: tcpdump to capture traffic for a given time and not for a given amount of packages or mb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311093 [06:56:43] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [06:58:43] (03PS4) 10Muehlenhoff: beta: import scap_masters list from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/310827 (owner: 10Hashar) [07:01:20] (03CR) 10Muehlenhoff: [C: 032] beta: import scap_masters list from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/310827 (owner: 10Hashar) [07:03:15] (03CR) 10Muehlenhoff: [C: 031] "labs uses unattended-upgrades anyway" [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [07:04:43] (03CR) 10Muehlenhoff: "tzdata would still be useful to keep on latest" [puppet] - 10https://gerrit.wikimedia.org/r/310897 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [07:05:42] 06Operations, 10ops-codfw, 10DBA: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2642835 (10Marostegui) [07:06:54] (03CR) 10Alexandros Kosiaris: "should we merge this ?" [puppet] - 10https://gerrit.wikimedia.org/r/310497 (owner: 10Giuseppe Lavagetto) [07:07:07] (03PS2) 10Elukey: Add a directive to mod_proxy_html's yarn configuration [puppet] - 10https://gerrit.wikimedia.org/r/310863 (https://phabricator.wikimedia.org/T116192) [07:09:13] (03CR) 10Elukey: [C: 032] Add a directive to mod_proxy_html's yarn configuration [puppet] - 10https://gerrit.wikimedia.org/r/310863 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [07:15:30] (03PS1) 10Muehlenhoff: Remove access credentials for ironholds [puppet] - 10https://gerrit.wikimedia.org/r/311095 [07:16:31] (03PS3) 10Giuseppe Lavagetto: exim: move templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310838 [07:16:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC confirms it's a noop" [puppet] - 10https://gerrit.wikimedia.org/r/310838 (owner: 10Giuseppe Lavagetto) [07:16:48] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [07:20:58] ACKNOWLEDGEMENT - MegaRAID on db2017 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Marostegui https://phabricator.wikimedia.org/T145844 - The acknowledgement expires at: 2016-09-22 07:20:45. [07:22:05] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:24:17] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [07:24:36] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:25:43] (03PS1) 10Alexandros Kosiaris: puppet-merge: Run conftool-merge in only 1 frontend [puppet] - 10https://gerrit.wikimedia.org/r/311096 [07:26:40] (03PS2) 10Alexandros Kosiaris: puppet-merge: Run conftool-merge in only 1 frontend [puppet] - 10https://gerrit.wikimedia.org/r/311096 [07:26:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: Run conftool-merge in only 1 frontend [puppet] - 10https://gerrit.wikimedia.org/r/311096 (owner: 10Alexandros Kosiaris) [07:29:35] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:36:27] !log forced logrotation with debug of /etc/logrotate.d/graphite-web on graphite1001 to find cronspam source [07:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:38:17] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:39:26] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2642915 (10elukey) [07:42:01] (03PS2) 10Muehlenhoff: Remove access credentials for ironholds [puppet] - 10https://gerrit.wikimedia.org/r/311095 [07:44:20] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [07:45:29] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for ironholds [puppet] - 10https://gerrit.wikimedia.org/r/311095 (owner: 10Muehlenhoff) [07:47:47] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2627494 (10elukey) Added both Tyler and Alex to get their thoughts about this issue [07:49:10] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2642950 (10elukey) [07:49:14] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2642949 (10elukey) [07:49:36] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:40] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [07:53:02] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:54:26] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdh] [07:56:48] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:04:27] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 38.15 ms [08:07:58] 06Operations, 06Labs: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2642953 (10elukey) [08:12:38] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:14:40] !sal [08:14:40] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [08:15:31] (03CR) 10Alexandros Kosiaris: [C: 031] "Plan looks ok, assumptions look ok. I would amend the commit message before merging though to move them from assumptions to actual facts." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) (owner: 10Elukey) [08:19:32] (03PS2) 10Giuseppe Lavagetto: templates: move apache templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310847 [08:20:47] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:58] (03PS4) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) [08:34:13] (03PS2) 10Filippo Giunchedi: monitoring: validate check_prometheus args [puppet] - 10https://gerrit.wikimedia.org/r/310835 [08:35:37] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:37:40] (03CR) 10Filippo Giunchedi: [C: 032] monitoring: validate check_prometheus args [puppet] - 10https://gerrit.wikimedia.org/r/310835 (owner: 10Filippo Giunchedi) [08:38:15] (03CR) 10Volans: "No problem to install timelimit too for me and the change LGTM, although I'm curious about what it adds compared with "timeout" (from core" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311093 (owner: 10Marostegui) [08:43:15] !log installing libidn security updates in eqiad [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:59] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:46:46] 06Operations, 10Wikimedia-Apache-configuration: Apache mod_status metrics only available in ganglia - https://phabricator.wikimedia.org/T141424#2642992 (10elukey) [08:46:48] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2642995 (10elukey) [08:47:22] 06Operations, 10Traffic: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2642997 (10elukey) [08:47:24] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2637292 (10elukey) [08:48:28] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:18] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [08:52:37] !log installing tomcat8 security updates [08:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:22] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:54:52] (03CR) 10Giuseppe Lavagetto: [C: 032] templates: move apache templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310847 (owner: 10Giuseppe Lavagetto) [08:55:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1254 [08:55:14] (03PS3) 10Giuseppe Lavagetto: templates: move apache templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310847 [08:57:00] 06Operations: long-running root console sessions - https://phabricator.wikimedia.org/T105869#2643003 (10fgiunchedi) @Dzahn yeah I still think we want to know if there are long-running root sessions left open on console, not sure would like the additional checks now though. [08:59:08] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:59:51] !log installing tomcat7 security updates [08:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:00:06] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 155304 Threads: 1 Questions: 36144653 Slow queries: 471 Opens: 1561 Flush tables: 2 Open tables: 508 Queries per second avg: 232.734 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:02:03] !log reimaging mw1252-mw1254 to jessie [09:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:47] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:07:15] (03PS1) 10Filippo Giunchedi: prometheus: optionally print labs targets according to format() [puppet] - 10https://gerrit.wikimedia.org/r/311099 [09:08:49] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:09:39] <_joe_> all the prometheus failures are due to my running with puppetdb [09:09:56] (03CR) 10Elukey: [C: 032] Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) (owner: 10Elukey) [09:10:02] (03PS5) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) [09:14:18] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:14:43] (03PS1) 10Giuseppe Lavagetto: prometheus: fix calls to functions from template [puppet] - 10https://gerrit.wikimedia.org/r/311100 [09:17:04] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2547185 (10akosiaris) I did manage to get a memberOf relayed from seaborgium to serpens. The process was: * Check that memberOf is present on seaborgium b... [09:17:09] (03PS6) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) [09:18:16] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus: fix calls to functions from template [puppet] - 10https://gerrit.wikimedia.org/r/311100 (owner: 10Giuseppe Lavagetto) [09:18:35] (03PS7) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) [09:23:42] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:24:00] * elukey sees health checks going to aqs100[456] \o/ [09:27:48] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:30:06] PROBLEM - puppet last run on prometheus1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:36] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:18] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:40] !log reimage mw1189-90 to Jessie (trying Riccardo's script!) [09:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:07] RECOVERY - puppet last run on prometheus1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:09] elukey: lol, ping you need help [09:35:12] *if [09:36:08] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:09] ah snap issues with pwstore, the last upgrade messed up my gpg setting [09:39:26] volans: is it ok to abort at IPMI Password right? [09:39:38] afaics nothing has been done [09:39:56] elukey: yes sure [09:40:03] super thanks [09:40:05] grrr [09:40:08] lol [09:49:22] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:52:16] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [09:53:19] volans: it seems working super fine [09:53:21] \o/ [09:53:50] will let you know when I finish, but it looks like magic [09:55:49] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2643111 (10MoritzMuehlenhoff) Interesting! But it that limited to rewriting existing groups? Or does it also affect new memberOf attributes being added as... [10:03:44] (03CR) 10Muehlenhoff: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/306399 (https://phabricator.wikimedia.org/T143671) (owner: 10Filippo Giunchedi) [10:04:06] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:06:36] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:08:15] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.039 second response time [10:15:30] (03PS1) 10Muehlenhoff: deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) [10:21:48] !log renaming tables before dropping them in codfw S1,S3,S4 - T54924 [10:21:49] T54924: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924 [10:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:05] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:04] (03CR) 10jenkins-bot: [V: 04-1] deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [10:27:48] (03CR) 10Hashar: "Redis has been made to not daemonize via a7f02e4016191dde30bf649c3ee2fc250bc1527c" [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [10:27:56] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2643220 (10Krenair) labtestweb2001 should be treated like silver here. terbium needs access to those so it can run non-essential maintenance jobs for the wikis that usually run there. [10:33:21] PROBLEM - Apache HTTP on mw1252 is CRITICAL: Connection refused [10:33:31] PROBLEM - Apache HTTP on mw1190 is CRITICAL: Connection timed out [10:35:34] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2643234 (10Marostegui) Tables have been renamed in codfw S1 ``` root@neodymium:/home/marostegui/git/software/dbtools#for i in `cat s1.hosts | grep cod | cut -f 1 -d " "`; do ec... [10:35:51] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.010 second response time [10:36:26] ^ reimages, silenced [10:37:04] thanks :) [10:37:38] 06Operations, 06Labs: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2643240 (10AlexMonk-WMF) >>! In T132422#2642951, @elukey wrote: > @AlexMonk-WMF and now we have this one :D > > ``` > keystoneclient.exceptions.Unauthor... [10:38:30] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.360 second response time [10:38:50] PROBLEM - mediawiki-installation DSH group on mw1189 is CRITICAL: Host mw1189 is not in mediawiki-installation dsh group [10:39:19] this one has also just been reimaged --^ [10:40:59] elukey: mw1189 is still a trusty host, though [10:41:12] with 79 days of uptime, so something went wrong there [10:41:37] yes I was checking.. :( [10:44:22] from the logs on puppetmaster1001 it seems that it stopped after cleaning up salt keys [10:47:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [10:47:55] Failed to query MASTER_POS_WAIT() [10:49:29] (03PS2) 10Muehlenhoff: deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) [10:51:02] Jobrunners? [10:52:13] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:57:09] elukey, volans: in the case of mw1189 it could be triggered by a failing ipmitool invocation in wmf_reimage? same hardware error as in codfw? [10:57:50] moritzm: I didn't find a trace of it in the logs though [10:58:34] "host": "db1034", "lag": 32.607033967972 [11:00:45] mafk: is it from logstash? (I am trying to understand mw errors) [11:00:58] nope, API [11:01:32] the Failed to query one is from Special:PageTranslation at meta [11:04:15] marostegui: you there? :) [11:04:23] there might be something interesting to check [11:05:01] lag on db1034 has raised up to 129.17012882233 [11:05:05] moritzm: also another problem of the script could be that it blocks for all the hosts listed if one fails? [11:05:56] there is indeed an icinga warning for db1034 [11:06:10] elukey: it resumes the working ones, see https://phabricator.wikimedia.org/P4063 [11:06:28] mw1253 seems to have run into a timeout, but mw125[24] went fine [11:07:24] https://tendril.wikimedia.org/host/view/db1034.eqiad.wmnet/3306 <-- somebody with access can check in there? [11:07:47] I hope it is not my bot. It's working with categories on Meta. [11:08:08] but it's "sleeping" for many seconds [11:09:15] mafk: any chance that you could stop it now to see if we recover? [11:09:35] elukey: sure, will have to restart later though [11:09:46] bot's stopped now [11:10:11] sure sure [11:17:02] maybe it's due to being "jobs": 915 in the jobqueue? [11:17:42] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:44] mafk: is there any graph that alerted you (that you can show) ? [11:18:52] I checked grafana and nothing came up :( [11:19:03] elukey: nope, just my bot lagging many seconds [11:19:09] but from tendril I can see that on db1034 there is an alter table [11:19:11] mmmm [11:19:21] "Sleeping for 98.0 seconds, 2016-09-16 12:55:53" [11:19:23] but not sure if this has been executed or it is inflight [11:20:00] does tendril works with LDAP credentials or I need special authorization clearance? [11:20:51] <_joe_> the latter. What do you want to know? [11:22:20] !log stop puppetmaster on all puppetmasters, resizing /var/lib/puppet [11:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:31] !log restarted puppetmaster on all puppetmasters [11:22:31] elukey: was having lunch, looking now [11:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:32] _joe_: see if it was my bot causing such lag to run it a bit slower next time or if it was something unreated [11:24:26] _joe_: puppetmasters are fine now with /var/lib/puppet on an LVM [11:24:31] <_joe_> akosiaris: cool [11:24:39] !log silence icinga-wm for a while [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:55] mafk: that server is also finishing an alter table, so maybe the combination of both, your bot and the alter load caused the lag [11:25:16] alter table? [11:26:16] the bot is disabled now so I don't think it influences the issue, the alter table might be more relevant no? [11:26:36] yep [11:27:00] The lag is now decreasing, it is just 20 seconds [11:27:33] This is the related task: https://phabricator.wikimedia.org/T141951 [11:28:34] It should be finished in like 20-30 minutes [11:31:12] mafk: I can ping you once it is done so we can enable the bot again and see how it goes? [11:31:24] Thanks marostegui! [11:32:19] marostegui: sure, not sure if I'll be here in 20/30 minutes though but you can try [11:34:03] host": "db1034", "lag": 194.09153199196 <-- that's not right... [11:35:14] Seconds_Behind_Master: 60 [11:35:20] That the server itself [11:36:22] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:36:30] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:36:35] mafk: It is spiking, but that is expected, now it is back to 0 for instance :) [11:36:45] I have a watch there monitoring it [11:36:49] * mafk checks [11:49:11] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:49:12] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:49:22] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:49:22] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:49:31] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:49:42] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:49:50] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:49:50] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:50:04] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:50:04] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:50:04] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:50:11] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:50:12] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:50:20] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [11:50:32] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:50:40] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:50:40] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:50:40] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:50:40] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:50:50] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:50:51] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:50:51] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:50:51] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:50:51] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:50:52] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:51:00] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:51:02] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:51:03] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:12] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:12] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:13] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:51:13] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:13] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:51:13] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:51:20] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:20] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:51:20] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:51:20] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:51:20] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:21] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:51:21] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:51:22] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:51:22] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:23] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:51:23] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:51:24] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:51:32] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:51:32] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:32] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:33] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:41] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:41] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:41] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:41] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:51:41] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:51:41] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:51:41] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:51:42] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:51:42] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:51:46] (03PS1) 10Giuseppe Lavagetto: prometheus: use scope.function in template [puppet] - 10https://gerrit.wikimedia.org/r/311116 [11:51:52] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:52:01] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:52:01] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:52:02] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:02] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:11] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:13] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:52:14] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:52:20] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:20] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:20] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:21] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:52:21] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:21] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:30] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:31] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:31] RECOVERY - puppet last run on dbproxy1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:31] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:32] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:32] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:52:40] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:52:40] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:40] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:51] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:52:52] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:52:52] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:52:53] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:01] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:53:03] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:53:04] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:10] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:53:11] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:53:11] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:11] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:21] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:21] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:53:22] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:22] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:53:22] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:23] RECOVERY - puppet last run on zosma is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:30] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:53:30] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:31] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:42] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:44] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:50] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:50] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:51] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:54:01] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:03] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:11] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:54:12] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:21] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:22] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:54:39] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet_compiler: clean older output dirs [puppet] - 10https://gerrit.wikimedia.org/r/306399 (https://phabricator.wikimedia.org/T143671) (owner: 10Filippo Giunchedi) [11:58:08] (03CR) 10Giuseppe Lavagetto: [C: 031] deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [11:58:32] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus: use scope.function in template [puppet] - 10https://gerrit.wikimedia.org/r/311116 (owner: 10Giuseppe Lavagetto) [12:06:41] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:44] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2643422 (10MoritzMuehlenhoff) This can't be easily tested with libvpx from backports; versions post 1.3 have changed the soname (i.e. the library package is now called libvpx4, so ffmpeg wou... [12:15:27] (03PS1) 10Giuseppe Lavagetto: prometheus: ruby has Array.push, not append [puppet] - 10https://gerrit.wikimedia.org/r/311117 [12:15:36] !log installing python-imaging security updates on precise [12:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] prometheus: ruby has Array.push, not append [puppet] - 10https://gerrit.wikimedia.org/r/311117 (owner: 10Giuseppe Lavagetto) [12:17:14] (03PS1) 10Hashar: Inline doc for $wgMaxShell* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311118 (https://phabricator.wikimedia.org/T145819) [12:19:44] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:20:41] !log installing security updates for mysql 5.5 (one off systems running mysql as packaged by Ubuntu/Debian and not running wmf-mariadb10) [12:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:47] !log rolling restart of codfw elasticsearch cluster completed - T145404 [12:24:48] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [12:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:25:04] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2643460 (10Gehel) [12:25:28] Note: I'll wait until Tuesday for the rolling restart of eqiad [12:27:24] elukey, hey [12:27:35] o/ [12:27:36] re https://phabricator.wikimedia.org/T132422 labtestcontrol2001.wikimedia.org [12:27:56] thanks a lot [12:28:03] I'm not 100% sure what that mysqladmin error is or if I should really be touching those files right now [12:28:46] As it appears mysql on that box is not properly puppetised: https://phabricator.wikimedia.org/T145679 [12:29:33] nono I added you to get some info, probably the labs team needs to check that? [12:31:23] I have the permissions necessary to look into it, just want to minimise any local changes right now [12:32:54] sure sure [12:33:19] bare in mind that the cronspam ticket is not super urgent, I am just trying to ping people once in a while :) [12:37:45] !log mw1190 back serving traffic after the reimage [12:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:02] !log Going to rollback all Wikis back to 1.28.0-wmf.18 . Despite much investigation, a bunch of jobs are broken due to T145819 which includes Special:CreateAccount :( [12:40:03] T145819: Wikidata at 1.28.0-wmf.19 no more replicate to wikis (replag raise / dispatch stop) - https://phabricator.wikimedia.org/T145819 [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:18] yeah, I know [12:41:43] (03PS1) 10Ema: varnish: add varnish-be restart script [puppet] - 10https://gerrit.wikimedia.org/r/311119 [12:42:17] PROBLEM - salt-minion processes on mw1189 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:44:26] (03PS1) 10Hashar: All wikis back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311120 (https://phabricator.wikimedia.org/T145819) [12:44:50] elukey: the old salt key on mw1189 wasn't removed, fixing that [12:46:00] (03PS2) 10Ema: varnish: add varnish-be restart script [puppet] - 10https://gerrit.wikimedia.org/r/311119 [12:46:32] (03CR) 10Addshore: [C: 031] All wikis back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311120 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [12:47:27] RECOVERY - salt-minion processes on mw1189 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:47:53] (03CR) 10Hashar: [C: 032] All wikis back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311120 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [12:48:22] (03Merged) 10jenkins-bot: All wikis back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311120 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [12:48:32] moritzm: thanks, just reimaging it! [12:48:54] maybe it would be better to silence it [12:49:19] done [12:50:09] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: All wikis back to 1.28.0-wmf.18 :( T145819 [12:50:10] T145819: Wikidata at 1.28.0-wmf.19 no more replicate to wikis (replag raise / dispatch stop) - https://phabricator.wikimedia.org/T145819 [12:50:10] running scap to rollback mw version to .18 [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:55] ok [13:08:15] PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 448.14 seconds [13:08:52] ah, it finally alerted [13:09:01] yep :( [13:09:28] akosiaris: these are my thoughts [13:10:00] db1034 is listed in the config file the same as db1062, so maybe we can depool it, let it recover and finish the alter and then pool it again? [13:10:57] sounds ok, lemme doublecheck [13:11:02] Thanks [13:12:12] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 478.67 seconds Manuel Arostegui there is an alter going on which is not being throttled too well [13:13:52] akosiaris: if that looks good to you, if you could do the config change, that would be helpful, as I have only done one so far and I am sure it will take me a while to figure the whole workflow again. I can do the +1 though :) [13:14:03] marostegui: ok [13:14:10] thanks [13:15:44] marostegui: btw, generally mediawiki is capable of depooling servers on its own, for example if they lag too much [13:15:59] (03PS1) 10Alexandros Kosiaris: Depool db1034 temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311121 [13:16:04] marostegui: ^ [13:16:16] (03CR) 10Marostegui: [C: 031] Depool db1034 temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311121 (owner: 10Alexandros Kosiaris) [13:16:28] akosiaris: but yet it generate errors? :( [13:16:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Depool db1034 temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311121 (owner: 10Alexandros Kosiaris) [13:17:41] akosiaris: The lag just went to 0 \o/ [13:18:13] And the alter is advencing a lot faster now [13:18:16] advancing [13:18:35] RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Replication lag: 7.00 seconds [13:19:23] !log akosiaris@tin Synchronized wmf-config/db-eqiad.php: (no message) (duration: 00m 48s) [13:19:28] marostegui: ^ [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:32] so wikis are back to wmf.18 [13:19:34] so... let's wait it out now [13:19:41] yah [13:19:51] I think there was a potential hack to workaround the ongoing issue. But I did not feel adventurous on a friday [13:19:53] hashar: yes [13:19:56] so playing it safe and rollbacked [13:20:03] agreed [13:20:08] the root cause being runJobs invoking mwscript with a filelimit of 512MBytes [13:20:26] which dies when HHVM attempt to write to its hhvm cache file which is more than 512MBytes [13:20:37] the hack would be to disable the file limit for that specific code path [13:20:51] the sleepless me solution is rollback [13:20:56] gonna write the incident report [13:27:30] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2643554 (10akosiaris) >>! In T142817#2643111, @MoritzMuehlenhoff wrote: > Interesting! But it that limited to rewriting existing groups? Or does it also af... [13:27:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:27:54] !sal [13:27:54] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:29:36] (03PS1) 10Alexandros Kosiaris: Revert "Depool db1034 temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311123 [13:30:03] marostegui: I 've uploaded it already ^ for when we want to repool db1034 [13:30:23] akosiaris: thank you [13:31:20] akosiaris: the alter finished [13:31:26] akosiaris: so we can do it now [13:31:37] (03CR) 10Marostegui: [C: 031] Revert "Depool db1034 temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311123 (owner: 10Alexandros Kosiaris) [13:33:15] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 7 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2643564 (10Anomie) 05Open>03Resolved Investigation complete, and a fix applied. [13:34:27] marostegui, akosiaris: just FYI and clarification, db1034 was with weight 1 in the normal traffic but was NOT in any special role, see the groupLoadsBySection below. [13:34:35] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2643581 (10MoritzMuehlenhoff) Ok, I'll tweak my earlier script and we can convert the core groups next week. And let's take this feature with a grain of sa... [13:35:06] !log gallium: removing MySQL which is no more defined in puppet and running puppet. Did: apt-get remove mysql-common mysql-server mysql-server-core-5.5 [13:35:14] moritzm: mysql is gone from gallium [13:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:15] (03PS1) 10Alexandros Kosiaris: Revert "puppetmaster: throw away reports" [puppet] - 10https://gerrit.wikimedia.org/r/311124 [13:36:57] hashar: ok, thanks [13:37:13] volans: thanks for the clarification [13:37:56] so looks like it was getting almost no traffic... is that true? [13:38:08] didn't had time to check dashaboards yet [13:39:22] volans: Not much, it was getting around 100 connections stuck waiting for replication [13:39:52] ok [13:40:50] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Depool db1034 temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311123 (owner: 10Alexandros Kosiaris) [13:41:50] !log akosiaris@tin Synchronized wmf-config/db-eqiad.php: (no message) (duration: 00m 46s) [13:41:54] marostegui: ^ [13:41:55] done [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:10] akosiaris: thanks :-) [13:43:09] for coherence please fix also the comment ;) [13:49:15] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "puppetmaster: throw away reports" [puppet] - 10https://gerrit.wikimedia.org/r/311124 (owner: 10Alexandros Kosiaris) [13:49:24] (03PS3) 10Ema: varnish: add varnish-be restart script [puppet] - 10https://gerrit.wikimedia.org/r/311119 [13:49:29] (03CR) 10Ema: [C: 032 V: 032] varnish: add varnish-be restart script [puppet] - 10https://gerrit.wikimedia.org/r/311119 (owner: 10Ema) [13:54:09] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:04] (03PS2) 10Filippo Giunchedi: puppet_compiler: clean older output dirs [puppet] - 10https://gerrit.wikimedia.org/r/306399 (https://phabricator.wikimedia.org/T143671) [13:56:45] !log mw1189 back serving traffic after reimage [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:17] (03CR) 10Filippo Giunchedi: [C: 032] puppet_compiler: clean older output dirs [puppet] - 10https://gerrit.wikimedia.org/r/306399 (https://phabricator.wikimedia.org/T143671) (owner: 10Filippo Giunchedi) [13:59:03] (03PS5) 10Filippo Giunchedi: prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) [14:03:10] (03PS6) 10Filippo Giunchedi: prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) [14:05:01] (03PS1) 10Giuseppe Lavagetto: puppetmaster: re-enable puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/311129 [14:06:28] (03PS3) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [14:06:49] (03CR) 10Ottomata: "Thanks Filippo. I'd like to merge this on Monday." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [14:08:27] (03PS1) 10Muehlenhoff: Update to 4.4.21 [debs/linux44] - 10https://gerrit.wikimedia.org/r/311130 [14:15:51] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2643675 (10Marostegui) @jcrespo mentioned he wasn't trusting the server so much so I have been running different stress tests, cpu (sys, user), mem, iowait etc for the whole day to introduce some overload situations while... [14:19:20] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:24:00] (03PS1) 10Alexandros Kosiaris: puppet.conf: Remove reports = statsd from agent [puppet] - 10https://gerrit.wikimedia.org/r/311133 [14:28:22] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.21 [debs/linux44] - 10https://gerrit.wikimedia.org/r/311130 (owner: 10Muehlenhoff) [14:28:36] (03CR) 10Muehlenhoff: "Cherrypicked on the deployment-prep puppetmaster, fixes the redis start on mira02." [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [14:30:15] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 7 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2643748 (10mobrovac) Thank you, @jcrespo and @Anomie for the detailed investigation and fix. @Anomie, would you be ok with me SWATting the fix onto wmf.19 on mo... [14:31:37] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2643749 (10MoritzMuehlenhoff) A few jessie-related changes have been sorted out, mira02.deployment-prep.eqiad.wmflabs should be ready for testing. [14:33:20] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:34:21] (03CR) 10Marostegui: "Thanks Ricciardo. I didn't know about timeout, but it looks pretty similar to timelimit, so probably that is enough for the majority of th" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311093 (owner: 10Marostegui) [14:44:13] (03PS1) 10Alexandros Kosiaris: puppetmaster: Send reports to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/311136 [14:44:49] RECOVERY - mediawiki-installation DSH group on mw1189 is OK: OK [14:46:55] !log disabling shard allocation check on relforge to test shard allocation issues [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:06] (03PS2) 10Alexandros Kosiaris: puppet.conf: Remove reports = statsd from agent [puppet] - 10https://gerrit.wikimedia.org/r/311133 [14:47:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet.conf: Remove reports = statsd from agent [puppet] - 10https://gerrit.wikimedia.org/r/311133 (owner: 10Alexandros Kosiaris) [14:47:23] (03PS2) 10Alexandros Kosiaris: puppetmaster: Send reports to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/311136 [14:47:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Send reports to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/311136 (owner: 10Alexandros Kosiaris) [14:48:57] (03PS2) 10Giuseppe Lavagetto: puppetmaster: re-enable puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/311129 [14:50:06] <_joe_> akosiaris: I'll stop the hammering of puppetmaster2001 so that we can apply both changes [14:50:09] <_joe_> ok? [14:50:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: re-enable puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/311129 (owner: 10Giuseppe Lavagetto) [14:50:35] ok [14:52:43] (03PS1) 10Muehlenhoff: zookeeper: Retrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/311138 [14:53:00] <_joe_> running puppet there [14:53:08] <_joe_> then I'll re-disable it [14:53:25] <_joe_> oh, I just realized your change won't work [14:53:43] <_joe_> we send the report requests to the puppetmaster1001 anyways [14:55:22] (03CR) 10Alexandros Kosiaris: [C: 031] zookeeper: Retrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/311138 (owner: 10Muehlenhoff) [14:55:43] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2643780 (10GWicke) @Smalyshev, my understanding (which is also documented in [the d... [14:56:29] hmmm [14:56:33] yeah... [14:56:44] (03CR) 10Filippo Giunchedi: "LGTM, modulo unrelated change to mariadb submodule" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [14:57:05] https://docs.puppet.com/puppet/latest/reference/configuration.html#reportserver [14:57:14] either I change it on the puppetmaster or on the agent [14:57:31] I think I am gonna go with the puppetmaster for now... how bad can it be ? [14:57:36] * akosiaris famous last words [14:58:43] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:00:02] akosiaris: jinxing it on a friday?! [15:00:48] !log gallium: dpkg --purge php5-mysql (mysql got removed) [15:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:40] (03PS1) 10Elukey: Add the deployment_key class variable to service::node [puppet] - 10https://gerrit.wikimedia.org/r/311139 [15:01:41] hehe [15:01:51] <_joe_> ugh [15:02:04] <_joe_> another parameter to that define? [15:03:28] I think I am just gonna hack around it for like 30 mins [15:03:32] and then call it a day [15:03:38] !log disabling puppet on puppetmaster1001 [15:03:44] (03PS2) 10Elukey: Add the deployment_key class variable to service::node [puppet] - 10https://gerrit.wikimedia.org/r/311139 [15:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:59] <_joe_> akosiaris: puppetmaster1001 doesn't talk with puppetdb [15:04:09] ProxyPassMatch ^/([^/]+/report/.*)$ https://puppetmaster1001.eqiad.wmnet:8141 [15:04:13] <_joe_> yes [15:04:17] guess what that's gonna be for like 30 mins :P [15:04:25] <_joe_> ahah ok [15:04:30] <_joe_> puppetmaster2001? [15:04:32] <_joe_> jeez [15:07:57] ACKNOWLEDGEMENT - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] Filippo Giunchedi disks renumbered, sdk failed https://phabricator.wikimedia.org/T83811 [15:09:04] 2016-09-16 15:08:54,674 INFO [c.p.p.command] [dfb56178-668e-4536-8bab-b8b8e29b276e] [store report] puppet v3.7.2 - wtp1018.eqiad.wmnet [15:09:06] :-) [15:09:13] so... let's see what we get out of that [15:11:19] (03CR) 10Ottomata: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [15:15:39] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:15:43] (03PS1) 10Ema: WIP: varnish: restart upload backends once a day [puppet] - 10https://gerrit.wikimedia.org/r/311142 [15:16:48] <_joe_> akosiaris: I'm going to restart my mass compilation thing [15:16:59] ok [15:17:24] I am gonna revert my local hack... got enough data already [15:17:57] (03CR) 10Elukey: [C: 032] Add the deployment_key class variable to service::node [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [15:18:25] (03CR) 10Elukey: [C: 04-1] "Wrong button, waiting for Marko." [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [15:18:35] !log enable puppet on puppetmaster1001 again [15:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:11] (03CR) 10Giuseppe Lavagetto: [C: 031] Add the deployment_key class variable to service::node [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [15:19:36] thanks! [15:22:01] (03CR) 10Hashar: "Thank you. That is very helpful :]" [puppet] - 10https://gerrit.wikimedia.org/r/310717 (https://phabricator.wikimedia.org/T127797) (owner: 10Dzahn) [15:22:04] (03PS2) 10Ema: WIP: varnish: restart upload backends once a day [puppet] - 10https://gerrit.wikimedia.org/r/311142 [15:23:18] 06Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 13Patch-For-Review: OSError: [Errno 28] No space left on device on compiler02.puppet3-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T143671#2643813 (10hashar) a:03fgiunchedi Looks like @fgiunchedi solved it :) [15:26:00] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1711 MB (3% inode=97%) [15:27:01] heh, spotted the warning on icinga, I'll take a look [15:28:31] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [15:34:46] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2643838 (10fgiunchedi) [15:36:17] (03PS3) 10Ema: varnish: restart upload backends once a day [puppet] - 10https://gerrit.wikimedia.org/r/311142 [15:38:19] (03PS2) 10Jhobs: Initiate Hovercards A/B test on ruwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) [15:40:50] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:41:45] (03CR) 10Ottomata: "Doesn't this restrict cross-dc traffic? eqiad Kafka mirror maker needs to talk to codfw zookeeper, and vice versa." [puppet] - 10https://gerrit.wikimedia.org/r/311138 (owner: 10Muehlenhoff) [15:49:27] (03CR) 10Jhobs: Initiate Hovercards A/B test on ruwiki and itwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [15:49:28] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2643875 (10Anomie) >>! In T140813#2643780, @GWicke wrote: > @Smalyshev, my understa... [15:51:11] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:43] (03CR) 10Alexandros Kosiaris: "no, a DOMAIN in this case is one of "production", "labs", "frack", "sandbox". The domain is obviously determined in the context of the pup" [puppet] - 10https://gerrit.wikimedia.org/r/311138 (owner: 10Muehlenhoff) [16:02:36] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:08] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:19] that looks like a race... [16:13:59] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:15:10] (03PS4) 10Dzahn: partman: delete some more unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 [16:15:17] (03CR) 10Dzahn: [C: 032] partman: delete some more unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 (owner: 10Dzahn) [16:16:53] !log puppet developers are you reading this? just checking... [16:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:19] (03PS4) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [16:24:48] (03CR) 10Ottomata: [C: 031] "Cool, +1 then." [puppet] - 10https://gerrit.wikimedia.org/r/311138 (owner: 10Muehlenhoff) [16:24:56] @seen codezee [16:24:56] mutante: Last time I saw codezee they were quitting the network with reason: Quit: Leaving N/A at 9/15/2016 6:52:38 PM (21h32m17s ago) [16:25:51] @seen paladox [16:25:51] paladox: are you really looking for yourself? [16:26:02] LOL [16:27:56] @seen paladox [16:27:56] Platonides: I have never seen paladox [16:28:04] LOL [16:28:08] LOL² [16:28:29] LOL£ [16:28:31] LOL3 [16:28:57] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [16:28:57] @info [16:33:50] wm-bot: tell paladox about thx [16:34:11] Oh LOL [16:36:32] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:00] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:30] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:39:01] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:44:04] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files triggers 404 (not found) - https://phabricator.wikimedia.org/T124101#1945898 (10Platonides) Image https://commons.wikimedia.org/wiki/File:Cactaceae_(1082183341).jpg missing (there's only one revi... [16:46:10] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2643956 (10Platonides) [16:46:30] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:46:35] (03PS1) 10EBernhardson: Update ebernhardson ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311149 [16:47:51] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:53:30] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [16:53:40] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/swift/object.ring.gz] [16:54:21] (03PS3) 10Dzahn: ldap: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [16:56:30] (03CR) 10Dzahn: [C: 032] ldap: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [16:58:51] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:59] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:59:06] (03PS1) 10Dzahn: ldap::client::utils: don't ensure => latest packages [puppet] - 10https://gerrit.wikimedia.org/r/311151 (https://phabricator.wikimedia.org/T115348) [16:59:24] (03PS3) 10Dzahn: mediawiki_singlenode: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [16:59:29] Thanks ^^ [17:01:35] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2643992 (10GWicke) @dzahn, if we could configure that warning to only be sent to the services contact list, then that would be great. [17:03:06] !log deploy changeprop to apply gerrit 311153 config change [17:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:58] (03CR) 10Dzahn: [C: 032] "labs uses unattended upgrades, prod uses debdeploy" [puppet] - 10https://gerrit.wikimedia.org/r/311151 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [17:05:12] (03PS2) 10Dzahn: ldap::client::utils: don't ensure => latest packages [puppet] - 10https://gerrit.wikimedia.org/r/311151 (https://phabricator.wikimedia.org/T115348) [17:08:24] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:10] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:12:17] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2644031 (10Dzahn) @Gwicke That should already be the case. I see this: 50 contact_group => 'team-services', (monitoring::graphite_threshold { 'restbase_h... [17:12:39] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:13:23] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2644032 (10Dzahn) @Gwicke Ah, but you probably didn't get a notification because the contact is configured to only get CRITs and recoveries but not mere WARNings. [17:13:28] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:15:11] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:16:32] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:18:13] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:54] 06Operations: Decomission mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2644056 (10Papaul) 05Open>03Resolved DNS mgmt entries removed Decommission complete closing this task. [17:19:10] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:29] (03PS1) 10Andrew Bogott: Puppet Panel: Add captions to instance tabs [puppet] - 10https://gerrit.wikimedia.org/r/311160 (https://phabricator.wikimedia.org/T91990) [17:21:31] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:21:45] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Add captions to instance tabs [puppet] - 10https://gerrit.wikimedia.org/r/311160 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [17:27:20] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2644083 (10Dzahn) @Gwicke I changed the settings of the "team-services" contact itself, in the private repo. from: service_notification_options c,r,f to: se... [17:28:57] (03PS4) 10Dzahn: mediawiki_singlenode: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:29:14] (03CR) 10Dzahn: [C: 032] "just like other packages in this same manifest" [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:29:24] ^^ thanks :) [17:29:36] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:31:43] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2644113 (10GWicke) @Dzahn: Awesome, thanks! This should help to keep us in the loop on these, and ought to also make sure that we fix broken alerts. I vaguely reme... [17:32:45] (03PS2) 10Dzahn: Update list of mailman site languages [puppet] - 10https://gerrit.wikimedia.org/r/310746 (https://phabricator.wikimedia.org/T144933) (owner: 10Muehlenhoff) [17:35:00] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2644119 (10Dzahn) Yep, the IRC bot doesn't show warnings on the channel, but the notification options are per contact, so we can change this for just the team-servi... [17:36:03] (03PS1) 10Hashar: contint: drop now unused sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/311161 (https://phabricator.wikimedia.org/T51846) [17:40:45] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644146 (10Smalyshev) > a password recovery token sent to the configured email addr... [17:41:42] (03PS1) 10Yuvipanda: labs: Add a per-project puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 [17:42:46] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644148 (10aaron) I like the idea of using the 300 second system for doing "elevate... [17:42:51] (03PS1) 10Andrew Bogott: Puppet Panel: Remove use of breadcrumb_nav [puppet] - 10https://gerrit.wikimedia.org/r/311164 (https://phabricator.wikimedia.org/T91990) [17:44:21] (03PS2) 10Dzahn: Update ebernhardson ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311149 (owner: 10EBernhardson) [17:44:40] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Remove use of breadcrumb_nav [puppet] - 10https://gerrit.wikimedia.org/r/311164 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [17:45:46] (03CR) 10Dzahn: [C: 032] "verified identity via hangout" [puppet] - 10https://gerrit.wikimedia.org/r/311149 (owner: 10EBernhardson) [17:45:54] arr [17:45:55] (03PS3) 10Dzahn: Update ebernhardson ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311149 (owner: 10EBernhardson) [17:46:08] (03CR) 10Dzahn: [V: 032] Update ebernhardson ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311149 (owner: 10EBernhardson) [17:46:38] (03PS2) 10Yuvipanda: labs: Add a per-project puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 [17:47:52] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644154 (10Smalyshev) > Password change now requires a successful authentication in... [17:49:55] (03PS2) 10Dzahn: contint: drop now unused sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/311161 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [17:50:36] (03PS3) 10Yuvipanda: labs: Add a per-project puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 [17:51:33] mutante: I dont know how to manually clean the sudo rule of https://gerrit.wikimedia.org/r/311161 :( [17:52:55] (03CR) 10Dzahn: [C: 032] contint: drop now unused sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/311161 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [17:52:56] <_joe_> yuvipanda: your patch looks great! [17:53:05] (03PS4) 10Yuvipanda: labs: Add a per-project puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 [17:53:11] hasharAway: let me take a look [17:53:25] <_joe_> on the middle-to-long run, though, we should really focus on a way to get environment to work for individual projects [17:53:38] <_joe_> from the labs puppetmaster [17:53:56] <_joe_> that would mostly eliminate the need for self-hosted puppetmasters [17:54:16] joe: :D would it allow people to test changes without merging them into ops/puppet? [17:54:19] * yuvipanda isn't sure how environments work [17:54:39] <_joe_> yuvipanda: basically the idea would be something like (bear with me for a sec) [17:54:56] my name has 'panda' in it, I can bear for a long time [17:55:21] !log gallium rm /etc/sudoers.d/jenkins-slave (to go with gerrit 311161) [17:55:27] <_joe_> - every project gets an environment, that comes from the a specific project / a branch on gerrit [17:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:33] mutante: awesome :) [17:55:38] hasharAway: done :) [17:55:42] <_joe_> where people in the project get +2 [17:56:03] <_joe_> that can override specific classes that are in the main repo [17:56:47] <_joe_> people can then just push to their environment the modified class they want to test [17:56:58] <_joe_> do the needful, then revert the patch there [17:57:23] <_joe_> the big advantage is you'll have one central puppetmaster cluster under labops control [17:57:39] <_joe_> no risk of having years-old self-hosted puppetmasters [17:58:01] interesting. so the 'environment' is kind of overlaid? the base is the production puppet branch, and people can then 'override'? [17:58:06] <_joe_> yes [17:58:27] that sounds much nicer than 'cherry pick and hope' [17:58:31] <_joe_> ofc you can't just post the same patch, as you need to commit the whole modified class [17:58:47] <_joe_> it's a bit more of work but it would make for a much cleaner arch [17:58:56] oh I see [17:59:00] <_joe_> the missing bit is the project => gerrit repo link [17:59:09] so I'll need to copy the files somehow that I'm touching into the env [17:59:10] and modify? [17:59:30] <_joe_> just copy over the module you modified, actually [17:59:40] (03CR) 10Dzahn: [C: 032] Update list of mailman site languages [puppet] - 10https://gerrit.wikimedia.org/r/310746 (https://phabricator.wikimedia.org/T144933) (owner: 10Muehlenhoff) [17:59:43] (03PS3) 10Dzahn: Update list of mailman site languages [puppet] - 10https://gerrit.wikimedia.org/r/310746 (https://phabricator.wikimedia.org/T144933) (owner: 10Muehlenhoff) [18:00:05] <_joe_> so you prepare a patch in ops/puppet, then copy over the modules you modified to your environment [18:00:25] <_joe_> hell, we could also make the environment dir just writable on NFS ;) [18:00:51] <_joe_> but i can see that going south very fast [18:01:43] <_joe_> but at least it would be easy for you guys to control what's happening [18:02:15] <_joe_> and it would be "harder" to maintain a large number of patches on top of production (*cough* beta *cough*) [18:03:06] yeah probably [18:07:24] well most beta cluster patches are integration work for prod nowadays [18:07:41] the only long term hack we maintain are the screts/passwords in the private repo [18:08:03] one of the issue with envs is having to fork [18:08:14] and then eventually deal with merging the changes back :( [18:11:51] !log fermium - re-enabled puppet (after merging gerrit 310746( [18:11:58] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644235 (10Anomie) >>! In T140813#2644154, @Smalyshev wrote: > Wait, but this doesn... [18:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:08] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644236 (10Smalyshev) > If you lost your password, you do the "email me a temporary... [18:16:45] 06Operations, 13Patch-For-Review: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2644238 (10Dzahn) I merged that and re-enabled puppet on fermium. Notice: /Stage[main]/Mailman::Listserve/Debconf::Set[mailman/default_server_language]/Exec[debconf-communicate set mailman/def... [18:17:46] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2644244 (10Dzahn) [18:21:47] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2644257 (10Dzahn) 05Open>03Resolved [18:21:56] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644259 (10GWicke) > If it's the frontend code, which we assume might be vulnerabl... [18:25:31] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644269 (10Smalyshev) Also, thinking further about it, if I had RCE vulnerability,... [18:39:54] (03PS1) 10Dereckson: Update Alphos' blog URL for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/311171 [18:42:55] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/keyholder] [18:48:31] (03CR) 10Dzahn: [C: 032] Update Alphos' blog URL for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/311171 (owner: 10Dereckson) [18:49:47] All 100 conversations on this page are selected. Select all 289,831 conversations in "Monitor-Cron" [18:53:54] ouch [18:53:59] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Make elasticsearch actually uses shard allocation awareness - https://phabricator.wikimedia.org/T143571#2644325 (10debt) 05Open>03Resolved Thanks, @Gehel ! [18:54:51] mutante: what I thought off is to sent the cron mails to root+@wikimedia.org [18:55:19] so you could easily sort / delete / identify based on the equivalent puppet definition: cron { 'name here': } [18:55:27] which would send to cron+name_here@wikimedia.org [18:55:27] hashar: i cant select them. Oops… the system encountered a problem (#502) - Retrying in 1s… [18:55:29] something like that [18:55:32] :( [18:56:16] hashar: that's a nice idea, including the resource name [19:07:56] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:09:04] anomie: https://gerrit.wikimedia.org/r/#/c/311172/ [19:09:55] AaronSchulz: There's no need to worry that the submitting wiki has a different value for that configuration setting than the target wiki? [19:10:20] that global is always assumed to be the same on the farm [19:10:31] Ok, I see it's documented that way already. [19:14:38] anomie: https://gerrit.wikimedia.org/r/#/c/311168/ for master [19:14:56] * AaronSchulz will deploy the backport [19:22:47] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644354 (10GWicke) In this first iteration the email address would indeed still be... [19:26:15] !log aaron@tin Synchronized php-1.28.0-wmf.19/includes/jobqueue/JobQueueGroup.php: 01254b0a72a8619117ecad103427e2431e89cc52 (duration: 00m 47s) [19:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:47] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2644366 (10Dzahn) 05Resolved>03Open ehhh... uhmm.. a little while later Icinga tells me: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args '/mailman/bin/qrunner' PROCS CRITICAL: 0 processes with UID... [19:30:59] !log fermium starting mailman qrunner (T144933) [19:31:00] T144933: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933 [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:31] !log fermium disabled puppet again [19:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:10] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2644372 (10Dzahn) >>! In T144933#2644368, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AVc0ewKWaH8PnNb4D94N} [2016-09-16T19:30:59Z] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 05MW-1.28-release-notes, 13Patch-For-Review: HHVM: segfault when serializing/unserializing large preprocessor cache items - https://phabricator.wikimedia.org/T73486#761345 (10hashar) I used a reproduction case from T135483 and it is apparently all fix... [19:46:57] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:38] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:00:34] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644416 (10Tgr) In general if we want to be safe against SQL injection the minimum... [20:03:15] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644422 (10GWicke) @Tgr, the first iteration aims at protecting password hashes and... [20:05:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4627260 keys - replication_delay is 637 [20:13:26] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:08] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:52] (03PS2) 10Reedy: 2 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308577 (https://phabricator.wikimedia.org/T139800) [20:24:50] (03PS3) 10Reedy: Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) [20:25:52] Reedy: thx! Sorry for letting that one slide ^ ... [20:26:18] (03CR) 10Reedy: [C: 032] 2 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308577 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [20:26:28] AndyRussG: Haha, it's not a problem [20:26:37] I've been away a month, so just seeing what I can get pushed though [20:26:51] (03Merged) 10jenkins-bot: 2 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308577 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [20:28:12] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644463 (10Tgr) For protecting passwords, we would have to consider the following s... [20:28:23] !log reedy@tin Synchronized wmf-config/extension-list: Couple more to extension.json (duration: 00m 47s) [20:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:17] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4595093 keys - replication_delay is 0 [20:30:33] (03PS2) 10Reedy: Only require_once JsonConfig.php in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304617 [20:31:00] Reedy: ah K thx... :) [20:31:23] (03CR) 10Reedy: [C: 032] Only require_once JsonConfig.php in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304617 (owner: 10Reedy) [20:31:49] (03Merged) 10jenkins-bot: Only require_once JsonConfig.php in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304617 (owner: 10Reedy) [20:32:12] 06Operations, 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2601885 (10hashar) That works [[ https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=839239&oldid=839... [20:32:16] (03CR) 10Reedy: [C: 032] Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [20:32:38] (03PS4) 10Reedy: Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) [20:32:44] (03CR) 10Reedy: Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [20:32:48] (03CR) 10Reedy: [C: 032] Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [20:33:15] (03Merged) 10jenkins-bot: Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [20:34:40] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644504 (10Tgr) >>! In T140813#2644463, @Tgr wrote: > I would strongly prefer going... [20:34:43] !log reedy@tin Synchronized wmf-config/: Load CN via extension registration. Only load jsonconfig once (duration: 00m 56s) [20:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:09] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644509 (10Tgr) As for checkuser data, how would that work? If we want to prevent a... [20:46:06] (03PS1) 10Andrew Bogott: Puppet Panel: Cache the project panel tab [puppet] - 10https://gerrit.wikimedia.org/r/311183 [20:48:01] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Cache the project panel tab [puppet] - 10https://gerrit.wikimedia.org/r/311183 (owner: 10Andrew Bogott) [20:56:45] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:09] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2602033 (10CRoslof) I'm not sure I understand this task's specific request. Is it that: # wikipedia.in be changed to redirect to an IP address r... [21:05:14] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644582 (10GWicke) Several comments on this task seem to imply that an arbitrary RC... [21:06:47] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:48] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [21:21:35] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:21:46] (03PS5) 10Aaron Schulz: Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 [21:21:50] (03CR) 10Aaron Schulz: [C: 032] Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 (owner: 10Aaron Schulz) [21:22:18] (03Merged) 10jenkins-bot: Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 (owner: 10Aaron Schulz) [21:23:40] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Set some database logging groups to log (duration: 00m 47s) [21:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:27] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644606 (10Smalyshev) I assume RCE means by definition running any code, that does... [21:27:09] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644607 (10Tgr) >>! In T140813#2644582, @GWicke wrote: > Several comments on this t... [21:31:39] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:40:24] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2644626 (10Tgr) [21:43:08] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:46:38] (03PS1) 10Paladox: archiva: Fix it not being a autoload module [puppet] - 10https://gerrit.wikimedia.org/r/311194 [21:57:33] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644657 (10GWicke) The discussion here has moved on beyond protecting against mass... [21:58:28] (03PS2) 10Paladox: archiva: Fix it not being a autoload module [puppet] - 10https://gerrit.wikimedia.org/r/311194 [22:09:37] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: register .wiki gTLD domains - https://phabricator.wikimedia.org/T88873#2644690 (10Dzahn) [22:13:01] (03CR) 10Paladox: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/191109 (https://phabricator.wikimedia.org/T88873) (owner: 10Dzahn) [22:19:21] (03PS1) 10Bmansurov: Blacklist minerva from showing Related Articles in the footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) [22:21:58] 06Operations: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644765 (10Paladox) [22:23:19] 06Operations: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644754 (10Paladox) [22:24:33] (03CR) 10Dzahn: "this would work around the issue that it can't just be init.pp, yep, it's just a bit "ugly"" [puppet] - 10https://gerrit.wikimedia.org/r/311194 (owner: 10Paladox) [22:25:03] (03CR) 10Paladox: "yep" [puppet] - 10https://gerrit.wikimedia.org/r/311194 (owner: 10Paladox) [22:25:25] (03CR) 10Paladox: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/191104 (https://phabricator.wikimedia.org/T88873) (owner: 10Dzahn) [22:25:34] (03CR) 10Dzahn: "fwiw, the context of this whole problem is https://phabricator.wikimedia.org/T119042" [puppet] - 10https://gerrit.wikimedia.org/r/311194 (owner: 10Paladox) [22:26:19] (03PS3) 10Paladox: archiva: Fix it not being a autoload module [puppet] - 10https://gerrit.wikimedia.org/r/311194 (https://phabricator.wikimedia.org/T119042) [22:30:11] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644808 (10Paladox) [22:36:04] (03CR) 10Dereckson: [C: 04-1] "Should use a generic group name of technical administrator (but not editinterface) per https://phabricator.wikimedia.org/T139246. We can c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [22:37:50] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644825 (10Tgr) The point I am trying to make is that we are about to implement som... [22:39:25] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-dummy.conf] [22:50:42] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644754 (10Platonides) The .org TLD also suits us perfecty. I don't think either of those is a strong enough reason for changing the dom... [22:52:56] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644886 (10Dzahn) also see T88873#1691739 and the comments on those abandoned patches https://gerrit.wikimedia.org/r/#/c/191104/ htt... [22:53:20] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644887 (10Paladox) @Platonides but Wikipedia and wikitionary are wiki's since they use mediawiki, which is a wiki softwhere. But It sui... [22:53:58] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:54:56] paladox: ^ there, i ran puppet, false positive [22:55:09] mutante thanks :) [22:56:25] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2644898 (10GWicke) In terms of its impact, leaking all password hashes or sessions... [22:57:38] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644754 (10greg) @Paladox you're just repeating yourself :) (~"use them because we're a wiki") [23:09:06] PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:09:09] !log titanium - shutdown -h now [23:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:25] ^that was another precise :) [23:09:36] what was it doing/hosting? [23:09:38] one more week before we wipe it [23:09:39] archiva [23:09:42] * greg-g nods [23:12:43] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2014555 (10Dzahn) titanium shut down, archiva.wm.org runs on meitnerium, count: 11 [23:33:37] RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:35:06] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:41:03] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2645042 (10Dzahn) Note that WMF already owns all the .wiki domains and there is a long history that comes with that.