[00:03:49] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.176 second response time [00:29:49] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.220 second response time [00:36:19] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:47:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:47:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:53:43] 06Operations, 10Phabricator, 06Release-Engineering-Team, 07Technical-Debt: Replace deprecated phabricator conduit api calls in phab_epipe.py file - https://phabricator.wikimedia.org/T159043#3056743 (10Paladox) [00:56:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:57:39] RECOVERY - Misc HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:58:19] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:04:19] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:06:19] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:11:24] !ops dongs lol poop shit fuck [01:35:19] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:10:09] RECOVERY - check_recurring_gc_failures_missed on db1025 is OK: OK [02:18:29] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [02:19:49] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 07m 23s) [02:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:09] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=827 [critical =325] [02:21:01] Jeff_Green: you about? i cannot connect to serial on barium [02:22:12] ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=827 [critical =325] Jeff_Green known, will investigate [02:22:39] robh: yeah [02:22:49] so we just rebooted it for a kernel update, it's not coming back [02:22:59] ha, ignore my text i just sent you like 2 miutes aog [02:23:01] stuck at "Initializing firmware interfaces...." [02:23:03] hah ok [02:23:08] the second i couldnt login i texted. [02:23:17] frack in eqiad i assumed important ;] [02:23:26] yeah it's the civicrm webserver [02:23:42] if osmeone else is already on its serial, that explains me not being able to connect to its serial =] [02:23:54] right [02:23:56] i shall resume making dinner. =] [02:24:06] cool. thanks for checking [02:24:36] quite welcome [02:25:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 27 02:25:10 UTC 2017 (duration 5m 21s) [02:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:09] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1805.470078 Seconds [02:49:09] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 28.669926 Seconds [02:50:31] (03PS1) 10Tim Landscheidt: Tools: Unpuppetize sql [puppet] - 10https://gerrit.wikimedia.org/r/340058 [02:52:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:52:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:02:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:03:27] (03PS1) 10Tim Landscheidt: Tools: Fix test for enabled PHP module mcrypt [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) [03:07:39] RECOVERY - Misc HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:08:19] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:11:02] (03PS2) 10Tim Landscheidt: Tools: Fix test for enabled PHP module mcrypt [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) [03:16:59] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [03:17:09] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:09] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:09] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:59] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [03:17:59] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [03:19:39] PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:20:29] RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [03:22:59] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 36 minutes ago with 0 failures [03:27:55] (03CR) 10Tim Landscheidt: "I tested this temporarily as a cherry-pick on tools-puppetmaster-02 and it worked fine." [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) (owner: 10Tim Landscheidt) [03:29:29] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [03:29:29] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [03:29:29] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [03:29:39] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [03:29:39] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [03:29:39] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [03:29:39] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [03:29:39] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [03:29:49] (03PS5) 10Tim Landscheidt: labstore: Use explicit groups for file resources [puppet] - 10https://gerrit.wikimedia.org/r/324729 (https://phabricator.wikimedia.org/T152095) [03:29:50] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [03:29:50] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [03:29:50] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [03:29:59] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [03:30:09] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [03:30:09] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [03:30:09] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [03:30:09] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [03:30:09] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [03:32:52] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3056828 (10Cmjohnson) @ema the sfp has been replaced I see a link light now. LMK if that fixes the problem. [03:32:54] (03PS3) 10Tim Landscheidt: Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306 [03:38:49] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.148 second response time [03:46:59] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:08:49] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.141 second response time [04:14:09] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:17:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:18:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:30:39] RECOVERY - Misc HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:31:19] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:52:19] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [04:54:19] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:55:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:00:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:10:09] (03PS1) 10Tim Landscheidt: puppet: Remove requirement on python3-ldap3 [puppet] - 10https://gerrit.wikimedia.org/r/340064 [05:12:29] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:13:39] RECOVERY - Misc HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:14:19] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:40:29] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:58:19] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:11] (03PS1) 10Marostegui: db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340078 (https://phabricator.wikimedia.org/T132416) [07:13:04] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340078 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:14:52] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340078 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:15:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2070 - T132416 (duration: 00m 40s) [07:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:59] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:16:54] (03CR) 10jenkins-bot: db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340078 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:17:24] (03PS1) 10Marostegui: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340080 (https://phabricator.wikimedia.org/T132416) [07:19:27] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340080 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:20:27] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340080 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:20:36] (03CR) 10jenkins-bot: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340080 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:21:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2034 - T132416 (duration: 00m 40s) [07:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:28] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:22:29] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:26:29] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:27:57] (03PS2) 10Giuseppe Lavagetto: conftool-data: convert to new format [puppet] - 10https://gerrit.wikimedia.org/r/339672 [07:28:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] conftool-data: convert to new format [puppet] - 10https://gerrit.wikimedia.org/r/339672 (owner: 10Giuseppe Lavagetto) [07:29:11] !log Deploy alter table enwiki.revision - db2034 - T132416 [07:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:17] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:31:39] (03PS3) 10Muehlenhoff: Remove absense check from data_test.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339606 [07:50:01] (03CR) 10Muehlenhoff: [C: 032] Remove absense check from data_test.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339606 (owner: 10Muehlenhoff) [07:50:29] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:55:16] (03PS1) 10Marostegui: check_private_data_report: Ignore wmf_checksums [puppet] - 10https://gerrit.wikimedia.org/r/340082 [07:56:19] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 11307 bytes in 0.001 second response time [07:56:20] (03CR) 10Marostegui: [C: 032] check_private_data_report: Ignore wmf_checksums [puppet] - 10https://gerrit.wikimedia.org/r/340082 (owner: 10Marostegui) [07:56:22] (03PS1) 10Giuseppe Lavagetto: role::configcluster: replicate all of eqiad's etcd cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/340083 [07:56:49] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 11307 bytes in 0.001 second response time [07:59:41] !log Run pt-table-checksum on s2 (nlwiki) on revision table - T154485 [07:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:46] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:09:30] (03PS2) 10Giuseppe Lavagetto: role::configcluster: replicate all of eqiad's etcd cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/340083 [08:09:32] (03CR) 10Giuseppe Lavagetto: [C: 032] role::configcluster: replicate all of eqiad's etcd cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/340083 (owner: 10Giuseppe Lavagetto) [08:09:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::configcluster: replicate all of eqiad's etcd cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/340083 (owner: 10Giuseppe Lavagetto) [08:16:19] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [08:19:19] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [08:19:48] (03PS1) 10Muehlenhoff: Remove access credentials for hjiang [puppet] - 10https://gerrit.wikimedia.org/r/340086 [08:22:19] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [08:24:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 (owner: 10Giuseppe Lavagetto) [08:26:34] (03CR) 10Muehlenhoff: [C: 031] shell access for Shreyas Lakhtakia [puppet] - 10https://gerrit.wikimedia.org/r/339686 (owner: 10RobH) [08:26:40] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for hjiang [puppet] - 10https://gerrit.wikimedia.org/r/340086 (owner: 10Muehlenhoff) [08:41:05] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3057021 (10Esc3300) @Lydia_Pintscher can you arrange for this to be done? [08:42:01] <_joe_> !log promote conftool 0.4.0 to jessie-wikimedia main [08:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:47] <_joe_> !log upload conftool 0.4.0 to trusty-wikimedia [08:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:15] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.073 second response time [08:51:15] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [08:56:48] (03PS3) 10Muehlenhoff: Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 [08:58:05] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:12] (03PS1) 10Jcrespo: mariadb: Pool db1026 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340087 (https://phabricator.wikimedia.org/T147747) [09:04:45] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:15] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [09:14:35] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:16:12] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1026 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340087 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:17:29] (03Merged) 10jenkins-bot: mariadb: Pool db1026 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340087 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:17:37] (03CR) 10jenkins-bot: mariadb: Pool db1026 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340087 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:19:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 after maintenance with full weight (duration: 00m 39s) [09:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:05] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:34:09] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3057222 (10Lydia_Pintscher) I believe it is a trivial change that doesn't but if @Lea_Lacroix_WMDE wants... [09:34:35] (03PS2) 10Muehlenhoff: Update email address for Ellery Wulczyn [puppet] - 10https://gerrit.wikimedia.org/r/339431 [09:35:12] (03PS1) 10Jcrespo: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340091 (https://phabricator.wikimedia.org/T147747) [09:38:34] (03CR) 10Muehlenhoff: [C: 032] Update email address for Ellery Wulczyn [puppet] - 10https://gerrit.wikimedia.org/r/339431 (owner: 10Muehlenhoff) [09:42:35] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:43:42] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3057250 (10Lydia_Pintscher) I currently don't have the time, sorry. [09:48:08] (03PS1) 10Muehlenhoff: Update to 4.4.51 [debs/linux44] - 10https://gerrit.wikimedia.org/r/340094 [09:49:35] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:50:35] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:51:35] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:53:02] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340091 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:54:37] (03Merged) 10jenkins-bot: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340091 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:54:56] (03CR) 10jenkins-bot: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340091 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [09:59:13] (03PS1) 10Phuedx: wme: Set ReadingDepth sampling rate to 0.1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340095 (https://phabricator.wikimedia.org/T155639) [10:01:52] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#3057289 (10MoritzMuehlenhoff) 05Open>03Resolved That's fully deployed. [10:02:35] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 for maintenance (duration: 00m 43s) [10:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:35] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:05:54] hashar: what's the procedure for out of hours swat deploys? [10:06:23] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.51 [debs/linux44] - 10https://gerrit.wikimedia.org/r/340094 (owner: 10Muehlenhoff) [10:07:06] phuedx: no idea ? :} [10:07:12] hrrm [10:07:25] i can probably wait for this afternoon's swat [10:07:36] well I would [10:07:45] check with operations that nothing is going on right now [10:07:51] double check the patch is fine [10:08:05] ideally try it out on beta cluster [10:08:15] and then just do it with mwdebug1001 first to triple check that it is working [10:08:17] then sync all :} [10:08:43] phuedx: so in short: just do it :} [10:10:07] (03PS6) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [10:10:23] i don't think the event logging pipeline is falling over, i'm just concerned about about the rate of events for a particular schema [10:10:53] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3057319 (10Gehel) As I understand the situation (helped by enabling rewrite logging on deployment-prep9: `https://www... [10:11:46] <_joe_> !log upgrading conftool to 0.4.0 across the cluster T149617 [10:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:51] T149617: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617 [10:12:02] phuedx: well I guess push it. I can do some basic review if you want :} [10:12:27] 06Operations, 06Performance-Team, 10Thumbor: Make Thumbor IM engine based on a subprocess - https://phabricator.wikimedia.org/T149903#3057322 (10Gilles) [10:12:36] hashar: <3 https://gerrit.wikimedia.org/r/#/c/340095/ [10:17:05] (03CR) 10Elukey: "Change looks good: https://puppet-compiler.wmflabs.org/5587/conf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [10:18:35] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:18:35] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:19:15] phuedx: so disable that for all wikis but 0.001 for wikipedia wikis right? [10:20:01] (03PS2) 10Hashar: wme: Set ReadingDepth sampling rate to 0.1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340095 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [10:20:21] hashar: yup, by default it's disabled for non-wikipedia wikis anyway [10:20:27] but it pays to be explicit in the config [10:20:39] make sense [10:22:30] (03CR) 10Hashar: [C: 032] wme: Set ReadingDepth sampling rate to 0.1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340095 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [10:24:06] (03Merged) 10jenkins-bot: wme: Set ReadingDepth sampling rate to 0.1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340095 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [10:24:25] (03CR) 10jenkins-bot: wme: Set ReadingDepth sampling rate to 0.1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340095 (https://phabricator.wikimedia.org/T155639) (owner: 10Phuedx) [10:24:33] phuedx: it is on mwdebug1001 if that is at all testable [10:26:11] hashar: how do you do this so fast? [10:26:57] got ton of XP and a "Swat +3 sword" [10:27:38] hashar: confirm that i'm seeing the lowered sampling rate on mwdebug1001 and 0.005 on other machines [10:27:48] basically: CR+2, on tin: git remote update && git log HEAD..HEAD@{u} . If happy rebase [10:27:54] then ssh mwdebug1001.eqiad.wmnet scap pull [10:27:56] done [10:30:23] (03PS1) 10Volans: SSH keys: add new key for myself [puppet] - 10https://gerrit.wikimedia.org/r/340098 [10:31:47] !log limiting the Zookeeper Maximum heap size to 1G (https://gerrit.wikimedia.org/r/#/c/337797/) - setting applied gradually to Zookeeper on Druid and Conf* hosts [10:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:55] (03CR) 10Elukey: [C: 032] Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [10:32:27] phuedx: let me know when I can sync it on the whole cluster [10:32:49] hashar: confirmed [10:32:52] hashar: phuedx> hashar: confirm that i'm seeing the lowered sampling rate on mwdebug1001 and 0.005 on other machines [10:34:00] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: wme: Set ReadingDepth sampling rate to 0.1% - T155639 (duration: 00m 40s) [10:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:05] T155639: Create reading depth schema - https://phabricator.wikimedia.org/T155639 [10:34:21] phuedx: done !!! [10:34:37] moritzm: ok to merge your last puppet change? [10:34:42] hashar: <3 <3 <3 [10:35:29] well it is an email change, I guess I can proceed safely :) [10:40:22] elukey: yes, sorry, I forgot to press "y" [10:41:48] (03PS1) 10Giuseppe Lavagetto: conftool-data: purge the old directories [puppet] - 10https://gerrit.wikimedia.org/r/340099 [10:42:35] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:54] hashar: just seen the latest data point come in on the graph of readingdepth events and i'm seeing a rate of < 1000 per minute [10:45:55] thanks! [10:47:27] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: purge the old directories [puppet] - 10https://gerrit.wikimedia.org/r/340099 (owner: 10Giuseppe Lavagetto) [10:49:53] !log uploaded apache2 2.4.10-10+deb8u8+wmf1 to apt.wikimedia.org (rebase of local patches on top on latest DSA) [10:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:58:41] (03PS4) 10Muehlenhoff: Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 [11:01:17] (03CR) 10Volans: "I'm wondering if we should have the initial data already filled in, not leaving it to manual insertion afterwards, that is error prone." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [11:01:35] (03CR) 10Muehlenhoff: [C: 032] Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 (owner: 10Muehlenhoff) [11:03:05] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:03:45] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:03:55] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:03:56] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:03:57] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:03:57] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:05] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:05] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:05] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:05] PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:35] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:35] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:45] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:45] !log rebooting cp1052 into kernel 4.4.2-3+wmf8 T148891 [11:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] T148891: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891 [11:04:55] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:55] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:56] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:56] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:56] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:56] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:56] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:05] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:06] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:35] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:45] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:05] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:05] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:05] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:15] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:15] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:36] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:36] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:45] moritzm: _joe_ ^^^ [11:06:45] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:52] hmm... someone working on this? [11:06:52] site_nodes is not a hash or array when accessing it with appserver at /etc/puppet/modules/role/manifests/mediawiki/webserver.pp:30 [11:06:55] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:01] (03PS1) 10Muehlenhoff: Remove Aaron from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/340101 [11:07:05] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:05] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:15] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:15] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:35] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:35] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:38] * volans looking what broke it [11:07:45] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:55] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:45] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:54] * volans silencing icinga-wm [11:08:55] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:55] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:55] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:56] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:56] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:05] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:05] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:16] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:35] <_joe_> thanks volans [11:09:38] <_joe_> that's my fault [11:09:48] !log temporarily stopped ircecho (icinga-wm) [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:00] ok, in a meeting, ping if you need help [11:10:22] jynus: thanks, but no need [11:11:46] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: adapt to new conftool-data structure [puppet] - 10https://gerrit.wikimedia.org/r/340102 [11:11:59] <_joe_> volans: ^^ [11:12:05] checking [11:12:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::webserver: adapt to new conftool-data structure [puppet] - 10https://gerrit.wikimedia.org/r/340102 (owner: 10Giuseppe Lavagetto) [11:12:56] looks good [11:13:05] (03CR) 10Volans: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/340102 (owner: 10Giuseppe Lavagetto) [11:13:18] I was checking with the new data structure :) [11:13:48] let' check if icinga-wm agrees [11:13:51] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:56] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:56] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:56] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:14:07] too early :D [11:14:21] <_joe_> yeah, wait [11:14:43] I wanted to see some RECOVERY and then stop it again waiting for them to recover [11:14:49] (03PS3) 10Muehlenhoff: Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) [11:16:37] _joe_: FYI etcdmirror-conftool-eqiad-wmnet.service failed on conf2002 [11:16:48] <_joe_> volans: ack I know [11:18:26] <_joe_> it should be "solved" soon [11:18:31] (03CR) 10Volans: [C: 031] "LGTM. As a first version seems to have all the data we need" [puppet] - 10https://gerrit.wikimedia.org/r/339673 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [11:19:15] !log zookeeper status report - new changes rolled out to druid nodes and conf2001 - conf1* and conf200[23] still pending, waiting for more metrics before proceeding [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:35] _joe_: are you running puppet to resolve it or just waiting? in 3 minutes puppet will re-enable icinga-wm on einsteinium ;) [11:23:14] <_joe_> volans: I was waiting tbh [11:23:49] ok, I'll re-stop irecho after puppet hen [11:23:54] s/hen/then/ [11:24:09] <_joe_> volans: or I disable puppet there for a few minutes [11:24:10] <_joe_> :P [11:24:27] all yours [11:37:54] !log cp1052 repooled T148891 [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:59] T148891: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891 [11:39:38] (03PS1) 10Volans: Cumin: fine tuning ssh_options [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) [11:39:56] (03PS2) 10Volans: SSH keys: add new key for myself [puppet] - 10https://gerrit.wikimedia.org/r/340098 [11:40:15] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3057526 (10ema) 05Open>03Resolved @Cmjohnson looks good. Thanks! [11:41:21] (03CR) 10Volans: [C: 032] SSH keys: add new key for myself [puppet] - 10https://gerrit.wikimedia.org/r/340098 (owner: 10Volans) [11:42:08] (03PS1) 10Giuseppe Lavagetto: new_wmf_service: fix references to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/340105 [11:42:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] new_wmf_service: fix references to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/340105 (owner: 10Giuseppe Lavagetto) [11:53:31] <_joe_> hello again icinga-wm [11:53:44] <_joe_> sorry for having to put you out of our misery earlier [11:58:50] !log re-enabled icinga-wm [11:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:09] !log rebooting mw2092 due to puppet errors for mw-cgroup - T151427 [12:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:14] T151427: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427 [12:01:35] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:02:53] going to lunch now but I'll prep mw2092 for re-pool after that [12:03:15] (puppet now runs correctly) [12:09:51] (03PS1) 10Giuseppe Lavagetto: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 [12:14:45] <_joe_> !log reissuing the certificate for etcd.codfw.wmnet due to a previous error [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:45] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:17:54] (03PS1) 10Giuseppe Lavagetto: cacert::certificates: new cert for etcd [puppet] - 10https://gerrit.wikimedia.org/r/340108 [12:18:35] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [1800.0] [12:25:35] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [1800.0] [12:26:50] (03Abandoned) 10Giuseppe Lavagetto: cacert::certificates: new cert for etcd [puppet] - 10https://gerrit.wikimedia.org/r/340108 (owner: 10Giuseppe Lavagetto) [12:27:30] (03PS2) 10Giuseppe Lavagetto: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 [12:27:32] (03PS1) 10Giuseppe Lavagetto: cacert::certificates: new cert for etcd [puppet] - 10https://gerrit.wikimedia.org/r/340110 [12:29:45] (03PS2) 10Giuseppe Lavagetto: cacert::certificates: new cert for etcd [puppet] - 10https://gerrit.wikimedia.org/r/340110 [12:30:01] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] cacert::certificates: new cert for etcd [puppet] - 10https://gerrit.wikimedia.org/r/340110 (owner: 10Giuseppe Lavagetto) [12:30:35] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:36:35] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [12:39:10] !log restart zookeeper on conf2002 [12:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:45] fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck fuck [12:39:51] !ops [12:40:50] (03CR) 10Muehlenhoff: [C: 032] Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [12:40:58] (03PS4) 10Muehlenhoff: Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) [12:46:32] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. All the options are supported on trusty as well (our precise have a trusty openssh backport)" [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) (owner: 10Volans) [12:47:51] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: add default schema [puppet] - 10https://gerrit.wikimedia.org/r/339673 (https://phabricator.wikimedia.org/T149617) [12:50:10] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::client: add default schema [puppet] - 10https://gerrit.wikimedia.org/r/339673 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [12:51:29] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3057684 (10aude) @Lydia_Pintscher suggest the pywikibot people be informed, since I think they might stil... [12:53:16] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3057685 (10Lydia_Pintscher) Makes sense. [13:00:03] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3057708 (10ema) p:05Triage>03Normal [13:00:46] 06Operations, 10Ops-Access-Requests, 10Icinga, 10Monitoring, 06Release-Engineering-Team: Rename Icinga contact 'amusso' to 'hashar' - https://phabricator.wikimedia.org/T158167#3057709 (10ema) p:05Triage>03Normal [13:02:11] 06Operations, 13Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#3057726 (10ema) p:05Triage>03Normal [13:02:21] Loaded: error (Reason: Invalid argument) [13:02:22] Active: active (running) s [13:02:27] lovely systemd :D [13:06:56] !log restart zookeeper on conf2003 [13:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:55] PROBLEM - DPKG on restbase-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:08:59] 06Operations, 10DNS, 10Traffic, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#3057738 (10ema) p:05Triage>03Normal [13:09:06] (03PS8) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [13:10:55] RECOVERY - DPKG on restbase-test2001 is OK: All packages OK [13:12:27] (03PS2) 10Giuseppe Lavagetto: conftool-data: add first discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) [13:12:34] (03PS9) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [13:13:21] (03CR) 10Giuseppe Lavagetto: "Setting the URL here while not setting the other property is not supported by conftool at present, unluckily. We might be able to do that " (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [13:20:38] (03CR) 10Faidon Liambotis: [C: 031] Cumin: fine tuning ssh_options [puppet] - 10https://gerrit.wikimedia.org/r/340104 (https://phabricator.wikimedia.org/T159127) (owner: 10Volans) [13:21:07] (03PS2) 10Faidon Liambotis: Kill ubuntu.wikimedia.org legacy hostname [puppet] - 10https://gerrit.wikimedia.org/r/339652 [13:21:09] (03PS2) 10Faidon Liambotis: Kill ubuntu.wikimedia.org legacy hostname [dns] - 10https://gerrit.wikimedia.org/r/339653 [13:22:17] (03CR) 10Faidon Liambotis: [C: 032] Kill ubuntu.wikimedia.org legacy hostname [puppet] - 10https://gerrit.wikimedia.org/r/339652 (owner: 10Faidon Liambotis) [13:22:39] (03CR) 10Faidon Liambotis: [C: 032] Kill ubuntu.wikimedia.org legacy hostname [dns] - 10https://gerrit.wikimedia.org/r/339653 (owner: 10Faidon Liambotis) [13:24:19] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#3057793 (10Jgreen) 05Open>03Resolved 3TB disks have been removed, system boots normally [13:27:57] (03PS2) 10Faidon Liambotis: Reorder check for timesyncd or ntpd [puppet] - 10https://gerrit.wikimedia.org/r/338364 (owner: 10Muehlenhoff) [13:28:45] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:07] (03CR) 10Faidon Liambotis: [C: 032] Reorder check for timesyncd or ntpd [puppet] - 10https://gerrit.wikimedia.org/r/338364 (owner: 10Muehlenhoff) [13:29:22] 06Operations, 15User-Elukey: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10ema) [13:30:33] 06Operations, 10Traffic, 15User-Elukey: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057816 (10ema) [13:37:16] 06Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057819 (10elukey) [13:37:29] 06Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10faidon) Yeah, WoSign's CT server seems to be occasionally flaky, I saw someone else complaining about this somewhere. Not sure what we can do about that :/ [13:41:13] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3057841 (10ema) p:05Triage>03Normal [13:50:58] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#3057861 (10ema) p:05Triage>03Normal [13:53:07] (03PS3) 10Marostegui: mariadb: Add gtid_domain_id to s2 [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) [13:54:30] (03CR) 10Marostegui: [C: 032] mariadb: Add gtid_domain_id to s2 [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [13:54:37] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2661877 (10ema) This is happening on tin too: ``` -rw-r----- 1 root adm 254 Nov 27 21:39 /var/log/hhvm/error.log-20161127 ``` [13:55:29] this is really weird ema--^ [13:55:41] not sure where but I think there is a race condition [13:56:45] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:58:15] jouncebot: refresh [13:58:16] I refreshed my knowledge about deployments. [13:58:18] jouncebot: next [13:58:18] No deployments scheduled for the forseeable future! [13:58:58] !log Manually deploy gtid_domain_id on s2 - T149418 [13:59:00] Urbanecm: Dereckson: addshore: zeljkof: https://wikitech.wikimedia.org/wiki/Deployments did not get the boiler plate added so nothing is scheduled for SWAT. But I am willing to take request on the channel [13:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:03] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [13:59:48] elukey: only logs generated in October 2016 seem to be affected [13:59:49] hashar: while waiting, could you take a minute at this? ;) https://gerrit.wikimedia.org/r/#/c/340112/ [13:59:57] is there anything for swat? [14:00:10] hashar: just check if there is too much release notes [14:00:24] nowhere on the deployments page to put stuff :/ [14:00:27] aude_: the wiki page is not updated [14:00:32] yeah [14:00:39] hashar and I were not sure how it is done [14:00:40] * aude_ would like https://gerrit.wikimedia.org/r/#/c/339446/ [14:00:48] could do myself [14:01:08] hashar, could you deploy 339348 for me? [14:01:13] aude_: well, the eu swat is scheduled for now :) [14:01:18] Or anybody else [14:01:20] yeah, that's why i ask [14:01:46] aude_: can you then deploy your and Urbanecm's changes? [14:02:07] * aude_ looks [14:03:59] ok [14:04:03] zeljkof: reviewed pending dan I guess [14:04:21] hashar: I can release, if the release notes look good [14:04:51] zeljkof: yup looks good to me. It is up to you to check whether you want Dan input as well [14:05:01] we can always fix stuff, if needed, I was just not sure if there is too much in the notes [14:05:08] I'll release now [14:05:09] aude_: please do https://gerrit.wikimedia.org/r/#/c/339446/ [14:05:19] I am reviewing Urbanecm change meanwhile ( https://gerrit.wikimedia.org/r/#/c/339348/ ) [14:06:03] Urbanecm: did we check the short aliases? [14:06:12] thanks [14:06:28] (03CR) 10Aude: [C: 032] Disallow geo-shape data type on wikidata for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339446 (https://phabricator.wikimedia.org/T158849) (owner: 10Aude) [14:07:10] ps. Setting extra.merge-plugin.include does not exist or is not supported by this command [14:07:30] i'm not going to be around most of the day, but see that's still something we need to solve [14:07:43] (03Merged) 10jenkins-bot: Disallow geo-shape data type on wikidata for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339446 (https://phabricator.wikimedia.org/T158849) (owner: 10Aude) [14:07:47] (03CR) 10jenkins-bot: Disallow geo-shape data type on wikidata for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339446 (https://phabricator.wikimedia.org/T158849) (owner: 10Aude) [14:08:48] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3057880 (10ema) I've looked for other instances of the problem across the cluster and could only found the following: ``` tin.eqiad.wmnet: -rw-r----- 1 root adm 254 Nov... [14:09:45] aude_: arg yeah that get raised for mediawiki/extensions/Wikidata [14:10:04] was an attempt to merge autoload-dev but that needs a more recent version of composer than the one we have on CI :/ [14:10:23] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Disable geo-shape datatype on wikidata for now (duration: 00m 41s) [14:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:20] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3057884 (10elukey) I worked today on mw2092, it was put back in service around Dec 2 (https://phabricator.wikimedia.org/T151427) but it wasn't set up correctly. I think it was rei... [14:11:32] yeah, i saw [14:11:44] * aude_ checks wikidata [14:12:18] noticed unrelated error in the exception logs [14:13:52] !log installed apache2 security updates on mwdebug* [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:04] wikidata looks ok (can still create properties of the other data types) [14:22:19] (03PS5) 10Hashar: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [14:22:50] (03CR) 10Hashar: [C: 032] "There are no conflict with currently defined interwiki ( from https://it.wikiversity.org/wiki/Speciale:Interwiki )." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [14:24:26] (03Merged) 10jenkins-bot: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [14:24:39] (03PS6) 10Rush: labstore: Use explicit groups for file resources [puppet] - 10https://gerrit.wikimedia.org/r/324729 (https://phabricator.wikimedia.org/T152095) (owner: 10Tim Landscheidt) [14:24:44] (03CR) 10jenkins-bot: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [14:24:47] deploying Urbanecm changes to itwikiversity [14:25:29] Urbanecm: deployed on mwdebug1001.eqiad.wmnet [14:25:59] thanks [14:26:49] pagelinks from=16476 ns=0 dbk=T:Immagine_sinottico -> Template:Immagine_sinottico DRY RUN [14:26:49] pagelinks from=16478 ns=0 dbk=T:Immagine_sinottico -> Template:Immagine_sinottico DRY RUN [14:27:02] Urbanecm: only two links on the wiki needed to be updated :} [14:28:15] (03CR) 10Rush: [C: 032] labstore: Use explicit groups for file resources [puppet] - 10https://gerrit.wikimedia.org/r/324729 (https://phabricator.wikimedia.org/T152095) (owner: 10Tim Landscheidt) [14:29:18] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: New namespace aliases for itwikiversity - T158775 (duration: 00m 43s) [14:29:18] (03PS1) 10Muehlenhoff: Allow LDAP access to corp mirrors from terbium [puppet] - 10https://gerrit.wikimedia.org/r/340119 [14:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:23] T158775: New namespace aliases for Italian Wikiversity - https://phabricator.wikimedia.org/T158775 [14:29:49] jouncebot: now [14:29:49] No deployments scheduled for the forseeable future! [14:29:56] jouncebot: refresh [14:29:57] I refreshed my knowledge about deployments. [14:30:04] Urbanecm: should be good now [14:30:28] what deployment window is this i hate spamming jouncebot [14:30:43] !log European SWAT done. Pushed https://gerrit.wikimedia.org/r/#/c/339446/ and https://gerrit.wikimedia.org/r/#/c/339348/ [14:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:33] (03CR) 10Volans: [C: 031] "LGTM. Nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:37:01] (03PS4) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [14:40:27] hashar, thank you [14:41:21] (03CR) 10Giuseppe Lavagetto: conftool-data: add first discovery objects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:42:36] !log Fix namespace dupes pages on ext.wikipedia (T158914) [14:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:41] T158914: namespaceDupes.php for ext.wikipedia - https://phabricator.wikimedia.org/T158914 [14:44:04] (03PS3) 10Giuseppe Lavagetto: conftool-data: add first discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) [14:45:52] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: add first discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [14:49:18] (03CR) 10Andrew Bogott: [C: 031] nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 (owner: 10Rush) [14:49:23] 06Operations, 06Discovery, 06Services (watching), 15User-mobrovac: Set up Logstash behind LVS - https://phabricator.wikimedia.org/T159004#3054130 (10fgiunchedi) FYI I did have a patch to add logstash to LVS (with source address hashing, to address @EBernhardson concern) at https://gerrit.wikimedia.org/r/#/... [14:57:17] (03PS5) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [14:58:17] (03PS6) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [14:58:31] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad [14:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:08] nice! [15:14:37] (03PS1) 10Marostegui: production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) [15:19:54] (03CR) 10Marostegui: "This compiles finely and changes only the hosts that still don't have gtid_domain_id (those not in s2 and s6): https://puppet-compiler.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:21:16] (03CR) 10Jcrespo: [C: 031] production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:23:49] (03Draft1) 10Gehel: portals: cleanup Apache configuration template [puppet] - 10https://gerrit.wikimedia.org/r/340132 [15:32:45] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:04] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/5593/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/340132 (owner: 10Gehel) [15:47:04] (03PS1) 10Jcrespo: mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340134 (https://phabricator.wikimedia.org/T147747) [15:55:31] !log starting schema change on db2038 T147747 [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:37] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3058199 (10Papaul) [15:56:51] (03PS10) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [15:57:58] (03CR) 10jerkins-bot: [V: 04-1] jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [15:58:18] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340134 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [15:59:17] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3058208 (10Paladox) This requires us to set this config https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#receiveemail [16:00:17] (03Merged) 10jenkins-bot: mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340134 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [16:00:26] (03CR) 10jenkins-bot: mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340134 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [16:01:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 for maintenance (duration: 00m 40s) [16:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:45] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:01:56] (03PS11) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [16:02:29] (03CR) 10Paladox: "We should upstream this too :)" [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [16:03:41] Is there an Ops meeting today? (I can't find the etherpad, and last week was extra short) [16:03:45] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:31] andrewbogott: I think so, I have it on my cal [16:08:12] !log starting schema change on db1051 T147747 [16:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:32] (03CR) 10Faidon Liambotis: "terbium feels a little odd, as it also runs a lot of application stack cronjobs that feel entirely unrelated (mediawiki etc.). Prhaps the " [puppet] - 10https://gerrit.wikimedia.org/r/340119 (owner: 10Muehlenhoff) [16:14:26] (03CR) 10Muehlenhoff: "It's currently running on terbium since it's the host that owns the openldap::management role. But I can also move the cron job to dubnium" [puppet] - 10https://gerrit.wikimedia.org/r/340119 (owner: 10Muehlenhoff) [16:14:45] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:25] PROBLEM - Host mc2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:36] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:17:01] (03PS9) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [16:17:44] (03PS4) 10Nuria: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 (owner: 10Krinkle) [16:18:25] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures [16:18:27] mc2001 has been decommed [16:18:42] now we are using mc2019->mc2036 [16:19:02] so that one might be due to somebody working on it? [16:22:02] (03PS1) 10Filippo Giunchedi: lvs: fix icinga hostname for ms-fe in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/340138 [16:24:49] (03CR) 10Filippo Giunchedi: [C: 032] lvs: fix icinga hostname for ms-fe in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/340138 (owner: 10Filippo Giunchedi) [16:25:25] PROBLEM - Host mc2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:50] (03PS12) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [16:28:22] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3058399 (10madhuvishy) a:05madhuvishy>03None [16:31:42] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:36:21] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3058414 (10Cmjohnson) I can make a cable to run anywhere in the data center so proximity is not an issue. I need to find a space t... [16:37:12] PROBLEM - Host mc2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:24] (03PS2) 10BBlack: varnish: move "apps" data back into manifests [WIP, 1/4] [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [16:41:26] (03PS3) 10BBlack: varnish: switch all clusters to req_handling [WIP, 2/4] [puppet] - 10https://gerrit.wikimedia.org/r/339668 (https://phabricator.wikimedia.org/T134404) [16:41:28] (03PS4) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (https://phabricator.wikimedia.org/T134404) [16:41:30] (03PS2) 10BBlack: varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) [16:42:40] (03PS1) 10Filippo Giunchedi: swift: ignore spammy 507s from container-server [puppet] - 10https://gerrit.wikimedia.org/r/340142 (https://phabricator.wikimedia.org/T157237) [16:42:42] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:42:58] (03PS2) 10Gehel: Remove non-existent setting from apifeatureusage logstash template [puppet] - 10https://gerrit.wikimedia.org/r/338469 (owner: 10EBernhardson) [16:44:07] (03PS3) 10BBlack: varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) [16:44:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3058444 (10RobH) Please note that I emailed Lisa Gruwell as the person to approve ac... [16:45:28] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3058463 (10RobH) [16:45:32] PROBLEM - Host mc2005 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:32] PROBLEM - Host mc2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:42] PROBLEM - Host mc2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:38] elukey: ^^^ [16:46:59] (03CR) 10Gehel: [C: 032] Remove non-existent setting from apifeatureusage logstash template [puppet] - 10https://gerrit.wikimedia.org/r/338469 (owner: 10EBernhardson) [16:47:39] volans: I think that papaul is probably decom them, old hosts :) [16:48:06] if we start seeing mc2019 onwards that would be a problem [16:48:16] elukey: yeah, downtime would be nice [16:48:42] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3058475 (10Cmjohnson) I have updated Dell Tech support and continue with another possible solution. [16:50:42] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3050736 (10jcrespo) Thanks for this. Please give us a heads up if you need to turn down the server- it is back to being in service as the main backup server, to test it behaves well, last thing we want is to put it down... [16:52:52] (03PS13) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [16:55:05] (03PS2) 10Hashar: systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 [16:55:20] (03PS6) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [16:55:29] (03CR) 10jerkins-bot: [V: 04-1] jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [16:55:49] (03PS14) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [16:57:39] 06Operations: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3058497 (10Volans) [16:58:24] (03PS7) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [16:59:20] (03PS8) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [17:00:12] (03CR) 10Rush: [V: 032 C: 032] nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 (owner: 10Rush) [17:01:24] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3058512 (10Papaul) [17:03:56] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3058515 (10Papaul) a:05Papaul>03RobH Disk wipe complete , systems unracked complete, decommission sheet update. port information ms2001 row B rack B1 ge-1/0/3 ms2002 row B rack B8 g... [17:04:18] papaul: thanks! [17:04:25] that'll free up a lot of rackspace! [17:04:36] 12u or so... [17:04:48] or 16... [17:05:18] (03PS1) 10Rush: nova: nova-fullstack.upstart.erb proper variable reference [puppet] - 10https://gerrit.wikimedia.org/r/340143 [17:05:24] robh: no problem [17:06:01] (03CR) 10Hashar: "I wrote them as a base for the child change https://gerrit.wikimedia.org/r/#/c/337411/ which let us change how the programname is matched" [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [17:06:55] (03PS2) 10Rush: nova: nova-fullstack.upstart.erb proper variable reference [puppet] - 10https://gerrit.wikimedia.org/r/340143 [17:08:06] (03CR) 10Hashar: "I have actually applied this on instance jenkinstest.integration.eqiad.wmflabs to polish up hence the spam of changes. It seems good now " [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [17:08:43] (03CR) 10Rush: [C: 032] nova: nova-fullstack.upstart.erb proper variable reference [puppet] - 10https://gerrit.wikimedia.org/r/340143 (owner: 10Rush) [17:26:59] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:40] (03CR) 10RobH: [C: 031] "This looks good. Perhaps introduce newlines after RewriteCond since they don't work in conjunction with the line immediately following ea" [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [17:29:28] 07Puppet, 13Patch-For-Review: On standalone puppetmasters labstore files in /usr/local/sbin get group 998 (gitpuppet) - https://phabricator.wikimedia.org/T152095#3058597 (10scfc) 05Open>03Resolved [17:29:56] (03CR) 10RobH: [C: 031] "I forgot to note that I would get many, many eyes on this. Some of our most painful outages were due to apache config issues. I'd always" [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [17:37:16] (03PS2) 10Gehel: portals: do not rewrite 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) [17:37:46] (03CR) 10Gehel: "added white space as suggested by robh" [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [17:39:05] 06Operations, 10Annual-Report, 10Security-Reviews: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3058623 (10Dzahn) [17:39:40] 06Operations, 10Annual-Report, 10Security-Reviews: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2828442 (10Dzahn) 05Open>03Resolved Alright, no more pending changes here. Index redirects, content is ok. I'm going to close this as resolved now. [17:42:04] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT) - https://phabricator.wikimedia.org/T84380#3058636 (10Dzahn) install2002 is up and running. APT data is synced over from install1002 by rsync/cron.... [17:42:08] 06Operations: Setup basic infrastructure services in codfw - https://phabricator.wikimedia.org/T84350#3058639 (10Dzahn) [17:42:11] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT) - https://phabricator.wikimedia.org/T84380#3058638 (10Dzahn) 05Open>03Resolved [17:42:28] 06Operations: Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT) - https://phabricator.wikimedia.org/T84380#926563 (10Dzahn) [17:42:49] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:43:47] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3058650 (10fgiunchedi) From the latest audit of fluorine logs missing from mwlog1001 I can't spot any files present on the former but not the latter. [17:44:02] (03CR) 10Dzahn: [C: 031] Gerrit: Make GerritBot report author of patch and the uploader of patch [puppet] - 10https://gerrit.wikimedia.org/r/339980 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [17:47:02] (03Abandoned) 10Tim Landscheidt: puppetmaster: Enable expand_path for Hiera in Labs as well [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt) [17:49:00] (03CR) 10Dzahn: [C: 032] Gerrit: Make GerritBot report author of patch and the uploader of patch [puppet] - 10https://gerrit.wikimedia.org/r/339980 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [17:49:09] (03PS5) 10Dzahn: Gerrit: Make GerritBot report author of patch and the uploader of patch [puppet] - 10https://gerrit.wikimedia.org/r/339980 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [17:49:36] (03CR) 10Dzahn: [V: 032 C: 032] Gerrit: Make GerritBot report author of patch and the uploader of patch [puppet] - 10https://gerrit.wikimedia.org/r/339980 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [17:54:04] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3058692 (10Papaul) @RobH I update the license and try to update the firmware having the same error. The image I am using is the same image I used on ms-be2002. {F5896057} [17:54:21] (03CR) 10EBernhardson: "It may have happened later, but just merging the patch and letting puppet run on logstash hosts doesn't seem to have updated the template." [puppet] - 10https://gerrit.wikimedia.org/r/338469 (owner: 10EBernhardson) [17:55:06] (03PS5) 10Dzahn: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [17:55:09] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:56:07] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3058697 (10RobH) >>! In T155690#3058692, @Papaul wrote: > @RobH I update the license and try to update the firmware having the same error. The image I am using is the same image I used on ms-be200... [17:58:39] PROBLEM - Host mc2007 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:44] (03CR) 10Tim Landscheidt: [C: 04-1] "I encountered unexpected look-ups with my (identical) patch, so I abandoned it and -1 here. Cf. Iaf73d6f52ec402cb7c1b7eebd0bc462b55343825" [puppet] - 10https://gerrit.wikimedia.org/r/274566 (owner: 10Dduvall) [18:01:39] (03PS6) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [18:01:44] (03PS1) 10Tim Landscheidt: Move puppetdb::password variables to hieradata/labs.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/340148 [18:01:51] (03PS7) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [18:04:14] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1001.eqiad.wmnet [18:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:19] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:49] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:12:17] !log gehel@tin Started deploy [wdqs/wdqs@62354ed]: log [18:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:25] !log temporarily bumping timeout_idle to 120s on cache_misc T154558 [18:12:30] !log gehel@tin Finished deploy [wdqs/wdqs@62354ed]: log (duration: 00m 12s) [18:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:30] T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558 [18:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:10] !log gehel@tin Started deploy [wdqs/wdqs@62354ed]: (no justification provided) [18:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:02] !log gehel@tin Finished deploy [wdqs/wdqs@62354ed]: (no justification provided) (duration: 00m 52s) [18:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:18] 06Operations: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3058753 (10MoritzMuehlenhoff) I don't have a strong opinion here, but I'd expect that the Debian-specific sticky rules will be removed at some point in favour of the upstream mechanism and we'll have to migra... [18:14:30] SMalyshev: wdqs deployment completed, tests looking good... [18:14:47] !log restarting wdqs-updater on all wdqs servers [18:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:11] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3058766 (10RobH) That is indeed odd, since they are both R720xd. Unfortunately, the system is no longer covered under warranty with Dell, so we cannot contact their support for assistance on it di... [18:22:30] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3058782 (10RobH) a:05Papaul>03fgiunchedi I'm going to assign this to @fgiunchedi for his feedback regarding the potential decommission of ms-be2010. I'm uncertain as to the roadmap for replace... [18:24:16] !log gehel@tin Started deploy [wdqs/wdqs@daca9b3]: (no justification provided) [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:39] !log redeploying wdqs (previous deploy was not latest version) [18:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:55] !log gehel@tin Finished deploy [wdqs/wdqs@daca9b3]: (no justification provided) (duration: 01m 39s) [18:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:01] SMalyshev: deploy done for real this time [18:26:05] !log restarting wdqs-updater on all wdqs servers [18:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:16] gehel: cool, thanks! [18:26:19] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:29] (03PS1) 10BBlack: geo config structure changes for svc discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) [18:29:35] (03CR) 10jerkins-bot: [V: 04-1] geo config structure changes for svc discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:30:44] (03CR) 10Tim Landscheidt: "Works for my use case." [labs/private] - 10https://gerrit.wikimedia.org/r/340148 (owner: 10Tim Landscheidt) [18:31:03] (03CR) 10Tim Landscheidt: "* Tested to work for my use case." [labs/private] - 10https://gerrit.wikimedia.org/r/340148 (owner: 10Tim Landscheidt) [18:32:56] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't create account "Trizek (WMF)" - https://phabricator.wikimedia.org/T158408#3058811 (10Trizek-WMF) 05Open>03Invalid The "paid coding" vs "paid editing" reason is convincing me. Thanks @bd808! [18:33:51] (03PS3) 10BBlack: varnish: move "apps" data back into manifests [WIP, 1/4] [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [18:33:53] (03PS4) 10BBlack: varnish: switch all clusters to req_handling [WIP, 2/4] [puppet] - 10https://gerrit.wikimedia.org/r/339668 (https://phabricator.wikimedia.org/T134404) [18:33:55] (03PS5) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (https://phabricator.wikimedia.org/T134404) [18:33:57] (03PS4) 10BBlack: varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) [18:33:59] (03CR) 10Paladox: [C: 031] "Tested and works still :)" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [18:34:19] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:39] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/5597/" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [18:36:37] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3058815 (10RobH) The only other thing I could think to do is have @papaul offline the host and try to flash the bios (not the idrac) and see if that resolves things. It will require downtime. [18:39:43] (03PS9) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [18:39:45] (03PS1) 10BBlack: authdns: re-structure prep for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340156 (https://phabricator.wikimedia.org/T156100) [18:40:32] (03PS2) 10BBlack: geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) [18:40:45] (03CR) 10jerkins-bot: [V: 04-1] geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:42:25] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3058840 (10RobH) a:05RobH>03None So there wasn't an ops meeting today, but I've emailed our team about processing this request. When there isn't a meeting,... [18:45:17] 06Operations, 06Discovery, 06Discovery-Search (Current work): Add elasticsearch 5 .deb to reprepro experimental repository - https://phabricator.wikimedia.org/T159168#3058868 (10Paladox) [18:54:20] (03PS3) 10BBlack: geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) [18:54:27] (03CR) 10jerkins-bot: [V: 04-1] geo config structure changes for discovery [dns] - 10https://gerrit.wikimedia.org/r/340154 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:54:42] (03CR) 10Subramanya Sastry: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [18:55:19] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:55:56] greg-g: Need to run a maintenance script to purge some resourceloader caches post-wmf.13 deployment. Forgot to do so last week. Not super urgent, but will resolve various minor outstanding RL bugs (https://phabricator.wikimedia.org/T158105). Would prefer today or tomorrow? [18:56:49] will need to run on all wikis so may take a while to complete (30min-1h) [18:57:14] jouncebot: next [18:57:14] In 0 hour(s) and 2 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T1900) [18:58:35] (03CR) 10Dzahn: "thanks @Juniorsys! :) @AndrewBogott one more compiler run now ?:)" [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:59:00] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10ema) >>! In T154558#3058745, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://t... [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T1900). Please do the needful. [19:00:24] Empty SWAT [19:01:44] Krinkle: kk, do the needful as appropriate time-wise :) [19:06:57] RoanKattouw: Congrats on a successful swat window :D [19:07:19] (03PS1) 10Thcipriani: Scap: update version to 3.5.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/340159 (https://phabricator.wikimedia.org/T127762) [19:07:48] (03CR) 10Krinkle: portals: do not rewrite 404 errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [19:08:31] (03Draft1) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:08:37] (03PS2) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:11:29] (03PS3) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:12:07] (03CR) 10Krinkle: portals: do not rewrite 404 errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [19:13:56] Krinkle: thanks! I'll patch this tomorrow! [19:15:18] (03PS4) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:15:39] jouncebot: next [19:15:39] In 1 hour(s) and 44 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T2100) [19:18:15] (03CR) 10Dzahn: [C: 032] Get rid of old beta_sites class now just containing a load of ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [19:18:36] (03CR) 10Dzahn: [C: 032] "would have merged but has dependencies (parents)" [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [19:19:33] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3058926 (10Krinkle) >>! In T158782#3057319, @Gehel wrote: > > `https://www.wikipedia.org/portal/x -> https://en.wikip... [19:20:39] (03CR) 10Chad: [C: 031] "Yeah this can't merge yet, too many dependencies we need to get through" [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [19:23:05] (03PS5) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:23:24] (03CR) 10Dzahn: [C: 032] Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:23:32] (03PS5) 10Dzahn: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:24:31] (03CR) 10Dzahn: [C: 032] "too hard to get review sfor toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:28:09] (03CR) 10Dzahn: [C: 032] "@Dduvall could you remove the dependencies between multiple patches? (unless they really have to be merged in that order for technical rea" [puppet] - 10https://gerrit.wikimedia.org/r/274572 (owner: 10Dduvall) [19:29:57] (03PS2) 10BBlack: authdns: re-structure prep for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340156 (https://phabricator.wikimedia.org/T156100) [19:29:59] (03PS10) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [19:37:52] (03PS2) 10Dzahn: phabricator: include project creations with policies other than public+all-users [puppet] - 10https://gerrit.wikimedia.org/r/317990 (owner: 10Alex Monk) [19:37:58] (03CR) 10Dzahn: "amended to add escaped quotes per Andre's comment" [puppet] - 10https://gerrit.wikimedia.org/r/317990 (owner: 10Alex Monk) [19:39:44] (03CR) 10Dzahn: [C: 031] "this https://gerrit.wikimedia.org/r/#/c/338540/ looks like the same thing plus it's removing the path to the template dir from .. templat" [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo) [19:40:09] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/#/c/337837/2" [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) (owner: 10Tim Landscheidt) [19:44:32] (03PS6) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:45:07] (03Abandoned) 10Dduvall: Fix programdashboard hieradata [puppet] - 10https://gerrit.wikimedia.org/r/274572 (owner: 10Dduvall) [19:45:30] (03Abandoned) 10Dduvall: labs: Deployer access for programdashboard [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [19:45:45] (03Abandoned) 10Dduvall: labs: Database server to support Program Dashboard [puppet] - 10https://gerrit.wikimedia.org/r/275138 (https://phabricator.wikimedia.org/T127105) (owner: 10Dduvall) [19:46:50] (03CR) 10Jcrespo: "Please amend this one: https://gerrit.wikimedia.org/r/337837 instead, it is older." [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) (owner: 10Tim Landscheidt) [19:48:44] (03PS7) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:49:29] (03CR) 10Thcipriani: [C: 031] "lgtm, seems to work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [19:50:12] (03PS8) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:53:38] (03PS1) 10Dzahn: ganglia: move esams aggregator from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) [19:53:46] (03PS1) 10Dduvall: Remove programdashboard module and related hieradata [puppet] - 10https://gerrit.wikimedia.org/r/340164 [19:56:26] (03PS2) 10Dzahn: ganglia: move esams aggregator from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) [19:57:22] (03CR) 10Chad: [C: 032] Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [19:58:25] (03PS1) 10Dzahn: install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) [19:59:19] (03Merged) 10jenkins-bot: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [19:59:28] (03CR) 10jenkins-bot: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad) [20:00:56] (03PS1) 10Dzahn: install/prometheus: move prometheus::ops from bast3001 to 3002 [puppet] - 10https://gerrit.wikimedia.org/r/340166 (https://phabricator.wikimedia.org/T156506) [20:01:48] (03CR) 10jerkins-bot: [V: 04-1] install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:03:25] (03CR) 10jerkins-bot: [V: 04-1] install/prometheus: move prometheus::ops from bast3001 to 3002 [puppet] - 10https://gerrit.wikimedia.org/r/340166 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:05:51] (03PS1) 10Dzahn: install: remove bast3001 from puppet and smokeping [puppet] - 10https://gerrit.wikimedia.org/r/340169 (https://phabricator.wikimedia.org/T156506) [20:07:52] (03PS3) 10Dzahn: add bast3002 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/339684 (https://phabricator.wikimedia.org/T156506) [20:09:03] (03CR) 10jerkins-bot: [V: 04-1] install: remove bast3001 from puppet and smokeping [puppet] - 10https://gerrit.wikimedia.org/r/340169 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:13:35] (03CR) 10Dzahn: "We are getting alerts from the-service-formerly-known-as-Watchmouse about this. (ALERT! Ubuntu mirror: Could not resolve host: ubuntu.wik" [dns] - 10https://gerrit.wikimedia.org/r/339653 (owner: 10Faidon Liambotis) [20:33:33] (03CR) 10Dzahn: [C: 032] add bast3002 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/339684 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:34:37] (03CR) 10Thcipriani: [C: 04-1] "Seems like it should work fine, except syntax error: missing comma" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [20:41:08] (03PS1) 10Dzahn: prometheus: add bast3002 as second esams host [puppet] - 10https://gerrit.wikimedia.org/r/340173 (https://phabricator.wikimedia.org/T156506) [20:44:20] (03PS17) 10Hashar: Rake helper to run rspec in all modules having specs [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [20:50:09] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [20:50:10] PROBLEM - check_puppetrun on frdb2001 is CRITICAL: CRITICAL: puppet fail [20:50:19] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [20:51:24] (03CR) 10Dzahn: [C: 032] prometheus: add bast3002 as second esams host [puppet] - 10https://gerrit.wikimedia.org/r/340173 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:51:38] !log disabled puppet on einstienium for icinga update of config [20:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:55] robh: renaming hashar ?:) [20:51:59] yep [20:52:02] cool! [20:52:04] o/ [20:52:10] which matches all the opsen, they have login as contact names [20:52:13] and alias as their actual names [20:52:15] yep [20:52:16] so this will now match [20:52:43] for years I thought that only ops could ack :D [20:52:44] (03PS1) 10RobH: change hashar's contact entry for command acks [puppet] - 10https://gerrit.wikimedia.org/r/340174 [20:53:14] +1,.. the LDAP user has to match the internal Icinga user [20:53:23] and its a string in a config file iirc [20:53:28] and then we can use privileges in cgi.cfg [20:53:30] and there is some change needed in the private contact list ? [20:53:34] to allow sending ACKs [20:53:37] just the contact name change [20:53:43] swapping contactname and alias around [20:54:04] (03CR) 10RobH: [C: 032] change hashar's contact entry for command acks [puppet] - 10https://gerrit.wikimedia.org/r/340174 (owner: 10RobH) [20:54:13] ahhh, i just missed my rebase window [20:54:14] damn it. [20:54:16] (03PS12) 10Hashar: Use rake tasks to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [20:54:21] (03PS2) 10RobH: change hashar's contact entry for command acks [puppet] - 10https://gerrit.wikimedia.org/r/340174 [20:54:23] (03CR) 10Chad: clean.py: Fix up l10nupdate-owned files on masters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [20:54:26] (03CR) 10Dzahn: [C: 031] change hashar's contact entry for command acks [puppet] - 10https://gerrit.wikimedia.org/r/340174 (owner: 10RobH) [20:54:52] mutante: thx for +1! [20:55:09] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [20:55:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [20:55:09] RECOVERY - check_puppetrun on frdb2001 is OK: OK: Puppet is currently enabled, last run 280 seconds ago with 0 failures [20:55:10] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 244 seconds ago with 0 failures [20:55:10] (03PS3) 10Chad: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 [20:55:19] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [20:55:28] (03CR) 10jerkins-bot: [V: 04-1] clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [20:58:31] ok, changes merged testing on system [20:59:02] and puppet refreshed and restarted icinga successfully.... [20:59:05] hashar: try to login? [20:59:15] and maybe put a host you have rights on into a 1 minute maint mode? [20:59:26] (if you have time right now, no rush.) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T2100). Please do the needful. [21:00:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [21:00:09] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 218 seconds ago with 0 failures [21:00:10] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [21:00:19] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [21:00:30] no ores deploy today [21:00:34] oh those puppet fails scared the shit out of me then i realized they arent in the repo i touched... [21:00:36] Jeff_Green: ^ [21:00:38] (03PS6) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [21:00:46] you have quite a few frack puppet run failures suddenly. [21:00:48] so with the current config it should work for all services that hashar is a contact for [21:00:52] but not globally for any service [21:00:56] indeed [21:01:24] (03PS4) 10Chad: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 [21:01:26] robh: sorry. i restarted the puppetmaster service on one of the frack puppetmasters and a few of the clients freaked out [21:01:49] no need to apologize i just wanted to make sure you were aware =] [21:02:55] robh: mutante : icinga yields => "Your command requests were successfully submitted to Icinga for processing." [21:03:04] cool [21:03:24] I have managed to leave a comment on contint2001 check https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=contint2001&service=jenkins_zmq_publisher :} [21:03:25] awesome [21:03:35] :) [21:03:47] revel in the power [21:03:56] ^^ [21:03:58] (03PS3) 10Dzahn: ganglia: move esams aggregator from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) [21:04:18] 06Operations, 10Ops-Access-Requests, 10Icinga, 10Monitoring, and 2 others: Rename Icinga contact 'amusso' to 'hashar' - https://phabricator.wikimedia.org/T158167#3059206 (10RobH) 05Open>03Resolved done and tested successfully, resolving. [21:04:32] 06Operations, 10Ops-Access-Requests, 10Icinga, 10Monitoring, and 2 others: Rename Icinga contact 'amusso' to 'hashar' - https://phabricator.wikimedia.org/T158167#3059208 (10hashar) icinga yields => "Your command requests were successfully submitted to Icinga for processing." I have managed to leave a comm... [21:05:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [21:05:09] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [21:05:19] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [21:05:24] i hate puppet. [21:05:49] stupid puppetmaster was down for literally 10 seconds, and for the next half hour everyone is confused [21:06:21] (03PS3) 10Tim Landscheidt: puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/337837 (https://phabricator.wikimedia.org/T95158) (owner: 10Jcrespo) [21:06:24] (03Abandoned) 10Tim Landscheidt: puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) (owner: 10Tim Landscheidt) [21:08:35] (03PS7) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [21:08:37] (03PS1) 10Hashar: Enable rspec testing in Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340186 (https://phabricator.wikimedia.org/T78342) [21:10:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [21:10:09] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 208 seconds ago with 0 failures [21:10:10] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 79 seconds ago with 0 failures [21:10:49] (03CR) 10jerkins-bot: [V: 04-1] Enable rspec testing in Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340186 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [21:12:31] (03PS4) 10Dzahn: ganglia: move esams aggregator from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) [21:15:08] (03CR) 10Filippo Giunchedi: "LGTM, see comments inline and name bikeshedding" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [21:15:09] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 226 seconds ago with 0 failures [21:15:35] (03CR) 10Dzahn: [C: 032] ganglia: move esams aggregator from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:16:56] !log ganglia - switching esams aggregator to bast3002 - except short gaps in esams graphs [21:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:21] except i meant expect [21:17:22] (03PS1) 10Papaul: DNS/Decom Remove production DNS for mc2001-mc2016 [dns] - 10https://gerrit.wikimedia.org/r/340195 [21:21:02] Jeff_Green: are all agents running at the same time? there is randomization in the cron in prod [21:23:26] (03CR) 10Dzahn: DNS/Decom Remove production DNS for mc2001-mc2016 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/340195 (owner: 10Papaul) [21:24:22] (03CR) 10Dzahn: "also removes mw2215 and mw2216" [dns] - 10https://gerrit.wikimedia.org/r/340195 (owner: 10Papaul) [21:25:45] (03CR) 10Dzahn: "confirmed with cp3003/cp3004 (maps cache), graphs coming back after puppet runs everywhere involved" [puppet] - 10https://gerrit.wikimedia.org/r/340163 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:28:36] (03PS2) 10Papaul: DNS/Decom Remove production DNS for mc2001-mc2016 [dns] - 10https://gerrit.wikimedia.org/r/340195 [21:36:09] PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused [21:36:29] PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.35, port=6533): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc981a31910: Failed to establish a new connection: [Errno 111] Connection refused,)) [21:36:49] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [21:37:19] oops, downtime expired, lemme fix this [21:37:48] (03PS1) 10Hashar: base/profile: fix spec for base::certificate [puppet] - 10https://gerrit.wikimedia.org/r/340222 [21:38:08] gehel: since you are around. ^^^^ that one will be for you tomorrow :) [21:38:25] * gehel is reading back... [21:38:35] hashar: which one? [21:39:10] (03CR) 10Hashar: "base::certificate has been moved to the profile module but the spec was left behind. So I have added some boiler plate to profile so we c" [puppet] - 10https://gerrit.wikimedia.org/r/340222 (owner: 10Hashar) [21:39:10] !log bsitzmann@tin Started deploy [mobileapps/deploy@872a615]: Update mobileapps to c924126 [21:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:24] gehel: https://gerrit.wikimedia.org/r/#/c/340222/ it is some spec that got broken [21:39:34] gehel: but it is getting late, there is no urgency for that one [21:40:25] I will hopefully send some announcement regarding rspec/puppet tomorrow :-} Gotta polish up some doc first [21:41:10] hashar: ok, will do... [21:41:39] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3059333 (10Papaul) [21:41:43] gehel, stfu maps-test2004? :P [21:42:24] !log bsitzmann@tin Finished deploy [mobileapps/deploy@872a615]: Update mobileapps to c924126 (duration: 03m 14s) [21:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:45] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3012720 (10Papaul) Disks wipe in progress [21:43:11] MaxSem: still down, waiting for the new nodejs 6 version of kartotherian / tilerator... [21:43:23] MaxSem: just the icinga downtime expiring [21:46:44] that's what I mean by stfuing it:) [21:47:07] my woory at that time is only that it might wake you up:) [21:50:51] mutante: they're splayed like production, i don't really understand why they get perturbed. [21:54:09] Jeff_Green: gotcha.. that's a bit odd, yea [21:54:28] there's a >5 year old ticket about having puppet retry if it fails :-) [21:55:00] (03PS2) 10Dzahn: install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) [21:55:16] https://tickets.puppetlabs.com/browse/PUP-3319 [21:55:52] (03CR) 10Dzahn: [C: 032] install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:55:54] (03CR) 10jerkins-bot: [V: 04-1] install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:57:40] ah, i thought in our tickets [21:57:54] seemed like the "make it run NICEd" ticket [21:59:07] (03CR) 10Thcipriani: [C: 031] clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [21:59:16] (03PS3) 10Dzahn: install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) [22:00:05] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T2200). [22:01:33] (03CR) 10Thcipriani: [C: 031] "lgtm. Should be good to go after a rebase of this patch series." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 (owner: 10Chad) [22:09:30] (03PS4) 10Tim Landscheidt: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) [22:10:17] !log otto@tin Started deploy [eventstreams/deploy@2f73b52]: Deploying /?doc swagger-ui endpoint only to scb2001 [22:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:35] !log otto@tin Finished deploy [eventstreams/deploy@2f73b52]: Deploying /?doc swagger-ui endpoint only to scb2001 (duration: 00m 18s) [22:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:25] (03CR) 10Tim Landscheidt: "@Andrew: "Better" – probably, but that does not seem to be easy to accomplish (cf. for example @yuvipanda's b70710bcf3432a5310ae196c9ba026" [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) (owner: 10Tim Landscheidt) [22:16:26] (03CR) 10Dzahn: [C: 032] install/bast: move tftp server from bast3001 to bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [22:18:37] (03Draft1) 10Paladox: Phabricator: Fix phd not starting up after reboot if it was previously stopped [puppet] - 10https://gerrit.wikimedia.org/r/340242 [22:19:00] (03PS2) 10Paladox: Phabricator: Fix phd not starting up after reboot if it was previously stopped [puppet] - 10https://gerrit.wikimedia.org/r/340242 (https://phabricator.wikimedia.org/T158434) [22:20:18] (03CR) 10Dzahn: "installer files in /srv/tftpboot/ are being deployed on bast3002 by puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/340165 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [22:20:39] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:59] (03CR) 10Dzahn: [C: 031] "you could add the explanation why the first line has "=-" and the second does not and that link you had" [puppet] - 10https://gerrit.wikimedia.org/r/340242 (https://phabricator.wikimedia.org/T158434) (owner: 10Paladox) [22:27:39] (03PS3) 10Paladox: Phabricator: Fix phd not starting up after reboot if it was previously stopped [puppet] - 10https://gerrit.wikimedia.org/r/340242 (https://phabricator.wikimedia.org/T158434) [22:28:13] (03CR) 10Paladox: "> you could add the explanation why the first line has "=-" and the" [puppet] - 10https://gerrit.wikimedia.org/r/340242 (https://phabricator.wikimedia.org/T158434) (owner: 10Paladox) [22:34:37] !log otto@tin Started deploy [eventstreams/deploy@76c763e]: Deploying /?doc swagger-ui endpoint only to scb2001 [22:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:54] !log otto@tin Finished deploy [eventstreams/deploy@76c763e]: Deploying /?doc swagger-ui endpoint only to scb2001 (duration: 00m 17s) [22:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:10] !log otto@tin Started deploy [eventstreams/deploy@76c763e]: Deploying swagger-ui /?doc endpoint [22:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] !log otto@tin Finished deploy [eventstreams/deploy@76c763e]: Deploying swagger-ui /?doc endpoint (duration: 01m 45s) [22:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:08] (03CR) 10Krinkle: [C: 04-1] contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) (owner: 10Dzahn) [22:42:50] (03PS1) 10Ottomata: Only pipe /v2/stream requests to EventStreams service, everything else can be cached by varnish [puppet] - 10https://gerrit.wikimedia.org/r/340246 (https://phabricator.wikimedia.org/T158066) [22:43:58] (03CR) 10Paladox: [C: 031] "Tested more then 4 times and works." [puppet] - 10https://gerrit.wikimedia.org/r/340242 (https://phabricator.wikimedia.org/T158434) (owner: 10Paladox) [22:44:18] (03CR) 10Thcipriani: "Feedback inline." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [22:48:39] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:56:55] (03PS4) 10Dzahn: contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) [23:00:54] (03CR) 10Chad: [C: 032] clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [23:02:26] (03Merged) 10jenkins-bot: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [23:02:29] (03CR) 10Krinkle: [C: 031] contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (https://phabricator.wikimedia.org/T158652) (owner: 10Dzahn) [23:02:33] (03CR) 10jenkins-bot: clean.py: Fix up l10nupdate-owned files on masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339035 (owner: 10Chad) [23:07:12] (03CR) 10Chad: "Ah, turns out that since Git 1.7.0 we can do `git push origin --delete foo` But really, the end result is the same :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad) [23:07:51] (03PS2) 10Chad: clean.py: Minor abstraction of param handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 [23:13:35] (03CR) 10Chad: [C: 032] clean.py: Minor abstraction of param handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 (owner: 10Chad) [23:14:44] (03Merged) 10jenkins-bot: clean.py: Minor abstraction of param handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 (owner: 10Chad) [23:14:58] (03CR) 10jenkins-bot: clean.py: Minor abstraction of param handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339094 (owner: 10Chad) [23:15:37] (03PS3) 10Chad: Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) (owner: 10Milimetric) [23:17:12] !log demon@tin Synchronized scap/plugins/clean.py: no-op (duration: 00m 48s) [23:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:12] thx very much RainbowSprinkles, are you doing this now? I can stick around no prob [23:23:08] (03PS1) 10Chad: Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 [23:23:34] milimetric: Yeah let's do it now [23:23:36] before swat [23:23:38] jouncebot: now [23:23:38] For the next 0 hour(s) and 36 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170227T2200) [23:24:27] (03CR) 10Chad: [C: 032] Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) (owner: 10Milimetric) [23:25:48] (03Merged) 10jenkins-bot: Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) (owner: 10Milimetric) [23:27:22] (03CR) 10jenkins-bot: Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) (owner: 10Milimetric) [23:28:38] (03PS3) 10Chad: Scap clean: Automate purging of old deployment branches from gerrit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 [23:29:56] !log demon@tin Started scap: Enabling Dashiki on meta [23:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:10] milimetric: Doing full scap since we need l10n rebuild [23:30:19] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:30] I'm sorry RainbowSprinkles this looks like a lot of work for a little extension [23:30:40] should I have just gotten deploy privileges and done this myself? [23:30:57] I mean you're more than welcome to apply for deploy privileges :) [23:31:08] It's not hard, I've just been too busy to finish what I promised to help with [23:32:06] At this point, it's sitting, waiting, and making sure things go fine :) [23:32:38] It shouldn't take terribly long, things should be reasonably sync'd right now :) [23:32:51] (03PS7) 10Ppchelko: Enable local logging for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) [23:35:22] now I have some context on why https://meta.wikimedia.org/wiki/Special:Version gets a lot of hits :) [23:47:09] PROBLEM - Nginx local proxy to apache on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:47:19] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:47:30] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:14] (03Abandoned) 10Chad: Stop using MWMinimalScriptInit [puppet] - 10https://gerrit.wikimedia.org/r/332673 (owner: 10Chad) [23:48:20] (03Abandoned) 10Chad: Remove MWMinimalScriptInit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332674 (owner: 10Chad) [23:48:45] milimetric: Home stretch [23:49:07] * milimetric roots for a slide-in finish [23:49:10] (03CR) 10Thcipriani: [C: 04-1] Scap clean: abort if a branch is still in use (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [23:50:42] !log demon@tin Finished scap: Enabling Dashiki on meta (duration: 20m 46s) [23:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:39] yay, thanks RainbowSprinkles, it works: https://meta.wikimedia.org/wiki/Config:Dashiki:Sample/tabs [23:53:08] (03CR) 10Chad: Scap clean: abort if a branch is still in use (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [23:53:43] milimetric: Awesome [23:54:12] thanks very much, you're great. Next deployment I'll just apply for permissions [23:55:44] You're very welcome :) [23:56:58] (03PS1) 10Kaldari: Enable editmyoptions right for all users on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340258 (https://phabricator.wikimedia.org/T158871)