[00:01:53] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:35:43] PROBLEM - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [00:35:54] ACKNOWLEDGEMENT - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T163087 [00:35:57] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3185569 (10ops-monitoring-bot) [00:45:04] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:04] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:17:38] (03CR) 10TTO: "I'll schedule this for SWAT at some point" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347545 (https://phabricator.wikimedia.org/T159416) (owner: 10TTO) [01:52:04] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:20:04] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [03:02:24] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 7635 MB (3% inode=96%) [03:58:24] RECOVERY - Disk space on ocg1003 is OK: DISK OK [04:15:54] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=297.00 Read Requests/Sec=2045.40 Write Requests/Sec=0.30 KBytes Read/Sec=35473.60 KBytes_Written/Sec=6.00 [04:21:54] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.60 Read Requests/Sec=0.10 Write Requests/Sec=1.00 KBytes Read/Sec=0.40 KBytes_Written/Sec=17.60 [04:38:24] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:39:04] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [04:39:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:39:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [04:41:54] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [04:42:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:42:24] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:42:25] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [05:01:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [05:02:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [05:03:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [05:03:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [05:03:25] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 148 days) [05:03:54] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on 10.192.32.138 port 9042 [05:04:34] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 148 days) [05:05:04] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [05:57:11] (03PS2) 10Marostegui: Revert "mysql-predump.erb: Reduce the number of jobs" [puppet] - 10https://gerrit.wikimedia.org/r/347996 (owner: 10Jcrespo) [06:00:15] (03CR) 10Marostegui: [C: 032] "Going to deploy this, to speed up the backups. Let's see if we see the alerts coming back then" [puppet] - 10https://gerrit.wikimedia.org/r/347996 (owner: 10Jcrespo) [06:28:04] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:04] (03CR) 10Marostegui: [C: 031] Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611) (owner: 10Legoktm) [06:58:04] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:08] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3185829 (10Marostegui) I am moving this to "Done" on the DBA workboard as it is done from our side along with: T162290 so it is easier for... [07:19:58] (03PS1) 10Ladsgroup: Don't let Wikibase instances read/write terms_full_entity_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348413 (https://phabricator.wikimedia.org/T159851) [07:36:54] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [07:37:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:37:24] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:37:25] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [07:38:24] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:39:04] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [07:39:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:39:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [07:56:25] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3185844 (10Marostegui) 05stalled>03Resolved Looks like he hasn't come back again: https://grafana.wikimedia.org/dashboard/db/... [08:01:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [08:02:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [08:02:54] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.138 port 9042 [08:03:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [08:03:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [08:03:25] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 148 days) [08:04:04] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [08:04:25] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 148 days) [08:07:04] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:54] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [08:08:54] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [08:09:04] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [08:09:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:09:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:09:24] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:09:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:09:25] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:09:25] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:19:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [08:20:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [08:20:34] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 148 days) [08:22:03] (ran puppet on restbase2009 and now on 2004( [08:22:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [08:22:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [08:22:54] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.138 port 9042 [08:23:01] Cc: urandom --^ [08:23:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:23:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:23:24] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:23:34] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 148 days) [08:25:54] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [08:26:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:26:24] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:26:24] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:27:45] (03PS1) 10Marostegui: templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) [08:28:11] (03CR) 10Marostegui: [C: 04-2] "Do not submit until the DC switchover is done" [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [08:31:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [08:31:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [08:33:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [08:34:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [08:35:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:35:24] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:37:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:37:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:50:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [08:51:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [08:51:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [08:51:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [08:52:04] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 1.034 second response time on 10.192.48.56 port 9042 [08:52:30] ran puppet again on both at the same time [08:52:54] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on 10.192.32.138 port 9042 [08:52:54] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [08:54:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:54:22] nope [08:54:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:55:04] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [08:55:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:55:25] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:55:54] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [09:01:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [09:01:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [09:03:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [09:03:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [09:03:44] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:04:04] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [09:04:24] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 148 days) [09:06:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:24] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [09:11:04] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [09:11:14] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:11:24] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [09:11:24] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:31:14] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [09:31:24] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [09:31:54] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 1.032 second response time on 10.192.32.138 port 9042 [09:32:44] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 148 days) [09:33:02] !log Silence alerts for restbase2004 and restbase2009 T160759 [09:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:11] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [09:33:14] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [09:33:24] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active [09:34:04] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [09:34:24] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 148 days) [09:54:24] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 7643 MB (3% inode=96%) [10:02:24] RECOVERY - Disk space on ocg1003 is OK: DISK OK [10:37:37] 06Operations, 10DBA: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3186029 (10Marostegui) Any news from Dell? Thanks! [13:32:04] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:30] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3186231 (10BBlack) [13:51:11] 06Operations, 10DBA, 10Traffic: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10BBlack) tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should b... [13:52:02] 06Operations, 10DBA, 10Traffic: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3186237 (10BBlack) (also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...) [13:58:19] (03PS1) 10BBlack: dbtree: split backend from noc.wm.o, make eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/348456 (https://phabricator.wikimedia.org/T162976) [14:00:04] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:03:30] (03CR) 10BBlack: [C: 032] dbtree: split backend from noc.wm.o, make eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/348456 (https://phabricator.wikimedia.org/T162976) (owner: 10BBlack) [14:27:44] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 627011 [14:55:22] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3186304 (10BBlack) [14:57:54] (03PS1) 10Cmjohnson: adding mgmt dns entries for db1096-1106 T162233 [dns] - 10https://gerrit.wikimedia.org/r/348467 [14:58:45] (03CR) 10Cmjohnson: [C: 032] adding mgmt dns entries for db1096-1106 T162233 [dns] - 10https://gerrit.wikimedia.org/r/348467 (owner: 10Cmjohnson) [15:13:54] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 613388 [15:25:22] (03PS1) 10Jdlrobson: Correctly enforce config for related pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348472 (https://phabricator.wikimedia.org/T163114) [15:31:04] !log mobrovac@tin Started deploy [restbase/deploy@6595298]: (no justification provided) [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:46] !log mobrovac@tin Finished deploy [restbase/deploy@6595298]: (no justification provided) (duration: 01m 42s) [15:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:16] !log mobrovac@tin Started deploy [restbase/deploy@6595298]: (no justification provided) [15:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:54] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 28959 [15:34:45] !log mobrovac@tin Finished deploy [restbase/deploy@6595298]: (no justification provided) (duration: 01m 29s) [15:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:18] !log mobrovac@tin Started deploy [restbase/deploy@6595298]: Update client caching headers for T161284 [15:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] T161284: Minimise incidental HTTP requests caused by Page Previews - https://phabricator.wikimedia.org/T161284 [15:43:11] (03PS1) 10RobH: setting naos (new codfw deploy host) ipv6 dns [dns] - 10https://gerrit.wikimedia.org/r/348474 [15:46:03] (03CR) 10Dzahn: [C: 031] setting naos (new codfw deploy host) ipv6 dns [dns] - 10https://gerrit.wikimedia.org/r/348474 (owner: 10RobH) [15:46:13] (03CR) 10RobH: [C: 032] setting naos (new codfw deploy host) ipv6 dns [dns] - 10https://gerrit.wikimedia.org/r/348474 (owner: 10RobH) [15:47:14] (03PS1) 10Thcipriani: Scap: canaries should include INFO-level messages [puppet] - 10https://gerrit.wikimedia.org/r/348475 (https://phabricator.wikimedia.org/T162974) [15:48:33] !log mobrovac@tin Finished deploy [restbase/deploy@6595298]: Update client caching headers for T161284 (duration: 08m 15s) [15:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] T161284: Minimise incidental HTTP requests caused by Page Previews - https://phabricator.wikimedia.org/T161284 [15:48:46] (03CR) 10Dzahn: "looks good to me, now if you could just amend the commit message to reflect what it's doing." [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox) [15:49:59] (03PS1) 10BBlack: remove long-unused frontend-hooks file [puppet] - 10https://gerrit.wikimedia.org/r/348476 [15:50:02] (03PS1) 10BBlack: use synthetic warning for 1% of 3DES pageviews [puppet] - 10https://gerrit.wikimedia.org/r/348477 (https://phabricator.wikimedia.org/T147199) [15:50:11] (03PS4) 10Paladox: Phabricator: Add additial check just for checking phd [puppet] - 10https://gerrit.wikimedia.org/r/348165 [15:50:37] (03PS5) 10Paladox: Phabricator: Add additial check just for checking phd [puppet] - 10https://gerrit.wikimedia.org/r/348165 [15:52:14] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:42] (03CR) 10Dzahn: [C: 032] "thanks, this will now also warn if PHD is running but not as the right user, or if other processes are running as PHD but not the deamon i" [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox) [15:53:13] (03PS1) 10RobH: adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 [15:53:24] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 7643 MB (3% inode=96%) [15:53:36] hi [15:53:56] does anyone know why the epcoordinator user group can't be granted? [15:54:16] (03CR) 10BBlack: [C: 032] remove long-unused frontend-hooks file [puppet] - 10https://gerrit.wikimedia.org/r/348476 (owner: 10BBlack) [15:54:21] (03PS2) 10BBlack: remove long-unused frontend-hooks file [puppet] - 10https://gerrit.wikimedia.org/r/348476 [15:54:23] local admin can't grant it anymore [15:54:24] (03CR) 10BBlack: [V: 032 C: 032] remove long-unused frontend-hooks file [puppet] - 10https://gerrit.wikimedia.org/r/348476 (owner: 10BBlack) [15:54:59] mutante: you can merge mine, it's a safe no-op [15:57:36] bblack: ok! done, thanks [15:58:03] good reminder that typing "yes" isnt enough anymore , heh [15:58:19] it made me type the world "multiple" yea [15:58:39] 06Operations, 10Traffic, 13Patch-For-Review: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3186491 (10BBlack) [15:59:14] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3186494 (10RobH) [16:01:20] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3186495 (10RobH) So my latest patchset https://phabricator.wikimedia.org/T162900 has appending in naos for everything that has mira. I'm not 100% on some of the files,... [16:02:53] (03PS2) 10RobH: adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 [16:04:59] (03CR) 10Dzahn: "+1 for network/constants and tcpircbot. those are definitely needed and could just be merged." [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:06:24] RECOVERY - Disk space on ocg1003 is OK: DISK OK [16:07:55] (03CR) 10Marostegui: [C: 031] "Looks good, let me know when you want the new grants to be added to silver's mysql" [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:08:21] marostegui: would you want to do that via task or what? [16:08:43] cuz i didnt really know it had to happen manually =] [16:09:06] robh: it doesn't matter, when you merge I can do it :) [16:09:22] I assume you will not merge and deploy it today, no? [16:09:22] cool, i figured i'd marge tomorrow after more folks can review it [16:09:26] sure :) [16:10:06] my impression is they wanted this box rather than the broken mira for the dc switchover, heh [16:10:20] so, its barely there on time! [16:11:28] Like tempdb2001! You are mastering the last-time hw requests! impressive! [16:12:00] just under the line is still under the line! [16:16:10] (03CR) 10RobH: [C: 031] "I'm pretty sure its all good, except the dsh and ircbot additions. Those may require the bot be refreshed, and I'm not sure if adding thi" [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:16:31] (03CR) 10Dzahn: [C: 031] "the salt change is harmless too. this should all be good to go, except _maybe_ the scap change" [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:21:14] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:21:23] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3186744 (10Andrew) 05Open>03Resolved Five days without leaks or failures, so I think this is resolved. [17:27:47] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 683 [17:43:59] 06Operations, 10ops-eqiad: rack and setup boron replacement frpm1001 - https://phabricator.wikimedia.org/T162298#3186860 (10Jgreen) [17:47:37] 06Operations, 10ops-eqiad: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3186886 (10Jgreen) [18:00:26] akosiaris, paladox, this patch is causing me issues on a labs instance: https://gerrit.wikimedia.org/r/#/c/347518/ [18:00:34] The failure is "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Service unit ircecho has a systemd script but nothing useful for upstart at /etc/puppet/modules/base/manifests/service_unit.pp:90 on node shinken-01.shinken.eqiad.wmflabs" [18:00:45] Is that because that patch only works on jessie and breaks trusty? [18:02:53] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3186978 (10Halfak) FYI: Most of the discussion has taken place in T159838 [18:06:58] andrewbogott [18:06:59] ah [18:07:01] Yeh [18:07:08] systemd does not work on trusty [18:07:15] we can re add the trusty version [18:07:27] shall I revert that whole patch, or do you want to add a special case to the existing code? [18:07:44] andrewbogott it should be easy to fix [18:07:55] Though i wont be able to test on a trusty instance [18:12:44] (03Draft1) 10Paladox: ircecho: Fix support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/348502 [18:12:47] (03PS2) 10BBlack: use synthetic warning for 1% of 3DES pageviews [puppet] - 10https://gerrit.wikimedia.org/r/348477 (https://phabricator.wikimedia.org/T147199) [18:13:03] (03PS2) 10Paladox: ircecho: Fix support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/348502 [18:13:15] andrewbogott ^^ [18:14:11] I think that will fix it :) [18:16:07] (03PS3) 10Paladox: ircecho: Fix support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/348502 (https://phabricator.wikimedia.org/T163129) [18:17:19] andrewbogott is this https://gerrit.wikimedia.org/r/#/c/348502/3/modules/ircecho/templates/initscripts/ircecho.sysvinit.erb and upstart script or sysvinit. [18:18:53] thats sysvinit it seeems. Comparing it to https://github.com/wikimedia/puppet/blob/production/modules/phabricator/templates/initscripts/phd.upstart.erb [18:19:42] (03CR) 10BBlack: [C: 032] use synthetic warning for 1% of 3DES pageviews [puppet] - 10https://gerrit.wikimedia.org/r/348477 (https://phabricator.wikimedia.org/T147199) (owner: 10BBlack) [18:31:18] (03PS4) 10Paladox: ircecho: Fix support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/348502 (https://phabricator.wikimedia.org/T163129) [18:35:13] paladox: is that just the same ircecho.sysvinit.erb that was deleted in the earlier patch? [18:35:20] Yep [18:35:57] hey, can i help with that ircecho systemd? [18:36:00] just got back [18:36:06] i think i once made that [18:36:11] (03CR) 10Andrew Bogott: [C: 032] ircecho: Fix support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/348502 (https://phabricator.wikimedia.org/T163129) (owner: 10Paladox) [18:36:15] ah:) [18:36:39] mutante yep Didn't realise that we still had trusty instances using the class. I've re added trusty support :) [18:37:08] andrewbogott thanks :) [18:37:35] looks at patches... gotcha! [18:37:43] :) [18:38:22] paladox: which instance was it [18:38:33] mutante the labs shinken instance [18:38:56] ah, ok. thx [18:39:03] mutante, paladox, it's still broken, full story is on https://phabricator.wikimedia.org/T163129 [18:39:28] ah [18:39:48] wait i think i know why it's failing now. [18:39:49] andrewbogott: oh, ok.looking [18:40:12] file { '/etc/init.d/ircecho': [18:40:13] ensure => absent, [18:40:14] owner => 'root', [18:40:15] group => 'root', [18:40:16] mode => '0544', [18:40:18] } [18:40:18] It's because i need to remove that [18:40:21] patch incomming :) [18:40:40] that sounds right, yea [18:41:54] (03Draft1) 10Paladox: ircecho: Remove file { '/etc/init.d/ircecho': code [puppet] - 10https://gerrit.wikimedia.org/r/348511 (https://phabricator.wikimedia.org/T163129) [18:41:57] (03PS2) 10Paladox: ircecho: Remove file { '/etc/init.d/ircecho': code [puppet] - 10https://gerrit.wikimedia.org/r/348511 (https://phabricator.wikimedia.org/T163129) [18:42:00] that's because you were too nice about trying to have puppet remove that file :p [18:42:11] andrewbogott mutante ^^ [18:42:12] :) [18:42:29] yep [18:42:38] ok [18:42:41] (03CR) 10Dzahn: [C: 031] ircecho: Remove file { '/etc/init.d/ircecho': code [puppet] - 10https://gerrit.wikimedia.org/r/348511 (https://phabricator.wikimedia.org/T163129) (owner: 10Paladox) [18:42:56] that would cause the duplicate resource, yep [18:43:09] yep [18:44:00] note how that is an ensure "absent" anyways [18:44:15] (03CR) 10Andrew Bogott: [C: 032] ircecho: Remove file { '/etc/init.d/ircecho': code [puppet] - 10https://gerrit.wikimedia.org/r/348511 (https://phabricator.wikimedia.org/T163129) (owner: 10Paladox) [18:44:42] yep [18:44:43] :) [18:44:45] thanks [18:45:06] !log maxsem@tin Started scap: https://gerrit.wikimedia.org/r/#/c/348224/ to test hosts only [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:25] !log maxsem@tin scap aborted: https://gerrit.wikimedia.org/r/#/c/348224/ to test hosts only (duration: 00m 19s) [18:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:56] that was supposed to be tilerator only. facepalm [18:46:10] !log maxsem@tin Started deploy [tilerator/deploy@001811e]: https://gerrit.wikimedia.org/r/#/c/348224/ to test hosts only [18:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:29] !log maxsem@tin Finished deploy [tilerator/deploy@001811e]: https://gerrit.wikimedia.org/r/#/c/348224/ to test hosts only (duration: 00m 19s) [18:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:11] 06Operations, 10Traffic, 13Patch-For-Review: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3187214 (10BBlack) a:03BBlack I've deployed the change above, which gets all of the basics on track for how we want to operate the real campaign.... [18:47:12] paladox, mutante, ok, that made my instance happy. Hopefully things are still working for the jessie cases :) [18:47:21] Thanks :) [18:47:45] yep systemd will always overide any other unit on debian jessie i think. [18:48:11] thats if theres a systemd script there. [18:48:20] paladox: which instance has it with jessie? [18:48:35] mutante i presume the prod ircecho [18:48:42] the one icinga-wm is running on [18:48:51] paladox: well, you had one for your tests when you converted it [18:49:08] Oh yep. I tested it on gerrit-mysql and jenkins-slave-01 [18:49:16] :) that's what i meant [18:49:17] they are both jessie [18:49:19] ok [18:49:20] :) [18:49:24] but i am also checking prod [18:50:11] ok [18:58:08] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.15 (duration: 00m 14s) [18:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:21] andrewbogott: i double checked icinga prod, which has jessie ircecho. looks all good. the systemd unit files are just directly in /lib/systemd/system/ now because base::service_unit is used. before that i would put them in /etc/systemd/system/ and that's where i was looking for it at first. the nice way would be symlinks from /etc to /lib [18:58:55] mutante: thanks for checking. [18:58:58] the trend is we are supposed to use base::service_unit for all [18:59:02] yw [19:02:44] !log demon@tin Synchronized wmf-config/: Pruning some old extension message files, co-master sync (duration: 01m 52s) [19:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:44] (03CR) 10Chad: [C: 04-1] "Needs addition of a naos.yaml to hieradata. Copy+pasting from mira.yaml will work." [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [19:11:57] RainbowSprinkles: thx for review, will fix post lunch =] [19:12:40] yw. Thank you from thcipriani mostly. He's the one who mentioned it to me :p [19:12:51] Mostly going over it from the perspective of https://wikitech.wikimedia.org/wiki/Incident_documentation/20160202-deployment-server-loss [19:12:58] (THERE BE DRAGONS!) [19:16:37] !log tegmen test ircecho stop/start service to confirm it's fine on jessie/prod icinga role (that's the passive server) [19:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:53] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3187340 (10Dzahn) seems like we have 2 follow-ups: - make dbtree not use status code 200 for an error page - make wasat a working dbtree backend, then add it back to varnish dire... [19:22:26] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3187342 (10Dzahn) @kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up with the things above i suppose). [19:23:02] (03CR) 10Legoktm: "I think we should be going with the first approach on T156924#3138751 for now and do proper integration with wgConfigRegistry etc. after t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [19:31:39] (03PS3) 10RobH: adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 [19:31:56] (03PS4) 10RobH: adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 [19:45:14] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3187405 (10bd808) >>! In T162976#3187342, @Dzahn wrote: > @kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up wit... [19:45:14] (03CR) 10Krinkle: [C: 031] Scap: canaries should include INFO-level messages [puppet] - 10https://gerrit.wikimedia.org/r/348475 (https://phabricator.wikimedia.org/T162974) (owner: 10Thcipriani) [19:48:30] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3187423 (10BBlack) Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at thi... [19:48:41] (03CR) 10Krinkle: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [19:50:39] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3187431 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:50:55] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Dzahn) a:05Dzahn>03None [19:52:15] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Dzahn) a:03BBlack [19:57:07] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10Dzahn) [20:01:19] 06Operations, 10DBA: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143#3187529 (10Dzahn) [20:29:05] (03PS2) 10Andrew Bogott: wmfkeystonehooks: Work around a keystone bug with role removal [puppet] - 10https://gerrit.wikimedia.org/r/348135 (https://phabricator.wikimedia.org/T162615) [20:30:35] (03CR) 10Andrew Bogott: [C: 032] wmfkeystonehooks: Work around a keystone bug with role removal [puppet] - 10https://gerrit.wikimedia.org/r/348135 (https://phabricator.wikimedia.org/T162615) (owner: 10Andrew Bogott) [20:30:50] (03CR) 10Dzahn: "@Aklapper you guessed right. i tested and "ERROR 1044 (42000): Access denied for user 'phstats'@'10.64.32.150' to database 'phabricator_di" [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [20:35:40] (03PS1) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 [20:36:57] (03PS2) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 [20:37:47] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:36] (03CR) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [20:39:57] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/348565/ should unblock this" [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [20:45:08] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#3187779 (10Dzahn) @ottomata is rcstream going to be deprecated? Should it move to virtual machines? [20:55:20] (03PS1) 10BBlack: debian patch: main source to nginx-1.11.13 [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348585 [20:55:22] (03PS1) 10BBlack: debian patches: forward-port WMF patches and quilt refresh [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348586 [20:55:24] (03PS1) 10BBlack: control: depend on libssl11-dev [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348587 [20:55:26] (03PS1) 10BBlack: Lua module: OpenSSL-1.1 compat fixup [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348588 [20:55:28] (03PS1) 10BBlack: Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348589 [20:55:30] (03PS1) 10BBlack: build: remove --with-ipv6 (removed upstream) [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348590 [20:55:32] (03PS1) 10BBlack: nginx (1.11.13-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348591 [21:01:27] PROBLEM - puppet last run on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:02:27] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures [21:09:53] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2464932 (10RobH) I'd suggest the following: Right now netboot has the following: rdb100[1-6]) echo partman/mw.cfg ;; \ rdb100[7-8]) echo partman/raid1.cfg ;; \ So it seems some of these hosts have the H... [21:10:17] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:48] looking at the tin puppet fail on tin [21:29:01] (03PS1) 10BryanDavis: labs: make labs_lvm compatible with role::toollabs::docker::builder [puppet] - 10https://gerrit.wikimedia.org/r/348624 [21:29:42] (03CR) 10BryanDavis: [C: 04-1] "Will test via cherry-pick on tools-puppetmaster-02" [puppet] - 10https://gerrit.wikimedia.org/r/348624 (owner: 10BryanDavis) [21:29:57] (03PS1) 10Milimetric: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 [21:30:10] (03PS2) 10Milimetric: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 [21:31:21] (03CR) 10Milimetric: [C: 032] "Tested this setup locally and in Vagrant, will be around to check after it deploys." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 (owner: 10Milimetric) [21:32:48] (03CR) 10jenkins-bot: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 (owner: 10Milimetric) [21:45:55] (03CR) 10Andrew Bogott: [C: 04-1] "This breaks tools puppet runs, like so:" [puppet] - 10https://gerrit.wikimedia.org/r/348624 (owner: 10BryanDavis) [21:46:29] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3187979 (10Catrope) [21:47:44] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177068 (10Catrope) tin is also affected, its kernel version appears to be 4.4.0-3-amd64 . [21:49:17] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 7606 MB (3% inode=96%) [21:55:45] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3188017 (10Catrope) Apparently this previously happened on tin's sister host mira as well: T137647#2791091 [22:03:27] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:17] PROBLEM - Check the NTP synchronisation status of timesyncd on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:47] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:08:30] !log catrope@tin Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents/modules/ext.wikimediaEvents.recentChangesClicks.js: T158458 T163152 (duration: 16m 23s) [22:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:39] T158458: ERI Metrics: Measure click-through actions from RC page and create 'Productivity" baseline - https://phabricator.wikimedia.org/T158458 [22:08:39] T163152: ERI Metrics fromrc=1 URL-extension breaks heavily used admin script - https://phabricator.wikimedia.org/T163152 [22:09:17] RECOVERY - Disk space on ocg1003 is OK: DISK OK [22:09:27] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [22:10:47] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [22:36:07] RECOVERY - Check the NTP synchronisation status of timesyncd on tin is OK: OK: synced at Mon 2017-04-17 22:36:00 UTC. [22:40:33] !log tin - rmmod acpi_pad (T163158) [22:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:40] T163158: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158 [22:41:39] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3188128 (10Dzahn) a:03Dzahn [22:42:19] !log tin - load average going down, acpi_pad processes gone, cpu usage low again (T163158) [22:42:25] RoanKattouw: ^ [22:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:35] Yay thanks [22:43:09] so this is an issue we have had before with other servers, and just before the weekend we were debugging which kind of hardware it happens on and they are all R320 [22:43:15] and tin is one of them [22:43:36] and i even have this patch here waiting: https://gerrit.wikimedia.org/r/#/c/348197/ for this specific kernel module on just R320 [22:43:55] It's still slower than normal, but no longer unusuably slow [22:43:56] this incident is another subtask of https://phabricator.wikimedia.org/T162850 if you wish [22:44:11] I did already file it as such, yes [22:44:15] ok, cool [22:44:51] Back to normal now that puppet has finished [22:45:48] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3188134 (10Dzahn) this change that is currently in code review should prevent this from happening again: https://gerrit.wikimedia.org/r/#/c/348197/ [22:46:01] !log catrope@tin Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents/modules/ext.wikimediaEvents.recentChangesClicks.js: T158458 T163152 (duration: 03m 01s) [22:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:09] :) [22:46:09] T158458: ERI Metrics: Measure click-through actions from RC page and create 'Productivity" baseline - https://phabricator.wikimedia.org/T158458 [22:46:09] T163152: ERI Metrics fromrc=1 URL-extension breaks heavily used admin script - https://phabricator.wikimedia.org/T163152 [22:47:49] thanks mutante [22:48:25] (03CR) 10Dzahn: [C: 031] "and today it happened again, on tin, and that is also R320, matching the pattern again. https://phabricator.wikimedia.org/T163158" [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn) [22:49:17] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:52:06] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188160 (10Dzahn) [22:52:07] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3188158 (10Dzahn) 05Open>03Resolved closing this one as tin is back to normal with the short term fix as follow-up the change above is already in review and linked to the parent task (formerly known as "tracking tas... [22:58:31] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188182 (10Dzahn) I ran `rmmod acpi_pad` on tin and it fixed the issue right away. tin is a `Dell PowerEdge R320` which is the pattern all affected servers had in common so far, further confirming the theory in the g... [23:13:48] (03CR) 10BBlack: [C: 031] base::kernel: mod blacklist for Dell R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn) [23:14:05] (03PS1) 10Mattflaschen: Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) [23:15:57] (03CR) 10Catrope: [C: 032] Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:15:59] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188247 (10Dzahn) [23:16:12] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) [23:19:17] (03PS2) 10Catrope: Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:19:22] (03CR) 10Catrope: Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:19:26] (03CR) 10Catrope: [C: 032] Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:20:38] (03Merged) 10jenkins-bot: Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:20:49] (03CR) 10jenkins-bot: Beta: Enable GuidedTour everywhere (except loginwiki and votewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348647 (https://phabricator.wikimedia.org/T152827) (owner: 10Mattflaschen) [23:21:05] (03PS5) 10Dzahn: base::kernel: mod blacklist for Dell R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) [23:24:14] (03PS1) 10Catrope: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348649 [23:24:18] (03CR) 10Catrope: [C: 032] Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348649 (owner: 10Catrope) [23:25:18] (03Merged) 10jenkins-bot: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348649 (owner: 10Catrope) [23:25:33] (03CR) 10Dzahn: [C: 032] base::kernel: mod blacklist for Dell R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn) [23:27:01] (03CR) 10jenkins-bot: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348649 (owner: 10Catrope) [23:29:51] (03Abandoned) 10BryanDavis: labs: make labs_lvm compatible with role::toollabs::docker::builder [puppet] - 10https://gerrit.wikimedia.org/r/348624 (owner: 10BryanDavis) [23:30:48] (03CR) 10Catrope: "Reverted because this was not deployed after merging, and deploying this change would in fact not be allowed this week due to the deployme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 (owner: 10Milimetric) [23:33:23] !log running puppet via cumin on all 16 Dell PowerEdge R320, adding blacklist file for acpi_pad kernel module. 15/16 success, all but tin (T162850) [23:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:30] T162850: acpi_pad issues - https://phabricator.wikimedia.org/T162850 [23:37:25] !log runnin rmmod acpi_pad on the 16 R320 via cumin, since blacklisting in puppet does not actively remove, confirmed unloaded. (16/16) success ratio (>= 100.0% threshold) for command: 'lsmod|grep -c acpi_pad ||:' (T162850) [23:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:09] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188373 (10Dzahn) p:05High>03Normal Now this should either never happen again... or it would affect more than just `R320`'s. But we have not seen a case of that so far. So.. a bit unsure about ticket status, so... [23:48:32] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188393 (10Dzahn) [23:48:34] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#3188390 (10Dzahn) 05Open>03Resolved a:03Dzahn closing this subtask since we know from the other similar tasks and the parent ticket that the fix is always` rmmod acpi_pad` and the affected hardware is alw... [23:50:42] 06Operations, 10ops-codfw, 10Traffic: baham (ns1) CPU-related issues - https://phabricator.wikimedia.org/T159870#3081551 (10Dzahn) @bblack do you think this should stay open as a separate task given the recent changes regarding acpi_pad and blacklisting? I realize the HT / BIOS thing is unrelated but might a... [23:56:02] 06Operations, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3188440 (10Dzahn) p:05Triage>03Normal [23:56:39] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10Dzahn) [23:57:09] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10Dzahn)