[00:04:16] <wikibugs>	 (03PS1) 10BBlack: cp4021: use macaddr from second 10Gbps interface [puppet] - 10https://gerrit.wikimedia.org/r/351742 (https://phabricator.wikimedia.org/T164327)
[00:04:26] <wikibugs___>	 (03CR) 10BBlack: [V: 032 C: 032] cp4021: use macaddr from second 10Gbps interface [puppet] - 10https://gerrit.wikimedia.org/r/351742 (https://phabricator.wikimedia.org/T164327) (owner: 10BBlack)
[00:11:57] <wikibugs___>	 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233839 (10BBlack) ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!).  Things I found while trying to...
[00:39:49] <wikibugs___>	 (03PS1) 10BBlack: cp4021-32 ipv6 revdns [dns] - 10https://gerrit.wikimedia.org/r/351744
[00:40:01] <wikibugs>	 06Operations: Installer assumes eth0 is the used interface - https://phabricator.wikimedia.org/T164444#3233877 (10Volans)
[00:40:34] <wikibugs___>	 (03CR) 10BBlack: [C: 032] cp4021-32 ipv6 revdns [dns] - 10https://gerrit.wikimedia.org/r/351744 (owner: 10BBlack)
[00:40:36] <wikibugs>	 (03Restored) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson)
[00:46:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:46:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:46:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:46:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[00:46:40] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused
[00:47:09] <wikibugs>	 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233902 (10BBlack) With the cable in the second port + T164444 we're blocked on getting a successful test install.  @RobH is asking smart hands to swap the cable, and I'll proce...
[00:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200)
[00:47:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[00:47:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:47:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[00:47:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:48:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[00:48:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[00:48:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[00:48:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[00:48:40] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[00:48:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:49:20] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused
[00:49:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[00:49:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[00:49:41] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.231 and port 9042: Connection refused
[00:49:50] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[00:50:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:50:50] <icinga-wm>	 PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:50:50] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[00:53:40] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active
[00:53:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase1012 is OK: OK - running: The system is fully operational
[00:55:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2017-09-12 15:34:10 +0000 (expires in 131 days)
[00:55:40] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.038 second response time on 10.64.32.202 port 9042
[00:56:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase1007 is OK: OK - running: The system is fully operational
[00:56:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1007 is OK: OK - cassandra-b is active
[00:57:30] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is OK: SSL OK - Certificate restbase1007-b valid until 2017-09-12 15:33:32 +0000 (expires in 131 days)
[00:57:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.141 second response time on 10.64.0.231 port 9042
[01:02:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:02:41] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused
[01:03:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[01:03:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[01:03:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[01:03:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[01:03:40] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[01:03:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:06:50] <icinga-wm>	 RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational
[01:06:50] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active
[01:08:20] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.036 second response time on 10.64.48.138 port 9042
[01:08:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-a valid until 2017-09-12 15:34:34 +0000 (expires in 131 days)
[01:09:50] <urandom_>	 !log T160759: starting restbase1012-a
[01:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:10:00] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[01:10:40] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active
[01:10:41] <icinga-wm>	 RECOVERY - Check systemd state on restbase1012 is OK: OK - running: The system is fully operational
[01:11:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2017-09-12 15:34:10 +0000 (expires in 131 days)
[01:11:41] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.037 second response time on 10.64.32.202 port 9042
[01:12:06] <wikibugs>	 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3233129 (10faidon) We don't generally add third-parties to our SPF or DKIM records (there is one exception, with a long and complicated history). Among other concerns, this...
[01:18:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:18:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:19:01] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused
[01:19:01] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused
[01:19:40] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[01:20:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:20:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2017-09-12 15:34:28 +0000 (expires in 131 days)
[01:21:00] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.036 second response time on 10.64.48.136 port 9042
[01:22:35] <urandom_>	 !log T160759: lowering tombstone_threshold on restbase1013 & restbase1014
[01:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:42] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[01:23:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[01:23:40] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1013 is OK: OK - cassandra-b is active
[01:24:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-b valid until 2017-09-12 15:34:20 +0000 (expires in 131 days)
[01:25:00] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.037 second response time on 10.64.32.206 port 9042
[01:25:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:26:00] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused
[01:26:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2017-09-12 15:34:28 +0000 (expires in 131 days)
[01:27:00] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.036 second response time on 10.64.48.136 port 9042
[01:42:59] <logmsgbot>	 !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki
[01:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:47:11] <logmsgbot>	 !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki (duration: 04m 12s)
[01:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:47:41] <logmsgbot>	 !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki
[01:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:49:30] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.195 and port 9042: Connection refused
[01:49:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:49:50] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:50:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:50:30] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused
[01:50:50] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[01:51:09] <logmsgbot>	 !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki (duration: 03m 28s)
[01:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:50] <icinga-wm>	 PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:52:00] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[01:54:38] <logmsgbot>	 !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3
[01:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:57:30] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.117 and port 9042: Connection refused
[01:57:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:58:07] <logmsgbot>	 !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3 (duration: 03m 29s)
[01:58:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:58:50] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:59:00] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[01:59:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:13] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200)
[01:59:30] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.187 and port 9042: Connection refused
[01:59:35] <logmsgbot>	 !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3a
[01:59:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1008 is OK: OK - cassandra-b is active
[02:00:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase1008 is OK: OK - running: The system is fully operational
[02:00:14] <urandom>	 !log T160759: lowering tombstone threshold to 1000 on all eqiad nodes
[02:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:00:22] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[02:00:30] <mobrovac>	 failure party
[02:00:50] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[02:01:00] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1011 is OK: OK - cassandra-a is active
[02:01:30] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.037 second response time on 10.64.32.187 port 9042
[02:01:50] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-a valid until 2017-09-12 15:33:39 +0000 (expires in 131 days)
[02:02:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[02:02:30] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.036 second response time on 10.64.0.117 port 9042
[02:02:30] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[02:02:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-a valid until 2017-09-12 15:34:03 +0000 (expires in 131 days)
[02:03:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[02:03:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[02:03:20] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:03:30] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is OK: TCP OK - 0.036 second response time on 10.64.32.195 port 9042
[02:03:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-b valid until 2017-09-12 15:33:41 +0000 (expires in 131 days)
[02:03:40] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[02:03:50] <icinga-wm>	 RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational
[02:04:00] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1015 is OK: OK - cassandra-c is active
[02:04:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[02:04:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[02:04:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[02:04:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[02:04:10] <icinga-wm>	 PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused
[02:04:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[02:04:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[02:04:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[02:04:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[02:04:20] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused
[02:04:40] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2017-12-13 00:15:51 +0000 (expires in 222 days)
[02:04:50] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-c valid until 2017-09-12 15:34:41 +0000 (expires in 131 days)
[02:05:10] <icinga-wm>	 RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.078 second response time
[02:05:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[02:05:20] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.037 second response time on 10.64.0.34 port 9042
[02:05:30] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.036 second response time on 10.64.48.140 port 9042
[02:06:47] <wikibugs>	 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233964 (10RobH) >>! In T164327#3233902, @BBlack wrote: > With the cable in the second port + T164444 we're blocked on getting a successful test install.  @RobH is asking smart...
[02:07:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:07:20] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:12] <logmsgbot>	 !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3a (duration: 08m 37s)
[02:08:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[02:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:09:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[02:09:19] <mobrovac>	 we know, we know, icinga, things are not looking good
[02:09:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[02:10:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[02:10:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[02:10:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:10:20] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
[02:10:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:11:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[02:12:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:13:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[02:13:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:13:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:13:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:14:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[02:14:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:15:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[02:16:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[02:16:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:18:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:19:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[02:19:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[02:19:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[02:19:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[02:19:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:20:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:21:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[02:21:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[02:21:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[02:21:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[02:21:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[02:21:11] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[02:21:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[02:21:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[02:26:01] <logmsgbot>	 !log l10nupdate@naos scap sync-l10n completed (1.29.0-wmf.21) (duration: 08m 02s)
[02:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:22] <logmsgbot>	 !log l10nupdate@naos ResourceLoader cache refresh completed at Thu May  4 02:31:22 UTC 2017 (duration 5m 21s)
[02:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:17:53] <wikibugs___>	 (03PS1) 10BBlack: Revert "cp4021: use macaddr from second 10Gbps interface" [puppet] - 10https://gerrit.wikimedia.org/r/351750
[03:18:03] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Revert "cp4021: use macaddr from second 10Gbps interface" [puppet] - 10https://gerrit.wikimedia.org/r/351750 (owner: 10BBlack)
[03:32:51] <wikibugs>	 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3234002 (10Neil_P._Quinn_WMF) Thanks for the details, @faidon. It makes sense that we would not want to give full mail-sending authority to a third party, no matter how tru...
[03:51:33] <wikibugs___>	 (03PS1) 10BBlack: cp4021: add ipsec and node list entry [puppet] - 10https://gerrit.wikimedia.org/r/351752
[03:51:57] <wikibugs___>	 (03CR) 10BBlack: [V: 032 C: 032] cp4021: add ipsec and node list entry [puppet] - 10https://gerrit.wikimedia.org/r/351752 (owner: 10BBlack)
[03:55:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[03:56:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[03:56:11] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0
[04:17:12] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4021 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined
[04:17:32] <icinga-wm>	 PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:18:12] <icinga-wm>	 PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/cache/varnishkafka]
[04:18:17] <wikibugs>	 (03PS1) 10BBlack: group for varnishlog should be varnish these days [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/351755
[04:18:31] <wikibugs___>	 (03CR) 10BBlack: [V: 032 C: 032] group for varnishlog should be varnish these days [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/351755 (owner: 10BBlack)
[04:18:42] <icinga-wm>	 PROBLEM - traffic-pool service on cp4021 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive
[04:20:03] <wikibugs___>	 (03PS1) 10BBlack: bump varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/351756
[04:20:15] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] bump varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/351756 (owner: 10BBlack)
[04:20:22] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2017_v4, cp2017_v6,kafka1018_v4,kafka1018_v6
[04:22:22] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK
[04:24:13] <icinga-wm>	 RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[04:24:13] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp4021 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf
[04:29:32] <icinga-wm>	 RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational
[04:29:42] <icinga-wm>	 RECOVERY - traffic-pool service on cp4021 is OK: OK - traffic-pool is active
[04:31:08] <wikibugs>	 (03PS1) 10BBlack: raise storage size for new cache SSDs [puppet] - 10https://gerrit.wikimedia.org/r/351757
[04:31:16] <wikibugs___>	 (03CR) 10BBlack: [V: 032 C: 032] raise storage size for new cache SSDs [puppet] - 10https://gerrit.wikimedia.org/r/351757 (owner: 10BBlack)
[06:03:48] <marostegui>	 !log Deploy alter table on wikidatawiki.wb_terms - dbstore2002 - T162539 T163548
[06:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:58] <stashbot>	 T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539
[06:03:59] <stashbot>	 T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548
[06:05:42] <wikibugs___>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539)
[06:07:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[06:07:52] <wikibugs___>	 (03PS1) 10Tim Starling: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767
[06:08:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[06:08:25] <wikibugs___>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[06:09:30] <Dereckson>	 !log CentralAuth: Removed MediaWiki 2FA for Alexsh (T164265)
[06:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:38] <stashbot>	 T164265: Lost 2FA details, request recovery. - https://phabricator.wikimedia.org/T164265
[06:10:09] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Depool db2066 - T162539 T163548 (duration: 01m 25s)
[06:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:17] <stashbot>	 T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539
[06:10:17] <stashbot>	 T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548
[06:10:26] <marostegui>	 !log Deploy alter table on wikidatawiki.wb_terms - db2066 - T162539 T163548
[06:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:29] <marostegui>	 !log Stop MySQL on tempdb2001 to take a backup and prepare to decomission - T161712
[06:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:37] <stashbot>	 T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712
[06:22:25] <wikibugs___>	 (03PS1) 10Marostegui: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712)
[06:23:45] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[06:24:48] <wikibugs___>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[06:25:56] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[06:26:25] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Depool tempdb2001, no longer needed - T161712 (duration: 01m 08s)
[06:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:33] <stashbot>	 T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712
[06:29:38] <wikibugs>	 (03PS4) 10Phuedx: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377
[06:37:41] <wikibugs___>	 (03Abandoned) 10Giuseppe Lavagetto: tendril::maintenance: enable temporarily in codfw [puppet] - 10https://gerrit.wikimedia.org/r/350375 (owner: 10Giuseppe Lavagetto)
[06:41:40] <wikibugs___>	 (03PS1) 10Marostegui: mariadb: Get ready to decomission tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712)
[06:45:52] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[06:47:03] <wikibugs___>	 (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6292/" [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[06:47:52] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[06:58:46] <moritzm>	 !log installing freetype security updates
[06:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:48] <Nikerabbit>	 lots of errors
[07:03:29] <Nikerabbit>	 Error: 503, Backend fetch failed at Thu, 04 May 2017 07:01:30 GMT
[07:07:32] <_joe_>	 Nikerabbit: on it
[07:17:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Depool esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774
[07:17:35] <_joe_>	 ema ^^
[07:18:04] <ema>	 _joe_: let's hold on just a few secs
[07:18:14] <wikibugs___>	 (03CR) 10Muehlenhoff: [C: 031] Depool esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774 (owner: 10Giuseppe Lavagetto)
[07:18:20] <_joe_>	 ema: yes
[07:18:29] <_joe_>	 also, I might want to depool just text
[07:18:53] <ema>	 _joe_: indeed, other clusters are unaffected
[07:19:25] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Depool text esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774
[07:19:37] <_joe_>	 errors are going down, not to zero though
[07:19:58] <_joe_>	 ema: let's go?
[07:21:36] <ema>	 _joe_: the backend 503s are almost 0 now
[07:22:00] <_joe_>	 yes, I see that
[07:22:19] <ema>	 _joe_: let me just grab a few varnishlogs and then +1 to depool
[07:22:26] <_joe_>	 seems that coincided with me restarting cp3043's backend
[07:22:37] <_joe_>	 ema: I think it's ok now
[07:22:39] <_joe_>	 more or less
[07:22:50] <_joe_>	 I'm tailing the 5xx log and it basically stopped
[07:23:13] <_joe_>	 it's only the usual noise on upload
[07:24:48] <ema>	 _joe_: yeah, it stopped indeed. I'd say let's not depool for now but keep the patch ready if the issue comes back
[07:25:24] <_joe_>	 yes
[07:25:53] <_joe_>	 !log restarted cp3043 backend varnish at 7:13 UTC while trying to debug issues
[07:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:52] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:33:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decomission tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[07:33:52] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:37:20] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712)
[07:38:09] <wikibugs___>	 (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712)
[07:40:44] <wikibugs___>	 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#3234168 (10hashar)
[07:42:52] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[07:43:52] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[07:44:52] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:47:52] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[07:51:57] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:52:47] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:54:39] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[07:56:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[07:56:55] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui)
[07:58:42] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 25s)
[07:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:50] <stashbot>	 T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712
[07:59:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good. Note that the current queue discipline doesn't get changed by the sysctl, this either needs a reboot or manual one time applic" [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema)
[07:59:57] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 07s)
[08:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:57] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234220 (10Marostegui) Position where tempdb2001 slave was stopped at: https://phabricator.wikimedia.org/P5374 Backu...
[08:01:11] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3234221 (10hashar)
[08:03:55] <wikibugs>	 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3234234 (10hashar) There are only two account in LDAP with shells not being `/bin/bash`: ``` $ ldapsearch -LLL -x -b 'ou=people,dc=wikimedia,dc=o...
[08:05:22] <_joe_>	 moritzm: can you change the topic back?
[08:08:12] <moritzm>	 yep
[08:10:24] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3234243 (10Marostegui) 05Open>03Resolved Thanks @Papaul all the hosts are looking good! I will mark this ticket as resolved then.
[08:13:21] <wikibugs>	 (03PS1) 10Marostegui: x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780
[08:15:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780 (owner: 10Marostegui)
[08:16:49] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234251 (10jcrespo) "Decommissioned" in the software sense, I think they will want it to return to spares. Just clar...
[08:17:58] <wikibugs>	 (03Merged) 10jenkins-bot: x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780 (owner: 10Marostegui)
[08:18:02] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057)
[08:20:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057)
[08:20:49] <wikibugs>	 (03PS1) 10Marostegui: s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057)
[08:22:53] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057)
[08:23:39] <gehel>	 !log restart elasticsearch on relforge for JDK update
[08:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:34] <wikibugs>	 (03PS4) 10Gehel: elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650
[08:28:06] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786
[08:28:20] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 (owner: 10Gehel)
[08:29:57] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786 (owner: 10Gehel)
[08:30:02] <wikibugs>	 (03PS2) 10Gehel: elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786
[08:31:49] <wikibugs>	 (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6294/" [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:32:58] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234269 (10Marostegui) Thanks for the clarification :-)
[08:34:20] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3234285 (10hashar) On deployment-tin I created: ``` lang=ini,name=/etc/mysql/conf.d/prompt.cnf [mysql] prompt = "\u@\h[...
[08:34:28] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 (owner: 10Gehel)
[08:35:05] <wikibugs>	 (03CR) 10DCausse: [C: 031] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[08:41:24] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057)
[08:42:18] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3234312 (10ArielGlenn) 05Open>03Resolved
[08:42:40] <wikibugs>	 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3234314 (10fgiunchedi) @Cmjohnson thanks! disk has been rebuilt
[08:43:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:43:50] <wikibugs>	 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3234320 (10fgiunchedi) >>! In T152562#3232264, @Jgreen wrote: >>>! In T152562#3227173, @fgiunchedi wrote: >> @Jgreen any news/updates on having FR fully on jessie?  >  > We're still waiting for...
[08:45:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:47:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:47:12] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:49:22] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove db1040 from config files as it will be decommissioned - T164057 (duration: 00m 55s)
[08:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:31] <stashbot>	 T164057: Decommission db1040 - https://phabricator.wikimedia.org/T164057
[08:50:18] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove db1040 from config files as it will be decommissioned - T164057 (duration: 00m 48s)
[08:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:52:47] <wikibugs>	 (03Merged) 10jenkins-bot: s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui)
[08:53:31] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3234342 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done:  - Removed it from prometheus, added it as a spare on site.pp and removed it from dhcp list:...
[08:54:25] <wikibugs>	 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3234351 (10fgiunchedi) The same recharging status and icinga warning is showing on ms-be1020 now too  ``` => show status  Smart Array P840 in Slot 3    Controller Status: OK    Cache St...
[09:00:37] <wikibugs>	 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234359 (10MoritzMuehlenhoff)
[09:04:42] <wikibugs>	 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234372 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[09:07:01] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "It seems a bit overcomplicated to me, suggestions to simplify it are inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[09:12:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thumbstats: fix thumb detection regex [software] - 10https://gerrit.wikimedia.org/r/351791
[09:12:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thumbstats: add env variables to help text [software] - 10https://gerrit.wikimedia.org/r/351792
[09:12:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796)
[09:12:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi)
[09:13:28] <logmsgbot>	 !log joal@naos Started deploy [analytics/refinery@9d35029]: (no justification provided)
[09:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:26] <logmsgbot>	 !log joal@naos Finished deploy [analytics/refinery@9d35029]: (no justification provided) (duration: 02m 58s)
[09:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:20] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "amended comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[09:25:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) (owner: 10Elukey)
[09:26:39] <wikibugs>	 (03PS1) 10Jcrespo: query-killer: Do not kill queries containing gtid_wait or DMLs [software] - 10https://gerrit.wikimedia.org/r/351796
[09:26:56] <wikibugs>	 (03PS4) 10Elukey: Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204)
[09:28:11] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) (owner: 10Elukey)
[09:31:56] <wikibugs>	 06Operations: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3234453 (10fgiunchedi)
[09:32:24] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3234480 (10ema) Today, 2017-05-04, this issue affected 4 out of 8 of the text-esams hosts roughly at the same time, resulting in a peak of [[ https://grafana.wik...
[09:32:35] <wikibugs>	 06Operations, 15User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3234482 (10fgiunchedi)
[09:36:00] <godog>	 bd808: when you get a chance can you let me know what you think of https://gerrit.wikimedia.org/r/#/c/350817/ ? especially the logstash part and adding "type" to distinguish webrequest from json_lines
[09:36:05] <wikibugs>	 (03PS1) 10Jcrespo: x1: Remove all read traffic from x1-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407)
[09:37:11] <wikibugs>	 (03CR) 10Marostegui: [C: 031] x1: Remove all read traffic from x1-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo)
[09:39:30] <wikibugs>	 (03PS2) 10Jcrespo: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407)
[09:40:51] <elukey>	 !log stop kafka on kafka1012 and reboot the host for kernel upgrade
[09:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796)
[09:53:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: fix thumb detection regex [software] - 10https://gerrit.wikimedia.org/r/351791 (owner: 10Filippo Giunchedi)
[09:53:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: add env variables to help text [software] - 10https://gerrit.wikimedia.org/r/351792 (owner: 10Filippo Giunchedi)
[09:54:25] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796)
[10:00:10] <wikibugs>	 (03PS1) 10Ema: cache: stop expiry thread RT experiment on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/351802 (https://phabricator.wikimedia.org/T145661)
[10:00:19] <wikibugs>	 (03PS4) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796)
[10:04:05] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] cache: stop expiry thread RT experiment on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/351802 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema)
[10:04:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi)
[10:05:28] <ema>	 !log restart varnish-be on cp2024 without RT experiment
[10:05:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:40] <moritzm>	 !log restarting hhvm on mediawiki canaries to pick up freetype security update
[10:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:07] <elukey>	 !log executed DEL ocg_job_status on rdb1007:6379 (new ocg_job_status hash is stored on the ocg* hosts) - T159850
[10:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:15] <stashbot>	 T159850: JobQueue Redis codfw replicas periodically lagging  - https://phabricator.wikimedia.org/T159850
[10:27:17] <wikibugs>	 06Operations, 06Performance-Team, 13Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3234673 (10Peter) I've added the new metrics in the, graph, will keep the issue open until I cam change the alert so we use both pages (need to collect metric...
[10:28:02] <jynus>	 I think nagios losts its downtimes again
[10:41:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add scap.cfg [software/librenms] - 10https://gerrit.wikimedia.org/r/351811 (https://phabricator.wikimedia.org/T129136)
[10:44:07] <wikibugs>	 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#3234740 (10hashar)
[10:44:58] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3234753 (10fgiunchedi) I've extracted some data from the list of thumbnails we are storing in swift and processed it with hive, distribution of si...
[10:48:01] <moritzm>	 !log installing tomcat security updates
[10:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:01] <icinga-wm>	 PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:14:54] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3232060 (10MoritzMuehlenhoff) @Andrew : Can you clarify, this is host running in the labs or production realm?
[11:15:28] <wikibugs>	 (03CR) 10Jcrespo: "I will deploy this after a break" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo)
[11:18:48] <wikibugs>	 (03CR) 10Ema: [C: 031] icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 (owner: 10BBlack)
[11:19:28] <wikibugs>	 (03PS1) 10Elukey: Lower the Apache workers on bohrium (Piwik) [puppet] - 10https://gerrit.wikimedia.org/r/351812
[11:20:50] <wikibugs>	 (03CR) 10Elukey: [C: 032] Lower the Apache workers on bohrium (Piwik) [puppet] - 10https://gerrit.wikimedia.org/r/351812 (owner: 10Elukey)
[11:21:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86459.17 seconds
[11:21:13] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778)
[11:21:54] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778)
[11:22:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86557.44 seconds
[11:23:24] <wikibugs>	 (03PS1) 10Marostegui: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778)
[11:26:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86597.36 seconds
[11:28:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86394.13 seconds
[11:29:09] <wikibugs>	 (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6297/" [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:29:14] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778)
[11:29:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86482.40 seconds
[11:30:55] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:33:59] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:34:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86644.84 seconds
[11:35:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:35:46] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:36:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86560.94 seconds
[11:36:20] <hashar>	 jouncebot: refresh
[11:36:23] <jouncebot>	 I refreshed my knowledge about deployments.
[11:36:24] <hashar>	 jouncebot: next
[11:36:24] <jouncebot>	 In 97 hour(s) and 23 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300)
[11:36:56] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 25s)
[11:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:04] <stashbot>	 T163778: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778
[11:38:01] <icinga-wm>	 RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[11:38:36] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 06s)
[11:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:31] <wikibugs>	 (03PS2) 10BBlack: icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684
[11:43:00] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 (owner: 10BBlack)
[11:43:17] <wikibugs>	 (03PS2) 10Marostegui: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778)
[11:43:27] <wikibugs>	 (03PS4) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230
[11:44:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack)
[11:44:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:45:10] <ema>	 !log starting cache_text upgrades to varnish 4.1.6-1wm1
[11:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:53] <wikibugs>	 (03Merged) 10jenkins-bot: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui)
[11:46:31] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234912 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done:  Removed it from pr...
[11:47:37] <wikibugs>	 (03PS5) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230
[11:47:40] <bblack>	 I hate you jenkins
[11:48:47] <wikibugs>	 (03CR) 10BBlack: [C: 032] varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack)
[11:51:00] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend - port 3128 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3128: Connection refused
[11:51:07] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3234935 (10Marostegui) @Ottomata remember to: `stop all slaves;` before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging,...
[11:51:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3125: Connection refused
[11:51:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3120: Connection refused
[11:51:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3121: Connection refused
[11:51:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3122: Connection refused
[11:51:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3127: Connection refused
[11:51:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3124: Connection refused
[11:51:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3126: Connection refused
[11:51:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3123: Connection refused
[11:51:28] <ema>	 looking ^
[11:51:40] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:51:50] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused
[11:52:17] <ema>	 bblack: perhaps puppetfail on mem sizing?
[11:53:08] <ema>	 re-running puppet agent
[11:53:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.002 second response time
[11:53:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.000 second response time
[11:53:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.000 second response time
[11:53:11] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.000 second response time
[11:53:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time
[11:53:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.000 second response time
[11:53:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time
[11:53:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time
[11:53:40] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[11:53:50] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.005 second response time
[11:53:51] <bblack>	 uh
[11:54:00] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend - port 3128 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.001 second response time
[11:54:12] <bblack>	 it's only the one you're restarting, right?
[11:54:54] <ema>	 I've restarted cp4010 but that was before the mem sizing patch got merged
[11:54:59] <bblack>	 right
[11:55:05] <ema>	 let me stop the upgrades and try one manually
[11:55:16] <bblack>	 we probably do need agent runs pre-restart now
[11:55:59] <ema>	 ok
[11:56:13] <moritzm>	 !log installing mysql-connector-java security updates
[11:56:17] <bblack>	 the agent update for the mem size seems to work fine in isolation though
[11:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:30] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=287.60 Read Requests/Sec=2093.30 Write Requests/Sec=7.00 KBytes Read/Sec=33404.80 KBytes_Written/Sec=2376.00
[11:57:21] <bblack>	 I guess spam the agent across the caches
[11:58:16] <ema>	 yeah
[12:01:10] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3234951 (10Gilles) What "size" are we talking about when you have a range like 0.0000 - 500.0000?
[12:01:23] <wikibugs>	 (03PS1) 10DCausse: Upgrade plugins for elastic 5.3.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948)
[12:07:40] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=26.20 Read Requests/Sec=0.70 Write Requests/Sec=1.90 KBytes Read/Sec=13.60 KBytes_Written/Sec=39.20
[12:11:10] <wikibugs>	 (03CR) 10DCausse: "I really hope it's the last time we do this..." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse)
[12:13:00] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:20:28] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3235014 (10Cmjohnson)
[12:21:10] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[12:28:43] <marostegui>	 !log Deploy alter table enwiki.revision on dbstore1001 - T132416
[12:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:52] <stashbot>	 T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416
[12:35:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392)
[12:36:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui)
[12:38:01] <wikibugs>	 06Operations, 10DNS, 10Traffic, 15User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3235053 (10Zppix)
[12:38:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui)
[12:40:22] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Depool db1070 for maintenance - T160392 (duration: 01m 35s)
[12:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:31] <stashbot>	 T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392
[12:42:42] <marostegui>	 !log Stop MySQL db1070 for maintenance - T160392
[12:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:49] <Zppix>	 what time zone & format is throttle.php in?
[12:51:45] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui)
[12:52:23] <wikibugs>	 (03Draft2) 10Zppix: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482)
[12:55:50] <Zppix>	 Can i have something deployed before saturday its for T164482 and the event is saturday
[12:55:50] <stashbot>	 T164482: Lift account registration limit from IP address - https://phabricator.wikimedia.org/T164482
[13:02:51] <Dereckson>	 Zppix: nope
[13:03:15] <Dereckson>	 Zppix: er "before Saturday", yes, sure
[13:03:19] <Zppix>	 Dereckson:  how else am I supposed to have the account registration limit lifted?
[13:03:30] <wikibugs>	 (03CR) 10Tjones: [C: 031] "Yay!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse)
[13:03:43] <Zppix>	 Dereckson:  I can do it whenever EU swat is "usually"
[13:04:13] <Dereckson>	 I was going to tell you we need to deploy that during the week, as it's known beforehand, and not the Saturday itself
[13:04:24] <Dereckson>	 Gerrit URL?
[13:04:35] <Zppix>	 https://gerrit.wikimedia.org/r/#/c/351825/
[13:05:18] <Dereckson>	 Thanks
[13:05:22] <Zppix>	 no problem
[13:07:04] <jynus>	 these are the kind of things that should not be on config, but validated + on a store somewhere
[13:07:18] <Dereckson>	 jynus: we've a bug open for that
[13:07:35] <Dereckson>	 jynus: and someone told us they want to work on that during the hackathon
[13:07:37] <jynus>	 yes, I was just reminding a conversation
[13:07:42] <tto>	 most of InitialiseSettings.php should be in a store somewhere :)
[13:07:53] <jynus>	 we had a dew days ago
[13:07:53] <tto>	 That's ancient tech debt
[13:08:06] <jynus>	 and basically everyone agreed with that
[13:08:31] <jynus>	 s/remind/remember/
[13:08:33] <Zppix>	 to be fair if its no able to be accessed on gerrit then staff would therolicatly be the only ones able to change them.
[13:08:59] <Dereckson>	 Zppix: let me check something first for the timezone, the space is fishy, and I'll deploy it, as soon as jynus gives me a green light, as it owns th econfig repo this week for db purpose
[13:09:20] <jynus>	 Zppix, that is the point, not having to use gerrit at all :-)
[13:09:49] <jynus>	 I am cool with that, no more maintenance commits anymore
[13:09:53] <Zppix>	 jynus:  so then staff would be the only ones doing config changes, adding to the already long list of a workload
[13:09:54] <Dereckson>	 jynus: ok
[13:09:59] <Zppix>	 Dereckson:  ack
[13:10:50] <jynus>	 Zppix, nope- I would say autoconfirmed users to propose changes and a special role to ok them
[13:11:15] <Zppix>	 jynus: ah I see? is there a phab task on this?
[13:11:27] <jynus>	 but not participating on development, I cannot say or promis anything
[13:11:38] <jynus>	 that would be only my idea
[13:12:16] <jynus>	 probably https://phabricator.wikimedia.org/T44785
[13:12:46] <jynus>	 opened by you 4 years and a half ago :-)
[13:13:29] <wikibugs>	 (03CR) 10Dereckson: [C: 031] "Timezone checked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix)
[13:14:01] <jynus>	 would it help if I work on a generic configuration store proposal?
[13:14:20] <Zppix>	 I wasnt here 4 year and half ago jynus  so im assuming your not talking about me lol
[13:14:28] <jynus>	 he he no
[13:14:47] <Dereckson>	 jynus: if you provide a solution for all the config, IS/CS files like db variables, yes, that would help
[13:15:08] <gehel>	 !log restart services on maps-test
[13:15:13] <Dereckson>	 that would avoid to have one solution for one config file, another one for by wiki settings, etc.
[13:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:00] <wikibugs>	 (03CR) 10Dereckson: [C: 032] "Emergency deployment, throttle rule for an event before the next SWAT available." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix)
[13:16:22] <jynus>	 probably the same model than etcd- cached data, but every X seconds it is reloaded anynchronously
[13:16:57] <Zppix>	 jynus:  would it be able to load balance that properly considering it will be used by every project 
[13:17:06] <wikibugs>	 (03Merged) 10jenkins-bot: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix)
[13:17:06] <Dereckson>	 That's actualy the etcd point
[13:17:15] <wikibugs>	 (03CR) 10jenkins-bot: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix)
[13:17:21] <jynus>	 no need, stale data should be kept
[13:17:31] <jynus>	 if the service goes down
[13:17:51] <Dereckson>	 Zppix: here, read this: http://thesecretlivesofdata.com/raft/ 
[13:17:57] <Dereckson>	 That's the algo behind etcd
[13:18:00] <Zppix>	 thansk Dereckson  
[13:18:14] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3235210 (10fgiunchedi) @Gilles the image width in pixels, i.e. the user-provided size in the url
[13:18:45] <gehel>	 !log restart services on maps codfw
[13:18:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:55] <Dereckson>	 Zppix: your change works on mwdebug1002, syncing
[13:20:20] <Zppix>	 ack Dereckson 
[13:21:39] <logmsgbot>	 !log dereckson@naos Synchronized wmf-config/throttle.php: Lift Account registration limit for cywiki for an event / T164482 (duration: 01m 08s)
[13:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:48] <stashbot>	 T164482: Lift account registration limit from IP address - https://phabricator.wikimedia.org/T164482
[13:22:55] <Dereckson>	 I'm done on Naos.
[13:23:10] <wikibugs>	 06Operations, 05Goal, 07kubernetes: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3235220 (10akosiaris)
[13:23:21] <Zppix>	 I thought tin was eqiad
[13:24:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Create kubemaster.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/351836 (https://phabricator.wikimedia.org/T162040)
[13:24:45] <akosiaris>	 Zppix: deployment server failover hasn't happened yet. It's scheduled for today
[13:25:20] <Zppix>	 Failover lol... ack
[13:27:24] <Dereckson>	 Zppix: it's only a pure convention
[13:27:31] <Dereckson>	 you can deploy from every server, it works
[13:27:32] <Dereckson>	 BUT
[13:28:06] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3235232 (10fgiunchedi)
[13:28:15] <Dereckson>	 There is a need to have a coherent staging directory, so someone don't deploy commits A B C while someone deploy A B D overwriting C
[13:28:42] <Dereckson>	 so the deployment server is mainly a social convention to have a coherent source of files to deploy
[13:29:22] <Zppix>	 So i guess ill be the last deploy from codfw for a while (hopefully).
[13:29:46] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3235233 (10Gilles) Migration steps on VM once the change is applied:  ``` mwscript refreshImageMetadata.php...
[13:31:29] <gehel>	 !log restart services on maps eqiad
[13:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:55] <wikibugs>	 (03PS5) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327
[13:35:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[13:37:07] <wikibugs>	 (03CR) 10Gehel: logstash - delete all indices older than 31 days (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[13:38:02] <wikibugs>	 (03PS6) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327
[13:38:35] <wikibugs>	 (03PS7) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327
[13:46:11] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3235268 (10jcrespo)
[13:46:26] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) p:05High>03Normal
[13:47:09] <wikibugs>	 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3235271 (10BBlack) Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's...
[13:48:03] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Much nicer! Thanks a lot for improving it. Looks good to me, but I think there is a typo." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[13:48:27] <wikibugs>	 06Operations, 06Release-Engineering-Team: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3235275 (10jcrespo) Removing tag due to change in scope of the ticket.
[13:49:29] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839
[13:49:45] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui)
[13:50:33] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839
[13:50:58] <wikibugs>	 06Operations, 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235290 (10jcrespo)
[13:51:35] <wikibugs>	 (03PS8) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327
[13:51:58] <wikibugs>	 (03PS9) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327
[13:52:22] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3235306 (10jcrespo)
[13:52:24] <wikibugs>	 06Operations, 10DBA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#3235308 (10jcrespo)
[13:52:32] <wikibugs>	 (03CR) 10BBlack: [C: 031] "Same thought as moritz.  We can do this at runtime with e.g. "tc qdisc replace dev eth0 root fq"" [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema)
[13:52:38] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[13:53:17] <wikibugs>	 06Operations, 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235313 (10jcrespo)
[13:53:52] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3235314 (10Ottomata) @cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following.  Coordinate with @elukey if you want to do it next we...
[13:53:54] <wikibugs>	 (03CR) 10Gehel: [C: 032] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel)
[13:56:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Madhuvishy Known issue, looking into it.
[13:59:07] <wikibugs>	 (03PS1) 10Gehel: logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841
[13:59:50] <wikibugs>	 (03CR) 10Volans: [C: 031] logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841 (owner: 10Gehel)
[14:00:20] <wikibugs>	 (03CR) 10Gehel: [C: 032] logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841 (owner: 10Gehel)
[14:03:35] <chasemp>	 !log maintain-meta_p --all-databases --purge --debug labsdb1009/1010/1011 for T164103
[14:03:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:47] <stashbot>	 T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103
[14:04:32] <wikibugs>	 (03PS1) 10Gehel: logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842
[14:04:40] <wikibugs>	 06Operations, 06Labs: maintain-meta_p hands on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10chasemp)
[14:04:57] <wikibugs>	 06Operations, 06Labs: maintain-meta_p hands on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235351 (10chasemp) p:05Triage>03Normal
[14:05:34] <wikibugs>	 06Operations, 06Labs: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10chasemp)
[14:05:36] <halfak>	 o/
[14:05:44] <halfak>	  I have a thing on my calendar that says ORES traffic is being redirected at EQIAD now. :)
[14:05:50] <halfak>	 I'm here to monitor
[14:06:04] <halfak>	 akosiaris, ^
[14:07:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui)
[14:07:32] <halfak>	 let me know if that's happening :) 
[14:07:34] <wikibugs>	 (03CR) 10Volans: [C: 031] logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842 (owner: 10Gehel)
[14:07:42] <wikibugs>	 (03CR) 10Gehel: [C: 032] logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842 (owner: 10Gehel)
[14:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui)
[14:08:15] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui)
[14:09:44] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1070 with less weight - T160392 (duration: 01m 16s)
[14:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:52] <stashbot>	 T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392
[14:10:11] <akosiaris>	 halfak: ok will do. we are running 30 minutes late btw at the request of the services team
[14:10:19] <halfak>	 no prob
[14:10:22] <halfak>	 I'll be around :) 
[14:10:24] <akosiaris>	 ok
[14:11:49] <wikibugs>	 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3235386 (10fgiunchedi) Issue has been fixed upstream, pending next node_exporter release or internal package build
[14:13:14] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844
[14:16:31] <chasemp>	 !log maintain-meta_p --all-databases --purge --debug labsdb1001 for T164103
[14:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:38] <stashbot>	 T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103
[14:18:16] <wikibugs>	 06Operations, 06Labs: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10jcrespo) I do not think we host that:  ``` $ dig wikimedia.org.uk wikimedia.org.uk.       1847    IN      A       37.188.117.184  $ whois  37.188.117.184 descr:          Rackspac...
[14:20:51] <wikibugs>	 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2736876 (10madhuvishy) @hashar This seems like a known and documented...
[14:21:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui)
[14:22:36] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui)
[14:22:43] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui)
[14:23:50] <chasemp>	 !log maintain-meta_p --databases dtywiki,pawikisource,ptwikimedia,wbwikimedia --debug labsdb1003 for T164103
[14:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:59] <stashbot>	 T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103
[14:24:05] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Increase db1070 weight - T160392 (duration: 01m 10s)
[14:24:08] <wikibugs>	 (03PS1) 10Gehel: logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846
[14:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:12] <stashbot>	 T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392
[14:25:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848
[14:27:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849
[14:27:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850
[14:27:47] <_joe_>	 bblack, mobrovac ^^
[14:28:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto)
[14:28:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto)
[14:29:11] <jynus>	 !log dropping and recreating user for maintain-views on labsdb1001 T164103
[14:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:19] <stashbot>	 T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103
[14:29:29] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto)
[14:29:37] <jynus>	 chasemp ^ heads up
[14:29:43] <wikibugs>	 (03CR) 10BBlack: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto)
[14:29:55] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto)
[14:30:07] <wikibugs>	 (03CR) 10BBlack: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto)
[14:30:15] <godog>	 I'm staging the switch patches for later switchback
[14:30:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655
[14:30:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852
[14:30:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853
[14:30:34] <akosiaris>	 godog: s/switch/swift/ but ok
[14:30:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto)
[14:31:07] <_joe_>	 ok, sending all to a/a now
[14:31:12] <akosiaris>	 ok
[14:31:39] <godog>	 akosiaris: lol, yeah switch swift too close
[14:32:06] <wikibugs>	 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3235448 (10hashar) I am all for enabling persistent connections again. Ideally we would deploy it on a singl...
[14:35:19] <_joe_>	 !log running puppet on varnishes in eqiad (text,misc,maps) to pick up the a/a traffic to services
[14:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:43] <moritzm>	 !log installing mysql-connector-java security updates on hadoop cluster
[14:36:44] <chasemp>	 jynus: got it tx
[14:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:07] <wikibugs>	 (03CR) 10Gehel: [C: 032] logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 (owner: 10Gehel)
[14:38:14] <wikibugs>	 06Operations, 10netops, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235452 (10fgiunchedi)
[14:38:16] <wikibugs>	 (03PS2) 10Gehel: logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846
[14:39:33] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 (owner: 10Gehel)
[14:39:35] <_joe_>	 ok, switching the internal traffic now
[14:39:36] <wikibugs>	 (03PS1) 10Elukey: Re-enable persistent connection to Redis for jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351854 (https://phabricator.wikimedia.org/T125735)
[14:39:52] <logmsgbot>	 !log oblivian: Setting restbase-async in codfw UP
[14:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:11] <logmsgbot>	 !log oblivian: Setting restbase in eqiad UP
[14:40:12] <apergos>	 ccccccelgjhdnfjckieituirflcckfccidljjrekjgvv
[14:40:17] <apergos>	 sorry
[14:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:32] <akosiaris>	 apergos: and now use it once more to invalidate that 
[14:40:35] <akosiaris>	 :-)
[14:40:44] <mobrovac>	 lol
[14:41:02] <_joe_>	 elukey: hold on with that
[14:41:15] <_joe_>	 so, merging the second traffic patch
[14:41:20] <akosiaris>	 ok
[14:41:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto)
[14:41:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850
[14:42:04] <elukey>	 _joe_ sure sure it is not meant to be merged soon
[14:42:07] <_joe_>	 gehel: merging during a switchover?
[14:42:11] <_joe_>	 ahem
[14:42:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto)
[14:42:23] <gehel>	 :/ ... yeah, did not realize, sry
[14:43:19] <_joe_>	 !log forcing a puppet run on cache (text,maps, misc) in eqiad/codfw to complete the switchback
[14:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:39] <logmsgbot>	 !log oblivian: Setting restbase in codfw DOWN
[14:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:16] <logmsgbot>	 !log oblivian: Setting restbase-async in eqiad DOWN
[14:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:47] <_joe_>	 gehel: any reason why wdqs is marked down in codfw according to discovery?
[14:46:15] <gehel>	 _joe_: not that I know of... 
[14:46:36] <_joe_>	 ok
[14:46:59] <logmsgbot>	 !log oblivian: Setting wdqs in codfw UP
[14:46:59] <gehel>	 it is receiveing traffic ...
[14:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:04] <_joe_>	 gehel: this is the internal discovery
[14:48:04] <_joe_>	 not the varnish routing
[14:48:43] <gehel>	 no idea then. And there are probably nothing using internal discovery for wdqs
[14:50:53] <wikibugs>	 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408#3235550 (10akosiaris)
[14:50:55] <wikibugs>	 06Operations, 10netops, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235548 (10akosiaris) 05Open>03Resolved Done. cr1-eqiad, cr2-eqiad updated and now the term is  ``` from...
[14:50:57] <akosiaris>	 godog: ^
[14:51:22] * akosiaris likes ansible for these things 
[14:51:46] <godog>	 akosiaris: !! that was fast, thanks a lot! I was the config has hostnames in comments, is that automatic from juniper or ..?
[14:51:56] <akosiaris>	 me
[14:52:10] <_joe_>	 mobrovac: we should be done
[14:52:12] <akosiaris>	 thanks to my ocd 
[14:52:30] <_joe_>	 services are switched back
[14:52:32] <godog>	 akosiaris: haha ok, that explains why I wasn't finding anything about that in juniper docs
[14:52:36] <akosiaris>	 \o/
[14:52:38] <_joe_>	 green light for the other switches
[14:52:53] <akosiaris>	 halfak: we are done btw. everything seems to have gone fine
[14:52:59] <wikibugs>	 06Operations, 10Graphite, 13Patch-For-Review: something (reqstats?) puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3235558 (10hashar)
[14:53:01] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure, 07Upstream, 07Zuul: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#3235556 (10hashar) 05Open>03declined Seems statsd is strong enough to handle the metrics. Notably nowadays we have ~ 300 jobs instead of...
[14:53:53] <halfak>	 akosiaris, confirmed.  Looks good on my end. 
[14:54:16] <mobrovac>	 _joe_: \o/
[14:54:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:54:58] <mobrovac>	 _joe_: graphs looking good
[14:55:11] <icinga-wm>	 RECOVERY - Apache HTTP on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time
[14:55:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui)
[14:55:45] <elukey>	 really nice indeed - https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=now-6h&to=now
[14:55:48] <elukey>	 AQS is happier :)
[14:56:03] <elukey>	 (latency going down)
[14:57:22] <wikibugs>	 (03PS1) 10Elukey: Lower down Piwik Apache workers again [puppet] - 10https://gerrit.wikimedia.org/r/351856
[15:01:12] <mobrovac>	 elukey: \o/
[15:01:36] <wikibugs>	 (03PS1) 10Milimetric: Sqoop using the pre-generated orm jar [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119)
[15:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui)
[15:02:24] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui)
[15:03:23] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Restore db1070 original weight - T160392 (duration: 00m 57s)
[15:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:32] <stashbot>	 T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392
[15:06:02] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539)
[15:06:39] <wikibugs>	 (03PS1) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859
[15:07:26] <wikibugs>	 (03PS2) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859
[15:08:15] <wikibugs>	 (03CR) 10Elukey: [C: 032] Lower down Piwik Apache workers again [puppet] - 10https://gerrit.wikimedia.org/r/351856 (owner: 10Elukey)
[15:08:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[15:12:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[15:13:04] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui)
[15:14:28] <logmsgbot>	 !log marostegui@naos Synchronized wmf-config/db-codfw.php: Repool db2066, depool db2059 - T162539 T163548 (duration: 01m 06s)
[15:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:36] <stashbot>	 T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539
[15:14:36] <stashbot>	 T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548
[15:15:20] <chasemp>	 !log labsdb1003 maintain-views --databases ptwikimedia,pawikisourcewbwikimedia,dtywiki --replace-all --debug T164103
[15:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:28] <stashbot>	 T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103
[15:16:17] <marostegui>	 !log Deploy alter table on wikidatawiki.wb_terms - db2059- https://phabricator.wikimedia.org/T162539 https://phabricator.wikimedia.org/T163548
[15:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:29] <wikibugs>	 (03PS1) 10Rush: labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866
[15:29:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852
[15:30:30] <godog>	 bblack ema ^ going with https://gerrit.wikimedia.org/r/#/c/351852 and https://gerrit.wikimedia.org/r/#/c/351853/ for swift switchback
[15:30:32] <_joe_>	 godog: I'll switch the internal url in the meantime, ok?
[15:30:42] <godog>	 _joe_: yep, thanks!
[15:30:42] <bblack>	 godog: ack
[15:30:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 (owner: 10Filippo Giunchedi)
[15:30:59] <wikibugs>	 (03CR) 10BBlack: [C: 031] traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 (owner: 10Filippo Giunchedi)
[15:31:12] <wikibugs>	 (03CR) 10BBlack: [C: 031] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi)
[15:31:18] <logmsgbot>	 !log oblivian: Setting switft-rw in codfw DOWN
[15:31:21] <logmsgbot>	 !log oblivian: Setting swift-rw in eqiad UP
[15:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:15] <godog>	 !log run-puppet-agent on cache_upload in codfw/swift for swift a/a
[15:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:04] <wikibugs>	 (03PS1) 10Addshore: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407)
[15:33:19] <wikibugs>	 (03PS2) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853
[15:33:21] <wikibugs>	 (03CR) 10Addshore: [C: 04-2] "Needs feature to be merged and deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[15:33:21] <apergos>	 addshore: please hold on that for righ tnow
[15:33:24] <apergos>	 ah thanks
[15:33:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx)
[15:33:31] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx)
[15:33:51] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken
[15:33:52] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on radon is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken
[15:34:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx)
[15:34:11] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on baham is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken
[15:34:17] <phuedx>	 \o/
[15:34:21] <wikibugs>	 (03PS20) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006
[15:34:21] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on eeden is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken
[15:34:42] <godog>	 nooo rebase wars /o\
[15:34:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[15:34:49] <chasemp>	 !log add cwd to acl*procurement-review for phab S4
[15:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:08] <gehel>	 godog: I'm not planning to merge, just trying to keep that change up to date...
[15:35:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi)
[15:35:20] <wikibugs>	 (03PS3) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853
[15:35:21] <akosiaris>	 godog: sorry my bad
[15:35:36] <akosiaris>	 I should not have merged
[15:35:43] <godog>	 akosiaris: haha no worries
[15:35:53] <gehel>	 oh, that wasn't me...
[15:35:53] <apergos>	 I should have spammed people with an announcement
[15:35:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi)
[15:35:55] <wikibugs>	 07Puppet, 06Labs: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#3235671 (10bd808) p:05Triage>03Lowest Patches are of course always welcome, but this seems like a pretty delicate operation to perform via Puppet for what is in reality a seldom used edge...
[15:36:38] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw:  ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235673 (10Papaul) p:05Triage>03Normal a:05akosiaris>03Papaul
[15:36:44] <godog>	 !log run-puppet-agent on cache_upload in codfw/swift for swift a/p in codfw
[15:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:41] <godog>	 gehel: yeah I think it counts only in the merge/submit case
[15:38:01] <wikibugs>	 (03PS2) 10Rush: labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866
[15:38:36] <godog>	 ok all done, I'll keep an eye on swift dashboards
[15:38:56] <akosiaris>	 ok
[15:38:57] <apergos>	 you're on for 20 mins from now with naos -> tin, right?
[15:39:08] <godog>	 apergos: that's correct
[15:39:14] <apergos>	 great
[15:40:53] <addshore>	 apergos: ahh yes, don't worry, not doing it now, just preparing the patch!
[15:41:08] <apergos>	 yep saw that, thanks!
[15:41:13] <godog>	 _joe_: the confd template errors above are expected? Compilation of file '/var/lib/gdnsd/discovery-swift-rw.state' is broken
[15:42:04] <bblack>	 !log nginx upgraded to 1.11.10-1+wmf1 on cp1045 (cache_misc)
[15:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:54] <chasemp>	 godog: are we holding on merges atm or can I run w/ https://gerrit.wikimedia.org/r/#/c/351866/ ?
[15:43:54] <_joe_>	 godog: still broken?
[15:44:25] <_joe_>	 godog: as in: someone went to radon and looked at /var/log/confd.log?
[15:44:48] <godog>	 chasemp: yeah you can go, thanks for checking
[15:45:00] <godog>	 _joe_: no I haven't looked yet, I was checking icinga tho
[15:45:03] <wikibugs>	 (03CR) 10Rush: [C: 032] labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866 (owner: 10Rush)
[15:45:03] <bblack>	 !log nginx upgraded to 1.11.10-1+wmf1 on cp1051 (cache_misc)
[15:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:41] <chasemp>	 godog: kk thanks (and np)
[15:45:47] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift-rw,name=codfw
[15:45:52] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw:  ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235683 (10Papaul)
[15:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:57] <wikibugs>	 (03PS1) 10Gehel: maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T160215)
[15:46:04] <_joe_>	 godog: should recover nw
[15:46:12] <_joe_>	 *now
[15:47:33] <wikibugs>	 (03PS3) 10BBlack: cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663
[15:47:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663 (owner: 10BBlack)
[15:48:23] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw:  ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235694 (10Papaul) @akosiaris what HW RAID type I have to use to ganeti200[7-8]?
[15:48:42] <akosiaris>	 papaul: raid 5
[15:48:58] <papaul>	 akosiaris: thanks
[15:49:15] <godog>	 _joe_: recovered indeed
[15:49:32] <wikibugs>	 (03PS4) 10BBlack: cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663
[15:51:37] <wikibugs>	 (03PS5) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569)
[15:51:45] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema)
[15:52:01] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo)
[15:52:06] <wikibugs>	 (03PS3) 10Jcrespo: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407)
[15:56:04] <wikibugs>	 (03CR) 10jenkins-bot: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo)
[15:57:00] <logmsgbot>	 !log jynus@naos Synchronized wmf-config/db-eqiad.php: Remove all read traffic from x1, es2 & es3-master-eqiad (duration: 01m 08s)
[15:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 (owner: 10Filippo Giunchedi)
[15:58:56] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654
[16:00:37] <godog>	 !log switch deployment server back to tin.eqiad.wmnet
[16:00:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 (owner: 10Filippo Giunchedi)
[16:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 (owner: 10Filippo Giunchedi)
[16:02:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655
[16:03:40] <urandom>	 !log T160759: restoring default Cassandra tombstone_threshold in eqiad
[16:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:49] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[16:04:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 (owner: 10Filippo Giunchedi)
[16:09:30] <logmsgbot>	 !log filippo@tin Started scap: README
[16:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:56] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877
[16:09:59] <logmsgbot>	 !log filippo@tin scap aborted: README (duration: 00m 28s)
[16:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:54] <godog>	 not really what I wanted to do, anyways deployment server is back to tin
[16:11:17] <godog>	 thcipriani: ^ if you'd like to test
[16:11:35] <thcipriani>	 godog: sure
[16:14:41] <logmsgbot>	 !log thcipriani@tin Synchronized README: test tin is back (duration: 01m 06s)
[16:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:58] <apergos>	 all righty then
[16:15:21] <thcipriani>	 godog: lgtm
[16:15:49] <godog>	 \o/ thanks
[16:15:59] <godog>	 sending an announcement email now
[16:18:35] <bblack>	 !log nginx upgraded to 1.11.10-1+wmf1 on all cache_misc
[16:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:00] <wikibugs>	 (03CR) 10BryanDavis: "Do we have the ability to add this kafkatee setup in deployment-prep to test things out?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi)
[16:23:48] <urandom>	 _joe_: re: 1018, you formatted /srv, yes?
[16:24:06] <_joe_>	 urandom: it's a raid 0 that lost one disk
[16:24:15] <thcipriani>	 addshore: I +2'd the cognate change for wmf.21 waiting for zuul now
[16:24:15] <jynus>	 gilles, I have created https://phabricator.wikimedia.org/T164504 not as a request to you, but as something to report to several teams that you should be aware of
[16:24:19] <urandom>	 _joe_: right
[16:24:22] <_joe_>	 it's not like I had a lot of options :P
[16:24:36] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and setup boron replacement frpm1001 - https://phabricator.wikimedia.org/T162298#3235814 (10Jgreen) 05Open>03Resolved done!
[16:24:45] <urandom>	 _joe_: no no, that's fine, i was half expecting to have to do that, but it looked like you did, so i was just making sure :)
[16:24:58] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw:  ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235820 (10Papaul)
[16:25:22] <_joe_>	 heh ok
[16:25:25] <addshore>	 thcipriani: ack!
[16:25:28] <_joe_>	 check the size of the FS though
[16:25:32] <_joe_>	 I did some magic there
[16:26:06] <urandom>	 what magic? disabled reserve?
[16:26:11] <addshore>	 thcipriani: looks like it is merged
[16:26:32] <thcipriani>	 addshore: yup, getting stuff on tin squared away
[16:26:35] <_joe_>	 I turned off the VG, deleted the raid device, re-created a raid device with the same name and the same size
[16:26:43] <_joe_>	 turned on the VG
[16:26:45] <_joe_>	 :P
[16:26:51] <thcipriani>	 addshore: anything you want to check on mwdebug1002 before I sync?
[16:26:52] <_joe_>	 it "worked"
[16:27:26] <_joe_>	 I tested I could write there
[16:27:47] <_joe_>	 and that the partition was 5+T
[16:27:52] <addshore>	 thcipriani: I can check a couple of things yes!
[16:28:07] <thcipriani>	 addshore: ok, it's pulled down on there
[16:28:12] <addshore>	 checking
[16:28:20] <thcipriani>	 addshore: I haven't pulled in the config change yet though
[16:28:31] <addshore>	 ahhhh yeh, nothing to check without the config change.
[16:28:36] <addshore>	 sync this and then I'll check the config change
[16:28:38] <thcipriani>	 didn't figure :)
[16:29:41] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active
[16:29:50] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate: SWAT: [[gerrit:351867|Add read only mode]] T164407 (duration: 00m 56s)
[16:29:51] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-a valid until 2017-12-13 00:15:55 +0000 (expires in 222 days)
[16:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:58] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[16:30:03] <thcipriani>	 addshore: ok, merging config change
[16:30:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[16:30:41] <icinga-wm>	 RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.082 second response time
[16:30:52] <wikibugs>	 (03PS2) 10Thcipriani: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[16:31:55] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[16:34:26] <thcipriani>	 ^ addshore I think you have to remove the -2 for zuul to pick it up :)
[16:34:40] <wikibugs>	 (03CR) 10Addshore: [C: 032] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[16:34:42] <jynus>	 done it for him
[16:35:10] <jynus>	 (assuming it is an administrative thing)
[16:35:31] <thcipriani>	 cool thanks :)
[16:35:55] <bblack>	 heh irccloud died?
[16:36:11] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active
[16:36:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[16:36:40] <wikibugs>	 (03CR) 10jenkins-bot: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[16:36:41] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active
[16:36:41] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-b valid until 2017-12-13 00:15:56 +0000 (expires in 222 days)
[16:36:51] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-c valid until 2017-12-13 00:15:58 +0000 (expires in 222 days)
[16:37:09] <greg-g>	 bblack: yup :)
[16:37:27] <thcipriani>	 addshore: config change live on mwdebug1002, check please
[16:37:34] <addshore>	 ack, checking
[16:39:27] <addshore>	 thcipriani: checked everything I can, ready to push it out!
[16:39:35] <thcipriani>	 addshore: ok, going
[16:42:31] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config: [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)
[16:42:39] <thcipriani>	 ^ addshore live now
[16:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:40] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[16:42:52] <addshore>	 thcipriani: ack, im watching all sorts of graphs and logs
[16:43:37] <addshore>	 hmm wait, something is off
[16:43:46] <addshore>	 https://www.irccloud.com/pastebin/x5VFWM3H/
[16:44:42] <addshore>	 bah, a null got lost in a rebase
[16:45:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 (owner: 10Ema)
[16:45:27] <addshore>	 thcipriani: https://gerrit.wikimedia.org/r/#/c/351881/ is also needed
[16:45:59] <thcipriani>	 addshore: I see that
[16:48:02] <wikibugs>	 (03CR) 10BBlack: [C: 031] varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema)
[16:48:23] <thcipriani>	 addshore: I'm going to revert while we wait for this to merge
[16:48:28] <addshore>	 thcipriani: ack!
[16:48:32] <wikibugs>	 (03PS4) 10Ema: varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954)
[16:48:38] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema)
[16:49:41] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config: Revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)
[16:49:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:49] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[16:50:54] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Parallelize url fetching [software/service-checker] - 10https://gerrit.wikimedia.org/r/351110
[16:50:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add statsd support [software/service-checker] - 10https://gerrit.wikimedia.org/r/351882
[16:53:19] <elukey>	 urandom: is that you for restbase1018?
[16:53:55] <elukey>	 probably yes
[16:54:44] <urandom>	 elukey: yes
[16:55:05] <elukey>	 ah nice.. can you log it just for the records?
[16:55:10] <urandom>	 just doing that
[16:55:13] <elukey>	 super
[16:55:29] <urandom>	 i was trying to get the bootstrap to actually start, so that the log matched that... but then, gah
[16:55:40] <urandom>	 !log T163292: Starting bootstrap of restbase1018-a
[16:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:49] <stashbot>	 T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet  - https://phabricator.wikimedia.org/T163292
[16:56:09] <urandom>	 elukey: (but it is started, now)
[16:56:50] <addshore>	 thcipriani: merged :)
[16:57:15] <wikibugs>	 (03PS3) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859
[16:57:29] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 (owner: 10Ema)
[16:57:48] * thcipriani pushes live
[16:58:21] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file]
[16:59:19] <ema>	 looking ^
[16:59:35] <elukey>	 urandom: sure sure I just wanted to make sure that it was in the sal, that's it :)
[16:59:49] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate/src/CognateStore.php: [[gerrit:351881|Construct DBReadOnlyError with null db]] (duration: 00m 39s)
[16:59:55] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH)
[16:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:05] <thcipriani>	 ^ addshore live, going to revert my revert now
[17:00:10] <addshore>	 ack!
[17:00:21] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[17:00:28] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05stalled>03Resolved Yes, this is a new system, so it's reclaimed to spares, not decommissioned...
[17:01:22] <wikibugs>	 (03PS1) 10Chad: Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884
[17:01:51] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config: Revert revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)
[17:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:59] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[17:02:02] <thcipriani>	 ^ addshore live again
[17:02:07] <addshore>	 ack!
[17:02:12] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad)
[17:02:13] <addshore>	 again, checking all of the places :)
[17:02:18] <addshore>	 and I'll continue to watch it for a while
[17:02:56] <wikibugs>	 (03PS2) 10Dzahn: Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad)
[17:03:07] <wikibugs>	 06Operations: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#3236038 (10MoritzMuehlenhoff) nfs-common and rpcbind get installed during the initial d-i base installation. At this point our apt config to not install recommended packages is not yet in place (and I've also found no preseed option to...
[17:05:13] <thcipriani>	 cool.  declaring deployment complete.
[17:06:24] <addshore>	 thcipriani: how about also https://gerrit.wikimedia.org/r/#/c/351845 which I mentioned in #wikimedia-releng ? It would let us watch these queries much closer
[17:07:34] <thcipriani>	 addshore: ah, right. wanna cherry-pick that one and we'll get it out?
[17:07:50] <addshore>	 thcipriani: https://gerrit.wikimedia.org/r/#/c/351886/
[17:08:15] <wikibugs>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall  Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3236053 (10Nuria)
[17:08:17] <wikibugs>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095234 (10Nuria) 05Open>03Resolved
[17:10:46] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad)
[17:15:36] <coreyfloyd>	 Anyone know if something is going on with the WMFgerrit on github? It was just used to delete some branches in he ios repository
[17:17:03] <wikibugs>	 (03PS2) 10Dzahn: DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877 (owner: 10Papaul)
[17:17:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877 (owner: 10Papaul)
[17:17:22] <addshore>	 ahhh jenkins
[17:18:09] <paladox>	 hmm, not sure maybe related to https://gerrit.wikimedia.org/r/351884 ?
[17:18:17] <coreyfloyd>	 https://usercontent.irccloud-cdn.com/file/cUZobjOL/IMG_3112.PNG
[17:18:19] <paladox>	 coreyfloyd ^^
[17:19:02] <coreyfloyd>	 paladox: so is this starting to replicate gerrit changes back to the iOS repo?
[17:19:28] <addshore>	 thcipriani: looks like jenkins finally merged that one!
[17:19:51] <coreyfloyd>	 I think that was turned off before... instead we only wanted to reply I replicate github back to gerrrit
[17:19:58] <paladox>	 nope, thats replicating to the new codfw server for gerrit. But the timing looks like it could be caused by that.
[17:19:59] <paladox>	 mutante ^^
[17:20:01] <thcipriani>	 addshore: live on mwdebug1002
[17:20:05] <addshore>	 checking
[17:20:19] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3236110 (10Dzahn) Afaict the host is exactly half-way between the realms in neverland, being "metal labs" and such, heh
[17:20:40] <mutante>	 paladox: what is the question please, i have no context
[17:21:00] <paladox>	 mutante it seems wmfgerrit deleted some projects on the ios project.
[17:21:17] <paladox>	 It may be related to https://gerrit.wikimedia.org/r/351884 (not really sure though)
[17:21:32] <paladox>	 question originally from coreyfloyd
[17:21:41] <RainbowSprinkles>	 On github? It doesn't have replicateProjectDeletions set
[17:21:48] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236115 (10jcrespo) I can confirm the only use it had in production (x1 slave on codfw) has been retired: https://tendril.wikimedia.org/tree https://noc.wikimedia.org/conf/highlig...
[17:21:50] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "This needs to be synchronized with reimaging maps-test servers." [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T160215) (owner: 10Gehel)
[17:22:01] <RainbowSprinkles>	 Wait, deleted?
[17:22:02] <RainbowSprinkles>	 What?
[17:22:17] <paladox>	 RainbowSprinkles https://usercontent.irccloud-cdn.com/file/cUZobjOL/IMG_3112.PNG
[17:22:29] <RainbowSprinkles>	 That's not projects.
[17:22:31] <RainbowSprinkles>	 That's branches.
[17:22:47] <paladox>	 oh woops
[17:22:58] <paladox>	 I mixed my words up, i meant branches (sorry)
[17:23:07] <RainbowSprinkles>	 coreyfloyd: So, I forced a replication of everything. Tbh, if we *dont* want things to *ever* replicate to github, we need to configure that explicitly.
[17:23:18] <RainbowSprinkles>	 (right now, you'd just been relying on "nobody's using it on gerrit, so it won't replicate"
[17:23:28] <addshore>	 looks good (nothing broken) still waiting for things to show up in graphtie << thcipriani 
[17:25:14] <addshore>	 thcipriani: data is appearing, all looks good to rollout!
[17:25:22] <thcipriani>	 addshore: cool, going live.
[17:25:38] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2247220 (10Deskana) p:05High>03Low This remains a valid issue, but has not been touched in a while. Changing priority accordingly.
[17:25:56] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrap-vz:  Include kpartx package [puppet] - 10https://gerrit.wikimedia.org/r/351889
[17:25:58] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrap-vz:  Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890
[17:26:31] <RainbowSprinkles>	 coreyfloyd: I blocked replication of wikipedia-ios explicitly. Shouldn't happen again
[17:27:53] <wikibugs>	 (03CR) 10Paladox: bootstrap-vz:  Add a manifest for a Stretch labs image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott)
[17:28:13] <wikibugs>	 (03CR) 10Paladox: bootstrap-vz:  Add a manifest for a Stretch labs image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott)
[17:29:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz:  Include kpartx package [puppet] - 10https://gerrit.wikimedia.org/r/351889 (owner: 10Andrew Bogott)
[17:29:47] <wikibugs>	 (03PS1) 10Catrope: Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498)
[17:30:06] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3236167 (10chasemp)
[17:30:17] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate: [[gerrit:351886|Add stats tracking for CognateRepo method usage]] (duration: 00m 39s)
[17:30:22] <thcipriani>	 ^ addshore live
[17:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:29] <addshore>	 thanks!
[17:30:36] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10chasemp) Updated description from:  > Number of systems: 1  to the correct  > Number of systems: 2
[17:31:01] <thcipriani>	 *now* I'll declare deployment complete
[17:32:08] <bblack>	 !log nginx upgrading to 1.11.10-1+wmf1 on cache_maps
[17:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:38] <wikibugs>	 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3236179 (10RobH) So one of these needs to adjust to 32GB of RAM.
[17:48:50] <wikibugs>	 06Operations, 10hardware-requests: codfw: (1) labs puppetmaster - https://phabricator.wikimedia.org/T164515#3236250 (10RobH)
[17:49:03] <wikibugs>	 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH)
[17:49:19] <wikibugs>	 06Operations, 10hardware-requests: codfw: (2) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH)
[17:51:19] <coreyfloyd>	 RainbowSprinkles: thanks. Sorry had to go to a meeting
[17:54:39] <wikibugs>	 (03PS2) 10Andrew Bogott: bootstrap-vz:  Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890
[18:00:54] <wikibugs>	 (03PS3) 10Andrew Bogott: bootstrap-vz:  Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890
[18:02:30] <addshore>	 thcipriani: are jynus and I okay to make a couple of small wiktionaries write to the cognate dbs again so that we can monitor the queries? (just making sure I don't get in the way of anything)
[18:02:31] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 3, unused: 0BRge-11/0/6: down - db1025BR
[18:04:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz:  Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott)
[18:04:10] <thcipriani>	 addshore: I'm doing anything on the servers currently, I think the datacenter switchover stuff is done for now, but jynus would know more than I would :)
[18:04:19] <addshore>	 ack!
[18:05:44] <wikibugs>	 06Operations, 10Traffic: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3236376 (10BBlack) Yeah, @Faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the...
[18:07:10] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrap-vz:  fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895
[18:10:43] <wikibugs>	 (03CR) 10Paladox: [C: 031] bootstrap-vz:  fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895 (owner: 10Andrew Bogott)
[18:12:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz:  fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895 (owner: 10Andrew Bogott)
[18:12:17] <wikibugs>	 (03PS1) 10Addshore: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407)
[18:13:37] <wikibugs>	 (03CR) 10Addshore: [C: 032] wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:13:40] <addshore>	 jynus: ^^
[18:14:16] <wikibugs>	 06Operations, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3236403 (10Cmjohnson)
[18:14:48] <wikibugs>	 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3236404 (10Cmjohnson) 05Open>03Resolved Resolved...the old disk has been dropped off for return
[18:15:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:16:00] <wikibugs>	 (03Merged) 10jenkins-bot: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:16:07] <wikibugs>	 (03CR) 10jenkins-bot: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:17:27] <addshore>	 syncing
[18:18:04] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351911|wgCognateReadOnly false for small wikis]] (duration: 00m 40s)
[18:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:12] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[18:18:31] <addshore>	 jynus: right, we should start to see a few writes now
[18:20:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Labs/salt:  Update master finger for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/351914
[18:20:52] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3236468 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The warranty expired 2 years ago,  disks have been removed and destroyed. Removed from rack and updated racktab...
[18:22:28] <Reedy>	 jouncebot: now
[18:22:28] <jouncebot>	 No deployments scheduled for the next 90 hour(s) and 37 minute(s)
[18:22:31] <Reedy>	 jouncebot: next
[18:22:33] <jouncebot>	 In 90 hour(s) and 37 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300)
[18:23:25] <jynus>	 I see some deadlocks
[18:24:03] <addshore>	 There are also some deadlocks between 17:38 and 17:58
[18:25:56] <addshore>	 db1031 just had a large spike in QPS but i guess that is unrelated?
[18:27:26] <addshore>	 jynus: do you actually see writes coming in?
[18:28:59] <wikibugs>	 06Operations: Clean up wikimedia's apt repo - https://phabricator.wikimedia.org/T164521#3236503 (10Paladox)
[18:30:40] <wikibugs>	 (03PS1) 10Cmjohnson: Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917
[18:30:51] <cmjohnson1>	 jeff_green please review that ^^
[18:31:42] <Jeff_Green>	 looking
[18:33:03] <wikibugs>	 (03CR) 10Jgreen: [C: 031] Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917 (owner: 10Cmjohnson)
[18:33:32] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917 (owner: 10Cmjohnson)
[18:42:33] <wikibugs>	 06Operations, 10ops-eqiad: Decommission wmf3096 - https://phabricator.wikimedia.org/T147860#3236566 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson removed ssds to be separately. removed switch port information, removed from rack and upated racktables
[18:51:22] <wikibugs>	 (03PS1) 10Jdlrobson: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891)
[18:54:59] <wikibugs>	 (03PS1) 10Addshore: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407)
[18:55:12] <addshore>	 jynus: ^^
[18:57:40] <wikibugs>	 (03CR) 10Addshore: [C: 032] wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:59:20] <wikibugs>	 (03Merged) 10jenkins-bot: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:59:29] <wikibugs>	 (03CR) 10jenkins-bot: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore)
[18:59:55] <addshore>	 syncing
[19:01:50] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351923|wgCognateReadOnly false for medium wikis]] (duration: 00m 39s)
[19:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:58] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[19:11:05] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#3236674 (10demon)
[19:11:07] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3236671 (10demon) 05Open>03Resolved Gerrit running on `gerrit2001.wikimedia.org` in codfw. Git data is being replicated just fine.
[19:12:52] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Build warm slave for Gerrit in Dallas - https://phabricator.wikimedia.org/T148186#3236675 (10demon) 05Open>03Resolved a:03demon Spare is running in Dallas, data is being replicated in real time so I think we're warm.  Only improv...
[19:14:24] <mutante>	 ^ :)) gerrit2001 is now in actual use and we have replication 
[19:14:29] <mutante>	 sweet, eh
[19:14:33] <paladox>	 :)
[19:18:27] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org, 07HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#3236727 (10Jdforrester-WMF)
[19:26:10] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3236823 (10Dzahn)
[19:43:11] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[19:43:32] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:44:51] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[19:49:31] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[19:50:51] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[19:51:12] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:01:18] <bblack>	 there was an esams 5xx spike, very narrow around 19:39 UTC
[20:01:31] <bblack>	 (the msgs above are the delay report -> recover for it)
[20:01:45] <bblack>	 it hit all clusters proportionally, likely an actual network blip
[20:04:48] <wikibugs>	 06Operations: Clean up wikimedia's apt repo - https://phabricator.wikimedia.org/T164521#3236942 (10Dzahn) precise and lucid are removed from APT config for all purposes:  https://gerrit.wikimedia.org/r/#/c/345838/  the only thing that isn't happening is that files are actively purged from reprepros database and...
[20:06:27] <logmsgbot>	 !log maxsem@tin Synchronized php-1.29.0-wmf.21/extensions/JsonConfig: https://gerrit.wikimedia.org/r/#/c/351749/ (duration: 00m 40s)
[20:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:18] <wikibugs>	 (03PS7) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572
[20:23:11] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on baham is OK: No errors detected
[20:23:22] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on eeden is OK: No errors detected
[20:24:01] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is OK: No errors detected
[20:24:01] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on radon is OK: No errors detected
[20:36:50] <wikibugs>	 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10bd808) >>! In T86668#3234234, @hashar wrote: > There are only two account in LDAP with shells not being `/bin/bash`: > ``` > $ ldapsear...
[21:12:15] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "One nit, +1 otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric)
[21:15:31] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0]
[21:34:04] <wikibugs>	 (03PS2) 10Milimetric: Sqoop using the pre-generated orm jar [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119)
[21:34:08] <wikibugs>	 (03CR) 10Milimetric: Sqoop using the pre-generated orm jar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric)
[21:34:24] <wikibugs>	 (03CR) 10Milimetric: "sure, I can deploy with Luca Monday" [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric)
[21:38:25] <wikibugs>	 (03PS1) 10Dzahn: add passwords::icinga with notsecret-fake password [labs/private] - 10https://gerrit.wikimedia.org/r/352043
[21:39:35] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] add passwords::icinga with notsecret-fake password [labs/private] - 10https://gerrit.wikimedia.org/r/352043 (owner: 10Dzahn)
[21:48:57] <wikibugs>	 (03PS1) 10RobH: return tempdb2001 to spares [puppet] - 10https://gerrit.wikimedia.org/r/352044
[21:50:19] <wikibugs>	 (03PS1) 10RobH: return tempdb2001 to spares [dns] - 10https://gerrit.wikimedia.org/r/352045
[21:52:36] <wikibugs>	 (03CR) 10RobH: [C: 032] return tempdb2001 to spares [puppet] - 10https://gerrit.wikimedia.org/r/352044 (owner: 10RobH)
[21:53:02] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237193 (10RobH)
[21:59:36] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237222 (10RobH)
[22:00:35] <wikibugs>	 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3237223 (10bbogaert) Hi @faidon ,  I found a workaround to avert our security concerns, and use gmail smtp, because Neil did not need the "reply-to" address to be qualtrics...
[22:08:47] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6300/" [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn)
[22:13:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn)
[22:13:57] <wikibugs>	 (03PS8) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572
[22:26:47] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237246 (10RobH)
[22:26:55] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH) a:05RobH>03Papaul
[22:27:06] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH)
[22:31:04] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0]
[22:37:36] <wikibugs>	 (03PS1) 10Dzahn: icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051
[22:43:02] <wikibugs>	 (03PS2) 10Dzahn: icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051
[22:43:13] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051 (owner: 10Dzahn)
[22:43:57] <wikibugs>	 (03CR) 10Dzahn: "follow-up  https://gerrit.wikimedia.org/r/#/c/352051/" [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn)