[00:04:16] (03PS1) 10BBlack: cp4021: use macaddr from second 10Gbps interface [puppet] - 10https://gerrit.wikimedia.org/r/351742 (https://phabricator.wikimedia.org/T164327) [00:04:26] (03CR) 10BBlack: [V: 032 C: 032] cp4021: use macaddr from second 10Gbps interface [puppet] - 10https://gerrit.wikimedia.org/r/351742 (https://phabricator.wikimedia.org/T164327) (owner: 10BBlack) [00:11:57] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233839 (10BBlack) ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!). Things I found while trying to... [00:39:49] (03PS1) 10BBlack: cp4021-32 ipv6 revdns [dns] - 10https://gerrit.wikimedia.org/r/351744 [00:40:01] 06Operations: Installer assumes eth0 is the used interface - https://phabricator.wikimedia.org/T164444#3233877 (10Volans) [00:40:34] (03CR) 10BBlack: [C: 032] cp4021-32 ipv6 revdns [dns] - 10https://gerrit.wikimedia.org/r/351744 (owner: 10BBlack) [00:40:36] (03Restored) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson) [00:46:10] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:10] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:10] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:40] PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:46:40] PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused [00:47:09] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233902 (10BBlack) With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart hands to swap the cable, and I'll proce... [00:47:10] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:10] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:10] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:10] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:10] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:11] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [00:47:11] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [00:47:12] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:12] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [00:47:13] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:00] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [00:48:01] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [00:48:02] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [00:48:10] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [00:48:40] PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:48:40] PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:49:20] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [00:49:30] PROBLEM - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:49:40] PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:49:41] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.231 and port 9042: Connection refused [00:49:50] PROBLEM - cassandra-b service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [00:50:10] PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:50:50] PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:50:50] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:53:40] RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active [00:53:40] RECOVERY - Check systemd state on restbase1012 is OK: OK - running: The system is fully operational [00:55:40] RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2017-09-12 15:34:10 +0000 (expires in 131 days) [00:55:40] RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.038 second response time on 10.64.32.202 port 9042 [00:56:11] RECOVERY - Check systemd state on restbase1007 is OK: OK - running: The system is fully operational [00:56:50] RECOVERY - cassandra-b service on restbase1007 is OK: OK - cassandra-b is active [00:57:30] RECOVERY - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is OK: SSL OK - Certificate restbase1007-b valid until 2017-09-12 15:33:32 +0000 (expires in 131 days) [00:57:40] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.141 second response time on 10.64.0.231 port 9042 [01:02:10] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:10] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:10] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:10] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:40] PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:02:41] PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused [01:03:00] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [01:03:00] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [01:03:00] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [01:03:10] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [01:03:40] PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:03:40] PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:06:50] RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational [01:06:50] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [01:08:20] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.036 second response time on 10.64.48.138 port 9042 [01:08:40] RECOVERY - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-a valid until 2017-09-12 15:34:34 +0000 (expires in 131 days) [01:09:50] !log T160759: starting restbase1012-a [01:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:00] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [01:10:40] RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active [01:10:41] RECOVERY - Check systemd state on restbase1012 is OK: OK - running: The system is fully operational [01:11:40] RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2017-09-12 15:34:10 +0000 (expires in 131 days) [01:11:41] RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.037 second response time on 10.64.32.202 port 9042 [01:12:06] 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3233129 (10faidon) We don't generally add third-parties to our SPF or DKIM records (there is one exception, with a long and complicated history). Among other concerns, this... [01:18:40] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:18:40] PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:19:01] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [01:19:01] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [01:19:40] PROBLEM - cassandra-b service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [01:20:11] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:20:40] RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2017-09-12 15:34:28 +0000 (expires in 131 days) [01:21:00] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.036 second response time on 10.64.48.136 port 9042 [01:22:35] !log T160759: lowering tombstone_threshold on restbase1013 & restbase1014 [01:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:42] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [01:23:10] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [01:23:40] RECOVERY - cassandra-b service on restbase1013 is OK: OK - cassandra-b is active [01:24:40] RECOVERY - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-b valid until 2017-09-12 15:34:20 +0000 (expires in 131 days) [01:25:00] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.037 second response time on 10.64.32.206 port 9042 [01:25:40] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:26:00] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [01:26:40] RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2017-09-12 15:34:28 +0000 (expires in 131 days) [01:27:00] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.036 second response time on 10.64.48.136 port 9042 [01:42:59] !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki [01:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:11] !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki (duration: 04m 12s) [01:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:41] !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki [01:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:30] PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.195 and port 9042: Connection refused [01:49:40] PROBLEM - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:49:50] PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:50:11] PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:50:30] PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused [01:50:50] PROBLEM - cassandra-b service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [01:51:09] !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: Blacklist a page on dewiki (duration: 03m 28s) [01:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:50] PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:52:00] PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:54:38] !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3 [01:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:30] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.117 and port 9042: Connection refused [01:57:40] PROBLEM - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:58:07] !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3 (duration: 03m 29s) [01:58:10] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:50] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:59:00] PROBLEM - cassandra-a service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:59:10] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:10] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:10] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:10] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:10] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:11] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:11] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:12] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:12] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:13] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:13] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:59:30] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.187 and port 9042: Connection refused [01:59:35] !log mobrovac@naos Started deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3a [01:59:40] PROBLEM - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:50] RECOVERY - cassandra-b service on restbase1008 is OK: OK - cassandra-b is active [02:00:11] RECOVERY - Check systemd state on restbase1008 is OK: OK - running: The system is fully operational [02:00:14] !log T160759: lowering tombstone threshold to 1000 on all eqiad nodes [02:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:22] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [02:00:30] failure party [02:00:50] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [02:01:00] RECOVERY - cassandra-a service on restbase1011 is OK: OK - cassandra-a is active [02:01:30] RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.037 second response time on 10.64.32.187 port 9042 [02:01:50] RECOVERY - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-a valid until 2017-09-12 15:33:39 +0000 (expires in 131 days) [02:02:10] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [02:02:30] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.036 second response time on 10.64.0.117 port 9042 [02:02:30] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [02:02:40] RECOVERY - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-a valid until 2017-09-12 15:34:03 +0000 (expires in 131 days) [02:03:20] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [02:03:20] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [02:03:20] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:30] RECOVERY - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is OK: TCP OK - 0.036 second response time on 10.64.32.195 port 9042 [02:03:40] RECOVERY - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-b valid until 2017-09-12 15:33:41 +0000 (expires in 131 days) [02:03:40] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:03:50] RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational [02:04:00] RECOVERY - cassandra-c service on restbase1015 is OK: OK - cassandra-c is active [02:04:10] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [02:04:10] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [02:04:10] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [02:04:10] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [02:04:10] PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused [02:04:11] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [02:04:11] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [02:04:13] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [02:04:20] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:04:20] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [02:04:40] RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2017-12-13 00:15:51 +0000 (expires in 222 days) [02:04:50] RECOVERY - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-c valid until 2017-09-12 15:34:41 +0000 (expires in 131 days) [02:05:10] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.078 second response time [02:05:11] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [02:05:20] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.037 second response time on 10.64.0.34 port 9042 [02:05:30] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.036 second response time on 10.64.48.140 port 9042 [02:06:47] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233964 (10RobH) >>! In T164327#3233902, @BBlack wrote: > With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart... [02:07:10] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:07:20] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:12] !log mobrovac@naos Finished deploy [restbase/deploy@4d04dfd]: blacklist dewiki page, take 3a (duration: 08m 37s) [02:08:20] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:20] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:20] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:20] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:20] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:21] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:21] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:22] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:22] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:09:10] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [02:09:19] we know, we know, icinga, things are not looking good [02:09:20] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [02:10:10] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [02:10:20] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [02:10:20] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:10:20] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [02:10:20] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:11:10] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [02:12:20] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:20] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [02:13:20] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:20] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:20] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:14:10] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [02:14:11] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:11] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [02:16:10] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [02:16:20] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:20] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:10] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [02:19:10] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [02:19:10] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [02:19:22] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [02:19:22] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:20] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:10] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [02:21:10] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [02:21:10] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [02:21:10] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [02:21:10] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [02:21:11] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [02:21:11] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [02:21:20] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [02:26:01] !log l10nupdate@naos scap sync-l10n completed (1.29.0-wmf.21) (duration: 08m 02s) [02:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:22] !log l10nupdate@naos ResourceLoader cache refresh completed at Thu May 4 02:31:22 UTC 2017 (duration 5m 21s) [02:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:53] (03PS1) 10BBlack: Revert "cp4021: use macaddr from second 10Gbps interface" [puppet] - 10https://gerrit.wikimedia.org/r/351750 [03:18:03] (03CR) 10BBlack: [V: 032 C: 032] Revert "cp4021: use macaddr from second 10Gbps interface" [puppet] - 10https://gerrit.wikimedia.org/r/351750 (owner: 10BBlack) [03:32:51] 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3234002 (10Neil_P._Quinn_WMF) Thanks for the details, @faidon. It makes sense that we would not want to give full mail-sending authority to a third party, no matter how tru... [03:51:33] (03PS1) 10BBlack: cp4021: add ipsec and node list entry [puppet] - 10https://gerrit.wikimedia.org/r/351752 [03:51:57] (03CR) 10BBlack: [V: 032 C: 032] cp4021: add ipsec and node list entry [puppet] - 10https://gerrit.wikimedia.org/r/351752 (owner: 10BBlack) [03:55:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:56:10] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [03:56:11] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [04:17:12] PROBLEM - Webrequests Varnishkafka log producer on cp4021 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined [04:17:32] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:18:12] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/cache/varnishkafka] [04:18:17] (03PS1) 10BBlack: group for varnishlog should be varnish these days [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/351755 [04:18:31] (03CR) 10BBlack: [V: 032 C: 032] group for varnishlog should be varnish these days [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/351755 (owner: 10BBlack) [04:18:42] PROBLEM - traffic-pool service on cp4021 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [04:20:03] (03PS1) 10BBlack: bump varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/351756 [04:20:15] (03CR) 10BBlack: [V: 032 C: 032] bump varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/351756 (owner: 10BBlack) [04:20:22] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2017_v4, cp2017_v6,kafka1018_v4,kafka1018_v6 [04:22:22] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [04:24:13] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:24:13] RECOVERY - Webrequests Varnishkafka log producer on cp4021 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [04:29:32] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational [04:29:42] RECOVERY - traffic-pool service on cp4021 is OK: OK - traffic-pool is active [04:31:08] (03PS1) 10BBlack: raise storage size for new cache SSDs [puppet] - 10https://gerrit.wikimedia.org/r/351757 [04:31:16] (03CR) 10BBlack: [V: 032 C: 032] raise storage size for new cache SSDs [puppet] - 10https://gerrit.wikimedia.org/r/351757 (owner: 10BBlack) [06:03:48] !log Deploy alter table on wikidatawiki.wb_terms - dbstore2002 - T162539 T163548 [06:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:58] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:03:59] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:05:42] (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) [06:07:11] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:07:52] (03PS1) 10Tim Starling: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 [06:08:16] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:08:25] (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351766 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:09:30] !log CentralAuth: Removed MediaWiki 2FA for Alexsh (T164265) [06:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:38] T164265: Lost 2FA details, request recovery. - https://phabricator.wikimedia.org/T164265 [06:10:09] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Depool db2066 - T162539 T163548 (duration: 01m 25s) [06:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:17] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:10:17] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:10:26] !log Deploy alter table on wikidatawiki.wb_terms - db2066 - T162539 T163548 [06:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:29] !log Stop MySQL on tempdb2001 to take a backup and prepare to decomission - T161712 [06:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:37] T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712 [06:22:25] (03PS1) 10Marostegui: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) [06:23:45] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [06:24:48] (03Merged) 10jenkins-bot: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [06:25:56] (03CR) 10jenkins-bot: db-codfw.php: Depool tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351769 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [06:26:25] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Depool tempdb2001, no longer needed - T161712 (duration: 01m 08s) [06:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:33] T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712 [06:29:38] (03PS4) 10Phuedx: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 [06:37:41] (03Abandoned) 10Giuseppe Lavagetto: tendril::maintenance: enable temporarily in codfw [puppet] - 10https://gerrit.wikimedia.org/r/350375 (owner: 10Giuseppe Lavagetto) [06:41:40] (03PS1) 10Marostegui: mariadb: Get ready to decomission tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712) [06:45:52] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [06:47:03] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6292/" [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [06:47:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [06:58:46] !log installing freetype security updates [06:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:48] lots of errors [07:03:29] Error: 503, Backend fetch failed at Thu, 04 May 2017 07:01:30 GMT [07:07:32] <_joe_> Nikerabbit: on it [07:17:31] (03PS1) 10Giuseppe Lavagetto: Depool esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774 [07:17:35] <_joe_> ema ^^ [07:18:04] _joe_: let's hold on just a few secs [07:18:14] (03CR) 10Muehlenhoff: [C: 031] Depool esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774 (owner: 10Giuseppe Lavagetto) [07:18:20] <_joe_> ema: yes [07:18:29] <_joe_> also, I might want to depool just text [07:18:53] _joe_: indeed, other clusters are unaffected [07:19:25] (03PS2) 10Giuseppe Lavagetto: Depool text esams due to outage [dns] - 10https://gerrit.wikimedia.org/r/351774 [07:19:37] <_joe_> errors are going down, not to zero though [07:19:58] <_joe_> ema: let's go? [07:21:36] _joe_: the backend 503s are almost 0 now [07:22:00] <_joe_> yes, I see that [07:22:19] _joe_: let me just grab a few varnishlogs and then +1 to depool [07:22:26] <_joe_> seems that coincided with me restarting cp3043's backend [07:22:37] <_joe_> ema: I think it's ok now [07:22:39] <_joe_> more or less [07:22:50] <_joe_> I'm tailing the 5xx log and it basically stopped [07:23:13] <_joe_> it's only the usual noise on upload [07:24:48] _joe_: yeah, it stopped indeed. I'd say let's not depool for now but keep the patch ready if the issue comes back [07:25:24] <_joe_> yes [07:25:53] <_joe_> !log restarted cp3043 backend varnish at 7:13 UTC while trying to debug issues [07:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:33:25] (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decomission tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/351772 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [07:33:52] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:37:20] (03PS1) 10Marostegui: db-codfw.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) [07:38:09] (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) [07:40:44] 07Puppet, 10Beta-Cluster-Infrastructure, 05Goal, 13Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#3234168 (10hashar) [07:42:52] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:43:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:44:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:47:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:51:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:52:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:54:39] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [07:56:46] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [07:56:55] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Remove tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351777 (https://phabricator.wikimedia.org/T161712) (owner: 10Marostegui) [07:58:42] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 25s) [07:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:50] T161712: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712 [07:59:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good. Note that the current queue discipline doesn't get changed by the sysctl, this either needs a reboot or manual one time applic" [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [07:59:57] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 07s) [08:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:57] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234220 (10Marostegui) Position where tempdb2001 slave was stopped at: https://phabricator.wikimedia.org/P5374 Backu... [08:01:11] 06Operations, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3234221 (10hashar) [08:03:55] 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3234234 (10hashar) There are only two account in LDAP with shells not being `/bin/bash`: ``` $ ldapsearch -LLL -x -b 'ou=people,dc=wikimedia,dc=o... [08:05:22] <_joe_> moritzm: can you change the topic back? [08:08:12] yep [08:10:24] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3234243 (10Marostegui) 05Open>03Resolved Thanks @Papaul all the hosts are looking good! I will mark this ticket as resolved then. [08:13:21] (03PS1) 10Marostegui: x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780 [08:15:56] (03CR) 10Marostegui: [C: 032] x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780 (owner: 10Marostegui) [08:16:49] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234251 (10jcrespo) "Decommissioned" in the software sense, I think they will want it to return to spares. Just clar... [08:17:58] (03Merged) 10jenkins-bot: x1.host: Remove tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/351780 (owner: 10Marostegui) [08:18:02] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) [08:20:21] (03PS1) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) [08:20:49] (03PS1) 10Marostegui: s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057) [08:22:53] (03PS2) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) [08:23:39] !log restart elasticsearch on relforge for JDK update [08:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:34] (03PS4) 10Gehel: elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 [08:28:06] (03PS1) 10Gehel: elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786 [08:28:20] (03CR) 10Gehel: [C: 032] elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 (owner: 10Gehel) [08:29:57] (03CR) 10Gehel: [C: 032] elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786 (owner: 10Gehel) [08:30:02] (03PS2) 10Gehel: elasticsearch - fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/351786 [08:31:49] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6294/" [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:32:58] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234269 (10Marostegui) Thanks for the clarification :-) [08:34:20] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3234285 (10hashar) On deployment-tin I created: ``` lang=ini,name=/etc/mysql/conf.d/prompt.cnf [mysql] prompt = "\u@\h[... [08:34:28] (03CR) 10DCausse: [C: 031] elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 (owner: 10Gehel) [08:35:05] (03CR) 10DCausse: [C: 031] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [08:41:24] (03PS3) 10Marostegui: mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) [08:42:18] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3234312 (10ArielGlenn) 05Open>03Resolved [08:42:40] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3234314 (10fgiunchedi) @Cmjohnson thanks! disk has been rebuilt [08:43:09] (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decommission db1040 [puppet] - 10https://gerrit.wikimedia.org/r/351783 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:43:50] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3234320 (10fgiunchedi) >>! In T152562#3232264, @Jgreen wrote: >>>! In T152562#3227173, @fgiunchedi wrote: >> @Jgreen any news/updates on having FR fully on jessie? > > We're still waiting for... [08:45:06] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:47:02] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:47:12] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351782 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:49:22] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove db1040 from config files as it will be decommissioned - T164057 (duration: 00m 55s) [08:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:31] T164057: Decommission db1040 - https://phabricator.wikimedia.org/T164057 [08:50:18] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove db1040 from config files as it will be decommissioned - T164057 (duration: 00m 48s) [08:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:03] (03CR) 10Marostegui: [C: 032] s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:52:47] (03Merged) 10jenkins-bot: s4.hosts: Remove db1040 [software] - 10https://gerrit.wikimedia.org/r/351784 (https://phabricator.wikimedia.org/T164057) (owner: 10Marostegui) [08:53:31] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3234342 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: - Removed it from prometheus, added it as a spare on site.pp and removed it from dhcp list:... [08:54:25] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3234351 (10fgiunchedi) The same recharging status and icinga warning is showing on ms-be1020 now too ``` => show status Smart Array P840 in Slot 3 Controller Status: OK Cache St... [09:00:37] 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234359 (10MoritzMuehlenhoff) [09:04:42] 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234372 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:07:01] (03CR) 10Volans: [C: 04-1] "It seems a bit overcomplicated to me, suggestions to simplify it are inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [09:12:03] (03PS1) 10Filippo Giunchedi: thumbstats: fix thumb detection regex [software] - 10https://gerrit.wikimedia.org/r/351791 [09:12:05] (03PS1) 10Filippo Giunchedi: thumbstats: add env variables to help text [software] - 10https://gerrit.wikimedia.org/r/351792 [09:12:07] (03PS1) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) [09:12:58] (03CR) 10jerkins-bot: [V: 04-1] thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [09:13:28] !log joal@naos Started deploy [analytics/refinery@9d35029]: (no justification provided) [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:26] !log joal@naos Finished deploy [analytics/refinery@9d35029]: (no justification provided) (duration: 02m 58s) [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:20] (03CR) 10Volans: [C: 04-1] "amended comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [09:25:41] (03CR) 10Filippo Giunchedi: [C: 031] Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) (owner: 10Elukey) [09:26:39] (03PS1) 10Jcrespo: query-killer: Do not kill queries containing gtid_wait or DMLs [software] - 10https://gerrit.wikimedia.org/r/351796 [09:26:56] (03PS4) 10Elukey: Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) [09:28:11] (03CR) 10Elukey: [C: 032] Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) (owner: 10Elukey) [09:31:56] 06Operations: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3234453 (10fgiunchedi) [09:32:24] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3234480 (10ema) Today, 2017-05-04, this issue affected 4 out of 8 of the text-esams hosts roughly at the same time, resulting in a peak of [[ https://grafana.wik... [09:32:35] 06Operations, 15User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3234482 (10fgiunchedi) [09:36:00] bd808: when you get a chance can you let me know what you think of https://gerrit.wikimedia.org/r/#/c/350817/ ? especially the logstash part and adding "type" to distinguish webrequest from json_lines [09:36:05] (03PS1) 10Jcrespo: x1: Remove all read traffic from x1-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) [09:37:11] (03CR) 10Marostegui: [C: 031] x1: Remove all read traffic from x1-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo) [09:39:30] (03PS2) 10Jcrespo: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) [09:40:51] !log stop kafka on kafka1012 and reboot the host for kernel upgrade [09:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:27] (03PS2) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) [09:53:01] (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: fix thumb detection regex [software] - 10https://gerrit.wikimedia.org/r/351791 (owner: 10Filippo Giunchedi) [09:53:12] (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: add env variables to help text [software] - 10https://gerrit.wikimedia.org/r/351792 (owner: 10Filippo Giunchedi) [09:54:25] (03PS3) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) [10:00:10] (03PS1) 10Ema: cache: stop expiry thread RT experiment on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/351802 (https://phabricator.wikimedia.org/T145661) [10:00:19] (03PS4) 10Filippo Giunchedi: thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) [10:04:05] (03CR) 10Ema: [V: 032 C: 032] cache: stop expiry thread RT experiment on cp2024 [puppet] - 10https://gerrit.wikimedia.org/r/351802 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [10:04:18] (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: add Hive export [software] - 10https://gerrit.wikimedia.org/r/351793 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [10:05:28] !log restart varnish-be on cp2024 without RT experiment [10:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:40] !log restarting hhvm on mediawiki canaries to pick up freetype security update [10:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] !log executed DEL ocg_job_status on rdb1007:6379 (new ocg_job_status hash is stored on the ocg* hosts) - T159850 [10:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:15] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [10:27:17] 06Operations, 06Performance-Team, 13Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3234673 (10Peter) I've added the new metrics in the, graph, will keep the issue open until I cam change the alert so we use both pages (need to collect metric... [10:28:02] I think nagios losts its downtimes again [10:41:54] (03PS1) 10Alexandros Kosiaris: Add scap.cfg [software/librenms] - 10https://gerrit.wikimedia.org/r/351811 (https://phabricator.wikimedia.org/T129136) [10:44:07] 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#3234740 (10hashar) [10:44:58] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3234753 (10fgiunchedi) I've extracted some data from the list of thumbnails we are storing in swift and processed it with hive, distribution of si... [10:48:01] !log installing tomcat security updates [10:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:01] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:14:54] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3232060 (10MoritzMuehlenhoff) @Andrew : Can you clarify, this is host running in the labs or production realm? [11:15:28] (03CR) 10Jcrespo: "I will deploy this after a break" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo) [11:18:48] (03CR) 10Ema: [C: 031] icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 (owner: 10BBlack) [11:19:28] (03PS1) 10Elukey: Lower the Apache workers on bohrium (Piwik) [puppet] - 10https://gerrit.wikimedia.org/r/351812 [11:20:50] (03CR) 10Elukey: [C: 032] Lower the Apache workers on bohrium (Piwik) [puppet] - 10https://gerrit.wikimedia.org/r/351812 (owner: 10Elukey) [11:21:11] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86459.17 seconds [11:21:13] (03PS1) 10Marostegui: mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) [11:21:54] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) [11:22:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86557.44 seconds [11:23:24] (03PS1) 10Marostegui: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) [11:26:11] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86597.36 seconds [11:28:31] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86394.13 seconds [11:29:09] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6297/" [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:29:14] (03PS2) 10Marostegui: mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) [11:29:31] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86482.40 seconds [11:30:55] (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decomission db1022 [puppet] - 10https://gerrit.wikimedia.org/r/351813 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:33:59] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:34:11] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86644.84 seconds [11:35:06] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:35:46] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351814 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:36:11] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86560.94 seconds [11:36:20] jouncebot: refresh [11:36:23] I refreshed my knowledge about deployments. [11:36:24] jouncebot: next [11:36:24] In 97 hour(s) and 23 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300) [11:36:56] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 25s) [11:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:04] T163778: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778 [11:38:01] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:38:36] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 06s) [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:31] (03PS2) 10BBlack: icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 [11:43:00] (03CR) 10BBlack: [V: 032 C: 032] icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 (owner: 10BBlack) [11:43:17] (03PS2) 10Marostegui: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) [11:43:27] (03PS4) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 [11:44:31] (03CR) 10jerkins-bot: [V: 04-1] varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack) [11:44:56] (03CR) 10Marostegui: [C: 032] s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:45:10] !log starting cache_text upgrades to varnish 4.1.6-1wm1 [11:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:53] (03Merged) 10jenkins-bot: s6.hosts: Decommission db1022 [software] - 10https://gerrit.wikimedia.org/r/351815 (https://phabricator.wikimedia.org/T163778) (owner: 10Marostegui) [11:46:31] 06Operations, 10ops-eqiad, 10DBA: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234912 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: Removed it from pr... [11:47:37] (03PS5) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 [11:47:40] I hate you jenkins [11:48:47] (03CR) 10BBlack: [C: 032] varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack) [11:51:00] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3128: Connection refused [11:51:07] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3234935 (10Marostegui) @Ottomata remember to: `stop all slaves;` before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging,... [11:51:10] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3125: Connection refused [11:51:10] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3120: Connection refused [11:51:10] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3121: Connection refused [11:51:10] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3122: Connection refused [11:51:20] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3127: Connection refused [11:51:20] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3124: Connection refused [11:51:20] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3126: Connection refused [11:51:20] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3123: Connection refused [11:51:28] looking ^ [11:51:40] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:50] PROBLEM - Varnish HTTP text-frontend - port 80 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused [11:52:17] bblack: perhaps puppetfail on mem sizing? [11:53:08] re-running puppet agent [11:53:10] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.002 second response time [11:53:10] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.000 second response time [11:53:10] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.000 second response time [11:53:11] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.000 second response time [11:53:20] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [11:53:20] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.000 second response time [11:53:20] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [11:53:20] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [11:53:40] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:53:50] RECOVERY - Varnish HTTP text-frontend - port 80 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.005 second response time [11:53:51] uh [11:54:00] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.001 second response time [11:54:12] it's only the one you're restarting, right? [11:54:54] I've restarted cp4010 but that was before the mem sizing patch got merged [11:54:59] right [11:55:05] let me stop the upgrades and try one manually [11:55:16] we probably do need agent runs pre-restart now [11:55:59] ok [11:56:13] !log installing mysql-connector-java security updates [11:56:17] the agent update for the mem size seems to work fine in isolation though [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:30] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=287.60 Read Requests/Sec=2093.30 Write Requests/Sec=7.00 KBytes Read/Sec=33404.80 KBytes_Written/Sec=2376.00 [11:57:21] I guess spam the agent across the caches [11:58:16] yeah [12:01:10] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3234951 (10Gilles) What "size" are we talking about when you have a range like 0.0000 - 500.0000? [12:01:23] (03PS1) 10DCausse: Upgrade plugins for elastic 5.3.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) [12:07:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=26.20 Read Requests/Sec=0.70 Write Requests/Sec=1.90 KBytes Read/Sec=13.60 KBytes_Written/Sec=39.20 [12:11:10] (03CR) 10DCausse: "I really hope it's the last time we do this..." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse) [12:13:00] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:20:28] 06Operations, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3235014 (10Cmjohnson) [12:21:10] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:28:43] !log Deploy alter table enwiki.revision on dbstore1001 - T132416 [12:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:52] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [12:35:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) [12:36:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui) [12:38:01] 06Operations, 10DNS, 10Traffic, 15User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3235053 (10Zppix) [12:38:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui) [12:40:22] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Depool db1070 for maintenance - T160392 (duration: 01m 35s) [12:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:31] T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392 [12:42:42] !log Stop MySQL db1070 for maintenance - T160392 [12:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:49] what time zone & format is throttle.php in? [12:51:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351822 (https://phabricator.wikimedia.org/T160392) (owner: 10Marostegui) [12:52:23] (03Draft2) 10Zppix: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) [12:55:50] Can i have something deployed before saturday its for T164482 and the event is saturday [12:55:50] T164482: Lift account registration limit from IP address - https://phabricator.wikimedia.org/T164482 [13:02:51] Zppix: nope [13:03:15] Zppix: er "before Saturday", yes, sure [13:03:19] Dereckson: how else am I supposed to have the account registration limit lifted? [13:03:30] (03CR) 10Tjones: [C: 031] "Yay!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse) [13:03:43] Dereckson: I can do it whenever EU swat is "usually" [13:04:13] I was going to tell you we need to deploy that during the week, as it's known beforehand, and not the Saturday itself [13:04:24] Gerrit URL? [13:04:35] https://gerrit.wikimedia.org/r/#/c/351825/ [13:05:18] Thanks [13:05:22] no problem [13:07:04] these are the kind of things that should not be on config, but validated + on a store somewhere [13:07:18] jynus: we've a bug open for that [13:07:35] jynus: and someone told us they want to work on that during the hackathon [13:07:37] yes, I was just reminding a conversation [13:07:42] most of InitialiseSettings.php should be in a store somewhere :) [13:07:53] we had a dew days ago [13:07:53] That's ancient tech debt [13:08:06] and basically everyone agreed with that [13:08:31] s/remind/remember/ [13:08:33] to be fair if its no able to be accessed on gerrit then staff would therolicatly be the only ones able to change them. [13:08:59] Zppix: let me check something first for the timezone, the space is fishy, and I'll deploy it, as soon as jynus gives me a green light, as it owns th econfig repo this week for db purpose [13:09:20] Zppix, that is the point, not having to use gerrit at all :-) [13:09:49] I am cool with that, no more maintenance commits anymore [13:09:53] jynus: so then staff would be the only ones doing config changes, adding to the already long list of a workload [13:09:54] jynus: ok [13:09:59] Dereckson: ack [13:10:50] Zppix, nope- I would say autoconfirmed users to propose changes and a special role to ok them [13:11:15] jynus: ah I see? is there a phab task on this? [13:11:27] but not participating on development, I cannot say or promis anything [13:11:38] that would be only my idea [13:12:16] probably https://phabricator.wikimedia.org/T44785 [13:12:46] opened by you 4 years and a half ago :-) [13:13:29] (03CR) 10Dereckson: [C: 031] "Timezone checked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix) [13:14:01] would it help if I work on a generic configuration store proposal? [13:14:20] I wasnt here 4 year and half ago jynus so im assuming your not talking about me lol [13:14:28] he he no [13:14:47] jynus: if you provide a solution for all the config, IS/CS files like db variables, yes, that would help [13:15:08] !log restart services on maps-test [13:15:13] that would avoid to have one solution for one config file, another one for by wiki settings, etc. [13:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:00] (03CR) 10Dereckson: [C: 032] "Emergency deployment, throttle rule for an event before the next SWAT available." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix) [13:16:22] probably the same model than etcd- cached data, but every X seconds it is reloaded anynchronously [13:16:57] jynus: would it be able to load balance that properly considering it will be used by every project [13:17:06] (03Merged) 10jenkins-bot: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix) [13:17:06] That's actualy the etcd point [13:17:15] (03CR) 10jenkins-bot: Lift Account registration limit for cywiki for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351825 (https://phabricator.wikimedia.org/T164482) (owner: 10Zppix) [13:17:21] no need, stale data should be kept [13:17:31] if the service goes down [13:17:51] Zppix: here, read this: http://thesecretlivesofdata.com/raft/ [13:17:57] That's the algo behind etcd [13:18:00] thansk Dereckson [13:18:14] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3235210 (10fgiunchedi) @Gilles the image width in pixels, i.e. the user-provided size in the url [13:18:45] !log restart services on maps codfw [13:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:55] Zppix: your change works on mwdebug1002, syncing [13:20:20] ack Dereckson [13:21:39] !log dereckson@naos Synchronized wmf-config/throttle.php: Lift Account registration limit for cywiki for an event / T164482 (duration: 01m 08s) [13:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] T164482: Lift account registration limit from IP address - https://phabricator.wikimedia.org/T164482 [13:22:55] I'm done on Naos. [13:23:10] 06Operations, 05Goal, 07kubernetes: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3235220 (10akosiaris) [13:23:21] I thought tin was eqiad [13:24:33] (03PS1) 10Alexandros Kosiaris: Create kubemaster.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/351836 (https://phabricator.wikimedia.org/T162040) [13:24:45] Zppix: deployment server failover hasn't happened yet. It's scheduled for today [13:25:20] Failover lol... ack [13:27:24] Zppix: it's only a pure convention [13:27:31] you can deploy from every server, it works [13:27:32] BUT [13:28:06] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3235232 (10fgiunchedi) [13:28:15] There is a need to have a coherent staging directory, so someone don't deploy commits A B C while someone deploy A B D overwriting C [13:28:42] so the deployment server is mainly a social convention to have a coherent source of files to deploy [13:29:22] So i guess ill be the last deploy from codfw for a while (hopefully). [13:29:46] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3235233 (10Gilles) Migration steps on VM once the change is applied: ``` mwscript refreshImageMetadata.php... [13:31:29] !log restart services on maps eqiad [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:55] (03PS5) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [13:35:51] (03CR) 10jerkins-bot: [V: 04-1] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [13:37:07] (03CR) 10Gehel: logstash - delete all indices older than 31 days (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [13:38:02] (03PS6) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [13:38:35] (03PS7) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [13:46:11] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3235268 (10jcrespo) [13:46:26] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) p:05High>03Normal [13:47:09] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3235271 (10BBlack) Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's... [13:48:03] (03CR) 10Volans: [C: 04-1] "Much nicer! Thanks a lot for improving it. Looks good to me, but I think there is a typo." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [13:48:27] 06Operations, 06Release-Engineering-Team: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3235275 (10jcrespo) Removing tag due to change in scope of the ticket. [13:49:29] (03PS1) 10Marostegui: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 [13:49:45] (03CR) 10Marostegui: [C: 04-2] "Wait for lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui) [13:50:33] (03PS2) 10Marostegui: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 [13:50:58] 06Operations, 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235290 (10jcrespo) [13:51:35] (03PS8) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [13:51:58] (03PS9) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [13:52:22] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3235306 (10jcrespo) [13:52:24] 06Operations, 10DBA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#3235308 (10jcrespo) [13:52:32] (03CR) 10BBlack: [C: 031] "Same thought as moritz. We can do this at runtime with e.g. "tc qdisc replace dev eth0 root fq"" [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [13:52:38] (03CR) 10Volans: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [13:53:17] 06Operations, 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235313 (10jcrespo) [13:53:52] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3235314 (10Ottomata) @cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following. Coordinate with @elukey if you want to do it next we... [13:53:54] (03CR) 10Gehel: [C: 032] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [13:56:03] ACKNOWLEDGEMENT - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Madhuvishy Known issue, looking into it. [13:59:07] (03PS1) 10Gehel: logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841 [13:59:50] (03CR) 10Volans: [C: 031] logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841 (owner: 10Gehel) [14:00:20] (03CR) 10Gehel: [C: 032] logstash - add missing shebang line in python script [puppet] - 10https://gerrit.wikimedia.org/r/351841 (owner: 10Gehel) [14:03:35] !log maintain-meta_p --all-databases --purge --debug labsdb1009/1010/1011 for T164103 [14:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103 [14:04:32] (03PS1) 10Gehel: logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842 [14:04:40] 06Operations, 06Labs: maintain-meta_p hands on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10chasemp) [14:04:57] 06Operations, 06Labs: maintain-meta_p hands on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235351 (10chasemp) p:05Triage>03Normal [14:05:34] 06Operations, 06Labs: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10chasemp) [14:05:36] o/ [14:05:44] I have a thing on my calendar that says ORES traffic is being redirected at EQIAD now. :) [14:05:50] I'm here to monitor [14:06:04] akosiaris, ^ [14:07:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui) [14:07:32] let me know if that's happening :) [14:07:34] (03CR) 10Volans: [C: 031] logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842 (owner: 10Gehel) [14:07:42] (03CR) 10Gehel: [C: 032] logstash - fix argument parsing in logstash_delete_index [puppet] - 10https://gerrit.wikimedia.org/r/351842 (owner: 10Gehel) [14:08:06] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui) [14:08:15] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1070 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351839 (owner: 10Marostegui) [14:09:44] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1070 with less weight - T160392 (duration: 01m 16s) [14:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:52] T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392 [14:10:11] halfak: ok will do. we are running 30 minutes late btw at the request of the services team [14:10:19] no prob [14:10:22] I'll be around :) [14:10:24] ok [14:11:49] 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3235386 (10fgiunchedi) Issue has been fixed upstream, pending next node_exporter release or internal package build [14:13:14] (03PS1) 10Marostegui: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 [14:16:31] !log maintain-meta_p --all-databases --purge --debug labsdb1001 for T164103 [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:38] T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103 [14:18:16] 06Operations, 06Labs: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490#3235339 (10jcrespo) I do not think we host that: ``` $ dig wikimedia.org.uk wikimedia.org.uk. 1847 IN A 37.188.117.184 $ whois 37.188.117.184 descr: Rackspac... [14:20:51] 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2736876 (10madhuvishy) @hashar This seems like a known and documented... [14:21:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui) [14:22:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui) [14:22:43] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1070's weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351844 (owner: 10Marostegui) [14:23:50] !log maintain-meta_p --databases dtywiki,pawikisource,ptwikimedia,wbwikimedia --debug labsdb1003 for T164103 [14:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103 [14:24:05] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Increase db1070 weight - T160392 (duration: 01m 10s) [14:24:08] (03PS1) 10Gehel: logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 [14:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:12] T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392 [14:25:12] (03PS1) 10Marostegui: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 [14:27:01] (03PS1) 10Giuseppe Lavagetto: cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 [14:27:03] (03PS1) 10Giuseppe Lavagetto: cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 [14:27:47] <_joe_> bblack, mobrovac ^^ [14:28:34] (03CR) 10Alexandros Kosiaris: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto) [14:28:42] (03CR) 10Alexandros Kosiaris: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto) [14:29:11] !log dropping and recreating user for maintain-views on labsdb1001 T164103 [14:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:19] T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103 [14:29:29] (03CR) 10Mobrovac: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto) [14:29:37] chasemp ^ heads up [14:29:43] (03CR) 10BBlack: [C: 031] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto) [14:29:55] (03CR) 10Mobrovac: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto) [14:30:07] (03CR) 10BBlack: [C: 031] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto) [14:30:15] I'm staging the switch patches for later switchback [14:30:25] (03PS2) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 [14:30:27] (03PS1) 10Filippo Giunchedi: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 [14:30:29] (03PS1) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 [14:30:34] godog: s/switch/swift/ but ok [14:30:51] (03CR) 10Giuseppe Lavagetto: [C: 032] cache: services switchback to eqiad 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/351849 (owner: 10Giuseppe Lavagetto) [14:31:07] <_joe_> ok, sending all to a/a now [14:31:12] ok [14:31:39] akosiaris: lol, yeah switch swift too close [14:32:06] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3235448 (10hashar) I am all for enabling persistent connections again. Ideally we would deploy it on a singl... [14:35:19] <_joe_> !log running puppet on varnishes in eqiad (text,misc,maps) to pick up the a/a traffic to services [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:43] !log installing mysql-connector-java security updates on hadoop cluster [14:36:44] jynus: got it tx [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:07] (03CR) 10Gehel: [C: 032] logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 (owner: 10Gehel) [14:38:14] 06Operations, 10netops, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235452 (10fgiunchedi) [14:38:16] (03PS2) 10Gehel: logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 [14:39:33] (03CR) 10Gehel: [V: 032 C: 032] logstash - indices need to be striped of whitespace in delete script [puppet] - 10https://gerrit.wikimedia.org/r/351846 (owner: 10Gehel) [14:39:35] <_joe_> ok, switching the internal traffic now [14:39:36] (03PS1) 10Elukey: Re-enable persistent connection to Redis for jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351854 (https://phabricator.wikimedia.org/T125735) [14:39:52] !log oblivian: Setting restbase-async in codfw UP [14:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:11] !log oblivian: Setting restbase in eqiad UP [14:40:12] ccccccelgjhdnfjckieituirflcckfccidljjrekjgvv [14:40:17] sorry [14:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:32] apergos: and now use it once more to invalidate that [14:40:35] :-) [14:40:44] lol [14:41:02] <_joe_> elukey: hold on with that [14:41:15] <_joe_> so, merging the second traffic patch [14:41:20] ok [14:41:50] (03CR) 10Giuseppe Lavagetto: [C: 032] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto) [14:41:58] (03PS2) 10Giuseppe Lavagetto: cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 [14:42:04] _joe_ sure sure it is not meant to be merged soon [14:42:07] <_joe_> gehel: merging during a switchover? [14:42:11] <_joe_> ahem [14:42:15] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] cache: services switchback to eqiad 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/351850 (owner: 10Giuseppe Lavagetto) [14:42:23] :/ ... yeah, did not realize, sry [14:43:19] <_joe_> !log forcing a puppet run on cache (text,maps, misc) in eqiad/codfw to complete the switchback [14:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:39] !log oblivian: Setting restbase in codfw DOWN [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:16] !log oblivian: Setting restbase-async in eqiad DOWN [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:47] <_joe_> gehel: any reason why wdqs is marked down in codfw according to discovery? [14:46:15] _joe_: not that I know of... [14:46:36] <_joe_> ok [14:46:59] !log oblivian: Setting wdqs in codfw UP [14:46:59] it is receiveing traffic ... [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:04] <_joe_> gehel: this is the internal discovery [14:48:04] <_joe_> not the varnish routing [14:48:43] no idea then. And there are probably nothing using internal discovery for wdqs [14:50:53] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408#3235550 (10akosiaris) [14:50:55] 06Operations, 10netops, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235548 (10akosiaris) 05Open>03Resolved Done. cr1-eqiad, cr2-eqiad updated and now the term is ``` from... [14:50:57] godog: ^ [14:51:22] * akosiaris likes ansible for these things [14:51:46] akosiaris: !! that was fast, thanks a lot! I was the config has hostnames in comments, is that automatic from juniper or ..? [14:51:56] me [14:52:10] <_joe_> mobrovac: we should be done [14:52:12] thanks to my ocd [14:52:30] <_joe_> services are switched back [14:52:32] akosiaris: haha ok, that explains why I wasn't finding anything about that in juniper docs [14:52:36] \o/ [14:52:38] <_joe_> green light for the other switches [14:52:53] halfak: we are done btw. everything seems to have gone fine [14:52:59] 06Operations, 10Graphite, 13Patch-For-Review: something (reqstats?) puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3235558 (10hashar) [14:53:01] 06Operations, 10Continuous-Integration-Infrastructure, 07Upstream, 07Zuul: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#3235556 (10hashar) 05Open>03declined Seems statsd is strong enough to handle the metrics. Notably nowadays we have ~ 300 jobs instead of... [14:53:53] akosiaris, confirmed. Looks good on my end. [14:54:16] _joe_: \o/ [14:54:21] PROBLEM - Apache HTTP on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:58] _joe_: graphs looking good [14:55:11] RECOVERY - Apache HTTP on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time [14:55:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui) [14:55:45] really nice indeed - https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=now-6h&to=now [14:55:48] AQS is happier :) [14:56:03] (latency going down) [14:57:22] (03PS1) 10Elukey: Lower down Piwik Apache workers again [puppet] - 10https://gerrit.wikimedia.org/r/351856 [15:01:12] elukey: \o/ [15:01:36] (03PS1) 10Milimetric: Sqoop using the pre-generated orm jar [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) [15:02:12] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui) [15:02:24] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1070 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351848 (owner: 10Marostegui) [15:03:23] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Restore db1070 original weight - T160392 (duration: 00m 57s) [15:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] T160392: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392 [15:06:02] (03PS1) 10Marostegui: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) [15:06:39] (03PS1) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 [15:07:26] (03PS2) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 [15:08:15] (03CR) 10Elukey: [C: 032] Lower down Piwik Apache workers again [puppet] - 10https://gerrit.wikimedia.org/r/351856 (owner: 10Elukey) [15:08:56] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [15:12:53] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [15:13:04] (03CR) 10jenkins-bot: db-codfw.php: Repool db2066, depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351858 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [15:14:28] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Repool db2066, depool db2059 - T162539 T163548 (duration: 01m 06s) [15:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:36] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [15:14:36] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [15:15:20] !log labsdb1003 maintain-views --databases ptwikimedia,pawikisourcewbwikimedia,dtywiki --replace-all --debug T164103 [15:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:28] T164103: Generate labsdb views for dtywiki, pawikisource, ptwikimedia, wbwikimedia - https://phabricator.wikimedia.org/T164103 [15:16:17] !log Deploy alter table on wikidatawiki.wb_terms - db2059- https://phabricator.wikimedia.org/T162539 https://phabricator.wikimedia.org/T163548 [15:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:29] (03PS1) 10Rush: labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866 [15:29:31] (03PS2) 10Filippo Giunchedi: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 [15:30:30] bblack ema ^ going with https://gerrit.wikimedia.org/r/#/c/351852 and https://gerrit.wikimedia.org/r/#/c/351853/ for swift switchback [15:30:32] <_joe_> godog: I'll switch the internal url in the meantime, ok? [15:30:42] _joe_: yep, thanks! [15:30:42] godog: ack [15:30:57] (03CR) 10Filippo Giunchedi: [C: 032] traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 (owner: 10Filippo Giunchedi) [15:30:59] (03CR) 10BBlack: [C: 031] traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/351852 (owner: 10Filippo Giunchedi) [15:31:12] (03CR) 10BBlack: [C: 031] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi) [15:31:18] !log oblivian: Setting switft-rw in codfw DOWN [15:31:21] !log oblivian: Setting swift-rw in eqiad UP [15:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:15] !log run-puppet-agent on cache_upload in codfw/swift for swift a/a [15:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:04] (03PS1) 10Addshore: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) [15:33:19] (03PS2) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 [15:33:21] (03CR) 10Addshore: [C: 04-2] "Needs feature to be merged and deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [15:33:21] addshore: please hold on that for righ tnow [15:33:24] ah thanks [15:33:26] (03CR) 10Alexandros Kosiaris: [C: 032] Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [15:33:31] (03PS5) 10Alexandros Kosiaris: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [15:33:51] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken [15:33:52] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on radon is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken [15:34:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [15:34:11] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on baham is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken [15:34:17] \o/ [15:34:21] (03PS20) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [15:34:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on eeden is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-swift-rw.state is broken [15:34:42] nooo rebase wars /o\ [15:34:47] (03CR) 10jerkins-bot: [V: 04-1] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [15:34:49] !log add cwd to acl*procurement-review for phab S4 [15:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:08] godog: I'm not planning to merge, just trying to keep that change up to date... [15:35:13] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi) [15:35:20] (03PS3) 10Filippo Giunchedi: traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 [15:35:21] godog: sorry my bad [15:35:36] I should not have merged [15:35:43] akosiaris: haha no worries [15:35:53] oh, that wasn't me... [15:35:53] I should have spammed people with an announcement [15:35:53] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift a/p in eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/351853 (owner: 10Filippo Giunchedi) [15:35:55] 07Puppet, 06Labs: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#3235671 (10bd808) p:05Triage>03Lowest Patches are of course always welcome, but this seems like a pretty delicate operation to perform via Puppet for what is in reality a seldom used edge... [15:36:38] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235673 (10Papaul) p:05Triage>03Normal a:05akosiaris>03Papaul [15:36:44] !log run-puppet-agent on cache_upload in codfw/swift for swift a/p in codfw [15:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:41] gehel: yeah I think it counts only in the merge/submit case [15:38:01] (03PS2) 10Rush: labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866 [15:38:36] ok all done, I'll keep an eye on swift dashboards [15:38:56] ok [15:38:57] you're on for 20 mins from now with naos -> tin, right? [15:39:08] apergos: that's correct [15:39:14] great [15:40:53] apergos: ahh yes, don't worry, not doing it now, just preparing the patch! [15:41:08] yep saw that, thanks! [15:41:13] _joe_: the confd template errors above are expected? Compilation of file '/var/lib/gdnsd/discovery-swift-rw.state' is broken [15:42:04] !log nginx upgraded to 1.11.10-1+wmf1 on cp1045 (cache_misc) [15:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:54] godog: are we holding on merges atm or can I run w/ https://gerrit.wikimedia.org/r/#/c/351866/ ? [15:43:54] <_joe_> godog: still broken? [15:44:25] <_joe_> godog: as in: someone went to radon and looked at /var/log/confd.log? [15:44:48] chasemp: yeah you can go, thanks for checking [15:45:00] _joe_: no I haven't looked yet, I was checking icinga tho [15:45:03] (03CR) 10Rush: [C: 032] labsdb: add new wiki dbs to dnsrecursor.pp [puppet] - 10https://gerrit.wikimedia.org/r/351866 (owner: 10Rush) [15:45:03] !log nginx upgraded to 1.11.10-1+wmf1 on cp1051 (cache_misc) [15:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:41] godog: kk thanks (and np) [15:45:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift-rw,name=codfw [15:45:52] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235683 (10Papaul) [15:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:57] (03PS1) 10Gehel: maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T160215) [15:46:04] <_joe_> godog: should recover nw [15:46:12] <_joe_> *now [15:47:33] (03PS3) 10BBlack: cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663 [15:47:45] (03CR) 10jerkins-bot: [V: 04-1] cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663 (owner: 10BBlack) [15:48:23] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235694 (10Papaul) @akosiaris what HW RAID type I have to use to ganeti200[7-8]? [15:48:42] papaul: raid 5 [15:48:58] akosiaris: thanks [15:49:15] _joe_: recovered indeed [15:49:32] (03PS4) 10BBlack: cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663 [15:51:37] (03PS5) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) [15:51:45] (03CR) 10Ema: [V: 032 C: 032] BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [15:52:01] (03CR) 10Jcrespo: [C: 032] db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo) [15:52:06] (03PS3) 10Jcrespo: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) [15:56:04] (03CR) 10jenkins-bot: db: Remove all read traffic from x1, es2 & es3-master-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351799 (https://phabricator.wikimedia.org/T164407) (owner: 10Jcrespo) [15:57:00] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Remove all read traffic from x1, es2 & es3-master-eqiad (duration: 01m 08s) [15:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:50] (03CR) 10Filippo Giunchedi: Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 (owner: 10Filippo Giunchedi) [15:58:56] (03PS2) 10Filippo Giunchedi: Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 [16:00:37] !log switch deployment server back to tin.eqiad.wmnet [16:00:43] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 (owner: 10Filippo Giunchedi) [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:42] (03CR) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 (owner: 10Filippo Giunchedi) [16:02:47] (03PS3) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 [16:03:40] !log T160759: restoring default Cassandra tombstone_threshold in eqiad [16:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:49] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [16:04:00] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 (owner: 10Filippo Giunchedi) [16:09:30] !log filippo@tin Started scap: README [16:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:56] (03PS1) 10Papaul: DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877 [16:09:59] !log filippo@tin scap aborted: README (duration: 00m 28s) [16:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] not really what I wanted to do, anyways deployment server is back to tin [16:11:17] thcipriani: ^ if you'd like to test [16:11:35] godog: sure [16:14:41] !log thcipriani@tin Synchronized README: test tin is back (duration: 01m 06s) [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:58] all righty then [16:15:21] godog: lgtm [16:15:49] \o/ thanks [16:15:59] sending an announcement email now [16:18:35] !log nginx upgraded to 1.11.10-1+wmf1 on all cache_misc [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:00] (03CR) 10BryanDavis: "Do we have the ability to add this kafkatee setup in deployment-prep to test things out?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [16:23:48] _joe_: re: 1018, you formatted /srv, yes? [16:24:06] <_joe_> urandom: it's a raid 0 that lost one disk [16:24:15] addshore: I +2'd the cognate change for wmf.21 waiting for zuul now [16:24:15] gilles, I have created https://phabricator.wikimedia.org/T164504 not as a request to you, but as something to report to several teams that you should be aware of [16:24:19] _joe_: right [16:24:22] <_joe_> it's not like I had a lot of options :P [16:24:36] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and setup boron replacement frpm1001 - https://phabricator.wikimedia.org/T162298#3235814 (10Jgreen) 05Open>03Resolved done! [16:24:45] _joe_: no no, that's fine, i was half expecting to have to do that, but it looked like you did, so i was just making sure :) [16:24:58] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3235820 (10Papaul) [16:25:22] <_joe_> heh ok [16:25:25] thcipriani: ack! [16:25:28] <_joe_> check the size of the FS though [16:25:32] <_joe_> I did some magic there [16:26:06] what magic? disabled reserve? [16:26:11] thcipriani: looks like it is merged [16:26:32] addshore: yup, getting stuff on tin squared away [16:26:35] <_joe_> I turned off the VG, deleted the raid device, re-created a raid device with the same name and the same size [16:26:43] <_joe_> turned on the VG [16:26:45] <_joe_> :P [16:26:51] addshore: anything you want to check on mwdebug1002 before I sync? [16:26:52] <_joe_> it "worked" [16:27:26] <_joe_> I tested I could write there [16:27:47] <_joe_> and that the partition was 5+T [16:27:52] thcipriani: I can check a couple of things yes! [16:28:07] addshore: ok, it's pulled down on there [16:28:12] checking [16:28:20] addshore: I haven't pulled in the config change yet though [16:28:31] ahhhh yeh, nothing to check without the config change. [16:28:36] sync this and then I'll check the config change [16:28:38] didn't figure :) [16:29:41] RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active [16:29:50] !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate: SWAT: [[gerrit:351867|Add read only mode]] T164407 (duration: 00m 56s) [16:29:51] RECOVERY - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-a valid until 2017-12-13 00:15:55 +0000 (expires in 222 days) [16:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:58] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [16:30:03] addshore: ok, merging config change [16:30:21] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [16:30:41] RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.082 second response time [16:30:52] (03PS2) 10Thcipriani: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [16:31:55] (03CR) 10Thcipriani: [C: 032] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [16:34:26] ^ addshore I think you have to remove the -2 for zuul to pick it up :) [16:34:40] (03CR) 10Addshore: [C: 032] Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [16:34:42] done it for him [16:35:10] (assuming it is an administrative thing) [16:35:31] cool thanks :) [16:35:55] heh irccloud died? [16:36:11] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [16:36:31] (03Merged) 10jenkins-bot: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [16:36:40] (03CR) 10jenkins-bot: Enable Cognate for Wiktionary in Read Only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351868 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [16:36:41] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [16:36:41] RECOVERY - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-b valid until 2017-12-13 00:15:56 +0000 (expires in 222 days) [16:36:51] RECOVERY - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-c valid until 2017-12-13 00:15:58 +0000 (expires in 222 days) [16:37:09] bblack: yup :) [16:37:27] addshore: config change live on mwdebug1002, check please [16:37:34] ack, checking [16:39:27] thcipriani: checked everything I can, ready to push it out! [16:39:35] addshore: ok, going [16:42:31] !log thcipriani@tin Synchronized wmf-config: [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s) [16:42:39] ^ addshore live now [16:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:40] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [16:42:52] thcipriani: ack, im watching all sorts of graphs and logs [16:43:37] hmm wait, something is off [16:43:46] https://www.irccloud.com/pastebin/x5VFWM3H/ [16:44:42] bah, a null got lost in a rebase [16:45:04] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 (owner: 10Ema) [16:45:27] thcipriani: https://gerrit.wikimedia.org/r/#/c/351881/ is also needed [16:45:59] addshore: I see that [16:48:02] (03CR) 10BBlack: [C: 031] varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema) [16:48:23] addshore: I'm going to revert while we wait for this to merge [16:48:28] thcipriani: ack! [16:48:32] (03PS4) 10Ema: varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) [16:48:38] (03CR) 10Ema: [V: 032 C: 032] varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema) [16:49:41] !log thcipriani@tin Synchronized wmf-config: Revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s) [16:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:49] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [16:50:54] (03PS2) 10Giuseppe Lavagetto: Parallelize url fetching [software/service-checker] - 10https://gerrit.wikimedia.org/r/351110 [16:50:56] (03PS1) 10Giuseppe Lavagetto: Add statsd support [software/service-checker] - 10https://gerrit.wikimedia.org/r/351882 [16:53:19] urandom: is that you for restbase1018? [16:53:55] probably yes [16:54:44] elukey: yes [16:55:05] ah nice.. can you log it just for the records? [16:55:10] just doing that [16:55:13] super [16:55:29] i was trying to get the bootstrap to actually start, so that the log matched that... but then, gah [16:55:40] !log T163292: Starting bootstrap of restbase1018-a [16:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:49] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [16:56:09] elukey: (but it is started, now) [16:56:50] thcipriani: merged :) [16:57:15] (03PS3) 10Ema: prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 [16:57:29] (03CR) 10Ema: [V: 032 C: 032] prometheus: add tcpstat collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/351859 (owner: 10Ema) [16:57:48] * thcipriani pushes live [16:58:21] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [16:59:19] looking ^ [16:59:35] urandom: sure sure I just wanted to make sure that it was in the sal, that's it :) [16:59:49] !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate/src/CognateStore.php: [[gerrit:351881|Construct DBReadOnlyError with null db]] (duration: 00m 39s) [16:59:55] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH) [16:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] ^ addshore live, going to revert my revert now [17:00:10] ack! [17:00:21] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:00:28] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05stalled>03Resolved Yes, this is a new system, so it's reclaimed to spares, not decommissioned... [17:01:22] (03PS1) 10Chad: Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 [17:01:51] !log thcipriani@tin Synchronized wmf-config: Revert revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s) [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:59] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [17:02:02] ^ addshore live again [17:02:07] ack! [17:02:12] (03CR) 10Paladox: [C: 031] Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad) [17:02:13] again, checking all of the places :) [17:02:18] and I'll continue to watch it for a while [17:02:56] (03PS2) 10Dzahn: Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad) [17:03:07] 06Operations: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#3236038 (10MoritzMuehlenhoff) nfs-common and rpcbind get installed during the initial d-i base installation. At this point our apt config to not install recommended packages is not yet in place (and I've also found no preseed option to... [17:05:13] cool. declaring deployment complete. [17:06:24] thcipriani: how about also https://gerrit.wikimedia.org/r/#/c/351845 which I mentioned in #wikimedia-releng ? It would let us watch these queries much closer [17:07:34] addshore: ah, right. wanna cherry-pick that one and we'll get it out? [17:07:50] thcipriani: https://gerrit.wikimedia.org/r/#/c/351886/ [17:08:15] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3236053 (10Nuria) [17:08:17] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095234 (10Nuria) 05Open>03Resolved [17:10:46] (03CR) 10Dzahn: [C: 032] Gerrit: Turn replication back on [puppet] - 10https://gerrit.wikimedia.org/r/351884 (owner: 10Chad) [17:15:36] Anyone know if something is going on with the WMFgerrit on github? It was just used to delete some branches in he ios repository [17:17:03] (03PS2) 10Dzahn: DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877 (owner: 10Papaul) [17:17:18] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt dns entries for ganeti200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/351877 (owner: 10Papaul) [17:17:22] ahhh jenkins [17:18:09] hmm, not sure maybe related to https://gerrit.wikimedia.org/r/351884 ? [17:18:17] https://usercontent.irccloud-cdn.com/file/cUZobjOL/IMG_3112.PNG [17:18:19] coreyfloyd ^^ [17:19:02] paladox: so is this starting to replicate gerrit changes back to the iOS repo? [17:19:28] thcipriani: looks like jenkins finally merged that one! [17:19:51] I think that was turned off before... instead we only wanted to reply I replicate github back to gerrrit [17:19:58] nope, thats replicating to the new codfw server for gerrit. But the timing looks like it could be caused by that. [17:19:59] mutante ^^ [17:20:01] addshore: live on mwdebug1002 [17:20:05] checking [17:20:19] 06Operations, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3236110 (10Dzahn) Afaict the host is exactly half-way between the realms in neverland, being "metal labs" and such, heh [17:20:40] paladox: what is the question please, i have no context [17:21:00] mutante it seems wmfgerrit deleted some projects on the ios project. [17:21:17] It may be related to https://gerrit.wikimedia.org/r/351884 (not really sure though) [17:21:32] question originally from coreyfloyd [17:21:41] On github? It doesn't have replicateProjectDeletions set [17:21:48] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236115 (10jcrespo) I can confirm the only use it had in production (x1 slave on codfw) has been retired: https://tendril.wikimedia.org/tree https://noc.wikimedia.org/conf/highlig... [17:21:50] (03CR) 10Gehel: [C: 04-1] "This needs to be synchronized with reimaging maps-test servers." [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T160215) (owner: 10Gehel) [17:22:01] Wait, deleted? [17:22:02] What? [17:22:17] RainbowSprinkles https://usercontent.irccloud-cdn.com/file/cUZobjOL/IMG_3112.PNG [17:22:29] That's not projects. [17:22:31] That's branches. [17:22:47] oh woops [17:22:58] I mixed my words up, i meant branches (sorry) [17:23:07] coreyfloyd: So, I forced a replication of everything. Tbh, if we *dont* want things to *ever* replicate to github, we need to configure that explicitly. [17:23:18] (right now, you'd just been relying on "nobody's using it on gerrit, so it won't replicate" [17:23:28] looks good (nothing broken) still waiting for things to show up in graphtie << thcipriani [17:25:14] thcipriani: data is appearing, all looks good to rollout! [17:25:22] addshore: cool, going live. [17:25:38] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2247220 (10Deskana) p:05High>03Low This remains a valid issue, but has not been touched in a while. Changing priority accordingly. [17:25:56] (03PS1) 10Andrew Bogott: bootstrap-vz: Include kpartx package [puppet] - 10https://gerrit.wikimedia.org/r/351889 [17:25:58] (03PS1) 10Andrew Bogott: bootstrap-vz: Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890 [17:26:31] coreyfloyd: I blocked replication of wikipedia-ios explicitly. Shouldn't happen again [17:27:53] (03CR) 10Paladox: bootstrap-vz: Add a manifest for a Stretch labs image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott) [17:28:13] (03CR) 10Paladox: bootstrap-vz: Add a manifest for a Stretch labs image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott) [17:29:02] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: Include kpartx package [puppet] - 10https://gerrit.wikimedia.org/r/351889 (owner: 10Andrew Bogott) [17:29:47] (03PS1) 10Catrope: Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498) [17:30:06] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3236167 (10chasemp) [17:30:17] !log thcipriani@tin Synchronized php-1.29.0-wmf.21/extensions/Cognate: [[gerrit:351886|Add stats tracking for CognateRepo method usage]] (duration: 00m 39s) [17:30:22] ^ addshore live [17:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:29] thanks! [17:30:36] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10chasemp) Updated description from: > Number of systems: 1 to the correct > Number of systems: 2 [17:31:01] *now* I'll declare deployment complete [17:32:08] !log nginx upgrading to 1.11.10-1+wmf1 on cache_maps [17:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:38] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3236179 (10RobH) So one of these needs to adjust to 32GB of RAM. [17:48:50] 06Operations, 10hardware-requests: codfw: (1) labs puppetmaster - https://phabricator.wikimedia.org/T164515#3236250 (10RobH) [17:49:03] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH) [17:49:19] 06Operations, 10hardware-requests: codfw: (2) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH) [17:51:19] RainbowSprinkles: thanks. Sorry had to go to a meeting [17:54:39] (03PS2) 10Andrew Bogott: bootstrap-vz: Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890 [18:00:54] (03PS3) 10Andrew Bogott: bootstrap-vz: Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890 [18:02:30] thcipriani: are jynus and I okay to make a couple of small wiktionaries write to the cognate dbs again so that we can monitor the queries? (just making sure I don't get in the way of anything) [18:02:31] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 3, unused: 0BRge-11/0/6: down - db1025BR [18:04:05] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: Add a manifest for a Stretch labs image [puppet] - 10https://gerrit.wikimedia.org/r/351890 (owner: 10Andrew Bogott) [18:04:10] addshore: I'm doing anything on the servers currently, I think the datacenter switchover stuff is done for now, but jynus would know more than I would :) [18:04:19] ack! [18:05:44] 06Operations, 10Traffic: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3236376 (10BBlack) Yeah, @Faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the... [18:07:10] (03PS1) 10Andrew Bogott: bootstrap-vz: fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895 [18:10:43] (03CR) 10Paladox: [C: 031] bootstrap-vz: fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895 (owner: 10Andrew Bogott) [18:12:08] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: fix c/p error [puppet] - 10https://gerrit.wikimedia.org/r/351895 (owner: 10Andrew Bogott) [18:12:17] (03PS1) 10Addshore: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) [18:13:37] (03CR) 10Addshore: [C: 032] wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:13:40] jynus: ^^ [18:14:16] 06Operations, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3236403 (10Cmjohnson) [18:14:48] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3236404 (10Cmjohnson) 05Open>03Resolved Resolved...the old disk has been dropped off for return [18:15:27] (03CR) 10Jcrespo: [C: 031] wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:16:00] (03Merged) 10jenkins-bot: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:16:07] (03CR) 10jenkins-bot: wgCognateReadOnly false for 'small' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351911 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:17:27] syncing [18:18:04] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351911|wgCognateReadOnly false for small wikis]] (duration: 00m 40s) [18:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:12] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [18:18:31] jynus: right, we should start to see a few writes now [18:20:30] (03PS1) 10Andrew Bogott: Labs/salt: Update master finger for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/351914 [18:20:52] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3236468 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The warranty expired 2 years ago, disks have been removed and destroyed. Removed from rack and updated racktab... [18:22:28] jouncebot: now [18:22:28] No deployments scheduled for the next 90 hour(s) and 37 minute(s) [18:22:31] jouncebot: next [18:22:33] In 90 hour(s) and 37 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300) [18:23:25] I see some deadlocks [18:24:03] There are also some deadlocks between 17:38 and 17:58 [18:25:56] db1031 just had a large spike in QPS but i guess that is unrelated? [18:27:26] jynus: do you actually see writes coming in? [18:28:59] 06Operations: Clean up wikimedia's apt repo - https://phabricator.wikimedia.org/T164521#3236503 (10Paladox) [18:30:40] (03PS1) 10Cmjohnson: Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917 [18:30:51] jeff_green please review that ^^ [18:31:42] looking [18:33:03] (03CR) 10Jgreen: [C: 031] Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917 (owner: 10Cmjohnson) [18:33:32] (03CR) 10Cmjohnson: [C: 032] Removing all dns entries for decom servers boron,barium,db1025,lutetium and silicon [dns] - 10https://gerrit.wikimedia.org/r/351917 (owner: 10Cmjohnson) [18:42:33] 06Operations, 10ops-eqiad: Decommission wmf3096 - https://phabricator.wikimedia.org/T147860#3236566 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson removed ssds to be separately. removed switch port information, removed from rack and upated racktables [18:51:22] (03PS1) 10Jdlrobson: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) [18:54:59] (03PS1) 10Addshore: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) [18:55:12] jynus: ^^ [18:57:40] (03CR) 10Addshore: [C: 032] wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:59:20] (03Merged) 10jenkins-bot: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:59:29] (03CR) 10jenkins-bot: wgCognateReadOnly false for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351923 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:59:55] syncing [19:01:50] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351923|wgCognateReadOnly false for medium wikis]] (duration: 00m 39s) [19:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:58] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [19:11:05] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#3236674 (10demon) [19:11:07] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3236671 (10demon) 05Open>03Resolved Gerrit running on `gerrit2001.wikimedia.org` in codfw. Git data is being replicated just fine. [19:12:52] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Build warm slave for Gerrit in Dallas - https://phabricator.wikimedia.org/T148186#3236675 (10demon) 05Open>03Resolved a:03demon Spare is running in Dallas, data is being replicated in real time so I think we're warm. Only improv... [19:14:24] ^ :)) gerrit2001 is now in actual use and we have replication [19:14:29] sweet, eh [19:14:33] :) [19:18:27] 06Operations, 06Labs, 10wikitech.wikimedia.org, 07HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#3236727 (10Jdforrester-WMF) [19:26:10] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3236823 (10Dzahn) [19:43:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:43:32] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:44:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:49:31] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:50:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:51:12] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:01:18] there was an esams 5xx spike, very narrow around 19:39 UTC [20:01:31] (the msgs above are the delay report -> recover for it) [20:01:45] it hit all clusters proportionally, likely an actual network blip [20:04:48] 06Operations: Clean up wikimedia's apt repo - https://phabricator.wikimedia.org/T164521#3236942 (10Dzahn) precise and lucid are removed from APT config for all purposes: https://gerrit.wikimedia.org/r/#/c/345838/ the only thing that isn't happening is that files are actively purged from reprepros database and... [20:06:27] !log maxsem@tin Synchronized php-1.29.0-wmf.21/extensions/JsonConfig: https://gerrit.wikimedia.org/r/#/c/351749/ (duration: 00m 40s) [20:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:18] (03PS7) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [20:23:11] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on baham is OK: No errors detected [20:23:22] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on eeden is OK: No errors detected [20:24:01] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is OK: No errors detected [20:24:01] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on radon is OK: No errors detected [20:36:50] 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10bd808) >>! In T86668#3234234, @hashar wrote: > There are only two account in LDAP with shells not being `/bin/bash`: > ``` > $ ldapsear... [21:12:15] (03CR) 10Ottomata: [C: 031] "One nit, +1 otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric) [21:15:31] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] [21:34:04] (03PS2) 10Milimetric: Sqoop using the pre-generated orm jar [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) [21:34:08] (03CR) 10Milimetric: Sqoop using the pre-generated orm jar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric) [21:34:24] (03CR) 10Milimetric: "sure, I can deploy with Luca Monday" [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric) [21:38:25] (03PS1) 10Dzahn: add passwords::icinga with notsecret-fake password [labs/private] - 10https://gerrit.wikimedia.org/r/352043 [21:39:35] (03CR) 10Dzahn: [V: 032 C: 032] add passwords::icinga with notsecret-fake password [labs/private] - 10https://gerrit.wikimedia.org/r/352043 (owner: 10Dzahn) [21:48:57] (03PS1) 10RobH: return tempdb2001 to spares [puppet] - 10https://gerrit.wikimedia.org/r/352044 [21:50:19] (03PS1) 10RobH: return tempdb2001 to spares [dns] - 10https://gerrit.wikimedia.org/r/352045 [21:52:36] (03CR) 10RobH: [C: 032] return tempdb2001 to spares [puppet] - 10https://gerrit.wikimedia.org/r/352044 (owner: 10RobH) [21:53:02] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237193 (10RobH) [21:59:36] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237222 (10RobH) [22:00:35] 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3237223 (10bbogaert) Hi @faidon , I found a workaround to avert our security concerns, and use gmail smtp, because Neil did not need the "reply-to" address to be qualtrics... [22:08:47] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6300/" [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn) [22:13:47] (03CR) 10Dzahn: [C: 032] icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn) [22:13:57] (03PS8) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [22:26:47] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3237246 (10RobH) [22:26:55] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH) a:05RobH>03Papaul [22:27:06] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10RobH) [22:31:04] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [22:37:36] (03PS1) 10Dzahn: icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051 [22:43:02] (03PS2) 10Dzahn: icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051 [22:43:13] (03CR) 10Dzahn: [C: 032] icinga: adjust nsca template from "nagios" to "icinga" [puppet] - 10https://gerrit.wikimedia.org/r/352051 (owner: 10Dzahn) [22:43:57] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/352051/" [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn)