[00:48:21] RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8511 bytes in 0.063 second response time [00:48:33] mutante ^^ [00:48:54] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3396574 (10Dzahn) blocker for librenms: https://github.com/librenms/librenms/issues/6818 "php-net-ipv4 not available any more on current debian and ubuntu" [00:48:54] :) [00:48:54] (03PS10) 10Krinkle: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) [00:49:01] paladox: :) yea [00:49:02] (03CR) 10Krinkle: [C: 032] Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [00:49:04] (03CR) 10Krinkle: [C: 032] "Beta-only testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [00:49:23] paladox: just need to make it official [00:49:28] not manual :) [00:49:30] Yep :) [00:49:37] upload to stretch :) [00:50:48] (03Merged) 10jenkins-bot: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [00:50:50] (03CR) 10jenkins-bot: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [00:53:20] PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 285 bytes in 0.011 second response time [00:54:04] !log APT - importing php-net-ipv4 to stretch (for librenms) T159756 [00:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:15] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [01:00:10] (03CR) 10Dzahn: [C: 032] librenms: add php-net-ipv4 package also on stretch [puppet] - 10https://gerrit.wikimedia.org/r/362605 (owner: 10Dzahn) [01:02:10] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1498870923 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8915136 keys, up 2 minutes 2 seconds - replication_delay is 1498870923 [01:02:20] RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8511 bytes in 0.108 second response time [01:02:20] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:02:30] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:02:30] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498870949 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8912015 keys, up 2 minutes 27 seconds - replication_delay is 1498870949 [01:02:40] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:40] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498870956 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8817481 keys, up 2 minutes 34 seconds - replication_delay is 1498870956 [01:03:10] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8910887 keys, up 3 minutes 1 seconds - replication_delay is 0 [01:03:20] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4202929 keys, up 3 minutes 13 seconds - replication_delay is 0 [01:03:20] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4203899 keys, up 3 minutes 15 seconds - replication_delay is 0 [01:03:30] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8906769 keys, up 3 minutes 27 seconds - replication_delay is 0 [01:03:40] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8904818 keys, up 3 minutes 35 seconds - replication_delay is 0 [01:03:40] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8813358 keys, up 3 minutes 35 seconds - replication_delay is 0 [01:16:44] (03PS1) 10Smalyshev: Add more units for conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362606 (https://phabricator.wikimedia.org/T168582) [03:58:19] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3396732 (10Peachey88) [04:15:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=363.60 Read Requests/Sec=3815.60 Write Requests/Sec=8.10 KBytes Read/Sec=29641.40 KBytes_Written/Sec=3590.40 [04:24:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.80 Read Requests/Sec=8.10 Write Requests/Sec=1.00 KBytes Read/Sec=37.60 KBytes_Written/Sec=25.60 [04:29:24] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3396733 (10bearND) I'm going to ask @Mholloway to see if he could merge and deploy https:/... [07:08:47] (03CR) 10Volans: "@_joe_ I'll start my reply to https://gerrit.wikimedia.org/r/#/c/361274/3 here to split it in 2." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [07:29:04] (03CR) 10Volans: "Features of the test collection/run that required some code integration are listed inline, but in this case it seemed to me that the gain " (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [07:39:19] (03CR) 10Volans: "@_joe_: to reply to your comment on pytest, please see first my replies in the previous CR of the series, where it is introduced as test r" [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [08:02:37] (03CR) 10Volans: "@joe: If I'm not mistaken this CR adds 37 "pylint: " comments of which 18 in the tests/ directory, in over 3200 lines of code (excluded em" [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [08:27:07] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Framawiki) [08:46:20] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] [08:46:41] ^ that's expected [08:47:13] I think. [09:05:12] (I hope that doesn't alert anyone) [09:05:52] It should probably catch up with the backlog in an hour or two [09:47:20] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:20] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:50:20] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:51:20] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:52:40] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:52:50] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:55:10] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:41] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:00:10] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [10:00:50] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [10:01:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [10:02:21] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [10:02:30] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [10:03:20] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [10:04:20] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [10:05:10] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 73 days) [10:05:10] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [10:08:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [10:17:30] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:17:40] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [10:18:20] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:18:30] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:24:20] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [10:24:30] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [10:25:30] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [10:25:40] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [10:35:55] is it normal that our servers do not have synced time, so differrent servers show differrent time? [10:38:40] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [10:39:30] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:39:30] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:40:20] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:42:50] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:43:20] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:43:30] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:43:30] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [10:53:34] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [10:54:30] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [10:54:40] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [10:54:50] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [11:01:00] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [11:01:20] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [11:02:30] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [11:02:30] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [11:03:30] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:30] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:03:31] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:03:40] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [11:10:30] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [11:10:30] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:11:30] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:12:00] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:18:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:18:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:19:20] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:24:30] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [11:24:40] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [11:24:40] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [11:24:50] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [11:25:39] 10Operations, 10Puppet, 10Labs: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885#3279509 (10Kelson) Yes, please. My multiple labs instances run out of space in /var and this basically blocks "dpkg". As non-puppet expert, this took me a bit of time to figure ou... [11:26:21] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:26:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:27:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:28:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:30:00] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [11:30:30] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [11:31:30] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [11:31:31] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [11:40:30] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:40:40] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:40:40] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:40:41] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [11:50:50] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [11:52:40] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [11:53:10] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:53:31] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:54:30] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [11:54:40] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [11:54:40] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [11:54:52] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [11:58:50] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [11:59:30] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:59:40] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:59:41] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:00:30] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [12:01:10] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [12:02:01] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [12:02:40] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [12:18:30] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [12:18:41] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [12:18:50] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [12:18:50] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [12:21:40] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [12:21:41] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [12:22:40] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:22:50] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [12:23:31] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:40] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:24:10] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:24:30] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:24:40] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [12:24:40] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [12:25:40] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [12:25:50] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.041 second response time on 10.192.48.69 port 9042 [12:28:40] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:28:50] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [12:29:40] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:29:50] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:30:40] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [12:31:10] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [12:32:10] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [12:32:40] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [12:44:41] Hmm, those have been going off all morning ^^ [12:45:20] yep this is a known problem with restbase in codfw, we are working on it :) [12:49:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [12:54:43] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [12:54:50] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [12:54:50] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [12:54:50] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [12:54:50] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:00] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [12:57:20] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:57:40] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:58:50] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:58:50] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [12:59:50] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:00:20] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [13:00:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [13:00:39] !log ppchelko@tin Started deploy [restbase/deploy@8ea07d6]: Manual blacklist for russian wiki [13:00:40] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:00:40] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [13:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] elukey thanks :) [13:02:31] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [13:02:40] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [13:05:11] PROBLEM - Restbase root url on restbase2007 is CRITICAL: connect to address 10.192.16.175 and port 7231: Connection refused [13:05:51] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:10] RECOVERY - Restbase root url on restbase2007 is OK: HTTP OK: HTTP/1.1 200 - 15580 bytes in 0.092 second response time [13:06:40] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:07:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [13:08:20] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:08:38] !log ppchelko@tin Finished deploy [restbase/deploy@8ea07d6]: Manual blacklist for russian wiki (duration: 07m 59s) [13:08:40] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:10] running puppet on the restbase nodes down [13:10:20] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [13:10:41] RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational [13:11:40] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days) [13:11:40] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [13:11:50] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [13:11:50] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [13:11:50] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [13:12:00] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days) [13:13:28] all right we should be good now, let's see if cassandra likes it [13:20:32] * elukey off! [13:30:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [13:41:30] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [13:56:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:58:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:59:30] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3398184 (10Mholloway) I am technically off Mon-Tues but happy to pop in and merge/deploy/v... [14:05:01] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:05:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:05:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:05:52] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3398193 (10Mholloway) Er, scratch that, I'm on the road through Tuesday and only have my p... [14:20:29] (03PS3) 10Paladox: Replace TEMPLATE_CONTEXT_PROCESSORS with TEMPLATES [software/servermon] - 10https://gerrit.wikimedia.org/r/362600 [14:30:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:41:50] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3393909 (10ssastry) We'll look at T169293 this week. [15:58:50] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2050043 [16:02:19] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398300 (10alanajjar) [16:12:50] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:50] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2029845 [16:19:01] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398337 (10Framawiki) [16:19:04] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398340 (10Framawiki) [16:19:45] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Framawiki) From the original task: The request is accepted so we would appreciate if you could reserve for us a window so this rename can be performed... [16:19:53] marostegui: you around for a large global rename? [16:35:39] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398344 (10Aklapper) a:05Marostegui>03None @alanajjar: For future reference, please do not set an assignee for a task without previous agreement of the... [16:38:50] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 15160 [16:40:50] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:03:10] PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect, AS6939/IPv4: Connect [17:05:10] RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 54, down: 0, shutdown: 2 [17:11:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 30 probes of 298 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:16:50] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 16 probes of 298 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:27:06] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398373 (10alanajjar) @Framawiki Thanks, and sorry for open a new task, can you please add me as a subscriber in every task you open like this one (i.e. GR tas... [17:28:16] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398374 (10alanajjar) >>! In T169437#3398344, @Aklapper wrote: > @alanajjar: For future reference, please do not set an assignee for a task without previous... [17:46:24] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398377 (10MarcoAurelio) @alanajjar It'd be easier for you to create a Herald rule that emails you on every task that matches a set of defined criteria :) [17:47:17] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Urbanecm) @alanajjar I strongly advise you to create a Herald rule. You can do it at https://phabricator.wikimedia.org/herald/create/. If you have a qu... [17:47:46] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398381 (10Urbanecm) You were faster @MarcoAurelio :). [17:51:20] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398300 (10Urbanecm) The Assigned To field is to indicate who is currently working at this task. Just wait till somebody would claim the task theyself. He i... [17:55:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398397 (10MarcoAurelio) [17:55:34] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398398 (10MarcoAurelio) [17:55:37] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Cardamom → GreenC: supervision needed - https://phabricator.wikimedia.org/T168776#3398399 (10MarcoAurelio) [17:55:40] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Smuconlaw → Sgconlaw: supervision needed - https://phabricator.wikimedia.org/T168109#3398400 (10MarcoAurelio) [17:55:43] 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3398401 (10MarcoAurelio) [17:55:46] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3398403 (10MarcoAurelio) [17:55:49] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3398402 (10MarcoAurelio) [17:55:57] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#3398407 (10MarcoAurelio) [17:56:00] 10Operations, 10Wikimedia-General-or-Unknown: Global-rename which needs sysadmin's attention - https://phabricator.wikimedia.org/T73384#3398408 (10MarcoAurelio) [17:57:36] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398414 (10MarcoAurelio) I know it's old-style and old-fashioned but I've created T169440 so people can subscribe and check how many global renames in need of sup... [18:07:50] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [19:54:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:56:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:56:30] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 436 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:01:30] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 436 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:02:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:04:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:04:38] (03PS3) 10BryanDavis: tools: Fix maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) [20:05:54] (03CR) 10BryanDavis: [C: 031] "Tested manually on tools-k8s-master-01. The current script is broken there." [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [21:12:50] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:40] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 75883 bytes in 0.156 second response time [21:52:20] PROBLEM - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:52:21] ACKNOWLEDGEMENT - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169448 [21:52:24] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3398609 (10ops-monitoring-bot) [22:16:50] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89973.56 seconds [22:57:59] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3398635 (10MF-Warburg) All pages have been deleted.