[00:48:21] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8511 bytes in 0.063 second response time
[00:48:33] <paladox>	 mutante ^^
[00:48:54] <wikibugs_>	 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3396574 (10Dzahn) blocker for librenms: https://github.com/librenms/librenms/issues/6818 "php-net-ipv4 not available any more on current debian and ubuntu"
[00:48:54] <paladox>	 :)
[00:48:54] <wikibugs_>	 (03PS10) 10Krinkle: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742)
[00:49:01] <mutante>	 paladox: :) yea
[00:49:02] <wikibugs_>	 (03CR) 10Krinkle: [C: 032] Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle)
[00:49:04] <wikibugs_>	 (03CR) 10Krinkle: [C: 032] "Beta-only testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle)
[00:49:23] <mutante>	 paladox: just need to make it official 
[00:49:28] <mutante>	 not manual :)
[00:49:30] <paladox>	 Yep :)
[00:49:37] <paladox>	 upload to stretch :)
[00:50:48] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle)
[00:50:50] <wikibugs>	 (03CR) 10jenkins-bot: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle)
[00:53:20] <icinga-wm>	 PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 285 bytes in 0.011 second response time
[00:54:04] <mutante>	 !log APT - importing php-net-ipv4 to stretch (for librenms) T159756
[00:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:15] <stashbot>	 T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756
[01:00:10] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] librenms: add php-net-ipv4 package also on stretch [puppet] - 10https://gerrit.wikimedia.org/r/362605 (owner: 10Dzahn)
[01:02:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1498870923 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8915136 keys, up 2 minutes 2 seconds - replication_delay is 1498870923
[01:02:20] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8511 bytes in 0.108 second response time
[01:02:20] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481
[01:02:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[01:02:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498870949 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8912015 keys, up 2 minutes 27 seconds - replication_delay is 1498870949
[01:02:40] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379
[01:02:40] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498870956 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8817481 keys, up 2 minutes 34 seconds - replication_delay is 1498870956
[01:03:10] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8910887 keys, up 3 minutes 1 seconds - replication_delay is 0
[01:03:20] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4202929 keys, up 3 minutes 13 seconds - replication_delay is 0
[01:03:20] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4203899 keys, up 3 minutes 15 seconds - replication_delay is 0
[01:03:30] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8906769 keys, up 3 minutes 27 seconds - replication_delay is 0
[01:03:40] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8904818 keys, up 3 minutes 35 seconds - replication_delay is 0
[01:03:40] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8813358 keys, up 3 minutes 35 seconds - replication_delay is 0
[01:16:44] <wikibugs>	 (03PS1) 10Smalyshev: Add more units for conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362606 (https://phabricator.wikimedia.org/T168582)
[03:58:19] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3396732 (10Peachey88)
[04:15:10] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=363.60 Read Requests/Sec=3815.60 Write Requests/Sec=8.10 KBytes Read/Sec=29641.40 KBytes_Written/Sec=3590.40
[04:24:20] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.80 Read Requests/Sec=8.10 Write Requests/Sec=1.00 KBytes Read/Sec=37.60 KBytes_Written/Sec=25.60
[04:29:24] <wikibugs>	 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3396733 (10bearND) I'm going to ask @Mholloway to see if he could merge and deploy https:/...
[07:08:47] <wikibugs>	 (03CR) 10Volans: "@_joe_ I'll start my reply to https://gerrit.wikimedia.org/r/#/c/361274/3 here to split it in 2." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[07:29:04] <wikibugs>	 (03CR) 10Volans: "Features of the test collection/run that required some code integration are listed inline, but in this case it seemed to me that the gain " (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[07:39:19] <wikibugs>	 (03CR) 10Volans: "@_joe_: to reply to your comment on pytest, please see first my replies in the previous CR of the series, where it is introduced as test r" [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[08:02:37] <wikibugs>	 (03CR) 10Volans: "@joe: If I'm not mistaken this CR adds 37 "pylint: " comments of which 18 in the tests/ directory, in over 3200 lines of code (excluded em" [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[08:27:07] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Framawiki)
[08:46:20] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0]
[08:46:41] <legoktm>	 ^ that's expected
[08:47:13] <legoktm>	 I think.
[09:05:12] <legoktm>	 (I hope that doesn't alert anyone)
[09:05:52] <legoktm>	 It should probably catch up with the backlog in an hour or two
[09:47:20] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:48:20] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[09:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[09:51:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:52:40] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:52:50] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[09:55:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:55:41] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[10:00:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[10:00:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[10:01:30] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[10:02:21] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[10:02:30] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[10:03:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[10:04:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[10:05:10] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 73 days)
[10:05:10] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[10:08:30] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[10:17:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:17:40] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[10:18:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:18:30] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[10:24:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[10:24:30] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[10:25:30] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[10:25:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[10:35:55] <Danny_B>	 is it normal that our servers do not have synced time, so differrent servers show differrent time?
[10:38:40] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[10:39:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:39:30] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[10:40:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:42:50] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[10:43:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:43:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:43:30] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused
[10:53:34] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[10:54:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[10:54:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[10:54:50] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[11:01:00] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[11:01:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[11:02:30] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[11:02:30] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[11:03:30] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:03:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:03:31] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:03:40] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[11:10:30] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused
[11:10:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:11:30] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:12:00] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:18:00] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[11:18:30] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[11:19:20] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[11:24:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[11:24:40] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[11:24:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[11:24:50] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[11:25:39] <wikibugs>	 10Operations, 10Puppet, 10Labs: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885#3279509 (10Kelson) Yes, please. My multiple labs instances run out of space in /var and this basically blocks "dpkg". As non-puppet expert, this took me a bit of time to figure ou...
[11:26:21] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:26:30] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[11:27:30] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:28:00] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:30:00] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[11:30:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[11:31:30] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[11:31:31] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[11:40:30] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:40:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:40:40] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:40:41] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[11:50:50] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[11:52:40] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused
[11:53:10] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:53:31] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:54:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[11:54:40] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[11:54:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[11:54:52] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[11:58:50] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[11:59:30] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:59:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:59:41] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[12:00:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[12:01:10] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[12:02:01] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[12:02:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[12:18:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[12:18:41] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[12:18:50] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[12:18:50] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[12:21:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[12:21:41] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused
[12:22:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:22:50] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[12:23:31] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:23:40] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[12:24:10] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[12:24:30] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:24:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[12:24:40] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[12:25:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[12:25:50] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.041 second response time on 10.192.48.69 port 9042
[12:28:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:28:50] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[12:29:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:29:50] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[12:30:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[12:31:10] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[12:32:10] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[12:32:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[12:44:41] <paladox>	 Hmm, those have been going off all morning ^^
[12:45:20] <elukey>	 yep this is a known problem with restbase in codfw, we are working on it :)
[12:49:30] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[12:54:43] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[12:54:50] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[12:54:50] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[12:54:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[12:54:50] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:55:00] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[12:57:20] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[12:57:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:58:50] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:58:50] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[12:59:50] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[13:00:20] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[13:00:30] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[13:00:39] <logmsgbot>	 !log ppchelko@tin Started deploy [restbase/deploy@8ea07d6]: Manual blacklist for russian wiki
[13:00:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:00:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[13:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:47] <paladox>	 elukey thanks :)
[13:02:31] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[13:02:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[13:05:11] <icinga-wm>	 PROBLEM - Restbase root url on restbase2007 is CRITICAL: connect to address 10.192.16.175 and port 7231: Connection refused
[13:05:51] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:06:10] <icinga-wm>	 RECOVERY - Restbase root url on restbase2007 is OK: HTTP OK: HTTP/1.1 200 - 15580 bytes in 0.092 second response time
[13:06:40] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:07:30] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[13:08:20] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[13:08:38] <logmsgbot>	 !log ppchelko@tin Finished deploy [restbase/deploy@8ea07d6]: Manual blacklist for russian wiki (duration: 07m 59s)
[13:08:40] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:10] <elukey>	 running puppet on the restbase nodes down
[13:10:20] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active
[13:10:41] <icinga-wm>	 RECOVERY - Check systemd state on restbase2008 is OK: OK - running: The system is fully operational
[13:11:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2017-09-12 15:36:00 +0000 (expires in 73 days)
[13:11:40] <icinga-wm>	 RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational
[13:11:50] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042
[13:11:50] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042
[13:11:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active
[13:12:00] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 138 days)
[13:13:28] <elukey>	 all right we should be good now, let's see if cassandra likes it
[13:20:32] * elukey off!
[13:30:30] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[13:41:30] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0]
[13:56:40] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:58:00] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[13:59:30] <wikibugs>	 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3398184 (10Mholloway) I am technically off Mon-Tues but happy to pop in and merge/deploy/v...
[14:05:01] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:05:30] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[14:05:40] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:05:52] <wikibugs>	 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3398193 (10Mholloway) Er, scratch that, I'm on the road through Tuesday and only have my p...
[14:20:29] <wikibugs>	 (03PS3) 10Paladox: Replace TEMPLATE_CONTEXT_PROCESSORS with TEMPLATES [software/servermon] - 10https://gerrit.wikimedia.org/r/362600
[14:30:30] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[14:41:50] <wikibugs>	 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3393909 (10ssastry) We'll look at T169293 this week.
[15:58:50] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2050043
[16:02:19] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398300 (10alanajjar)
[16:12:50] <icinga-wm>	 PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:17:50] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2029845
[16:19:01] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398337 (10Framawiki)
[16:19:04] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398340 (10Framawiki)
[16:19:45] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Framawiki) From the original task: The request is accepted so we would appreciate if you could reserve for us a window so this rename can be performed...
[16:19:53] <TabbyCat>	 marostegui: you around for a large global rename?
[16:35:39] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398344 (10Aklapper) a:05Marostegui>03None @alanajjar: For future reference, please do not set an assignee for a task without previous agreement of the...
[16:38:50] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 15160
[16:40:50] <icinga-wm>	 RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[17:03:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect, AS6939/IPv4: Connect
[17:05:10] <icinga-wm>	 RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 54, down: 0, shutdown: 2
[17:11:50] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 30 probes of 298 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:16:50] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 16 probes of 298 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:27:06] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398373 (10alanajjar) @Framawiki  Thanks, and sorry for open a new task, can you please  add me as a subscriber in every task you open like this one (i.e. GR  tas...
[17:28:16] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398374 (10alanajjar) >>! In T169437#3398344, @Aklapper wrote: > @alanajjar: For future reference, please do not set an assignee for a task without previous...
[17:46:24] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398377 (10MarcoAurelio) @alanajjar It'd be easier for you to create a Herald rule that emails you on every task that matches a set of defined criteria :)
[17:47:17] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3396823 (10Urbanecm) @alanajjar I strongly advise you to create a Herald rule. You can do it at https://phabricator.wikimedia.org/herald/create/. If you have a qu...
[17:47:46] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398381 (10Urbanecm) You were faster @MarcoAurelio :).
[17:51:20] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398300 (10Urbanecm) The Assigned To field is to indicate who is currently working at this task. Just wait till somebody would claim the task theyself. He i...
[17:55:31] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169437#3398397 (10MarcoAurelio)
[17:55:34] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398398 (10MarcoAurelio)
[17:55:37] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Cardamom → GreenC: supervision needed - https://phabricator.wikimedia.org/T168776#3398399 (10MarcoAurelio)
[17:55:40] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Smuconlaw → Sgconlaw: supervision needed - https://phabricator.wikimedia.org/T168109#3398400 (10MarcoAurelio)
[17:55:43] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3398401 (10MarcoAurelio)
[17:55:46] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3398403 (10MarcoAurelio)
[17:55:49] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3398402 (10MarcoAurelio)
[17:55:57] <wikibugs>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#3398407 (10MarcoAurelio)
[17:56:00] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Global-rename which needs sysadmin's attention - https://phabricator.wikimedia.org/T73384#3398408 (10MarcoAurelio)
[17:57:36] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398414 (10MarcoAurelio) I know it's old-style and old-fashioned but I've created T169440 so people can subscribe and check how many global renames in need of sup...
[18:07:50] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[19:54:20] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:56:00] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[19:56:30] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 436 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[20:01:30] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 436 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[20:02:20] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:04:00] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:04:38] <wikibugs>	 (03PS3) 10BryanDavis: tools: Fix maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875)
[20:05:54] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "Tested manually on tools-k8s-master-01. The current script is broken there." [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis)
[21:12:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:13:40] <icinga-wm>	 RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 75883 bytes in 0.156 second response time
[21:52:20] <icinga-wm>	 PROBLEM - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[21:52:21] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169448
[21:52:24] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3398609 (10ops-monitoring-bot)
[22:16:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89973.56 seconds
[22:57:59] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3398635 (10MF-Warburg) All pages have been deleted.