[00:19:40] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.020 second response time
[00:19:40] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.021 second response time
[00:19:43] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.020 second response time
[00:19:43] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.019 second response time
[00:19:46] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.019 second response time
[00:19:46] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.020 second response time
[00:20:00] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.004 second response time
[00:20:00] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time
[00:20:10] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.003 second response time
[00:20:14] <godog>	 sigh
[00:20:20] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time
[00:20:30] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time
[00:20:30] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time
[00:20:32] <icinga-wm>	 PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.001 second response time
[00:20:55] <godog>	 silenced for 2h
[00:21:29] <yuvipanda>	 thanks godog 
[00:21:30] <icinga-wm>	 RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.330 second response time
[00:21:30] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.465 second response time
[00:21:30] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 22.688 second response time
[00:21:49] <yuvipanda>	 godog: looks like the grid master died because of a user's runaway script
[00:21:50] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 26.508 second response time
[00:21:51] <yuvipanda>	 am investigating
[00:21:53] <icinga-wm>	 RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 26.865 second response time
[00:21:53] <icinga-wm>	 RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 28.082 second response time
[00:21:56] <icinga-wm>	 RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.108 second response time
[00:21:56] <icinga-wm>	 RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.112 second response time
[00:21:56] <icinga-wm>	 RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.117 second response time
[00:21:59] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.134 second response time
[00:21:59] <icinga-wm>	 RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.157 second response time
[00:22:13] <godog>	 yuvipanda: ok! thanks for the quick response, let me know if I can help
[00:22:42] <yuvipanda>	 godog: will do! since the recoveries came in, will they still be silenced if they fail again for 2h?
[00:23:45] <godog>	 yuvipanda: yeah they will
[00:23:55] <yuvipanda>	 godog: ok!
[00:32:30] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:32:40] <icinga-wm>	 RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.019 second response time
[00:40:40] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 2 minutes ago with 13 failures. Failed resources (up to 3 shown): Service[ferm],Service[prometheus-node-exporter],Service[ntp],Service[salt-minion]
[00:42:34] <Krenair>	 up to 3 shown, huh?
[00:52:20] <icinga-wm>	 PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:01:30] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[01:07:40] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[01:13:20] <icinga-wm>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:14:20] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.493 second response time
[01:20:20] <icinga-wm>	 RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[01:28:50] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:42:20] <icinga-wm>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[01:56:50] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[02:38:24] <wikibugs>	 (03PS3) 10Tim Landscheidt: Labs: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/326312
[02:42:32] <wikibugs>	 (03CR) 10Tim Landscheidt: "@Dzahn: A commit message that refers to "os_version('ubuntu <= precise')" would be misleading: The code should not be removed because it w" [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt)
[02:44:20] <icinga-wm>	 PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:56:00] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:57:50] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[03:08:50] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:08:50] <icinga-wm>	 PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:12:20] <icinga-wm>	 RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[03:13:20] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:23:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 789.99 seconds
[03:33:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 279.81 seconds
[03:36:50] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[03:37:50] <icinga-wm>	 RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[03:41:20] <icinga-wm>	 RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[03:45:50] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:48:50] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:02:32] <wikibugs>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2901215 (10Shoichi) a:03Shoichi han3_ji7_tsoo1_kian3_WM Network code translation for safey code review.  I organized a 5 men translation team on 12/22....
[04:09:10] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1251.80 Read Requests/Sec=1008.80 Write Requests/Sec=81.50 KBytes Read/Sec=31424.40 KBytes_Written/Sec=2224.00
[04:14:50] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[04:17:10] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.20 Read Requests/Sec=0.20 Write Requests/Sec=0.70 KBytes Read/Sec=0.80 KBytes_Written/Sec=7.20
[04:17:50] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[04:53:40] <icinga-wm>	 PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:04:20] <icinga-wm>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:04:30] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[05:22:40] <icinga-wm>	 RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[05:32:30] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[05:32:40] <icinga-wm>	 RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[05:37:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:38:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:41:40] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:44:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:45:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:45:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:52:38] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901258 (10zhuyifei1999)
[06:09:40] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:13:30] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:31:50] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:32:30] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:32:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
[06:32:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:32:40] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.44 seconds
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.38 seconds
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.39 seconds
[06:32:51] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:32:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.40 seconds
[06:32:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:32:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[06:32:53] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:38:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 92.76 seconds
[06:41:30] <icinga-wm>	 RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[06:49:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 119.83 seconds
[06:52:40] <icinga-wm>	 PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:52:56] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2901265 (10scfc)
[06:53:28] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc)
[07:00:00] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[07:20:40] <icinga-wm>	 RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[08:07:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:42] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:07:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:02] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:03] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:14:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:14:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:15:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:23:40] <icinga-wm>	 PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:30:16] <jynus>	 !log restarting dbstore1002 mariadb, alive but unresponsive to queries do to long running queries saturating the service
[08:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:40] <icinga-wm>	 RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[08:55:50] <icinga-wm>	 PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:23:50] <icinga-wm>	 RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[09:45:37] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2901305 (10scfc)
[09:46:56] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105#2901318 (10scfc)
[09:48:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave
[09:48:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[09:48:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[09:50:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:31] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:31] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:31] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:31] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:40] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:50:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[09:51:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[10:03:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[10:08:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.59 seconds
[10:09:50] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:09:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 191.75 seconds
[10:11:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[10:13:40] <icinga-wm>	 PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:22:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 190.69 seconds
[10:24:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 242.59 seconds
[10:37:50] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[10:42:40] <icinga-wm>	 RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[10:48:50] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:35:30] <icinga-wm>	 PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:03:30] <icinga-wm>	 RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[12:31:30] <icinga-wm>	 PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:43:30] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:00:30] <icinga-wm>	 RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[13:11:30] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[13:49:40] <icinga-wm>	 PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:16:40] <icinga-wm>	 PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:17:40] <icinga-wm>	 RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[14:36:20] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901491 (10zhuyifei1999)
[14:44:40] <icinga-wm>	 RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[15:14:40] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:42:40] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[17:17:27] <wikibugs>	 (03PS1) 10Marostegui: .profile: Add .profile file [puppet] - 10https://gerrit.wikimedia.org/r/329133
[17:20:15] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4992/ compiles fine" [puppet] - 10https://gerrit.wikimedia.org/r/329133 (owner: 10Marostegui)
[17:24:50] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 2 minutes ago with 19 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[17:33:00] <icinga-wm>	 PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:52:50] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[18:01:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[19:04:40] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[19:21:50] <icinga-wm>	 PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:32:40] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[19:33:15] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901590 (10Yann) As of now there are  * 30 running transcodes * 10,571 queued transcodes * 361,961 failed trans...
[19:50:50] <icinga-wm>	 RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[20:28:50] <icinga-wm>	 PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:32:00] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:56:50] <icinga-wm>	 RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[21:00:00] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[23:01:00] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:29:00] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[23:45:10] <icinga-wm>	 PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues