[00:19:40] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.020 second response time [00:19:40] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.021 second response time [00:19:43] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.020 second response time [00:19:43] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.019 second response time [00:19:46] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.019 second response time [00:19:46] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.020 second response time [00:20:00] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.004 second response time [00:20:00] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time [00:20:10] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.003 second response time [00:20:14] sigh [00:20:20] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time [00:20:30] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [00:20:30] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time [00:20:32] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.001 second response time [00:20:55] silenced for 2h [00:21:29] thanks godog [00:21:30] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.330 second response time [00:21:30] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.465 second response time [00:21:30] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 22.688 second response time [00:21:49] godog: looks like the grid master died because of a user's runaway script [00:21:50] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 26.508 second response time [00:21:51] am investigating [00:21:53] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 26.865 second response time [00:21:53] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 28.082 second response time [00:21:56] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.108 second response time [00:21:56] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.112 second response time [00:21:56] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.117 second response time [00:21:59] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.134 second response time [00:21:59] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.157 second response time [00:22:13] yuvipanda: ok! thanks for the quick response, let me know if I can help [00:22:42] godog: will do! since the recoveries came in, will they still be silenced if they fail again for 2h? [00:23:45] yuvipanda: yeah they will [00:23:55] godog: ok! [00:32:30] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:40] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.019 second response time [00:40:40] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 2 minutes ago with 13 failures. Failed resources (up to 3 shown): Service[ferm],Service[prometheus-node-exporter],Service[ntp],Service[salt-minion] [00:42:34] up to 3 shown, huh? [00:52:20] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:30] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:07:40] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [01:13:20] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:14:20] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.493 second response time [01:20:20] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [01:28:50] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:42:20] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:56:50] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:38:24] (03PS3) 10Tim Landscheidt: Labs: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/326312 [02:42:32] (03CR) 10Tim Landscheidt: "@Dzahn: A commit message that refers to "os_version('ubuntu <= precise')" would be misleading: The code should not be removed because it w" [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [02:44:20] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:56:00] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:50] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [03:08:50] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:08:50] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:12:20] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:13:20] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:23:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 789.99 seconds [03:33:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 279.81 seconds [03:36:50] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:37:50] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:41:20] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:45:50] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:48:50] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:02:32] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2901215 (10Shoichi) a:03Shoichi han3_ji7_tsoo1_kian3_WM Network code translation for safey code review. I organized a 5 men translation team on 12/22.... [04:09:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1251.80 Read Requests/Sec=1008.80 Write Requests/Sec=81.50 KBytes Read/Sec=31424.40 KBytes_Written/Sec=2224.00 [04:14:50] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:17:10] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.20 Read Requests/Sec=0.20 Write Requests/Sec=0.70 KBytes Read/Sec=0.80 KBytes_Written/Sec=7.20 [04:17:50] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:53:40] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:20] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:30] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:22:40] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:32:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:32:40] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [05:37:40] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:40] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:50] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:00] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:30] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:40] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:40] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:40] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:40] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:40] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:40] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:40] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:50] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:00] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:00] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:38] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901258 (10zhuyifei1999) [06:09:40] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:13:30] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:31:50] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:30] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:32:30] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:30] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:30] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:30] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:40] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [06:32:40] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:32:40] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:40] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [06:32:50] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:32:50] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:50] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [06:32:50] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [06:32:50] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:32:50] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:50] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [06:32:51] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:32:51] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [06:32:52] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:32:52] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [06:32:53] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:38:50] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 92.76 seconds [06:41:30] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:49:50] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 119.83 seconds [06:52:40] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:52:56] 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2901265 (10scfc) [06:53:28] 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc) [07:00:00] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:20:40] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:07:40] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:40] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:42] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:42] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:42] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:50] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:02] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:02] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:03] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:50] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:50] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:00] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:01] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:40] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:16] !log restarting dbstore1002 mariadb, alive but unresponsive to queries do to long running queries saturating the service [08:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:40] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:50] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:23:50] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:45:37] 07Puppet, 06Labs, 10Tool-Labs: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2901305 (10scfc) [09:46:56] 07Puppet, 06Labs, 10Tool-Labs: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105#2901318 (10scfc) [09:48:50] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [09:48:50] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [09:48:50] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [09:50:30] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:31] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:31] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:31] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:31] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:40] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:40] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:50] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:50] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:50] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:50] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:50] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:51] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:51] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:51] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:50:51] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:52] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [09:50:54] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:51:50] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:03:50] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:08:50] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.59 seconds [10:09:50] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:09:50] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 191.75 seconds [10:11:40] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:13:40] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:50] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 190.69 seconds [10:24:50] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 242.59 seconds [10:37:50] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:42:40] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:48:50] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:35:30] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:30] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:31:30] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:30] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:00:30] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:11:30] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:49:40] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:40] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:17:40] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:36:20] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901491 (10zhuyifei1999) [14:44:40] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:14:40] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:40] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:17:27] (03PS1) 10Marostegui: .profile: Add .profile file [puppet] - 10https://gerrit.wikimedia.org/r/329133 [17:20:15] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4992/ compiles fine" [puppet] - 10https://gerrit.wikimedia.org/r/329133 (owner: 10Marostegui) [17:24:50] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 2 minutes ago with 19 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [17:33:00] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:50] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:01:00] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:04:40] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [19:21:50] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:40] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:33:15] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2901590 (10Yann) As of now there are * 30 running transcodes * 10,571 queued transcodes * 361,961 failed trans... [19:50:50] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:28:50] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:32:00] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:56:50] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:00:00] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:01:00] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:29:00] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:45:10] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues