[00:19:48] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:28] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:48] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:47:48] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:57:28] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:57:48] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:05:18] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:25] (03PS1) 10Legoktm: contint: Add dependencies needed for PoolCounter tests [puppet] - 10https://gerrit.wikimedia.org/r/325145 (https://phabricator.wikimedia.org/T152338) [01:33:18] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:48:07] (03PS1) 10Tim Landscheidt: Quote "owner" and "group" attributes for file and git::clone resources [puppet] - 10https://gerrit.wikimedia.org/r/325146 [02:14:38] PROBLEM - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 [02:14:49] ACKNOWLEDGEMENT - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152339 [02:14:52] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2845080 (10ops-monitoring-bot) [02:16:59] PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:17:02] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:17:39] PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118) [02:19:51] uhhhh [02:19:52] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.005 second response time [02:20:17] I think that broke CI [02:20:49] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time [02:20:49] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.001 second response time [02:20:53] woah [02:21:02] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.005 second response time [02:21:03] labservices1001 is down? [02:21:09] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.003 second response time [02:21:11] sounds like more than just CI legoktm [02:21:14] what did you do? [02:21:19] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time [02:21:25] I didn't do anything... [02:21:29] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.004 second response time [02:21:29] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time [02:21:37] Just noticed that CI started failing with cannot find host gerrit.wm.o errors [02:21:49] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [02:21:51] CI will be using the labs DNS servers [02:21:52] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.006 second response time [02:21:52] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.006 second response time [02:21:52] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.005 second response time [02:21:52] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time [02:22:47] we think it is related to a recent change? legoktm Krenair ? [02:22:58] nothing should have changed? [02:23:19] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.002 second response time [02:23:19] oh, sorry [02:23:23] ok thanks, just making sure [02:23:25] I think labs DNS just died [02:23:29] I misread 'I think that broke CI' [02:23:37] my head added an extra 'I' in there [02:23:39] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [02:23:39] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 1.003 second response time [02:23:44] ok checking the checker [02:24:01] the real underlying problem will be in this godog [02:24:02] PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:24:02] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:24:02] PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118) [02:24:07] not tools stuff [02:24:36] ah, thanks I'll check that now [02:25:49] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time [02:25:49] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 1.005 second response time [02:25:49] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 1.005 second response time [02:25:59] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time [02:25:59] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [02:25:59] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.001 second response time [02:26:02] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time [02:26:02] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.004 second response time [02:26:02] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.004 second response time [02:27:01] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.003 second response time [02:27:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [02:27:59] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.004 second response time [02:27:59] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.005 second response time [02:27:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.006 second response time [02:30:09] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [02:30:09] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time [02:30:12] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time [02:30:12] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time [02:30:12] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [02:30:12] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time [02:30:15] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.002 second response time [02:30:15] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time [02:30:15] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.002 second response time [02:30:59] !log silence checker.tools.wmflabs.org for 2h [02:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:39] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.431 second response time [02:32:59] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.207 second response time [02:33:03] I logged onto tools-bastion-03 and it still appears functional [02:34:22] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 14.955 second response time [02:34:42] yeah looks like it is recovering, still investigating [02:35:09] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 3 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:42:29] RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:42:32] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.131 second response time [02:42:32] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.133 second response time [02:42:39] RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:42:49] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [02:42:54] !log powercycle labservices1001 [02:42:59] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.006 second response time [02:42:59] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time [02:42:59] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.031 second response time [02:43:02] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.031 second response time [02:43:02] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.160 second response time [02:43:02] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.828 second response time [02:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:49] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.716 second response time [02:45:29] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.619 second response time [02:45:59] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:57:39] 06Operations, 06Labs: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845112 (10fgiunchedi) a:03Andrew [02:57:45] 06Operations, 06Labs, 10Labs-Infrastructure: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845114 (10Krenair) [03:00:31] (03CR) 10Alex Monk: "racktables still has the old name" [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [03:02:19] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [03:14:59] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:18:59] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:19] 06Operations, 06Labs, 10Labs-Infrastructure: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845101 (10Legoktm) > Note that during this time dns from labs instances seemed fine, why toolschecker failed needs investigation Well, it wasn't all fine. CI tests were failing with stuff like: ```... [03:32:29] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:47:59] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:00:39] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:15:43] (03PS1) 10Catrope: Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 [04:15:49] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:17:02] (03CR) 10Catrope: "Whoops, thanks for fixing this. I've created https://gerrit.wikimedia.org/r/325152 for production and scheduled it for SWAT on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) [04:44:49] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:39:29] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:29] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:11:39] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:29] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [06:51:36] 07Puppet, 10Continuous-Integration-Config, 07Jenkins: There is no sane way to get arcanist's conduit tokens onto nodepool CI slaves - https://phabricator.wikimedia.org/T140417#2845161 (10mmodell) 05Open>03Invalid [06:52:22] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845162 (10mmodell) @paladox: is this now working on phab-01? [06:53:04] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845163 (10mmodell) a:05mmodell>03None rather, the main role now works on labs, correct? [07:01:49] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [07:07:29] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [07:07:49] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [07:08:29] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [07:08:49] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [07:22:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4838537 keys, up 33 days 23 hours - replication_delay is 620 [07:27:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4821176 keys, up 33 days 23 hours - replication_delay is 32 [07:29:49] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:24:39] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [08:25:29] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:52:39] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:53:29] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:06:09] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4936.70 Read Requests/Sec=4553.20 Write Requests/Sec=286.80 KBytes Read/Sec=21649.60 KBytes_Written/Sec=7822.80 [09:13:09] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=10.60 Read Requests/Sec=0.80 Write Requests/Sec=18.50 KBytes Read/Sec=18.40 KBytes_Written/Sec=83.60 [09:43:39] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:39] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [11:22:49] 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845248 (10Paladox) @mmodell I haven't applied this role to phab-01 yet but I have applied it to the phabricator role so looks like it is working on labs (no failures) [11:59:22] (03CR) 10Mobrovac: [C: 04-1] (WIP) services: create global service restart script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325039 (owner: 10Dzahn) [13:15:29] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:29] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:24:29] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:29] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:32:29] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:19] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:36:12] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2845282 (10mobrovac) On the services side, we could probably move the update flow (RB, Parsoid, MobileApps et a... [13:41:59] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:29] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:49] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:08:59] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:28:29] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:29:49] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:49:35] (03CR) 10Mobrovac: [C: 031] node service - allow empty entry point [puppet] - 10https://gerrit.wikimedia.org/r/324190 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [16:37:09] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [17:03:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4807632 keys, up 34 days 8 hours - replication_delay is 38 [17:04:50] (03CR) 10Reedy: "Needs doing in CommonSettings.php too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) [17:05:09] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:09:29] PROBLEM - Disk space on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:29] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:29] PROBLEM - salt-minion processes on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:39] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:39] PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:49] PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:49] (03CR) 10Alex Monk: "Yep, that's what Roan's commit is for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) [17:09:49] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:49] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:49] PROBLEM - DPKG on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:19] RECOVERY - Disk space on bast3001 is OK: DISK OK [17:10:19] RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:10:19] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [17:10:29] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [17:10:29] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [17:10:39] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [17:10:39] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [17:10:39] RECOVERY - DPKG on bast3001 is OK: All packages OK [17:10:49] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [17:11:11] (03Draft1) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) [17:11:13] (03Draft2) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) [17:11:16] Reedy ^^ [17:12:13] (03Abandoned) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) (owner: 10Paladox) [17:12:38] (03CR) 10Paladox: "Fixes https://phabricator.wikimedia.org/T152348" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) [17:13:19] (03CR) 10Paladox: [C: 031] Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) [17:15:18] 06Operations, 06Discovery, 06Labs, 06Maps, and 2 others: PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2845427 (10scfc) [17:50:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4809070 keys, up 34 days 9 hours - replication_delay is 627 [18:00:56] (03PS1) 10Jcrespo: mariadb: Update check private data script to handle BINARY fields [puppet] - 10https://gerrit.wikimedia.org/r/325176 (https://phabricator.wikimedia.org/T152194) [18:10:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4805542 keys, up 34 days 9 hours - replication_delay is 32 [18:54:21] 06Operations, 10Ops-Access-Requests, 10Gerrit: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2845508 (10Paladox) [19:56:59] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:19] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:59] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:33:19] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:53:35] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2845622 (10Nuria) [20:53:37] 06Operations, 10ops-codfw, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2845621 (10Nuria) 05Open>03Resolved [21:55:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1351 [22:00:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 543484 Threads: 2 Questions: 120873263 Slow queries: 5953 Opens: 6654741 Flush tables: 2 Open tables: 64 Queries per second avg: 222.404 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:46:39] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [23:31:49] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 55 failures. Last run 1 minute ago with 55 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[screen],Package[gdb] [23:59:49] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures