[00:19:48] <icinga-wm>	 PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:29:28] <icinga-wm>	 PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:29:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:47:48] <icinga-wm>	 RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[00:57:28] <icinga-wm>	 RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[00:57:48] <icinga-wm>	 RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[01:05:18] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:29:25] <grrrit-wm>	 (03PS1) 10Legoktm: contint: Add dependencies needed for PoolCounter tests [puppet] - 10https://gerrit.wikimedia.org/r/325145 (https://phabricator.wikimedia.org/T152338) 
[01:33:18] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[01:48:07] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Quote "owner" and "group" attributes for file and git::clone resources [puppet] - 10https://gerrit.wikimedia.org/r/325146 
[02:14:38] <icinga-wm>	 PROBLEM - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0
[02:14:49] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152339
[02:14:52] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2845080 (10ops-monitoring-bot)
[02:16:59] <icinga-wm>	 PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:17:02] <icinga-wm>	 PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:17:39] <icinga-wm>	 PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118)
[02:19:51] <legoktm>	 uhhhh
[02:19:52] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.005 second response time
[02:20:17] <legoktm>	 I think that broke CI
[02:20:49] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time
[02:20:49] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.001 second response time
[02:20:53] <Krenair>	 woah
[02:21:02] <icinga-wm>	 PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.005 second response time
[02:21:03] <Krenair>	 labservices1001 is down?
[02:21:09] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.003 second response time
[02:21:11] <Krenair>	 sounds like more than just CI legoktm 
[02:21:14] <Krenair>	 what did you do?
[02:21:19] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time
[02:21:25] <legoktm>	 I didn't do anything...
[02:21:29] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.004 second response time
[02:21:29] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time
[02:21:37] <legoktm>	 Just noticed that CI started failing with cannot find host gerrit.wm.o errors
[02:21:49] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time
[02:21:51] <Krenair>	 CI will be using the labs DNS servers
[02:21:52] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.006 second response time
[02:21:52] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.006 second response time
[02:21:52] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.005 second response time
[02:21:52] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time
[02:22:47] <godog>	 we think it is related to a recent change? legoktm Krenair ?
[02:22:58] <legoktm>	 nothing should have changed?
[02:23:19] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.002 second response time
[02:23:19] <Krenair>	 oh, sorry
[02:23:23] <godog>	 ok thanks, just making sure
[02:23:25] <legoktm>	 I think labs DNS just died
[02:23:29] <Krenair>	 I misread 'I think that broke CI'
[02:23:37] <Krenair>	 my head added an extra 'I' in there
[02:23:39] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time
[02:23:39] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 1.003 second response time
[02:23:44] <godog>	 ok checking the checker
[02:24:01] <Krenair>	 the real underlying problem will be in this godog 
[02:24:02] <Krenair>	 <icinga-wm> PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:24:02] <Krenair>	 <icinga-wm> PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:24:02] <Krenair>	 <icinga-wm> PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118)
[02:24:07] <Krenair>	 not tools stuff
[02:24:36] <godog>	 ah, thanks I'll check that now
[02:25:49] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time
[02:25:49] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 1.005 second response time
[02:25:49] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 1.005 second response time
[02:25:59] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time
[02:25:59] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time
[02:25:59] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.001 second response time
[02:26:02] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time
[02:26:02] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.004 second response time
[02:26:02] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.004 second response time
[02:27:01] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.003 second response time
[02:27:59] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time
[02:27:59] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.004 second response time
[02:27:59] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.005 second response time
[02:27:59] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.006 second response time
[02:30:09] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time
[02:30:09] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time
[02:30:12] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time
[02:30:12] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time
[02:30:12] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time
[02:30:12] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time
[02:30:15] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.002 second response time
[02:30:15] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time
[02:30:15] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.002 second response time
[02:30:59] <godog>	 !log silence checker.tools.wmflabs.org for 2h
[02:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:39] <icinga-wm>	 RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.431 second response time
[02:32:59] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.207 second response time
[02:33:03] <Krenair>	 I logged onto tools-bastion-03 and it still appears functional
[02:34:22] <icinga-wm>	 RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 14.955 second response time
[02:34:42] <godog>	 yeah looks like it is recovering, still investigating 
[02:35:09] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 3 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[02:42:29] <icinga-wm>	 RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[02:42:32] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.131 second response time
[02:42:32] <icinga-wm>	 RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 18.133 second response time
[02:42:39] <icinga-wm>	 RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[02:42:49] <icinga-wm>	 RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[02:42:54] <godog>	 !log powercycle labservices1001
[02:42:59] <icinga-wm>	 RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.006 second response time
[02:42:59] <icinga-wm>	 RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time
[02:42:59] <icinga-wm>	 RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.031 second response time
[02:43:02] <icinga-wm>	 RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.031 second response time
[02:43:02] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.160 second response time
[02:43:02] <icinga-wm>	 RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.828 second response time
[02:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:49] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.716 second response time
[02:45:29] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.619 second response time
[02:45:59] <icinga-wm>	 PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:57:39] <wikibugs_>	 06Operations, 06Labs: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845112 (10fgiunchedi) a:03Andrew
[02:57:45] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845114 (10Krenair)
[03:00:31] <grrrit-wm>	 (03CR) 10Alex Monk: "racktables still has the old name" [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) 
[03:02:19] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[03:14:59] <icinga-wm>	 RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[03:18:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:25:19] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845101 (10Legoktm) > Note that during this time dns from labs instances seemed fine, why toolschecker failed needs investigation  Well, it wasn't all fine. CI tests were failing with stuff like: ```...
[03:32:29] <icinga-wm>	 PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:47:59] <icinga-wm>	 RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[04:00:39] <icinga-wm>	 RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[04:15:43] <grrrit-wm>	 (03PS1) 10Catrope: Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 
[04:15:49] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:17:02] <grrrit-wm>	 (03CR) 10Catrope: "Whoops, thanks for fixing this. I've created https://gerrit.wikimedia.org/r/325152 for production and scheduled it for SWAT on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) 
[04:44:49] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[05:39:29] <icinga-wm>	 PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:07:29] <icinga-wm>	 RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:11:39] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:12:29] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[06:51:36] <wikibugs_>	 07Puppet, 10Continuous-Integration-Config, 07Jenkins: There is no sane way to get arcanist's conduit tokens onto nodepool CI slaves - https://phabricator.wikimedia.org/T140417#2845161 (10mmodell) 05Open>03Invalid
[06:52:22] <wikibugs>	 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845162 (10mmodell) @paladox: is this now working on phab-01?
[06:53:04] <wikibugs_>	 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845163 (10mmodell) a:05mmodell>03None rather, the main role now works on labs, correct?
[07:01:49] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[07:07:29] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[07:07:49] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199
[07:08:29] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0
[07:08:49] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[07:22:29] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4838537 keys, up 33 days 23 hours - replication_delay is 620
[07:27:29] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4821176 keys, up 33 days 23 hours - replication_delay is 32
[07:29:49] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[08:24:39] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[08:25:29] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:52:39] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[08:53:29] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[09:06:09] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4936.70 Read Requests/Sec=4553.20 Write Requests/Sec=286.80 KBytes Read/Sec=21649.60 KBytes_Written/Sec=7822.80
[09:13:09] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=10.60 Read Requests/Sec=0.80 Write Requests/Sec=18.50 KBytes Read/Sec=18.40 KBytes_Written/Sec=83.60
[09:43:39] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:45:39] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[11:22:49] <wikibugs_>	 07Puppet, 06Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2845248 (10Paladox) @mmodell I haven't applied this role to phab-01 yet but I have applied it to the phabricator role so looks like it is working on labs (no failures)
[11:59:22] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] (WIP) services: create global service restart script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325039 (owner: 10Dzahn) 
[13:15:29] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:17:29] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[13:24:29] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:25:29] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[13:32:29] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:34:19] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[13:36:12] <wikibugs_>	 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2845282 (10mobrovac) On the services side, we could probably move the update flow (RB, Parsoid, MobileApps et a...
[13:41:59] <icinga-wm>	 PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:00:29] <icinga-wm>	 PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:01:49] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[14:08:59] <icinga-wm>	 RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[14:28:29] <icinga-wm>	 RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[14:29:49] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[14:49:35] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] node service - allow empty entry point [puppet] - 10https://gerrit.wikimedia.org/r/324190 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) 
[16:37:09] <icinga-wm>	 PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:02:29] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[17:03:29] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4807632 keys, up 34 days 8 hours - replication_delay is 38
[17:04:50] <grrrit-wm>	 (03CR) 10Reedy: "Needs doing in CommonSettings.php too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) 
[17:05:09] <icinga-wm>	 RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[17:09:29] <icinga-wm>	 PROBLEM - Disk space on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:29] <icinga-wm>	 PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:29] <icinga-wm>	 PROBLEM - salt-minion processes on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:39] <icinga-wm>	 PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:49] <icinga-wm>	 PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:49] <grrrit-wm>	 (03CR) 10Alex Monk: "Yep, that's what Roan's commit is for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325119 (owner: 10Alex Monk) 
[17:09:49] <icinga-wm>	 PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:49] <icinga-wm>	 PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:09:49] <icinga-wm>	 PROBLEM - DPKG on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:10:19] <icinga-wm>	 RECOVERY - Disk space on bast3001 is OK: DISK OK
[17:10:19] <icinga-wm>	 RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:10:19] <icinga-wm>	 RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full
[17:10:29] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set
[17:10:29] <icinga-wm>	 RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient
[17:10:39] <icinga-wm>	 RECOVERY - configured eth on bast3001 is OK: OK - interfaces up
[17:10:39] <icinga-wm>	 RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational
[17:10:39] <icinga-wm>	 RECOVERY - DPKG on bast3001 is OK: All packages OK
[17:10:49] <icinga-wm>	 RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures
[17:11:11] <grrrit-wm>	 (03Draft1) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) 
[17:11:13] <grrrit-wm>	 (03Draft2) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) 
[17:11:16] <paladox>	 Reedy ^^
[17:12:13] <grrrit-wm>	 (03Abandoned) 10Paladox: deployment-prep: Follow-up Iaff51065: CONTENT_MODEL_FLOW_BOARD is no longer set by Flow.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325173 (https://phabricator.wikimedia.org/T152348) (owner: 10Paladox) 
[17:12:38] <grrrit-wm>	 (03CR) 10Paladox: "Fixes https://phabricator.wikimedia.org/T152348" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) 
[17:13:19] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) 
[17:15:18] <wikibugs>	 06Operations, 06Discovery, 06Labs, 06Maps, and 2 others: PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2845427 (10scfc)
[17:50:29] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4809070 keys, up 34 days 9 hours - replication_delay is 627
[18:00:56] <grrrit-wm>	 (03PS1) 10Jcrespo: mariadb: Update check private data script to handle BINARY fields [puppet] - 10https://gerrit.wikimedia.org/r/325176 (https://phabricator.wikimedia.org/T152194) 
[18:10:29] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4805542 keys, up 34 days 9 hours - replication_delay is 32
[18:54:21] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Gerrit: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2845508 (10Paladox)
[19:56:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:04:19] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:23:59] <icinga-wm>	 RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[20:33:19] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[20:53:35] <wikibugs_>	 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2845622 (10Nuria)
[20:53:37] <wikibugs>	 06Operations, 10ops-codfw, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2845621 (10Nuria) 05Open>03Resolved
[21:55:09] <icinga-wm>	 PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1351
[22:00:09] <icinga-wm>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 543484 Threads: 2 Questions: 120873263 Slow queries: 5953 Opens: 6654741 Flush tables: 2 Open tables: 64 Queries per second avg: 222.404 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[22:46:39] <icinga-wm>	 PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[23:31:49] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 55 failures. Last run 1 minute ago with 55 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[screen],Package[gdb]
[23:59:49] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures