[00:05:55] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:55] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:07:15] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=170.00 Read Requests/Sec=205.20 Write Requests/Sec=15.30 KBytes Read/Sec=2879.60 KBytes_Written/Sec=490.80
[00:08:45] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[00:08:46] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:58:05] <icinga-wm>	 PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:05:35] <icinga-wm>	 PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:35] <icinga-wm>	 PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:35] <icinga-wm>	 PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:36] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:36] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:45] <icinga-wm>	 PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:55] <icinga-wm>	 PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:55] <icinga-wm>	 PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:15] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:25] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[01:06:45] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:06:45] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[01:07:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:07:35] <icinga-wm>	 PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:07:35] <icinga-wm>	 PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:07:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:08:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set
[01:08:26] <icinga-wm>	 RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full
[01:08:26] <icinga-wm>	 RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:08:26] <icinga-wm>	 RECOVERY - configured eth on cobalt is OK: OK - interfaces up
[01:08:26] <icinga-wm>	 RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient
[01:08:26] <icinga-wm>	 RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:08:26] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[01:08:27] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures
[01:09:22] <paladox>	 RainbowSprinkles ^^
[01:09:35] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set
[01:09:45] <icinga-wm>	 RECOVERY - Disk space on cobalt is OK: DISK OK
[01:09:45] <icinga-wm>	 RECOVERY - DPKG on cobalt is OK: All packages OK
[01:10:22] <paladox>	 gerrit is not loading for me
[01:11:05] <Reedy>	 just slow
[01:11:05] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:05] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:05] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:05] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:05] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:06] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labtestcontrol2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:06] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:33] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148691 (10Paladox) Happended again on the 02/04/17 bst time.  <icinga-wm> PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05...
[01:11:35] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:11:39] <paladox>	 Reedy it's not loading the status:open for me
[01:12:55] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labtestcontrol2001 is OK: No changes to merge.
[01:12:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge.
[01:14:35] <icinga-wm>	 PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:35] <icinga-wm>	 PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:35] <icinga-wm>	 PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:35] <icinga-wm>	 PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:36] <icinga-wm>	 PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:14:41] <paladox>	 ^^
[01:14:45] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:15:05] <icinga-wm>	 PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures
[01:15:18] <Reedy>	 Looks like there's been a load spike on it
[01:15:40] <paladox>	 Is this the same problem we had with lead?
[01:15:49] <paladox>	 Where the storage system became very slow
[01:15:55] <icinga-wm>	 PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:15:56] <icinga-wm>	 PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:16:04] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148692 (10Paladox) a minute later after reporting the recovery errors, the problem thing came again    <icinga-wm> PROBLEM - Check size of conntrack table on cobalt is CRITI...
[01:16:07] <Reedy>	 No idea. AFAIK I don't have access to the system
[01:16:35] <icinga-wm>	 RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational
[01:16:36] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[01:16:36] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
[01:16:44] <paladox>	 Should one of the ops be paged?
[01:17:22] <paladox>	 Reedy ^^
[01:17:45] <icinga-wm>	 RECOVERY - Disk space on cobalt is OK: DISK OK
[01:17:46] <icinga-wm>	 RECOVERY - DPKG on cobalt is OK: All packages OK
[01:18:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set
[01:18:25] <icinga-wm>	 RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full
[01:18:25] <icinga-wm>	 RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:18:27] <icinga-wm>	 RECOVERY - configured eth on cobalt is OK: OK - interfaces up
[01:18:27] <icinga-wm>	 RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient
[01:18:27] <icinga-wm>	 RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:19:14] <RainbowSprinkles>	 Nah, it's probably fine
[01:19:24] <paladox>	 ok
[01:20:05] <icinga-wm>	 RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 94 seconds ago with 0 failures
[01:21:55] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:22:18] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148694 (10demon) None of the last two comments are related to the issue here. That sounds like icinga flapping, not the machine or service itself.
[01:22:45] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[01:22:50] <wikibugs_>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148695 (10Paladox) >>! In T148478#3148694, @demon wrote: > None of the last two comments are related to the issue here. That sounds like icinga flapping, not the machine or...
[01:23:16] <RainbowSprinkles>	 Considering there's several other services complaining about CHECK_NRPE, I'm more inclined to blame icinga.
[01:23:37] <Reedy>	 It did get pretty slow
[01:23:48] <Reedy>	 https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=cobalt.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2
[01:24:04] <Reedy>	 Big load of grey on the load/procs graph
[01:24:06] <paladox>	 i found it slow too now it's working.
[01:24:48] <RainbowSprinkles>	 Reedy: Lots of incoming connections in the apache stats. *shrug*
[01:25:35] <Reedy>	 I got a few of the popup error 0 things
[01:25:59] <Reedy>	 There's not some 1am utc cronjob is there?
[01:26:36] <paladox>	 i thought we got rid of the cronjob?
[01:26:47] <paladox>	 oh we didnt
[01:26:49] <paladox>	 Reedy https://github.com/wikimedia/puppet/blob/production/modules/gerrit/manifests/crons.pp
[01:27:05] <icinga-wm>	 RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[01:27:30] <Reedy>	 Doubt that caused it
[01:29:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 58377.022587 Seconds
[01:29:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 58382.013225 Seconds
[01:29:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 58383.437874 Seconds
[01:30:13] <Zppix>	 Reedy: maybe puppet needs to be ran manually ive been testing with puppet privately with a smaller but similar setup and got the same error and puppet run triggering manually tends to fix or atleast temp bandage it
[01:30:24] <Reedy>	 What
[01:30:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 58433.575453 Seconds
[01:30:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 58435.788166 Seconds
[01:30:40] <Zppix>	 Im referring to popup error 0
[01:31:04] <Reedy>	 That's nothing to do with puppet
[01:32:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 58560.26345 Seconds
[01:32:55] <Zppix>	 Oh i see now, nevermind my bad
[01:33:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:35:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:36:05] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cobalt is OK: OK: synced at Sun 2017-04-02 01:35:57 UTC.
[01:36:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 58803.558914 Seconds
[01:37:45] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:37:45] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:38:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 58920.283176 Seconds
[01:38:45] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient
[01:38:45] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:39:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:39:59] <Zppix>	 Whats going on with postgres?
[01:40:20] <Reedy>	 Does it every night
[01:41:35] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused
[01:41:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:41:55] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:41:56] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:41:56] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[01:42:26] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused
[01:42:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 59153.534169 Seconds
[01:43:05] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-c valid until 2017-09-12 15:35:38 +0000 (expires in 163 days)
[01:43:25] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.48 port 9042
[01:43:42] <Zppix>	 Reedy: i take it it has something to do with backing up?
[01:44:02] <Reedy>	 Don't think so
[01:44:38] <Zppix>	 Is it normal?
[01:45:36] <Reedy>	 Nope
[01:45:55] <Reedy>	 I think Ops have looked at it superficially, and didn't have any ideas, but don't know if anyone has really dug into it
[01:46:21] <Zppix>	 I would look into it but 1) im to lazy to rn 2) i probably dont have that kind of access
[01:48:55] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:48:55] <icinga-wm>	 PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:49:05] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:50:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:51:45] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[01:51:55] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:53:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 59822.032782 Seconds
[01:55:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:57:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:57:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 23.947743 Seconds
[01:57:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 28.175942 Seconds
[01:58:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:59:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 23.613795 Seconds
[02:01:45] <icinga-wm>	 RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational
[02:01:55] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
[02:02:35] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.138 port 9042
[02:02:55] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 163 days)
[02:12:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:13:15] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[02:17:55] <icinga-wm>	 RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[02:18:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:18:46] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[02:18:46] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[02:19:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[02:21:45] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:21:46] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[02:22:55] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:23:45] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient
[02:24:26] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 09m 44s)
[02:24:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:35] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[02:25:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:26:45] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set
[02:31:55] <icinga-wm>	 PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=69%)
[02:51:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:53:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set
[03:28:15] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[03:28:35] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
[03:29:25] <icinga-wm>	 PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:30:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:30:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:32:45] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:36:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[03:37:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[03:37:25] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=824.60 Read Requests/Sec=353.50 Write Requests/Sec=0.90 KBytes Read/Sec=37978.40 KBytes_Written/Sec=34.00
[03:37:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:37:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[03:38:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:38:25] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days)
[03:38:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused
[03:38:35] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[03:41:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:42:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set
[03:45:35] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
[03:46:15] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[03:46:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:47:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:47:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:50:25] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=205.70 Read Requests/Sec=298.80 Write Requests/Sec=7.60 KBytes Read/Sec=2556.40 KBytes_Written/Sec=222.40
[03:57:25] <icinga-wm>	 RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[04:00:45] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[04:03:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active
[04:03:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational
[04:04:25] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042
[04:05:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[04:06:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[04:07:15] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days)
[04:07:35] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[04:07:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:08:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[04:08:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused
[04:12:15] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:12:35] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
[04:13:05] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[04:13:25] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[04:14:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:15:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[04:15:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[04:32:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational
[04:33:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active
[04:33:55] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days)
[04:34:25] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042
[04:35:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[04:36:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[04:36:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[04:37:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[04:37:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused
[04:37:35] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[04:37:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:37:55] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days)
[04:39:25] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200)
[04:40:15] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[04:41:25] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[04:41:35] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
[04:41:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:42:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[04:53:45] <icinga-wm>	 PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:56:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:02:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational
[05:03:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active
[05:03:25] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042
[05:03:56] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days)
[05:05:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[05:06:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[05:06:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused
[05:06:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:06:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[05:07:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[05:08:35] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[05:09:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:10:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[05:10:55] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:11:05] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[05:16:35] <icinga-wm>	 PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:22:45] <icinga-wm>	 RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[05:25:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[05:33:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational
[05:34:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active
[05:34:15] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days)
[05:34:25] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042
[05:36:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[05:37:05] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
[05:37:35] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[05:37:45] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days)
[05:44:35] <icinga-wm>	 RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[05:47:55] <icinga-wm>	 PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:14:55] <icinga-wm>	 RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:28:55] <icinga-wm>	 RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK
[06:30:49] <wikibugs>	 (03PS2) 10ArielGlenn: retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507)
[06:32:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:36:17] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148786 (10Ankry) Also available already: [[https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG]] [[https://c...
[06:45:00] <wikibugs>	 (03PS1) 10ArielGlenn: add new config options for dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/346037
[06:51:45] <icinga-wm>	 PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:53:55] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] add new config options for dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/346037 (owner: 10ArielGlenn)
[06:59:54] <wikibugs_>	 (03PS1) 10ArielGlenn: add new config settings for en wikipedia dumps [puppet] - 10https://gerrit.wikimedia.org/r/346038
[07:01:45] <icinga-wm>	 RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[07:03:39] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] add new config settings for en wikipedia dumps [puppet] - 10https://gerrit.wikimedia.org/r/346038 (owner: 10ArielGlenn)
[07:20:40] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507) (owner: 10ArielGlenn)
[07:21:45] <icinga-wm>	 RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[07:25:39] <logmsgbot>	 !log ariel@tin Started deploy [dumps/dumps@1ac3fb3]: var/method name cleanups, refactor, pregenerate page ranges for page content jobs, auto retry of failed page ranges
[07:25:42] <logmsgbot>	 !log ariel@tin Finished deploy [dumps/dumps@1ac3fb3]: var/method name cleanups, refactor, pregenerate page ranges for page content jobs, auto retry of failed page ranges (duration: 00m 03s)
[07:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:14] <wikibugs_>	 (03PS1) 10ArielGlenn: Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039
[07:27:29] <wikibugs>	 (03PS2) 10ArielGlenn: Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039
[07:30:27] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039 (owner: 10ArielGlenn)
[08:02:45] <icinga-wm>	 PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:16:12] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148796 (10hashar) Machine had a load spike at 1:00am.  It shows high disk IOPS since 1:00 and the disk utilisation largely exploded.  There is 35-45% CPU usage for `md1_raid...
[08:18:28] <elukey>	 !log powercycle ms-be1016 (stuck in console, answers pings but not ssh)
[08:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:55] <icinga-wm>	 PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:21:05] <icinga-wm>	 PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100%
[08:21:45] <icinga-wm>	 RECOVERY - SSH on ms-be1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[08:21:55] <icinga-wm>	 RECOVERY - Host ms-be1016 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[08:22:05] <icinga-wm>	 RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient
[08:22:05] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be1016 is OK: OK: nf_conntrack is 9 % full
[08:22:06] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[08:22:06] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[08:22:06] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[08:22:06] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[08:22:06] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[08:22:07] <icinga-wm>	 RECOVERY - DPKG on ms-be1016 is OK: All packages OK
[08:22:07] <icinga-wm>	 RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[08:22:08] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[08:22:08] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1016 is OK: OK ferm input default policy is set
[08:22:09] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[08:23:33] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3106352 (10elukey) Just powercycled ms-be1016 that was stuck in console (pingable but no ssh available):  `[11674384.225319] BUG: soft lockup - CPU#12 stuck for 22s...
[08:29:45] <icinga-wm>	 RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset -5.477666855e-05 secs
[08:30:46] <icinga-wm>	 RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[08:47:55] <icinga-wm>	 RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[09:33:45] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:49:45] <icinga-wm>	 PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:02:45] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[10:17:45] <icinga-wm>	 RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[10:57:24] <wikibugs>	 (03PS1) 10DatGuy: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804)
[11:09:07] <wikibugs>	 (03PS2) 10DatGuy: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804)
[11:29:32] <wikibugs>	 (03PS1) 10DatGuy: Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593)
[11:37:24] <wikibugs>	 (03CR) 10Luke081515: [C: 031] Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy)
[11:39:45] <icinga-wm>	 PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:07:45] <icinga-wm>	 RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[12:24:15] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:30:30] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3148902 (10Cmjohnson) @elukey yes it's a known problem. I have the new part but @fgiunchedi is out this week.  We'll take care of it next week.  https://phabricator...
[12:53:15] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[12:55:35] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1970.80 Read Requests/Sec=3035.40 Write Requests/Sec=0.20 KBytes Read/Sec=27955.20 KBytes_Written/Sec=39.60
[13:03:45] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:06:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[13:08:35] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=181.90 Read Requests/Sec=191.00 Write Requests/Sec=73.60 KBytes Read/Sec=3328.40 KBytes_Written/Sec=380.40
[13:11:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[13:19:45] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[13:29:01] <wikibugs_>	 (03PS1) 10Urbanecm: Switch from $stdlogo to static resources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980)
[13:30:09] <wikibugs>	 (03PS2) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980)
[13:35:45] <icinga-wm>	 PROBLEM - puppet last run on d-i-test is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:38:52] <wikibugs_>	 (03PS3) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980)
[13:41:52] <wikibugs>	 (03PS4) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980)
[13:42:59] <wikibugs_>	 (03PS5) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980)
[14:03:46] <icinga-wm>	 RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[14:44:55] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:05] <icinga-wm>	 PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:55] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[15:19:05] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[15:22:05] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[15:26:12] <wikibugs>	 (03PS1) 10Andrew Bogott: instance-info-dumper:  Use mwopenstackclient rather than the nova client directly. [puppet] - 10https://gerrit.wikimedia.org/r/346048 (https://phabricator.wikimedia.org/T158650)
[15:27:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] instance-info-dumper:  Use mwopenstackclient rather than the nova client directly. [puppet] - 10https://gerrit.wikimedia.org/r/346048 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott)
[15:37:05] <icinga-wm>	 RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[15:42:26] <wikibugs>	 (03PS1) 10Andrew Bogott: instance-info-dumper:  fix config path debug change [puppet] - 10https://gerrit.wikimedia.org/r/346049
[15:49:15] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:50:59] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] instance-info-dumper:  fix config path debug change [puppet] - 10https://gerrit.wikimedia.org/r/346049 (owner: 10Andrew Bogott)
[16:17:15] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[16:23:51] <wikibugs_>	 (03CR) 10Dereckson: [C: 031] "as long as it's optipng too, it's fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm)
[18:13:11] <wikibugs_>	 (03PS1) 10Urbanecm: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980)
[18:16:19] <wikibugs_>	 (03PS2) 10Urbanecm: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980)
[18:25:32] <wikibugs_>	 (03CR) 10Dereckson: [C: 031] "Checked all logos files exist. Fine for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm)
[18:46:55] <icinga-wm>	 PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:09:05] <icinga-wm>	 PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:14:55] <icinga-wm>	 RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[19:31:16] <wikibugs>	 (03PS1) 10Urbanecm: Optimalize all not-optimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999)
[19:32:55] <icinga-wm>	 PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:36:05] <icinga-wm>	 RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[19:43:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 24 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:48:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:00:55] <icinga-wm>	 RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[20:19:05] <wikibugs>	 (03CR) 10Dereckson: [C: 031] "yeah but -1B, that qualifies to the best of optimization :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm)
[20:20:42] <wikibugs>	 (03CR) 10Urbanecm: "Don't understand. Should I use optipng -1B? Or did I do something wrong?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm)
[20:22:43] <wikibugs_>	 (03CR) 10Dereckson: [C: 031] "Gerrit shows the size delta. It seems for nostalagiawiki and shwiktionary, you optimized removing one unique byte." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm)
[20:25:13] <wikibugs>	 (03CR) 10Urbanecm: "Oh, now I see. Thanks-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm)
[21:00:15] <icinga-wm>	 PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:08:15] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:15:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:16:15] <icinga-wm>	 PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:20:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:23:55] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3198.50 Read Requests/Sec=3216.40 Write Requests/Sec=20.00 KBytes Read/Sec=25533.60 KBytes_Written/Sec=7070.00
[21:29:15] <icinga-wm>	 RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[21:33:45] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=3.60 Write Requests/Sec=9.70 KBytes Read/Sec=15.60 KBytes_Written/Sec=53.60
[21:37:15] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[21:44:15] <icinga-wm>	 RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[22:07:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[22:07:45] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 22 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:12:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[22:12:45] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 16 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:13:05] <icinga-wm>	 PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:18:15] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[22:36:55] <icinga-wm>	 PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:39:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[22:41:05] <icinga-wm>	 RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[22:46:15] <icinga-wm>	 PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:47:15] <icinga-wm>	 RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[22:56:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:05:55] <icinga-wm>	 RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[23:06:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:14:15] <icinga-wm>	 RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[23:37:15] <icinga-wm>	 PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:43:05] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:49:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues