[00:06:34] <icinga-wm>	 RECOVERY - configured eth on terbium is OK: OK - interfaces up
[00:06:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on terbium is OK: OK ferm input default policy is set
[00:06:44] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures
[00:06:44] <icinga-wm>	 RECOVERY - SSH on terbium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[00:06:44] <icinga-wm>	 RECOVERY - salt-minion processes on terbium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:06:45] <icinga-wm>	 RECOVERY - DPKG on terbium is OK: All packages OK
[00:06:54] <icinga-wm>	 RECOVERY - nutcracker port on terbium is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[00:06:54] <icinga-wm>	 RECOVERY - Disk space on terbium is OK: DISK OK
[00:07:04] <icinga-wm>	 RECOVERY - dhclient process on terbium is OK: PROCS OK: 0 processes with command name dhclient
[00:07:04] <icinga-wm>	 RECOVERY - Check systemd state on terbium is OK: OK - running: The system is fully operational
[00:07:14] <icinga-wm>	 RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[00:07:14] <icinga-wm>	 RECOVERY - Check size of conntrack table on terbium is OK: OK: nf_conntrack is 0 % full
[00:08:04] <icinga-wm>	 RECOVERY - IPMI Temperature on terbium is OK: Sensor Type(s) Temperature Status: OK
[00:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[00:16:24] <icinga-wm>	 RECOVERY - MegaRAID on terbium is OK: OK: optimal, 1 logical, 2 physical
[00:17:24] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1929 bytes in 0.146 second response time
[00:23:04] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on terbium is OK: OK: synced at Sun 2017-06-04 00:22:57 UTC.
[01:00:04] <icinga-wm>	 PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0
[01:00:06] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166964
[01:00:09] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3313134 (10ops-monitoring-bot)
[01:11:34] <icinga-wm>	 PROBLEM - MegaRAID on lvs3001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[01:11:35] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on lvs3001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166965
[01:11:38] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10ops-monitoring-bot)
[01:32:34] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:33:34] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set
[01:35:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:36:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set
[01:41:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[01:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[01:56:24] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:57:24] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set
[02:06:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:08:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[02:22:14] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[02:23:20] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 09m 12s)
[02:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:31] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[02:29:44] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[02:29:54] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:36:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:37:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[02:47:38] <icinga-wm>	 RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.003 second response time
[02:47:49] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[02:47:59] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[03:10:49] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[03:12:49] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[03:33:09] <icinga-wm>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:41:49] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[04:01:19] <icinga-wm>	 RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[04:42:49] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[05:01:29] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=859.70 Read Requests/Sec=2158.30 Write Requests/Sec=384.20 KBytes Read/Sec=8960.00 KBytes_Written/Sec=7987.20
[05:10:30] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=14.40 Read Requests/Sec=0.40 Write Requests/Sec=47.60 KBytes Read/Sec=10.00 KBytes_Written/Sec=310.80
[05:31:39] <icinga-wm>	 PROBLEM - IPMI Temperature on snapshot1006 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[06:10:49] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[06:12:49] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[06:27:59] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.078 second response time
[06:28:59] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.082 second response time
[06:31:09] <icinga-wm>	 RECOVERY - IPMI Temperature on snapshot1006 is OK: Sensor Type(s) Temperature Status: OK
[07:11:49] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[07:42:59] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[09:08:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received
[09:08:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad URL) timed out before a response was received
[09:08:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received
[09:08:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad URL) timed out before a response was received
[09:08:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad URL) timed out before a response was received
[09:09:00] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad URL) timed out before a response was received
[09:09:00] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received
[09:09:49] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[09:09:49] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[09:09:49] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[09:09:51] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[09:09:51] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[09:09:51] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[09:09:51] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[09:12:59] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[09:52:39] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:53:29] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient
[10:10:49] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[10:20:24] <wikibugs>	 (03PS2) 10Ema: check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205)
[10:20:44] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema)
[10:31:00] <ema>	 !log mw2256 down, console stuck on 'Starti'. power cycled.
[10:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:39] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms
[10:42:49] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[10:54:45] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[11:30:46] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3313394 (10elukey) Report from today (UTC timings):  `10:30  <ema> !log mw2256 down, console stuck on 'Starti'. power cycled.`
[12:58:50] <wikibugs>	 (03Draft1) 10Paladox: Phabricator: Fix colour for Unbreak Now tasks [puppet] - 10https://gerrit.wikimedia.org/r/357121
[12:58:53] <wikibugs>	 (03PS2) 10Paladox: Phabricator: Fix colour for Unbreak Now tasks [puppet] - 10https://gerrit.wikimedia.org/r/357121
[14:22:57] <wikibugs>	 10Puppet, 10Labs, 10Phabricator, 10Patch-For-Review: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#3313584 (10Paladox) 05Open>03Resolved a:03Paladox
[17:35:44] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2049042
[17:49:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:50:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 75107 bytes in 0.153 second response time
[18:05:39] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3313752 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.  Your reque...
[18:14:53] <wikibugs>	 (03PS1) 10DatGuy: Lift IP throttle for Wikimedia Chile editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357128 (https://phabricator.wikimedia.org/T166788)
[21:55:45] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 14433
[22:02:45] <wikibugs>	 (03PS1) 10Ladsgroup: Change colors of LanguageStats to comply with WikimediaUI color palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357169 (https://phabricator.wikimedia.org/T162058)
[23:30:54] <icinga-wm>	 PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%)
[23:48:44] <icinga-wm>	 PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues