[00:06:34] RECOVERY - configured eth on terbium is OK: OK - interfaces up [00:06:44] RECOVERY - Check whether ferm is active by checking the default input chain on terbium is OK: OK ferm input default policy is set [00:06:44] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [00:06:44] RECOVERY - SSH on terbium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [00:06:44] RECOVERY - salt-minion processes on terbium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:06:45] RECOVERY - DPKG on terbium is OK: All packages OK [00:06:54] RECOVERY - nutcracker port on terbium is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:06:54] RECOVERY - Disk space on terbium is OK: DISK OK [00:07:04] RECOVERY - dhclient process on terbium is OK: PROCS OK: 0 processes with command name dhclient [00:07:04] RECOVERY - Check systemd state on terbium is OK: OK - running: The system is fully operational [00:07:14] RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [00:07:14] RECOVERY - Check size of conntrack table on terbium is OK: OK: nf_conntrack is 0 % full [00:08:04] RECOVERY - IPMI Temperature on terbium is OK: Sensor Type(s) Temperature Status: OK [00:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [00:16:24] RECOVERY - MegaRAID on terbium is OK: OK: optimal, 1 logical, 2 physical [00:17:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1929 bytes in 0.146 second response time [00:23:04] RECOVERY - Check the NTP synchronisation status of timesyncd on terbium is OK: OK: synced at Sun 2017-06-04 00:22:57 UTC. [01:00:04] PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 [01:00:06] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166964 [01:00:09] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3313134 (10ops-monitoring-bot) [01:11:34] PROBLEM - MegaRAID on lvs3001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [01:11:35] ACKNOWLEDGEMENT - MegaRAID on lvs3001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166965 [01:11:38] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3313139 (10ops-monitoring-bot) [01:32:34] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:33:34] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set [01:35:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:36:14] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set [01:41:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [01:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [01:56:24] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:57:24] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set [02:06:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:08:14] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [02:22:14] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [02:23:20] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 09m 12s) [02:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:31] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [02:29:44] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [02:29:54] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:36:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:37:14] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [02:47:38] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.003 second response time [02:47:49] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [02:47:59] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [03:10:49] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:12:49] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:33:09] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:41:49] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [04:01:19] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:42:49] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [05:01:29] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=859.70 Read Requests/Sec=2158.30 Write Requests/Sec=384.20 KBytes Read/Sec=8960.00 KBytes_Written/Sec=7987.20 [05:10:30] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=14.40 Read Requests/Sec=0.40 Write Requests/Sec=47.60 KBytes Read/Sec=10.00 KBytes_Written/Sec=310.80 [05:31:39] PROBLEM - IPMI Temperature on snapshot1006 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [06:10:49] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [06:12:49] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [06:27:59] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.078 second response time [06:28:59] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.082 second response time [06:31:09] RECOVERY - IPMI Temperature on snapshot1006 is OK: Sensor Type(s) Temperature Status: OK [07:11:49] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [07:42:59] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [09:08:59] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received [09:08:59] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad URL) timed out before a response was received [09:08:59] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received [09:08:59] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad URL) timed out before a response was received [09:08:59] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad URL) timed out before a response was received [09:09:00] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad URL) timed out before a response was received [09:09:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received [09:09:49] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [09:09:49] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:09:49] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [09:09:51] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [09:09:51] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [09:09:51] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:09:51] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [09:12:59] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [09:52:39] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:29] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [10:10:49] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [10:20:24] (03PS2) 10Ema: check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205) [10:20:44] (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [10:31:00] !log mw2256 down, console stuck on 'Starti'. power cycled. [10:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:39] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [10:42:49] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [10:54:45] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [11:30:46] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3313394 (10elukey) Report from today (UTC timings): `10:30 !log mw2256 down, console stuck on 'Starti'. power cycled.` [12:58:50] (03Draft1) 10Paladox: Phabricator: Fix colour for Unbreak Now tasks [puppet] - 10https://gerrit.wikimedia.org/r/357121 [12:58:53] (03PS2) 10Paladox: Phabricator: Fix colour for Unbreak Now tasks [puppet] - 10https://gerrit.wikimedia.org/r/357121 [14:22:57] 10Puppet, 10Labs, 10Phabricator, 10Patch-For-Review: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#3313584 (10Paladox) 05Open>03Resolved a:03Paladox [17:35:44] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2049042 [17:49:24] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:14] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 75107 bytes in 0.153 second response time [18:05:39] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3313752 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [18:14:53] (03PS1) 10DatGuy: Lift IP throttle for Wikimedia Chile editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357128 (https://phabricator.wikimedia.org/T166788) [21:55:45] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 14433 [22:02:45] (03PS1) 10Ladsgroup: Change colors of LanguageStats to comply with WikimediaUI color palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357169 (https://phabricator.wikimedia.org/T162058) [23:30:54] PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%) [23:48:44] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues