[00:02:08] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3124: Connection refused
[00:02:38] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3126: Connection refused
[00:02:38] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3121: Connection refused
[00:02:38] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3127: Connection refused
[00:02:38] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 80 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 80: Connection refused
[00:02:38] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3122: Connection refused
[00:02:48] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3125: Connection refused
[00:02:48] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3123: Connection refused
[00:02:58] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3120: Connection refused
[00:15:48] <icinga-wm>	 PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:27:02] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 3 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909168 (10jcrespo)
[00:27:08] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.072 second response time
[00:27:38] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time
[00:27:38] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.073 second response time
[00:27:38] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.076 second response time
[00:27:47] <godog>	 !log dump core file and restart varnish-frontend on cp2026
[00:27:48] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time
[00:27:48] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time
[00:27:48] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time
[00:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:58] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.072 second response time
[00:27:58] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time
[00:35:47] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909183 (10jcrespo) I have added more tags that this eventually should have, just as a heads up- as it is not clear where the issue is (clie...
[00:43:48] <icinga-wm>	 RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[00:46:01] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1286.eqiad.wmnet
[00:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:34] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909168 (10bd808) ``` logstash1001:~ bd808$ sudo journalctl -l --no-pager -f -u logstash -- Logs begin at Thu 2016-12-22 07:12:58 UTC. -- De...
[00:55:19] <bd808>	 !log Restarted logstash on logstash1001 (T154388)
[00:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:22] <stashbot>	 T154388: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388
[00:56:48] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet,service=apache2
[00:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:41] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909190 (10jcrespo) 05Open>03Resolved a:03jcrespo
[01:01:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:01:48] <icinga-wm>	 PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:11] <wikibugs>	 06Operations, 10Elasticsearch, 10Wikimedia-Logstash, 15User-bd808, 07Wikimedia-log-errors: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909194 (10jcrespo) a:05jcrespo>03bd808
[01:02:28] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set
[01:02:38] <icinga-wm>	 RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full
[01:02:48] <icinga-wm>	 PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:48] <icinga-wm>	 PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:48] <icinga-wm>	 PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:58] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:02:58] <icinga-wm>	 PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:04:38] <icinga-wm>	 RECOVERY - configured eth on cobalt is OK: OK - interfaces up
[01:04:38] <icinga-wm>	 RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:05:38] <icinga-wm>	 PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:05:39] <icinga-wm>	 RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational
[01:05:48] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures
[01:06:28] <icinga-wm>	 RECOVERY - Disk space on cobalt is OK: DISK OK
[01:06:48] <icinga-wm>	 RECOVERY - DPKG on cobalt is OK: All packages OK
[01:11:13] <paladox>	 Why did those warnnings happen ^^
[01:17:46] <godog>	 high load on cobalt, could be related to logstash1001 bounce
[01:34:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1804.23828 Seconds
[01:35:58] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 32.27195 Seconds
[01:39:58] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:47:26] <paladox>	 oh
[01:47:38] <icinga-wm>	 PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:07:58] <icinga-wm>	 PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118)
[02:07:58] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[02:08:08] <icinga-wm>	 PROBLEM - Host labservices1001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:09:08] <icinga-wm>	 PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:11:28] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.003 second response time
[02:11:28] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time
[02:11:28] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time
[02:11:58] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.006 second response time
[02:11:58] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.013 second response time
[02:12:13] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.013 second response time
[02:12:13] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.013 second response time
[02:12:23] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.003 second response time
[02:12:38] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time
[02:12:38] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time
[02:13:07] <godog>	 pagerstorm
[02:13:11] <icinga-wm>	 PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.002 second response time
[02:13:11] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time
[02:13:31] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time
[02:16:21] <chasemp>	 crap, yeah godog seems iffy
[02:16:21] <icinga-wm>	 RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[02:16:39] <paladox>	 hi ci is down
[02:16:40] <paladox>	 https://integration.wikimedia.org/ci/job/integration-config-tox-jessie/2548/console
[02:20:09] <godog>	 thanks paladox, likely related to the pages above
[02:20:24] <paladox>	 oh your welcome.
[02:22:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:23:32] <godog>	 !log reboot labservices1001, unresponsive on console and MCE/temperature alerts found on lithium
[02:23:32] <chasemp>	 !log labservices1001 'racadm serveraction hardreset'
[02:23:40] <godog>	 olz
[02:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:48] <godog>	 good timing chasemp 
[02:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:51] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[02:25:31] <icinga-wm>	 RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[02:25:51] <icinga-wm>	 RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[02:26:01] <icinga-wm>	 RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[02:26:01] <icinga-wm>	 RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 35.071 second response time
[02:26:01] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 33.393 second response time
[02:26:04] <icinga-wm>	 RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 25.629 second response time
[02:26:04] <icinga-wm>	 RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 25.629 second response time
[02:26:04] <icinga-wm>	 RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.524 second response time
[02:26:04] <icinga-wm>	 RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.528 second response time
[02:26:04] <icinga-wm>	 RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.544 second response time
[02:26:07] <icinga-wm>	 RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.126 second response time
[02:26:09] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.009 second response time
[02:26:31] <icinga-wm>	 RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.460 second response time
[02:26:31] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.146 second response time
[02:26:31] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.655 second response time
[02:28:21] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.426 second response time
[02:30:00] <madhuvishy>	 hello
[02:30:22] <madhuvishy>	 chasemp: how's it going?
[02:30:46] <chasemp>	 madhuvishy: ok I guess, seems like labservices1001 fell over dead, I'm guessing some thus far intermitten hw error that could hit again :)
[02:32:54] <madhuvishy>	 chasemp: ah i can keep watch 
[02:41:16] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909243 (10Andrew)
[02:42:51] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909256 (10chasemp) p:05Triage>03High
[02:43:13] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909257 (10Andrew) a:03Cmjohnson Chris -- I'm not sure what the procedure is here.  If you need to power down the machine for this we'll...
[02:59:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:00:51] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[03:08:52] <wikibugs>	 06Operations, 10media-storage, 07Community-Wishlist-Survey-2016: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2909426 (10Liuxinyu970226)
[03:09:05] <wikibugs>	 06Operations, 06Commons, 10media-storage, 07Community-Wishlist-Survey-2016, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2909429 (10Liuxinyu970226)
[03:23:39] <wikibugs>	 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2909506 (10Paladox) Actually this looks like it is fixed in jgit 4.6 which gerrit 2.14 should be using when upstream realease gerrit 2.14....
[04:33:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:34:51] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[05:01:01] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:29:01] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[05:54:31] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:06:01] <icinga-wm>	 PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=54%)
[06:23:31] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:27:51] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:27:51] <icinga-wm>	 PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed
[06:28:01] <icinga-wm>	 RECOVERY - Disk space on iridium is OK: DISK OK
[06:28:21] <icinga-wm>	 PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed
[06:28:21] <icinga-wm>	 PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed
[06:46:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:47:51] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[06:54:51] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[06:54:51] <icinga-wm>	 RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active
[06:55:21] <icinga-wm>	 RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active
[06:55:21] <icinga-wm>	 RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active
[07:49:01] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:49:51] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[08:13:31] <icinga-wm>	 PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:30:01] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:30:51] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[08:41:31] <icinga-wm>	 RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[09:28:58] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2909583 (10Marostegui)
[09:37:01] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:39:01] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[09:46:01] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:47:01] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[10:02:01] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:03:01] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[10:20:31] <icinga-wm>	 PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:48:31] <icinga-wm>	 RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[11:00:41] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:28:41] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[13:47:31] <icinga-wm>	 PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error
[13:50:01] <icinga-wm>	 PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:01:31] <icinga-wm>	 RECOVERY - Disk space on ms-be1001 is OK: DISK OK
[14:02:51] <icinga-wm>	 PROBLEM - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[14:03:03] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T154396
[14:03:07] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1001 - https://phabricator.wikimedia.org/T154396#2909758 (10ops-monitoring-bot)
[14:09:23] <icinga-wm>	 PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1]
[14:18:13] <icinga-wm>	 RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[15:03:33] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:32:33] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[15:39:23] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:47:14] <wikibugs>	 (03PS2) 10BryanDavis: vagrant: Update LXC packages and apparmor conf for systemd [puppet] - 10https://gerrit.wikimedia.org/r/329702 (https://phabricator.wikimedia.org/T154294)
[15:47:16] <wikibugs>	 (03PS2) 10BryanDavis: vagrant: remove setup.sh call [puppet] - 10https://gerrit.wikimedia.org/r/329723
[15:47:18] <wikibugs>	 (03PS2) 10BryanDavis: vagrant: add sudo rules for Vagrant 1.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/329724 (https://phabricator.wikimedia.org/T122735)
[15:47:20] <wikibugs>	 (03PS6) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340)
[15:54:43] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[16:07:23] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[16:17:39] <wikibugs>	 (03PS7) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340)
[16:23:33] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[16:42:13] <icinga-wm>	 PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:01:13] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[17:11:13] <icinga-wm>	 RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[17:22:49] <wikibugs>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2909926 (10Seb35) In T149617#2832764, @Krinkle wrote: > We currently cache the expansion/extrac...
[17:30:13] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[17:52:13] <icinga-wm>	 PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:06:53] <icinga-wm>	 PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:09:23] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:11:07] <wikibugs>	 (03CR) 10BryanDavis: l10nupdate: acquire scap lock before changing files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis)
[18:20:13] <icinga-wm>	 RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[18:33:53] <icinga-wm>	 RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:37:23] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[18:50:17] <wikibugs>	 06Operations, 10media-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2909997 (10Aklapper) @Liuxinyu970226: This was mentioned as an example why the Community Wishlist item "Commons backup" was proposed. This task is not an item on the Community Wishlist.
[18:50:25] <wikibugs>	 06Operations, 06Commons, 10media-storage, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2910003 (10Aklapper) @Liuxinyu970226: This was mentioned as an example why the Community Wishlist item "Commons backup" was proposed. This ta...
[20:37:25] <wikibugs>	 (03CR) 10Hashar: [C: 031] gerrit: Indent @ssl_settings in Apache configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329738 (owner: 10Tim Landscheidt)
[20:47:33] <icinga-wm>	 PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:16:33] <icinga-wm>	 RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[21:26:03] <icinga-wm>	 PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:53:03] <icinga-wm>	 RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[23:03:23] <icinga-wm>	 PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:11:33] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:31:23] <icinga-wm>	 RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[23:40:33] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[23:40:43] <icinga-wm>	 PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:54:13] <icinga-wm>	 PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=54%)