[00:02:08] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3124: Connection refused [00:02:38] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3126: Connection refused [00:02:38] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3121: Connection refused [00:02:38] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3127: Connection refused [00:02:38] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 80: Connection refused [00:02:38] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3122: Connection refused [00:02:48] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3125: Connection refused [00:02:48] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3123: Connection refused [00:02:58] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp2026 is CRITICAL: connect to address 10.192.48.30 and port 3120: Connection refused [00:15:48] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:27:02] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 3 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909168 (10jcrespo) [00:27:08] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.072 second response time [00:27:38] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [00:27:38] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.073 second response time [00:27:38] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.076 second response time [00:27:47] !log dump core file and restart varnish-frontend on cp2026 [00:27:48] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [00:27:48] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [00:27:48] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time [00:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:58] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.072 second response time [00:27:58] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time [00:35:47] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909183 (10jcrespo) I have added more tags that this eventually should have, just as a heads up- as it is not clear where the issue is (clie... [00:43:48] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:46:01] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1286.eqiad.wmnet [00:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:34] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909168 (10bd808) ``` logstash1001:~ bd808$ sudo journalctl -l --no-pager -f -u logstash -- Logs begin at Thu 2016-12-22 07:12:58 UTC. -- De... [00:55:19] !log Restarted logstash on logstash1001 (T154388) [00:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:22] T154388: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388 [00:56:48] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet,service=apache2 [00:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:41] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 4 others: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909190 (10jcrespo) 05Open>03Resolved a:03jcrespo [01:01:38] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:48] PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:11] 06Operations, 10Elasticsearch, 10Wikimedia-Logstash, 15User-bd808, 07Wikimedia-log-errors: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388#2909194 (10jcrespo) a:05jcrespo>03bd808 [01:02:28] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:02:38] RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full [01:02:48] PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:48] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:48] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:58] PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:58] PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:38] RECOVERY - configured eth on cobalt is OK: OK - interfaces up [01:04:38] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:05:38] PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:39] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [01:05:48] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [01:06:28] RECOVERY - Disk space on cobalt is OK: DISK OK [01:06:48] RECOVERY - DPKG on cobalt is OK: All packages OK [01:11:13] Why did those warnnings happen ^^ [01:17:46] high load on cobalt, could be related to logstash1001 bounce [01:34:58] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1804.23828 Seconds [01:35:58] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 32.27195 Seconds [01:39:58] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:47:26] oh [01:47:38] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:07:58] PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118) [02:07:58] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [02:08:08] PROBLEM - Host labservices1001 is DOWN: PING CRITICAL - Packet loss = 100% [02:09:08] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:11:28] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.003 second response time [02:11:28] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time [02:11:28] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time [02:11:58] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.006 second response time [02:11:58] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.013 second response time [02:12:13] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.013 second response time [02:12:13] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.013 second response time [02:12:23] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.003 second response time [02:12:38] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time [02:12:38] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [02:13:07] pagerstorm [02:13:11] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.002 second response time [02:13:11] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time [02:13:31] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time [02:16:21] crap, yeah godog seems iffy [02:16:21] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:16:39] hi ci is down [02:16:40] https://integration.wikimedia.org/ci/job/integration-config-tox-jessie/2548/console [02:20:09] thanks paladox, likely related to the pages above [02:20:24] oh your welcome. [02:22:51] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:32] !log reboot labservices1001, unresponsive on console and MCE/temperature alerts found on lithium [02:23:32] !log labservices1001 'racadm serveraction hardreset' [02:23:40] olz [02:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:48] good timing chasemp [02:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:51] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [02:25:31] RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [02:25:51] RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:26:01] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [02:26:01] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 35.071 second response time [02:26:01] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 33.393 second response time [02:26:04] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 25.629 second response time [02:26:04] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 25.629 second response time [02:26:04] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.524 second response time [02:26:04] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.528 second response time [02:26:04] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.544 second response time [02:26:07] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.126 second response time [02:26:09] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.009 second response time [02:26:31] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.460 second response time [02:26:31] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.146 second response time [02:26:31] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.655 second response time [02:28:21] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.426 second response time [02:30:00] hello [02:30:22] chasemp: how's it going? [02:30:46] madhuvishy: ok I guess, seems like labservices1001 fell over dead, I'm guessing some thus far intermitten hw error that could hit again :) [02:32:54] chasemp: ah i can keep watch [02:41:16] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909243 (10Andrew) [02:42:51] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909256 (10chasemp) p:05Triage>03High [02:43:13] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2909257 (10Andrew) a:03Cmjohnson Chris -- I'm not sure what the procedure is here. If you need to power down the machine for this we'll... [02:59:51] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:51] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [03:08:52] 06Operations, 10media-storage, 07Community-Wishlist-Survey-2016: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2909426 (10Liuxinyu970226) [03:09:05] 06Operations, 06Commons, 10media-storage, 07Community-Wishlist-Survey-2016, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2909429 (10Liuxinyu970226) [03:23:39] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2909506 (10Paladox) Actually this looks like it is fixed in jgit 4.6 which gerrit 2.14 should be using when upstream realease gerrit 2.14.... [04:33:51] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:51] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [05:01:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:29:01] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:54:31] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:06:01] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=54%) [06:23:31] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:27:51] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:27:51] PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [06:28:01] RECOVERY - Disk space on iridium is OK: DISK OK [06:28:21] PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed [06:28:21] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [06:46:51] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:51] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [06:54:51] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [06:54:51] RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active [06:55:21] RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active [06:55:21] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [07:49:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:51] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:13:31] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:30:51] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:41:31] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:28:58] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2909583 (10Marostegui) [09:37:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:39:01] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:46:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:01] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:02:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:03:01] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:20:31] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:31] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:00:41] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:28:41] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:47:31] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error [13:50:01] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:31] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [14:02:51] PROBLEM - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [14:03:03] ACKNOWLEDGEMENT - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T154396 [14:03:07] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1001 - https://phabricator.wikimedia.org/T154396#2909758 (10ops-monitoring-bot) [14:09:23] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1] [14:18:13] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:03:33] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:33] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:39:23] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:14] (03PS2) 10BryanDavis: vagrant: Update LXC packages and apparmor conf for systemd [puppet] - 10https://gerrit.wikimedia.org/r/329702 (https://phabricator.wikimedia.org/T154294) [15:47:16] (03PS2) 10BryanDavis: vagrant: remove setup.sh call [puppet] - 10https://gerrit.wikimedia.org/r/329723 [15:47:18] (03PS2) 10BryanDavis: vagrant: add sudo rules for Vagrant 1.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/329724 (https://phabricator.wikimedia.org/T122735) [15:47:20] (03PS6) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) [15:54:43] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [16:07:23] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:17:39] (03PS7) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) [16:23:33] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:42:13] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [17:11:13] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:22:49] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2909926 (10Seb35) In T149617#2832764, @Krinkle wrote: > We currently cache the expansion/extrac... [17:30:13] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:52:13] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:53] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:23] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:07] (03CR) 10BryanDavis: l10nupdate: acquire scap lock before changing files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [18:20:13] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:33:53] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:37:23] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:50:17] 06Operations, 10media-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2909997 (10Aklapper) @Liuxinyu970226: This was mentioned as an example why the Community Wishlist item "Commons backup" was proposed. This task is not an item on the Community Wishlist. [18:50:25] 06Operations, 06Commons, 10media-storage, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2910003 (10Aklapper) @Liuxinyu970226: This was mentioned as an example why the Community Wishlist item "Commons backup" was proposed. This ta... [20:37:25] (03CR) 10Hashar: [C: 031] gerrit: Indent @ssl_settings in Apache configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329738 (owner: 10Tim Landscheidt) [20:47:33] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:33] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [21:26:03] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:53:03] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:03:23] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:11:33] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:23] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:40:33] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:40:43] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:54:13] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=54%)