[00:05:55] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:55] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:15] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=170.00 Read Requests/Sec=205.20 Write Requests/Sec=15.30 KBytes Read/Sec=2879.60 KBytes_Written/Sec=490.80 [00:08:45] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [00:08:46] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:58:05] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:05:35] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:35] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:35] PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:36] PROBLEM - gerrit process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:36] PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:45] PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:55] PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:55] PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:15] PROBLEM - Check the NTP synchronisation status of timesyncd on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:25] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [01:06:45] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:06:45] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [01:07:35] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:35] PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:35] PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:08:25] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:08:26] RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full [01:08:26] RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:08:26] RECOVERY - configured eth on cobalt is OK: OK - interfaces up [01:08:26] RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient [01:08:26] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:08:26] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:08:27] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [01:09:22] RainbowSprinkles ^^ [01:09:35] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [01:09:45] RECOVERY - Disk space on cobalt is OK: DISK OK [01:09:45] RECOVERY - DPKG on cobalt is OK: All packages OK [01:10:22] gerrit is not loading for me [01:11:05] just slow [01:11:05] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:05] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:05] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:05] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:05] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:06] PROBLEM - Unmerged changes on repository puppet on labtestcontrol2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:06] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:33] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148691 (10Paladox) Happended again on the 02/04/17 bst time. PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05... [01:11:35] PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:39] Reedy it's not loading the status:open for me [01:12:55] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on labtestcontrol2001 is OK: No changes to merge. [01:12:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. [01:14:35] PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:35] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:35] PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:35] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:35] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:36] PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:41] ^^ [01:14:45] PROBLEM - gerrit process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:05] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [01:15:18] Looks like there's been a load spike on it [01:15:40] Is this the same problem we had with lead? [01:15:49] Where the storage system became very slow [01:15:55] PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:56] PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:16:04] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148692 (10Paladox) a minute later after reporting the recovery errors, the problem thing came again PROBLEM - Check size of conntrack table on cobalt is CRITI... [01:16:07] No idea. AFAIK I don't have access to the system [01:16:35] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [01:16:36] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:16:36] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [01:16:44] Should one of the ops be paged? [01:17:22] Reedy ^^ [01:17:45] RECOVERY - Disk space on cobalt is OK: DISK OK [01:17:46] RECOVERY - DPKG on cobalt is OK: All packages OK [01:18:25] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:18:25] RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full [01:18:25] RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:18:27] RECOVERY - configured eth on cobalt is OK: OK - interfaces up [01:18:27] RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient [01:18:27] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:19:14] Nah, it's probably fine [01:19:24] ok [01:20:05] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 94 seconds ago with 0 failures [01:21:55] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:22:18] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148694 (10demon) None of the last two comments are related to the issue here. That sounds like icinga flapping, not the machine or service itself. [01:22:45] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [01:22:50] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148695 (10Paladox) >>! In T148478#3148694, @demon wrote: > None of the last two comments are related to the issue here. That sounds like icinga flapping, not the machine or... [01:23:16] Considering there's several other services complaining about CHECK_NRPE, I'm more inclined to blame icinga. [01:23:37] It did get pretty slow [01:23:48] https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=cobalt.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [01:24:04] Big load of grey on the load/procs graph [01:24:06] i found it slow too now it's working. [01:24:48] Reedy: Lots of incoming connections in the apache stats. *shrug* [01:25:35] I got a few of the popup error 0 things [01:25:59] There's not some 1am utc cronjob is there? [01:26:36] i thought we got rid of the cronjob? [01:26:47] oh we didnt [01:26:49] Reedy https://github.com/wikimedia/puppet/blob/production/modules/gerrit/manifests/crons.pp [01:27:05] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:27:30] Doubt that caused it [01:29:45] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 58377.022587 Seconds [01:29:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 58382.013225 Seconds [01:29:55] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 58383.437874 Seconds [01:30:13] Reedy: maybe puppet needs to be ran manually ive been testing with puppet privately with a smaller but similar setup and got the same error and puppet run triggering manually tends to fix or atleast temp bandage it [01:30:24] What [01:30:25] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 58433.575453 Seconds [01:30:26] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 58435.788166 Seconds [01:30:40] Im referring to popup error 0 [01:31:04] That's nothing to do with puppet [01:32:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 58560.26345 Seconds [01:32:55] Oh i see now, nevermind my bad [01:33:55] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:35:35] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:36:05] RECOVERY - Check the NTP synchronisation status of timesyncd on cobalt is OK: OK: synced at Sun 2017-04-02 01:35:57 UTC. [01:36:55] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 58803.558914 Seconds [01:37:45] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:37:45] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:38:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 58920.283176 Seconds [01:38:45] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [01:38:45] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:39:25] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:39:59] Whats going on with postgres? [01:40:20] Does it every night [01:41:35] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [01:41:45] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:41:55] PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:41:56] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:41:56] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [01:42:26] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [01:42:26] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 59153.534169 Seconds [01:43:05] RECOVERY - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-c valid until 2017-09-12 15:35:38 +0000 (expires in 163 days) [01:43:25] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.48 port 9042 [01:43:42] Reedy: i take it it has something to do with backing up? [01:44:02] Don't think so [01:44:38] Is it normal? [01:45:36] Nope [01:45:55] I think Ops have looked at it superficially, and didn't have any ideas, but don't know if anyone has really dug into it [01:46:21] I would look into it but 1) im to lazy to rn 2) i probably dont have that kind of access [01:48:55] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:55] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:49:05] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:55] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:51:45] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [01:51:55] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:53:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 59822.032782 Seconds [01:55:55] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:57:25] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:57:45] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 23.947743 Seconds [01:57:55] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 28.175942 Seconds [01:58:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:59:35] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 23.613795 Seconds [02:01:45] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [02:01:55] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [02:02:35] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.138 port 9042 [02:02:55] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 163 days) [02:12:25] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:13:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [02:17:55] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:18:25] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:18:46] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [02:18:46] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [02:19:25] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [02:21:45] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:21:46] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [02:22:55] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:45] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [02:24:26] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 09m 44s) [02:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:35] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [02:25:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:26:45] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set [02:31:55] PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=69%) [02:51:55] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:53:55] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [03:28:15] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [03:28:35] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [03:29:25] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:30:05] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:30:45] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:32:45] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:36:55] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [03:37:05] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [03:37:25] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=824.60 Read Requests/Sec=353.50 Write Requests/Sec=0.90 KBytes Read/Sec=37978.40 KBytes_Written/Sec=34.00 [03:37:55] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:37:55] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [03:38:05] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:38:25] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days) [03:38:25] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [03:38:35] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [03:41:55] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:42:55] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [03:45:35] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [03:46:15] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [03:46:55] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:47:05] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:47:35] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:50:25] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=205.70 Read Requests/Sec=298.80 Write Requests/Sec=7.60 KBytes Read/Sec=2556.40 KBytes_Written/Sec=222.40 [03:57:25] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:00:45] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:03:05] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active [04:03:55] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [04:04:25] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042 [04:05:55] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [04:06:05] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [04:07:15] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days) [04:07:35] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [04:07:55] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:08:05] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:08:25] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [04:12:15] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:12:35] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [04:13:05] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [04:13:25] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:14:55] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:15:05] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:15:35] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [04:32:55] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [04:33:05] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active [04:33:55] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days) [04:34:25] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042 [04:35:55] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [04:36:05] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [04:36:55] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:37:05] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:37:25] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [04:37:35] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [04:37:55] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:37:55] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days) [04:39:25] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [04:40:15] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [04:41:25] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:41:35] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [04:41:55] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:42:05] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:53:45] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:56:35] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:55] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [05:03:05] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active [05:03:25] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042 [05:03:56] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days) [05:05:55] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [05:06:05] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [05:06:25] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [05:06:55] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:06:55] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:07:05] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:08:35] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [05:09:25] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:10:15] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [05:10:55] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:11:05] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:16:35] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:22:45] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:25:35] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [05:33:55] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [05:34:05] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active [05:34:15] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 163 days) [05:34:25] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042 [05:36:55] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [05:37:05] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [05:37:35] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [05:37:45] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 163 days) [05:44:35] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:47:55] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:14:55] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:28:55] RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK [06:30:49] (03PS2) 10ArielGlenn: retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507) [06:32:45] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:36:17] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148786 (10Ankry) Also available already: [[https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG]] [[https://c... [06:45:00] (03PS1) 10ArielGlenn: add new config options for dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/346037 [06:51:45] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:53:55] (03CR) 10ArielGlenn: [C: 032] add new config options for dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/346037 (owner: 10ArielGlenn) [06:59:54] (03PS1) 10ArielGlenn: add new config settings for en wikipedia dumps [puppet] - 10https://gerrit.wikimedia.org/r/346038 [07:01:45] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:03:39] (03CR) 10ArielGlenn: [C: 032] add new config settings for en wikipedia dumps [puppet] - 10https://gerrit.wikimedia.org/r/346038 (owner: 10ArielGlenn) [07:20:40] (03CR) 10ArielGlenn: [C: 032] retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507) (owner: 10ArielGlenn) [07:21:45] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:25:39] !log ariel@tin Started deploy [dumps/dumps@1ac3fb3]: var/method name cleanups, refactor, pregenerate page ranges for page content jobs, auto retry of failed page ranges [07:25:42] !log ariel@tin Finished deploy [dumps/dumps@1ac3fb3]: var/method name cleanups, refactor, pregenerate page ranges for page content jobs, auto retry of failed page ranges (duration: 00m 03s) [07:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:14] (03PS1) 10ArielGlenn: Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039 [07:27:29] (03PS2) 10ArielGlenn: Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039 [07:30:27] (03CR) 10ArielGlenn: [C: 032] Revert "disable full dumps cron job for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/346039 (owner: 10ArielGlenn) [08:02:45] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:12] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3148796 (10hashar) Machine had a load spike at 1:00am. It shows high disk IOPS since 1:00 and the disk utilisation largely exploded. There is 35-45% CPU usage for `md1_raid... [08:18:28] !log powercycle ms-be1016 (stuck in console, answers pings but not ssh) [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:55] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:21:05] PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:45] RECOVERY - SSH on ms-be1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [08:21:55] RECOVERY - Host ms-be1016 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [08:22:05] RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient [08:22:05] RECOVERY - Check size of conntrack table on ms-be1016 is OK: OK: nf_conntrack is 9 % full [08:22:06] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:22:06] RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:22:06] RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:22:06] RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:22:06] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:22:07] RECOVERY - DPKG on ms-be1016 is OK: All packages OK [08:22:07] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:22:08] RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:22:08] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1016 is OK: OK ferm input default policy is set [08:22:09] RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:23:33] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3106352 (10elukey) Just powercycled ms-be1016 that was stuck in console (pingable but no ssh available): `[11674384.225319] BUG: soft lockup - CPU#12 stuck for 22s... [08:29:45] RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset -5.477666855e-05 secs [08:30:46] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:47:55] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:33:45] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:49:45] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:02:45] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:17:45] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:57:24] (03PS1) 10DatGuy: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) [11:09:07] (03PS2) 10DatGuy: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) [11:29:32] (03PS1) 10DatGuy: Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) [11:37:24] (03CR) 10Luke081515: [C: 031] Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [11:39:45] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:45] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:24:15] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:30] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3148902 (10Cmjohnson) @elukey yes it's a known problem. I have the new part but @fgiunchedi is out this week. We'll take care of it next week. https://phabricator... [12:53:15] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:55:35] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1970.80 Read Requests/Sec=3035.40 Write Requests/Sec=0.20 KBytes Read/Sec=27955.20 KBytes_Written/Sec=39.60 [13:03:45] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:06:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:08:35] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=181.90 Read Requests/Sec=191.00 Write Requests/Sec=73.60 KBytes Read/Sec=3328.40 KBytes_Written/Sec=380.40 [13:11:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:19:45] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:29:01] (03PS1) 10Urbanecm: Switch from $stdlogo to static resources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) [13:30:09] (03PS2) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) [13:35:45] PROBLEM - puppet last run on d-i-test is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:52] (03PS3) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) [13:41:52] (03PS4) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) [13:42:59] (03PS5) 10Urbanecm: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) [14:03:46] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:44:55] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:05] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:55] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:19:05] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [15:22:05] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [15:26:12] (03PS1) 10Andrew Bogott: instance-info-dumper: Use mwopenstackclient rather than the nova client directly. [puppet] - 10https://gerrit.wikimedia.org/r/346048 (https://phabricator.wikimedia.org/T158650) [15:27:57] (03CR) 10Andrew Bogott: [C: 032] instance-info-dumper: Use mwopenstackclient rather than the nova client directly. [puppet] - 10https://gerrit.wikimedia.org/r/346048 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [15:37:05] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:42:26] (03PS1) 10Andrew Bogott: instance-info-dumper: fix config path debug change [puppet] - 10https://gerrit.wikimedia.org/r/346049 [15:49:15] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:59] (03CR) 10Andrew Bogott: [C: 032] instance-info-dumper: fix config path debug change [puppet] - 10https://gerrit.wikimedia.org/r/346049 (owner: 10Andrew Bogott) [16:17:15] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:23:51] (03CR) 10Dereckson: [C: 031] "as long as it's optipng too, it's fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [18:13:11] (03PS1) 10Urbanecm: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) [18:16:19] (03PS2) 10Urbanecm: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) [18:25:32] (03CR) 10Dereckson: [C: 031] "Checked all logos files exist. Fine for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [18:46:55] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:05] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:55] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:31:16] (03PS1) 10Urbanecm: Optimalize all not-optimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) [19:32:55] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:05] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:43:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 24 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:48:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:00:55] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:19:05] (03CR) 10Dereckson: [C: 031] "yeah but -1B, that qualifies to the best of optimization :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [20:20:42] (03CR) 10Urbanecm: "Don't understand. Should I use optipng -1B? Or did I do something wrong?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [20:22:43] (03CR) 10Dereckson: [C: 031] "Gerrit shows the size delta. It seems for nostalagiawiki and shwiktionary, you optimized removing one unique byte." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [20:25:13] (03CR) 10Urbanecm: "Oh, now I see. Thanks-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [21:00:15] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:15] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:16:15] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:20:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:23:55] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3198.50 Read Requests/Sec=3216.40 Write Requests/Sec=20.00 KBytes Read/Sec=25533.60 KBytes_Written/Sec=7070.00 [21:29:15] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:33:45] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=3.60 Write Requests/Sec=9.70 KBytes Read/Sec=15.60 KBytes_Written/Sec=53.60 [21:37:15] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:44:15] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:07:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:07:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 22 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:12:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:12:45] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 16 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:13:05] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:18:15] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:36:55] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:39:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:41:05] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:46:15] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:47:15] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:56:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:05:55] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:06:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:14:15] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:37:15] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:43:05] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:49:25] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues