[00:11:03] <wikibugs>	 (03CR) 10Dzahn: "no-op http://puppet-compiler.wmflabs.org/6549/" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn)
[02:25:07] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 41s)
[02:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:44:24] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 05s)
[02:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:13] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat May 27 02:51:13 UTC 2017 (duration 6m 49s)
[02:51:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:04:51] <icinga-wm>	 PROBLEM - CPU frequency on tin is CRITICAL: CRITICAL: CPU frequency is 600 MHz (162 MHz)
[03:15:01] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:09:51] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=459.50 Read Requests/Sec=294.20 Write Requests/Sec=10.80 KBytes Read/Sec=36436.80 KBytes_Written/Sec=410.00
[04:18:51] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=331.60 Write Requests/Sec=5.70 KBytes Read/Sec=3646.00 KBytes_Written/Sec=434.80
[04:24:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 12.07 seconds
[04:27:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:28:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set
[04:50:41] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[04:51:41] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[04:53:01] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[04:54:41] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[04:58:41] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[05:01:42] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[05:01:43] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:01:43] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:02:01] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:04:41] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:16:11] <icinga-wm>	 PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[05:16:13] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166422
[05:16:16] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3295950 (10ops-monitoring-bot)
[09:03:52] <elukey>	 ouch this is the EL master
[09:04:55] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3296013 (10elukey) p:05Triage>03High
[09:12:12] <elukey>	 should be one out of 12 in raid10, all good
[09:52:02] <wikibugs>	 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3296023 (10Volans) 05Resolved>03Open `tin` hit this today. I've tried to `rmmod mei_me` and `rmmod mei` as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it rig...
[09:53:53] <wikibugs>	 06Operations, 13Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3296025 (10Volans)
[09:57:21] <icinga-wm>	 PROBLEM - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
[10:03:53] <_joe_>	 marostegui: ^^ FYI
[10:17:29] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296028 (10Volans) 05Resolved>03Open Re-opening as it alarmed again today for the write policy... the battery is reported to be from 2010, was not swapped...
[10:24:00] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296030 (10Volans) So far the lag is limited to 3~4 seconds according to tendril, while from Grafana is flat zero, looks like the dashboard is not graphing th...
[12:51:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix tox.ini in order to work on newer systems [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355876
[12:51:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877
[12:51:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878
[12:53:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877 (owner: 10Giuseppe Lavagetto)
[12:54:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878 (owner: 10Giuseppe Lavagetto)
[13:00:15] <_joe_>	 uhm what did I miss
[13:12:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877
[13:12:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878
[15:17:21] <icinga-wm>	 RECOVERY - CPU frequency on tin is OK: OK: CPU frequency is = 600 MHz (1199 MHz)
[15:24:51] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is OK: Files ownership is ok.
[15:34:01] <icinga-wm>	 PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:51] <icinga-wm>	 RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 32.92 ms
[15:54:11] <icinga-wm>	 PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100%
[16:35:21] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:35:21] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:35:21] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:36:11] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[16:36:11] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[16:36:11] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:07:21] <icinga-wm>	 RECOVERY - MegaRAID on db1048 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[17:22:07] <wikibugs>	 (03PS1) 10Multichill: Adding the domain for the Bayerische Staatsgemäldesammlungen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437)
[17:26:11] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[17:27:11] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[17:32:18] <wikibugs>	 (03PS1) 10Dzahn: admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882
[17:33:29] <wikibugs>	 (03PS2) 10Dzahn: admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882
[17:35:58] <wikibugs>	 (03CR) 10Dzahn: [C: 032] admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882 (owner: 10Dzahn)
[17:43:26] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296291 (10Marostegui) It was swapped a few weeks ago, but I guess the new one is also pretty old as it comes from hosts previously decommissioned - right @Cm...
[17:45:04] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296292 (10Volans) And db1048 returned to WriteBack policy less than 1h ago 😛
[17:45:50] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296293 (10Marostegui) Same behaviour as we have seen before with faulty BBUs :-(
[19:20:07] <wikibugs>	 (03PS2) 10Framawiki: Adding media.static.onlinesammlung.thenetexperts.info to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill)
[19:22:58] <wikibugs>	 (03CR) 10Framawiki: [C: 031] Adding media.static.onlinesammlung.thenetexperts.info to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill)
[19:26:52] <wikibugs>	 (03CR) 10Multichill: "Framawiki, are you kidding me? You're changing the subject line for a one line commit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill)
[19:29:36] <wikibugs>	 (03CR) 10Dereckson: "Well, I imagine there are cases oneliner could have a bit of context to explain why it goes there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill)
[19:30:41] <wikibugs>	 (03CR) 10Dereckson: "And as the domains are already in the config, multichill title was more informative than raw domains." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill)
[20:47:43] <kaldari>	 Just got a server error when trying to save an article: Request from 2602:306:3712:5170:19:847b:a867:f704 via cp4018 cp4018, Varnish XID 119521278
[20:47:44] <kaldari>	 Error: 503, Backend fetch failed at Sat, 27 May 2017 20:46:32 GMT
[21:00:29] <paladox>	 kaldari hi, did it happen to you twice?
[21:00:56] <kaldari>	 no, but it still says to report it, so I did :)
[21:01:05] <kaldari>	 no need to panic
[21:01:39] <paladox>	 ok
[21:52:21] <icinga-wm>	 PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:02:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:04:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy
[22:21:21] <icinga-wm>	 RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[23:06:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[23:08:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1003 is OK: OK ferm input default policy is set