[00:11:03] (03CR) 10Dzahn: "no-op http://puppet-compiler.wmflabs.org/6549/" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn) [02:25:07] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 41s) [02:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:24] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 05s) [02:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat May 27 02:51:13 UTC 2017 (duration 6m 49s) [02:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:51] PROBLEM - CPU frequency on tin is CRITICAL: CRITICAL: CPU frequency is 600 MHz (162 MHz) [03:15:01] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:51] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=459.50 Read Requests/Sec=294.20 Write Requests/Sec=10.80 KBytes Read/Sec=36436.80 KBytes_Written/Sec=410.00 [04:18:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=331.60 Write Requests/Sec=5.70 KBytes Read/Sec=3646.00 KBytes_Written/Sec=434.80 [04:24:41] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 12.07 seconds [04:27:11] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:01] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [04:50:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:51:41] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:53:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [04:54:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:58:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:01:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:01:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:01:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:02:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:04:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:16:11] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:16:13] ACKNOWLEDGEMENT - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166422 [05:16:16] 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3295950 (10ops-monitoring-bot) [09:03:52] ouch this is the EL master [09:04:55] 06Operations, 10ops-eqiad, 06Analytics-Kanban: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3296013 (10elukey) p:05Triage>03High [09:12:12] should be one out of 12 in raid10, all good [09:52:02] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3296023 (10Volans) 05Resolved>03Open `tin` hit this today. I've tried to `rmmod mei_me` and `rmmod mei` as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it rig... [09:53:53] 06Operations, 13Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3296025 (10Volans) [09:57:21] PROBLEM - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [10:03:53] <_joe_> marostegui: ^^ FYI [10:17:29] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296028 (10Volans) 05Resolved>03Open Re-opening as it alarmed again today for the write policy... the battery is reported to be from 2010, was not swapped... [10:24:00] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296030 (10Volans) So far the lag is limited to 3~4 seconds according to tendril, while from Grafana is flat zero, looks like the dashboard is not graphing th... [12:51:55] (03PS1) 10Giuseppe Lavagetto: Fix tox.ini in order to work on newer systems [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355876 [12:51:57] (03PS1) 10Giuseppe Lavagetto: Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877 [12:51:59] (03PS1) 10Giuseppe Lavagetto: Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878 [12:53:35] (03CR) 10jerkins-bot: [V: 04-1] Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877 (owner: 10Giuseppe Lavagetto) [12:54:40] (03CR) 10jerkins-bot: [V: 04-1] Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878 (owner: 10Giuseppe Lavagetto) [13:00:15] <_joe_> uhm what did I miss [13:12:15] (03PS2) 10Giuseppe Lavagetto: Enable flake8 enforcement on part of the code [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355877 [13:12:17] (03PS2) 10Giuseppe Lavagetto: Enable flake8 on pybal/monitors [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355878 [15:17:21] RECOVERY - CPU frequency on tin is OK: OK: CPU frequency is = 600 MHz (1199 MHz) [15:24:51] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is OK: Files ownership is ok. [15:34:01] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [15:34:51] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 32.92 ms [15:54:11] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:21] PROBLEM - swift-account-reaper on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:21] PROBLEM - swift-account-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:21] PROBLEM - swift-object-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:11] RECOVERY - swift-account-reaper on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:36:11] RECOVERY - swift-account-auditor on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:36:11] RECOVERY - swift-object-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:07:21] RECOVERY - MegaRAID on db1048 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:22:07] (03PS1) 10Multichill: Adding the domain for the Bayerische Staatsgemäldesammlungen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) [17:26:11] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [17:27:11] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [17:32:18] (03PS1) 10Dzahn: admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882 [17:33:29] (03PS2) 10Dzahn: admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882 [17:35:58] (03CR) 10Dzahn: [C: 032] admins: revoke ladsgroups key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355882 (owner: 10Dzahn) [17:43:26] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296291 (10Marostegui) It was swapped a few weeks ago, but I guess the new one is also pretty old as it comes from hosts previously decommissioned - right @Cm... [17:45:04] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296292 (10Volans) And db1048 returned to WriteBack policy less than 1h ago 😛 [17:45:50] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3296293 (10Marostegui) Same behaviour as we have seen before with faulty BBUs :-( [19:20:07] (03PS2) 10Framawiki: Adding media.static.onlinesammlung.thenetexperts.info to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [19:22:58] (03CR) 10Framawiki: [C: 031] Adding media.static.onlinesammlung.thenetexperts.info to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [19:26:52] (03CR) 10Multichill: "Framawiki, are you kidding me? You're changing the subject line for a one line commit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [19:29:36] (03CR) 10Dereckson: "Well, I imagine there are cases oneliner could have a bit of context to explain why it goes there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [19:30:41] (03CR) 10Dereckson: "And as the domains are already in the config, multichill title was more informative than raw domains." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [20:47:43] Just got a server error when trying to save an article: Request from 2602:306:3712:5170:19:847b:a867:f704 via cp4018 cp4018, Varnish XID 119521278 [20:47:44] Error: 503, Backend fetch failed at Sat, 27 May 2017 20:46:32 GMT [21:00:29] kaldari hi, did it happen to you twice? [21:00:56] no, but it still says to report it, so I did :) [21:01:05] no need to panic [21:01:39] ok [21:52:21] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:31] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:04:21] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [22:21:21] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:06:31] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:08:31] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1003 is OK: OK ferm input default policy is set