[00:00:11] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6409/" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:01:02] (03CR) 10Dzahn: [C: 04-1] "looks ok on iridium but not on phab2001 (?)" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:13:55] (03CR) 10Paladox: [C: 04-1] "Moving ganglia from site.pp fails on labs." [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:20:31] (03PS5) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [00:22:27] (03CR) 10Dzahn: [C: 04-1] "amended to move ::ganglia back to site, but i think we could probably also remove it altogether. maybe in follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:23:44] (03CR) 10Paladox: [C: 031] "Thanks. Works now :)" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:24:44] (03CR) 10Paladox: [C: 031] phabricator: convert to profile/role-structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:31:15] !log T165139: Truncating RESTBase summary tables (corruption) [00:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:25] T165139: Extension output is wrapped in
breaking editing in VE and rendering elsewhere - https://phabricator.wikimedia.org/T165139 [00:32:14] (03PS6) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [00:32:39] (03CR) 10Dzahn: phabricator: convert to profile/role-structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:32:49] * urandom whistles "burning down the house", from the talking heads [00:34:23] (03CR) 10Paladox: [C: 031] phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:37:48] (03CR) 10Dzahn: [C: 04-1] "not quite there yet http://puppet-compiler.wmflabs.org/6410/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:38:49] (03CR) 10Dzahn: [C: 04-1] "now it's the other way around, iridium changes but phab2001 does not. the change in the role name is normal, but i dont want to see the rs" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:39:29] (03CR) 10Paladox: [C: 031] phabricator: convert to profile/role-structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:44:56] (03PS7) 10Paladox: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:46:57] (03CR) 10Dzahn: [C: 031] "ok, thanks! now as i wanted it: http://puppet-compiler.wmflabs.org/6411/" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:47:26] (03CR) 10Paladox: [C: 031] phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [00:54:44] !log T165139: Truncating RESTBase feed_aggregated tables (corruption) [00:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:53] T165139: Extension output is wrapped in
breaking editing in VE and rendering elsewhere - https://phabricator.wikimedia.org/T165139 [01:01:36] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [01:01:37] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [01:01:37] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:01:37] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [01:01:37] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [01:01:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [01:01:37] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [01:01:38] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [01:01:38] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [01:01:39] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [01:01:39] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [01:01:40] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [01:01:40] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [01:01:41] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [01:18:24] !log zotero restart as memis above 50% [01:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:16] PROBLEM - swift-object-updater on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:06] RECOVERY - swift-object-updater on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [02:00:36] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:36] PROBLEM - MD RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:26] RECOVERY - MD RAID on ms-be1021 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [02:01:26] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [02:06:06] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 3.302 second response time [02:06:57] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.096 second response time [02:17:26] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time [02:17:26] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [02:17:46] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [02:18:26] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [02:18:27] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 73546 bytes in 0.237 second response time [02:18:46] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.109 second response time [02:27:25] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 07s) [02:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:08] hmmmm are notifications broken? [02:33:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat May 13 02:33:34 UTC 2017 (duration 6m 9s) [02:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:29] 06Operations, 06Labs: Disable keystone admin_token usage - https://phabricator.wikimedia.org/T165211#3260105 (10chasemp) [02:37:36] 06Operations, 06Labs: Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#2869590 (10chasemp) a:03chasemp [04:10:06] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2373.10 Read Requests/Sec=2855.10 Write Requests/Sec=0.50 KBytes Read/Sec=29679.20 KBytes_Written/Sec=14.40 [04:20:06] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=13.30 Read Requests/Sec=0.40 Write Requests/Sec=0.40 KBytes Read/Sec=25.60 KBytes_Written/Sec=14.00 [04:34:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:34:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:38:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 437 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:43:56] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 437 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:03:57] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3260174 (10jayvdb) [[https://lists.wikimedia.org/pipermail/wikimania-l/2017-May/008007.html|Another one]] occurred just now on wikimania-l, again targeting Katie ;-( The rules I a... [06:56:36] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [07:18:56] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [07:36:06] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [08:12:36] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:36] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:48:46] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:48:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [10:36:26] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [10:36:26] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [10:36:46] PROBLEM - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [10:36:50] ACKNOWLEDGEMENT - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T165220 [10:36:55] 06Operations, 10ops-eqiad: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3260322 (10ops-monitoring-bot) [10:37:26] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.115 second response time [10:37:26] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 73466 bytes in 0.242 second response time [11:49:06] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [12:27:30] !log restarting wdqs updater on wdqs cluster [12:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:06] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [13:13:06] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [13:43:26] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:44:26] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:58:36] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [14:10:48] 06Operations, 10ops-eqiad, 06Labs: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3260426 (10Paladox) [15:50:36] PROBLEM - swift-container-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:51:36] RECOVERY - swift-container-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:26:36] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2012139 [18:06:36] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [18:33:36] PROBLEM - swift-object-updater on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:36] PROBLEM - swift-object-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:36] PROBLEM - swift-container-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:37] PROBLEM - swift-container-updater on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:37] PROBLEM - salt-minion processes on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:37] PROBLEM - dhclient process on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:37] PROBLEM - swift-container-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:37] PROBLEM - swift-object-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:38] PROBLEM - swift-container-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:38] PROBLEM - swift-object-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:39] PROBLEM - swift-account-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:39] PROBLEM - swift-account-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:40] PROBLEM - swift-account-reaper on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:40] PROBLEM - swift-account-replicator on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:34:26] RECOVERY - swift-object-updater on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:34:26] RECOVERY - swift-container-auditor on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:34:27] RECOVERY - swift-container-updater on ms-be1020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:34:27] RECOVERY - swift-object-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:34:27] RECOVERY - salt-minion processes on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:34:27] RECOVERY - swift-container-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:34:27] RECOVERY - dhclient process on ms-be1020 is OK: PROCS OK: 0 processes with command name dhclient [18:34:27] RECOVERY - swift-object-server on ms-be1020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:34:28] RECOVERY - swift-account-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:34:28] RECOVERY - swift-account-auditor on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:34:29] RECOVERY - swift-container-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:34:29] RECOVERY - swift-account-reaper on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [18:34:30] RECOVERY - swift-object-auditor on ms-be1020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:34:30] RECOVERY - swift-account-replicator on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:46:26] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [19:46:26] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time [19:46:26] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [19:47:26] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.120 second response time [19:47:26] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.194 second response time [19:47:26] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 73450 bytes in 0.292 second response time [21:02:46] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:02:47] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:02:47] PROBLEM - configured eth on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:36] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:03:36] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:03:46] RECOVERY - configured eth on ms-be1019 is OK: OK - interfaces up [21:34:34] (03CR) 10Krinkle: [C: 031] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [23:09:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:11:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:15:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:16:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]