[01:04:31] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1507424665 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4224630 keys, up 4 minutes 22 seconds - replication_delay is 1507424665 [01:04:32] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:05:01] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1507424693 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4222763 keys, up 4 minutes 50 seconds - replication_delay is 1507424693 [01:05:41] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4220368 keys, up 5 minutes 29 seconds - replication_delay is 0 [01:05:43] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4219389 keys, up 5 minutes 30 seconds - replication_delay is 0 [01:06:01] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4219251 keys, up 5 minutes 51 seconds - replication_delay is 0 [02:10:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667572 (10Jgreen) [02:11:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667585 (10Jgreen) p:05Triage>03Unbreak! [03:18:51] (03CR) 10Jayprakash12345: [C: 04-1] "This is for wuu.wiki or zh.wiki? See https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/InitialiseSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21) [03:21:11] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:22:51] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:26:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 786.65 seconds [04:09:33] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 269.90 seconds [05:12:31] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:12:42] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:22:51] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:26:23] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:27:52] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:29:32] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:30:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:36:21] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [07:56:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:58:02] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:19:01] !log restart varnish backend on cp4026 to stop 503s [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:03] a ton of 503s during the past hours - https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&from=now-12h&to=now [08:28:01] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [08:34:38] <_joe_> elukey: sigh [08:37:39] ciao _joe_ :) [08:42:21] <_joe_> elukey: not a ton of 503s but still worrying no one had to notice [08:42:48] <_joe_> we are not supposed to be around here on sunday, checking IRC for 503s [08:42:55] <_joe_> there is the pager for such things [08:44:49] yep [08:50:17] (afk again :) [08:51:42] (03CR) 10Zoranzoki21: "Jay, when I in search in program Synwrite pasted 模块, I changed where search found it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21) [08:57:26] (03Abandoned) 10Zoranzoki21: Modification of the default alias for namespace 828 "模块:" of zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21) [09:14:09] Hi [09:14:32] Are you tweaking anything on labs right now? [09:15:03] the PDF thumbnaileron English Wikisource gums up when trying to talk to what it claims is one of the labs servers and doesn't load images properly [09:18:29] Fif your code so it doesn't break please [09:18:32] *Fix [09:28:53] this isn't the labs channel [09:29:15] and demanding people fix their code isn't usually the best idea [09:31:32] closedmouth: Okay sorry [09:31:40] but it frustrating [09:31:59] when something that should be working isn't [09:32:23] I will note that poor performance of the PDF thumbnailer was something I mentioned previously [09:32:44] It hasn't been resolved apparently [09:35:22] closedmouth: I mentioned the issue here in case it was a server problem [10:05:35] closedmouth: Apologies about earlier - but the labs channel seems not to have anyone in it. [10:06:03] that is able to respond currently [10:06:19] Looks like I'll have to find the former phab ticket [11:24:14] 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3667686 (10Volans) We had a re-occurrence of the same, with a very similar stack trace and the same consequences: ``` [Oct 7 23:27]... [11:27:19] ACKNOWLEDGEMENT - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:27:19] ACKNOWLEDGEMENT - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:27:19] ACKNOWLEDGEMENT - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:27:19] ACKNOWLEDGEMENT - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:27:19] ACKNOWLEDGEMENT - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:27:19] ACKNOWLEDGEMENT - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680 [11:28:21] !log ack-ed puppet not running on stat100[5-6],snapshot100[1,5-7] due to NFSD stuck on dataset1001 - T169680 [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:28] T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680 [14:29:12] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:17] 10Operations, 10ops-eqiad, 10DBA: check db1054 power supply redundancy - https://phabricator.wikimedia.org/T177628#3667792 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The PSU was indeed dead, swapped it with a spare from a decom'd server. Both power supplies are working. [14:44:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667572 (10Cmjohnson) The system board for frpm1001 will need to be rplaced, the server will not power on, tried reseating power supplies, DIMM, CPU1 with no luck. Th... [14:45:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3667812 (10Cmjohnson) A support ticket has been placed with HP. Sent them the AHS log. [14:47:11] RECOVERY - IPMI Sensor Status on db1052 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:47:17] 10Operations, 10ops-eqiad, 10DBA: check db1052 power supply redundancy - https://phabricator.wikimedia.org/T177627#3667816 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson db1052 has a bad PSU, swapped with a spare from a decom'd server [14:48:29] 10Operations, 10ops-eqiad, 10DBA: check db1080 power supply redundancy - https://phabricator.wikimedia.org/T177630#3667819 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Loose cable, fixed [14:51:38] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631#3665166 (10Cmjohnson) Both PSU are working, cleared the log. still showing as critical in Icinga [14:58:14] 10Operations, 10ops-eqiad, 10DBA: check db1052 power supply redundancy - https://phabricator.wikimedia.org/T177627#3667827 (10Marostegui) Thanks a lot @Cmjohnson - confirmed they are looking good now ``` /admin1/system1/logs1/log1-> show record1 properties CreationTimestamp = 20171008144224.000000-300... [14:59:12] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:59:27] 10Operations, 10ops-eqiad: check kafka1022 power supply status - https://phabricator.wikimedia.org/T177633#3665200 (10Cmjohnson) The PSU was lit as though it was working, the racadm log showed lost redundancy so I replaced the 2nd PSU for kafka1022 [15:03:16] 10Operations, 10ops-eqiad: check mw1200 power supply redundancy - https://phabricator.wikimedia.org/T177635#3665237 (10Cmjohnson) Currently the PSU appears to be working but it has lost redundancy several times according the h/w log. I want to check the settings on this and several other servers that show no... [15:11:51] 10Operations, 10ops-eqiad, 10DBA: check db1054 power supply redundancy - https://phabricator.wikimedia.org/T177628#3667832 (10Marostegui) The new PSU looks broken too, it didn't last long until it complained again: ``` reset /admin1/system1/logs1/log1-> show record1 properties CreationTimestamp = 20171... [15:16:16] 10Operations, 10ops-eqiad, 10DBA: check db1080 power supply redundancy - https://phabricator.wikimedia.org/T177630#3667833 (10Marostegui) Looks good now - thanks!: ``` hpiLO-> show powersupply1 status=0 status_tag=COMMAND COMPLETED Sun Oct 8 14:58:22 2017 /system1/powersupply1 Targets Prop... [15:17:11] RECOVERY - IPMI Sensor Status on db1080 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:23:14] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3667835 (10Zppix) There was another report of t... [15:25:16] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) [17:58:41] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:42] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:33:32] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:31] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:56:12] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:12] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:40:23] !log Killed "dumpwikidatardf.sh truthy nt" (wikidata truthy dump) on snapshot1007, got stuck after T169680. [21:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:30] T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680