[01:04:31] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1507424665 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4224630 keys, up 4 minutes 22 seconds - replication_delay is 1507424665
[01:04:32] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[01:05:01] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1507424693 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4222763 keys, up 4 minutes 50 seconds - replication_delay is 1507424693
[01:05:41] <icinga-wm>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4220368 keys, up 5 minutes 29 seconds - replication_delay is 0
[01:05:43] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4219389 keys, up 5 minutes 30 seconds - replication_delay is 0
[01:06:01] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4219251 keys, up 5 minutes 51 seconds - replication_delay is 0
[02:10:50] <wikibugs_>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667572 (10Jgreen)
[02:11:00] <wikibugs_>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667585 (10Jgreen) p:05Triage>03Unbreak!
[03:18:51] <wikibugs_>	 (03CR) 10Jayprakash12345: [C: 04-1] "This is for wuu.wiki or zh.wiki? See https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/InitialiseSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21)
[03:21:11] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[03:22:51] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[03:26:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 786.65 seconds
[04:09:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 269.90 seconds
[05:12:31] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:12:42] <icinga-wm>	 PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:22:51] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:26:23] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:27:52] <icinga-wm>	 PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:29:32] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[05:30:11] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[05:36:21] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[07:56:41] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:58:02] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:19:01] <elukey>	 !log restart varnish backend on cp4026 to stop 503s 
[08:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:03] <elukey>	 a ton of 503s during the past hours - https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&from=now-12h&to=now
[08:28:01] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0
[08:34:38] <_joe_>	 elukey: sigh
[08:37:39] <elukey>	 ciao _joe_ :)
[08:42:21] <_joe_>	 elukey: not a ton of 503s but still worrying no one had to notice
[08:42:48] <_joe_>	 we are not supposed to be around here on sunday, checking IRC for 503s
[08:42:55] <_joe_>	 there is the pager for such things
[08:44:49] <elukey>	 yep
[08:50:17] <elukey>	 (afk again :)
[08:51:42] <wikibugs_>	 (03CR) 10Zoranzoki21: "Jay, when I in search in program Synwrite pasted 模块, I changed where search found it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21)
[08:57:26] <wikibugs_>	 (03Abandoned) 10Zoranzoki21: Modification of the default alias for namespace 828 "模块:" of zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383007 (https://phabricator.wikimedia.org/T165593) (owner: 10Zoranzoki21)
[09:14:09] <ShakespeareFan00>	 Hi
[09:14:32] <ShakespeareFan00>	 Are you tweaking anything on labs right now?
[09:15:03] <ShakespeareFan00>	 the PDF thumbnaileron English Wikisource gums up when trying to talk to what it claims is one of the labs servers and doesn't load images properly
[09:18:29] <ShakespeareFan00>	 Fif your code so it doesn't break please
[09:18:32] <ShakespeareFan00>	 *Fix
[09:28:53] <closedmouth>	 this isn't the labs channel
[09:29:15] <closedmouth>	 and demanding people fix their code isn't usually the best idea
[09:31:32] <ShakespeareFan00>	 closedmouth: Okay sorry
[09:31:40] <ShakespeareFan00>	 but it frustrating
[09:31:59] <ShakespeareFan00>	 when something that should be working isn't
[09:32:23] <ShakespeareFan00>	 I will note that poor performance of the PDF thumbnailer was something I mentioned previously
[09:32:44] <ShakespeareFan00>	 It hasn't been resolved apparently
[09:35:22] <ShakespeareFan00>	 closedmouth:  I mentioned the issue here in case it was a server problem
[10:05:35] <ShakespeareFan00>	 closedmouth: Apologies about earlier - but  the labs channel seems not to have anyone in it.
[10:06:03] <ShakespeareFan00>	 that is able to respond currently
[10:06:19] <ShakespeareFan00>	 Looks like I'll have to find the former phab ticket
[11:24:14] <wikibugs_>	 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3667686 (10Volans) We had a re-occurrence of the same, with a very similar stack trace and the same consequences: ``` [Oct 7 23:27]...
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago Volans NFSD stuck on dataset1001 https://phabricator.wikimedia.org/T169680
[11:28:21] <volans>	 !log ack-ed puppet not running on stat100[5-6],snapshot100[1,5-7] due to NFSD stuck on dataset1001 - T169680
[11:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:28] <stashbot>	 T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680
[14:29:12] <icinga-wm>	 PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:37:17] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1054 power supply redundancy - https://phabricator.wikimedia.org/T177628#3667792 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The PSU was indeed dead, swapped it with a spare from a decom'd server.   Both power supplies are working.
[14:44:29] <wikibugs_>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3667572 (10Cmjohnson) The system board for frpm1001 will need to be rplaced, the server will not power on, tried reseating power supplies, DIMM, CPU1 with no luck.   Th...
[14:45:13] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3667812 (10Cmjohnson) A support ticket has been placed with HP.   Sent them the AHS log.
[14:47:11] <icinga-wm>	 RECOVERY - IPMI Sensor Status on db1052 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[14:47:17] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1052 power supply redundancy - https://phabricator.wikimedia.org/T177627#3667816 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson db1052 has a bad PSU, swapped with a spare from a decom'd server
[14:48:29] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1080 power supply redundancy - https://phabricator.wikimedia.org/T177630#3667819 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Loose cable, fixed
[14:51:38] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631#3665166 (10Cmjohnson) Both PSU are working, cleared the log.  still showing as critical in Icinga
[14:58:14] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1052 power supply redundancy - https://phabricator.wikimedia.org/T177627#3667827 (10Marostegui) Thanks a lot @Cmjohnson - confirmed they are looking good now  ``` /admin1/system1/logs1/log1-> show record1   properties   CreationTimestamp = 20171008144224.000000-300...
[14:59:12] <icinga-wm>	 RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:59:27] <wikibugs_>	 10Operations, 10ops-eqiad: check kafka1022 power supply status - https://phabricator.wikimedia.org/T177633#3665200 (10Cmjohnson) The PSU was lit as though it was working, the racadm log showed lost redundancy so I replaced the 2nd PSU for kafka1022
[15:03:16] <wikibugs_>	 10Operations, 10ops-eqiad: check mw1200 power supply redundancy - https://phabricator.wikimedia.org/T177635#3665237 (10Cmjohnson) Currently the PSU appears to be working but it has lost redundancy several times according the h/w log.  I want to check the settings on this and several other servers that show no...
[15:11:51] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1054 power supply redundancy - https://phabricator.wikimedia.org/T177628#3667832 (10Marostegui) The new PSU looks broken too, it didn't last long until it complained again: ```   reset /admin1/system1/logs1/log1-> show record1   properties   CreationTimestamp = 20171...
[15:16:16] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: check db1080 power supply redundancy - https://phabricator.wikimedia.org/T177630#3667833 (10Marostegui) Looks good now - thanks!: ``` </system1>hpiLO-> show powersupply1  status=0 status_tag=COMMAND COMPLETED Sun Oct  8 14:58:22 2017    /system1/powersupply1   Targets   Prop...
[15:17:11] <icinga-wm>	 RECOVERY - IPMI Sensor Status on db1080 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[15:23:14] <wikibugs_>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3667835 (10Zppix) There was another report of t...
[15:25:16] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui)
[17:58:41] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:28:42] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[18:33:32] <icinga-wm>	 PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:03:31] <icinga-wm>	 RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:56:12] <icinga-wm>	 PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:26:12] <icinga-wm>	 RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[21:40:23] <hoo>	 !log Killed "dumpwikidatardf.sh truthy nt" (wikidata truthy dump) on snapshot1007, got stuck after T169680.
[21:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:30] <stashbot>	 T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680