[00:45:54] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2142546 [01:04:43] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:05:04] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509239101 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3942794 keys, up 4 minutes 57 seconds - replication_delay is 1509239101 [01:05:43] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3941836 keys, up 5 minutes 35 seconds - replication_delay is 0 [01:06:04] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3941084 keys, up 5 minutes 56 seconds - replication_delay is 0 [03:27:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 865.99 seconds [03:36:54] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:57:31] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3717931 (10BBlack) Yes, but the work for that is more on the CA end than ours, from a technical perspective. Because of Google's deadlines, in practice virtually all CA vend... [04:01:54] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:11:54] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.59 seconds [04:12:40] (03PS1) 10BryanDavis: Add Timeless skin to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387069 (https://phabricator.wikimedia.org/T154371) [04:47:49] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3717940 (10Jack_who_built_the_house) On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queu... [07:47:04] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [07:54:09] ^ that is me fixing T179244 [07:54:09] T179244: labsdb1009 crashed - OOM - https://phabricator.wikimedia.org/T179244 [09:45:04] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718040 (10hoo) [10:15:24] (03PS1) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [10:17:21] (03PS2) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [10:26:59] (03CR) 10Zoranzoki21: [C: 031] Add Timeless skin to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387069 (https://phabricator.wikimedia.org/T154371) (owner: 10BryanDavis) [10:27:42] (03CR) 10Zoranzoki21: [C: 031] Enable ShortUrl and WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [11:14:47] (03CR) 10Paladox: [C: 031] Add Timeless skin to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387069 (https://phabricator.wikimedia.org/T154371) (owner: 10BryanDavis) [12:06:38] (03CR) 10Alex Monk: [C: 031] Add Timeless skin to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387069 (https://phabricator.wikimedia.org/T154371) (owner: 10BryanDavis) [12:33:19] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10ema) >>! In T179156#3717847, @BBlack wrote: > For future reference by another opsen who might be looking at this: on... [12:35:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [12:35:43] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [12:43:07] (03CR) 10Smalyshev: [C: 031] wdqs: remove PrintPLAB from GC logging [puppet] - 10https://gerrit.wikimedia.org/r/386791 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [12:54:53] !log cp4026: restart varnish-be for mbox lag [12:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:54] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [13:45:00] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718221 (10BBlack) Does Echo have any kind of push notification going on, even in light testing yet? [14:02:18] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718225 (10BBlack) Now that I'm digging deeper, it seems there are one or more projects in progress built around Push-like thin... [14:05:33] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:35:33] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:51:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:51:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:54:57] bblack hoo ^^ [14:57:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:58:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:06:22] it's a sub-1-minute 503 spike specific to esams-only, so not the same problem. just some transient crap of lesser severity. [16:51:59] (03CR) 10Paladox: [C: 031] Enable Timeless skin on 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377864 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [17:01:05] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3718257 (10elukey) @Cmjohnson let's order new PSUs if possible, we are not planning to replace this hardware soon :( [17:23:48] !log ariel@tin Started deploy [dumps/dumps@d426cf7]: batch 7z jobs, multistream job fixup [17:23:51] !log ariel@tin Finished deploy [dumps/dumps@d426cf7]: batch 7z jobs, multistream job fixup (duration: 00m 02s) [17:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:57] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718297 (10Legoktm) >>! In T179156#3718221, @BBlack wrote: > Does Echo have any kind of push notification going on, even in lig... [17:27:41] (03PS1) 10Zoranzoki21: Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) [17:28:49] (03CR) 10jerkins-bot: [V: 04-1] Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) (owner: 10Zoranzoki21) [17:31:10] (03PS2) 10Zoranzoki21: Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) [17:32:41] (03PS3) 10Zoranzoki21: Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) [17:34:19] (03CR) 10jerkins-bot: [V: 04-1] Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) (owner: 10Zoranzoki21) [17:39:21] (03PS4) 10Zoranzoki21: Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) [17:49:05] (03PS5) 10Zoranzoki21: Enable the Autopatrolled User Rights on hiwikiversity Enable Extension:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) [17:54:18] (03CR) 10ArielGlenn: [C: 032] Permit overrides section in dump config files and more per proj settings [dumps] - 10https://gerrit.wikimedia.org/r/387022 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [17:55:39] !log ariel@tin Started deploy [dumps/dumps@d8978ce]: add overrides section processing to config file [17:55:43] !log ariel@tin Finished deploy [dumps/dumps@d8978ce]: add overrides section processing to config file (duration: 00m 04s) [17:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:31] (03PS6) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [18:52:08] (03PS2) 10Framawiki: Create Appendix NS on Burmese Wiktionary (mywikt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385190 (https://phabricator.wikimedia.org/T178545) [18:56:06] (03PS6) 10Framawiki: Enable the Autopatrolled User Rights & Ext:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) (owner: 10Zoranzoki21) [18:57:11] (03CR) 10Framawiki: [C: 031] Enable the Autopatrolled User Rights & Ext:SandboxLink on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387106 (https://phabricator.wikimedia.org/T179251) (owner: 10Zoranzoki21) [19:04:43] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [19:05:24] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [19:14:46] (03PS7) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [21:25:12] (03PS1) 10Herron: puppet: change elasticsearch_5 template to check undef(nil) variable [puppet] - 10https://gerrit.wikimedia.org/r/387113 (https://phabricator.wikimedia.org/T179174) [21:27:20] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8524/" [puppet] - 10https://gerrit.wikimedia.org/r/387113 (https://phabricator.wikimedia.org/T179174) (owner: 10Herron) [21:43:28] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'exim_alias_file' at /etc/puppet/private/modules/privateexim/manifests/init.pp:55:2 - https://phabricator.wikimedia.org/T179170#3718432 (10herron) Should `define exim_alias_file` be moved from `p... [21:53:38] (03PS1) 10Herron: puppet: change ganglia aggregator site_instances call to full name [puppet] - 10https://gerrit.wikimedia.org/r/387139 (https://phabricator.wikimedia.org/T179165) [21:55:54] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8525/" [puppet] - 10https://gerrit.wikimedia.org/r/387139 (https://phabricator.wikimedia.org/T179165) (owner: 10Herron) [22:00:12] (03PS1) 10Herron: puppet: change mediawiki refreshlinks cronjob call to use full name [puppet] - 10https://gerrit.wikimedia.org/r/387142 (https://phabricator.wikimedia.org/T179177) [22:02:19] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8526/" [puppet] - 10https://gerrit.wikimedia.org/r/387139 (https://phabricator.wikimedia.org/T179165) (owner: 10Herron) [22:02:45] (03PS1) 10ArielGlenn: fix 'keep' config setting to work with overrides section [dumps] - 10https://gerrit.wikimedia.org/r/387145 [22:03:58] (03CR) 10Herron: "> https://puppet-compiler.wmflabs.org/compiler02/8526/" [puppet] - 10https://gerrit.wikimedia.org/r/387139 (https://phabricator.wikimedia.org/T179165) (owner: 10Herron) [22:06:46] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8526/" [puppet] - 10https://gerrit.wikimedia.org/r/387142 (https://phabricator.wikimedia.org/T179177) (owner: 10Herron) [22:18:56] (03PS1) 10Herron: puppet: change dbstore_multiinstance mariadb groups call to full name [puppet] - 10https://gerrit.wikimedia.org/r/387151 (https://phabricator.wikimedia.org/T179161) [22:21:06] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8527/" [puppet] - 10https://gerrit.wikimedia.org/r/387151 (https://phabricator.wikimedia.org/T179161) (owner: 10Herron) [22:22:39] (03PS8) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [22:30:09] (03CR) 10ArielGlenn: [C: 032] fix 'keep' config setting to work with overrides section [dumps] - 10https://gerrit.wikimedia.org/r/387145 (owner: 10ArielGlenn) [22:31:08] !log ariel@tin Started deploy [dumps/dumps@2aa2275]: fix keep setting to work with overrides [22:31:10] !log ariel@tin Finished deploy [dumps/dumps@2aa2275]: fix keep setting to work with overrides (duration: 00m 02s) [22:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:14] PROBLEM - Check Varnish expiry mailbox lag on cp4022 is CRITICAL: CRITICAL: expiry mailbox lag is 2085185 [22:58:20] (03PS1) 10BryanDavis: wikitech: Install php5-readline for cli scripts [puppet] - 10https://gerrit.wikimedia.org/r/387152 (https://phabricator.wikimedia.org/T126262) [23:04:01] (03CR) 10Andrew Bogott: [C: 032] wikitech: Install php5-readline for cli scripts [puppet] - 10https://gerrit.wikimedia.org/r/387152 (https://phabricator.wikimedia.org/T126262) (owner: 10BryanDavis) [23:06:42] (03PS1) 10BryanDavis: wmcs: add wmcs-roots to role::wmcs::openstack::wikitech [puppet] - 10https://gerrit.wikimedia.org/r/387153 [23:07:26] (03PS9) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [23:42:52] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [23:49:19] !log powercycle cp4024 [23:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:47] oh, nevermind, cp4024's was down already (and is depooled). I'll set downtime on icinga [23:54:33] ACKNOWLEDGEMENT - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% Ema T174891