[00:04:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:33:39] 6Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2193231 (10Krenair) [00:37:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [00:44:48] (03PS2) 10Alex Monk: Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282198 (https://phabricator.wikimedia.org/T131420) (owner: 10CSteipp) [00:47:18] 6Operations, 10Beta-Cluster-Infrastructure, 6Services, 7Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2193246 (10Krenair) [00:49:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [00:52:21] 6Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Apache-configuration: Special:CentralAutoLogin/checkLoggedIn redirects to wikimediafoundation.org on Beta Cluster - https://phabricator.wikimedia.org/T126697#2193267 (10Krenair) 5Open>3Invalid Likely a cached is... [01:09:43] RECOVERY - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on port 9042 [01:20:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:31:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:39:42] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures [01:57:30] 6Operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2193289 (10Krenair) [02:04:53] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:22:40] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 10m 00s) [02:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [02:26:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [02:26:57] !log mwscript deleteEqualMessages.php --wiki zh_yuewiki [02:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Apr 10 02:31:13 UTC 2016 (duration 8m 33s) [02:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [03:13:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [03:23:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:24:22] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:25:41] 6Operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2193331 (10Krenair) [03:32:33] 6Operations, 10Beta-Cluster-Infrastructure: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#2193335 (10Krenair) [03:35:13] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:44] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:53] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:03] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:19] 6Operations, 6Labs, 10netops, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2193353 (10Krenair) [03:36:22] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [03:38:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:39:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:45:20] (03CR) 10Alex Monk: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [03:51:06] 6Operations, 6Labs, 10netops, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2193376 (10Krenair) > "nl-mw-base.wmflabs.org" is using split horizon DNS. It resolves to 208.80.155.156 from outside the labs... [03:52:21] 6Operations, 6Labs, 10netops, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2193377 (10Krenair) >>! In T132216#2193290, @bd808 wrote: > Using `export http_proxy=http://webproxy.eqiad.wmnet:8080` as @Kre... [04:00:33] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [04:03:02] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:03] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:22] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:33] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:31] (03PS1) 10Yuvipanda: labs: Alias floating IPs in wikitextexp project as well [puppet] - 10https://gerrit.wikimedia.org/r/282511 (https://phabricator.wikimedia.org/T132216) [05:16:50] 6Operations, 6Labs, 10netops, 13Patch-For-Review, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2193396 (10yuvipanda) ^ is what you need. I'll merge on Monday :) [05:23:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [05:24:15] (03CR) 10Alex Monk: [C: 04-1] "Icbb352a8 should be done instead of expanding the list. At the moment this will probably just make that commit require an extra manual reb" [puppet] - 10https://gerrit.wikimedia.org/r/282511 (https://phabricator.wikimedia.org/T132216) (owner: 10Yuvipanda) [05:28:02] 6Operations, 6Labs, 10netops, 13Patch-For-Review, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2193397 (10Krenair) While that may be a workaround, it doesn't fix the fact that promethium cannot conne... [05:41:47] 7Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2193402 (10Krenair) [06:10:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:22:32] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [06:23:33] PROBLEM - HHVM rendering on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [06:30:23] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:13] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [06:32:43] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:42] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:03:42] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [08:28:52] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:45:21] (03CR) 10Gehel: "We still have a lot of tests in error that we need to fix before merging this. @Nicko: will you continue working on this? Or should we tak" [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [13:16:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [13:17:54] (03PS4) 10Dereckson: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 (https://phabricator.wikimedia.org/T132285) [14:01:14] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [14:11:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:18:52] (03PS4) 10Nicko: Modification of Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) [14:28:33] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [15:50:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [15:57:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [15:58:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:03:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:44:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:19:23] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: Puppet has 1 failures [17:28:42] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [17:46:33] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:42] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail [17:55:53] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:17:54] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:30:43] PROBLEM - HHVM rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50393 bytes in 0.018 second response time [18:32:34] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 67212 bytes in 0.088 second response time [19:20:45] 7Puppet, 10Beta-Cluster-Infrastructure, 7Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2193858 (10mmodell) [19:20:47] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet failures due to apache trying to start on same port as nginx - https://phabricator.wikimedia.org/T132269#2193856 (10mmodell) 5Open>3Resolved @krenair: Thanks for getting to the bottom of this. I just did the following: * removed the a... [19:37:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [19:56:01] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-prep puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2183851 (10Krenair) [20:18:18] 6Operations, 6Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2193908 (10Krenair) [20:22:20] 6Operations, 6Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2193924 (10Krenair) icinga shows this host as having downtime since 12th January for pretty much the rest of the year. Not sure why. [20:24:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:50:12] (03PS2) 10ArielGlenn: Maintain a current symlink for cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/282415 (owner: 10EBernhardson) [20:52:05] (03CR) 10ArielGlenn: [C: 032] Maintain a current symlink for cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/282415 (owner: 10EBernhardson) [21:55:24] !log mwscript deleteEqualMessages.php --wiki trwikimedia (T45917) [21:55:25] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:44] !log mwscript deleteEqualMessages.php --wiki yowiki (T45917) [22:16:45] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [22:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:25] PROBLEM - puppet last run on mw2054 is CRITICAL: CRITICAL: puppet fail [22:56:38] !log mwscript deleteEqualMessages.php --wiki srwiki (T45917) [22:56:39] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [22:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:04] RECOVERY - puppet last run on mw2054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:28] PROBLEM - MariaDB Slave Lag: s1 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 380.28 seconds [23:44:18] RECOVERY - MariaDB Slave Lag: s1 on db1053 is OK: OK slave_sql_lag Replication lag: 7.90 seconds