[00:06:44] RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:40:21] (03PS1) 10Alex Monk: Remove integration-puppetmaster from the labs monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/244948 [02:25:32] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 06m 17s) [02:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:34] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-11 02:28:34+00:00 [02:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:16] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:05:23] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [03:20:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [03:25:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [03:34:43] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:44] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:15] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:54] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:34] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:01:35] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:01:35] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:02:05] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:12:37] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:21:41] (03CR) 10Alex Monk: "Also, should this be marked against T108063?" [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [04:40:33] PROBLEM - puppet last run on mw2029 is CRITICAL: CRITICAL: puppet fail [04:57:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 11 04:57:40 UTC 2015 (duration 57m 39s) [04:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:09:14] RECOVERY - puppet last run on mw2029 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:17:28] (03CR) 10Tim Landscheidt: "a) Yes, that would fix T108063." [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [05:33:35] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:52:13] PROBLEM - Disk space on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:44] RECOVERY - Disk space on cp1059 is OK: DISK OK [06:05:35] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [06:29:14] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:30:43] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:14] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:33] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:13] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:15] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:34] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:55] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:15] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:14] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:50:44] (03CR) 10MarcoAurelio: [C: 04-1] Add three groups to itwikiversity, and allow sysops to add or remove users to them (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (owner: 10Gerrit Patch Uploader) [06:51:25] PROBLEM - Varnish traffic logger - multicast_relay on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:34] PROBLEM - Varnish traffic logger - erbium on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:03] RECOVERY - Varnish traffic logger - multicast_relay on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [06:53:04] RECOVERY - Varnish traffic logger - erbium on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [06:56:13] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:54] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:58:14] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:04] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:15] PROBLEM - Varnish traffic logger - erbium on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:25] RECOVERY - Varnish traffic logger - erbium on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [07:12:09] (03PS1) 10Ori.livneh: cirrus tests: skip if full MediaWiki install is not availale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244949 [07:12:31] (03CR) 10Ori.livneh: [C: 032] cirrus tests: skip if full MediaWiki install is not availale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244949 (owner: 10Ori.livneh) [07:12:37] (03Merged) 10jenkins-bot: cirrus tests: skip if full MediaWiki install is not availale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244949 (owner: 10Ori.livneh) [07:13:02] (03PS2) 10Ori.livneh: Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 [07:17:53] (03PS3) 10Ori.livneh: Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 [07:18:50] (03CR) 10Ori.livneh: [C: 032] Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 (owner: 10Ori.livneh) [07:18:56] (03Merged) 10jenkins-bot: Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 (owner: 10Ori.livneh) [07:23:27] (03PS1) 10Ori.livneh: Set MEDIAWIKI_DBLIST_DIR to /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244950 [07:23:35] (03CR) 10Ori.livneh: [C: 032] Set MEDIAWIKI_DBLIST_DIR to /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244950 (owner: 10Ori.livneh) [07:23:41] (03Merged) 10jenkins-bot: Set MEDIAWIKI_DBLIST_DIR to /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244950 (owner: 10Ori.livneh) [07:50:34] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:52:04] RECOVERY - RAID on cp1059 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:53:24] PROBLEM - salt-minion processes on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:54] RECOVERY - salt-minion processes on cp1059 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:56:45] PROBLEM - Freshness of OCSP Stapling files on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:23] RECOVERY - Freshness of OCSP Stapling files on cp1059 is OK: OK [08:03:44] PROBLEM - Confd vcl based reload on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:54] PROBLEM - Varnish traffic logger - multicast_relay on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:55] RECOVERY - Confd vcl based reload on cp1059 is OK: reload-vcl successfully ran 93h, 52 minutes ago. [08:07:25] RECOVERY - Varnish traffic logger - multicast_relay on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [08:12:14] PROBLEM - Confd vcl based reload on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:44] PROBLEM - DPKG on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:33] RECOVERY - Confd vcl based reload on cp1059 is OK: reload-vcl successfully ran 94h, 6 minutes ago. [08:20:43] PROBLEM - Freshness of OCSP Stapling files on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:43] RECOVERY - DPKG on cp1059 is OK: All packages OK [08:23:54] RECOVERY - Freshness of OCSP Stapling files on cp1059 is OK: OK [08:24:44] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:44] PROBLEM - Confd vcl based reload on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:15] PROBLEM - Disk space on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:23] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1059 is OK: No errors detected [08:27:24] PROBLEM - Varnish HTCP daemon on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:58] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:55] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [08:29:04] RECOVERY - Confd vcl based reload on cp1059 is OK: reload-vcl successfully ran 94h, 14 minutes ago. [08:29:13] PROBLEM - IPsec on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:14] PROBLEM - configured eth on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:34] RECOVERY - RAID on cp1059 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [08:31:03] RECOVERY - configured eth on cp1059 is OK: OK - interfaces up [08:31:23] RECOVERY - Disk space on cp1059 is OK: DISK OK [08:31:33] PROBLEM - Varnish traffic logger - multicast_relay on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:43] RECOVERY - Varnish HTCP daemon on cp1059 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [08:36:05] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [08:36:13] PROBLEM - configured eth on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:13] PROBLEM - Confd vcl based reload on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:36] RECOVERY - Varnish traffic logger - multicast_relay on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [08:37:44] RECOVERY - Confd vcl based reload on cp1059 is OK: reload-vcl successfully ran 94h, 23 minutes ago. [08:37:45] RECOVERY - configured eth on cp1059 is OK: OK - interfaces up [08:37:54] PROBLEM - Varnish HTCP daemon on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:39:33] RECOVERY - Varnish HTCP daemon on cp1059 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [08:39:57] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:33] RECOVERY - RAID on cp1059 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [08:41:43] PROBLEM - Varnish traffic logger - multicast_relay on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:44:34] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v4 [08:44:34] PROBLEM - service on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:04] RECOVERY - service on cp1059 is OK: OK - confd is active [08:46:15] PROBLEM - Confd vcl based reload on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:37] PROBLEM - DPKG on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:35] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [08:47:55] PROBLEM - Freshness of OCSP Stapling files on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:14] RECOVERY - Varnish traffic logger - multicast_relay on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [08:48:23] RECOVERY - DPKG on cp1059 is OK: All packages OK [08:49:34] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [08:49:34] RECOVERY - Freshness of OCSP Stapling files on cp1059 is OK: OK [08:51:13] RECOVERY - Confd vcl based reload on cp1059 is OK: reload-vcl successfully ran 94h, 37 minutes ago. [08:51:44] PROBLEM - Disk space on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:14] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:54] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v4 [08:56:24] PROBLEM - service on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:23] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [09:00:23] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1059 is OK: No errors detected [09:01:25] RECOVERY - service on cp1059 is OK: OK - confd is active [09:03:23] PROBLEM - Freshness of OCSP Stapling files on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:03:43] RECOVERY - Disk space on cp1059 is OK: DISK OK [09:03:44] PROBLEM - Varnish traffic logger - erbium on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:54] RECOVERY - Freshness of OCSP Stapling files on cp1059 is OK: OK [09:05:15] RECOVERY - Varnish traffic logger - erbium on cp1059 is OK: PROCS OK: 1 process with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [09:09:04] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [09:12:24] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [09:21:54] PROBLEM - service on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:59] (03PS1) 10MarcoAurelio: Enable Extension:ShortURL on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244953 (https://phabricator.wikimedia.org/T62956) [09:23:24] RECOVERY - service on cp1059 is OK: OK - confd is active [09:26:53] PROBLEM - Varnishkafka log producer on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:25] RECOVERY - Varnishkafka log producer on cp1059 is OK: PROCS OK: 3 processes with command name varnishkafka [09:49:14] PROBLEM - DPKG on cp1059 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:23] RECOVERY - DPKG on cp1059 is OK: All packages OK [10:36:54] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures [11:05:25] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:08:24] 6operations, 10ops-eqiad: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1717913 (10BBlack) I set cp1059 into downtime for a week in icinga, as it's been spamming IRC with random check failures, probably from the network port instability. [11:11:12] (03PS6) 10Glaisher: Add patrol, autopatrol, flood group to itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [11:11:40] (03CR) 10Glaisher: "Please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [11:18:23] (03CR) 10Steinsplitter: "Line 8808: whitespace missing" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [11:45:23] (03CR) 10KartikMistry: [C: 031] Fix nbwiki to nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244736 (owner: 10Amire80) [11:49:53] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: puppet fail [12:17:42] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1718004 (10Glaisher) [12:18:34] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:24:26] (03PS7) 10Gerrit Patch Uploader: Add patrol, autopatrol, flood group to itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [12:24:28] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [12:25:47] (03CR) 10Luke081515: "standardization of names flooder => flood like at T115200" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [12:45:53] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [12:47:33] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [12:59:54] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [13:10:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [13:11:51] (03PS8) 10Gerrit Patch Uploader: Add patrol, autopatrol, flood group to itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [13:11:53] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [13:19:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:26:53] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:39:45] (03CR) 10Alex Monk: [C: 031] Enable Extension:ShortURL on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244953 (https://phabricator.wikimedia.org/T62956) (owner: 10MarcoAurelio) [14:05:24] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: puppet fail [14:18:14] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [14:21:24] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [14:27:29] (03PS1) 10Alex Monk: Format Tmax in slow queries page as a number [software/tendril] - 10https://gerrit.wikimedia.org/r/244964 [14:32:23] RECOVERY - puppet last run on mw2210 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:56:40] (03CR) 10JanZerebecki: "This is not in https-everywhere and it has no relation to a domain that needs to be https only. So we could leave the http variant working" [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [15:38:17] (03CR) 10Alex Monk: [C: 031] Add patrol, autopatrol, flood group to itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [16:12:23] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: puppet fail [16:14:24] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [16:14:40] (03PS1) 1001tonythomas: Make mx1001/mx2001 to HTTP POST to meta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245128 (https://phabricator.wikimedia.org/T114984) [16:16:04] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [16:26:55] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [16:32:13] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [16:39:14] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:09] (03PS1) 10Alex Monk: ssh-key-ldap-lookup: Don't print whole list of keys as one line before printing each individual one [puppet] - 10https://gerrit.wikimedia.org/r/245137 [17:35:10] 6operations, 10RESTBase: uneven load on restbase workers - https://phabricator.wikimedia.org/T113579#1718205 (10mobrovac) 5Open>3Invalid a:3mobrovac This needs revisiting in case the status quo remains with incresed load. Resolving for now as invalid. [17:45:13] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [17:46:53] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [18:42:25] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [18:43:14] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v4 [18:44:05] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [18:44:53] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [18:50:45] (03CR) 10Florianschmidtwelzow: [C: 031] Add patrol, autopatrol, flood group to itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [19:19:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [19:26:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [19:57:17] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1718412 (10brion) Ok I suspect what we're going to end up doing is moving the user interface to MediaWiki-integrated code and just use a back... [20:06:33] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v4 [20:09:54] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [20:11:55] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [20:16:54] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [20:36:34] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [20:41:34] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [20:55:04] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [20:58:24] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [21:27:43] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [21:28:03] (03PS2) 10Yuvipanda: ldap: Remove leftover debugging in ldap lookup script [puppet] - 10https://gerrit.wikimedia.org/r/245137 (owner: 10Alex Monk) [21:28:09] (03PS3) 10Yuvipanda: ldap: Remove leftover debugging in ldap lookup script [puppet] - 10https://gerrit.wikimedia.org/r/245137 (owner: 10Alex Monk) [21:28:25] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove leftover debugging in ldap lookup script [puppet] - 10https://gerrit.wikimedia.org/r/245137 (owner: 10Alex Monk) [21:29:28] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [21:33:09] (03PS2) 10Yuvipanda: Remove integration-puppetmaster from the labs monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/244948 (owner: 10Alex Monk) [21:33:19] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove integration-puppetmaster from the labs monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/244948 (owner: 10Alex Monk) [21:35:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [21:36:29] (03CR) 10Alex Monk: [C: 031] "listing for deployment during the coming week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243517 (https://phabricator.wikimedia.org/T114613) (owner: 10Alex Monk) [21:36:50] (03CR) 10Alex Monk: [C: 031] "listing for deployment during the coming week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244378 (https://phabricator.wikimedia.org/T113593) (owner: 10Alex Monk) [21:40:01] (03CR) 10Alex Monk: "I may have been mistaken in how these settings work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [21:40:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [21:44:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [5000000.0] [21:44:24] (03PS1) 10Alex Monk: Don't require QuickSurveys to use HTTPS links in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245188 (https://phabricator.wikimedia.org/T114485) [21:45:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [21:49:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [21:52:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [21:57:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:57:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [22:02:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [22:16:50] (03PS1) 10Yuvipanda: Call utcnow on datetime class, not module [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/245192 (https://phabricator.wikimedia.org/T115225) [22:17:09] (03CR) 10jenkins-bot: [V: 04-1] Call utcnow on datetime class, not module [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/245192 (https://phabricator.wikimedia.org/T115225) (owner: 10Yuvipanda) [22:20:50] (03PS1) 10Ori.livneh: MWWikiversions::readDbListFile(): normalize all paths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245193 [22:21:23] (03CR) 10Ori.livneh: [C: 032] MWWikiversions::readDbListFile(): normalize all paths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245193 (owner: 10Ori.livneh) [22:21:30] (03Merged) 10jenkins-bot: MWWikiversions::readDbListFile(): normalize all paths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245193 (owner: 10Ori.livneh) [22:22:44] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 8 connecting: (unnamed) [22:24:24] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [22:26:46] (03CR) 10MarcoAurelio: Add patrol, autopatrol, flood group to itwikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [22:27:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [22:30:28] (03PS2) 10Yuvipanda: Call utcnow on datetime class, not module [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/245192 (https://phabricator.wikimedia.org/T115225) [22:30:54] Coren: ^ [22:35:06] (03CR) 10Yuvipanda: [C: 032] Call utcnow on datetime class, not module [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/245192 (https://phabricator.wikimedia.org/T115225) (owner: 10Yuvipanda) [22:35:26] (03Merged) 10jenkins-bot: Call utcnow on datetime class, not module [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/245192 (https://phabricator.wikimedia.org/T115225) (owner: 10Yuvipanda) [22:36:43] (03PS1) 10MarcoAurelio: Naming standardization from 'flooder' to 'flood' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245194 (https://phabricator.wikimedia.org/T115200) [22:41:23] (03PS1) 10Ori.livneh: MWWikiversions::readDbListFile() update callers for Ie6c2fd3129dd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245195 [22:41:28] (03CR) 10jenkins-bot: [V: 04-1] MWWikiversions::readDbListFile() update callers for Ie6c2fd3129dd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245195 (owner: 10Ori.livneh) [22:42:03] (03PS2) 10Ori.livneh: MWWikiversions::readDbListFile() update callers for Ie6c2fd3129dd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245195 [22:42:20] (03CR) 10Ori.livneh: [C: 032] MWWikiversions::readDbListFile() update callers for Ie6c2fd3129dd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245195 (owner: 10Ori.livneh) [22:42:28] (03Merged) 10jenkins-bot: MWWikiversions::readDbListFile() update callers for Ie6c2fd3129dd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245195 (owner: 10Ori.livneh) [22:43:39] (03CR) 10Luke081515: [C: 031] Naming standardization from 'flooder' to 'flood' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245194 (https://phabricator.wikimedia.org/T115200) (owner: 10MarcoAurelio) [22:44:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [23:10:09] (03PS9) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [23:11:25] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1059_v6 [23:12:03] (03PS10) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [23:12:48] (03CR) 10Ori.livneh: [C: 032] Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [23:12:54] (03Merged) 10jenkins-bot: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [23:13:05] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [23:38:35] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail