[00:02:06] <icinga-wm>	 PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused
[00:02:46] <icinga-wm>	 PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused
[00:09:36] <icinga-wm>	 PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:09:26] <icinga-wm>	 RECOVERY - SSH cp1066.mgmt on cp1066.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0)
[01:36:51] <wikibugs>	 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3863889 (10Legoktm) If I have my tool set a `strict-transport-security: max-age=86400` header, will that impact other tools as well since the...
[02:54:20] <wikibugs>	 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Krenair) >>! In T102367#3863889, @Legoktm wrote: > If I have my tool set a `strict-transport-security: max-age=86400` header, will...
[03:25:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 758.33 seconds
[03:49:24] <wikibugs>	 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3863974 (10bd808) >>! In T102367#3863889, @Legoktm wrote: > If I have my tool set a `strict-transport-security: max-age=86400` header, will t...
[03:51:36] <wikibugs>	 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Reedy) https://hstspreload.org/?domain=tools.wmflabs.org  {F12139559 size=full}
[03:54:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.81 seconds
[07:09:36] <icinga-wm>	 PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:09:26] <icinga-wm>	 RECOVERY - SSH cp1066.mgmt on cp1066.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0)
[10:07:43] <wikibugs>	 (03PS3) 10ArielGlenn: toy offline reader: pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280109
[10:08:13] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] toy offline reader: pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280109 (owner: 10ArielGlenn)
[10:25:10] <wikibugs>	 (03PS1) 10ArielGlenn: pep8 for rsyncmedia script [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/401051
[10:33:27] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] pep8 for rsyncmedia script [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/401051 (owner: 10ArielGlenn)
[11:03:35] <wikibugs>	 (03PS3) 10MarcoAurelio: [do not merge yet] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493)
[11:03:52] <wikibugs>	 (03PS4) 10MarcoAurelio: wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493)
[11:43:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) (owner: 10MarcoAurelio)
[11:44:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345)
[11:52:26] <wikibugs>	 (03PS5) 10MarcoAurelio: Close wikimania2017.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493)
[11:55:26] <wikibugs>	 (03Draft1) 10MarcoAurelio: Set category collation to uca-es-u-kn for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401081 (https://phabricator.wikimedia.org/T183802)
[11:55:29] <wikibugs>	 (03PS2) 10MarcoAurelio: Set category collation to uca-es-u-kn for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401081 (https://phabricator.wikimedia.org/T183802)
[12:30:07] <icinga-wm>	 PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdk1]
[12:55:07] <icinga-wm>	 RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:34:54] <sjoerddebruin>	 Request from 84.81.160.164 via cp3043 cp3043, Varnish XID 189073764
[15:34:54] <sjoerddebruin>	 Error: 503, Backend fetch failed at Sat, 30 Dec 2017 15:34:51 GMT
[15:35:44] <sjoerddebruin>	 Seems to be up again.
[15:35:48] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183814
[15:35:53] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814#3864387 (10ops-monitoring-bot)
[15:36:25] <sjoerddebruin>	 I was saying that too loud.
[15:37:57] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183815
[15:38:01] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T183815#3864391 (10ops-monitoring-bot)
[15:39:27] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[15:40:26] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[15:40:27] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[15:47:27] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[15:50:26] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[15:50:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[16:27:37] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[16:28:36] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[16:30:18] <paladox>	 Hmm is that normal ^^?
[16:38:36] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[16:38:37] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[17:13:38] <wikibugs>	 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Bawolff) HSTS should help with (a)
[17:17:46] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:18:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[17:18:46] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[17:25:46] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5
[17:30:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5
[17:30:47] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5
[17:33:08] <elukey>	 bblack,ema - any chance that you are online? ~2h ago lvs3001 seems to have registered a failed drive, could it be the cause of the above 503s ?
[17:37:55] <elukey>	 weird though, the issue appeared in two separate spikes during the past two hours, so might be something else
[17:47:20] <elukey>	 I'd say that there is no need to page in people now, but let's keep an eye during the next hours
[17:56:14] <wikibugs>	 (03Restored) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 (owner: 10ArielGlenn)
[17:56:27] <wikibugs>	 (03PS2) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100
[17:58:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 (owner: 10ArielGlenn)
[18:08:25] <wikibugs>	 (03Restored) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn)
[18:08:33] <wikibugs>	 (03PS2) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101
[18:09:36] <icinga-wm>	 PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:10:21] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn)
[18:48:33] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Thanks for taking care of this. I've a couple of main comments and more smaller ones inline." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn)
[21:26:36] <icinga-wm>	 PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:27:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 73710 bytes in 0.276 second response time