[00:02:06] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [00:02:46] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused [00:09:36] PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:09:26] RECOVERY - SSH cp1066.mgmt on cp1066.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [01:36:51] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3863889 (10Legoktm) If I have my tool set a `strict-transport-security: max-age=86400` header, will that impact other tools as well since the... [02:54:20] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Krenair) >>! In T102367#3863889, @Legoktm wrote: > If I have my tool set a `strict-transport-security: max-age=86400` header, will... [03:25:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 758.33 seconds [03:49:24] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3863974 (10bd808) >>! In T102367#3863889, @Legoktm wrote: > If I have my tool set a `strict-transport-security: max-age=86400` header, will t... [03:51:36] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Reedy) https://hstspreload.org/?domain=tools.wmflabs.org {F12139559 size=full} [03:54:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.81 seconds [07:09:36] PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:26] RECOVERY - SSH cp1066.mgmt on cp1066.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [10:07:43] (03PS3) 10ArielGlenn: toy offline reader: pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280109 [10:08:13] (03CR) 10ArielGlenn: [C: 032] toy offline reader: pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280109 (owner: 10ArielGlenn) [10:25:10] (03PS1) 10ArielGlenn: pep8 for rsyncmedia script [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/401051 [10:33:27] (03CR) 10ArielGlenn: [C: 032] pep8 for rsyncmedia script [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/401051 (owner: 10ArielGlenn) [11:03:35] (03PS3) 10MarcoAurelio: [do not merge yet] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) [11:03:52] (03PS4) 10MarcoAurelio: wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) [11:43:59] (03CR) 10Urbanecm: [C: 031] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) (owner: 10MarcoAurelio) [11:44:50] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [11:52:26] (03PS5) 10MarcoAurelio: Close wikimania2017.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) [11:55:26] (03Draft1) 10MarcoAurelio: Set category collation to uca-es-u-kn for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401081 (https://phabricator.wikimedia.org/T183802) [11:55:29] (03PS2) 10MarcoAurelio: Set category collation to uca-es-u-kn for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401081 (https://phabricator.wikimedia.org/T183802) [12:30:07] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdk1] [12:55:07] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:54] Request from 84.81.160.164 via cp3043 cp3043, Varnish XID 189073764 [15:34:54] Error: 503, Backend fetch failed at Sat, 30 Dec 2017 15:34:51 GMT [15:35:44] Seems to be up again. [15:35:48] ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183814 [15:35:53] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814#3864387 (10ops-monitoring-bot) [15:36:25] I was saying that too loud. [15:37:57] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183815 [15:38:01] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T183815#3864391 (10ops-monitoring-bot) [15:39:27] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [15:40:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [15:40:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:47:27] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [15:50:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [15:50:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [16:27:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [16:28:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [16:30:18] Hmm is that normal ^^? [16:38:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [16:38:37] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [17:13:38] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Bawolff) HSTS should help with (a) [17:17:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:18:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [17:18:46] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [17:25:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:30:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [17:30:47] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [17:33:08] bblack,ema - any chance that you are online? ~2h ago lvs3001 seems to have registered a failed drive, could it be the cause of the above 503s ? [17:37:55] weird though, the issue appeared in two separate spikes during the past two hours, so might be something else [17:47:20] I'd say that there is no need to page in people now, but let's keep an eye during the next hours [17:56:14] (03Restored) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 (owner: 10ArielGlenn) [17:56:27] (03PS2) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 [17:58:59] (03CR) 10ArielGlenn: [C: 032] hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 (owner: 10ArielGlenn) [18:08:25] (03Restored) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn) [18:08:33] (03PS2) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 [18:09:36] PROBLEM - SSH cp1066.mgmt on cp1066.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:21] (03CR) 10ArielGlenn: [C: 032] prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn) [18:48:33] (03CR) 10Volans: [C: 04-1] "Thanks for taking care of this. I've a couple of main comments and more smaller ones inline." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [21:26:36] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:26] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 73710 bytes in 0.276 second response time