[01:20:17] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [01:21:28] 6operations, 7Browser-Support-Internet-Explorer, 7HTTPS, 5HTTPS-by-default: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1447816 (10faidon) p:5Triage>3Normal [02:01:28] RECOVERY - Incoming network saturation on labstore2001 is OK Less than 10.00% above the threshold [75000000.0] [02:06:18] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [02:09:52] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 34s) [02:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:00] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-12 02:10:00+00:00 [02:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:52] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 06m 12s) [02:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 12 02:25:33 UTC 2015 (duration 25m 32s) [02:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:52] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-12 02:26:52+00:00 [02:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:38] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [02:35:35] (03PS1) 10Ori.livneh: varnishrls: small optimizations [puppet] - 10https://gerrit.wikimedia.org/r/224296 [02:35:54] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishrls: small optimizations [puppet] - 10https://gerrit.wikimedia.org/r/224296 (owner: 10Ori.livneh) [03:35:19] PROBLEM - puppet last run on mw1050 is CRITICAL Puppet has 1 failures [03:35:20] PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures [03:36:17] PROBLEM - puppet last run on mw1212 is CRITICAL Puppet has 1 failures [03:38:17] (03Restored) 10Alex Monk: Disable Extension:Oversight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169612 (https://bugzilla.wikimedia.org/60373) (owner: 10Reedy) [03:39:12] (03PS2) 10Alex Monk: Disable Extension:Oversight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169612 (https://phabricator.wikimedia.org/T62373) (owner: 10Reedy) [03:58:28] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [04:01:18] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:02:18] RECOVERY - puppet last run on mw1212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:43] (03CR) 10Alex Monk: [C: 04-2] "Not ready just yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169612 (https://phabricator.wikimedia.org/T62373) (owner: 10Reedy) [04:03:08] RECOVERY - puppet last run on mw1050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:20:48] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:47:58] PROBLEM - RAID on db1058 is CRITICAL 1 failed LD(s) (Degraded) [04:49:08] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 12 04:49:08 UTC 2015 (duration 49m 7s) [04:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:57:15] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1447899 (10Joe) As can be seen from the graph {F191478} it seems my prediction was confirmed, so the bug is clearly resolved. I would be... [04:57:31] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1447900 (10Joe) p:5Unbreak!>3Normal [05:36:19] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 3 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1447905 (10awight) Here's something interesting from `fluorine:/a/mw-log/archive/exc... [06:31:18] PROBLEM - puppet last run on subra is CRITICAL Puppet has 2 failures [06:31:18] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw2081 is CRITICAL Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 2 failures [06:32:08] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:33:18] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:33:19] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:20] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 3 failures [06:56:49] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2081 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:48] (03PS3) 10Chmarkine: Rank all ECDHE > all DHE for "mid" level suites [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) [07:50:29] (03PS4) 10Chmarkine: Rank all ECDHE > all DHE for "mid" level suites [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) [08:08:21] (03CR) 10Chmarkine: "I just tested the SSL handshake of Java 6u43 with WireShark. I used HttpURLConnection class. It doesn't seem Java 6 supports ECDHE. The ci" [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) (owner: 10Chmarkine) [08:21:28] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 1 failures [08:49:18] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:06:07] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 14.29% of data above the critical threshold [100000000.0] [10:28:08] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [11:22:27] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [11:45:07] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [11:49:14] 6operations, 5Patch-For-Review, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1448029 (10Krinkle) 5Open>3Resolved a:3Krinkle [11:49:43] 6operations, 5Patch-For-Review, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419619 (10Krinkle) p:5Triage>3High [12:12:35] (03PS2) 10Krinkle: grafana: Set a default dashboard [puppet] - 10https://gerrit.wikimedia.org/r/224129 [12:17:10] Krenair: i gave you 2 years prolong [12:18:58] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [12:30:34] 6operations, 10ops-eqiad: db1058 (s5 master) degraded RAID - https://phabricator.wikimedia.org/T105627#1448062 (10jcrespo) 3NEW [12:31:39] ACKNOWLEDGEMENT - RAID on db1058 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo T105627 [12:41:58] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [13:14:08] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 17.86% of data above the critical threshold [100000000.0] [13:40:47] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [14:43:37] (03PS1) 10Chmarkine: Remove old double-subdomain aliases [dns] - 10https://gerrit.wikimedia.org/r/224309 (https://phabricator.wikimedia.org/T102814) [14:48:46] !log upgraded apache2 to 2.2.22-1ubuntu1.9 on: antimony argon caesium fluorine helium iodine logstash1001 logstash1003 magnesium neon netmon1001 rhodium stat1001 ytterbium [14:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:53] !log upgraded most packages on sodium [14:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:38] PROBLEM - puppet last run on neon is CRITICAL Puppet has 1 failures [15:08:29] (03CR) 10Reedy: [C: 04-1] "There's still traffic/hits going to these... So we do need to do some notifications and wait a little bit of time" [dns] - 10https://gerrit.wikimedia.org/r/224309 (https://phabricator.wikimedia.org/T102814) (owner: 10Chmarkine) [15:11:58] Is morebots actually running 1.7.11 (https://github.com/wikimedia/operations-debs-adminbot/commit/2929fe24733cdb88c70e69cfa124479f1d0a6c1a)? [15:12:26] Despite the fix included in 1.7.11 morebots is still adding a new header on each !log [15:15:03] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1448225 (10Reedy) Might be worth looping in @Philippe-WMF and/or @jalexander as they should have contacts within those communities And then we need a patch to... [15:19:42] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: Labslogbot (adminbot) wrongly creates new sections for each entry - https://phabricator.wikimedia.org/T105636#1448227 (10Krinkle) 3NEW [15:20:15] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: Labslogbot (adminbot) wrongly creates new sections for each entry - https://phabricator.wikimedia.org/T105636#1448234 (10Krinkle) 5Open>3Resolved p:5Triage>3High a:3bd808 [15:20:53] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: Labslogbot (adminbot) wrongly creates new sections for each entry - https://phabricator.wikimedia.org/T105636#1448227 (10Krinkle) > Fix new section creation for each edit > https://gerrit.wikimedia.org/r/224212 [15:31:09] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:32:08] 6operations, 10Adminbot: Restore morebots' microblogging relay @wikimediatech - https://phabricator.wikimedia.org/T59969#1448255 (10Krinkle) [15:34:26] 6operations, 10Adminbot, 6Labs: Make morebots run on a production host - https://phabricator.wikimedia.org/T94638#1448264 (10Krinkle) [15:38:47] (03PS1) 10BBlack: toolserver.org: turn on standard ssl_settings [puppet] - 10https://gerrit.wikimedia.org/r/224311 [15:38:49] (03PS1) 10BBlack: toolserver.org: send proper chain [puppet] - 10https://gerrit.wikimedia.org/r/224312 [15:40:20] (03CR) 10BBlack: [C: 04-1] "Putting this here as a reminder, but for some reason I can't fathom at the moment, sslcert::certificate doesn't seem to be doing anything " [puppet] - 10https://gerrit.wikimedia.org/r/224312 (owner: 10BBlack) [15:40:50] (03CR) 10BBlack: [C: 032] toolserver.org: turn on standard ssl_settings [puppet] - 10https://gerrit.wikimedia.org/r/224311 (owner: 10BBlack) [15:44:49] (03CR) 10BBlack: "also, the previous change to relic's apache config for SSL: https://gerrit.wikimedia.org/r/#/c/224311 did nothing on the host. I'm gettin" [puppet] - 10https://gerrit.wikimedia.org/r/224312 (owner: 10BBlack) [15:51:38] PROBLEM - puppet last run on ms-be3002 is CRITICAL puppet fail [16:00:46] (03CR) 10Yuvipanda: "Heh, see https://phabricator.wikimedia.org/T104537" [puppet] - 10https://gerrit.wikimedia.org/r/224312 (owner: 10BBlack) [16:13:38] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [16:17:59] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:35:16] (03CR) 10BBlack: [C: 032] "I applied both changes to the host manually, and then I guess they'll still be here when/if this puppetization ever applies." [puppet] - 10https://gerrit.wikimedia.org/r/224312 (owner: 10BBlack) [16:36:08] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [17:10:02] (03PS1) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) [17:27:40] (03CR) 10Reedy: [WIP] Sync /srv/mediawiki-staging to co-masters (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [17:27:44] (03CR) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters (033 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [17:31:59] (03CR) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [17:41:42] SPF|Cloud: I don't think morebots is running the new version yet. mut.ante smartly didn't want to deploy it late on a Friday [17:42:07] So hopefully we will update tomorrow and see the fix [17:43:00] Okay [18:41:19] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [18:41:39] PROBLEM - puppet last run on restbase1008 is CRITICAL Puppet has 1 failures [18:41:39] PROBLEM - Cassanda CQL query interface on restbase1008 is CRITICAL: Connection refused [18:42:09] PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [18:46:29] PROBLEM - Apache HTTP on mw1154 is CRITICAL - Socket timeout after 10 seconds [18:48:17] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.112 second response time [19:07:29] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:57:48] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%) [20:02:52] some opsen should look at this ^ [20:03:14] * matanya pokes akosiaris , _joe_ and apergos [20:07:18] RECOVERY - Disk space on uranium is OK: DISK OK [20:08:11] I truncated a couple log files [20:09:59] that were no longer being written to [20:34:20] thanks apergos [21:09:08] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1421 bytes in 0.183 second response time [21:40:08] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1412 bytes in 0.137 second response time [22:13:29] (03PS1) 10Hoo man: Increase the dispatching resources for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/224365 [22:14:37] Any op around? [22:17:47] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [22:40:17] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [23:06:15] (03CR) 10Ori.livneh: [C: 032] Increase the dispatching resources for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/224365 (owner: 10Hoo man) [23:24:54] ummmmmm [23:25:00] SAL has a bunch of duplicate headers? [23:35:24] legoktm: yeah. we ahve a fix just not deployed yet [23:35:25] legoktm: it's already been fixed but not deployed on labs [23:42:14] ah [23:59:40] (03PS1) 10Thcipriani: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374