[00:04:49] !log maxsem@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaMaintenance/getPageCounts.php: (no message) (duration: 00m 34s) [00:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:07] RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:12:46] PROBLEM - salt-minion processes on pollux is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:23:24] !log maxsem@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaMaintenance/getPageCounts.php: (no message) (duration: 00m 34s) [01:11:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [01:16:56] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:20:17] (03PS1) 10Ladsgroup: Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 [01:24:57] (03CR) 10Alex Monk: "Please keep the indentation consistent. Also, have you spoken to James about this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (owner: 10Ladsgroup) [01:28:36] (03PS1) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [01:28:40] ^ Krinkle [01:29:43] (03CR) 10jenkins-bot: [V: 04-1] Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [01:29:45] (03PS2) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [01:29:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [01:30:04] (03PS2) 10Ladsgroup: Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 [01:31:02] ori: Interesting. And we'd it invoke it for metrics from different origins e.g. one for hafnium/eventlogging using navtiming for example, one for statsv using maybe one of the media metrics, and one for statsd in general using reqstats? [01:31:20] yeah [01:32:10] (03CR) 10Ladsgroup: "Thank you. I fixed identation. I hope James answer soon too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (owner: 10Ladsgroup) [01:35:39] ori: max( ts for value, ts in data['datapoints'] if value is not None) [01:35:52] that's very dense python, fascinating [01:36:29] gotta love commons [01:36:30] http://i.imgur.com/eAq9B6H.png [01:36:57] ori: Ah, you viewed a page with ?uselang=ownwork in the past, and then the next view is default [01:36:59] so you changed :P [01:37:21] (03PS3) 10Ladsgroup: Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) [01:37:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [01:37:28] :-) [01:37:53] it's easy: go upload file -> it is entirely my own work -> "new! try uploadwizard" [01:38:18] Oh, right [01:38:19] ori: Oh dear. [01:38:21] it happens by default [01:38:38] (03CR) 10Jforrester: [C: 031] "Let's do this on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) (owner: 10Ladsgroup) [01:38:45] because there is an effective interstitial page that gives you uselang=ownowkr [01:39:15] yeah [01:39:35] commons is held together by twigs and stray bits of yarn [01:45:31] (03PS2) 10Jforrester: Enable Flow user opt-in Beta Feature on two more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T117991) [01:45:33] (03PS1) 10Jforrester: Enable Flow user opt-in Beta Feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) [01:46:06] (03CR) 10Jforrester: [C: 04-1] "Let's not do this until a week's time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T117991) (owner: 10Jforrester) [01:46:47] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [01:52:24] (03PS1) 10CSteipp: [WIP]Set password policy for enwiki sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 [01:54:50] (03CR) 10Alex Monk: "Heh, that link says "Adding rules to passwords [...] requires community consensus to implement.". Nope." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp) [01:59:24] (03CR) 10Legoktm: "How does this work with CentralAuth...? Can't the attacker just login on another wiki and use autologin to get access to enwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp) [02:00:36] (03CR) 10Ori.livneh: "Is 6-8 really adequate? The length requirement should really be based on what it takes to secure the site, rather than what users find con" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp) [02:10:05] (03CR) 10Alex Monk: [C: 04-1] "Per Legoktm. A wiki-specific approach is not going to work here (for SUL wikis anyway), it must be global policy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp) [02:27:34] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 06m 12s) [02:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:02] (03PS1) 10Ori.livneh: Add cli-shutdown.groovy from Jenkins advisory [puppet] - 10https://gerrit.wikimedia.org/r/251681 [02:29:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Add cli-shutdown.groovy from Jenkins advisory [puppet] - 10https://gerrit.wikimedia.org/r/251681 (owner: 10Ori.livneh) [02:32:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [02:34:26] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:51:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 7 below the confidence bounds [03:31:56] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail [04:00:26] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:00] (03PS1) 10Ori.livneh: Remove proofreadpage.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251685 [04:02:02] (03PS1) 10Ori.livneh: Remove leftover configuration data from the mobile browse experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251686 [04:02:23] (03CR) 10Ori.livneh: [C: 032] Remove proofreadpage.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251685 (owner: 10Ori.livneh) [04:03:04] (03Merged) 10jenkins-bot: Remove proofreadpage.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251685 (owner: 10Ori.livneh) [04:03:15] (03PS2) 10Ori.livneh: Remove leftover configuration data from the mobile browse experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251686 [04:04:07] (03CR) 10Ori.livneh: [C: 032] Remove leftover configuration data from the mobile browse experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251686 (owner: 10Ori.livneh) [04:04:31] (03Merged) 10jenkins-bot: Remove leftover configuration data from the mobile browse experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251686 (owner: 10Ori.livneh) [04:07:18] Krenair: While you're here, re https://gerrit.wikimedia.org/r/#/c/157338/, does that need a SWAT window to go out? [04:07:27] If so I'll have to arrange to be available for one [04:07:45] probably not [04:08:04] although I suspect different people would have different answers for you [04:08:10] I'm certainly not going it right now though [04:08:14] fair enough. [04:08:49] maybe on monday [04:09:10] IIRC in my time zone, SWATs are at something like 11am (busy) and 2am (asleep) - not the best times! [04:10:55] you're on UTC+11? [04:12:59] Think it's 3AM and 11AM for you. [04:14:03] That sounds about right. 2am, 3am, same diff though... still asleep :) [04:20:36] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [04:23:13] ori, are you synchronising those [04:23:14] ? [04:28:28] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:36:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [04:43:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [04:59:24] !log ori@tin Synchronized wmf-config/mobile.php: (no message) (duration: 01m 09s) [04:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:00:16] !log ori@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 34s) [05:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:00:29] (03CR) 10Ori.livneh: Allow import from any Labs/Beta Cluster project to any other (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [05:00:36] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [05:01:34] ori: Own file, perhaps? [05:01:50] that'd be better for sure [05:01:58] I'll do it [05:02:05] thanks [05:12:51] (03PS8) 10TTO: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [05:13:05] ori: Because I'm on Windows, I have no idea how to set up a symlink in the noc/conf directory... [06:29:47] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:30:37] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:57] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:17] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:26] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 4 failures [06:32:57] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:08] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:58] o/ [06:43:27] Just as a note, Commons is intermittently giving the ‘high database server lag’ warning. [06:48:47] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: puppet fail [06:55:48] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:57] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:17] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:58:17] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:17] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:34:42] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 2 others: Deploy TileratorUI service - https://phabricator.wikimedia.org/T116062#1790771 (10Yurik) [07:34:44] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Kartotherian does not start in producton - https://phabricator.wikimedia.org/T115074#1790775 (10Yurik) [07:34:47] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 6Discovery, and 4 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1790777 (10Yurik) [07:34:58] 6operations, 6Discovery, 10Maps, 10Salt, 3Discovery-Maps-Sprint: Kartotherian git deploy service restart failed with perm error - https://phabricator.wikimedia.org/T112707#1790799 (10Yurik) [07:42:40] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1790958 (10Yurik) 5Open>3Resolved [08:08:29] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1790978 (10yuvipanda) 3NEW [08:10:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [08:18:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:54:27] RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:21:38] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [09:27:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:08:20] !log performing schema change on db1054 (zhwiki) [10:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:56] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [10:19:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] servermon: add ferm rules for http/https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [10:27:40] !log reverting last schema change [10:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:42] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1791004 (10akosiaris) a:3akosiaris Taking a quick look at the service in question that started this, namely https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?host=db1060&s... [10:30:51] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1791006 (10akosiaris) p:5Triage>3High [10:33:44] (03PS1) 10Jcrespo: Depool db1054 (s2 API server) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251694 [10:37:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:37:33] (03PS2) 10Jcrespo: Depool db1054 (s2 API server) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251694 [10:39:13] (03PS3) 10Jcrespo: Depool db1054 (s2 API server) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251694 [10:39:52] (03CR) 10Jcrespo: [C: 032] Depool db1054 (s2 API server) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251694 (owner: 10Jcrespo) [10:41:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1054 (duration: 00m 35s) [10:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:21] this change may cause extra load on db1060, that is expected [10:45:31] !log restarting db1054 (I may need to do it several times) [10:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:54] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1791014 (10jcrespo) A similar thing is happening on zhwiki for a different query- the optimizer seems to h... [10:48:11] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1791015 (10jcrespo) p:5Normal>3High [11:02:08] !log setting SET GLOBAL use_stat_tables = 0; on db1060 [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:19] it will take some minutes to get results [11:22:10] (03PS1) 10Jcrespo: Repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251696 [11:22:20] (03PS2) 10Jcrespo: Repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251696 [11:23:39] (03CR) 10Jcrespo: [C: 032] Repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251696 (owner: 10Jcrespo) [11:24:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1054 (duration: 00m 34s) [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:58] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1791062 (10jcrespo) The initial issue still happens, although now the query is consistently slow every tim... [12:31:10] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1791063 (10hoo) I can no longer see this issue on either db1060 nor db1054, but it's still reproducible on... [13:02:06] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [13:27:07] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:46:31] (03PS1) 10BBlack: remove erroneous comments [puppet] - 10https://gerrit.wikimedia.org/r/251703 [13:46:56] (03CR) 10BBlack: [C: 032 V: 032] remove erroneous comments [puppet] - 10https://gerrit.wikimedia.org/r/251703 (owner: 10BBlack) [13:47:58] (03PS1) 10BBlack: set cache_misc to "mid" ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/251704 [13:57:03] 7Puppet, 6operations, 10Traffic: Clean up nginx / nginx::ssl classes and usage - https://phabricator.wikimedia.org/T118078#1791115 (10BBlack) 3NEW [14:11:41] 6operations, 7Database: Bug on MariaDB use_stat_tables - https://phabricator.wikimedia.org/T118079#1791126 (10jcrespo) 3NEW a:3jcrespo [14:20:36] 6operations, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1791137 (10BBlack) 5stalled>3Open [14:42:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [14:48:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:53:28] (03PS1) 10Jcrespo: Return to status quo after load is back to normal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251706 [14:58:40] Is the beta cluster not updating these days? [14:58:52] Hmm lemme ask on releng [14:59:39] 6operations, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1791144 (10BBlack) p:5Low>3Normal [15:01:57] K I take that back in any case, my bad... [15:07:03] (03CR) 10Faidon Liambotis: "I'm not sure if I agree with the rationale here. misc-web is used for endpoints that address non-technical audiences as well (from a quick" [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack) [15:12:57] (03CR) 10BBlack: "Fair point, but the set of excluded clients in this particular case (because our cache clusters do support DHE properly) is fairly minimal" [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack) [15:15:46] (03CR) 10Jcrespo: [C: 032] Return to status quo after load is back to normal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251706 (owner: 10Jcrespo) [15:17:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Return db weights on s2 api back to normal (duration: 00m 34s) [15:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:38] (03PS1) 10BBlack: bring ciphersuite commentary up to date [puppet] - 10https://gerrit.wikimedia.org/r/251709 [15:23:10] (03CR) 10BBlack: [C: 032 V: 032] bring ciphersuite commentary up to date [puppet] - 10https://gerrit.wikimedia.org/r/251709 (owner: 10BBlack) [15:48:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:04:35] (03PS1) 10JanZerebecki: sudo journalctl: make missing restrions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) [16:11:36] 6operations, 7Graphite, 7Monitoring, 7Privacy: grafana.wikimedia.org calls out to AWS - https://phabricator.wikimedia.org/T110484#1791195 (10Nemo_bis) [16:39:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [16:44:48] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [17:01:06] !log restarting es2010 mysql to test mariadb upgrade [17:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:09:08] !log restarting db2070 mysql to test mariadb upgrade [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:17] 6operations, 7Graphite, 7Monitoring, 7Privacy: grafana.wikimedia.org calls out to AWS - https://phabricator.wikimedia.org/T110484#1791257 (10ori) 5Open>3Resolved a:3ori Fixed by upgrading. [17:28:41] !log restarting labsdb1004 mysql to test mariadb upgrade [17:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:55] ^this is the last one I send, I promise :-) [17:49:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [500.0] [17:55:43] the spike was me, and it wasn't from real traffic [17:59:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:03:45] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1791280 (10jcrespo) Can confirm last seen on db1018: ``` SELECT DISTINCT eu_entity_id FROM `wbc_entity_u... [18:06:48] (03CR) 10BryanDavis: [C: 04-1] "Oops, minor oversight on my part. The extension-list changes need to roll out with the initial 1.27-wmf.6 deploy on Tuesday so that l10n c" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [18:28:40] (03CR) 10Lydia Pintscher: [C: 031] "Good from my side :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) (owner: 10Jforrester) [19:27:11] fyi I'm about to deploy a high priority VE fix for a performance bug affecting page views [19:27:26] where 'about to' is determined by jenkins [19:47:23] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1791376 (10Krenair) ```krenair@mira:/srv/mediawiki-staging/php-1.27.0-wmf.5 (wmf/1.27.0-wmf.5)$ git fetch origin error: cannot open .git/FETCH_HEAD: Permission denied krenair@mira:/srv/... [19:50:31] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: https://gerrit.wikimedia.org/r/#/c/251739/ (duration: 00m 35s) [19:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:18] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [20:33:21] (03CR) 10Reedy: "Ok, so only the EP one is still there..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [20:36:24] (03CR) 10Reedy: "Pinged the task for that.. T48577" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [20:52:38] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:19:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [21:24:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [5000000.0] [21:33:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:33:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:01:59] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1791485 (10MZMcBride) [22:27:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [22:30:47] (03PS1) 10Alex Monk: Use a more useful error message when DB connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/251791 [22:40:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [23:11:46] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [23:37:48] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures