[00:53:18] 10serviceops, 10Operations, 10Performance-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Krinkle) [07:00:46] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Joe) a:05Joe→03None @RLazarus and @Dzahn can you pick up the task of putting these into rotation? They will eventually replace mw1221-mw1258, and I think we can go 1:1 in... [07:01:19] <_joe_> rlazarus, mutante ^^ (https://phabricator.wikimedia.org/T236437) - can you work on finalizing it? [07:01:32] <_joe_> I'm happy to help with any doubts you might have [09:12:03] we did not plan yesterday for a subteam meeting today; the challenge will be getting rlazarus and mutante timely notification if we want to do it >_< [09:54:01] <_joe_> apergos: effie is off too IIRC [09:54:11] ah well [09:54:26] <_joe_> or at least I think she said something along those lines yesterday [10:51:32] I am off [11:19:16] 10serviceops, 10WMF-JobQueue: Enable RunSingleJobHandler endpoint on Job Runner Cluster - https://phabricator.wikimedia.org/T244770 (10Joe) >>! In T244770#5871111, @Pchelolo wrote: > I've poked around mediawiki-config, and I don't really see a none-hacky way of setting a variable depending on MW cluster. [[ ht... [11:25:48] 10serviceops, 10Release-Engineering-Team: Enable phpdbg on mwdebug* servers - https://phabricator.wikimedia.org/T244549 (10hnowlan) a:03hnowlan [16:15:16] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 (10LarsWirzenius) [16:23:18] re T236437 - I haven't seen that workflow yet, happy to work on it if mutante doesn't mind showing me how [16:23:31] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) Very nice summary, thanks! A couple of questions: > FailoverWithExptimeRoute, where we define how long the values keys updated (eg set/add) dur... [16:32:56] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Joe) A few notes: - We cannot really worry too much about stale keys over failovers - we had a system before mcrouter where this was happening regularly... [16:58:06] hm, hiya _joe_ [16:58:07] reading https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [16:58:16] i think there incorrect bits, even though it was recently updated [16:58:19] <_joe_> oh ahah [16:58:27] <_joe_> yeah that needs updating [16:58:35] <_joe_> I wanted to finish with the changes [16:58:38] ah k [16:58:41] <_joe_> before doing so [16:58:43] i'll be adding a new service soon [16:58:47] <_joe_> ok [16:58:54] also, i need to change some ports on a couple of existing ones [16:58:57] <_joe_> you can for now read the docs in the puppet tree [16:59:03] does that mean I need to add a new service for the new ports? [16:59:11] <_joe_> you want to change ports? [16:59:12] Oh cool ok [16:59:31] yes [16:59:31] https://phabricator.wikimedia.org/T245203 [16:59:44] <_joe_> that yes would only work by adding the new one, switching clients, then removing the old one [17:00:29] 10serviceops, 10Analytics, 10Analytics-Kanban: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [17:00:30] yeah [17:00:32] thought so [17:01:21] its too bad (is it too bad?) the port isn't somehow abstracted/proxied in by discovery url. i guess that would require some universal proxy service [17:01:29] butu it'd be nice if the name was all that was needed to address the service [17:02:08] <_joe_> ottomata: I happen to have been preparing that service right now [17:02:16] :o [17:02:18] very coooool [17:02:24] <_joe_> it's using envoy, it will first go live on the appservers [17:02:26] _joe_: where to look in puppet for docs? [17:02:37] Oh right this is the local envoy proxy stuff [17:02:40] riiighhhht [17:02:41] cool [17:02:53] this is what will allow us to do https from MW [17:03:35] <_joe_> https://github.com/wikimedia/puppet/blob/production/hieradata/common/lvs/configuration.yaml points you in the right direction [17:03:42] <_joe_> ottomata: hopefully, yes! [17:08:20] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 (10ori) [18:15:12] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) About the TTL, I'd involve Timo and Aaron. For some keys, that are expensive to generate or that might cause a ton of traffic if regenerated and... [19:05:16] <_joe_> mutante, rlazarus did you see my ping this morning? [19:05:37] re T236437 - I haven't seen that workflow yet, happy to work on it if mutante doesn't mind showing me how [19:05:58] _joe_: i have not seen it, looking at that ticket [19:06:33] <_joe_> mutante: ^^ [19:06:54] _joe_: i see, yes, definitely can get mw servers into rotation [19:06:59] <_joe_> what rlazarus said, I can also assist tomorrow [19:07:47] <_joe_> so first thing is to decide which servers go into what cluster [19:08:11] 1:1 for appserver vs. API, ack [19:08:20] <_joe_> basically look at where the rack/rows are for the servers we're replacing and that, yes [19:08:46] <_joe_> the goal is to be able to tolerate losing one row and still serving with ~ 66% of our capacity [19:09:00] <_joe_> which should be enough for our normal traffic [19:09:15] <_joe_> and to continue to do so when the old servers are removed [19:09:17] <_joe_> https://phabricator.wikimedia.org/T236437#5891923 [19:10:00] <_joe_> netbox should have all the info you need [19:10:34] yes, we want to look at the row balance [19:10:41] <_joe_> one can even write a script that does the calculation for you, given netbox has an api, I'm told :P [19:11:00] <_joe_> uh wait does it? [19:11:51] <_joe_> yes, it does, and it's a simple rest api aiui, but better check with chaomodus [19:12:31] <_joe_> anyways, thanks :) This is also part of the remediation work - we badly need more capacity and the new servers are definitely more beefy (and 5 years younger) than the old ones [19:13:17] * chaomodus perks ears [19:13:51] _joe_: ACK. we'll work on it [19:13:56] 👍 [19:14:54] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10LGoto) [19:24:34] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JoeWalsh) @Joe could MediaWiki use a default value other than 0 if a client doesn't pass `maxage`... [20:05:26] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Krinkle) The Api classes in MediaWiki also have a way to enable caching by default, via... [20:05:58] 10serviceops, 10Analytics, 10Analytics-Kanban: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [20:10:16] 10serviceops, 10Analytics, 10Analytics-Kanban: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [20:10:24] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 (10ori) Cool beans. What's the timeline for getting this release out? [20:11:57] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) p:05Triage→03High [20:12:09] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Krinkle) @JoeWalsh There's a separate task about the MW default where I just asked a similar ques... [20:18:07] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [20:29:41] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Anomie) >>! In T244204#5894396, @Krinkle wrote: > Should we set `setCacheMaxAge()` by de... [20:39:27] site.pp says "mw1221-mw1235 are in rack D5" but netbox says they are in D4 [20:40:24] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [21:38:19] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:03aaron [21:44:47] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) [21:45:02] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) p:05Triage→03Medium [22:19:58] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Following up on our discussion at today's team meeting, I looked at the linked PHP commit. `hrtime()` is j... [22:21:25] so the /etc/cergen/mcrouter.manifests.d/mediawiki-hosts.certs.yaml file does not exist anymore [22:21:35] but our docs for mcrouter certs still say to make a backup just in case [22:22:04] was mediawiki-hosts.certs.yaml just not needed anymore in /etc/cergen ? [22:35:52] 10serviceops, 10Operations: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10RobH) [22:41:35] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Thinking about this a bit further, there might be cases where we want access to `CLOCK_MONOTONIC_RAW` inste... [22:59:50] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) On `CLOCK_MONOTONIC_RAW` vs `CLOCK_MONOTONIC` - I don't know enough there to know, I trust your judgement... [23:15:21] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) "Better" depends on what's being measured. `CLOCK_MONOTONIC` will always move forward, at a rate that's de...