[07:45:54] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Urbanecm) [09:57:39] 10serviceops: Kafka mirror maker codfw -> eqiad in warning state for low throughput - https://phabricator.wikimedia.org/T268121 (10elukey) [09:58:10] 10serviceops: Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput - https://phabricator.wikimedia.org/T268121 (10elukey) [10:08:12] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) p:05Low→03High More and more duplicates are being merged into this one and stats from te... [10:14:13] akosiaris, hnowlan hi :) [10:14:16] are you around? [10:14:26] I'd need some help in investigating an issue [10:14:37] I filed earlier on https://phabricator.wikimedia.org/T268121 [10:14:59] * akosiaris around [10:15:01] then I noticed https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=14&orgId=1&from=now-7d&to=now [10:15:19] and the last 24h are very weird [10:15:19] https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=14&orgId=1&from=now-24h&to=now [10:15:32] maybe these are two different problems [10:18:54] on the 14th? that was a Saturday ... [10:19:24] the last graph (on the 17th) looks pretty worrying but not related (I think) [10:19:31] quite possibly a different problem [10:19:41] looking [10:20:29] looks like something happened to changeprop around the same time [10:21:13] that last one (the 17th), lines up with a deployment. https://sal.toolforge.org/log/7KY22HUBpU87LSFJQ04O [10:21:35] the msg seems pretty innocuous (2.8.0 service-runner), but who knows [10:22:35] yeah https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=30s&from=now-7d&to=now points pretty clearly the issue at changeprop too [10:24:58] looks like we are no longer processing transclusions? case the kafka main msg panel in the events row doesn't show any change [10:25:34] looks like it, looking at changeprop logs atm [10:48:36] sigh, nothing useful in the logs for the time that things dipped [10:52:31] how are tranclusions working atm? I was a bit lost trying to remember the diff between codfw and eqiad [10:53:43] restbase codfw is being used to process those cause of the spare CPU capacity there [10:53:54] called restbase-async in various puppet refs [10:56:37] I've redeployed both changeprop clusters so it'll hopefully become unstuck but this isn't the first time this kind of thing has happened :/ [11:00:27] * akosiaris fingers crossed [11:14:53] on first glance I don't see any correlation between changeprop and the dropoff on the 17th, but the dropoff on the 14th is definitely interlinked [11:16:48] there's a good few changes in that restbase deploy last night though [11:19:48] looks like more or less all restbase metrics stopped https://grafana.wikimedia.org/d/000000068/restbase?orgId=1 [11:24:32] akosiaris: one last qs about restbase-async - I checked on dns-disc and it is pooled in both eqiad and codfw, so I expected that only codfw clients would use it, but it seems also used from eqiad ones? (or I didn't understand anything, probably) [11:41:25] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) I haven't been able to repro this on a public endpoint from my own home connection, even using... [11:42:54] (stepping afk) [11:59:16] elukey: it's wrong. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=DNS+Discovery+operations+diffs was mean to catch things like that, but for some reason it's acked since 2020-7-27 [11:59:20] * akosiaris fixing [12:16:54] fwiw I no longer think that dropoff in restbase revision requests is a bad thing (it's all 4xx anyway), I suspect it's a backlog clearing out - similar happened after the DC switchover previously [12:17:39] the one that coincides with the drop in changeprop that is, the empty graphs for restbase is another issue [12:40:22] never mind - seeing the transcludes and kafka messages in codfw pick back up now. [12:57:24] hnowlan: so, all it took was a full changeprop redeploy? [12:57:33] so... what broke then? [12:58:00] also, interesting nobody noticed for so long, even more interesting nobody complained [13:01:03] My extremely vague guess is that the changeprop subscriber to that topic died for some reason. Why I can't say, not seeing any hints in the logs so far [13:01:20] Better monitoring and alerting of the workers is needed at least [13:09:13] 10serviceops: Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput - https://phabricator.wikimedia.org/T268121 (10hnowlan) A restart of changeprop in codfw fixed this issue - we need to look into why the changeprop subscriber died or stopped processing. [13:13:54] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant a... [13:26:34] hnowlan,akosiaris Thansk! [13:29:20] hnowlan: if you have time / patience - say that nobody is around and I need to restart changeprop, is there documentation about it ? Super ignorant about the k8s stuff :( [13:29:55] (I guess something helm something :D) [13:30:16] elukey: good point - there isn't. The changeprop docs aren't bad but that's conspicuously absent, I'll add it [13:31:23] hnowlan: thanks! I also need to read more about how services are handle now :( [13:38:34] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 5 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10CDanis) Adding to what Brandon says, we do have evidence that it happens on edge DCs other than just eq... [13:48:28] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 5 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) The proposed changes are live now. It may take a a few hours to confirm that via NEL at our cu... [14:18:13] just ran into this, I haven't tried it but thought it might be interesting to this audience https://github.com/canonical/operator [14:28:44] 10serviceops: Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput - https://phabricator.wikimedia.org/T268121 (10Ottomata) Ok, so this wasn't a MirrorMaker issue then? changeprop was actually producing fewer messages? [14:33:37] 10serviceops: Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput - https://phabricator.wikimedia.org/T268121 (10Pchelolo) >>! In T268121#6630533, @Ottomata wrote: > Ok, so this wasn't a MirrorMaker issue then? changeprop was actually producing fewer messages? Seems it has gotten stu... [15:29:34] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) s2, s6, and s7 have also finished. The s3 worker has completed wikis up through ruwikibooks (in alphabetical order). [15:40:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Build calico 3.16 - https://phabricator.wikimedia.org/T266893 (10JMeybohm) After discussing with @akosiaris we decided to keep building the calico-images package but only use it as kind of artifact and a way to get the images out of the pbu... [15:51:38] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) So far so good! :) [16:33:23] 10serviceops, 10Discovery-Search, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Backport imposm3 to the debian channel - https://phabricator.wikimedia.org/T238753 (10MSantos) @hnowlan this can be a good resource for this task https://github.com/omniscale/imposm3#binary [17:20:44] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) @akosiaris and me discussed this further and we initially decided to give the kubernetes addon-manager a try for rolling out calico com... [17:48:46] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) 05Open→03Resolved [17:48:51] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [17:53:18] 10serviceops, 10Growth-Team, 10Operations, 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) @elukey beat me to writing the celebratory post :D Since we are happy with the current settings, I think we can continue by insta... [17:55:13] 10serviceops, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10Dzahn) [18:24:07] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) After merging 78588929801f and running onhost memcached on an api (m... [19:11:16] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) I could take this one (later). Have done mwmaint upgrade in the past. I would start by creating mwmaint1003 and flipping over. [19:16:22] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10RLazarus) You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and will be chewing through enwiki for some... [19:23:37] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10MoritzMuehlenhoff) We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS updates as well (reboots e.g. are a total p... [19:27:35] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) >>! In T267607#6631426, @RLazarus wrote: > You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and... [19:28:26] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) >>! In T267607#6631452, @MoritzMuehlenhoff wrote: > We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS upd... [19:48:47] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [19:50:00] 10serviceops, 10Performance-Team, 10WikimediaDebug: Fetch mwdebug backend server list from noc.wikimedia.org - https://phabricator.wikimedia.org/T268167 (10Krinkle) [19:50:23] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10jijiki) [19:55:09] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10jijiki) I unintentionally created some confusion I think, and I am very sorry. I have updated the description to reflect that our target for this quarter is to have done as much preliminary work as we can... [20:53:16] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) [20:53:28] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) [21:18:58] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from re... [21:25:27] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10RhinosF1) Works for me [21:28:22] 10serviceops, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10jijiki) [21:43:34] 10serviceops, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10Dzahn) [21:45:48] 10serviceops, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10Dzahn) [21:49:16] 10serviceops, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10Dzahn) ` [cumin1001:~] $ sudo -i confctl select name=mwdebug1003.eqiad.wmnet get {"mwdebug1003.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,clus... [22:22:49] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) Put on Last Call until 2 December. [22:30:25] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [22:30:55] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) Task description edit: added "backwards compatibility" section explaining how this will work for default installations. [22:35:38] 10serviceops, 10Performance-Team, 10WikimediaDebug, 10Patch-For-Review: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 (10Krinkle)