[00:21:43] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10aaron) Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failu... [04:00:58] 10serviceops, 10Graphoid, 10Operations, 10Chinese-Sites, and 3 others: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Shizhao) [04:31:39] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10aaron) >>! In T244340#6211682, @elukey wrote: > Side note: if not... [07:46:39] <_joe_> jayme, akosiaris I think it's time we move production-images to use buster btw [07:55:56] sounds like the right thing to do :) [07:58:31] +1 [07:59:04] although fwiw, we probably want to make a distinction of {buster,stretch}- [07:59:20] should make the transition a bit less bumpy for services [08:08:29] <_joe_> akosiaris: let's try see what sticks LOL [08:08:32] <_joe_> yolo [08:09:10] <_joe_> do any of you want to open a task? :) [08:11:26] we have a few already due to the stretch-backports stuff, I can create an umbrella one [08:15:34] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) Thanks for your work on this. One clarification, for those of us that are not that familiar with LV... [08:38:17] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10akosiaris) >>! In T258614#6328624, @jcrespo wrote: > Thanks for your work on this. One clarification, for th... [08:46:01] <_joe_> akosiaris: but more to my point: we need to switch seed_image [08:47:29] yeah, I am not sure seed_image is the best approach in all cases. That was what I was commenting about [08:47:43] e.g. it seems that if we bump the seed_image for the golang image, kask will break [08:48:05] as the packages that are in buster have breaking api changes from the packages that are in stretch [08:51:49] 10serviceops, 10Operations: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) p:05Triage→03High [08:58:12] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) I see, thanks. [10:03:23] 10serviceops, 10Operations: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) The mc1020 spikes are interesting: https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=60&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prome... [10:03:58] I tried to find a key responsible for the problem but I wasn't able --^ [10:04:49] it seems to be something to check before the weekend, it is causing some TKO events [10:09:45] <_joe_> effie: ^^ [10:09:46] FTI debmonitor deployed, at 14:57 UTC the cron should run and delete ~200+ old images over the 601 existing [10:09:50] *FYI [10:09:58] <_joe_> volans: ack, thanks [10:10:07] I was just adding annotations to the memcached graph [10:10:34] to keep help us understand when this is related to a deployment or not [10:10:40] <_joe_> yeah I am saying that should be checked :) [10:11:28] elukey: I am trying to finish up the push-notification service with alex, I am not sure I will have time to look deeper into it [10:12:00] :( [10:12:15] <_joe_> ok, then I'll find the time, someone needs to [10:12:20] unless this turns into a userfacing issue [10:12:35] let me finish up with the annotations [10:13:07] <_joe_> I don't think this is related at all to deployments, but who knows [10:14:37] I just added it as a nice to have, it probably is not related yes [10:18:26] <_joe_> mc1020 is the most hit machine [10:18:40] atm it is yes [10:19:10] <_joe_> has regular saturation intervals every 5/7 minutes [10:20:19] <_joe_> I don't see an alternative to run memkeys when I see the saturation going up [10:20:42] one thing that I usually do is capture pcaps and then check with wireshark [10:20:48] but it is tedious [10:20:59] <_joe_> that doesn't tell you the bytes transferred for each key [10:21:09] why not? [10:21:14] <_joe_> sometimes there are large keys that are requested less often [10:21:31] <_joe_> I mean it means you have to extract the info from the packets, and run the sum yourself [10:21:39] <_joe_> which is what memkeys does btw :P [10:22:17] yes but the spikes last few seconds, it is difficult to find the anomaly as well and capture what's wrong [10:23:46] (also memkeys uses a lot more CPU than tcpdump, so when the spikes are not so regular tcpdump is a little better in my opinion to stay in background) [10:23:49] <_joe_> WANCache:nlwiki:preprocess-hash:31096c1487dbb2f9c0450294bd64eb38:1|#|v is by far the worst offender, used as much as 30 mbit alone [10:24:26] <_joe_> followed by an old friend WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A979157361 [10:24:38] <_joe_> I am almost sure I recognize that id [10:25:02] <_joe_> WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A979157361 4214 81480 79.51 51827.43 [10:25:13] <_joe_> the last number is kilobits per second [10:25:18] the latter is responsible for a constant background of usage from days ago for slab 136 [10:25:52] <_joe_> the nlwiki stuff is 202430 bytes [10:25:59] on top of it, there are 3 slabs that shows increase bw matching the saturation spikes [10:26:25] <_joe_> yeah there are another couple keys there that have high bw usage [10:26:51] so WANCache:nlwiki:preprocess-hash:31096c1487dbb2f9c0450294bd64eb38 is in slab-155, one of the 3 that I mentioned (just checked) [10:27:03] <_joe_> yeah [10:27:17] <_joe_> the other two keys are related, also from nlwiki:preprocess-hash [10:27:27] <_joe_> this is the stuff that a local memcached can help with btw [10:27:33] yep [10:28:35] <_joe_> elukey: I remember you wrote a procedure down to get from the key to the object in blobstore [10:28:50] <_joe_> but I don't find it on wikitech [10:30:38] <_joe_> do you remember how we compress data when sending it to memcached btw? [10:34:20] https://wikitech.wikimedia.org/w/index.php?title=Memcached&diff=1863619&oldid=1819400 is this what you have in mind? [10:34:29] for getting to the blobstore I mean [10:35:18] back sorry, yes addshore wrote it! [10:36:02] I don't recall what gets compressed no :( [10:36:37] that's not in the current version of the article, maybe it should be added back in [10:37:29] Timo removed it, not sure why [10:38:07] maybe it is now under a different page, but it should be somewhere [10:38:11] it is very helpful [10:39:46] <_joe_> GROAN [10:39:54] <_joe_> thanks apergos [10:40:52] <_joe_> and ofc I don't have access to labstore [10:41:01] <_joe_> err dbstore [10:41:50] I can run it if you want [10:42:58] <_joe_> thanks :) [10:46:17] from the timings reported by Adam the whole thing will take ~20 mins [10:47:02] <_joe_> yes [10:47:44] https://www.mediawiki.org/wiki/Manual:Caching#Revision_text it's here but it would be nice if a link... i'll add it [10:49:04] thanks! [10:51:53] <_joe_> well there are no details on how to get the revision from there [10:52:14] <_joe_> oh it's down the pae [10:52:18] <_joe_> *g [10:52:55] * addshore reads up [10:53:31] <_joe_> addshore: no need :) [10:53:38] * addshore stops reading [10:53:56] <_joe_> sorry for the unintended ping :P [10:54:53] you were getting a thanks for some documentation, that's all :-) [10:56:36] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [10:58:19] ah! Sorry indeed, it was just some wikilove for you addshore [10:59:38] <_joe_> hey elukey he should be spared from wikilove [11:02:59] a little bit is ok :D [11:12:39] _joe_ https://en.wikipedia.org/w/index.php?oldid=967170285 [11:13:22] <_joe_> yeah our old friend :P [11:13:43] <_joe_> this specifically would benefit a lot from being locally cached [11:27:03] 10serviceops, 10Operations: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) >>! In T258679#6328948, @elukey wrote: > There is a baseline for slab 136 of constant GET traffic, that should be related to `ITEM WANCache:v:global:SqlBlobStore-blob:en... [11:27:33] added to the task [11:32:54] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) [11:34:40] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) p:05High→03Low Traffic has been switched fully for the last 18hours in co... [12:38:37] elukey: the example is now under the https://www.mediawiki.org/wiki/Manual:Caching#Revision_text section [12:38:40] i found it :) [12:39:33] addshore: I'd already added a link to that in the wikitech page :-) [13:52:39] apergos: it's one of the three links atop that wikitech page [13:52:55] "mw:Manual:Caching, documentation of what kind of data is stored, what the keys relate to etc." [13:53:19] Krinkle: I now but someone looking for the specific piece of info wouldn't easily find it, as demonstrated today [13:53:25] so I added a pointer explicitly to that [13:53:37] it's something we need now and then [13:54:06] Right i guess RevStore is more common than everything else? [13:54:23] It seems odd to document all of those on wikitech [13:55:14] we had that one there because we use it in troubleshooting stuff that got slow [13:56:59] yeah I think the rest doesn't really come up in response to an alert or whatever [13:58:06] okay. My main worry is to not invite lots of MW docs to end up there who because this one is there (and suggest/imply that there isn't a longer version elsewhere and this one being stale and unreviewed) [13:58:21] But as it is now seems fine [13:58:50] yeah I'm not excited about the same content having to be maintained in two places, for sure [13:59:11] because we're so good at keeping one set up to date already... [14:07:17] Krinkle: from the SRE point of view, it is difficult to debug things https://phabricator.wikimedia.org/T258679 without a quick wikitech page with tips/tricks etc.. I am happy to have it anywhere as long as it is a known place :) [14:08:37] I'm going to take this opportunity to again mention https://phabricator.wikimedia.org/T235773 which I'd really like perf and/or CPT to work on ;) [14:28:47] Can anyone tell me what the external IP addresses of our Citoid/Zotero service are? YouTube needs to whitelist it, but they can only whitelist by IP, not user agent. I know that Citoid runs on codfw and eqiad if that helps. [14:29:30] kaldari: can they accept entire IP blocks? [14:30:38] I'm not sure. [14:30:41] hopefully [14:33:56] my inclination is to give them the whole list at https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Public_IPs [14:42:26] <_joe_> cdanis: url-downloader has a fixed IP in both dcs though [14:42:48] but, it's one thing for that to be technically true, and it's another thing to have external organizations relying on that [15:17:24] <_joe_> fair enough [15:17:45] <_joe_> akosiaris: how did we manage that I the past? I think we always provided netbocks indeed [15:18:12] _joe_: yes, entire netblocks [15:18:19] sensible [15:27:49] we have now 340 images in debmonitor (although the counter showed in the logs has an error, I'll fix it) [15:30:23] <_joe_> volans: good [15:30:29] <_joe_> still way more than we actually use [15:31:25] if we cut at 1 month we delete another 137 [15:34:33] <_joe_> we can revisit easily, right? It's just a config parameter [15:34:44] <_joe_> it doesn't need a code deployment after all [15:44:52] ahahah [15:45:05] no, it's hardcoded because it's a temporary WMF-specific hack [15:45:11] that debmonito should not have in the first place ;) [15:53:52] _joe_: do you want to change it? I'm sending the patch to fix the counter [15:54:07] <_joe_> volans: nope, not now [15:54:34] ok [16:34:37] cdanis: So I should give them all 8 IP blocks listed at https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Public_IPs? Or can it be narrowed down any further? [16:35:18] kaldari: I would start there -- we can make it more specific, but, that limits our flexibility in the future [16:52:43] 10serviceops, 10Operations, 10Security-Team, 10Traffic, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett) 05Stalled→03Resolved a:03CDanis Thanks, @cdanis. Looks to be fixed. Resolving and making public. [16:52:50] 10serviceops, 10Operations, 10Security-Team, 10Traffic, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett) [17:11:39] 10serviceops, 10Operations, 10Security-Team, 10Traffic, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10ArielGlenn) 05Resolved→03Open Almost resolved, heh. [17:16:43] 10serviceops, 10Operations, 10Security-Team, 10Traffic, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10CDanis) 05Open→03Resolved Re-validation forced for ATS-BE, and also a Varnish cache ban has been put in place, so we should no longer be serving any... [18:56:43] cdanis: Just heard back from YouTube. They say they "would like to whitelist a narrower IP list", although they realize this may not work for the long-term. Do you know which of those IP blocks are used by services on codfw and eqiad? [18:59:05] kaldari: 208.80.152.0/22 and 2620:0:860::/46 [18:59:19] Thanks! I'll let them know. [18:59:58] yeah, no guarantees that's 100% correct, but I'm pretty confident [19:58:13] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH) [19:58:30] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH) [20:01:45] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) Merging the change above was a noop on scandium. I did not manually touch it so far, so the git repo at /srv/testreduce is unchang... [22:57:07] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10wkandek) p:05Triage→03Medium [22:59:37] just putting this question out here ... anyone know when we might start upgrading to php 7.3? [23:02:22] cscott, can we resolve the 1.35 lts parsoid phab tasks? [23:02:28] oops sorry, wrong channel.