[03:09:38] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) @kostajh -- maybe we should do that, but I would like to hear from @nettrom_WMF about what that would mean for o... [05:33:06] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) To test the hypothesis that this is related to firejail use, we're sending 1 req/s to on... [08:31:36] 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10fgiunchedi) @AMooney @jcrespo any updates on this ? thank you! [08:34:37] 10serviceops, 10Operations, 10Platform Engineering: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10fgiunchedi) p:05Triage→03Medium [12:24:31] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6340838, @Joe wrote: > This service should /not/ do any caching, which should instead... [12:32:25] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [13:23:45] 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10JMeybohm) [13:57:35] 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10JMeybohm) 05Open→03Resolved imported td-agent-bit_1.5.3-0 to buster-wikimedia Build steps for can be found at: https://wikitech.wikimedia.org/wiki/Td-agent-bit [13:58:48] 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10Pchelolo) Thank you so much! I'll update my the image. [14:20:57] 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) 05Stalled→03Invalid @fgiunchedi, this ticket will be closed [14:54:42] Hi. I'd like to make regular, time-based Scap releases. What can I do to make it as easy as possible for serviceops? [14:55:09] aiming at at least one release per quarter, preferably one per month [15:13:55] volans, any chance you would have a workaround for https://phabricator.wikimedia.org/T254786 ? [15:14:21] liw: I replied to you on _security on friday [15:14:53] volans, oh, I have missed that, sorry [15:15:03] being off I was o mobile and didn't had it logged in on phab [15:15:07] I can reply there for posterity though :D [15:15:45] volans, I see a link to T222480 [15:16:36] volans, which I've read, but, alas, I don't understand enough to make progress [15:17:47] clustershell, that is the library used by cumin can't cope with different padding of zeros, so the names will be messed up until we have hosts with different padding [15:18:29] volans, right, but some of the hosts I got errors for don't seem to have a corresponding host without a leading zero [15:18:58] example? [15:19:32] ping: deployment-perfapt1.deployment-prep.eqiad.wmflabs: Name or service not known [15:19:52] deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs [15:20:03] sorry, ping: deployment-docker-mobileapps1.deployment-prep.eqiad.wmflabs: Name or service not known [15:20:27] deployment-logstash2.deployment-prep.eqiad.wmflabs exists, thoug [15:22:06] I see a deployment-logstash2 and deployment-logstash03 [15:23:53] I see the same [15:24:05] I don't see anything related to deployment-perfapt1 on horizon [15:24:11] does that instance exists? [15:25:16] I don't know; cumin told me it can't resolve that hostname [15:25:32] I have no idea where it gets it from [15:27:14] openstack API [15:28:33] liw: oh wait, are you running cumin from withhin the deployment prep local cumin master or from the WMCS cumin master? [15:30:10] volans, I don't know, but I run it from liw@deployment-cumin02 [15:30:21] and which query do you do? [15:30:33] sudo cumin 'O{project:deployment-prep}' hostname [15:31:16] '*' is the same btw [15:32:17] liw: if I run that I get: [15:32:29] deployment-imagescaler01.deployment-prep.eqiad.wmflabs: permission denied (publickey) [15:32:41] that means that either puppet is broken there or is not properly configured [15:32:50] deployment-logstash02.deployment-prep.eqiad.wmflabs: could not resolve [15:33:00] and this is the known issue of zero padding [15:33:13] and then [15:33:13] deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs [15:33:16] could not resolve [15:33:18] and this is weird [15:33:48] as I can totally see deployment-docker-mobileapps01 on horizon, and dig fails to resolve it, so thtat's something to ask WMCS about I guess [15:36:08] ok, I will ask them [15:36:26] that said, what's your blocker? [15:37:16] when I just run that cumin command, it tells me 3/79 nodes failed, but I can only see two actual errors (and those are hard enough to spot so I may have missed things) [15:37:50] because you run 'hostname' that returns a different output for each [15:38:03] volans, my blocker is that as part of making a Scap release, I need to run a few cumin commands to test it before asking serviceops to built the .deb [15:38:09] I ususally run 'true' or 'id' [15:38:45] ah, good point. I hadn't realised that [15:38:59] I'm not familiar with the current procedure, but not sure if a few broken instances in deployment-prep can be called a blocker tbh [15:39:58] volans, the blocker is that cumin refuses to run my commands so I can't actually test the release candidate [15:40:14] no, it will just fail on those hosts, on the others will run happily [15:41:14] you can also exclude those 3 if you want [15:41:15] 'O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad.wmflabs,deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs,deployment-imagescaler01.deployment-prep.eqiad.wmflabs}' [15:41:23] up to you [15:42:02] volans, it tells me it aborts and doesn't run them... maybe I'm misunderstanding the tool [15:42:32] I tried to exclude the problematic hosts but could not understand the cumin docs to get a working expression - thanks, I'll try that [15:44:44] ok, the exlusion works, I shall try the actual release testing later. [15:44:52] volans, thank you very much! [15:46:46] np [15:47:38] by default it will try to run the command on all hosts, and if some fails will just fail there [15:49:48] _joe_, effie, elukey: re T260224 I'm inclined to say let's just use the spare host, because it's the simplest plan and most likely to be in good shape by Sep 1 -- thoughts? [15:51:23] or, correction, I'm inclined to ask dcops if they can get it swapped in time for that plan [15:52:00] can we avoid to reuse the name please? :) [15:52:18] <_joe_> why? [15:52:34] <_joe_> it's very convenient for netbox I heard [15:52:39] <_joe_> :D :D [15:52:51] yeah let's just do so [15:54:00] volans: we'll just name it "cumin2001", I don't think that's taken [15:54:38] rzl: db1083 is more fancy as a name though [15:55:25] mc2037 is next, right? I'm just taking the biggest number and adding one, if we already had a mc2037 and retired it, I don't know where to look [16:08:22] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cdanis on cumin1001.eqiad.wmnet for hosts: ` mw1359.eqiad.wmne... [16:10:16] rzl: we can have a look at netbox [16:10:20] for past servers iirc [16:10:23] and phab [16:10:55] I think that would cover all angles [16:11:29] or we can create a bot that does that for us and just replies the next server name avail [16:11:35] :D [16:19:42] we have that bot effie [16:19:45] its name is volans [16:23:03] I replied the TrueWayOfChecking™ to rzl in prvt [16:23:05] *private [16:28:36] lol [16:28:48] * effie bbl errand [16:36:31] <_joe_> hi everyone, I have good news and bad news. The good news is that the cluster reboot script works. The bad news is that the cluster reboot script works. [16:38:02] what... what did you do, joe [16:38:35] 😰 [16:39:08] <_joe_> it's bad news because now we need to use it :D [16:39:20] <_joe_> well once I've merged the next two patches [16:43:56] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1359.eqiad.wmnet'] ` and were **ALL** successful. [17:38:30] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) So to summarize: The vd_client/vd_server that are on testreduce1001 should NOT be on it and instead the rt_cl... [17:43:30] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390190, @Dzahn wrote: > So to summarize: The vd_client/vd_server that are on testreduce1001... [19:30:39] in grafana when i select codfw appservers to get "memory per host" i get only wtp1025. which is neither codfw nor an appserver [19:31:12] switching back to eqiad i see mw* servers as expected though [19:32:31] eh... but i also can't really reproduce it now and eventually got what i wanted. i have a screenshot though from earlier.. hmmm [20:03:54] i am doing some codfw appserver reboots but with the "single" cookbook and picking ones with high uptime [20:28:15] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) @ssastry @cscott rt_client and rt_server have been added to `testreduce1001.eqiad.wmnet`'. ` [testreduce100... [20:32:04] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) Thanks! > It fails after a little while though because it does not have access to the database yet. I sup... [20:32:35] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I was about to create a subtask for that. I got it. [20:37:28] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I just saw the DB appears to be running on localhost, not on a cluster, fwiw. [20:43:49] Pchelolo: is there an easy way to detect if a surge of requests [20:43:56] is related to a templated being updated? [20:44:15] effie: templated? templated what? [20:44:19] oh. [20:44:22] wiki template [20:44:25] we suspect this is what caused the alerts which radomly happened while you deployed [20:44:34] were deploying* [20:44:50] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390874, @Dzahn wrote: > I just saw the DB appears to be running on localhost, not on a clus... [20:45:15] you mean ? [20:48:40] effie: not really, I don't think there's an easy way.. [20:51:53] is there any way we could make it easier ? [20:52:08] not now, in general [20:53:15] Pchelolo: does a request ID not get propagated from the jobrunner request ID to the API calls? [20:54:36] cdanis: lemme step back. [20:54:51] so jobrunner itself doesn't call any apis right [20:55:03] jobrunner = where the job is executed [20:55:34] change-prop instance that exectes jobs by calling jobrunners... lemme check if it propagates req_id [20:55:41] okay, I wasn't sure if there were any self-calls involved [20:55:54] because the load spike effie saw was on the apiservers, but, maybe that's due to them having to re-parse things [20:56:34] ah no chris, it was the app servers [20:56:35] ok, change-prop doesn't propagate req_id.. that wold be a good thing to add [20:57:13] as for self-calls - I can not guarantee some job somewhere deep inside itself doesn't end up in calling api... but it's not what generally happening [20:58:04] but I thought that if a templated was updated but pages were requested that included it [20:58:57] it could trigger a reparse before the reparse we triggered was completed [20:59:50] wasn't Tim working on that recently effie? "fast stale mode"? [21:00:01] or a template change tha slowed down pages that include it anyway [21:00:41] mm I don't know, I am reading about it now [21:01:50] [operations/mediawiki-config@master] Enable fastStale mode on all wikis [21:02:10] aha! [21:13:53] last reboot cookbook ended again with "not repooling" even though it should be [21:37:03] <_joe_> rzl: I just noticed that the weights in confctl in codfw are all over the place, we need to fix them before the switchover [21:37:05] <_joe_> :/ [21:37:20] ah good catch, thanks [21:37:25] added to my list [21:39:27] when you ask me next time why it failed for me: here's another one: Not all services are recovered: mw1279:Check no envoy runtime configuration is left persistent [21:40:13] <_joe_> mutante: that's icinga lagging, that cookbook needs an option to ignore unknowns [21:41:14] the weights in codfw are also different because there are old and new servers. [21:41:21] _joe_: ack [21:47:38] since I have to look at them anyways I think prefer just not using the cookbook for a single host. then I don't ignore Icinga and it goes faster, not waiting for the polling [22:45:08] 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:45:46] 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:54:31] 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:55:53] 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) [22:58:20] 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) ` [scandium:~] $ mysql -h m5-master.eqiad.wmnet -u testreduce -p testreduce Enter password: Reading table information for completion of table and column... [23:00:18] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Stalled mariadb-client has been installed (added buster support by using that instead of outdate... [23:02:30] 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [23:08:48] 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) So this is everything in `modules/role/templates/mariadb/grants/production-m5.sql.erb` that refers to testreduce (line 5 to 48). Please make that work t... [23:28:07] 10serviceops, 10Performance-Team, 10Sustainability (Incident Followup): Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10jijiki) >>! In T253673#6386605, @ori wrote: > The test harness would generate the code, copy the generated PHP code to the server's document r...