[03:09:38] <wikibugs>	 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) @kostajh -- maybe we should do that, but I would like to hear from @nettrom_WMF about what that would mean for o...
[05:33:06] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) To test the hypothesis that this is related to firejail use, we're sending 1 req/s to on...
[08:31:36] <wikibugs>	 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10fgiunchedi) @AMooney @jcrespo any updates on this ? thank you!
[08:34:37] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10fgiunchedi) p:05Triage→03Medium
[12:24:31] <wikibugs>	 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6340838, @Joe wrote: > This service should /not/ do any caching, which should instead...
[12:32:25] <wikibugs>	 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh)
[13:23:45] <wikibugs>	 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10JMeybohm)
[13:57:35] <wikibugs>	 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10JMeybohm) 05Open→03Resolved imported td-agent-bit_1.5.3-0 to buster-wikimedia  Build steps for can be found at: https://wikitech.wikimedia.org/wiki/Td-agent-bit
[13:58:48] <wikibugs>	 10serviceops: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 (10Pchelolo) Thank you so much! I'll update my the image.
[14:20:57] <wikibugs>	 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) 05Stalled→03Invalid @fgiunchedi, this ticket will be closed
[14:54:42] <liw>	 Hi. I'd like to make regular, time-based Scap releases. What can I do to make it as easy as possible for serviceops?
[14:55:09] <liw>	 aiming at at least one release per quarter, preferably one per month
[15:13:55] <liw>	 volans, any chance you would have a workaround for https://phabricator.wikimedia.org/T254786 ?
[15:14:21] <volans>	 liw: I replied to you on _security on friday
[15:14:53] <liw>	 volans, oh, I have missed that, sorry
[15:15:03] <volans>	 being off I was o mobile and didn't had it logged in on phab
[15:15:07] <volans>	 I can reply there for posterity though :D
[15:15:45] <liw>	 volans, I see a link to T222480
[15:16:36] <liw>	 volans, which I've read, but, alas, I don't understand enough to make progress
[15:17:47] <volans>	 clustershell, that is the library used by cumin can't cope with different padding of zeros, so the names will be messed up until we have hosts with different padding
[15:18:29] <liw>	 volans, right, but some of the hosts I got errors for don't seem to have a corresponding host without a leading zero
[15:18:58] <volans>	 example?
[15:19:32] <liw>	 ping: deployment-perfapt1.deployment-prep.eqiad.wmflabs: Name or service not known
[15:19:52] <liw>	 deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs
[15:20:03] <liw>	 sorry, ping: deployment-docker-mobileapps1.deployment-prep.eqiad.wmflabs: Name or service not known
[15:20:27] <liw>	 deployment-logstash2.deployment-prep.eqiad.wmflabs exists, thoug
[15:22:06] <volans>	 I see a  deployment-logstash2 and deployment-logstash03
[15:23:53] <liw>	 I see the same
[15:24:05] <volans>	 I don't see anything related to deployment-perfapt1 on horizon
[15:24:11] <volans>	 does that instance exists?
[15:25:16] <liw>	 I don't know; cumin told me it can't resolve that hostname
[15:25:32] <liw>	 I have no idea where it gets it from
[15:27:14] <volans>	 openstack API
[15:28:33] <volans>	 liw: oh wait, are you running cumin from withhin the deployment prep local cumin master or from the WMCS cumin master?
[15:30:10] <liw>	 volans, I don't know, but I run it from liw@deployment-cumin02
[15:30:21] <volans>	 and which query do you do?
[15:30:33] <liw>	 sudo cumin 'O{project:deployment-prep}' hostname
[15:31:16] <volans>	 '*' is the same btw
[15:32:17] <volans>	 liw: if I run that I get:
[15:32:29] <volans>	 deployment-imagescaler01.deployment-prep.eqiad.wmflabs: permission denied (publickey)
[15:32:41] <volans>	 that means that either puppet is broken there or is not properly configured
[15:32:50] <volans>	 deployment-logstash02.deployment-prep.eqiad.wmflabs: could not resolve
[15:33:00] <volans>	 and this is the known issue of zero padding
[15:33:13] <volans>	 and then
[15:33:13] <volans>	 deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs
[15:33:16] <volans>	 could not resolve
[15:33:18] <volans>	 and this is weird
[15:33:48] <volans>	 as I can totally see deployment-docker-mobileapps01 on horizon, and dig fails to resolve it, so thtat's something to ask WMCS about I guess
[15:36:08] <liw>	 ok, I will ask them
[15:36:26] <volans>	 that said, what's your blocker?
[15:37:16] <liw>	 when I just run that cumin command, it tells me 3/79 nodes failed, but I can only see two actual errors (and those are hard enough to spot so I may have missed things)
[15:37:50] <volans>	 because you run 'hostname' that returns a different output for each
[15:38:03] <liw>	 volans, my blocker is that as part of making a Scap release, I need to run a few cumin commands to test it before asking serviceops to built the .deb
[15:38:09] <volans>	 I ususally run 'true' or 'id'
[15:38:45] <liw>	 ah, good point. I hadn't realised that
[15:38:59] <volans>	 I'm not familiar with the current procedure, but not sure if a few broken instances in deployment-prep can be called a blocker tbh
[15:39:58] <liw>	 volans, the blocker is that cumin refuses to run my commands so I can't actually test the release candidate
[15:40:14] <volans>	 no, it will just fail on those hosts, on the others will run happily
[15:41:14] <volans>	 you can also exclude those 3 if you want
[15:41:15] <volans>	 'O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad.wmflabs,deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs,deployment-imagescaler01.deployment-prep.eqiad.wmflabs}'
[15:41:23] <volans>	 up to you
[15:42:02] <liw>	 volans, it tells me it aborts and doesn't run them... maybe I'm misunderstanding the tool
[15:42:32] <liw>	 I tried to exclude the problematic hosts but could not understand the cumin docs to get a working expression - thanks, I'll try that
[15:44:44] <liw>	 ok, the exlusion works, I shall try the actual release testing later.
[15:44:52] <liw>	 volans, thank you very much!
[15:46:46] <volans>	 np
[15:47:38] <volans>	 by default it will try to run the command on all hosts, and if some fails will just fail there
[15:49:48] <rzl>	 _joe_, effie, elukey: re T260224 I'm inclined to say let's just use the spare host, because it's the simplest plan and most likely to be in good shape by Sep 1 -- thoughts?
[15:51:23] <rzl>	 or, correction, I'm inclined to ask dcops if they can get it swapped in time for that plan
[15:52:00] <volans>	 can we avoid to reuse the name please? :)
[15:52:18] <_joe_>	 why?
[15:52:34] <_joe_>	 it's very convenient for netbox I heard
[15:52:39] <_joe_>	 :D :D
[15:52:51] <effie>	 yeah let's just do so 
[15:54:00] <rzl>	 volans: we'll just name it "cumin2001", I don't think that's taken
[15:54:38] <volans>	 rzl: db1083 is more fancy as a name though
[15:55:25] <rzl>	 mc2037 is next, right? I'm just taking the biggest number and adding one, if we already had a mc2037 and retired it, I don't know where to look
[16:08:22] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cdanis on cumin1001.eqiad.wmnet for hosts: ` mw1359.eqiad.wmne...
[16:10:16] <effie>	 rzl: we can have a look at netbox 
[16:10:20] <effie>	 for past servers iirc 
[16:10:23] <effie>	 and phab 
[16:10:55] <effie>	 I think that would cover all angles 
[16:11:29] <effie>	 or we can create a bot that does that for us and just replies the next server name avail 
[16:11:35] <effie>	 :D
[16:19:42] <cdanis>	 we have that bot effie 
[16:19:45] <cdanis>	 its name is volans
[16:23:03] <volans>	 I replied the TrueWayOfChecking™ to rzl in prvt
[16:23:05] <volans>	 *private
[16:28:36] <effie>	 lol
[16:28:48] * effie bbl errand
[16:36:31] <_joe_>	 hi everyone, I have good news and bad news. The good news is that the cluster reboot script works. The bad news is that the cluster reboot script works.
[16:38:02] <cdanis>	 what... what did you do, joe
[16:38:35] <rzl>	 😰
[16:39:08] <_joe_>	 it's bad news because now we need to use it :D
[16:39:20] <_joe_>	 well once I've merged the next two patches
[16:43:56] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1359.eqiad.wmnet'] `  and were **ALL** successful.
[17:38:30] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) So to summarize: The vd_client/vd_server that are on testreduce1001 should NOT be on it and instead the rt_cl...
[17:43:30] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390190, @Dzahn wrote: > So to summarize: The vd_client/vd_server that are on testreduce1001...
[19:30:39] <mutante>	 in grafana when i select codfw appservers to get "memory per host" i get only wtp1025. which is neither codfw nor an appserver
[19:31:12] <mutante>	 switching back to eqiad i see mw* servers as expected though
[19:32:31] <mutante>	 eh... but i also can't really reproduce it now and eventually got what i wanted. i have a screenshot though from earlier.. hmmm
[20:03:54] <mutante>	 i am doing some codfw appserver reboots but with the "single" cookbook and picking ones with high uptime
[20:28:15] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) @ssastry @cscott rt_client and rt_server have been added to `testreduce1001.eqiad.wmnet`'.   ` [testreduce100...
[20:32:04] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) Thanks!  > It fails after a little while though because it does not have access to the database yet.  I sup...
[20:32:35] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I was about to create a subtask for that. I got it.
[20:37:28] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I just saw the DB appears to be running on localhost, not on a cluster, fwiw.
[20:43:49] <effie>	 Pchelolo: is there an easy way to detect if a surge of requests 
[20:43:56] <effie>	 is related to a templated being updated?
[20:44:15] <Pchelolo>	 effie: templated? templated what?
[20:44:19] <Pchelolo>	 oh.
[20:44:22] <Pchelolo>	 wiki template
[20:44:25] <effie>	 we suspect this is what caused the alerts which radomly happened while you deployed
[20:44:34] <effie>	 were deploying*
[20:44:50] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390874, @Dzahn wrote: > I just saw the DB appears to be running on localhost, not on a clus...
[20:45:15] <effie>	 you mean ?
[20:48:40] <Pchelolo>	 effie: not really, I don't think there's an easy way..
[20:51:53] <effie>	 is there any way we could make it easier ?
[20:52:08] <effie>	 not now, in general 
[20:53:15] <cdanis>	 Pchelolo: does a request ID not get propagated from the jobrunner request ID to the API calls?
[20:54:36] <Pchelolo>	 cdanis: lemme step back. 
[20:54:51] <Pchelolo>	 so jobrunner itself doesn't call any apis right
[20:55:03] <Pchelolo>	 jobrunner = where the job is executed
[20:55:34] <Pchelolo>	 change-prop instance that exectes jobs by calling jobrunners... lemme check if it propagates req_id
[20:55:41] <cdanis>	 okay, I wasn't sure if there were any self-calls involved
[20:55:54] <cdanis>	 because the load spike effie saw was on the apiservers, but, maybe that's due to them having to re-parse things
[20:56:34] <effie>	 ah no  chris, it was the app servers 
[20:56:35] <Pchelolo>	 ok, change-prop doesn't propagate req_id.. that wold be a good thing to add
[20:57:13] <Pchelolo>	 as for self-calls - I can not guarantee some job somewhere deep inside itself doesn't end up in calling api... but it's not what generally happening
[20:58:04] <effie>	 but I thought that if a templated was updated but pages were requested that included it 
[20:58:57] <effie>	 it could trigger a reparse before the reparse we triggered was completed 
[20:59:50] <cdanis>	 wasn't Tim working on that recently effie?  "fast stale mode"?
[21:00:01] <effie>	 or a template change tha slowed down pages that include it anyway 
[21:00:41] <effie>	 mm I don't know, I am reading about it now 
[21:01:50] <effie>	 [operations/mediawiki-config@master] Enable fastStale mode on all wikis
[21:02:10] <effie>	 aha!
[21:13:53] <mutante>	 last reboot cookbook ended again with "not repooling" even though it should be
[21:37:03] <_joe_>	 rzl: I just noticed that the weights in confctl in codfw are all over the place, we need to fix them before the switchover
[21:37:05] <_joe_>	 :/
[21:37:20] <rzl>	 ah good catch, thanks
[21:37:25] <rzl>	 added to my list
[21:39:27] <mutante>	 when you ask me next time why it failed for me: here's another one: Not all services are recovered: mw1279:Check no envoy runtime configuration is left persistent
[21:40:13] <_joe_>	 mutante: that's icinga lagging, that cookbook needs an option to ignore unknowns
[21:41:14] <mutante>	 the weights in codfw are also different because there are old and new servers. 
[21:41:21] <mutante>	 _joe_: ack
[21:47:38] <mutante>	 since I have to look at them anyways I think prefer just not using the cookbook for a single host. then I don't ignore Icinga and it goes faster, not waiting for the polling
[22:45:08] <wikibugs>	 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki)
[22:45:46] <wikibugs>	 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki)
[22:54:31] <wikibugs>	 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki)
[22:55:53] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn)
[22:58:20] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) ` [scandium:~] $ mysql -h m5-master.eqiad.wmnet -u testreduce -p testreduce Enter password:  Reading table information for completion of table and column...
[23:00:18] <wikibugs>	 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Stalled mariadb-client has been installed (added buster support by using that instead of outdate...
[23:02:30] <wikibugs>	 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki)
[23:08:48] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Parsoid, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) So this is everything in `modules/role/templates/mariadb/grants/production-m5.sql.erb` that refers to testreduce (line 5 to 48).  Please make that work t...
[23:28:07] <wikibugs>	 10serviceops, 10Performance-Team, 10Sustainability (Incident Followup): Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10jijiki) >>! In T253673#6386605, @ori wrote:  > The test harness would generate the code, copy the generated PHP code to the server's document r...