[01:21:58] <Cyberpower678>	 bstorm_: interesting.  Yea.  CPU is perpetually maxed out with this new VM
[01:22:58] <Cyberpower678>	 bstorm_: is it possible that cluster has slower CPUs?
[01:23:57] <bstorm_>	 I don't think so. It isn't one of the old machines 
[01:24:08] <Cyberpower678>	 It was previously at around 40% average.  So a significant difference.
[08:57:07] <Naypta>	 Hi folks, I wonder if anyone might be able to help me. I've just signed up for a Wikimedia developer account through Toolforge, but when going to sign in I'm immediately prompted to "enter a correct username and password". I'm using a password manager, so I know my password is right... any ideas?
[08:57:23] <Naypta>	 Is there meant to be a verification email perhaps? I've not received one.
[09:14:58] <Naypta>	 !help
[09:14:58] <wm-bot>	 If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban
[09:18:44] <mutante>	 Naypta: the user name is case-sensitive. could it be the first letter of the user name not being capitalized?
[09:19:09] <mutante>	 are you trying to logon at wikitech.wikimedia.org ?
[09:19:35] <Naypta>	 Hi mutante, I've tried both capitals and non-capitals, neither seem to work :/ I didn't even change my username, it's just the defaults set by toolforge after the wikimedia oauth (so my wiki username, Naypta, and then the lowercase equivalent for ssh)
[09:19:53] <Naypta>	 I'm logging on at the location it sent me to after I finished the signup wizard, which is toolsadmin.wikimedia.org
[09:28:12] <mutante>	 Naypta: i am not sure this is the same thing as making a develeoper account at wikitech.wikimedia.org wiki
[09:29:05] <Naypta>	 mutante, I've just been following the instructions at https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_account#Toolforge_users
[09:29:50] <Naypta>	 I can try creating an account again on wikitech.wikimedia.org, but I tried creating another one on the toolforge page and was told there was already an account associated with my wikimedia oauth account
[09:32:48] <mutante>	 Naypta: sorry, i don't know about the toolforge part of things. for me creating a dev account meant https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount  but this lets you use an existing Wikimedia (wiki user on wikipedia) account
[09:33:09] <Naypta>	 Cheers, I'll give that a go, give me a moment
[09:33:23] <mutante>	 Naypta: best if you create a ticket or wait for others who know better about the specifics of the toolforge account linking
[09:33:24] <Naypta>	 If that doesn't work, I'll try and see if resetting the password through wikitech works, as there's no reset password link on toolforge either it seems
[09:33:31] <mutante>	 that sounds like a plan
[09:33:42] <Naypta>	 ty for your help :)
[10:06:23] <kalle>	 bd808: The cert is there without any weiredness. The weirdness is when apt attempts to install ca-certificates-java. It then tries to copy the cert from that location and to a new (jks) keystore. Apt then fails to access the existing file.
[10:07:34] <kalle>	 bd808: My suggestion is to install a new instance with Buster and immediatly execute apt install ca-certificats-java and look close at the log.
[10:08:29] <kalle>	 Apt will actually finish successfully, but if you scroll up you'll see it failed to copy a bunch of certs.
[10:09:46] <kalle>	 On the old 9.5 image I previously used, it failed to copy about 15 certs with funky diacrits and other non lower bit ASCII chars.
[10:14:03] <kalle>	 I do not have this problem on an out of the box Buster netinstall on a Virtualbox, nor on my libvirt hypervisor.
[14:29:43] <mutante>	 !log wildcat - instance "dannyb" - replacing role::simplelamp with role::simplelamp2 
[14:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wildcat/SAL
[14:32:00] <mutante>	 !log wildcat - instance "dannyb" - the above fixed broken puppet run due to missing mysql packages and installed mariadb instead (there was no existing DB in use and apache works just like before, it also showed the default page before my change
[14:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wildcat/SAL
[14:32:36] <andrewbogott>	 mutante: thanks again for your VPS puppet tune-ups :)
[14:32:58] <mutante>	 andrewbogott: :) you're welcome. i am sooo close to deleting old modules now :)
[14:33:07] <andrewbogott>	 \o/
[14:33:08] <mutante>	 one more project
[14:34:35] <mutante>	 sorry for the breakage on standalone puppetmaster, i don't really get how it worked before .. but that was also part of this
[14:34:44] <mutante>	 since it was the last prod thing using "apache"
[14:34:52] <mutante>	 and now simplelamp and then that's it
[14:36:27] <andrewbogott>	 the puppetmaster thing was an easy fix.  And I'm pretty sure I created the simplelamp module so I'm happy to see it phased out :)
[14:56:40] <bd808>	 kalle: is there something that is actually not working for you? Or are you just upset about seeing warnings from the java package about it not understanding something about some collection of certs? This all sounds like upstream bugs to me.
[15:12:03] <CurbSafeCharmer>	 bstorm_ Refill seems to be stuck again
[15:12:22] <CurbSafeCharmer>	 "Waiting for an available worker."
[15:17:55] <bstorm_>	 Oh boy. I wonder why?
[15:19:01] <bstorm_>	 It's had 6 restarts
[15:19:13] <CurbSafeCharmer>	 automatic ones?
[15:19:41] <bstorm_>	 Yeah, that happens for lots of reasons though
[15:19:50] <bstorm_>	 If it was like 100 restarts, then it's crashing :)
[15:19:52] <bstorm_>	 Could not load cache: EOFError('Ran out of input',)
[15:20:19] <bstorm_>	 It looks like it is often doing ok.
[15:20:41] <bstorm_>	 But there's been a few EOFErrors. That can mean it lost redis afaict
[15:20:56] <bstorm_>	 The last logs suggest it is working, though
[15:21:20] <bstorm_>	 It's not being killed
[15:21:48] <bstorm_>	 Ahhh, wait, it might be dead
[15:22:13] <bstorm_>	 yeah, I'll restart the celery. The logs are older than I thought at a glance
[15:22:28] <bstorm_>	 It might have been killed for using too much memory
[15:23:29] <bstorm_>	 !log tools.refill-api killing the pods to restart the worker
[15:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.refill-api/SAL
[15:23:52] <bstorm_>	 I think the celery worker needs a good liveness probe defined
[15:24:03] <bstorm_>	 So that if it dies, the pod restarts itself
[15:24:20] <bstorm_>	 CurbSafeCharmer: try now
[15:24:46] <bstorm_>	 I don't know the app well enough to be sure what a good "health check" is for the celery worker
[15:25:31] <CurbSafeCharmer>	 still getting the same message
[15:26:02] <bstorm_>	 https://www.irccloud.com/pastebin/fRZp3ekg/
[15:26:22] <bstorm_>	 https://www.irccloud.com/pastebin/FtssO71q/
[15:26:44] <bstorm_>	 It cannot connect to www.news.sky.com?
[15:27:28] <AntiComposite>	 sounds like a similar problem to last time...
[15:27:30] <bstorm_>	 There's a lot of "could not load cache"
[15:27:34] <bd808>	 "EOFError('Ran out of input',)" is a pickle error I think
[15:27:36] <bstorm_>	 That was related to redis
[15:27:47] <bstorm_>	 The pickle errors are "normal" for this
[15:28:05] <bstorm_>	 I wonder if tools-redis is having an issue?
[15:28:58] <bstorm_>	 It has a different ssh key since I rebuilt it?
[15:29:22] <bstorm_>	 That can mean it was migrated
[15:29:49] <bstorm_>	 We have a problem somewhere
[15:29:52] <bstorm_>	 https://www.irccloud.com/pastebin/KEPkxwSv/
[15:30:11] <bstorm_>	 canary1027-01 != tools-redis1003
[15:30:25] <bstorm_>	 That's the issue
[15:30:30] <bstorm_>	 ...
[15:30:44] <bstorm_>	 andrewbogott: did we change anything about cloud DNS today?
[15:31:25] <bstorm_>	 That's a very concerning error
[15:32:46] <bstorm_>	 https://www.irccloud.com/pastebin/MdUZ3YhN/
[15:34:07] <bstorm_>	 Actually, the change would have been yesterday sometime
[15:34:12] <bstorm_>	 based on when this started happening
[15:35:09] <bstorm_>	 172.16.1.166 is that canary host, not tools-redis
[15:35:28] <bstorm_>	 https://www.irccloud.com/pastebin/C2K4kM7g/
[15:36:45] <bd808>	 bstorm_: I think we can manually fix using horizon, but yeah this could be a sign that one of the dns cleanup things is busted post-Buster
[15:36:58] <bstorm_>	 Definitely
[15:37:15] <bstorm_>	 Where would we clean that up in horizon?
[15:37:20] * bstorm_ looks
[15:37:30] <bd808>	 wanna make a task? I'll get into Horizon and see if I an nuke the bad record
[15:38:48] <bstorm_>	 Sure
[15:41:47] <bstorm_>	 T252889
[15:41:48] <stashbot>	 T252889: tools-redis-1003 ended up with a duplicate record - https://phabricator.wikimedia.org/T252889
[15:42:21] <bd808>	 grrr... found the record but horizon is not letting me edit it
[15:44:31] * bd808 vaguely remembers that there is something "special" about auth for this zone
[15:47:10] <bd808>	 !log admin Manually running wmcs-novastats-dnsleaks from cloudcontrol1003 (T252889)
[15:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[15:47:15] <stashbot>	 T252889: tools-redis-1003 ended up with a duplicate record - https://phabricator.wikimedia.org/T252889
[15:48:05] <bd808>	 heh. it hasn't happened yet, but I think it is going to tell me "has multiple IPs: [...] This needs cleanup but that isn't implemented and almost never happens."
[15:48:18] <bstorm_>	 heh
[15:49:42] * bstorm_ starts reading that script
[15:53:49] <bstorm_>	 It seems to have the ability to delete duplicate records
[15:53:52] <bstorm_>	 but...
[15:54:16] <bd808>	 it didn't find the problem in dry-run mode :/
[15:55:38] <bstorm_>	 oh boy
[15:55:42] <bstorm_>	 huh
[15:56:24] <bd808>	 on the right track now ... `OS_PROJECT_ID=cloudinfra openstack recordset show eqiad1.wikimedia.cloud. tools-redis-1003.tools.eqiad1.wikimedia.cloud.`
[15:58:19] <bd808>	 "Managed records may not be updated" boo
[15:59:08] <bstorm_>	 🤨
[15:59:53] <bstorm_>	 Sounds like we may need to do database surgery or some such nonsense
[16:00:07] <mutante>	 !log glampipe - instance glampipe - replacing role::simplelamp with simplelamp2 - i note that before i touched anything this was already 502 Bad Gateway on http://glampipe.wmflabs.org/ and puppet was broken and the last time anyone besides andrew,root and myself logged in was in 2018 
[16:00:08] <bstorm_>	 I haven't had an argument with designate before
[16:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Glampipe/SAL
[16:03:28] <bd808>	 bstorm_: apparently I need to know how to make the `openstack` cli client send a "X-Designate-Edit-Managed-Records" header with the api call :/
[16:03:50] <bstorm_>	 hah!
[16:03:53] <bstorm_>	 Ok
[16:05:01] <bstorm_>	 I know some commands have header args...
[16:05:08] <bstorm_>	 trying to find docs that might be useful for this
[16:11:24] <bstorm_>	 CurbSafeCharmer: Thanks! I think you found a pretty interesting error that needs fixing :)
[17:09:18] <wm-bot>	 !log tools.versions <bd808> Update to 0d8ed77 (D1183)
[17:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.versions/SAL
[17:12:56] <bd808>	 greg-g: you change to https://versions.toolforge.org/ is live. :)
[17:14:52] <hauskatze>	 bd808: I +1ed it, where's my cookie? :-)
[17:15:19] <hauskatze>	 Differential is weird
[17:15:25] <bd808>	 hauskatze: take your pick -- https://emojipedia.org/cookie/
[17:16:40] <bd808>	 hauskatze: you actually functionally +2'd to the extent that differential has a +2. And yes it is an 'interesting' system. More a pathc mailing list manager than anything else. :/
[17:17:42] <hauskatze>	 bd808: but there's no "land" a.k.a. cr+2 UI button, you need to `arc land` via shell
[17:17:45] <hauskatze>	 iirc
[17:17:56] <hauskatze>	 so it's kinda cumbersome IMHO
[17:18:08] <hauskatze>	 even GitHub has a "Merge Pull Request" button
[17:18:11] <bd808>	 yes, and also the whole thing assumes that the person running `arc land` is the patch author
[17:18:39] <hauskatze>	 which caused attribution issues to some of my prior patches in rPHEX for example
[17:19:11] <hauskatze>	 I guess given we ain't using Differential for general CR it is not a priority to have those fixed
[17:19:30] <bd808>	 Like I said, more of a patch mailing list manager tool than anything else. And one with hard coced assumptions
[17:19:57] <hauskatze>	 I used Arcanist like 5/6 times and I don't quite get it
[17:20:42] <hauskatze>	 not a 'pro' coder though, so I'm sure it's also my fault
[17:27:26] <bstorm_>	 CurbSafeCharmer: I think we fixed it
[17:29:50] <Majavah>	 bd808 wait that patch to versions adds link to my tool :O
[17:30:07] <bd808>	 Majavah: yeah!
[17:41:03] <greg-g>	 bd808: hah! nice :)
[17:42:53] <greg-g>	 Majavah: yes! thank you!
[17:43:35] * Majavah is happy
[17:53:55] <kalle>	 bd808: Yes, this makes Gradle (java build tool) unable to connect to jcenter (a maven dependency repository) via https as it does not trust the root cert. 
[17:55:35] <kalle>	 We're thus unable to build our Blubber Docker images on the instance. We're left with building them locally and uploading our images. This will probably become a problem later on as releng and CI pipeline will do the same.
[17:56:00] <kalle>	 So this is indeed a real problem for us.
[18:06:36] <kalle>	 bd808: https://phabricator.wikimedia.org/T249220 See the Gradle build error for Mary TTS in the end of that task.
[18:15:03] <hauskatze>	 bd808: https://pathoschild-contrib.toolforge.org/ gives 403 but tools.wmflabs.org/meta works
[18:15:41] <hauskatze>	 https://meta.toolforge.org/ is also 403
[18:20:58] <bd808>	 Hauskatze: I’m not sure I understand what you are asking of me.
[18:22:17] <hauskatze>	 bd808: are wiki links of the type tools.wmflabs.org/$toolname being rewritten?
[18:22:26] <hauskatze>	 while you do a click on 'em I mean
[18:23:53] <bd808>	 Potentially. But tools running on the Kubernetes cluster would need an ingress object for the new domain. That requires that they be restarted by their maintainers
[18:24:29] <bd808>	 It’s tool by tool opt-in at the moment 
[18:24:42] <hauskatze>	 bd808: I intend to ask Pathoschild but I am not sure what to say to him. "Your tool does not work" ain't very accurate :)
[18:26:36] <bd808>	 https://wikitech.wikimedia.org/wiki/News/Toolforge.org
[18:33:43] <hauskatze>	 https://tools.wmflabs.org/pathoschild-contrib/stalktoy/ <-- it's our links being not correct afaics
[18:34:02] <hauskatze>	 I'll see if I can find where that link lives and update it
[18:44:20] <bstorm_>	 !log admin rebooting cloudvirt-wdqs1003 T252831
[18:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[18:44:24] <stashbot>	 T252831: cloudvirt ceph nodes can't launch new VMs - https://phabricator.wikimedia.org/T252831
[20:35:28] <bstorm_>	 !log toolsbeta updating the maintain-kubeusers image to be able to control admin accounts
[20:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[20:48:33] <bstorm_>	 !log toolsbeta found an error in the new version of maintain-kubeusers, removing the deployment for now T246059
[20:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[20:48:36] <stashbot>	 T246059: Add admin account creation to maintain-kubeusers - https://phabricator.wikimedia.org/T246059
[21:13:09] <Danny_B>	 mutante: thx for fix
[22:05:39] <bd808>	 !log admin Added reedy as projectadmin in admin project (T249774)
[22:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[22:05:43] <stashbot>	 T249774: Grant "Cloud admin" rights to Reedy - https://phabricator.wikimedia.org/T249774
[22:10:40] <bd808>	 !log admin Added reedy as projectadmin in cloudinfra project (T249774)
[22:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL