[08:26:27] !log tools.zppixbot wiki to read only, updating all submodules: 'tools.zppixbot@tools-sgebastion-07:~/ZppixBot/public_html/wiki$ git submodule foreach git pull origin REL1_34' [08:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [08:29:38] !log tools.zppixbot wiki out of read only [08:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [08:30:21] * RhinosF1 leave [16:23:22] !log admin restarting apache2 in cloudcontrol1003/1004 to pick up latest wmfkeystonehooks changes T249494 [16:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:23:24] T249494: CloudVPS: keystone bugs in Queens (wmfkeystonehooks missing role_api and LDAP encoding issues) - https://phabricator.wikimedia.org/T249494 [16:47:42] Can someone take a look at Refill? It needs a reset [16:49:46] CurbSafeCharmer_: have you tried to contact the tool's maintainers? [16:51:12] hmmm... the last comment on https://en.wikipedia.org/wiki/User_talk:Zhaofeng_Li/reFill is not a good thing. "To those posing here please be aware that Zhaofeng is no longer maintaining Refill." [16:51:43] and https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical) seems to have a lot of folks posting about downtime. [16:52:28] Sounds like the community needs to find some folks to step up and take over the tool via https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy [16:55:19] It worked a few days ago but is intermittent. A few weeks ago I reported the issue here and @bstorm_ got it going again [16:56:03] * bstorm_ peeks in wondering what's up [16:56:07] CurbSafeCharmer_: sure, but that's not very sustainable. We can keep hitting it with the restart hammer, but that won't actually fix the tool [16:56:17] Agreed [16:56:28] We should *not* restart refill [16:56:34] It requires a special handling [16:57:04] It's one of those ones that requires kubectl apply...also needs a maintainer :) [16:57:21] If it is having trouble I wonder if it needs more resources than it has [16:57:25] I'll take a look [16:57:25] is it a java mess? [16:57:35] Nope, it's the celery worker :) [16:57:40] ah [16:57:42] The one that was crashing nodes back in the day [16:58:21] yeah, so long time bad actor code that is broken for users because it is not breaking the Kuberntes cluster for everyone else [16:58:24] I did apply what I thought were suitable constraints on it. It even required a quota boost to move it [16:58:38] It was broken because the celery container had no limits [16:58:51] I added limits on the old cluster, and those moved over to the new (where everything has limits) [16:59:18] So exactly what's up now is an interesting question [16:59:22] I expected it to be stable [16:59:58] * bd808 stops himself from commenting on the global state of Python code quality in Toolforge Tools [17:00:01] It's been running for 6d. That doesn't seem too broken [17:00:22] CurbSafeCharmer_: what seems to be the trouble at the moment? [17:00:56] Stuck saying "waiting for an available worker" [17:01:07] ah, so it is the same type of thing [17:01:15] Yup [17:01:22] Lemme dig a little [17:01:48] The pod is alive, but that is likely because of poor liveness checking [17:04:03] This is actually somewhat new methinks [17:04:18] Unless the worker is dead [17:04:49] Yeah, the pywikibot errors were there before. I remember now [17:05:38] !log tools.refill-api running 'kubectl delete pods refill-api-6c78d8cdd-49krm' [17:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.refill-api/SAL [17:05:54] Restart should be that simple. I cannot get a good read on why it would not be live, though [17:06:18] I can check for prometheus metrics. It would be killed if it consumed more ram that it should [17:07:08] CurbSafeCharmer_: is it working now? [17:07:15] (problem with refill is that it's not abandoned. cyberpower already took it over from Zhaofeng. but he hasn't had the time to do more than minor maintainance) [17:08:27] Still saying waiting for an available worker [17:08:27] frankly I don't blame him, it's not exactly an easy codebase to understand [17:09:45] well, we've got this: [17:09:48] https://www.irccloud.com/pastebin/9Mqj32aC/ [17:09:55] AntiComposite: its also a bit of an anti-pattern that we have some many abandoned tools taken over by folks who are already working pretty much full time on other tools :/ [17:10:02] yup [17:10:08] https://www.irccloud.com/pastebin/VFSxoqZQ/ [17:10:18] This seems like code issues... [17:11:02] It got this too: [17:11:04] `requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('HTTP/1.1 401.1 Unauthorized\r\n',))` [17:11:06] Sheesh [17:12:07] Looking around for the whole `Could not load cache: EOFError('Ran out of input',)` thing I keep seeing [17:12:24] i don't remember that previously. I *do* remember pywikibot issues, likely because this is running a jessie container [17:12:31] Thus all old software [17:13:31] Ah, that's a pickle thing as well [17:14:27] CurbSafeCharmer_: Are you checking something that is a "known good" query? This may be insufficient error handling you are seeing. [17:14:35] In other words, a code issue [17:16:10] On the other hand, the usage pattern suddenly dropped recently... [17:16:11] https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-refill-api&refresh=5m [17:17:27] It looks like it could have died at some time in the past few days https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-refill-api&refresh=5m&fullscreen&panelId=1 [17:17:31] And then started again [17:17:39] when I kicked it [17:18:22] Hard to tell with the way it is set up [17:21:19] It had updates 6 days ago in the code... [17:21:56] (confirmed not working on a testcase from one of my tools that I've previously used with refill) [17:22:08] Thanks [17:22:41] requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) [17:22:51] That's the worker [17:23:22] raised unexpected: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')),) [17:23:31] I am concerned that it cannot talk to the wikis... [17:23:52] That would explain pickle unable to get "sufficient input" [17:24:30] As I mentioned, error handling is not the strong point in this code, so it will likely show any error in the worker like this [17:25:00] I got "55efbd7f-2a8b-47d8-9bb9-5a0649a6408b" for the celery task id and then a status response of [{"info": {}, "state": "PENDING"}, {:"info...] [17:25:13] yeah, I'm reading the celery logs [17:25:22] It's unable to make connections [17:27:48] I can't fix this by just restarting things [17:27:51] I don't know what's up [17:29:00] Nobody has logged in as this tool since I last did from what i can tell [17:29:10] Did that backend change? [17:36:45] only action the repo's seen in the last 15 months is translatewiki.net [17:36:59] oh, and dependabot and user complaints [17:39:07] !log admin [codfw1dev] `openstack zone create --email root@wmflabs.org --type PRIMARY --ttl 3600 --description "floating IPs subnet" 57.15.185.in-addr.arpa.` (T247972) [17:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:39:10] T247972: Cloud DNS: fix inconsistent ownership of reverse domains for openstack floating ip networks - https://phabricator.wikimedia.org/T247972 [17:42:18] !log admin [codfw1dev] transferred DNS zone 57.15.185.in-addr.arpa. to the cloudinfra-codfw1dev project (T247972) [17:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:44:34] AntiComposite: It's acting like it has lost access to something or cannot contact wikis. [17:46:13] Unless there's some other component that it is failing to connect to... [17:47:19] bstorm_: does it hit refill-api from the celery side? [17:48:25] trying to parse that out [17:49:20] I don't think so. refill-api is basically just the celery service with something uwsgi in the frontend, I think? [17:49:30] I think it connects to citoid...possibly other stuff [17:49:38] should be [17:51:39] Does it hit citoid directly? I wonder if that needs TLS now that it did not before? I'm not sure which places envoy is enforcing TLS this week that it was not several weeks ago. [17:52:01] citoid is a whole different mess :/ [17:53:23] looks like it's not citoid [17:53:47] actually, it's working now [17:53:54] kinda [17:54:59] yeah, it's working [17:55:17] CurbSafeCharmer_, can you check again please? [17:57:14] It is working...I dont' know why [17:57:30] It stopped getting "requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))" [17:59:21] Here's where it was breaking: [17:59:24] https://www.irccloud.com/pastebin/NsXHkAWc/ [18:02:08] Nice, this code has type hinting :) [18:02:14] that's literally just loading the URL from the citation [18:02:51] it breaks every two months, but at least it has type hinting [18:02:59] 😁 [18:03:22] I suspect something was throttling or preventing it from hitting those URLs. Those would be external URLs? [18:05:46] yup [18:05:48] yes, the URL in the citation would be a 3rd party site [18:06:02] It seemed to be failing on every one... [18:06:05] And now it isn't [18:06:18] and that would make things break seemingly at random with upstream throttling [18:06:57] * bd808 sees BeautifulSoup in that code and shudders [18:07:22] you know it's a fun one when you need pywikibot, mwparserfromhell, and beautifulsoup [18:07:52] It wasn't failing every single time perhaps because I see interspersed "[2020-04-06 17:21:14,721: WARNING/ForkPoolWorker-90] took 2.8050272464752197" [18:08:00] But lots of failures then suddenly none [18:08:51] this bash quote of mine is about BeautifulSoup -- https://bash.toolforge.org/quip/AVWoDg8ZgCrwkbTdmcjL [18:09:50] That's really an excellent evaluation of that library [18:09:58] I have nothing further to add :) [18:23:29] Thanks @bstorm_ [18:32:37] since I don't see that anyone's done this already, I'll throw a ping in for Cyberpower678 in case they can find an actual fix [18:35:37] A fix or perhaps a nice bit of error handling that signals the problem to the user. [19:16:32] !log tools deleted tools-redis-1001/2 T248929 [19:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:35] T248929: Rebuild tools-redis servers as stretch - https://phabricator.wikimedia.org/T248929 [20:08:08] Hi, https://tools-static.wmflabs.org/cdnjs/ajax/libs/twitter-bootstrap/4.0.0-beta.2/css/bootstrap.min.css seems to return 403. Could a cloud admin fix that, please? Per https://toolsadmin.wikimedia.org/tools/id/cdnjs, just them are maintainers. [20:08:38] bstorm_: ^ [20:20:50] Hey [20:20:54] Let me take a look [20:21:44] thanks [20:22:29] Urbanecm: that's because the file isn't available upstream now [20:22:30] https://cdnjs.com/libraries/twitter-bootstrap/4.0.0 [20:22:48] So we are getting a 403 from the upstream provider [20:23:09] Wait, no...it's there [20:23:15] *pokes further* [20:25:13] It should be a straight proxy to https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.0.0-beta.2/css/bootstrap.min.css and that works [20:25:24] I wonder if it's getting ratelimited [20:25:49] might be [20:25:59] through CURLing that from tools-sgebastion-08 works [20:26:34] I will try curling from the proxy :) [20:26:48] There also might be an issue with the URL somehow [20:35:43] Yes, they are giving all of our URLs 403s [20:36:11] Wondering if that's because of a change on the proxy or... [20:36:12] oh rip. I guess they think out use is abusive? [20:36:56] I don't know yet. It just started today [20:37:31] zhuyifei1999@zhuyifei1999-ThinkPad-T480 ~ $ ssh stretch-dev.tools.wmflabs.org [20:37:31] Connection closed by 185.15.56.50 port 22 [20:37:32] T480: Agree and Document the Process to Include Security Fixes in MediaWiki Releases - https://phabricator.wikimedia.org/T480 [20:37:41] o.O [20:38:06] heh [20:38:52] also tried bastion.toolforge.org, can't ssh in [20:38:56] I think I found the first one...but it wasn't all of the URLs then [20:39:00] Really? [20:39:07] zhuyifei1999_, looking [20:39:18] * Urbanecm neither [20:39:26] https://www.irccloud.com/pastebin/71aNqEav/ [20:39:31] Same here [20:41:16] hmmm [20:41:21] not sure why it's not letting people in [20:41:29] I had to go in as root too [20:41:42] I can get in by ssh-ing directly to the server without root [20:41:49] "directly" [20:42:21] pam_access(sshd:account): access denied for user `krenair' [20:42:23] root key works for me. bastion.wmflabs.org works for me [20:43:01] login.toolforge.org and dev.toolforge.org both should work and do not [20:43:18] confirmed [20:43:24] login.tools.wmflabs.org is not letting me in either [20:43:47] I can go through using login.tools.wmflabs.or [20:43:49] *login.tools.wmflabs.org [20:44:05] also login.toolforge.org [20:44:13] dev.toolforge.org doesn't work through [20:44:29] I would say it's an ldap failure except that I can use my key on other hosts... [20:44:48] I can log in with login.tools.wmflabs.org yeah [20:44:57] Wait no, I cannot [20:45:01] it didn't complete [20:45:15] can't on login.tools.wmflabs.org here [20:45:17] I can ssh directly to a bastion [20:45:21] doesn't look like any recent puppet changes on tools-sgebastion-08 [20:45:56] andrewbogott: I don't know if it is at all related, but https://tools-static.wmflabs.org/cdnjs/ajax/libs/popper.js/1.12.5/umd/popper.min.js [20:45:59] root@tools-sgebastion-08:~# getent group project-tools [20:45:59] root@tools-sgebastion-08:~# [20:46:09] auth.log makes it look like either LDAP or PAM [20:46:12] whether jumping through bastion.wmflabs.org doesn't seem to affect it [20:46:13] that seems... weird? [20:46:28] o.O [20:46:35] it clearly exists in LDAP [20:46:35] pam_access(sshd:account): access denied for user `urbanecm' from `bastion-eqiad1-01.bastion.eqiad1.wikimedia.cloud' [20:46:37] /usr/sbin/ssh-key-ldap-lookup andrew [20:46:39] lots of that [20:46:45] * Krenair eyes sssd [20:47:20] yeah. the ssh-key-ldap-lookup seems to be working [20:47:37] so it thinks we're not in project-tools for some reason [20:47:50] the rules in /etc/security/access.conf.d/ look okay [20:48:17] That makes sense [20:48:40] Has anyone tried one of the other bastions? [20:49:00] I've been digging into sgebastion-08 as root, we've tried logging in to login.tools.wmflabs.org with similar symptoms [20:49:12] after I sudoed into my account tools-bastion-09 works [20:49:19] andrewbogott I've tried both of the TF ones, and I dont' use that bastion for proxy [20:49:25] I use the restricted bastion [20:49:27] -08 is dev.toolforge.org, -07 is login.toolforge.org [20:49:28] *sgebastion [20:49:51] huh [20:49:51] 07 is working again [20:49:52] for me [20:50:01] Urbanecm, zhuyifei1999_ can you try now? [20:50:19] Working for me on 08 [20:50:29] Did anyone do anything? [20:50:31] Krenair: seems to work [20:50:33] yeah both working for me now [20:50:36] -09 still doesn't work for me [20:50:38] I didn't change anything [20:50:46] I changed nothing bstorm_ [20:50:51] Ok, so more cursed stuff [20:50:53] I did try to sudo from root to my user [20:50:59] I sudoed into my account on 09... then curled cdnjs <= all I did [20:51:01] but that wouldn't effect anything surely [20:51:04] I'm going to try that on 09 [20:51:26] I still can't get into dev.toolforge.org [20:51:30] huh 08 doesn't really work for me [20:51:31] I can [20:51:40] Urbanecm: that's what I tested. [20:51:46] bstorm@tools-sgebastion-08:~$ [20:51:56] Trying again in case it dropped [20:51:59] Krenair: you did 'sudo su - krenair'? [20:52:01] it did! [20:52:16] so sgebastion-08 just flapped [20:52:17] tried reperately [20:52:18] I used `sudo -iu zuyifei1999` [20:52:23] andrewbogott, sudo -iu krenair [20:52:31] alsso yeah -08 has broken for me again [20:52:37] (uh, misspelled my username) [20:52:38] -09 is working for me now [20:53:06] I'll have a look at ldap logs [20:53:08] Urbanecm: can you try -08 again? [20:53:16] and now it works... wtf..... [20:53:27] seems to work now [20:53:39] I just logged into dev.toolforge.org as root and did `sudo -iu bstorm` [20:53:48] Then I sat at terminal and waited until you tried [20:53:54] Gonna log out and then try [20:54:00] that's...weird [20:54:12] It's still working [20:54:18] so maybe that's a red herring :) [20:54:40] That's really odd [20:55:52] it all appears to still be working for me now... [20:56:24] maybe it's totally unrelated, but why https://tools.wmflabs.org/ldap/user/urbanecm has both b'project-bastion' and project-bastion? [20:56:54] I'd understand if it used b'...' for everything, then it would show a mistake in that tool [20:57:01] but displaying _both_ is weird [20:57:19] uh oh [20:57:21] ldap had some kind of syncing upset in the last few minutes [20:57:26] andrewbogott: ^ That sounds like a side effect of the bug that a.rturo fixed earlier today [20:57:33] ^ [20:57:41] the 'b...' byte string stuff [20:58:02] It does, although it's not like Urbanecm was recently added or removed from project-bastion [20:58:18] uh yeah. If I run `id` I see the same weirdness with b'...' [20:58:27] yeah but I think it re-synced membership for all members of those groups probably? [20:58:47] 50062(project-bastion),50380(project-tools) ... 54340(b'project-bastion'),54343(b'project-tools') [20:58:51] maybe [20:58:54] different group IDs even [20:59:01] yeah.... [20:59:02] I don't know why that would've healed though [20:59:05] That doesn't seem good [20:59:11] anyone know which one is the real one? [20:59:20] no idea [20:59:23] andrewbogott: the nfs-mounts.yaml will have it [20:59:30] the lower numbered ones [20:59:40] the one without b'...' should be real right? considering they have lower IDs [20:59:44] finding that [20:59:50] group ids are created by finding the largest and adding 1 [21:00:17] fair [21:00:17] yeah, low numbered without b'...' will be legit [21:00:18] yeah. anything with b'...' as a name is bugged and a duplicate [21:00:40] 50380 is project-tools [21:00:42] now if I can just get the novaadmin ldap password to work... [21:00:51] 54343(b'project-tools') is the wrong one [21:00:59] heh [21:01:36] andrewbogott: mind if I ignore this and drill down into the cdnjs problem while you and bd808 look at this? [21:01:43] bstorm_: go ahead [21:01:50] 👍🏻 [21:02:48] bd808: I stand poised to delete some of those b'projects' [21:02:52] any reason for me to hesitate? [21:03:00] I mean, the group entries in ldap for them [21:03:36] andrewbogott: probably not. I was just looking to see when they were created. 20200406200220Z for cn=b'project-tools' [21:03:52] which is hours after the fix was applied? [21:04:06] yep, that implies that the 'fix' is causing this rather than fixing it [21:04:09] ~1 hour ago [21:04:22] *nod* [21:04:22] * andrewbogott stops editing ldap [21:04:48] hm... [21:04:57] can I just scream about python-ldap for a bit? [21:05:29] sure! [21:05:47] * bd808 opens window and just yells about that library [21:06:53] the timestamp for b'project-bastion' is 20200406162343Z, so much earlier [21:07:24] so in python2, we were passing in regular old python2 strings [21:07:37] and pyldap was (maybe) decoding them before writing unicode to ldap [21:08:06] in python3, we are now passing in utf8 strings which pyldap is writing literally to ldap rather than decoding [21:08:31] so maybe the right version of art.uro's patch was to make everything python3 string rather than making everything utf8 [21:08:32] ? [21:09:05] The merge on a.rturo's first fix today was 10:05 UTC. Both weird groups in my group list were created after that [21:09:20] uh [21:09:27] I've stopped being able to log in to login.toolforge.org again [21:09:42] Krenair, the patch we're discussing is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/586341/ [21:09:47] (in case you're interested) [21:10:05] I know I approved an earlier version of it :) [21:11:01] `groupname = ("project-%s" % project_id).encode('utf-8')` -- that's going to make a b'...' string [21:11:33] yeah [21:11:36] I'm going to reverse it [21:11:59] hmmmm [21:12:27] old syntax though: [21:12:28] >>> project_id = "tools"; "project-%s" % project_id.encode('utf-8') [21:12:28] "project-b'tools'" [21:12:34] which seems broken [21:12:46] the python ldap lib returns bytes rather than strings in python3 [21:12:59] but feeding it bytes is not right I don't think [21:13:02] hm, I also found a case that that patch misses [21:13:16] okay so what we want is simply `"project-%s" % project_id` ? [21:13:18] oh, so it /speaks/ bytes but /hears/ strings? What a jerk [21:14:01] its all a bit of a mess honestly. their py2->py3 changes got really messed up [21:15:31] there is a ton of code in the library after they added py3 support for the bytes<->strings conversions. And it took quite a few releases for them to iron it out upstream. I haven't looked at library versions we have installed to try and guess where in the cycle of fixes Debian took a snapshot [21:16:37] I'm still proofreading, but… https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/586446/ [21:16:42] https://pypi.org/project/ldap3/ is the less crappy, pure python client library [21:17:53] I wonder about that str() call on line 178 but it seems harmless [21:18:03] oh, no, it's right, it's converting an int [21:18:14] andrewbogott: that one should be ok. it's casting an int to a str [21:18:37] and that's needed I think [21:20:04] so before a.rturo touched things we had mixed byte and strings it looks like and then he made it all bytes. Now andrewbogott is proposing all str? (just trying to make sure I'm following here) [21:20:30] I was sleeping during all that, but I think they were responding to exceptions caused by appending strings and bytes [21:20:34] Krenair: is that right? [21:20:41] So he fixed those exceptions by standardizing on bytes [21:20:45] andrewbogott: do you remember why the prior set of encode('utf8') add ons were done? [21:20:55] whereas I/we now think that he should've standardized on strings instead [21:21:09] andrewbogott, yes [21:21:11] because Keystone was passing in strs and ldap was producing bytes [21:21:30] um… hm, that doesn't make much sense in python2 [21:21:33] * andrewbogott looks at file history [21:22:00] python2 behaviour, old code: [21:22:01] >>> project_id = "tools"; "project-%s" % project_id.encode('utf-8') [21:22:02] 'project-tools' [21:22:06] It feels like something that would have been related to the cn unicode bugs we had [21:22:27] yeah [21:22:35] I picked a fine time to clean up old config... [21:22:39] * andrewbogott digs deeper [21:23:04] any ideas for a "connection closed" when I try to ssh in? (my labs username is aperson) [21:23:11] Krenair: ah. good research. so this may have been a hack that helped in py2 but then py3 did different things when combining the strings [21:23:41] bd808, yep that was my assumption [21:23:47] enterprisey: we are battling that right now. it appears to be related to some LDAP changes. It also appears to work and then not work. [21:23:57] ah, nice, thought I'd been blocked [21:24:02] good to know, thanks [21:24:10] nope I can't log in to my account either :) [21:24:42] enterprisey: not yet, but I'll add you to the naughty list if you'd like ;) [21:24:46] :p [21:25:35] andrewbogott: I think your patch is worth a shot. Do we have any confidence in codfw asa testing zone for this stuff? [21:25:43] if I want a labs user rename + project rename, that's just a phab ticket with the Cloud-Services project? [21:26:12] enterprisey: both are close to impossible, but yes phab would be the place to document an ask [21:26:26] yeah, figured - thanks again, and good luck w/ the outage [21:26:29] bd808: it's worth a try, yes [21:26:39] I'll apply there once I get my head out of git history [21:29:14] * andrewbogott still hasn't found the actual patch [21:30:04] disabling puppet on cloudcontrol1003/1004, merging in codfw [21:32:16] ah crap I patched queens but not rocky and codfw1dev is rocky [21:34:35] I am wondering if this is really the same bug as the inability to log in thing we've been seeing [21:35:30] I wonder that too [21:35:33] seems kind of unlikely [21:36:07] my only theory is that when the two ldap servers sync, they try to sync between the b and str versions of the same directory, get confused, lock up for a bit [21:36:17] (since I see a lot of upset in the logs) [21:36:39] yeah. that's sort of waht I was guessing too [21:36:53] I'd be surprised if an LDAP server would accept a group with a weird name and then not be able to sync it to other LDAP servers [21:37:03] at least overloading as the mirrors try to sync up if not causing something weirder to happen [21:37:40] ok, actually applying the patch to codfw1dev now [21:37:53] I'm going to remove the b'' groups there and see if I can get them to reappear [21:38:05] (by removing and adding bd808 from some projects) [21:39:12] Krenair: my guess is more number of items in the sync. the b'project-bastion' has ~5000 members, b'project-tools' ~2000 [21:39:29] ew [21:39:36] neither of which are as big as the real source groups [21:39:55] so things are getting jammed up somewhere in the replication pipeline [21:40:20] some interesting things in here [21:40:21] root@tools-sgebastion-08:~# ldapsearch -LLLx "cn=b'project-tools'" cn [21:40:21] dn: cn=b'project-tools',ou=groups,dc=wikimedia,dc=org [21:40:21] cn: project-tools [21:40:21] cn: b'project-tools' [21:40:21] root@tools-sgebastion-08:~# [21:40:36] project-bastion has ~10k members, project-tools has ~4k [21:40:45] two CNs [21:40:58] isn't that two separate groups? [21:41:18] oh actually [21:41:22] this could be the source of our problems [21:41:43] Krenair: yikes. I'm seeing that too [21:41:49] yeah look [21:41:52] root@tools-sgebastion-08:~# ldapsearch -LLLx "cn=project-tools" | grep dn: [21:41:53] dn: cn=project-tools,ou=groups,dc=wikimedia,dc=org [21:41:53] dn: cn=b'project-tools',ou=groups,dc=wikimedia,dc=org [21:41:53] root@tools-sgebastion-08:~# [21:42:02] and it would cause flapping as the two groups are not in sync for members [21:42:04] let's say it tries to check that someone is in project-tools [21:42:13] now they're in project-tools but not b'project-tools' [21:42:21] if it picks the wrong one it'll fail [21:42:51] I'm not sure that we ever lookup by only cn for a group, but maybe we do [21:43:06] not sure since when, but I'm in both groups and yet i was able to reproduce our login issue [21:43:55] mmmm [21:43:59] think you're right actually :/ [21:44:09] we're both in the b'' version of tools [21:45:23]