[00:06:33] RoanKattouw: Everything in the db looks right but I am seeing occasional errors in the logs. I need to go now but will try to track that down tomorrow. [00:06:44] OK cool [00:06:47] It's less urgent [00:06:48] I'm generally concerned that this instance is in a broken state though — we'll see if it can be saved. [00:07:14] With it being semi-alive now, could we potentially rescue the data on it (or image it)? [00:07:22] It'd be interesting to know if that same security group works on other VMs or if that same problem happens to you. I definitely tested adding/removing security groups last week but maybe this is an interesting case somehow... [00:07:33] Imaging it isn't a great option but you could definitely do a data rescue. [00:07:42] Yeah I haven't tried that [00:08:01] I can create deployment-maps03 (which is a name that has never been used), transfer the data, and then add the puppet role and see if that works [00:08:17] Last time I tried to do a guerilla import by just plopping all the right files in /srv it didn't really work [03:50:00] !log tools clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete' [03:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:14:41] could someone please remove the two bullet points from https://wikitech.wikimedia.org/w/index.php?title=MediaWiki:Openstackmanager-shellaccountnamehelp&action=edit and replace with something like "If you get a weird error message, you probably gave a shell account name that's already in use" ? [04:15:42] https://wikitech.wikimedia.org/wiki/MediaWiki:Signupend can too be deleted, it says the same thing [04:15:54] thanks [14:24:30] tgr: freakishly we had that SVN account special case show up in the last week. First one I've seen in 3+ years [15:20:38] Where can I recover my two factor auth on wikitech with the help of my recovery codes? ^^' [15:21:05] CFisch_WMDE: usually just input the code where a token is expected [15:21:10] ahhhh [15:21:13] and disable the 2fa [15:21:28] and re-enable it with a new device [15:21:38] nice, cool thanks :-) [15:21:44] the codes are single use [15:21:47] We very obviously need better prompts in the OATHAuth screens for this. [15:22:11] CFisch_WMDE: so do take care to not use all of them up before setting up a new device [15:22:20] sure [15:22:29] I will set it up right now [15:23:04] if you need further help, please ping me [15:23:27] !bug 1 | bd808 [15:23:59] Reedy: yeah. its just that I've heard this question like 3 times in the last 2 weeks [15:24:07] and I haven't fixed it yet [15:24:19] I may have filed a bug? /me looks [15:24:49] T189924 -- Adam did! [15:24:50] T189924: Lost two-factor auth on wikitech should explain alternative reset methods - https://phabricator.wikimedia.org/T189924 [15:24:58] I can put that on my TODO list, bd808 [15:25:55] the OATHAuth workboard is sad. That extension has gotten very little love over its lifetime. [15:26:06] Indeed :/ [15:26:13] * chicocvenancio sees the comment and is unsure it is as simple as it first seems [15:26:44] just insert (or a scratch token) [15:27:11] yeah, that would be the low hanging fruit answer [15:27:17] yeah, changing the message was my first instinct [15:27:33] and maybe add a link to some help page on mediawiki.org too... [15:27:58] That'd be nice [15:28:25] RoanKattouw: I'm trying to find a comprehensive solution for the security group thing but it's turning into quite the rabbit hole. Did you wind up rebuilding that one VM? If not I can go back to trying to fix it in particular. [15:28:26] https://www.mediawiki.org/wiki/Help:Two-factor_authentication [15:28:45] that page could use some love too [15:28:47] chicocvenancio: thanks, all fine, I am in and resetted it and tested it logging out and in again with the new device [15:29:44] andrewbogott: is the question if they rebuilt it since yesterday or ever? [15:29:51] since yesterday [15:30:00] I can see for myself I guess [15:30:36] cool, just making sure because it was mentioned here it was rebuit a couple of times before [16:11:13] Hmm next problem: I can't login to horizon although I just was added to a project there :-/ [16:12:02] CFisch_WMDE: are you getting an error message? [16:12:17] Just Invalid credentials. [16:12:35] =o [16:13:04] account wmde-fisch [16:13:15] CFisch_WMDE: and you are using your Wikitech username, Wikitech password, and 2FA token? [16:13:22] yepp [16:13:28] all that works on wikitech [16:14:03] CFisch_WMDE: are you entering the username as "WMDE-Fisch"? [16:14:08] CFisch_WMDE: can you try to clear the cookies in horizon? [16:14:33] I was a bit unsure about that ... I used wmde-fisch and WMDE-Fisch [16:14:39] chicocvenancio: will try [16:17:35] is it case-sensitive with the username? [16:17:48] I think that it is, yes [16:18:03] e.g in my 2 auth app it says "Wmde-fisch" [16:18:12] I know that Horizon is not as forgiving as Wikitech [16:18:35] in LDAP your cn is WMDE-Fisch [16:21:13] hmm ... still not working [16:21:26] Cookies cleared, different browser [16:22:26] your LDAP account is not showing password failures, so something else is going wrong here [16:24:34] there are plenty of `Login failed for user "WMDE-Fisch"` events in the horizon error log but no further detail [16:25:24] since the LDAP record is not showing password failures I'm thinking 2FA? [16:25:43] maybe I can see something in the wikitech logs about that [16:26:19] So for two times a deleted the cookies on the login and got a CSRF verification failed. Request aborted. error .... but I think that comes before checking anything else anyway [16:27:36] thats expected, if you delete the CSRF cookie and don't reload before submitting [16:27:44] yepp [16:27:48] I guessed so [16:28:21] since I just resetted 2FA, does it take some time? [16:28:34] but the cookie thing was just making sure it wasn't an issue we did have in the past with the session cache (was really unlikely, but I figured worth the try) [16:28:58] CFisch_WMDE: it should not, no. Horizon talks to wikitech for 2FA validation [16:29:07] hmm really strange then [16:30:19] I wonder... [16:30:42] there is a bug in wikitech where we can end up with multiple accounts for the same user that only differ in case [16:30:55] we can do that in mediawiki [16:31:11] and I see "WMDE-Fisch" in the LDAP record, but "Wmde-fisch" some wikitech logs [16:31:24] * bd808 looks in the backing database [16:31:31] yeah, my 2FA app also has my username as Wmde-fisch [16:32:46] and thats like also how it is, when I am logged in to wikitech [16:32:53] bam! you have at least 2 Wikitech accounts [16:32:59] o.O [16:33:18] which one is in the project? [16:33:39] well that's the fun part [16:33:58] on the OpenStack side it uses the shell name (uid) [16:34:09] which is always all-lower case [16:34:25] do both ldap accounts have the same shell? [16:34:38] or is one without a shell name? [16:34:44] there is only one LDAP account, jsut 2 attached users on wikitech [16:34:52] huh [16:35:08] and I'm guessing that the API call from Horizon to Wikitech uses the canonical username from LDAP [16:35:32] which is not the attached account that CFisch_WMDE has 2FA enabled on [16:35:58] jepp [16:36:21] CFisch_WMDE: can you open a phab task? We can fix this, but it's not super pretty to do [16:36:52] * CFisch_WMDE just logged in as WMDE-Fisch on wikitech [16:37:06] I guess I can just add 2FA here and it should work [16:37:25] https://wikitech.wikimedia.org/wiki/User:WMDE-Fisch [16:37:26] https://wikitech.wikimedia.org/wiki/User:Wmde-fisch [16:37:30] jepp [16:37:38] perhaps random question, is there much available compute in cloud services? I'm potentially mentoring a GSOC student training neural nets against our search click logs. Would 2 or 4 instances with 8 cores each training models separately be reasonable to offload into cloud (i would probably have to apply for a new project with quota for that) [16:37:51] I guess one could simply remove the https://wikitech.wikimedia.org/wiki/User:Wmde-fisch account? [16:38:01] CFisch_WMDE: that might be the best cleanup. The Wmde-fisch has no edits [16:38:35] but unfortunately it may come back due to an ugly LDAPAuthentication bug [16:39:02] do we have a ticket for that bug? :-) [16:39:33] ebernhardson: "it depends". We are pretty much constantly on the edge of being over subscribed, but we also have 2 servers that are not pooled due to initial setup issues [16:39:44] CFisch_WMDE: somewhere... I'll find it [16:40:03] bd808: cool, then I can link it in my ticket [16:40:25] T165795 [16:40:26] T165795: Ldap auth extension vs. ldap vs. username Case - https://phabricator.wikimedia.org/T165795 [16:43:43] Woho horizon works with the "real" account now :-) [16:43:54] * CFisch_WMDE creates the ticket [16:44:06] that was debugging by random luck :) [16:44:23] bd808: ok, that at least gives me an idea of where we are at. Thanks! [16:45:31] ebernhardson: file a project request and we will go from there. We try really hard not to turn folks down but may ask that the project be deleted or downsized once GSoC is done [16:46:42] bd808: well, we havn't even got to the selection phase of GSOC so there is no guarantee it will even be accepted yet. Students are turning in proposals and it looks like this project has 2. [16:47:14] bd808: but once we decide i'll put something in. Even a single machine ought to be sufficient, others just make exploration easier [16:48:08] 8 vCPUS is the standard quota [16:48:53] so a single 8 core machine, or 2 with 4 cores each would fit [16:48:58] bd808: chicocvenancio https://phabricator.wikimedia.org/T190427 and thanks for the help! :-) [18:10:08] 08:28:26 RoanKattouw: I'm trying to find a comprehensive solution for the security group thing but it's turning into quite the rabbit hole. Did you wind up rebuilding that one VM? If not I can go back to trying to fix it in particular. [18:10:27] andrewbogott: I did not end up trying to rebuild it yet, but if I have time today (and otherwise tomorrow) I can try setting up a clone of it [18:10:34] RoanKattouw: I updated the bug, I think I've unblocked you for now [18:10:36] and transfer the data [18:11:05] Basically — everything was already fixed as of last night, just there's a local firewall on your VM that I'll leave for you to hunt and kill [18:11:12] Aha I see [18:11:17] Pfft wtf OK [18:11:20] I'll look at that [18:11:21] well, 'everything was fixed' is an exaggeration [18:11:28] Horizon is still totally broken [18:11:32] but you should be able to move ahead anyway [18:11:44] Yeah at least it's now stuck in a state that I like [18:11:47] RoanKattouw: the local firewall is probably from some role you have used having ferm enabled [18:11:55] Yeah I'll have to track that down [18:12:03] It's possible the maps test role has a ferm rule [19:40:35] !log mobile shutting down instance page-summary-performance (no longer needed) [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile/SAL [19:45:36] !log mobile creating new parsoid-minerva instance for testing mobile styles on Parsoid HTML [19:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile/SAL [20:15:16] !help seems sometimes when going to https://gerrit-new.wmflabs.org/r/ or https://gerrit-test.wmflabs.org/gerrit/ i get this in the console "Failed to load resource: The network connection was lost." [20:15:16] paladox: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [20:15:36] i am pretty sure this is not my network [20:16:05] this seems to happen regulary. [20:16:11] (only happened a few days ago) [20:16:25] hmm reloading works now [20:16:32] though before, it took a couple of hours [20:16:38] paladox: are those different VMs or the same one? [20:16:43] same one [20:17:02] gerrit-test.git.eqiad.wmflabs [20:17:21] I cannot log into it, so I dunno [20:17:35] However, I'm not getting that message at least [20:17:42] so far ;-) [20:17:53] bstorm_ seems to happen sometimes, though seems to have started working [20:18:51] paladox: If it happens more then you should probably make a ticket, and also see if you can reproduce it (maybe it happens cloud-wide at certain times?) with a different VM. [20:19:03] ok [20:19:05] yeh [20:19:17] I am starting to think this is haveged [20:19:20] paladox: so you are reporting that at unknown times there was a network interruption between your client browser and a Cloud VPS instance being reverse proxied through the project-proxy? [20:19:21] Next time, see if you can make it happen with the network bit of the dev tools on it as well. [20:19:30] I have a theory ;-) [20:19:53] nginx reload for log rotation? [20:20:16] bd808 hmm, im not really sure, though i think so, always seems to happen to me at night time (not during the morning) [20:20:55] Yup [20:20:57] It's the log [20:21:03] I just checked. Disk was full [20:21:10] fuuuuuuu [20:21:13] I cleared it, and I have a patch that likely with fix it [20:21:21] it's in review already [20:21:35] something changed there in the last few days. maps traffic must be way up [20:22:49] Yes. Its something called "hikeandbike" or something like that [20:22:57] Huge amounts of traffic [20:23:47] oh for fucks sake -- http://hikebikemap.org/ [20:24:20] Yup. I think that's using the tiles [20:24:27] With some mistakes here and there [20:24:49] yeah, the info box says "Hosting courtesy WikiMedia Foundation Labs." [20:24:52] * bd808 digs [20:24:57] heh [20:26:21] the domain registration is old, but maybe somebody found/promoted them recently [20:28:20] apparently this has always been the thing the maps project does -- https://wiki.openstreetmap.org/wiki/Hike_%26_Bike_Map [20:29:24] I've used it before [20:30:20] We should probably shift their traffic off of the shared proxy to somewhere else [20:30:48] we should also have better stats on this and active project contacts :/ [20:31:19] Well, this might be a fine bandaid until then: https://gerrit.wikimedia.org/r/#/c/421321/ [20:31:35] To at least stop things from dying [20:31:52] But I agree, this is a bit too much [20:46:04] T190451 [20:46:05] T190451: Find out who maintains http://hikebikemap.org/ - https://phabricator.wikimedia.org/T190451 [20:46:35] If anybody knows the people who may be actively involved in any of this still, updates to that task are welcome [20:46:43] Can I have a slight quota increase in the phabricator project? That or a separate project - I need to create a swift cluster for testing T182085 [20:46:44] T182085: Connect Phabricator to swift for storage of git-lfs and file uploads. - https://phabricator.wikimedia.org/T182085 [20:47:23] I can get rid of the phabricator-stretch instance if you want twentyafterfour [20:47:31] i've tested the puppet change [20:47:32] ? [20:47:37] paladox: that'd help but I'll still be short by 1 [20:47:40] I need 2 more instances plus ram and cpu to go with them. [20:47:44] oh [20:47:50] this is a medium one [20:47:56] twentyafterfour: https://phabricator.wikimedia.org/project/view/2880/ [20:48:22] or leach off the swift cluster in deployment-prep [20:49:26] !log phabricator deleting phabricator-stretch4 [20:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL [20:49:35] twentyafterfour done ^^ [20:49:57] bd808: hmm, I didn't know there was one I could leach off of [20:50:34] I'm pretty sure that deployment-prep has swift for image storage in the modern day [20:50:42] Krenair built it out I think [20:50:56] uh partially [20:51:06] I didn't set up the swift machines themselves [20:51:15] I did the migration of MW away from NFS to Swift [20:51:47] I don't know what the node names would be, none of these jump out at me [20:53:37] ah, deployment-ms-fe02 deployment-ms-be03 [20:54:12] so obvious! [20:54:16] very [21:03:40] twentyafterfour, it mirrors the prod machines [21:03:47] ms, media storage [21:03:49] fe, front end [21:03:51] be, back end [21:04:21] per https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [21:39:00] i somehow can't find where to add a security group to a cloud vps instance anymore. [21:39:39] has it changed recently? [21:40:03] mdholloway: yes. we upgraded to a new version of horizon last week [21:40:18] go to the "instances" page [21:40:34] find the instance you want [21:40:53] ok [21:40:59] then all the way on the right click the dropdown arrow next to "associate floating ip" [21:41:07] then pick "Edit security groups" [21:41:19] * bd808 is not a huge fan of this UI [21:41:25] ahhh, that's where it went [21:41:33] yeah that got significantly less discoverable [21:41:40] thanks, bd808! [21:41:47] yw mdholloway [21:42:01] aha [21:42:19] bd808 that's why im missing T189706 [21:42:19] T189706: Floating Ip panel missing from new horizon update - https://phabricator.wikimedia.org/T189706 [21:42:19] :) [21:42:27] (was in that tab) [21:43:55] the drop down should say something like "choose action" instead of having a default value and acting like a button [21:44:10] bad coder, no cookie [21:45:20] is it possible to move my bot to a different k8s server? I'm trying to debug something and im running out of ideas to debug [21:45:47] Zppix: restart the pod [21:45:52] ok [21:45:56] it will probably be scheudled somewhere else [21:48:02] !log tools Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout [21:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:49:13] Zppix: you can attach a shell to the pod by kubectl exec -it [21:49:27] though it will really lack debugging tools inside [21:49:39] zhuyifei1999_: my luck i'd make it worse :P [21:50:09] (^ append /bin/bash to the command) [21:52:18] !log tools Forced puppet run on tools-proxy-01 for T130748 [21:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:52:20] T130748: Add Content-Security-Policy header enforcing 3rd party web interaction restrictions to proxy responses - https://phabricator.wikimedia.org/T130748 [21:55:18] hmmm... looks like I'd better add an allow for eval() or crappy js code will flood the logs [22:04:18] !log tools Forced puppet run on tools-proxy-02 for T130748 [22:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:04:21] T130748: Add Content-Security-Policy header enforcing 3rd party web interaction restrictions to proxy responses - https://phabricator.wikimedia.org/T130748 [22:04:24] this is the big one [22:09:09] zhuyifei1999_: 885 CSP violation reports in the first ... 4 minutes. We will have plenty of data for a wall of shame [22:09:30] wow [22:10:44] geohack is the big leader so far [22:11:23] wp-world just shot to the top... [22:11:35] this is going to be useful [22:12:17] where do you see the them? [22:13:00] hmmm... lots of false positives. [22:13:09] GeoHack loads https://en.wikipedia.org/wiki/MediaWiki:GeoHack.js and other map related stuff [22:13:24] the wildecards aren't working as expected in the policy [22:14:01] zhuyifei1999_: no ui yet, but you can query from inside toolforge with curl [22:14:02] curl -XGET 'http://tools-elastic-03.tools.eqiad.wmflabs/csp-*/_search?pretty' -d '{"size":0,"aggregations": {"top_tools":{"terms":{"field": "tool", "size": 100}}}}' [22:14:20] curl -XGET 'http://tools-elastic-03.tools.eqiad.wmflabs/csp-*/_search?pretty' -d '{"size":0,"aggregations": {"top_tools":{"terms":{"field": "blocked-site", "size": 100}}}}' [22:14:30] k [22:14:36] but as that second one is showing my policy is messed up [22:14:56] Zppix: do you need help with the debugging? [22:15:44] chicocvenancio: Well im having an issue where my bot is doing something it should do only once at a certain time, but it is doing it more than it should. [22:15:50] * zhuyifei1999_ gtg dinner [22:16:01] and logs werent any help [22:16:19] what is the relevant code? [22:16:20] it's gonna be painful if wildcards aren't supported [22:17:42] they are according to the spec. I'm wondering if this is bad user-agent stuff [22:18:00] I'm not recording the UA right now :/ [22:18:16] I can filter out the false positives in the data collector [22:19:18] chicocvenancio: https://github.com/Pix1234/ZppixBot-Source/blob/master/modules/mh_phab.py [22:20:32] chicocvenancio: more specfically https://github.com/Pix1234/ZppixBot-Source/blob/master/modules/mh_phab.py#L149 and beyond [22:21:25] Zppix: is it sending the notifications more than once a week? [22:21:29] yes [22:21:35] how often? [22:21:50] oh! this noise may be caused by mixed protocol actually [22:22:16] chicocvenancio: recently its been multiple times in a single day [22:22:53] nope. spec says that's not it. bad clients [22:23:39] bd808: re https://phabricator.wikimedia.org/T190451 - point them at our prod tileserver and block by referer? [22:24:34] MaxSem: I think they will flip out because the tile set they want isn't there, but ... yeah [22:24:46] oh well [22:25:04] doesn't give them the right to screw us in any case [22:27:03] the tiny bit of research I did this afternoon makes me think this was a thing that was built way back on toolserver. I'm really wondering if anyone pays attention to it anymore [22:27:29] I only cc'd you in because you are one of the project members still [22:27:51] I know this is not your problem to fix :) [22:28:07] hmm [22:28:32] we're serving dedicated tiles for them [22:28:44] and I have access to that VM [22:28:57] * MaxSem grabs his wrecking ball [22:29:07] part one: --dry-run [22:31:21] MaxSem: the thing I'd like to do first if just move them out from behind the shared http proxy. that's the choke point right now. [22:31:27] *is just [22:31:44] it also chokes our NFS and other stuff [22:31:54] * MaxSem is in a bad cop mode [22:32:05] * bd808 hands MaxSem a donut [22:35:02] they seem to use a dedicated subdomain [22:35:22] so it seems like it's just a matter of pointing it to a different http proxy [22:36:50] and then breaking it off, and then dancing on its grave! [22:39:59] bd808: I wonder if the CSP report will overflow the elasticsearch instances at this rate [22:41:57] zhuyifei1999_: only 9M saved so far and I'm working on adding the filtering. We'll be fine [22:42:14] there are many gigs of free space there [22:42:21] k [22:43:49] so how do you read the reports? [22:44:05] * Krenair reads further up [22:44:05] ah [22:44:07] Zppix: is that running under sopelbot? [22:44:10] with the tool that I haven't finished writing :) [22:45:07] oh, I would be interested in that, too [22:46:02] what hosts will tools-elastic-03 talk to? [22:46:22] anything inside toolforge I think [22:46:51] I'm running the curls from tools-dev [22:46:59] ok, it was failing from bastion [22:47:03] will try a tools machine [22:47:23] yeah, pretty sure you have to be inside the project [22:47:51] is blocked-site supposed to be showing up stuff like no.wikipedia.org ? [22:48:04] and en is much higher [22:48:13] chicocvenancio: yes [22:48:24] Krenair: no. this is a client bug of some kind [22:48:36] ah [22:48:58] Zppix: I see a lot of errors in the log [22:50:16] chicocvenancio: that scandir error can be ignored [22:51:05] I see a compilation error [22:51:24] for scandir [22:52:01] I've got to run to a meeting, but I'll check back this evening to see if my filter is working. :) [22:53:05] chicocvenancio: thats because I dont use scandir anymore i just havent gotten around to removing it from requirements [22:54:00] * chicocvenancio nods [23:28:43] !log maps blocked a couple of major commercial TOS abusers [23:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Maps/SAL [23:40:56] bd808: https://phabricator.wikimedia.org/T190451#4074619