[00:34:16] !log tools begin building misctools 1.37 using debuild T217406 [00:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:34:19] T217406: Stretch grid problem: cannot migrate tomcat webservice - https://phabricator.wikimedia.org/T217406 [00:38:29] !log tools published misctools 1.37 T217406 [00:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:40:11] (I just don't really have much time to mess with pdebuilder right now if I run into issues) [00:49:49] !log tools clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch T217406 [00:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:49:52] T217406: Stretch grid problem: cannot migrate tomcat webservice - https://phabricator.wikimedia.org/T217406 [01:02:46] valhallasw`cloud: thank you :))) [03:36:56] !log tools.meetbot Stopped instance someone started on the Trusty grid [03:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL [04:01:19] !log tools Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups (T217280) [04:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:01:23] T217280: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 [04:12:40] !log tools.meetbot Killed extra meetbot job that grid engine had lost track of after `qdel -f` [04:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL [04:15:19] !log tools Killed 3 orphan processes on Trusty grid [04:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [05:06:00] !log tools.cluebotng Killed cbng_bot job stuck in deletion state with 4000+ zombie child processes (T217817) [05:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [05:06:04] T217817: cluebotng creating thousands of zombie php processes on Stretch job grid - https://phabricator.wikimedia.org/T217817 [05:43:41] bd808: So it appears as if it keeps forking processes and never cleaning them up until the parent is killed by SGE? Or is it just not reaping them fast enough? [05:45:08] Cobi: I'm not certain if some are getting reaped or not. It's up to 3027 now after running for 40 minutes [05:45:47] I don't want to screw up the bot, but something here seems very not right to me [05:46:41] Normally, I'd login to the host in question and watch the pids myself, but I don't think I have access to execute a shell on a specific SGE exec host. Admittedly, I am not as familiar with SGE as I'd like to be, though. [05:46:58] The bot is supposed to fork per edit, but those forks are supposed to die quickly after they finish processing. [05:47:08] The bot is then supposed to reap those processes and nothing bad happens [05:47:32] That last part doesn't appear to be happening from what I can tell from your pastes in the ticket [05:48:26] Cobi: you can ssh over there easily actually. Become the tool and then just `ssh tools-sgeexec-0927` will get you there [05:49:09] tools.cluebotng@tools-bastion-03:~$ ssh tools-sgeexec-0927 [05:49:09] Permission denied (publickey,hostbased). [05:49:18] hmmm... [05:49:33] ah! wrong bastion [05:49:48] -03 is the Trusty grid [05:50:19] You need to start from login-stretch.tools.wmflabs.org [05:50:29] Ahh, derp. Thanks [06:02:15] Yeah, PHP 7.1 changed how signals are handled [06:02:43] I've adjusted the code. [06:02:51] Seems to be reaping zombies properly now [06:04:34] Cobi: nice. :) It looks much better on tools-sgeexec-0940 so far [06:07:05] Yep :) [06:08:01] Still usually a zombie or two when running `ps` but that's to be expected with the edit rate on enwiki. It's always a different PID though :) [06:08:42] yeah. I'm seeing them clean up now. A few is not scary, 4000+ is :P [06:09:14] Thanks for flagging it up in a ticket. [06:09:53] thank you for actually responding to the phab spam! [06:11:30] Well, luckily Phabricator e-mails are low enough volume that I haven't seen it necessary to filter them off out of my inbox, so it popped up in my inbox. [06:12:06] Had it been a github issue or something similar, I might've missed it. [08:41:59] Hi cloud-team - I'm experiencing issues with eswiki_p.revision and hewiki_p.revision on labsdb - It seems their schema has changed, namely dropping their rev_comment field - Do we have any more ifo on that ? [08:42:03] Many thanks :) [08:45:58] I am not an expert, I don't think rev_comment was existing in general for a long time [08:46:14] there is a comment table [08:47:37] joel: https://phabricator.wikimedia.org/T166733 [08:48:15] started 2 years ago, and that is the deploy, the design phase went before that [08:49:52] indeed the feature request goes back to 2006 :-) [08:53:54] joal: are you subscrived to cloud-l ? [08:54:06] I am not [08:54:11] *subscribed [08:54:36] https://lists.wikimedia.org/pipermail/cloud/2018-September/000383.html [08:54:50] Oh actually I am subscribed [08:55:52] jynus: we use the new rev_comment_id field, but so far (or, as much as we knew), migration was not fully done and some comments were still in rev_comment field [08:56:38] And we understood that in order not to break schema too much, the rev_comment field was going to be nullified (or emptied), but not remove [08:58:03] sorry, I don't have the details of cloud views, mention issues on the ticket T212972 [08:58:04] T212972: Remove reference to text fields replaced by the comment table from WMCS views - https://phabricator.wikimedia.org/T212972 [08:59:17] jynus: can you confirm the plan is at the end to drop comment fields (rev_comment, ar_comment etc) from the various needed tables? [09:08:21] I can say that is indeed the goal on production, but has not been done yet, as I said, cloud is a different beast [09:08:38] for that mediawiki change, the production contact is platform [09:08:52] for cloud maintenance, the contact is cloud [09:09:11] we dbas just maintain the servers being up :-D [09:10:06] I think there was mentions of compatibility views [09:10:20] something like revision_compat or something [09:10:55] but at the expose of potential lower performance if large scans are done [09:12:08] *expense [09:15:10] sorry I don't know the details, I am being informed of this, but the above people has the details [15:09:37] jynus: theres some replag on the replicas know anything about that [15:11:13] Zppix: it seems to be only one server, try a different one, at least one has 0 [15:11:31] jynus: i cannot im using https://tools.wmflabs.org/guc [17:22:49] bd808, bstorm_ o/ - if you have time would you mind to review https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494874/ ? It is for labsdb1012 (the analytics dedicated instance) [17:26:48] 👋🏻 [17:26:56] I can check it out in a bit, sure [17:29:06] thanksss! [17:35:04] jenkins doesn't like the change due to pre-existing wmf-stlyle issues, will override it before merging [17:50:35] !log snuggle rebooting snuggle-enwiki-01 (was unreachable) [17:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Snuggle/SAL [18:06:33] Hi! I updated my Toolforge tools for the comment changes in T166732 (just added `LEFT JOIN comment ON ipb_reason_id = comment_id` to get the log text), but the tools seem to run much slower now and I've been getting "User 's52256' has exceeded the 'max_user_connections' resource" today. [18:06:34] T166732: Refactor comment storage in the database and abstract access in MediaWiki - https://phabricator.wikimedia.org/T166732 [18:06:40] Querying the comment table for random IDs is instant, so it seems to be indexed fine. Is there anything special I should take into account when querying the comment table? [18:11:14] Hello all. Starting from February 12 (after migration on stretch), the bot receives a message about a wrong password at almost every login. I nothing change in the source code fir many years. Is this problem only just me? Maybe in recent times in Mediawiki set limit on max sucessfull login in per X time or something like that? (sorry for my English) [18:15:06] (Affected SQL queries for my issue: https://hastebin.com/eqavilokuk.sql .) [18:20:48] Iluvatar_: I have not heard other reports of API authentication problems on the new job grid. Does logging in eventually work for the tool, or does it fail every time? Are you logging the error response from the API and does it provide any clue about rate limiting or other soft errors? [18:24:35] No, not every time. After several sucess login, bot recieved message about a wrong password and then "too many failed login attempts" (need to login manually and write the captcha). [18:26:21] Very strange. But thanks for the answer. Only i have that problem. Ok. [18:26:32] hmm... I can't think of a reason that running on the Stretch grid as an origin would cause requests to vary over time because of the IP address the requests come from. It is possible that some change in the language/library runtime that your tool uses is causing issues though [18:37:13] Does mediawiki ban (forced captcha) an account or IP on multiple failed login attempts? If my tool does not login, Labs’ ip will not banned? I do not create problems for other tools? [18:38:05] I will try on old system (not stretch). [18:40:09] Iluvatar_: I am fairly certain that there are some abuse prevention things in the login flow, yes. Off the top of my head I'm not sure if they are only by ip, only by account, or a mix of both [18:48:43] I think IPs trying to log in several times and failing to do so are autoblocked by mediawiki, yes [18:48:56] In fact I think we had a case some weeks ago [18:49:01] a bot, precisely [18:49:15] not logged-in on phab right now so cannot fetch [18:59:37] I assumed that the problem related to someone was trying to login with wrong pass, and others bots on the Labs cant login because same IP. But if only my bot have a problem... [20:06:01] is it normal that when a page change a category in Wikipedia it delays to be replicated in the cloud replicas? [20:09:57] MariaDB [ptwiki_p]> SELECT cl_to FROM categorylinks WHERE cl_from = 5259513 AND cl_to LIKE '!Artigos_de_qualidade%'; <--- that returns !Artigos_de_qualidade_2_sobre_física [20:10:35] but the page is in category:!Artigos de qualidade 3 sobre física, https://pt.wikipedia.org/w/index.php?curid=5259513 [20:13:20] Yes [20:13:28] Because MW will delay the DB updates too [20:13:45] Not everything is updated immediately in the database when you click save [20:14:43] ok, thank you for the explanation [20:15:42] https://www.mediawiki.org/wiki/Manual:Job_queue if you want to see more [20:41:20] Reedy: just for curiosity, is there a page that monitor the job queue and shows how many jobs is in the queue? [20:41:54] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 [20:41:59] It's not an exact science though [20:42:22] thank you [20:42:55] Hi, just wanted to report that instances are intermittently rejecting my ssh key (i had to do "ssh gerrit-test4" 3-5 times). [20:43:05] i presume related to the ldap issues? [20:43:53] this happened yesturday too [20:44:47] paladox: it is like that the ldap problems are the cause for that. You can look in /var/log/auth.log to try and confirm [20:44:54] ok [20:44:54] *likely [20:55:27] !log gerrit disable puppet on gerrit-test4 (preparing for multi master) [20:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gerrit/SAL [21:13:47] !log git disabling puppet on gerrit-test3 (preparing for multi master) [21:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [21:51:52] andrewbogott: zhuyifei1999_: I am unable to SSH into stretch [21:52:06] Cyberpower678: could you be more specific please? [21:52:06] which host? [21:52:12] $ ssh login-stretch.tools.wmflabs.org [21:52:12] Connection closed by 185.15.56.48 port 22 [21:53:00] Cyberpower678: it's working for me. Can you try again? [21:53:34] the log looks like it worked [21:53:37] andrewbogott: yes. I'm in now. I tried like five times 2 minutes ago. Wierd. [21:53:52] we're having some intermittent ldap issues, it may be related. [21:54:00] Ah [21:54:18] andrewbogott: that won't throw me out of an SSH session will it? [21:54:27] nope [21:54:36] Good. I like keeping my connections open [21:54:56] https://www.irccloud.com/pastebin/Wz7etPlG/ [21:55:06] andrewbogott: I see you recently changed the wikitech:Production shell access with new ssh keys guidelines. Is ecdsa still good, if I may ask? cc bd808 as well. [21:55:19] probably LDAP instability again? [21:56:45] !log git setting up multi-site up on gerrit-test3 T217174 [21:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [21:56:48] T217174: Deploy multi-site plugin to cobalt and gerrit2001 - https://phabricator.wikimedia.org/T217174 [21:57:03] zhuyifei1999_: oh right I need to upgrade my SSH key right? [21:57:32] Cyberpower678: your key is rsa so should be good in theory afaik [21:57:37] I use rsa as well [21:57:46] * hauskatze uses ecdsa-sha2-nistp521 [21:57:51] hauskatze: I don't know offhand [21:57:59] zhuyifei1999_: yea but I thought 256 was out. We should be using 1024 bits [21:58:18] What I like to call a megabit rsa key [21:58:21] oh that could be relevant [21:58:24] *kilo [21:58:24] hauskatze: yeah, ecdsa is ok. Ancient DSS keys and RSA keys smaller than 1024 bit are the problems today [21:58:53] bd808: so my ecdsa-sha2-nistp521 can stay? is it safe enough? [21:59:16] (/me thinks ecdsa != dsa, so should be good. there's nothing wrong with dsa in principle, just some weird bureaucratic document forces all DSA keys to be 1024 bit...) [21:59:26] hauskatze: no. It will self destruct taking all of Wikimedia servers with it. Then we will all stare at you for ruining everything. [21:59:45] Cyberpower678: good thing I have immunity from prosecution :P [21:59:49] lol [22:03:23] hauskatze: I'm honestly not qualified to make value judgements about cypher selection. The restrictions I announced in https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000139.html are choices made by the upstream packagers/maintainers [22:04:07] bd808: np, thanks for your help. For now I'm not in the 'red flag' cases :) [22:05:51] * Cyberpower678 is surprised he didn't get flogged by the sysadmins. Wow [22:08:59] Cyberpower678: huh? [22:09:18] zhuyifei1999_: if you must know. Check the NFS usage history. That's all I will say [22:09:34] I dunno how to check :P [22:09:55] I haven't ran my naughty detector for a long time :P [22:09:59] zhuyifei1999_: you don't have disk usage historical graphs? [22:10:19] hey [22:10:52] we might, but idk where. certainly not ganglia... (if this were 2013 I'd check that) [22:10:52] zhuyifei1999_: I'll give it to you straight. IABot 2 beta errored out and generated 1.6 TB of error messages. [22:10:55] * Cyberpower678 runs away [22:10:55] i have a question to lighttpd on Trusty grid [22:11:00] oh [22:11:06] I know that ticket [22:11:34] didn't someone delete them already? [22:11:40] f2k1de: yes? [22:11:48] how can i run my webservice on a newer platform than 14.04? [22:11:54] zhuyifei1999_: this is different. Unlike the webservice and well everything throwing deprecated error messages, this is something different. IABot is using rewritten code, and well, being beta, it's not very stable yet. [22:12:02] the wikitech page seems to be outdated [22:12:21] zhuyifei1999_: this was one file consuming 1.6 TB [22:12:27] I obviously deleted it [22:12:32] Just now [22:12:40] f2k1de: you want to debian 8 jessie or debian 9 stretch? [22:12:48] * Cyberpower678 goes bowling. [22:13:17] well, good that you noticed and not someone else deleting them, possibly breaking iabot [22:13:32] though, did you restart the bot after rm? [22:13:43] or did you use `truncate`? [22:13:49] zhuyifei1999_: i just want a lighttpd on debian and not on ubuntu 14.04 AND not Kubernetes [22:14:25] f2k1de: then just use the same instructions, but ssh into login-stretch.tools.wmflabs.org instead [22:15:27] okay thanks, i'll try [22:15:52] `df` isn't exploding so I'm assumng you used either way... which is good [22:16:13] ^ was for Cyber.power678 [22:26:02] ganglia existed well past 2013 [22:26:20] pretty sure that was still a thing in like 2014-15 [22:46:25] zhuyifei1999_: I used rm and rebooted. [22:46:27] IABot's cron restarts crashed workers every 5 minutes [23:31:34] !log tools Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion) [23:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:37:12] !log wmflabsdotorg Updated DNS to make tools-login.wmflabs.org a CNAME of login.tools.wmflabs.org [23:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wmflabsdotorg/SAL [23:45:12] valhallasw`cloud: I just noticed your condensed topic. +1 although I might put the code of conduct link back at some point [23:49:25] bd808: the server's fingerprint just changed. Confirming that this is expected? [23:50:02] Cyberpower678: yes. announced on cloud-announce mailing list and !log'ed here as well [23:50:20] bd808: cool. Just making sure. [23:51:00] Cyberpower678: the fingerprints on wikitech were updated as well, so you can verify the change