[03:50:18] would the "Toolforge webgrid proxy issues" be why my applications are having intermittent connectivity issues with the replica dbs (`SQLSTATE[HY000] [2002] Connection refused`)? by intermittent, I mean only a handful of times today [03:52:07] musikanimal: no, that's a really really out of date channel banner :/ That notice was about when we broke the ingress http proxies connecting to kubernetes webservices several weeks ago. [03:54:29] oh okay. I'll create a phab task! [03:54:53] I'm not sure if there was any pooling/depooling of backend servers for the wiki replicas today, but that could be one cause of failures that we would consider "normal" [03:55:39] Concurrent load could be another [03:56:30] it happened about 8 hours ago and again 15 minutes ago [03:56:43] Do file a bug, but don't get too mad if the response is that we don't have enough data to tell the cause and that apps should be prepared to deal with short term outages [03:57:04] sure thing [04:31:31] bd808: https://phabricator.wikimedia.org/T215993 should you be interested [04:31:45] it wasn't the replicas, but the user databases [04:32:19] I wonder if it's related to that lockout issue I've been having with XTools (T215993#4950091) [04:32:19] T215993: Connectivity issues with tools.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T215993 [04:33:01] toolsdb doesn't have pooling, so that's not it [04:33:48] oh ok [04:35:05] lock wait timeout is usually a concurrency problem. mysql server has gone away is a super generic error message for "I didn't get an answer I expected" [04:35:32] yeah, I got the latter one today, first and only time [04:36:08] https://dev.mysql.com/doc/refman/8.0/en/gone-away.html [04:36:11] the lock wait timeout has been happening on and off for a while [04:38:46] lock wait timeout is 100% concurrency. it means that you tried a transactional write that required a lock (row or table level) and that lock was not available in the session or server specified max wait time [04:39:03] we see that in production too from time to time [04:39:34] yeah, you mentioned that before. I will say that according to the data I have, there's been no significant increase in traffic (and hence writes) [04:40:00] but it's totally possible there is a "burst" of activity in a short time. The queries are super duper fast though, in the 0.00 sec range [04:40:09] your usage of the database is not isolated from the usage of everyone else [04:40:15] right [04:40:22] so that's my guess [04:40:35] so it can and will be effected by growing load from other tools [04:40:43] right [04:40:50] could that also cause connectivity issues? [04:40:54] yes [04:41:04] ah, well that makes sense then [04:41:15] the databases are set to cap the number of global concurrent connections [04:41:36] otherwise they would just run out of ram at some point [04:42:31] basically if the db starts feeling overworked it will try to selectively stop doing some things it is asked to do out of self preservation :) [04:42:48] yeah [04:44:27] toolsdb in particular is underpowered right now. We have some new hardware to replace it with, but that hardware has been flakey since it was delivered and we haven't been able to move the work over to it yet as a result [04:44:49] * bd808 makes a note to check on the hardware tickets tomorrow [04:46:48] got it. That's a sign of success though, that there's that many tools using toolsdb now! [04:47:37] it is :) a "more money, more problems" success, but success none the less [04:48:00] hehe [04:54:20] "victims of our own success" is wikipedia in a nutshell [04:56:43] when your objectivist utopia encyclopedia accidentally becomes a public good that disrupts the economy and launches a media revolution 🤷‍♀️💅 [04:58:43] (I do not remember if Wikipedia "launched Web 2.0" or if it was becoming manifest anyway and Wikipedia just happened to be very early.) [08:11:45] chicocvenancio: you around? [08:18:59] bd808: FYI toolsdb is now unreachable https://phabricator.wikimedia.org/T215993 [08:24:26] :( [08:32:42] !log paws switch paws-proxy-02 puppetmaster to labs-puppetmaster.wikimedia.org [08:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [09:41:58] musikanimal: toolsdb was restarted just now, it should be back now [13:03:06] Hi, I'd need some quick help with my bot, it's getting "Incorrect username or password entered" or "Please wait 5 minutes before trying again" after launching from shell [13:03:26] !log tools T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07 [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:03:29] T216030: cloudvps: evaluate draining cloudvirt1018 - https://phabricator.wikimedia.org/T216030 [14:34:57] arturo: FYI, trying to run qstat is giving me an error "denied: host "tools-sgebastion-07.tools.eqiad.wmflabs" is neither submit nor admin host" [14:35:31] anomie: thanks, could you please open a phab task? [14:39:00] arturo: T216042 [14:39:01] T216042: qstat from login-stretch.tools.wmflabs.org fails - https://phabricator.wikimedia.org/T216042 [15:01:12] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @milimetric & @amir1 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:06:36] !log `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042 [15:06:37] zhuyifei1999_: Unknown project "`sudo" [15:06:37] T216042: qstat from login-stretch.tools.wmflabs.org fails - https://phabricator.wikimedia.org/T216042 [15:06:44] !log tools `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042 [15:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:13:07] bstorm_: could you look at this? I can't figure out why it's not recognizing -07 [15:16:22] !log tools `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042 [15:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:16:25] T216042: qstat from login-stretch.tools.wmflabs.org fails - https://phabricator.wikimedia.org/T216042 [15:17:32] never mind, that works [15:43:02] Glad something is working :) [15:50:54] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @milimetric & @amir1 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [16:24:02] thanks zhuyifei1999_ [16:24:28] (np) [16:29:50] jem: did you get your bot working? [18:44:16] Did the ECDSA key for login-stretch change? [18:56:54] hiyaa [18:57:08] how can I discover what nodes in horizon project include what puppet roles/classes? [18:57:53] i'm specifically looking for role::beta::docker_services in deployment prep [18:59:50] SQL: it did, but I don't think we have updated the wiki yet. We failed over to a different vm instance because of the hardware failures [19:00:14] bd808: tyvm, I got the warning when I tried to log in, and figured I'd ask before updating ssh [19:01:49] ottomata: https://tools.wmflabs.org/openstack-browser/puppetclass/ may be what you want, but I actually don't see that class applied there. If it can be applied indirectly that tool does not track that (like role::A includes role::B would only show role::A) [19:02:15] aye [19:02:16] hm [19:02:25] thanks bd808 [19:02:32] yeah I don't think we have any docker stuff in deployment-prep right now [19:02:42] think I needed to look into that for zotero or something but never got around to it [19:02:58] i'm going to try it... :p [19:14:12] !log general-k8s deleting k8s-master-01, k8s-node-03, k8s-node-05, k8s-node-06 as part of the cloudvirt1018 cleanup. [19:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:General-k8s/SAL [19:14:14] addshore: ^ [19:15:56] !log striker deleting striker-puppet01 and striker-deploy04 [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Striker/SAL [19:16:01] addshore: I made the executive call that those instances could be deleted rather than trying to move them to a non-failing physical host. [19:16:48] !log tools deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06 [19:16:48] Yup, sounds fine to me [19:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:52] bd808: ^^ [19:17:39] andrewbogott: I'll make a ticket to rebuild those sge nodes, but I think we can wait until things settle down to actually do it [19:17:45] thanks [19:17:49] is toolforge being affected by the outtage (more specfically tools.wmopbot [19:17:57] !log design deleting design-lsg3 — it was shutdown and also was located on failing hardware [19:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Design/SAL [19:18:13] Zppix: I wouldn't expect you to notice much [19:18:28] People running tools on the new grid might see a bit of a slowdown [19:18:59] Zppix: possibly? we lost a few Stretch grid engine nodes which could have killed or orphaned jobs that were running there [19:19:24] but if you restart things should be working as far as I know right now [19:19:45] bd808: im not a maintainer i was asking because i was curious, ill let the maintainers know however [19:28:38] zhuyifei1999_: Question about quarry-beta-01, is that instance easy for you to recreate from scratch? It is on a physical server we need to take offline and we are trying to find things that are easier/faster to build from scratch rather than rsyncing the disks to a new physical host [19:34:13] Port ssds [19:34:25] Poor....... [19:55:36] bd808: let me check [19:56:28] it was created by framawiki https://phabricator.wikimedia.org/T209119#4740698 [19:56:54] I don't know if anything valuable is in it (probably unlikely) [20:26:45] bd808: quarry-beta-01 can be recreated quickly [20:49:04] framawiki: thanks! [20:52:51] Hello all! [20:53:14] Is there some problem with mysqli on tools-sgebastion-07 and the new cloud? [20:53:41] PHP Fatal error: Uncaught Error: Class 'mysqli' not found [20:53:57] That worked yesterday fine [21:01:54] Wurgl: there could be. we had to fail over to a new bastion this morning and although it is in theory the same as the prior one there may be something not quite right about it. Can you file a bug with the details you know? I will try to check the installed php packages after I eat a quick lunch. [21:02:47] okay [21:04:22] Assign it to you? [21:14:17] bd808: https://phabricator.wikimedia.org/T216076 [21:46:31] Wurgl: thanks. I will poke at that and see if I can find out where our Puppet config failed us [21:52:45] bd808: No stress! I started the jobs on the old cloud, that workes fine for me [21:53:39] andrewbogott: well it took a couple of hard reboots, but IABot's exec node finally came online. [21:54:02] Cyberpower678: it came up because i fixed it, which was tricky since it kept rebooting while I was working on it :( [21:54:19] Anyway I'll leave you to it. You'll need to fix up the apt repo a bit [21:54:35] andrewbogott: apt repo? [21:54:49] Isn't that part of the OS [21:54:56] try 'apt-get update' and you'll see what I mean [21:55:35] How do I fix that? [21:55:44] I've never had those issues before. [21:56:16] I don't know immediately. But puppet is broken there, because apt is broken. And apt is probably broken because of some sort of file corruption [22:02:46] Do you know if there's anything else on there that's broken? [22:02:47] Cyberpower678: hey I've got a second to look, what instance is that again? [22:02:58] cyberbot-exec-iabot-01 [22:03:48] ok let me see if I can make sense of things, otherwise it may be wise to build a new instance so you're not finding oddness for days to come, which is a bummer I know [22:04:28] Well IABot itself appears to functioning normally. [22:07:42] chasemp: anything [22:08:01] Cyberpower678: gimme 2m [22:08:23] chasemp: mm [22:08:25] :p [22:09:48] Cyberpower678: OK, I have apt and puppet working, it seems like apt got messed up due to underlying disk issues at the time and it's transient. I would strongly suggest having a backup of anything here you love and care about but seems ok atm. [22:10:13] chasemp: it's just an exec node. [22:10:28] cyberbot-db-01 is where the memory of all of Cyberbot/IABot lives [22:17:20] Cyberpower678: makes sense, just wanted to give you a heads up :) I think it's OK tho [22:19:17] chasemp: I'll start setting up a backup script to upload dumps to my private SSH server.