[00:04:29] Could somebody kill all my connections to tools.labsdb? User: u2815 [00:13:26] hey all! I was looking to update my trusty webservices today, but I'm in a bit of a jam. [00:13:31] $ webservice --backend=gridengine stop [00:13:32] Your webservice is not running [00:13:46] but then: [00:13:47] $ webservice --backend=kubernetes start [00:13:48] Looks like you already have another webservice running, with a gridengine backen [00:14:07] and no combination of start/stop from various environments or backends seems to help [00:14:17] i'm working on tool "montage-dev" [00:27:51] Dispenser: for the most part, connections are really hard to kill right now until we get the new database replacing it. [00:28:01] I can try, but they will probably just hang there :( [00:35:17] It worth a try. Got while True; do query-killer.py; done to jump in as soon as it can [00:43:19] Dispenser: they are all stuff like: [00:43:27] https://www.irccloud.com/pastebin/aTjecFZb/ [00:43:30] Right? [00:44:01] Yes, kill categorder queries [00:45:58] Ok. I've killed one...it's hanging in the killed state [00:46:07] The database is very broken right now [00:46:33] killed another and it's the same [00:46:56] * Dispenser goes to disable the script [00:47:58] Soon enough we will have a new server up and running that we can move everyone to. made a lot of progress this weekend on that. [02:23:23] mhashemirc: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#'webservice_stop'_says_service_is_not_running,_but_'webservice_start'_says_service_is_running [03:59:17] Krinkle: thanks, I'll take a look [11:45:18] !log deployment-prep manually start deployment-db03 per Krenair request [11:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [13:03:52] Hi, does login-stretch.tools.wmflabs.org have issues? I'm unable to get a shell (the server looks overwhelmed). [13:16:13] Urbanecm: checking [13:16:19] thanks arturo [13:29:04] I don't see anything special [13:29:30] the most intensive proc right now is a scp by magnus [13:33:58] but it's true that the server is a bit slow [13:38:37] godog [13:38:41] you around? [13:39:03] (nevermind) [13:42:04] SSH handshake happens fast but opening the shell is slow [13:43:21] this is the hypervisor https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1029 [13:43:59] opening the shell involves NFS [13:44:35] I want to reboot the bastion, just in case [13:44:46] any comments gtirloni ? [13:45:00] no worries [13:47:17] !log tools rebooting tools-sgebastion-07 to try fixing general slowness [13:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:22:16] even closing my ssh session in sgebastion-07 takes a lot [14:23:11] we only have one sgebastion in toolforge right now [14:23:21] we should create other VM, just in case [14:48:39] Hi, anyone knows when the sgebastion slowness will be fixed. Each command takes like 10-25 seconds to complete. [14:48:42] :/ [14:53:37] Hi! I'm moving my tool to the Kubernetes cluster, but I need node>=4. Can anyone recommend a way to proceede? Thanks! [15:51:19] !log clouddb-services removing dns and public IP from clouddb1001 [15:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [15:56:38] Hello everyone, I am working on https://phabricator.wikimedia.org/T215523 and would like to know how is the procedure for upgrading the node version of a beta cluster machine, how to do it and if I am able to do it. [16:11:55] !log clouddb-services deleting clouddb-utils-01 VM and associated puppet prefix; we aren't going to run maintain_dbusers here after all [16:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [16:14:14] I will let you know when I see epantaleo and I will deliver that message to them [16:14:14] @notify epantaleo The Toolforge kuberentes cluster currently has node v6.11.0. [16:18:29] mateusbs17: the hard part will be finding/backporting/building the deb packages for nodejs v8. Other things were hoping to jump up to v10 (T210704) [16:18:30] T210704: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 [16:43:20] bd808: no idea how to address that, should I fill a phab ticket somewhere? [16:45:07] mateusbs17: that would be a good place to start. You can check out T203239 as a template [16:45:08] T203239: Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 [16:46:29] bd808: Thanks! [17:55:13] !log clouddb-services (jaime T193264) set clouddb1001 in read_only=1 [17:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [17:55:15] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [17:55:34] chicocvenancio: are you around? [17:55:58] On mobile [17:56:26] arturo: what's up? [17:57:05] chicocvenancio: we may need to try the new DB setup, and PAWS may be a good testing candidate? [17:58:11] It does simple, frequent queries, sounds good [17:58:45] chicocvenancio: ok, stay tuned. Not sure if such testing will be today in a bit, or tomorrow, but wanted to keep you in the loop [18:12:05] !log clouddb-services T193264 pointing tools.db.svc.eqiad.wmflabs to clouddb1001 [18:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [18:12:11] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [18:26:22] !log clouddb-services (jaime T193264) setting clouddb1001 in read_write mode [18:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [18:26:28] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [18:34:41] chicocvenancio: when you want, try it [18:37:10] Do we have any confirmation here of toolsdb working now? [18:37:25] Ah yes...PAWS, please that! [18:38:05] I am trying to repopulate the tool-db-usage cache. So far it hasn't crashed [18:39:57] https://tools.wmflabs.org/tool-db-usage/ has data again! [18:40:49] !log tools.tool-db-usage Re-enabled cache refresh cron jobs [18:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.tool-db-usage/SAL [18:50:21] !log tools moving paws back to toolsdb T216208 [18:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:50:25] T216208: ToolsDB overload and cleanup - https://phabricator.wikimedia.org/T216208 [18:50:34] thanks chicocvenancio ! [18:50:52] its looking fine, btw [18:52:44] yeah, also htop running in the new DB shows the white wolf is alive again [19:01:38] !log clouddb-services labsdb1004 is now a replica of clouddb1001, which is toolsdb now. [19:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [20:03:32] !log tools pointed tools-db.eqiad.wmflabs to 172.16.7.153 [20:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:12:29] !log admin ran `labs-ip-alias-dump.py` on cloudservices/labservices servers [20:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [20:21:02] !log admin downtimed cloudvirt1020 [20:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [20:22:06] !log tools enabled toolsdb monitoring in Icinga [20:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:45:43] !log tools upgraded and rebooted tools-sgebastion-07 (login-stretch) [20:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:31:31] any idea why $ ssh tools-login.wmflabs.org [21:31:31] Permission denied (publickey,hostbased). [21:31:40] when it just worked immediately after? [21:34:24] Platonides: checking logs [21:34:48] do you have a timespamp of that happening? [21:37:41] ok I see [21:38:36] https://www.irccloud.com/pastebin/WxzxVb4J/ [21:38:51] Platonides: ^ [21:39:05] no idea why ssh-key-ldap-lookup failed [21:39:49] weird [21:40:12] you may want to keep an eye on these failures [21:48:00] !log quarry Deployed 6bda39e on -web-01 T215831 [21:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [21:48:05] T215831: Show query run date above outputs section - https://phabricator.wikimedia.org/T215831 [22:13:58] Platonides, zhuyifei1999_: I have seen a few reports in the last ~24 hours that make me think the LDAP server has been having some hiccups. [22:14:21] ok [22:15:02] that can break ssh, sudo, and other things that rely on LDAP for authn/authz [22:15:36] * bd808 wonders if he actually has access to the LDAP servers themselves [22:15:40] probably not [22:18:53] bd808: I guess that if you're in the wmf ldap group you might [22:19:04] ops ldap most certainly I'd say [22:19:47] shell access is managed outside of LDAP in the production realm [22:20:10] I haven't looked up those servers yet, but I would guess that they are root only at the moment [22:20:28] that's pretty common in our core network [22:20:34] it makes snese [22:20:38] *sense [22:28:31] I was looking at https://wikitech.wikimedia.org/wiki/LDAP/Groups