[02:16:02] !log tools.extreg-wos app was serving 500s, deleted pod and it's back [02:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.extreg-wos/SAL [11:07:48] !log cyberbot migrating cyberbot-db-01 to cloudvirt1009 in response to T241313 [11:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cyberbot/SAL [11:07:51] T241313: cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 [11:12:59] !log osmit migrating osmit-test to cloudvirt1009 in response to T241313 [11:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Osmit/SAL [11:13:01] T241313: cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 [11:13:24] !log deployment-prep migrating deployment-aqs03 to cloudvirt1009 in response to T241313 [11:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [12:11:28] !log migrating tools-flannel-etcd-02 , tools-worker-1028, tools-worker-1005, many tools-sgewebgrid-lighttpd nodes and several tools-sgeexec nodes to cloudvirt1024 in response to T241313 [12:11:29] andrewbogott: Unknown project "migrating" [12:11:29] T241313: cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 [12:18:28] andrewbogott: Hi. I replaced yesterday all mentions to labmon1001 but deployment-cpjobqueue continues fataling trying to reach that server. Not sure why it keeps trying to reach to that host. Any ideas? [12:18:47] not sure but I can take a look [12:19:31] what's an example of a thing trying to talk to labmon1001? [12:20:03] (also, did you try rebooting the instance by chance? If we don't know what service has old config, that's a good way to force everything to update) [12:20:26] andrewbogott: I can reboot deployment-cpjobqueue via Horizon if that's okay [12:20:35] sure, let's see if that helps [12:20:37] we did sudo service cpjobqueue restart [12:20:43] but didn't fixed it [12:20:50] is that the service that's hitting labmon1001? [12:20:56] yup [12:21:17] hm [12:21:21] well, let's try a reboot [12:21:28] I don't otherwise know much about how the jobqueue works [12:21:32] https://logstash-beta.wmflabs.org/goto/5f21c9d24cd7ed32e83699e34cba2965 [12:22:20] andrewbogott: there's two reboot options [12:23:50] !log Deployment-prep Rebooting deployment-cpjobqueue [12:23:51] hauskatze: Unknown project "Deployment-prep" [12:23:58] !log deployment-prep Rebooting deployment-cpjobqueue [12:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [12:24:18] hauskatze: I think for this purpose the two reboots are equivalent [12:26:02] monitoring on logstash for errors now [12:29:38] andrewbogott: after the reboot I no longer see the EMERG errors trying to connect to labmon [12:35:04] great! That was an easy one :) [12:36:22] andrewbogott: now (just once for now): Error: Local: All broker connections are down [12:36:51] that might have just been a transitional error while the services were coming up [12:37:04] if it's anything more serious than that you're going to need help from someone who understands how the services works (which I don't) [12:37:45] perfect; it looks the reboot might have solved it [12:37:50] fingers crossed [12:38:03] stuff breaking on holidays is the worst ;) [14:52:16] Please can anybody check NFS on tools-sgebastion-07? Thanks [14:56:32] * Krenair looking [14:57:53] yeah something's up with that [14:58:04] need to sort myself out a root key for this project [14:58:08] or maybe just globally [14:58:11] You mean: something's down :-( [14:58:22] alex@alex-laptop:~/Development/Wikimedia/instance-puppet (master)$ ssh tools-sgebastion-07 [14:58:22] Timeout, server tools-sgebastion-07.tools.eqiad.wmflabs not responding. [14:58:42] oh I'm logging in now [14:58:45] slowly [14:59:30] Waiting now 6 Minutes for ":e!" in vi – and it responded! [15:01:13] Wurgl, try now [15:01:31] I did already and finished editing. [15:01:42] okay but does the server respond normally for other things to you [15:01:43] Thanks [15:02:07] seems to be fine [15:02:08] !log tools.superyetkin Killed a "php test5.php" process that was hogging IO on tools-sgebastion-07 [15:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.superyetkin/SAL [15:04:18] Are you still logged in? [15:04:22] Krenair? [15:04:28] Yes Wurgl [15:04:42] seems to hang again [15:06:11] hmm [15:06:31] !log tools Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07 [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:07:10] (these were things sitting at the top of `iotop` at 99.99% btw) [15:07:49] Wurgl, try now? [15:24:46] (stuff seems to have returned to normal, will email those users) [15:27:22] Krenair: I'm here, briefly! Seems like you've already done most of what there is to do though, thanks [15:28:20] andrewbogott, do we normally notify people who run problematic processes like this? [15:28:26] if so do we have a template for it? [15:28:38] I do if it's easy but don't have a formal template. [15:28:43] ok [15:29:02] Just, "I killed your process because it was going crazy, please check in on IRC before restarting for advice about how to limit the problem," something like that. [15:29:15] I can also send the emails if you don't feel like it :) [15:33:57] Krenair, that email looks good, thank you [15:34:03] I sent them and CC'd you [15:34:04] thanks [15:34:37] andrewbogott, logging in as root@ will avoid any NFS usage right? [15:35:01] yep, at least until you cd into a nfs dir [15:35:01] it took me a while to get in to have a look at the problem - first attempt timed out even [15:35:05] sure [15:35:20] wondering if I should get a key in root_authorized_keys [15:35:57] they need to be new keys generated for the purpose of cloud root right? [15:38:38] It's not a hard-and-fast rule but it's a good idea [17:01:34] andrewbogott: bd808: My DB VM is running out of space and needs a bigger environment. How big can I go? [17:02:19] Appears cloudvirt1014 just went down [17:04:23] arturo andrewbogott ^ [17:06:09] I'm here [17:06:18] Cyberpower678: best to open a ticket, I'm distracted right now [17:06:23] (sharding is probably the real answer) [17:06:54] andrewbogott: you mean mount another disk and have MySQL split it's files across the two volumes? [17:07:23] I'm open to that, but was told it wasn't possible last time. [17:07:39] I'm guessing he means sharding across vms [17:07:41] Cyberpower678: across multiple VMs [17:07:55] * Cyberpower678 is not familiar with how to do that. [17:08:43] https://severalnines.com/database-blog/database-sharding-how-does-it-work [17:08:59] plenty of stuff online [17:09:04] How big is your DB btw? [17:09:17] You might be able to do some compression, remove unused indexes etc [17:14:07] Reedy: it's at 250 GB [20:38:06] Krenair: Looks like renames are slowly happening again on beta-cluster [20:38:16] but are taking loooooooong [20:38:36] I mean, one I've just actioned is slowly being done [20:39:24] hello, does anybody know if the citation bot issue (it being down) is being worked on? [20:54:37] Hi everyone! I haven't been using toollabs in a very long time, and wanted to connect to one of the projects I am member of. Following the guide on https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances but can't quite figure out what to replace with? Could someone give me a pointer? :) [21:07:37] frimelle_: If you're connecting to a tool you just need to ssh to login.tools.wmflabs.org and then `become toolname` [21:08:25] ssh your_shell_username@login.tools.wmflabs.org [21:10:27] ahhh great, thank you so much! [21:10:42] frimelle_: I assume it's a tool you're trying to connect to [21:11:21] Exactly! Worked perfectly :) [21:11:55] I'm glad to hear that [21:12:15] Do you know if this is documented somewhere? [21:15:37] frimelle_: I'm searching on Wikitech, where it was documented; but I cannot find it [21:16:05] No worries at all :) I will check tomorrow if I can add it somewhere if I can't find it either [21:17:39] cool; good evening then [21:19:01] "become" is documented here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Switch_to_/_become_a_Tool_Account [22:17:04] !log tools.bookreader Deploying T241489 T241491 [22:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bookreader/SAL [22:17:07] T241489: bookreader: Allow users to write file name without File Prefix - https://phabricator.wikimedia.org/T241489 [22:17:08] T241491: bookreader: Autocompletion of File Name - https://phabricator.wikimedia.org/T241491