[05:21:57] <[1997kB]`> balloons, yes I followed https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_from_a_Servlet_in_Tomcat [09:30:59] <[1997kB]> frequent 502 Bad gateway.. [09:32:10] !log admin rebooting cloudvirt1012 to investigate linuxbridge agent issues [09:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:24:42] !log tools.sal tool webservice detected to be misbehaving, several uncaught exceptions in the source code [11:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [11:26:30] !log tools.sal trying a simple `webservice restart` [11:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [11:27:04] !log puppet-diffs syncing facts from puppetmaster1001 [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs/SAL [11:33:12] !log admin disabling puppet and downtiming every virt/net server in the fleet in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979) [11:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:33:15] T262979: cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff) - https://phabricator.wikimedia.org/T262979 [11:36:28] !log admin [codfw1dev] rebooting cloudnet2003-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:38:23] !log admin [codfw1dev] rebooting cloudnet2002-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 [11:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:40:36] !log admin rebooting cloudnet1004 (standby) to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979) [11:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:40:39] T262979: cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff) - https://phabricator.wikimedia.org/T262979 [11:46:57] !log admin rebooting cloudvirt1012 [11:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:49:01] !log admin rebooting cloudvirt1039 [11:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:24:12] !log cloudvirt-canary created canary1012-01 VM in cloudvirt1012 [12:30:28] !log admin move cloudvirt1012 and cloudvirt1039 to the ceph aggregate [12:31:13] stashbot_: ? [12:32:55] !log tools.stashbot restarting with `webservice restart` [12:36:22] !log admin move cloudvirt1012 and cloudvirt1039 to the ceph aggregate [12:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:36:38] !log admin rebooted cloudnet1003 (active) a couple of minutes ago [12:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:36:57] !log tools.stashbot restarted with `bin/stashbot.sh restart` [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [12:39:21] !log admin root@cloudcontrol1005:~# openstack aggregate add host maintenance cloudvirt1031 [12:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:40:44] !lo admin root@cloudcontrol1005:~# wmcs-drain-hypervisor cloudvirt1031 [12:51:12] !log admin rebooting cloudvirt1013 and moving it to the ceph host aggregate [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:55:15] !log admin rebooting cloudvirt1014 and moving it to the ceph host aggregate [12:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:56:48] !log cloudvirt-canary creating VM canary1013-01 [12:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [12:58:55] !log cloudvirt-canary creating VM canary1014-01 [12:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [13:02:13] !log admin rebooting cloudvirt1016 and moving it to the ceph host aggregate [13:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:06:46] !log cloudvirt-canary creating VM canary1016-01 [13:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [13:15:15] !log admin `aborrero@cloudcontrol1003:~$ sudo nova-manage placement sync_aggregates` after reading a hint in nova-api.log [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:25:07] !log cloudvirt-canary deleted a bunch of failed VMs [13:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [13:27:30] !log admin extend icinga downtimes for another 120 mins [13:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:28:49] !log admin enable puppet, reboot and pool back cloudvirt1031 [13:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:48:43] bd808: All AnomieBOT's processes seem to be blocked somehow since about 16:25 UTC yesterday. Do you (generic) want to look at it at all, or should I go ahead and try killing the jobs and such? [15:49:43] !log tools.wikiloves Rebuilding missing datasets [15:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikiloves/SAL [15:49:47] anomie: could you get us a dump of your qstat job status to look at? [15:50:44] Just `qstat`, or should some args be used? [15:50:52] * bd808 was briefly confused about whey `sudo become anomiebot` didn't work... on his local laptop :/ [15:51:23] anomie: trying to remember now. I think we need the xml dump to see interesting details [15:52:13] anomie: I got a dump. Start fixing your things as needed [15:53:32] * anomie starts by trying to jstop all the jobs [15:59:07] anomie: looks like they are stuck in 'd' state. Should I sudo kill them for you? [15:59:32] bd808: Go ahead [15:59:59] !log tools.anomiebot sudo qdel -f 2097160 [16:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL [16:00:48] !log tools.anomiebot sudo qdel -f 2097161 2097166 2097167 336318 417408 [16:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL [16:01:02] anomie: ready to start them back up I think [16:03:10] Thanks! [16:03:19] I'm running startbot.sh now [16:03:21] !log tools.anomiebot ./startbot.sh [16:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL [16:03:57] I think I beat you to it (: [16:04:18] But the script should be safe, the second attempt will just report "is already running" [16:08:27] anomie: "scheduling info: job dropped because of user limitations" -- does that sound normal? % of the jobs are in that state ("qw") [16:08:34] ... And, looks like I spoke too soon. We did manage to have double jobs running. [16:09:13] * anomie qdels everything again [16:09:44] crud. Yeah 15 are running and the other 5 are queued. I think 15 is the default concurrent job limit these days [16:11:39] bd808: I have Freenode staff asking about wm-bb and some sort of flood/reconnect issue.. [16:12:04] Waggie: ugh. Did they say what channel? [16:12:17] Well, flood/reconnect would be any channel that it's in. [16:12:39] let me see if the bot's logs tell me anything [16:13:06] Ok, qdeled everything and restarted just once this time ;) [16:14:13] !log tools.bridgebot Restarting. IRC join failures [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [16:15:13] Waggie: trying a full restart to see if that fixes the bot. I think it gets into strange states when it netsplits [16:15:28] looks like it at least joined properly here [16:15:54] Waggie: bot's logs look good now. Thank you for the poke [16:17:22] Ok, thanks [16:47:32] !log admin rebooting cloudvir1032, 1033, 1034 for T262979 [16:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:47:36] T262979: cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff) - https://phabricator.wikimedia.org/T262979 [16:53:54] !log tools.bridgebot Upgrade to 1.18.3; Add RejoinDelay=5 for IRC channels (T264212) [16:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [18:25:36] !log tools.listeria restarting the webservice because it's not receiving requests from what I can tell T264219 [18:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.listeria/SAL [18:25:39] T264219: Toolforge nginx/openresty timeout is too short - https://phabricator.wikimedia.org/T264219 [18:29:16] !log tools depooling tools-sgeexec-0918 so I can reboot cloudvirt1036 [18:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:34:19] !log tools repooling tools-sgeexec-0918 [18:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:11:29] !log tools.refill-api Restarting for T264233 [20:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.refill-api/SAL [21:12:34] !log tools.bridgebot Another restart for irc flood crash (T264212) [21:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [22:14:35] !log tools.notwikilambda installed cldr because [[User:Jforrester|Jforrester]] said so :) [22:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL [23:32:32] not sure if anyone’s online at this time but I thought I’d try a quick kubernetes question [23:32:53] I wrote a deployment.yaml for a continuous job, based on the stashbot example on wikitech [23:33:03] * bd808 listens [23:33:07] and after a few syntax errors, `kubectl create -f deployment.yaml` seemed to succeed [23:33:12] but it’s just not been created for half an hour now [23:33:28] only one pod in `kubectl get pods`, presumably the webservice [23:33:40] `kubectl get deployments` shows it as ready 0/1, up-to-date 0, available 0 [23:33:52] is that just the consequence for asking for a lot of memory? (2Gi) [23:34:10] (notwikilambda tool, fwiw) [23:35:52] lucaswerkmeister: run `kubectl get deployment.apps/notwikilambda.update -o yaml` and cehck out the errors it is showing [23:36:11] ah [23:36:16] “Invalid value: "2Gi": must be less than or equal to memory limit'” [23:36:19] I thought the limit was 4Gi? [23:36:41] or do I need to specify a non-default “limits” as well, if I’m specifying a “requests”? [23:36:42] I think that's the limit for the entire namespace... [23:38:28] addding limits: memory: "2048Mi" seems to have fixed it [23:38:31] thanks a lot! [23:38:45] heh. sounds like a parse issue :) [23:39:29] huh, I didn’t even notice that the error message said 2Gi while my yaml file says 2048Mi ^^ [23:39:34] anyways, seems to be working [23:39:38] and will hopefully be more stable than a grid job [23:40:25] !log tools.notwikilambda migrated `cron` from jstart to k8s, see deployment.yaml [23:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL