[00:24:54] Earwig: yes, filing a ticket would be good [03:21:58] Earwig: afaik, we don't currently have the code to increase the memory. are you sure it is running out of memory? [07:41:17] !log tools.wikibugs Updated channels.yaml to: 62469f2db86d26c599400a55b9a7642ef95ce8d9 Update for Acme-chief project rename [07:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [07:49:19] thanks wikibugs [09:21:56] My Python-based ordia Toolforge service does not start. I only get "SIGINT/SIGQUIT received...killing workers...". That is right after "spawned uWSGI". I do not use a database. I recently upgraded my tools to stretch and I believe I have done it correctly. I have redone the steps. [09:29:00] Ok. Now found an error. [09:38:13] I fixed it. It must have been some trusty/stretch issue [17:52:30] hey [17:53:05] i cant access tools.wmflabs.org [17:59:02] ^ works for me, but the trusty-tools tool seems to be down if someone wants to fix that.... I think I moved most everything now. [17:59:47] nothing works for me on *.wmflabs.org but wikipedia.org works well [18:00:46] f2k1de: what happens when you try to access? [18:03:35] Hmm. We’re having trouble finding that site.We can’t connect to the server at tools.wmflabs.org [18:04:36] What address are you using? [18:05:47] tools.wmflabs.org [18:06:15] strange… twitter is also not working [18:06:32] I would guess it's a network issue on your end [18:10:19] Hi, I have a question about kubernetes webservice - how can I set environment variables? (LD_LIBRARY_PATH?) [18:15:53] Hi, Is there a list of all ips used by tools somewhere? Bot is getting 'Closing Link: 185.15.56.1 (Too many user connections (global))' when hitting freenode, so i guess we'll need another iline request, but would be nice to fix this for all once. (works fine from the trusty boxes, not so much from stretch) [18:16:45] (I believe every exec use to have its own ip, we seem to have a nat box now at least from random playing without too much thought) [18:17:34] I suggest a phab task for that. Though a more knowledgeable person my drop the answer here [18:20:13] cool, I'll do that. Need to come back to a bunch of stuff with this due to the db issues currently anyway. thanks. [18:24:14] chicocvenancio: i have just an ipv6 adress [18:27:12] Damianz: I'm looking into trusty-tools [18:29:02] pod status is 'ContainerCreating'?! [18:29:53] I'll just restart the webservice. the host seems fine [18:34:13] !log tools.trusty-tools restarted webservice. it still has a phantom pod trusty-tools-909545302-jwrz7 at tools-worker-1010.tools.eqiad.wmflabs which refuses to terminate [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.trusty-tools/SAL [18:35:00] bd808: in case you want to look into another k8s weirdness ^ [18:35:04] zhuyifei1999_: doesn't it use toolsdb? [18:35:18] trusty-tools? [18:35:36] Yeah [18:35:58] iirc, it uses redis for cache. let me check [18:37:53] yeah, it uses ldap, redis, and fetches from https://tools.wmflabs.org/grid-jobs/json & https://tools.wmflabs.org/sge-jobs/json [18:38:05] https://phabricator.wikimedia.org/source/tool-precise-tools/browse/master/precise_tools/__init__.py [18:38:41] (and related files) [18:38:47] no toolsdb [18:40:52] f2k1de, you don't have an IPv4 address? [18:43:57] nope, Krenair [18:44:19] f2k1de, well tools.wmflabs.org does not have an IPv6 address [18:44:29] see what happens when you try pinging it? [18:45:20] zhuyifei1999_: thanks for checking. I'm just expecting anything that doesn't work to be related to toolsdb right now [18:45:42] :) [18:47:06] btw, do you know any reason a pod could be stuck in ContainerCreating? [18:47:46] after stopping the webservice it's now stuck in Terminating [18:48:18] there's one process on the host belonging to tools.trusty-tools [18:48:47] but it's the /pause that like every pod has [18:51:17] That's the usual pattern if the command is erroring out [18:51:38] wouldn't that be CrashLoopBackOff? [18:51:52] Only after a few tries [18:52:15] this pod has an age of 17h [18:52:35] Deployment creates pod =>pod runs command =>deployment sees error, kills pod [18:52:36] and restarting the webservice works fine [18:52:42] Rinse, repeat [18:52:49] hmm [18:53:44] Crashloopbackoff is configurable, how many tries under so many minutes is considered a crash loop and how long it waits [18:54:15] !log clouddb-services T193264 create VM clouddb-services-01 for PoC of running maintain-dbusers from here [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [18:54:18] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [18:54:30] I don't know what the default was in k8s 1.4, or if there were any relevant bugs in that version [18:54:37] ok [18:55:13] You should be able to get more details either from the log or the pod details [18:55:22] Or maybe the deployment details [18:55:40] which command is that? [18:55:50] kubectl describe? [18:55:57] Yes [18:56:02] ok [18:56:07] will check [18:56:17] And kubectl logs [18:56:59] I usually go `kubectl logs deployment NAME` [18:57:10] So I don't have to get the pod name first [18:57:43] ok [19:04:10] https://www.irccloud.com/pastebin/ngLLxLUf/ [19:04:15] chicocvenancio: umm, ^ [19:05:08] the old broken deployment is gone when the webservice is restarted [19:08:25] It's not a deployment, it's a ReplicaSet. The relevant info would be on `events` [19:09:11] Try `kubectl describe ReplicaSet trusty-tools` [19:10:18] https://www.irccloud.com/pastebin/RSfycJ24/ [19:10:20] If there no events there as well the best bet is to continously try to tail the logs (`kubectl logs RESOURCE - f`) [19:10:26] this one seems to be good [19:11:17] can't seem to access events? (is this the correct method?) [19:11:19] tools.trusty-tools@tools-sgebastion-07:~$ kubectl get events [19:11:20] Error from server (Forbidden): Forbidden: "/api/v1/namespaces/trusty-tools/events?limit=500" (get events) [19:11:57] I usually get events from describe, as I'll need the other info there [19:12:14] But there might be some permissions blocking that user [19:12:22] Might try it as root [19:12:55] Add `-n trusty-tools` to the commands for the namespace [19:16:17] yeah [19:16:19] !log clouddb-services T193264 delete VM clouddb-services-01 [19:16:20] https://www.irccloud.com/pastebin/3VERYf9l/ [19:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [19:16:24] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [19:17:06] no mention of jwrz7 [19:17:17] (the broken pod) [19:59:55] Commands like qstat and qdel are timing out for me on the trusty bastion [20:00:19] ok, well it came through now after a few attempts [20:08:36] legoktm: fyi, migrated the coverme cron to stretch bastion [20:08:50] legoktm: looks like it's hitting a 404 on doc.wm.o, but that's pre-existing. [21:21:18] !log clouddb-services The slave of labsdb1005.eqiad.wmnet is now clouddb1001.clouddb-services.eqiad.wmflabs [21:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [21:37:34] zhuyifei1999_: I haven't dug in yet, but I have seen orphaned replicasets before which make weird things happen. Under some rare error condition `webservice stop` can delete the deployment but leave the replicaset behind. Its been so rare that I've never made a bug report for it [21:39:28] there is only one replicaset remaining... the pod still refuse to terminate [21:43:54] !log tools.trusty-tools Force deleted pod stuck in Terminating state with ` kubectl delete po/trusty-tools-909545302-jwrz7 --now` [21:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.trusty-tools/SAL [21:44:16] smacked it with a larger hammer [21:49:53] bd808: shall I kill PID 8874 on tools-worker-1010? it's the /pause [21:53:13] 'sudo docker ps' taking forever. huh [21:54:49] zhuyifei1999_: yeah... I was trying to check the same thing [21:57:44] zhuyifei1999_: I'll stop poking there. If you can confirm that container is orphaned by the force kill I did on the related pod then sure kill it too [21:58:17] I'm trying to figure out why docker ps is hanging [22:02:20] killed 8874 [22:05:36] I just compared -worker-1010 and -worker-1014, the main dockerd thread at -1010 is waiting on a mutex `futex(0x2a92648, FUTEX_WAIT, 0, NULL` while -1014 is `read(140, ` [22:06:35] don't know why it's reading on 140, considering 140 is `dockerd 2720 root 140r FIFO 0,19 0t0 15123497 /run/docker/libcontainerd/7a89cf8cfa838b3b2f88f6871f0f6743610ade4d4d945d3d322330074b570330/init-stderr` [22:08:45] I'm thinking of restarting dockerd on this host [22:10:43] !log tools draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever [22:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:11:27] oh great `error: replicasets "snapshots-722241724" not found: snapshots-722241724-ymy0n, snapshots-722241724-ymy0n` [22:11:56] !log tools rebooting tools-worker-1010.tools.eqiad.wmflabs [22:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:13:13] !log tools.my-first-flask-tool Migrated from Trusty -> Stretch -> Kubernetes [22:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.my-first-flask-tool/SAL [22:23:50] !log tools uncordon tools-worker-1010.tools.eqiad.wmflabs [22:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:33:21] !log tools.mysql-php-session-test Migrated from Trusty -> Stretch -> Kubernetes [22:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.mysql-php-session-test/SAL