[00:37:32] Hello, Amitie 10g here [00:38:26] Does enyone noticied some slowdown in the Bastion server? [00:48:54] Apparently restored [00:51:33] hi Davod [00:51:53] if this is the tools bastion server, sometimes other users run things they perhaps shouldn't (in terms of resource usage) there [00:54:16] you might try dev.tools.wmflabs.org [00:54:37] (aka stretch-dev.tools.wmflabs.org) [00:54:45] Ahm, I've somewaht worried I made a mistake [00:55:41] what's up? [00:56:04] I'm activelly testing my bot, but everything I left to the Grid. [00:59:53] Sorry, I'm not sure I understand [01:05:03] By the looks of things none of your tools are running anything on the Grid right now? [01:16:07] I have to go now, sorry [14:00:54] !log admin starting scheduled maintenance: upgrading eqiad1 from openstack mitaka to newton [14:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:07:09] !log admin horizon is disabled for maintenance (T212302) [14:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:07:12] T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton - https://phabricator.wikimedia.org/T212302 [16:03:30] ssh to login.tools.wmflabs.org does not work. Is there some message I have overlooked? [16:03:59] fnielsen: could be https://lists.wikimedia.org/pipermail/cloud-announce/2019-October/000215.html [16:04:05] (not sure) [16:04:25] Ok. I have just discovered it. [16:09:06] fnielsen: looks like we are having some issues with DNS lookups right now. We think it is related to the OpenStack maintenance. Still looking into it. [16:24:20] fnielsen: working for you now? [16:27:58] ty anyway, i'll check later [16:28:08] lookup of ssh public keys from the LDAP directory is failing [16:39:55] !log tools `sudo service nslcd restart` on tools-sgebastion-08 [16:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:41:51] !log tools reboot tools-sgebastion-07 [16:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:46:31] !log tools `sudo shutdown -r now` for tools-sgebastion-08 [16:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:49:00] !log tools rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11 [16:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:50:39] !log tools rebooting tools-sgeexec-0915/18/19/23/26 [16:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:53:34] !log tools rebooting tools-sgewebgrid-generic-0902/4 [16:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:58:11] !log tools rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19 [16:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:01:44] !log tools rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭 [17:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:16:09] !log tools.sal kubectl delete po/sal-1981382867-71k6j [17:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [17:17:21] !log tools.versions kubectl delete po/versions-1108183015-r5bht [17:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.versions/SAL [17:28:48] !log tools.zppixbot kubectl delete pods —all [17:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [17:30:57] !log tools reboot tools-sgewebgrid-lighttpd-0923/24/08 [17:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:32:02] !log tools reboot tools-sgewebgrid-lighttpd-0912 [17:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:35:13] !log tools drained and uncordoned tools-worker-100[1-5] [17:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:43:34] !log tools reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue [17:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:47:19] !log tools reboot tools-worker-1004 due to nfs stale issue [17:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:47:24] !log tools reboot tools-worker-1005 due to nfs stale issue [17:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:51:01] I'm assuming the scheduled maint. is why there are also issues with cron? [17:55:51] hare: That's one grid node I didnt' check. Looking [17:56:42] Essentially the job grid is complaining that the scripts it's trying to run can't be found. Meaning that either NFS isn't being mounted somewhere or someone went and hosed my tool account :) [17:57:00] NFS was having issues across the grid. I've rebooted most of the nodes. [17:57:03] Missed that one [17:57:10] It was the maintenance that caused it [17:57:49] Aha. Happy to help. [17:57:57] !log tools reboot tools-worker-1006 due to nfs stale issue [17:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:58:02] !log tools reboot tools-worker-1007 due to nfs stale issue [17:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:58:08] hi hare :) Thanks for the poke [18:07:54] !log tools reboot tools-worker-1008 due to nfs stale issue [18:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:08:01] !log tools reboot tools-worker-1009 due to nfs stale issue [18:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:12:22] !log tools reboot tools-worker-1010 due to nfs stale issue [18:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:12:25] !log tools reboot tools-worker-1011 due to nfs stale issue [18:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:21:44] !log tools reboot tools-worker-1012 due to nfs stale issue [18:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:23:13] !log tools reboot tools-worker-1013 due to nfs stale issue [18:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:32:12] !log tools reboot tools-worker-1014 due to nfs stale issue [18:32:22] !log tools reboot tools-worker-1015 due to nfs stale issue [18:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:34:40] !log tools reboot tools-worker-1016 due to nfs stale issue [18:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:34:43] !log tools reboot tools-worker-1017 due to nfs stale issue [18:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:38:28] !log openstack cleanup fullstack VMs created during Newton upgrade [18:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Openstack/SAL [18:46:30] !log tools reboot tools-worker-1018 due to nfs stale issue [18:46:33] !log tools reboot tools-worker-1019 due to nfs stale issue [18:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:46:47] !log tools reboot tools-worker-1020 due to nfs stale issue [18:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:46:52] !log tools reboot tools-worker-1021 due to nfs stale issue [18:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:47:25] !log tools reboot tools-worker-1022 due to nfs stale issue [18:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:47:27] !log tools reboot tools-worker-1023 due to nfs stale issue [18:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:55:27] !log tools reboot tools-worker-1025 due to nfs stale issue [18:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:55:31] !log tools reboot tools-worker-1026 due to nfs stale issue [18:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:55:36] !log tools reboot tools-worker-1027 due to nfs stale issue [18:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:55:40] !log tools reboot tools-worker-1028 due to nfs stale issue [18:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:00:42] !log tools reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue [19:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:04:53] !log tools reboot tools-worker-1029 due to nfs stale issue [19:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:06:03] !log tools reboot tools-docker-registry-03 due to nfs stale issue [19:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:06:37] !log tools reboot tools-mail-02 due to nfs stale issue [19:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:07:57] !log tools reboot tools-paws-worker-1002 due to nfs stale issue [19:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:08:07] !log tools reboot tools-sge-services-04 due to nfs stale issue [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:08:51] !log tools reboot tools-sgebastion-09 due to nfs stale issue [19:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:09:32] !log tools reboot tools-sgebastion-0test due to nfs stale issue [19:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:10:22] !log tools reboot tools-puppetmaster-02 due to nfs stale issue [19:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:12:35] !log toolsbeta reboot toolsbeta-sgecron-01 toolsbeta-sgewebgrid-generic-0901 toolsbeta-sgewebgrid-lighttpd-0901 due to nfs stale issue [19:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [19:14:04] !log huggle reboot huggle-wl due to nfs stale issue [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Huggle/SAL [19:14:59] !log testlabs reboot canary1008-01 due to nfs stale issue [19:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [19:15:46] !log tools reboot tools-worker-1030 due to nfs stale issue [19:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:15:49] !log tools reboot tools-worker-1031 due to nfs stale issue [19:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:15:52] !log tools reboot tools-worker-1032 due to nfs stale issue [19:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:15:54] !log tools reboot tools-worker-1033 due to nfs stale issue [19:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:15:58] !log tools reboot tools-worker-1034 due to nfs stale issue [19:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:01] !log tools reboot tools-worker-1035 due to nfs stale issue [19:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:04] !log tools reboot tools-worker-1036 due to nfs stale issue [19:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:06] !log tools reboot tools-worker-1037 due to nfs stale issue [19:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:09] !log tools reboot tools-worker-1038 due to nfs stale issue [19:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:13] !log tools reboot tools-worker-1039 due to nfs stale issue [19:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:16] !log tools reboot tools-worker-1040 due to nfs stale issue [19:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:18:25] !log tools reboot tools-paws-worker-1006 due to nfs stale issue [19:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:20:04] !log tools reboot tools-k8s-master-01 due to nfs stale issue [19:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:23:36] !log catgraph Rebooting fridolin.catgraph.eqiad.wmflabs for stale NFS mounts [19:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph/SAL [19:25:23] !log tools deleted tools-puppetmaster-02 [19:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:26:20] !log catgraph Rebooting sylvester.catgraph.eqiad.wmflabs for stale NFS mounts [19:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph/SAL [19:27:45] !log fastcci Rebooting fastcci-worker1.fastcci.eqiad.wmflabs for stale NFS mounts [19:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Fastcci/SAL [19:28:33] I'm getting inundated with Cron Daemon emails. What's going on? [19:28:37] !log fastcci Rebooting fastcci-worker2.fastcci.eqiad.wmflabs for stale NFS mounts [19:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Fastcci/SAL [19:29:09] And it's the same one. It can't cd into a directory. [19:30:10] Cyberpower678: me too, saying it can't find files. [19:30:29] Cyberpower678: probably similar to the NFS issues other instances are having. Are you getting these from Toolforge crons or somewhere else? [19:30:40] Toolforge. [19:30:51] is there a hostname in the email message? [19:30:56] Toolforge here too [19:31:21] Geting cron failures on xtools too though. Different error. It can't resolve hostnames. [19:31:27] That's a VPS [19:31:40] https://img.hosted.gq/Seke3/reFUTEDoQA646.png/raw <---- all I get on toolforge [19:31:53] we have been doing reboots to fix NFS across a lot of projects but probably haven't found them all yet [19:32:03] Cyberpower678: are you getting recent dns failures on xtools, or just leftovers from an hour or two ago? [19:32:21] The latest error was 6 minutes ago. [19:32:43] it's now saying it can't reach GitHub (this is the auto-deploy script) [19:33:04] everything appears to be operational there [19:33:21] musikanimal: on an xtools instance? [19:33:34] bd808: yep [19:33:36] yeah, xtools-prod06 and -prod07 [19:33:49] DNS was very much broken for a while today, but we think it is all working now [19:34:14] I am similar errors from eventmetrics-prod02 [19:34:18] *I am getting [19:34:25] can you reproduce any of the dns issues on the cmdline? [19:34:36] I shall try [19:35:13] I simple `git fetch origin master` worked without error on xtools-prod06 [19:35:27] bd808: can we get tools.iabot@tools-sgecron-01 working again. I don't have the means to delete the crontab atm, but my phone keeps going off. [19:36:16] tools.eranbot seems to be having issues with NFS too, maybe... `[Errno 116] Stale file handle: '/data/project/eranbot/outs'` [19:36:26] Cyberpower678: looking. I have a hope that you are just getting mail bombed from things that failed in the past but are working now, but trying to confirm [19:36:30] last error < 1 minute ago [19:36:46] okay, maybe same for me then [19:36:51] The mail bombing started about 10 minutes ago? [19:36:51] musikanimal: if you can stand it, a reboot should resolve any nfs issues [19:38:52] so I'm getting root@ mail timestamped e.g. 15:40:40 [19:39:07] which suggests the mail queue is very behind [19:39:21] Cyberpower678: I ssh'ed to tools-sgecron-01, became iabot, and submitted a grid job. Worked as expected, so I think emails are likely just backed up from past failures [19:39:33] Krenair: ouch [19:39:34] .... [19:39:47] if people are still receiving error mail about stale NFS handles, I suggest opening the original mail with headers and looking for the Date: line to find out the actual timestamp [19:40:00] bd808: can you dump the mail queue then? [19:40:14] current UTC time: 19:40 [19:40:44] Mine are dated at time of inbox arrival though. [19:41:16] that explains it. I'm getting all these emails in very quick bursts; quicker than the cron is actually set to run [19:42:07] I am getting a lot of cronspam right now that were generated ~2 hours ago [19:42:56] we did reboot the tools mail server not that long ago [19:42:59] I wonder... [19:43:21] probably related [19:43:29] :/ [19:43:29] it would be looking at NFS for .forward files [19:43:30] for me the flood started around about 20:07 [19:43:48] Yeah here it is, from SAL: [19:43:49] 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue [19:44:03] so probably me fixing that host unleashed the mail that had been queueing up for hours [19:44:04] Is there some way to purge the queue of these emails? [19:44:33] Cyberpower678: not without possibly dropping other mail, but I'm looking around... [19:44:38] (my 20:07 is BST, which would make it 19:07 i.e. a minute after the SAL entry) [19:44:44] ((Sorry.)) [19:47:04] Something is wrong with NFS host ontools-sgecron-01 I got two mails (from tools.wikihistory & tools.persondata) about a stale NFS handle. Both jobs run with -jlocal [19:47:28] … NFS on host … [19:47:39] Wurgl: join the club. [19:47:59] Wurgl: known. Issue seems to be fixed, but there are a bunch of emails going out that were stuck while NFS was broken [19:48:10] NFS is fine now. The mail server was backed up due to a DNS failure, and now it's spewing out every email it queued up in the last few hours. [19:48:34] 937 more in the queue [19:48:53] bd808: what's the send rate? [19:49:01] Well, at least there is someone sending me a mail ;^) Think positive! [19:50:38] Cyberpower678: "fast" 658 now [19:51:23] So about 5 more minutes of spamming to go [19:53:10] About 2 minutes left [19:55:27] bd808: It should be empty now. I stopped getting spammed [19:55:44] I got only 4 Mails [19:55:58] How many did you get, Cyberpower678? [19:56:13] Wurgl: 142 [19:56:23] Oh! Fine [19:56:34] Coming in one at a time so I got 142 pings on my phone. [19:57:20] You did not sleep? Or did it wake you up? [19:57:20] Cyberpower678: in part, that might be a sign that you have a lot of high frequency cron jobs running ;) [19:58:00] bd808: I wouldn't call every 5 minutes high frequency. [19:58:46] Hmm … sleep 300 is an option? [19:59:48] Wurgl: it's actually designed to respawn a crashed script. [20:00:04] 437 :| [20:00:15] Krenair wins [20:00:26] yes but I cheated [20:00:36] that's root@ :P [20:00:37] :O How? [20:00:44] Oh lol [20:01:49] Wurgl: my scripts use sleep 60. [20:02:14] But the job commands are phrased so that duplicate jobs aren't spawned [20:13:21] !log tools Dropped backlog of frozen messages for delivery (240 dropped) [20:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:16:15] !log tools Dropped backlog of messages for delivery to tools.mix-n-match [20:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:17:21] !log tools Dropped backlog of messages for delivery to tools.usrd-tools [20:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:15:14] bd808: out of curiosity when all the issues occurred due to the update, did any k8 hosts go down, or lose connection to the network? [21:20:31] Zppix: "it depends" is the best answer I can give. The networking issues we noticed and fixed were not complete loss of networking. They were more firewall issues talking to various services hosted outside of the Cloud VPS IP space. [21:21:47] None of the Toolforge instances crashed, but nearly all ended up being restarted as the quickest and easiest way to clear up side effects from the DNS, LDAP, and NFS interruptions. [22:08:36] !log tools.iabot Restarted webservice, it was 502ing [22:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [22:49:47] !log tools.lexeme-forms deployed ce8ba2b234 (add plural grammatical feature to Ukrainian plurale tantum forms) [22:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL