[00:53:43] Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280) [00:53:44] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [00:53:53] !log tools Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280) [00:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [01:16:10] thedj: I was going to try and fix T217992 for you, but there is nothing listening on port 80 of maps-tiles1.maps.eqiad.wmflabs [01:16:11] T217992: Unable to delete proxy entry via horizon - https://phabricator.wikimedia.org/T217992 [08:54:31] https://usercontent.irccloud-cdn.com/file/zpjPnied/image.png [08:54:37] interesting, 2 IPs for an instance apparently :P [08:58:29] * AlexZ is getting emails about puppet failing on instances [08:58:40] I've filed a phab to keep track https://phabricator.wikimedia.org/T218009 [09:01:17] i get them from testlabs, as i wrote earlier [09:01:28] 2 just now [09:01:52] maybe i should file another ticket? [09:01:56] !help [09:01:56] annika: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [09:02:22] annika: Feel free to add-on to my ticket [09:03:15] i don't know if the admins will complain if i do that [09:04:38] It's better to have one ticket with all the similar issues instead of having it spread out over multiple ones. [09:05:26] That's something they'll complain about, since it's more work to track each ticket down :P [09:06:56] eh, i have the experience that issues that i think are related are in fact not [09:43:22] annika: what's the issue? [09:44:14] arturo: i get puppet emails that i think i should get [09:44:27] *shouldn't [09:44:48] last time i checked i wasn't projectadmin of testlabs [09:45:48] I got two emails too, and the instance name without the tool name wasn't helpful [09:46:01] what's your username annika ? [09:46:11] gifti [09:46:18] is shell [09:46:32] annika: according to https://tools.wmflabs.org/openstack-browser/user/gifti you are in the testlabs project [09:47:04] huh, ok [09:47:04] jem: what's your username (and the involved project)? [09:59:32] arturo: it's -jem- and the projects could be: ,51374(tools.spellcheck),51559(tools.quentinv57-tools),51605(tools.jembot),51702(tools.intelibot),52137(tools.intelirc),53121(tools.joaquinito01),53512(tools.patrubot),53526(tools.t1943bot) [10:00:08] the project could be any of* [10:00:35] jem: and the email is about the tools project? [10:00:55] Let me check [10:01:00] you are part of the tools project, that's why you get the email [10:01:18] tools-exec-1430.tools.eqiad.wmflabs, tools-sgeexec-0905.tools.eqiad.wmflabs [10:01:21] Ah [10:01:36] So everyone in Toolforge has received the email, then [10:03:35] ... so probably an email to everyone wasn't the intended or proper solution [10:04:52] I’m on Toolforge from two accounts (work/private) and didn’t get an email on either address, fwiw [10:05:35] i don't think any mere member of any project should get puppet mails [10:07:59] Then I'm confused about my case [10:26:56] Hello! Why do I get mails like "Puppet is failing to run on the "tools-sgeexec-0905.tools.eqiad.wmflabs" instance in Wikimedia Cloud VPS." [10:27:16] Seems I cannot do anything to fix that [10:28:36] Wurgl: it seems we have an issue with the script that sends those emails, we're investigating [10:28:46] thanks [11:57:33] !help [11:57:33] gabrieloli01001: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [11:59:53] gabrieloli01001: hi! how can we help you? [12:00:13] hello arturo [12:01:26] I received the email about Puppet and as it been a while I don't work in Wikimedia stuff I'm not familiar with it [12:02:02] I'm sorry if this is not the right place to ask but,what is exactly the puppet? [12:05:28] gabrieloli01001: you can ignore the email, sorry for the noise. If you want details of what's going on, you can check this phabricator bug: T218009 [12:05:28] T218009: Puppet failure on tools-exec-1430.tools.eqiad.wmflabs and tools-sgeexec-0905.tools.eqiad.wmflabs - https://phabricator.wikimedia.org/T218009 [12:06:19] okay thanks [13:21:40] Hi, today I received some emails with subject "[Cloud VPS alert] Puppet failure on tools-exec-1430.tools.eqiad.wmflabs". I remember to have some tools running on Toolforge, but the email lacked description of which job was failing. The email included a line that says "Please take steps to repair this instance or contact a Cloud VPS admin for assistance." but I have no idea what is failing nor who to contact. I'd [13:21:40] appreciate if there anyone here could give me a clue on what to do and who to contact. [13:21:52] Thanks! [13:22:59] kenrick95: TL;DR don’t worry about it https://lists.wikimedia.org/pipermail/cloud/2019-March/000581.html [13:23:31] Ah I see, thank you very much! [14:48:28] Anybody online? [14:49:09] !log tools deleted tools-webgrid-lighttpd-1419 [14:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:50:19] hi Adithyak1997 [14:50:36] I am facing the following error [14:50:41] 2019-03-11 20:14:49 Disconnected: No supported authentication methods available (server sent: publickey,hostbased) [14:50:53] Its related to logging into putty [14:50:56] what host are you trying to SSH to? [14:51:28] Bastion [14:51:33] bastion.wmflabs.org ? [14:52:25] Sorry [14:52:28] wmflabs [14:52:46] ? [14:52:52] can you just send the whole hostname please [14:53:22] https://tools.wmflabs.org/admin/tool/fireflytools [14:53:26] This one? [14:53:42] that's a tool, it has no host of its own [14:53:51] I'm looking for something ending .wmflabs.org or .eqiad.wmflabs [14:53:58] you will be using it as the hostname in PuTTy [14:54:03] PuTTY* [14:54:15] Yes Putty [14:54:22] maybe you're connecting to login.tools.wmflabs.org ? [14:55:23] Adithyak1997@tools-login.wmflabs.org [14:55:38] ok [14:55:44] Got an email about a failed cron job of mine, but it doesn't appear to be due to my code. [14:55:47] error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error). [14:55:47] Traceback (most recent call last): [14:55:47] File "/usr/bin/job", line 48, in [14:55:47] root = xml.etree.ElementTree.fromstring(proc.stdout.read()) [14:55:47] File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML [14:55:48] return parser.close() [14:55:48] xml.etree.ElementTree.ParseError: no element found: line 1, column 0 [14:55:54] well you're a member of the tools project Adithyak1997 [14:56:02] (Sorry) [14:56:27] @Krenair Yes I am [14:56:40] Krinkle, I heard there was a disruption related to NFS just now [14:56:49] and cronjobs in SGE [14:57:08] Adithyak1997, and you have an SSH key... have you configured an SSH key in PuTTY? [14:57:31] Actually I was having an SSH key earlier [14:57:47] It was stored in the WPCleaner folder [14:58:01] Yesterday, I uninstalled that folder and the SSH key was lost [14:58:20] Today, by using Puttygen, I created new public as well as private keys [14:58:36] I don't know how to proceed after that [14:58:45] I have saved my keys in my laptop [14:58:47] did you go to wikitech and update your keys there? [14:58:54] No [14:59:00] alright well you'll need to do that before you can log in [14:59:25] Ok [14:59:50] it'll need to go in https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [14:59:52] * Krenair will be back later [15:01:04] Ok. I have done that [15:01:39] Thanks a lot. It worked [15:14:02] jstop does not stop my job (on stretch). It failes with "error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error)." [15:14:47] same with me [15:14:55] annoying, since I submit a LOT of jobs [15:16:47] A feeling like: Bonnie & Clyde without bullets *g* [15:17:28] Wurgl, Urbanecm: does `qdel` work to remove these jobs? [15:17:43] let me try [15:17:46] cron submits them auto [15:17:49] *matically [15:18:23] qstat doesn't work [15:18:29] so I don't know what to qdel [15:20:27] qdel seems to time out … waiting [15:20:40] Adithyak1997, hi, I'm back [15:20:47] Adithyak1997, did it work after you added the key? [15:21:09] Yes. It worked [15:21:14] Thanks a lot [15:21:15] great [15:21:24] bd808: qdel says failed receiving gdi request response for mid=1 (got syncron message receive timeout error). [15:21:40] tools.urbanecmbot@tools-sgebastion-07 ~ [15:21:40] $ qstat [15:21:40] error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error). [15:22:55] Urbanecm, Wurgl: Let me poke around a bit on the Stretch grid to see if I can figure out what's going on. I can recreate a long pause and then a failure just running `qstat` [15:23:24] If one of you has the time to start a phabricator task about the error that would be helpful [15:24:05] A small think [15:24:09] *thing [15:24:19] Please check https://arc.liv.ac.uk/pipermail/gridengine-users/2010-April/030288.html [15:24:30] I don't know whether it will help [15:25:38] <-- phabricating a task [15:25:59] Please do check the above link sent by me [15:27:55] I found an old irc log with discussion of the same error... trying to find the correct log files to check now [15:28:32] bd808: T218038 [15:28:32] T218038: jstop on stretch times out - https://phabricator.wikimedia.org/T218038 [15:42:38] !log tools Rebooting tools-sgegrid-master (T218038) [15:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:42:45] T218038: jstop on stretch times out - https://phabricator.wikimedia.org/T218038 [15:47:14] !log tools Hard reboot of tools-sgegrid-master via Horizon UI (T218038) [15:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:52:38] bd808: jstop seems to work now [15:53:26] !log tools Manually started `service gridengine-master` on tools-sgegrid-master after reboot (T218038) [15:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:53:29] T218038: jstop on stretch times out - https://phabricator.wikimedia.org/T218038 [15:54:11] Wurgl: nice! The grid master seemed to have completely lost track of what it was supposed to be doing. Possibly related to NFS errors, but not sure yet [15:56:10] Computers are just people. I am starting to forget too :-( [16:20:51] !log tools.grid-jobs Updated to dbb5f60 Guard against malformed accounting lines [16:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.grid-jobs/SAL [16:54:39] Anybody here? [16:56:10] Maybe? [16:56:22] !ask [16:56:22] Hi, how can we help you? Just ask your question. [17:00:41] Sorry for late reply [17:01:00] I have currently updated a file in https://github.com/adithyak04/fireflytools/blob/master/linter_counts.py#L19 [17:01:16] I have also used the command git pull in the software putty [17:01:33] But the updates have not happened in the page https://tools.wmflabs.org/fireflytools/linter/enwiki [17:02:10] I have also got an email in the morning stating the error [Cloud VPS alert] Puppet failure on tools-sgeexec-0905.tools.eqiad.wmflabs [17:07:44] For the mail: See topic and/or https://lists.wikimedia.org/pipermail/cloud/2019-March/000581.html [17:08:29] OK [17:08:47] But I guess it has something to do with my problem [17:09:15] Why I say so because I have been recently getting emails asking me regarding the trusty deprecation [17:10:10] Till yesterday, I was able to see the jobs that are running on Trusty job grid [17:11:32] But today, I am not able to see that page which shows the error "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application." while running the link https://tools.wmflabs.org/trusty-tools/u/adithyak1997 [17:13:15] I unfortunately uninstalled WPCleaner yesterday due to which my private as well as public keys were lost related to SSH. Today, I created the keys using PuttyGen and also updated those in my Toolforge account [17:13:53] I haven't made any changes in the grid as I don't know whether anything needs to be done. I have also updated in WinCSP. [17:30:40] Does anybody have any solution to my problem? [17:31:15] Adithyak1997: I'm trying to fix https://tools.wmflabs.org/trusty-tools/ right now [17:31:34] Thank you [17:32:31] you can probably find the same data by using `qstat` as your tool from the login-trusty.tools.wmflabs.org server [17:42:59] But what command needs to be typed if I am running it in sgebastion? [17:43:11] I tried qstat but it didn't work [17:44:10] You can login at login-trusty.tools.wmflabs.org to access the trusty-jobs [17:44:44] You mean through direct browsing? [17:44:52] Or through Putty? [17:47:10] Actually, in the page https://tools.wmflabs.org/admin/oge/status I am able to see the running of my task [17:47:16] !log tools.sge-status Deployed 6b6d5cc Explicitly cast numeric data to int [17:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sge-status/SAL [17:48:05] Adithyak1997: through putty. When you say "I tried qstat but it didn't work" what actually happened? [17:48:20] Adithyak1997: https://tools.wmflabs.org/trusty-tools/u/adithyak1997 is working again :) [17:49:43] When I entered qstat, it showed the next command line [17:49:59] It didn't show any output [17:51:40] Adithyak1997: that means that there were no jobs to show. Depending on when you ran the command there may not have been anything running. fireflytools is running jobs from cron on the Trusty grid, so they are not running at all times [17:52:05] See https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_cron_job for moving your cron jobs from the Trusty grid to the Stretch grid [17:55:48] But I am getting stuck at the statement ssh @login-stretch.tools.wmflabs.org [17:56:11] what is the problem? [17:56:39] When I give that command replacing with Adithyak1997, it shows (publickey, hostbased) [17:57:49] are you using the new ssh key pair that you made today? [17:58:10] Yes [17:58:25] But before that I need a help [17:58:29] ssh @login-trusty.tools.wmflabs.org [17:59:17] Adithyak1997: if I remember correctly you are using windows and PuTTY, so you actually need to replace those commands with PuTTY connections [17:59:17] Can this statement be typed directly in the console starting with tools.fireflytools@tools-sgebastion-07 [17:59:39] Ouch [18:00:17] Your guess is right [18:01:28] Adithyak1997: you will need to make a new PuTTY configuration like the ones documented at https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_PuTTY_and_WinSCP#How_to_set_up_PuTTY_for_direct_access_to_your_Toolforge_account -- use "login-trusty.tools.wmflabs.org" as the Host Name [18:02:40] The configuration you have for "tools-login.wmflabs.org" can be used where the migration instructions say "login-stretch.tools.wmflabs.org". [18:03:10] Ok [18:03:14] Let me try that [18:03:38] * bd808 kind of wishes that all Windows users were on Windows 10 with native ssh [18:07:02] New issue [18:07:04] The last Puppet run was at Mon Mar 11 17:52:33 UTC 2019 (11 minutes ago). Last login: Mon Mar 11 17:42:13 2019 from 117.194.170.34 (env)adithyak1997@tools-sgebastion-07:~$ ssh tools-login.wmflabs.org @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The ECDSA host key for tools-login.wmflabs [18:07:48] Sorry [18:07:58] The ECDSA host key for tools-login.wmflabs.org has changed, and the key for the corresponding IP address 172.16.7.167 is unchanged [18:08:20] Offending key for IP in /home/adithyak1997/.ssh/known_hosts:4 remove with: ssh-keygen -f "/home/adithyak1997/.ssh/known_hosts" -R 172.16.7.167 [18:08:21] Did cron jobs stop working for a period of time? I just got a ton of emails [18:09:05] !log shinken restarted shinken to clear invalid alerts [18:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL [18:09:35] Zppix: NFS issues apparently T218038 [18:09:35] T218038: NFS issue affecting Toolforge SGE master - https://phabricator.wikimedia.org/T218038 [18:09:59] gtirloni: okay just wanted to make sure my cron job didnt just suddenly break or if it was a issue on cloud's side [18:11:20] Adithyak1997: yes, that key change is known and was announced on the cloud-announce@lists.wikimedia.org mailing list. You can get the new fingerprint from https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login.tools.wmflabs.org [18:12:04] So what needs to be done with that key? [18:14:01] Adithyak1997: see what you posted 5min ago how to remove a key [18:14:18] "remove with: ssh-keygen -f "/home/adithyak1997/.ssh/known_hosts" -R 172.16.7.167" [18:14:24] ok [18:14:33] that... confusingly is not a PuTTY warning [18:15:32] bd808: looks like a git bash/cmd warning [18:16:39] bd808: seems like Adithyak1997 is trying to jump from tools-sgebastion-07 [18:17:01] re-reading clues me in. that warning was from the shell on tools-sgebastion-07 (which actually is the host that tools-login.wmflabs.org currently points at) [18:17:15] Host 172.16.7.167 not found in /home/adithyak1997/.ssh/known_hosts [18:18:09] Adithyak1997: ok. this error is a bit of a side track anyway. You will not be able to hop from bastion to bastion. You need to connect to each from your local computer [18:19:08] Sorry to ask you. Means? [18:22:34] Adithyak1997: "(env)adithyak1997@tools-sgebastion-07:~$ ssh tools-login.wmflabs.org" -- that is you logging in to a bastion and then trying to use ssh from that bastion to get to another bastion. That is what will not work. [18:22:58] also tools-login.wmflabs.org and sgebastion-07 are the same server :) [18:23:17] Ok [19:32:21] I am wondering about "webservice --backend=kubernetes start" vs. "webservice start" currently I started with --backend=kubernetes but now I get status 502 Bad Gateway? https://tools.wmflabs.org/persondata [19:53:09] Wurgl: it can take 1-2 minutes for a new kubernetes webservice to start and be recognized by the gateway. If its still down after a few minutes you can look at the output of `kubectl get pods` to see if the pod in is a persistent error state. You should also check for error.log output. [19:57:30] I startet it in the morning (local time) … ~8-9 hours ago [19:58:41] persondata-676273626-15020 1/1 Running 0 1d <-- output of "kubectl get pods" [20:08:25] bd808: shall i flip the webservices again, so you check if the proxy can be deleted ? [20:21:00] After webservice start – tail error.log says 2019-03-11 20:19:50: (configfile.c.1154) source: /var/run/lighttpd/persondata line: 631 pos: 1 parser failed somehow near here: (EOL) [20:21:49] kubectl get pods says persondata-676273626-15020 1/1 Terminating 0 1d <-- since a few minutes, it does not want to die :-( [20:28:03] 6 minutes later: Still same line: persondata-676273626-15020 1/1 Terminating 0 1d [21:10:15] May someone take a look? T218060 [21:10:16] T218060: Webservice does not terminate / does not start - https://phabricator.wikimedia.org/T218060 [21:14:08] Wurgl: seems stuck in terminating ? [21:14:14] Wurgl: tried this? https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Monitoring_your_jobs [21:16:08] Error from server: deployments.extensions "persondata-676273626-15020" not found [21:16:19] persondata-676273626-15020 1/1 Terminating 0 1d <-- still this line [21:17:57] is that the result of kubectl get deployment ? [21:18:36] no output, just the shell prompt [21:19:27] That line with terminating is the the output of kubectl get pods [21:19:44] well that's about the limit of my knowledge ;) hope you find an answer soon. [22:49:12] thedj: Hey, I'm around now if you want me to try and help get the maps proxies pointed at the new server. Or you can leave some instructions on the ticket about what to start/stop on the hosts and I can try to move it at some other time. [22:51:40] i reset the server to a safe config and the httpd is started again. [22:51:45] so they are good to go. [22:53:30] thedj: cool. I'll try to get the proxy configuration fixed for you then. [22:54:16] cool [22:59:24] thanks bd808 [23:22:41] Is it just me getting 502 on a lot of tools on labs today? Can't load https://tools.wmflabs.org/citations/ or https://tools.wmflabs.org/magnustools/ ... [23:23:32] interesting [23:23:38] for me those pages are taking a long time to load [23:24:08] isn't it ordinarily instant to return a 502 if a tool is not running? [23:24:19] I haven' gotten them to load at all today...sometime they time out, and sometime they throw a 501/502 (can't remember) [23:24:35] yeah [23:25:21] might be worth contacting the tool authors [23:25:23] though [23:25:32] it's a plain 502 from nginx [23:25:38] this doesn't feel normal at all [23:25:49] it is more tahn one tool author though [23:25:52] than* [23:26:15] so, I'm guessing it is not 'their' faults [23:26:31] 502s come from the backend being unresponsive [23:27:32] we had some NFS and then grid engine hiccups earlier today that could have put come tools in a bad state [23:27:37] *some tools [23:28:39] do/should we have any monitoring for tools backend web servers going unresponsive? [23:28:59] should, yes. do, no. [23:29:19] do we have a task for that? [23:29:21] :-) [23:29:24] * Krenair looks [23:29:43] there are a whole set of tasks around better monitoring [23:30:48] there is always more work than time … [23:31:10] 'Implement a system to monitor tools on tool-labs' [23:32:52] … July 2013 [23:33:12] https://phabricator.wikimedia.org/T53434 yeah its a bit broad [23:33:43] tool-level monitoring is one of those eternal wishlist items [23:35:29] someday™ I would like to take a look at https://www.cncf.io/blog/2018/12/18/cortex-a-multi-tenant-horizontally-scalable-prometheus-as-a-service/ as possible service for monitoring [23:37:22] if the problem may have been caused by a NFS/SGE hiccup, maybe a tools admin could try restarting the webservices of broken tools? [23:37:57] the hard part is finding the broken ones [23:38:47] I could blindly restart everything on the grid that is re-schedulable, but that might break more things than it fixes [23:39:22] I see 2200 directories in /data/project/ each one is a tool [23:39:44] Checking all by hand … you end up in a funnyfarm … [23:50:12] Josve05a, sounds like you'll need to contact the individual tool authors [23:52:48] Krenair: drats. [23:52:52] Oh well, will do [23:52:55] thanks