[01:08:15] (03CR) 10Paladox: change apt related checks to a higher check interval to prevent large amnt of spam when there are updates to install (032 comments) [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359732 (owner: 10Zppix) [01:09:31] (03PS3) 10Paladox: change apt related checks to a higher check interval to prevent large amnt of spam when there are updates to install [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359732 (owner: 10Zppix) [01:09:36] (03CR) 10Paladox: [V: 032 C: 032] change apt related checks to a higher check interval to prevent large amnt of spam when there are updates to install [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359732 (owner: 10Zppix) [01:24:25] PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:22:37] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1417 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:34:27] RECOVERY - Puppet errors on tools-exec-1433 is OK: OK: Less than 1.00% above the threshold [0.0] [02:57:36] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [10:20:16] 10Labs, 10Tool-Labs: Cannot access replica databases - access denied - https://phabricator.wikimedia.org/T151296#3358196 (10MnemonicFlow) [10:52:23] PROBLEM - Puppet errors on tools-exec-1430 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:27:24] RECOVERY - Puppet errors on tools-exec-1430 is OK: OK: Less than 1.00% above the threshold [0.0] [12:21:36] bd808: Hi, are you online? Do you know if blocking a user at wikitech locks them out of phab? See https://phabricator.wikimedia.org/p/Waxmiguel/ [12:36:52] Sagan: unlikely [12:37:24] Phabricator and Wikitech both use LDAP to authenticate, but blocking is a layer they implemnet seperately [12:37:47] I reported it at devtools too, but there is nobody currently online, who can block at phabricator [12:45:35] if a user is blocked through ldap wont they be prevented from logging into phab? [13:01:04] users are not blocked through ldap [13:01:49] there is no provision for that (other than maybe changing the password) [14:24:07] (03Draft1) 10Paladox: Only run apt once not run it twice [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359795 [14:24:09] (03PS2) 10Paladox: Only run apt once not run it twice [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359795 [14:24:12] (03CR) 10Paladox: [V: 032 C: 032] Only run apt once not run it twice [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359795 (owner: 10Paladox) [14:38:08] bd808 you here? [14:40:39] or valhallasw`cloud ? a job it starting itself out of the blue every few seconds... :/ sigh [14:40:45] it -- is [14:42:44] Steinsplitter: some more info? [14:43:26] hello valhallasw`cloud :) 6274599 0.30000 bot.rotate tools.sbot r 06/18/2017 14:42:36 task@tools-exec-1409.eqiad.wmf 1 <-- this one. [14:45:14] Steinsplitter: how is the job normally started? [14:45:46] because from what I see the job (now #6274603) is just doing it's job [14:45:48] via cron every two hours, i removed the cron now. but i still restarts again and again after qdel, a few seconds [14:46:43] https://www.irccloud.com/pastebin/RF9gFCwM/ [14:50:26] so /data/project/sbot/bot.rotatebot.err is full of errors from jsub invocations [14:50:31] [Sun Jun 18 14:50:03 2017] there is a job named 'bot.rotatebot' already active [14:51:20] ok, it's not the webservice that's doing it [14:57:17] I'm not sure how to read from eventlogging, unfortunately [15:00:05] valhallasw`cloud: thx, and sorry for bothering you at sunday. well, now i see stuff like "[Sun Jun 18 14:55:18 2017] there is a job named 'bot.rotatebot' already active" in the logs but no job is active. Maybe there is aproblem at jobqueue. I just saw the wiki got flooded, and i was unable to stop the bot. I am paranoid, so it is unlikely that it is on my part (checked all logs, the logs of the webinterface, etc.) [15:00:34] no, there is a job active. [15:00:58] \o/ [15:01:08] I'm not sure why it is restarting; the easiest way to stop it from doing anything is to replace the script by just a sleep(1e9) or something like that [15:04:26] 10Labs, 10Tool-Labs: Automatically restarting job for tools.sbot - https://phabricator.wikimedia.org/T168206#3358412 (10valhallasw) [15:09:06] valhallasw`cloud , thanks. [15:14:55] (03Draft1) 10Paladox: Update some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359799 [15:14:57] (03PS2) 10Paladox: Update some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359799 [15:15:01] (03CR) 10Paladox: [V: 032 C: 032] Update some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359799 (owner: 10Paladox) [15:21:16] (03Draft1) 10Paladox: Update ssh check to check for Linux OS too [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359800 [15:21:18] (03PS2) 10Paladox: Update ssh check to check for Linux OS too [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359800 [15:21:22] (03CR) 10Paladox: [V: 032 C: 032] Update ssh check to check for Linux OS too [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359800 (owner: 10Paladox) [15:23:02] (03Draft1) 10Paladox: Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359801 [15:23:04] (03PS2) 10Paladox: Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359801 [15:24:35] (03PS3) 10Paladox: Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359801 [15:25:06] Hmm ^^ that's now throwing 500 for me [15:25:08] again [15:25:17] ( the gerrit change) [15:25:25] https://gerrit.wikimedia.org/r/#/c/359801/ [15:25:26] woops [15:25:31] 500 Internal server error [15:26:26] (03Draft1) 10Paladox: Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359803 [15:26:28] (03PS2) 10Paladox: Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359803 [15:26:30] (03CR) 10Paladox: [V: 032 C: 032] Update ssh check to not be used when checking websites [labs/icinga2] - 10https://gerrit.wikimedia.org/r/359803 (owner: 10Paladox) [15:26:41] recreated a new change to workaround that. [15:27:54] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:02:53] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:10] 10Labs, 10Horizon, 10User-bd808, 10cloud-services-team (Kanban): Horizon bug: hidden web proxy after deleting instance - https://phabricator.wikimedia.org/T167985#3358483 (10Andrew) I'm in the process of making a more comprehensive fix for this issue, but in the meantime I've tried a one-off fix... @mpopov... [17:44:53] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:17:58] 10Labs-project-icinga2, 10User-Zppix: Make Icinga2-wm bot use IRC auth - https://phabricator.wikimedia.org/T167807#3358539 (10Paladox) [18:24:52] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:42:05] 10Labs: systemctl status nfs-common shows it fail due to missing nobody group and missing file /etc/idmapd.conf - https://phabricator.wikimedia.org/T168208#3358547 (10Paladox) [19:16:30] 10Labs-project-Wikistats, 10Patch-For-Review: Wikistats 2.2 [beta] gives internal server error 500 for all csv, ssv and xml formats - https://phabricator.wikimedia.org/T165879#3358598 (10Xqt) [19:16:34] 10Labs-project-Wikistats, 10Pywikibot-core, 10Patch-For-Review, 10Pywikibot-tests, 10Upstream: wikistats.py seems broken - https://phabricator.wikimedia.org/T165830#3358597 (10Xqt) 05Open>03Resolved [21:00:13] 10Labs, 10Tool-Labs: Automatically restarting job for tools.sbot - https://phabricator.wikimedia.org/T168206#3358680 (10zhuyifei1999) https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Accessing_data [21:13:49] 10Labs, 10Tool-Labs: Automatically restarting job for tools.sbot - https://phabricator.wikimedia.org/T168206#3358685 (10zhuyifei1999) I'm highly suspecting this detached-from-sshd `become sbot`-sudo-invocation on tools-login (tools-bastion-03). @valhallasw Could you strace it? PID: 6755 {F8476083} [21:22:52] 10Labs, 10Tool-Labs: Automatically restarting job for tools.sbot - https://phabricator.wikimedia.org/T168206#3358698 (10zhuyifei1999) I suspended that process in question (`kill -SIGSTOP 6755`), seems to work. FWIW: ``` tools.sbot@tools-bastion-03:~$ lsof -p 6755 lsof: WARNING: can't stat() ext4 file system /v... [21:26:15] 10Labs, 10Tool-Labs: Automatically restarting job for tools.sbot - https://phabricator.wikimedia.org/T168206#3358700 (10zhuyifei1999) a:03zhuyifei1999 [23:20:21] 10Labs: systemctl status nfs-common shows it fail due to missing nobody group and missing file /etc/idmapd.conf - https://phabricator.wikimedia.org/T168208#3358736 (10Paladox)