[03:16:19] 3Tool-Labs: Gerrit Patch Uploader delivers empty HTML page when submitting patch - https://phabricator.wikimedia.org/T85319#944668 (10DLindsley) @Fomafix: What browser are you using? That would be helpful too. [06:56:14] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [07:26:16] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [08:46:08] 3Tool-Labs: Gerrit Patch Uploader delivers empty HTML page when submitting patch - https://phabricator.wikimedia.org/T85319#944723 (10Fomafix) @DLindsley: Mozilla Firefox [08:59:16] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [09:14:15] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [09:59:35] What is this? Sorry, I don't get it.. [09:59:56] Sarahcate: What are you referring to? [10:00:31] This chat feature, what is it for? [10:00:56] Sarahcate: It's called IRC. [10:02:59] Hmm.. This is my first time ever seeing this on Wikipedia, thank you for clarifying! [10:04:28] Sarahcate: No problem. If you're interested in general wikipedia chat, you can switch to #wikipedia-en channel. [10:20:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:22:11] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [10:22:13] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:42:04] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [10:45:29] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [10:47:16] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:50:39] 3Tool-Labs: static references broken with flask via uwsgi - https://phabricator.wikimedia.org/T85360#944807 (10valhallasw) 3NEW [11:53:38] 3Tool-Labs: Gerrit Patch Uploader: relative links broken with uwsgi - https://phabricator.wikimedia.org/T85361#944819 (10valhallasw) 3NEW [11:53:58] YuviPanda|zzz: uwsgi is broooooken [11:54:09] YuviPanda|zzz: or rather, all relative urls, I think [11:56:55] 3Tool-Labs: Uwsgi breaks flask project-relative URLs - https://phabricator.wikimedia.org/T85362#944830 (10valhallasw) 3NEW a:3yuvipanda [11:59:57] 3Tool-Labs: Gerrit Patch Uploader delivers empty HTML page when submitting patch - https://phabricator.wikimedia.org/T85319#944846 (10valhallasw) I think the issue was indeed the switch to uwsgi, which changed how urls were reported to projects. Flask thus thought it was in the site root, and had `/submit` as fo... [19:07:54] 3Tool-Labs: Put toolserver.org redirect configuration in git - https://phabricator.wikimedia.org/T85165#945344 (10jeremyb) in the meantime where do they live now? I was thinking it might be in a tool labs tool dedicated to just doing redirection. in which case we could add some more maintainers to help with thi... [19:15:33] hullo. [19:16:31] if qstat for my job lists a bunch of queue instances, all of them "dropped because it is temporarily not available" -- did I do something wrong, or are the grid machines having a hiccup? [19:18:05] abartov: that suggests all of them are full [19:18:30] so my job would requeue by itself in a little while, when load decreases? [19:18:41] s/requeue/resume [19:18:41] abartov: yes. Are you sure it's not just queueing? [19:18:59] valhallasw`cloud: well, it started running, I saw some output, but then suspended. [19:19:07] abartov: scheduling_info is often filled with a few queues without there being some issues [19:19:42] abartov: what's the state? s? [19:19:49] state is 'r' [19:19:54] so it's running [19:19:58] but no further output for more than 20m [19:20:28] and no increase in vmem, which is still well below the limit. [19:20:35] if it's 'r', it's not killed [19:20:56] if there is no output, it suggests that your job, although it's running, is not producing output somehow [19:21:01] but does it mean it's actually getting cpu cycles, or could it be suspended and waiting for a machine to execute on? [19:21:10] if it's suspended, it's in state 's'. [19:21:15] sorry, 'S' [19:21:19] I suppose it's possible my actual code is hanging. Is it possible to see CPU usage? [19:21:26] 's' is suspended via qmod, 'S' is because of scheduling [19:21:52] abartov: yes. You can see the host where it's scheduled; ssh there and check top [19:22:16] thing is, in qstat, I'm also not seeing an increase in "cpu=", under 'usage' [19:22:26] abartov: that suggests it's waiting for IO instead [19:22:45] hmm [19:23:38] valhallasw`cloud: thanks, that did not occur to me! It does write a couple of wiki pages, and I supposed those might have hung for some reason. I'll make sure I have a timeout. [19:23:52] let me paste the bottom of the qstat output: [19:23:57] usage 1: cpu=00:00:37, mem=8.99473 GBs, io=0.02518, vmem=291.758M, maxvmem=291.758M [19:23:58] scheduling info: queue instance "mailq@tools-exec-11.eqiad.wmflabs" dropped because it is temporarily not available [19:24:00] queue instance "mailq@tools-exec-10.eqiad.wmflabs" dropped because it is temporarily not available [19:24:01] queue instance "task@tools-exec-15.eqiad.wmflabs" dropped because it is temporarily not available [19:24:03] queue instance "task@tools-exec-11.eqiad.wmflabs" dropped because it is temporarily not available [19:24:04] queue instance "task@tools-exec-10.eqiad.wmflabs" dropped because it is temporarily not available [19:24:06] queue instance "continuous@tools-exec-15.eqiad.wmflabs" dropped because it is temporarily not available [19:24:07] queue instance "continuous@tools-exec-11.eqiad.wmflabs" dropped because it is temporarily not available [19:24:09] queue instance "continuous@tools-exec-06.eqiad.wmflabs" dropped because it is temporarily not available [19:24:10] queue instance "continuous@tools-exec-10.eqiad.wmflabs" dropped because it is temporarily not available [19:24:19] yes [19:24:37] there's about 10 other hosts where it can run (and we had already concluded it's running somewhere!) [19:26:30] valhallasw`cloud: ok, it seems to be running on "tools-exec-03.eqiad.wmflabs.org". How do I ssh there? [19:26:38] ssh tools-exec-03 [19:26:59] valhallasw`cloud: ah! I tried the full hostname. [19:27:59] valhallasw`cloud: looks like qstat didn't lie: it's running a bunch of python/php stuff, but my job is idle (and running): [19:28:02] valhallasw`cloud: 14144 tools.gl 20 0 280m 120m 6324 S 0 1.5 0:37.33 ruby [19:28:14] valhallasw`cloud: so I/O it must be! [19:28:36] abartov: I'm trying to think of a way to test that. Maybe strace works, but if it's in a blocking I/O call, I don't think it would show that. [19:29:10] abartov: also, it's running "queue:process" -- are you sure your process exists once the queue is empty? [19:29:24] valhallasw`cloud: hmm, pstack(1) would have been useful, but is apparently not installed. [19:30:20] valhallasw`cloud: yes, definitely. The 'queue' it reads is a wiki page, and it loops on the input just once. I've tested it manually. [19:31:30] I guess I'll cancel it, add timeouts, and run it manually from tools-login with the same input, to see if it reproduces. [19:31:37] abartov: hm, not sure then. maybe user input? [19:31:51] although I would expect it to crash, as stdin is supposed to be closed [19:33:29] valhallasw`cloud: nah, definitely no stdin. [19:33:49] valhallasw`cloud: wait, but that I/O is MW API, which is supposed to have server-side timeouts... [19:34:08] valhallasw`cloud: alright, I'll run from the command line and see. [19:34:14] abartov: well, no, not really [19:34:34] valhallasw`cloud: no timeouts server-side for the API? [19:34:44] abartov: there are some PHP timeouts, but that doesn't tell you anything [19:34:49] abartov: what happens if you don't get an RST back? [19:35:04] or if the connection drops in a different way? [19:37:10] valhallasw`cloud: hmm, I'm not sure. It's my first time using Wikimedia's official Ruby API client. I've been using (and hacking on) the old mediawiki_gateway client until now. [19:37:43] but since I didn't provide for it especially, I'd expect my code to give up with a runtime error. [19:41:00] abartov: I have to go now, sorry. You could try strace, but I'm not sure how much it'll tell you. Other than that, adding logging and using more verbose logging when running might help indeed. [19:41:07] valhallasw`cloud: re-running now. Thanks again for your expert help! [20:20:14] what is that python process that's taking up 100% cpu on tools-login? [20:24:41] tools.era running searcher3.py [20:26:56] eranbot? [20:28:15] running for 11-12 hours now [20:33:52] hmm [20:36:56] Thanks, Krenair [20:37:03] Krenair: I've just alerted Eran to this. [20:38:35] Krenair: he's killed it. [20:38:47] ok [23:16:01] Is there anyone available to restart wm-bot2? [23:16:12] It's completely crashed now [23:37:57] 3Wikimedia-Labs-Infrastructure: Fix HTTPS redirect for icinga.wmflabs.org - https://phabricator.wikimedia.org/T56710#945547 (10Se4598) 5Open>3declined a:3Se4598 closing: >>! In T85318#944034, yuvipanda wrote: > Neither icinga nor ganglia will be used in labs anymore. > [...] [23:51:20] 3Tool-Labs: Gerrit Patch Uploader delivers empty HTML page when submitting patch - https://phabricator.wikimedia.org/T85319#945569 (10Fomafix) 5Open>3Resolved a:3Fomafix It works again.