[00:20:31] Coren: sge master is not reachable [00:34:49] Hm.. php in tool labs is consistently failing to fetch a url from lists.wikimedia.org that works fine locally and via curl on tools-login [00:34:54] 2014-10-11 00:29:37: (mod_fastcgi.c.2701) FastCGI-stderr: PHP Warning: file_get_contents(https://lists.wikimedia.org/pipermail/cvn/): failed to open stream: HTTP request failed! in /data/project/list/src/wmf-tool-list/public_html/index.php on line 58 [00:34:59] 2014-10-10 16:22:24: (mod_fastcgi.c.2701) FastCGI-stderr: PHP Warning: file_get_contents(http://lists.wikimedia.org/pipermail/commons-l/): failed to open stream: HTTP request failed! in /data/project/list/src/wmf-tool-list/public_html/index.php on line 58 [00:35:30] <^d> Lack of user agent? [00:35:41] Works fine in php from tools-login and locally [00:36:13] echo file_get_contents('https://lists.wikimedia.org/pipermail/cvn/'); [00:36:16] from plain php -a [00:36:23] but fails on the tools webserver [00:38:22] added user-agent but no difference [00:38:59] https://github.com/Krinkle/wmf-tool-list#readme [00:39:00] https://tools.wmflabs.org/list/?list=wikitech-l&action=thismonth [04:53:50] eh [04:54:02] legoktm@tools-login:~$ become tfaprotbot [04:54:03] tools.tfaprotbot@tools-login:~$ qstat [04:54:03] error: commlib error: got select error (Connection refused) [04:54:03] error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got send error [04:54:03] tools.tfaprotbot@tools-login:~$ webservice start [04:54:04] Starting webservice... failed. [05:58:12] legoktm: still failing? [05:59:01] * legoktm checks [05:59:16] YuviPanda: nope, looks good now, thanks [05:59:24] legoktm: yeah, just restarted the master [05:59:35] I think someone emailed labs-l as well [06:01:02] legoktm: yeah, we need to keep track of actual outages... [06:01:20] !log tools restarted gridengine-master on tools-master [06:01:23] Logged the message, Master [09:35:01] I still get the errors described by legoktm above (from qstat and webservice) [11:05:58] Coren / YuviPanda / yuvipanda35 , I'm still getting a cron error mails. [11:06:01] error: commlib error: got select error (Connection refused) [11:06:01] Unable to run job: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got send error. [11:06:01] Exiting. [11:18:12] job queue is still down [11:42:24] Coren: Im also not seeing any Cron related email for the last 8+ hours [11:46:11] last email I got was 0100UTC which means the queue failed between 0100 and 0115 [11:54:37] So I guess that answers the question "Am I the only one having issues with Labs" [11:59:21] cron is not down, SGE is. [12:28:28] sigh [13:09:22] Coren: ? YuviPanda : ? [13:09:48] Stuff has been down for 14 hours now [13:17:33] Coren? [13:17:36] YuviPanda? [13:17:50] petan? [13:18:00] Hi russblau [13:18:09] hi multichill [13:18:13] Everyone seems to be gone [13:18:22] yeah, so I see [13:18:33] maybe the grid engine will be back up on Tuesday... [13:22:34] russblau: I haven't seen you around a lot related to Pywikibot. Still active? [13:22:35] barely [13:23:05] Now is a good time to become a bit more active. Quite a few new people got involved and some of the old timers also became more active [13:23:13] mostly lurking. ever since the switch to gerrit i've found it extremely inconvenient to contribute [13:23:21] Good stuff happening with code quality improvement [13:23:56] Yeah, Gerrit can be a bitch, you have to first tame that beast :P [13:29:43] maybe phab will be better :-) [13:35:04] valhallasw`cloud: I sure hope so [13:59:37] I have problems at tool labs: qstat doesn't work ("error: commlib error: got select error (Connection refused); error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got send error") [14:07:11] apper: yes, the grid engine is down and jobs aren't running. [14:07:15] (see topic) [14:10:45] guillom: is there worked on or is labs still not considered important enough for the communities by WMF? [14:12:04] apper: That's a bit of a loaded question :) AFAIK no one is working on it yet, but keep in mind it's the week-end, and a 3-day week-end in the US at that, which means people may not be checking IRC/email. It's also a bit early for San Francisco time. [14:13:54] apper: it's considered important, but not as important as production [14:14:21] In addition, I'm not sure that the grid engine is monitored by the usual monitoring tools, so it's possible that the people who can fix it don't actually know there is a problem. [14:14:21] In addition, I'm not sure that the grid engine is monitored by the usual monitoring tools, so it's possible that the people who can fix it don't actually know there is a problem. [14:14:33] guillom: I don't want to blame any of the admins here. But it would be possible to have a 24/7 standby service for emergencies, which would of course cost money, but the foundation has enough money, so it's about consideration of importance [14:14:40] apper: I think the numbers were something like 99% for labs vs 99.9(9?)% for prod [14:14:57] (sorry for double posting, network issues here) [14:15:25] apper: There is a service for emergencies, but what I'm saying is that I'm not sure people know there's an emergency. [14:15:38] guillom: so: how do we make sure they get that information? [14:16:09] valhallasw`cloud: Well, people in San Francisco should be getting up soon, so hopefully someone will see IRC or email. [14:17:10] Given it's weekend, I don't expect people in SF to be up in the next two hours, to be honest. Getting Coren (iirc he's on the east coast, so ~10.15 there) might be easier. [14:17:38] also, it's really a tools issue, so I'm not even sure if there are people in SF who can fix this. [14:18:02] * guillom has no idea. [14:18:11] (I'm just another guy whose tool isn't working.) [14:19:57] valhallasw`cloud: and that's the thing I don't understand... why is tool labs considered less important than production... for many users tools from tool labs are really important for working. So I don't want to blame any of the admins, which really do excellent work, but I don't really like the fact, that there is so much money and hundreds of employees and when I find some rare time at the weekend to work on tools (for free, after a [14:19:58] full-time week) I can't do anything, because the money isn't spent on technology and engineers... [14:20:36] apper: because it *is* less important. No, you can't do anything, but the millions of people who want to read wikipedia can still do so. [14:21:56] valhallasw`cloud: okay, yes, you're right. And maybe I could live with 99%, but we do not have 99% fully operational tool labs. [14:22:17] apper: remember that 99% is 3.6 days of downtime every year ;-) [14:22:26] anyway [14:22:36] Coren has stated he should just be called when things break down badly [14:22:44] so the question is: who has his phone number :-p [14:22:56] guillom: Labs doesn't have proper monitoring AFAIK [14:23:22] multichill: I think the current issue makes that pretty clear :) [14:23:47] As in, labs didn't have any form of monitoring untill recenlty [14:28:25] Coren: on tool labs I am getting: [14:28:27] error: commlib error: got select error (Connection refused) [14:28:29] Unable to run job: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got se$ [14:28:30] Exiting. [14:28:32] since 15 hours ago [14:28:39] is that part of the outage? [14:28:44] yes [14:29:07] so there is nothing I need to do to fix that, just wait? [14:29:35] Wait for Coren, basically :( [14:31:42] ok [14:32:27] or call/text/etc him, but I'm not sure who has his number [14:32:51] * guillom checks. [14:33:56] got it; sending text [14:34:23] \o/ [14:38:30] guillom: andrewbogott seems to be the on call engineer [14:38:56] So... I should text him as well? [14:39:17] yes please [14:39:42] multichill: ok. Doing now. Out of curiosity, where do you see that? I don't even know where that information is. [14:40:42] Topic of #wikimedia-operations contained his name guillom . (and he updated the topic) [14:40:56] And I noticed he fixed something else this week [15:24:32] Coren: yt? [15:27:38] andrewbogott: Ah. Finally! :P [15:27:52] Grid engine is down [15:27:52] just 'cause I'm awake doesn't mean I know how to fix anything... [15:28:28] andrewbogott: I've been on call for the last couple of years. I know how that feels :P [15:29:03] andrewbogott: Unable to run job: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got send error. [15:29:07] That's the message we're getting [15:29:21] So you might have a look at tools-master.eqiad.wmflabs [15:29:27] yep. I can restarat it but it crashes immediately. [15:29:47] Hmm, no Windows solution, that sucks [15:30:12] Whole OS or just the application? [15:31:01] maybe a full reboot of the system instead of just the application helps andrewbogott? It's broken anyway.... [15:31:05] like, right now if you look, the sge is working. [15:31:09] But in a minute it'll fail again [15:31:20] ah, yep, there it goes [15:31:42] !log tools rebooting tools-master, stab in the dark [15:31:47] Logged the message, dummy [15:31:54] whoah, labs-morebots, you're working still? [15:31:59] labs-morebots, really? [15:31:59] I am a logbot running on tools-exec-14. [15:31:59] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:32:00] To log a message, type !log . [15:32:07] multichill: this suggests that existing jobs are still running. [15:32:31] labs-morebots, yt [15:32:31] I am a logbot running on tools-exec-14. [15:32:32] This is just the dispatcher, right? [15:32:32] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:32:32] To log a message, type !log . [15:32:36] yep. [15:32:46] So that's good, not as big of a collapse as I'd feared [15:35:51] andrewbogott: Reboot still going? Maybe the logs contain something useful? [15:36:52] https://dpaste.de/YTZt [15:38:22] Null overload... [15:38:50] here is the exact issue: http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028817.html [15:38:53] with no responses :( [15:39:24] https://arc.liv.ac.uk/pipermail/gridengine-users/2008-April/018650.html <- andrewbogott [15:40:16] andrewbogott: Maybe remove job 4632075.1 from the queue? [15:41:14] Should be in $SGE_ROOT/$SGE_CELL/spool/jobs AFAIK [15:42:32] Hm, don't know what $SGE_ROOT is... [15:43:25] the next time it crashes it's because of a different job. [15:43:57] when i could run qstat there were jobs that shouldn't be there [15:44:23] You need one of the grid guru's for this I guess [15:44:50] Judging from the different things I found while googling, something got corrupted [15:48:23] I'm trying to figure out how to purge all pending jobs... [15:53:03] What happen? [15:53:51] Coren: grid scheduler died [15:53:56] Can someone give me the two-line summary so I can dive in? [15:54:04] Coren: https://dpaste.de/YTZt [15:54:54] It's the first time I see this error; ever. [15:55:04] Coren: Googling suggests some sort of job pool corruption at tools-master.eqiad.wmflabs [15:55:17] Smells like it. [16:00:26] Which is why the shadow master couldn't pick up. [16:00:37] ah! you're here :) [16:00:56] Bad morning to take to stay in bed. [16:01:58] Coren: When Murphy comes along, everything will go wrong :P [16:02:20] Coren, I've been poking about but haven't learned much. If you restart the master it responds to commands for a few seconds, then errors out with the above failure [16:02:47] existing jobs on the exec nodes are still running, but complaining about being unable to report back to the master [16:03:23] My next step (if I knew how to do it) would be to purge all the pending jobs, and maybe delete/recreate queues. [16:03:31] But you may have a more surgical approach [16:04:17] the master keeps everything in a bdb, I'm trying to check its consistency now. [16:11:44] Something is really rotten in denmark. [16:12:14] yeah :( [16:12:26] just re-provision the entire server? [/nuclear option] [16:13:22] Coren: do you see this email thread that suggests installing a mod that produces core dumps? [16:14:05] bah, many mail threads with this issue but none with a followup explaining what went wrong. [16:14:09] stupid internet [16:14:41] I'm still trying to figure out exactly what is wrong now. [16:20:00] Is it possible that this problem is deriving from a corrupt exec node? If, for instance, the master has decided that it's the least busy node so always tries to hit it on startup? [16:20:22] andrewbogott: No, I'm pretty sure that its spool is corrupt in some way. [16:20:54] And the spool contains running jobs /and/ scheduled jobs, or just scheduled jobs? [16:21:02] Both [16:21:26] Ah, so, pretty catastrophic to just throw it away and build a fresh one :( [16:23:30] It /looks/ like it was able to delete the really really broken entry. I'm trying to get rid of another one but that one doesn't cause de daemon to abort() so that's that. [16:24:12] Huh, I saw it crashing on different entries each time so I didn't think it was a particular job that was cursed. [16:24:24] but… looks fixed to me [16:24:28] andrewbogott: Well, right now it's up and happy about all but one entry. [16:24:43] How did you determine which entry to delete? [16:25:16] 10/11/2014 16:21:58|worker|tools-master|W|job 4632075.1 failed on host before writing exit_status because: shepherd exited with exit status 19: before writing exit_status [16:25:44] Having a job runing on was a bad sign. :-) [16:26:05] true... [16:26:18] ok, so… safe for me to go eat breakfast now, y'think? [16:26:23] Also, it was always the last job the master touched before it selfdestructed. [16:26:33] andrewbogott: Yes, do. I'm around if things break again. [16:26:48] ok. Thank you for fixing. [16:27:04] hurrah! [16:27:06] And, as always, this suggests we need yet more monitoring, since this was broken all night :( [16:27:26] andrewbogott: We *know* we need monitoring. [16:27:32] a human alarm bell would also work [16:27:51] Coren: yeah [16:28:08] weird... when grid came back up, 'qstat' showed me a job submitted on 08/25/2014 that had never been visible before (I killed it) [16:28:59] because in some sense, we already *have* monitoring (the first email on this was sent to labs-l 12 hours ago). We just need to get the alarm to reach whoever is available to fix it. [16:29:51] That'll teach me to rest starting friday night. :-) [16:30:12] There's still something wrong with one of the jobs that prevents further scheduling. [16:30:30] So it's not _fixed_ yet, but it doesn't crash - which is that. [16:39:38] *Finally* managed to get rid of that job. [16:40:16] And the queues unclog. [16:40:35] I really /really/ wish I knew how those two corrupt entries ended up in the job db. [16:43:55] russblau: It's possible that something had been wrong with the DB for a while, just that the effects were nowhere near as visible. [16:44:16] russblau: It got *really* noticable tonight because the master died. [16:44:54] After this last master restart, it no longer complains and its output is all perfectly normal. [16:46:05] Coren, while I'm here, do you have any idea why tool jobs sometimes get "ERROR: CALL dab05_cleanup_dab_pl(): Deadlock found when trying to get lock; try restarting transaction" - only intermittently [16:46:59] the script runs without error about 80% of the time, and the other 20% it gets the deadlock error [16:47:33] russblau: I've heard of that before. The new DB engine has slightly less generous timeouts by defaults when waiting on locks; springle is the one to ask though, he's the DB expert. I know it's possible to alter the query to avoid the deadlocks; or to increase the timeout per-session. [16:47:57] He should be able to look at your actual queries and help. [16:48:07] OK thanks [17:13:01] hmmmm [17:13:21] http://ogvjs-testing.wmflabs.org/w/index.php?title=Demo&action=edit <- says ‘edit’ but shows page view instead of edit. wtf? [17:14:40] brion: page shows me view source tab, but yea thats not source :S [17:14:49] looks like https://gerrit.wikimedia.org/r/#/c/147058/ broke [17:14:53] not labs, just mediawiki :D [17:15:08] woot [17:42:30] "woot we broke mediawiki"? [17:43:07] more like, found the problem easy to fix :) [18:37:35] I've three task showed by qstat as runnning but not existing on the exec node, must I try to qdel them? [18:53:33] labs-morebots: feeling ok? [18:53:34] I am a logbot running on tools-exec-14. [18:53:34] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:53:34] To log a message, type !log . [19:03:34] andrewbogott: Did you have any success with the new image? [19:03:47] test instances are spinning up now. [19:03:55] in the 'testlabs' project. [19:04:17] I think that the initial reboot obscured the console log though so I can't quite tell what happened… giving them another few minutes to stabilize. [19:05:02] Well, if it got to rebooting at all then it means it believes it succesfully created the filesystems - that's a good sign. [19:05:33] IMO, the only real "risk" is the detection being broken so that it reboots in a loop, but I don't think that's likely. [19:05:37] What's the instance name? [19:05:59] there are three. testlabs-lvm-trusty testlabs-lvm-trusty-medium testlabs-lvm-precise [19:06:06] I really did just create them a minute ago though [19:06:30] the -precise image is displaying a bug that I thought was fixed already, which makes me think I got confused when copying up the image... [19:10:40] That firstboot.sh thing is an ugly hack, but if it works right then all the issues with space are fixed for good. [19:11:10] Well, / remains fixed-sized, but if you're filling /that/ up then you're definitely doing something wrong. [19:11:30] that precise instance is failing in a novel way. [19:11:54] The trusty images… if I had to guess I'd say that it's running firstboot on the first attempt, rebooting, but failing to run it the second time hence not running puppet. [19:12:03] Just a guess from the terse console log. [19:12:16] Can you dpaste it so I can take a peek? [19:12:37] The good news is, if there's a bug in the firstboot.sh we can just change it in the image, no need to rebuild one anew. [19:13:02] on, no my mistake, I think it's not rebooting in the first place. [19:13:13] You can see the output, I'm just looking at 'get consoleoutput' on the manage instances page. [19:13:58] https://dpaste.de/XzNB [19:14:25] I'm going to explicitly reboot -medium to see what it does on the second try. [19:14:52] Ah: /root/firstboot.sh: line 47: syntax error near unexpected token `then' [19:14:55] Bleh. [19:15:11] oh, of course I see that now :) [19:15:32] * Coren slaps self. [19:15:40] It's nothing but a dumb typo. [19:17:21] It's easy to try again then; once that is commited just mount the image and change the file. [19:18:01] Since nothing else needs to be changed. [19:24:15] copying the image over takes ages, but doing that now... [19:24:29] Better than building one from scratch. [19:24:40] true! [19:26:14] weird, speed test tells me I have 20mbps both ways but scping that file only gets me 200kb [19:26:30] and we know the connection between labs and the outside world is faster than that... [19:26:57] by 'weird' I mean 'totally expected but annoying' [19:57:50] Coren: the latest: https://dpaste.de/gzDP [19:58:40] andrewbogott: Hmmm. Is there a default key on that image? [19:58:54] I don't think so [19:59:24] Shall I reboot and see if it puppetizes enough for a login? [20:00:32] It won't. For some reason, parted didn't accept the values it, itself, gave to create the partition - the next boot will have it try again. [20:01:13] Error: You requested a partition from 32.3kB to 21.5GB. [20:01:13] The closest location we can manage is 31.7kB to 31.7kB. [20:01:43] * Coren checks that parted -m doesn't give different order [20:02:04] root is only 10gb, shouldn't the partition start at 10.something? [20:02:19] It should; which is why it failed. [20:03:39] I can add some extra verbosity to firstboot.sh to figure it out. When I try the same things on wikitech-test-horizon the numbers are correct. [20:04:41] lemme try with a bigger flavor, just in case that teaches us something... [20:06:31] well, this one is stupider. [20:06:35] Error: You requested a partition from 32.3kB to 42.9GB. [20:06:36] The closest location we can manage is 31.7kB to 31.7kB. [20:06:47] No, it's the same error clearly. [20:07:12] yes, but the first try it's bigger-to-smaller, the second try smaller-to-bigger. [20:07:16] Why in hell is the same snippet of code that properly returns the available space on one instance fails there? [20:07:56] * Coren needs to see the output. [20:07:57] ok, wait, nevermind, I see what's happening... [20:08:02] Oh? [20:08:12] well, I mean, I see a little bit... [20:08:25] seems like in 'from x to y' y is correct. [20:08:28] And x is always 32.3 [20:08:58] Yeah, but those numbers come from parted itself; I'm not sure how it could return anything but the correct ones. [20:09:23] Is Volume group "vd" not found expected? [20:09:35] For instance, on wikitech-test-horizon that snippet returns "12.3GB 21.5GB" [20:09:50] here's the whole enchilada from that second test: https://dpaste.de/2GAX [20:09:53] The first time? Yes -- that's why it tries to create it. [20:10:20] Oh. OH! [20:10:25] * Coren is an idiot. [20:10:42] Oh, no I'm not. [20:10:46] :) [20:11:30] Coren: is there any reason a job would get sent SIGUSR1? [20:11:50] (03PS1) 10Legoktm: Sent Wikidata-related things to #wikidata [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/166217 [20:12:25] Coren: by SGE? (or, alternatively, is there a way to find out what the origin of a Signal is?) [20:12:48] valhallasw`cloud: There's a number of reasons, the main one being "I'm about to kill you because you are hitting the limit, you have (5s iirc) to wrap up)" [20:13:17] (03CR) 10Legoktm: [C: 032] Sent Wikidata-related things to #wikidata [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/166217 (owner: 10Legoktm) [20:13:23] (03CR) 10Legoktm: [V: 032] Sent Wikidata-related things to #wikidata [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/166217 (owner: 10Legoktm) [20:14:11] andrewbogott: There's something funky with the output of parted print free on that image, but I can't get what. Lemme add a bit of verbosity. [20:14:55] !log tools.wikibugs deploying https://gerrit.wikimedia.org/r/166217 [20:14:59] Logged the message, Master [20:15:58] Coren: Great, you fixed it. [20:16:18] Nagios contains a standard check to monitor a certain tcp port so you get notified when a service goes down [20:16:28] You probably want to enable that for the grid master [20:16:54] multichill: Yeah, Yuvi is hard at work on the labs monitoring; we'll get better monitoring soon. [20:18:03] Improving monitoring is a continues process, not one big step [20:18:22] You shouldn't stop doing small steps and wait for the big one [20:19:43] Well, getting actual monitoring infrastructure in place is very much a necessary big step. :-) [20:20:02] Adding checks to it after that is the easy part. :-P [20:20:37] It's still not up and running? Who is holding YuviPanda back? ;-) [20:21:00] andrewbogott: Try again with that new firstboot.sh? Its not going to work better but at least now we'll see why. [20:21:09] Coren: yep, I'm copying it over [20:21:09] * Coren ponders. [20:21:55] You know what? What would be *really* nice is if we had the firstboot.sh curl the "real" one from the infrastructure. [20:23:20] So that adjusting what happens on firstboot doesn't need a new image all the time. [20:23:55] Coren: is time travel still broken? [20:24:11] (It could actually curl it from puppet; there's no rule against that) [20:24:32] valhallasw`cloud: It probably will never come back, sadly, but there are weekly backups in /data/backup [20:25:04] I mean /public/backup even [20:25:33] Let me check. I ran "rm a *" instead of "rm a.*" :-( [20:25:39] Ouch. [20:26:29] wasn't too much, though, just a few wrapper shell scripts [20:26:57] They should be in /public/backup/{username}/tools/ [20:27:15] Yeah, unfortunately, it's all from after 5-10 [20:27:19] well, was worth the try [20:31:06] we should add that to the tools help [20:31:54] gifti: It's still in active development and the details aren't well-defined yet. I told valhallasw`cloud mostly to give him a chance to save himself. :-) [20:32:08] ok, well [20:48:51] andrewbogott: News? [20:49:00] Coren: just booting now [20:49:34] Coren: https://dpaste.de/fcbp [20:49:53] you're not going to like it though [20:50:40] What the hell? [20:51:09] Well, now we know why it doesn't work -- there is no free space. [20:52:03] But why in Baal's name is vmbuilder explicitly going against the setting we gave it? [20:53:57] When there were more partitions, the extra space was never allocated to anything. [20:54:27] There may be more to it… https://dpaste.de/zgic [20:54:57] Yep. For some reason, vmbuilder have the entire space to the one partition. [20:55:23] oh, part of the issue is that that 32.3 you were reading before is k and not g [20:55:27] maybe you saw that already [20:55:36] No, that's okay and will work. [20:56:17] The issue really is just that there is no free space. vmbuilder.partition is being entirely ignored. Special case because there is just the one entry? [20:56:26] That would be completely insane. Also probable. [20:56:51] Hm, I don't like that we aren't specifying units in vmbuilder.partition [20:57:15] andrewbogott: You never do, it's explicitly megs. [20:58:29] so… I guess we move swap into vmbuilder.partition and see if that makes it behave [20:59:14] or add --- to the end of vmbuilder.partition? I'm not sure what that would do [20:59:25] That should tell it to create two devices. [21:02:01] Of course, this means we're not stuck having to actually rebuild the image. :-( [21:02:20] now* [21:03:44] I'll leave the verbosity in firstboot.sh; it's harmless and may help debugging [21:08:06] * andrewbogott rebuilds [21:44:57] Coren: what decides whether a child process is killed or not if the parent is kill -9'd? [21:45:26] (I have a process which cannot be killed with qdel... that just deletes the wrapping bash script) [21:51:21] Children and not, by default, killed if their parents are. They usually will die on closed pipes, but if they are entirely independent they will likely continue to run. That said, I should probably have gridengine kill the whole process group if it sets one up (which it may well do - I'll need to check) [21:52:35] Coren: I'm 100% sure it doesn't ;-) [21:52:52] Coren: also, it might be good to first send SIGINT and then SIGKILL a few secs later [21:52:56] instead of SIGKILL immediately [21:54:48] That's configurable per-queue for all but running out of memory; but you're right that it's a good idea. [21:55:01] Hi. How to edit /etc/hosts and add our new established wiki on it? [21:55:05] (both would solve my issues) [21:55:06] :q [21:57:53] valhallasw`cloud: How are you sure it doesn't set up a process group? You actually checked that? :-) [21:57:55] https://tools.wmflabs.org/sigma/created.py?name=Ebraminio&server=fawikivoyage&ns=,,&redirects=none [21:58:34] Coren: eeeerm. That's a good point, actually. The pgid's are indeed different. Meh. [22:00:16] ebraminio, "sql fawikivoyage" gives "This is unknown db to me, if you don't like that, blame petan on freenode".... [22:00:17] hmm [22:00:37] * Coren sets the terimnation to sigint the entire process group. [22:00:48] valhallasw`cloud: That's probably more gentle now. [22:00:57] Krenair: that command using /etc/hosts in order to resolve IP of replication [22:01:03] okay, let's see if I can get that to work then [22:01:20] MariaDB [enwikivoyage_p]> use fawikivoyage_p; [22:01:21] ERROR 1049 (42000): Unknown database 'fawikivoyage_p' [22:01:38] hmm [22:01:50] database 'fawikivoyage' exists [22:02:21] Krenair: Yes is not on enwikivoyage ip [22:02:24] perhaps [22:02:32] but no fawikivoyage_p [22:02:35] I tested sql fawiki [22:02:35] * valhallasw`cloud hugs Coren [22:02:37] Coren will know more about this [22:02:52] hmm [22:02:56] Thank you [22:03:12] New databases are only added at regular interval, usually monthly when the maintain-replicas script is run. [22:03:37] Coren: I'm still a bit confused why a process would be able to escape SGE's supervision, but I'm happy to have the phab irc bot working under SGE now :-) [22:03:45] Coren: Thank you [22:03:48] “Sorry! This site is experiencing technical difficulties. [22:03:49] Try waiting a few minutes and reloading. [22:03:49] (Cannot contact the database server: Too many connections ())” [22:03:50] hmmm [22:03:59] And thank you Krenair [22:04:33] valhallasw`cloud: It doesn't so much escape as, by default, gridengine was a bit overly trusting. :-) [22:04:57] Yeah... fawikivoyage was only created on the 2nd of this month [22:07:11] Coren: working much better now, although there still seem to be some issues… I think that there's something wrong with new /var overwriting old /var https://dpaste.de/E8yF [22:07:50] andrewbogott: New var never overwites old one; it's mounted in /tmp until the reboot. [22:08:01] * Coren wonders [22:08:14] What's the instance name? [22:08:34] testlabs-trusty-swap [22:08:48] you'll have to use root, though. I can't log in as myself [22:09:28] I see, at least, that it created the volume group and volumes. That's a good sign. [22:09:52] Hm. Mind if I reboot it? [22:10:12] not at all [22:10:14] all yours [22:11:52] there we go, got mysql restarted. woo [22:12:27] At first glance, it looks like the copy didn't work. Hm. [22:18:50] andrewbogott: That's really odd. It looks like it didn't actually reboot to finish the installation. [22:19:08] it says it's rebooting, in the log... [22:19:14] and it must've if /var is writable now? [22:19:29] * andrewbogott looks again [22:21:19] The logs have long scrolled past recovery. :-( [22:21:34] yeah [22:21:46] I'm sure I saw it say it was rebooting in the logs. There were a few errors about /var before that though [22:21:51] Let me start anew... [22:22:23] Tell me when so I can watch too [22:25:13] Wait, did I miss the initial run? This looks like a second run. [22:26:20] I started with the wrong image the first time! [22:26:21] Try testlabs-trusty-swap3 [22:26:26] swap2 is a mistake [22:26:30] Ah! [22:26:47] Oh duh! [22:26:50] I see the error [22:27:19] /tmp/var does not exist ? [22:27:47] Yes, because /sbin/mkdir. I have *no* idea whose ass I pulled /sbin/ of. [22:28:02] So it never mounts the new FS for the copies. [22:28:07] for posterity: https://dpaste.de/H4QE [22:28:12] oh of course [22:28:30] it's not the location that's missing, it's mkdir itself [22:29:42] At least, that should be a 'just change firstboot.sh' fix [22:30:18] I note that I was correct in my guess; vmbuilder stupidly and quietly special cases "only one partition" into "use the whole disk, damn the setting" [22:32:59] Coren: I'm going to vanish for a bit but I've started new images abuilding. I'll ping you or email when there are new test results. [22:59:34] Is there some tool on wmflabs for text searching wiki pages? [23:15:38] i don't need them, but i noticed there are dumps missing: dewiki-20140918-pages-meta-history{3,4}.bz2 [23:16:46] 3Wikimedia Labs / 3deployment-prep (beta): "There was an unexpected error logging in" when creating accounts on Beta - 10https://bugzilla.wikimedia.org/71862#c12 (10Sam Reed (reedy)) p:5Normal>3Highes Seems to be affecting production now... [23:17:26] gifti: file a bug :) [23:17:38] hm, ok [23:18:41] Reedy: I wonder if we missed changing something in the patches that rearranged CentralAuth [23:20:25] that would be my first guess too [23:20:33] CA seems to be the most likely candidate [23:20:54] * legoktm uninstalls locally [23:20:58] i guess that is wikimedia labs/infrastructure? [23:21:35] Depends... Where exactly are they missing from? [23:21:48] ffs logstash is borked [23:22:03] nope, installed CA and it's still broken [23:22:37] er, uninstalled* [23:26:56] // Validate the login token [23:26:56] if ( $this->mToken !== self::getLoginToken() ) { [23:27:01] that check is failing [23:27:09] but, that's a login check not create account... [23:27:10] Is it possible for you to upgrade PHP on Tool Labs? Or I need to painstakingly build my own? [23:30:20] it's not thinkign account creations are account creations. [23:31:16] found it. [23:31:43] Zhaofeng_Li: I don't think so until tool labs upgrades to trusty (guessing it's still using precise), you'll probably have to build your own. [23:32:00] legoktm: Okay, thanks. [23:32:15] Reedy, bd808: https://gerrit.wikimedia.org/r/#/c/163775/ [23:33:43] now, how to fix it... [23:34:26] Is there some tool on wmflabs for text searching wiki pages? I want to find usage of an extension on wikipedia which doesn't have any tracking category or such, any help? [23:34:52] * bd808 looks at that patch and wonders what it broke [23:36:30] bd808: it's no longer passing wpCreateaccount=1 to the form, so it thinks you're logging in instead of creating an account, and those things use two different tokens [23:36:41] ah. [23:36:42] and they obviously don't match, hence session failure errors [23:37:24] and we saw this in beta but blamed beta setup instead of bad code :( [23:37:50] legoktm: The easy fix is revert [23:38:13] I'm about half done with a proper fix, it'll take me like 10 more minutes to test it properly [23:38:19] ok [23:40:30] legoktm: looks like we did miss something in CA -- readfile(/srv/mediawiki/php-1.25wmf3/extensions/CentralAuth/includes/specials/../1x1.png): failed to open stream [23:40:47] :/ [23:40:55] that's the autologin for non-JS users I think [23:41:13] in includes/specials/SpecialCentralAutoLogin.php [23:41:17] I'll make a patch