[00:36:37] (03PS1) 10Andrew Bogott: Add some dummy passwords for nova in heira [labs/private] - 10https://gerrit.wikimedia.org/r/264220 [00:37:35] YuviPanda: around? [00:37:37] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add some dummy passwords for nova in heira [labs/private] - 10https://gerrit.wikimedia.org/r/264220 (owner: 10Andrew Bogott) [00:47:03] !log wikimetrics Fixed some things in wikimetrics and wikimetrics-deploy for db creation and migration to work. Prod server initialize stage complete [00:49:17] !log wikimetrics Deploying to prod wikimetrics, which also restarts all services. So good so far [00:53:23] !log wikimetrics Setup temporary proxy to metrics-prod.wmflabs.org. All good, web and queue are up, scheduler seems to be failing though [00:53:53] madhuvishy: yeah [00:54:12] (03PS1) 10Andrew Bogott: More dummy hiera passwords for keystone. [labs/private] - 10https://gerrit.wikimedia.org/r/264225 [00:54:32] YuviPanda: was running into some issues with running alembic upgrade - found more hardcoded paths in code [00:54:45] but i fixed those [00:54:53] (03CR) 10Andrew Bogott: [C: 032 V: 032] More dummy hiera passwords for keystone. [labs/private] - 10https://gerrit.wikimedia.org/r/264225 (owner: 10Andrew Bogott) [00:55:02] madhuvishy: ok :D [00:59:06] YuviPanda: celery beat is looking for a folder to run from - any best practices? it was set to /var/run/wikimetrics/celerybeat_scheduled_tasks and /var/run/wikimetrics/celerybeat.pid [01:02:53] (03PS1) 10Andrew Bogott: Remove a couple of redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264227 [01:03:34] (03CR) 10Andrew Bogott: [C: 032 V: 032] Remove a couple of redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264227 (owner: 10Andrew Bogott) [01:05:42] (03PS1) 10Andrew Bogott: Remove yet more redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264228 [01:06:01] (03CR) 10Andrew Bogott: [C: 032 V: 032] Remove yet more redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264228 (owner: 10Andrew Bogott) [01:09:01] !log wikimetrics Found more config path issues for the scheduler. Fixed. All services are running on prod wikimetrics [01:11:29] YuviPanda: i don't seem to have merge rights on secrets/wikimetrics [01:11:45] merge https://gerrit.wikimedia.org/r/#/c/263669/ when you can [01:21:55] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1935979 (10MZMcBride) Thank you for the quick fix! <3 [01:48:50] madhuvishy: I added rights for you [01:49:17] YuviPanda: thanks. Can you also add mforns, nuria and milimetric [01:49:27] madhuvishy: sure [01:49:34] thanks again [01:49:56] madhuvishy: done [01:50:05] cool [01:51:38] madhuvishy: np. thanks for fixing it all up :) [01:51:43] madhuvishy: next step: limn1 :) [01:51:55] :P [01:52:14] YuviPanda: :) db has not been moved over yet - when it's all up, i'll get cake for us :P [01:52:44] some of limn1 will go away when the dashiki fabric + puppet is adopted [01:53:23] and Dan is making a new layout for the browser reports, which should deprecate most of the old limn reports [01:53:27] we'll get there [01:53:28] :) [01:56:32] madhuvishy: :) just remember that if something crashes hard on the limn instance we can't really do much [01:56:49] ya okay [02:20:43] !log wikimetrics Add rest of analytics team as wikimetrics project admins [02:44:59] 6Labs, 10wikitech.wikimedia.org: Exclude nova resource pages from *default* wikitech search - https://phabricator.wikimedia.org/T122993#1936095 (10Tgr) Labs-project pages could actually be very useful if people used them to document those projects (which quite often does happen). The machine pages are indeed n... [03:22:32] YuviPanda: You wouldn't happen to still be on, would you? [03:25:28] So guys, I have a Python Flask tool on Labs that I just barely got working a few days ago. I messed with the environment to try to up-patch it from Python 2 to Python 3, but I couldn't figure it out. [03:25:42] I tried to roll the change back, but now I'm getting a 404 error. [03:26:07] What baffles me is that the webservice reports itself as running, but the page is a 404. [03:26:49] If I replace the app.py file that is running everything with the app.py quickstarter (http://flask.pocoo.org/docs/0.10/quickstart/), there is issue, and output is as expected: "Hello World!" [03:28:11] The complex tools app.py works in a local environment on my home system without issue. [03:49:09] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936128 (10ResMar) 3NEW [03:56:44] ResMar: Have you seen https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_.28uwsgi-plain.29 ? [03:57:34] I think the "mount = /myproject=foo:bar" bit is the important part for py3 [03:57:50] No I haven't, thanks for the reference. [03:58:29] yw. There is a bit more detail at https://phabricator.wikimedia.org/T104374#1911373 [03:58:30] I was having a hard time parsing the information in the file that was posted to the Phab ticket, since I don't see a `.sock` file in my application (is one created? is it just a placeholder?) [03:58:42] That I did see [03:58:56] This happens in Python 2: https://phabricator.wikimedia.org/T123704 [03:59:58] valhallasw was look at it earlier, but we didn't figure out what was going on, I don't think [04:00:24] I got the application running by turning debug on and off again, somehow. [04:00:55] heisenbugs! [04:01:09] Yeah :-( My experiences on Labs have been...interesting. [04:01:37] it is certainly more tested with php. [04:01:49] I haven't tried making any python web apps yet [04:01:55] (on tools) [04:02:20] Oh man, we need less things in PHP, not more. :) [04:02:37] heh. opinions vary [04:03:24] Yeah. [04:04:36] I guess I'll just have to sit on it for now [04:05:07] I think the sock file will be created by uwsgi once it starts [04:05:44] What do you think? [04:05:44] looking at your app.py module I think the example "mount = /myproject=app:app" should match your code [04:06:09] Blow everything up and try to move to Python 3? [04:06:38] oh. so it's python2 now and not working? [04:06:54] Right now? Yeah. [04:06:59] hmm... [04:07:39] Maybe if I move to Better Faster Stronger Python it'll start working :-) Just maybe Yuvi should take a look at it first and try to troubleshoot [04:16:06] ResMar: https://tools.wmflabs.org/bd808-test2/ [04:16:35] It looks like it is working? [04:16:43] It does [04:16:47] Did you port it to 2? [04:16:50] Er, 3. [04:16:54] Or is this in Python 2? [04:17:01] just 2 [04:17:06] Uh... [04:17:20] I followed these steps -- https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_2_.28uwsgi.29 [04:17:25] Did I get a lemon Lab tools instance? :) [04:18:08] maybe? [04:19:10] ResMar: I added you to that tool (bd808-test2) so you can poke around [04:20:03] I can't become it just yet, hmm [04:20:23] you have to log out and log back in to get new permissions [04:20:31] Yeah I figured [04:20:49] I usually switch bastions just out of paranoia too [04:22:26] Ok looking around [04:24:02] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936154 (10bd808) I set this tool up as bd808-test2 with these commands: ``` $ become bd808-test2 $ mkdir -p www/python $ cd www/python $ git clone https://github.com/ResidentMario/signpostlab.git src $ cd src $ virtualen... [04:24:56] Ok, so I've been poking through it and the only difference that I found was that I had a wsgi.ini file and you did not [04:25:13] I deleted that and, unsurprisingly, that didn't fix the issue [04:25:19] :/ [04:25:25] That is a little unnerving [04:25:35] One instance that works and one that doesn't, on the same server [04:25:50] well on the same job grid anyhow [04:26:51] How do I get your job grid? :0 [04:27:18] and you are not getting any error output in error.log or wsgi.log? [04:27:28] *uwsgi.log [04:27:52] uwsgi.log indicates that it's working, as does error.log [04:29:05] that 404 page makes me think that portgrabber (the bit that tells the proxy how to find your container) messed up [04:29:25] Maybe, let me look up what v had to say [04:30:17] Shoot, I don't think he filed a bug. [04:30:35] When he was going through it I think he said that there was a 500 crash that was being returned by the software as a 404 [04:32:46] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936158 (10ResMar) Yep---so that's one instance that works and one that doesn't, on the same service. Someone gave me a lemon! [04:34:35] I'm out of ideas for helping :/ [04:36:26] ¯\_(ツ)_/¯ [04:36:51] Thanks though. [04:37:09] I guess at some point you are going to want to try for python3 to be able to use mwapi [04:37:18] Yeah [04:37:38] While we're on the topic, how difficult is to learn how to send scheduled jobs via the grid? [04:37:52] There's a couple of scripts I'd like to run on a weekly basis off of Labs [04:38:01] But I don't have any prior experience with chron jobs [04:38:13] And Labs is not usually the most...transparent...place. [04:39:53] basically you just add a cron entry to `jsub ...` the script you want to run [04:40:33] I would personally make a shell script that does all the hard work, test it manually, and then add the cron entry [04:41:10] uh oh, shell scripts [04:41:44] `man -s5 crontab` explains a cron entry pretty well [04:42:32] the shell script might be as simple as "jsub my_script.py" [04:42:50] Maybe, with my luck, doubt it though :-) [04:43:46] I'll give it a go once this patch is taped over, though [04:49:17] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1936169 (10Yaron_Koren) I can't see the problem - was it fixed? I should note tha... [04:49:25] Thanks for your help! Got to go now. [06:31:14] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1936218 (10Florian) > Wikitech will be rolled back to 1.27.0-wmf.9 so stuff isn't... [09:21:06] chasemp: on your earlier question wrt "where is that 1h caching set out of curiousity": that's configured with the "positive-time-to-live" values in /etc/nscd.conf [13:57:55] (03CR) 108ohit.dua: "I'm abandoning this patch set and copying the contents from github to here." [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [13:59:06] (03Abandoned) 108ohit.dua: [WIP] coming_soon [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [14:09:14] Hi all, I wanted to ask about a possible bug in the enwiki events log [14:09:48] this is an example item from such log [14:09:48] --- [14:09:48] [14:09:48] 68818859 [14:09:49] 2015-09-04T16:21:40Z [14:09:49] [14:09:49] Colipon [14:09:49] 14772 [14:09:49] [14:09:50] current title is a bit too informal sounding [14:09:50] move [14:09:51] move [14:10:28] prior to February 2015 the params tag contained just the new name of the page [14:10:52] is it expected that now the now format is that wierd-looking string? [14:43:02] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936736 (10ResMar) 5Open>3Resolved [14:45:18] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936738 (10ResMar) After re-doing setup it's up again. Hopefully I just missed a step somewhere and didn't realize it. [14:47:55] bd808: I redid setup and it's working again :-) [14:48:01] Hopefully I jsut missed a step somewhere [14:48:31] I closed the Phab issue. Will try to roll it over to Python 3 so I can use mwapi some other time, though, kind of OCD about it at this point. [15:48:00] (03PS16) 10ArthurPSmith: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 [15:49:20] (03CR) 10ArthurPSmith: [C: 031] "Ok, this works with python3!" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [16:11:24] moritzm: thanks! [17:39:44] chasemp, YuviPanda: yeah, TORQUE (https://en.wikipedia.org/wiki/TORQUE) [17:52:37] chasemp, YuviPanda: I've been trying to figure out if there is any documentation on the SGE config on the toolserver, but I can't find much. What I could find: the sge qmaster ran on the 'HA cluster' ( https://www.mediawiki.org/wiki/Toolserver:Admin:HA_cluster ), which means that it did not use NFS and did not have a shadow master [17:53:03] oh interesting [17:53:27] basically, it's a comparable setup to our NFS server, but then the server also ran a whole host of other services [17:59:27] I've heard the downtiem there was sometimes significant as well [18:01:58] mostly not SGE related though. mostly NFS. and LDAP. Sounds familiar :D [18:02:51] yes :) [18:03:30] ah, and replication broke a lot [18:04:40] mostly it looks like you could just grab all toolserver mails, replace 'toolserver' with 'tool labs' and you wouldn't notice a thing ;-) [18:05:56] eheh, well I'd love to change that :) [18:06:17] valhallasw`cloud: maybe late next week or into teh one after I can try to show you some speculative thinking [18:06:29] on what I'm leaning towards, this is jsut a heads up that I'd like to pick your brain [18:06:37] as you probably have some of the best use case semantics [18:06:46] sure [18:08:11] also, just as a note: there are some people around who know more about SGE. Merlissimo/Merl is still around, nosy and DaB. might know some stuff, and river (=user:kate, who basically set up the toolserver back in the day) seems to have recently resurfaced [18:09:27] sweet thanks for the insight, I can try to cast a wider net [18:09:40] especially for when we are in trouble :) [18:27:24] PROBLEM - SSH on tools-mail-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:13] !log tools tools-mail-01 is locked up I am rebooting [18:34:30] I don't know if tools-mail-01 is signifiacnt but there is a tools-mail [18:34:36] so it's not exactly clear [18:37:15] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [18:48:31] chasemp: YuviPanda sge master dead again? [18:48:39] Got cron error mail [18:48:46] But not at computer [18:49:02] I'm not at my computer* [18:49:39] seems up [18:49:49] there was some issue with tools-mail-01 [18:52:48] I woldn't be surprised if old or weird emails were stalled [18:52:54] but master seems up and I can submit jobs etc afaict [18:52:56] so far [18:53:49] maybe :) [18:59:48] general health seems ok [19:01:57] chasemp: yeah, another bdb crash + a bit of downtime [19:02:04] 01/15/2016 18:27:10| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery [19:02:13] so it recovered on it's own? [19:02:17] and it was juts back up when you checked [19:02:18] yeah [19:02:34] what log is that in? [19:02:46] in /data/project/.system/gridengine/spool/qmaster/messages [19:02:50] k [19:03:28] qping -info tools-grid-master 6444 qmaster 1 [19:03:33] ^ this shows uptime, among others [19:03:49] grrrreat [19:03:54] via http://thread.gmane.org/gmane.org.wikimedia.toolserver/6077/focus=6079 [19:06:02] so that's happening a lot [19:08:54] * YuviPanda waves [19:09:03] I think [19:09:08] we should maybe do a bdb dump and recover? [19:09:24] I was just reading teh same thing, we should at least do a dump somewhere [19:09:33] I mean...if we lose this db totally do we have any clue what is in it? [19:09:36] I know settings and not just queue [19:09:49] no [19:09:55] kk [19:10:05] I think just information on running jobs? [19:10:10] the configuration is in... errr [19:10:17] aren't the settings in a bdb as well? might not be the same one [19:10:18] *checks toolserver docs* [19:10:24] I'm almost postive when you choose bdb it stores settings there [19:10:28] for queue's etc [19:10:40] and it make believe's it's a file with /tmp when you do like qconf -mq [19:10:42] then I think that means the toolserver didn't use bdb queuing [19:10:43] and then saves to db [19:11:15] "The qmaster configuration and state is in /global/misc/sge62, but generally SGE is configured by commands which can be run on any host" https://www.mediawiki.org/wiki/Toolserver:Admin:Sun_Grid_Engine [19:11:23] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1937379 (10Tgr) What would such a namespace accomplish? You can always name a page `Portal:Something` (or, if you prefer plain English, `Something... [19:11:47] if I stop the master do exec nodes give out or does just new job submission stop working? [19:11:52] I mean, do ongoing things keep chugging? [19:12:06] just new job submissions [19:12:07] obv new crons take a big hit [19:12:09] and qstats [19:12:44] (03PS2) 10Yuvipanda: passwords: add root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) (owner: 10Merlijn van Deen) [19:13:03] (03CR) 10Yuvipanda: [C: 032 V: 032] "andrew +1'd in task. I'll also add him to cloudadmin. \o/ <3" [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) (owner: 10Merlijn van Deen) [19:17:06] yeah, it might be the './spool/spooldb/sge' database (as compared to ./spool/spooldb/sge_job) [19:17:45] that's what I was thinking [19:17:47] sge: Berkeley DB (Btree, version 9, native byte-order) [19:17:47] sge_job: Berkeley DB (Btree, version 9, native byte-order) [19:17:55] var/spool/gridengine/spooldb [19:19:08] otoh, it's 3.4MB (!) [19:19:13] that's a lot of config [19:20:48] but it is config. the keys are stuff like 'ADMINHOST:tools-bastion-01.eqiad.wmflabs\x00','COMPLEX_ENTRY:arch\x00', MANAGER:valhallasw\x00, etc. [19:21:25] how did you see that? db_dump? [19:21:28] I'm not sure if the value actually means anything (''\x00\x00\x00\x00\x10\x02\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00valhallasw\x00' for my manager entry) [19:21:38] https://docs.python.org/2/library/bsddb.html on one of the backup databases [19:21:47] the one in /data/project/.system/gridengine/spool.bak/spooldb [19:25:55] so assuming the sge db is settings [19:26:00] and the sge_jobs one is the queue [19:26:07] I imagine teh queue db is correct [19:26:09] corrupt I mean [19:26:17] yeah, I think so [19:26:19] and the only reason it recovers is that teh sge one is semi sane [19:26:24] and so it keeps coming back up and retrying [19:26:36] if the sge db was corrupt (i.e. a config change that hit the nfs outage) [19:26:44] we would be in deeper poop than now [19:27:10] at least parts of it are puppetized, but it would be good to do a full dump of the configuration somewhere [19:27:24] yes desperately we need to do that [19:27:29] I can't figure out exactly how yet [19:27:36] even a plan text crappy version we can work from [19:27:55] so interesting is [19:28:01] I have a test master somewhere [19:28:06] and if I totally remove the sge_job db [19:28:07] https://arc.liv.ac.uk/SGE/howto/backup.html ? ;-) [19:28:09] and restart [19:28:12] it creates a new one [19:28:23] can it still see existing jobs? [19:28:37] I would imagine no but there were no jobs there atm [19:28:40] I can submit one and test [19:28:59] YuviPanda: we can mass restart all web services right [19:29:02] and cron will sort itself out [19:29:06] but task and continuous? [19:29:09] would be lost? [19:29:39] at least partially [19:29:48] bigbrother might reschedule some continuous jobs [19:29:52] or rather [19:30:12] the jobs would probably happily purr on, and we could get issues with double irc bots etc [19:30:36] chasemp: the last time we did this we rebooted all the machines [19:30:38] to stop this problem [19:30:41] of double bots [19:32:35] chasemp: https://github.com/valhallasw/son-of-gridengine/blob/d1673d47d84fa526548657ed8e0771fd1f5cac26/source/dist/util/upgrade_modules/save_sge_config.sh [19:32:43] and https://github.com/valhallasw/son-of-gridengine/blob/d1673d47d84fa526548657ed8e0771fd1f5cac26/source/dist/util/upgrade_modules/load_sge_config.sh [19:33:34] ok I'll pull those downa nd try them [19:33:46] fwiw on my test setup which is admittedly trivial [19:33:49] but same versions etc [19:33:58] I stopped sge, removed the entire queue db, started it [19:34:03] and then succesffully submitted a new job [19:34:15] so I'm thinking about that [19:36:48] 6Labs, 10Tool-Labs: tools.taxonbot and tools.giftbot cronjobs not firing - https://phabricator.wikimedia.org/T123186#1937492 (10Giftpflanze) 2016-01-01 00:00 UTC and 2016-01-15 00:00 UTC: 0 0 1,15 * * jlocal... [19:37:36] ^ given the timing, I'm pretty sure it's a load issue [19:37:37] ^ could we do something about that or is there a workaround? [19:38:18] gifti: workaround: fire a few minutes earlier or later [19:38:43] yeah, i thought so [19:38:51] will do [19:42:29] valhallasw`cloud: did you see the etherpad I sent? :) [19:42:41] do you plan on doing something about the load? [19:42:45] YuviPanda: yeah. looks good. [19:42:54] gifti: I'm not sure what's causing the load. [19:43:11] ah [19:43:12] migrating to a more powerful vm is possible, but will take time [19:48:34] ok [19:48:39] now I'm going to look at our stats again [19:49:15] so once I sorted out the path issues that actually does make a plan text dump of things [19:49:23] it test anyway [19:49:28] in test [19:51:48] parentcommandline in the eventlogging schema is fascinating [19:53:04] https://phabricator.wikimedia.org/P2477 [19:54:14] * * * * * /usr/bin/jsub [19:54:21] times 5 [19:54:24] let me look [19:54:46] wtf [19:56:17] valhallasw`cloud: can you get on the master? [19:56:20] root/emergency_sge_dump [19:56:51] that looks good [19:57:26] yeah, looks good to me (although I wouldn't be able to say if anything is missing) [19:58:00] me neither [19:58:04] but the things I know to check for are there? [19:58:09] not a great indicator tho :) [19:58:17] but this is teh best backup we have [19:58:19] and also [19:58:23] no issues dumping the //settings// db [19:58:27] 24217 total invocations for cluebot [19:58:33] 12123 without -once [19:58:34] this clearly doesn't fool w/ the active queue [19:58:39] that's almost 50% without -once [19:58:44] * YuviPanda investigates the ones without once [19:58:48] do we allow jobs to submit other jobs? [19:58:54] aaah [19:58:56] job [19:58:58] that's double [19:59:02] let me stop recording 'job' [19:59:04] that's useless [19:59:38] nope [19:59:40] without 'job' [19:59:42] there are 0 things [19:59:44] without -once [19:59:46] from cluebot [20:00:31] oh [20:00:34] nvm [20:00:36] my sql was wrong [20:00:38] hmm [20:01:03] my sql is still wrong [20:01:48] bah [20:01:50] my code is wrong [20:02:24] I was emitting 'job' as 'jsub' [20:02:29] oh well [20:02:31] fixable [20:02:54] yeah [20:02:58] 0 things without -once [20:03:00] from cluebot [20:03:25] valhallasw`cloud: I wonder if we can / should change cluebot's crontab [20:03:39] YuviPanda: chasemp : more from the archives: https://meta.wikimedia.org/wiki/Toolserver/Stable_server/Candidates [20:03:52] YuviPanda: I would leave a message at their talk page / create a task in phab [20:03:56] and only act if there's no response [20:04:19] valhallasw`cloud: yeah [20:04:24] valhallasw`cloud: so I'm going to look at their crontab [20:04:27] and then do those things [20:04:47] hmm [20:04:51] YuviPanda: I think our best bet is to wipe out the queue db [20:04:53] actually [20:04:55] they're all -once [20:05:06] I made a dump of the settings db in /root [20:05:10] and am backing it up locally [20:05:20] chasemp: during the migration to your new master or? [20:05:23] and it seems like from testing if you remove the spool db it recreates [20:05:36] well, I think we shoudl do it now honestly [20:05:38] if we can [20:05:43] friday afternoon? :) [20:05:48] yeah shitty timing [20:05:53] but it's crashing hundreds of tiems a day [20:06:10] hmm [20:06:12] I guess, what's teh fallout? risk of dupe jobs? [20:06:13] chasemp: I have a better idea [20:06:23] have we tried the dump + load thing yet? [20:06:24] new master [20:06:25] please :) [20:06:36] import configuration, but with the other spool system, and not on NFS [20:06:39] YuviPanda: I have dumped yes but it's //settings// not queue [20:06:56] and I think considering the state of the queue db it's not sane to try to recover it [20:07:06] db_recover seesm to basically want a journal file [20:07:09] and it wants to roll back [20:07:12] no [20:07:14] not recover [20:07:17] but a dump and reload [20:07:21] which should ignore the journal [20:07:31] then I think we can switch over by changing the file that indicates the master [20:07:32] db_dump do you mean? [20:07:37] db_dump [20:07:37] althugh I'm not sure what that does to execds :/ [20:07:57] chasemp: db_dump writes out TXT, then db_load [20:08:06] valhallasw`cloud: yeah I agree but not today I think or not unless we have to today [20:08:21] there is some crazy path issues on the current master I'm not entirly sure of everthing [20:08:27] YuviPanda: does that work for you? [20:09:04] if the queue db is corrupt and it's crashing on handling teh qeueue the shortest path to no crashes seems to be to start a fresh queue [20:09:27] I don't think dumping the current one and trying to recover it makes sense as it's corrupt [20:09:33] chasemp: so we have to announce 'we're going to kill the queue database' anyway [20:09:35] and db_recover doesn't seem feasible [20:09:41] so in that sense it doesn't necessarily matter [20:09:57] I think db_dump/db_load should be OK because it's an issue in bdb consistency, not in the contents [20:10:02] yeah [20:10:06] am going to try that now [20:10:16] not on the live one!! [20:10:44] apparently that can already cause corruption [20:12:08] valhallasw`cloud: db_dump can cause corruption? [20:12:14] not db_load [20:12:18] db_load probably can [20:12:20] I don't understand how dumping a corrupt db and restoring it is going to work out? [20:12:21] db_dump shouldn't [20:12:24] YuviPanda: yes [20:12:32] that's what https://arc.liv.ac.uk/SGE/howto/backup.html says [20:12:33] it may lock it to dump [20:12:39] there is a db_hostbackup or something [20:12:54] too late because I did it anyway [20:13:10] YuviPanda: "Caution: This method is not safe if Berkeley DB spooling is in use, the qmaster is running, and it opens the spool database in ‘private’ mode to allow spooling to a network filesystem. " [20:13:11] >_< [20:13:35] well [20:13:39] I'm going to go for lunch [20:13:41]