[00:36:37] (03PS1) 10Andrew Bogott: Add some dummy passwords for nova in heira [labs/private] - 10https://gerrit.wikimedia.org/r/264220 [00:37:35] YuviPanda: around? [00:37:37] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add some dummy passwords for nova in heira [labs/private] - 10https://gerrit.wikimedia.org/r/264220 (owner: 10Andrew Bogott) [00:47:03] !log wikimetrics Fixed some things in wikimetrics and wikimetrics-deploy for db creation and migration to work. Prod server initialize stage complete [00:49:17] !log wikimetrics Deploying to prod wikimetrics, which also restarts all services. So good so far [00:53:23] !log wikimetrics Setup temporary proxy to metrics-prod.wmflabs.org. All good, web and queue are up, scheduler seems to be failing though [00:53:53] madhuvishy: yeah [00:54:12] (03PS1) 10Andrew Bogott: More dummy hiera passwords for keystone. [labs/private] - 10https://gerrit.wikimedia.org/r/264225 [00:54:32] YuviPanda: was running into some issues with running alembic upgrade - found more hardcoded paths in code [00:54:45] but i fixed those [00:54:53] (03CR) 10Andrew Bogott: [C: 032 V: 032] More dummy hiera passwords for keystone. [labs/private] - 10https://gerrit.wikimedia.org/r/264225 (owner: 10Andrew Bogott) [00:55:02] madhuvishy: ok :D [00:59:06] YuviPanda: celery beat is looking for a folder to run from - any best practices? it was set to /var/run/wikimetrics/celerybeat_scheduled_tasks and /var/run/wikimetrics/celerybeat.pid [01:02:53] (03PS1) 10Andrew Bogott: Remove a couple of redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264227 [01:03:34] (03CR) 10Andrew Bogott: [C: 032 V: 032] Remove a couple of redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264227 (owner: 10Andrew Bogott) [01:05:42] (03PS1) 10Andrew Bogott: Remove yet more redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264228 [01:06:01] (03CR) 10Andrew Bogott: [C: 032 V: 032] Remove yet more redundant passwords. [labs/private] - 10https://gerrit.wikimedia.org/r/264228 (owner: 10Andrew Bogott) [01:09:01] !log wikimetrics Found more config path issues for the scheduler. Fixed. All services are running on prod wikimetrics [01:11:29] YuviPanda: i don't seem to have merge rights on secrets/wikimetrics [01:11:45] merge https://gerrit.wikimedia.org/r/#/c/263669/ when you can [01:21:55] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1935979 (10MZMcBride) Thank you for the quick fix! <3 [01:48:50] madhuvishy: I added rights for you [01:49:17] YuviPanda: thanks. Can you also add mforns, nuria and milimetric [01:49:27] madhuvishy: sure [01:49:34] thanks again [01:49:56] madhuvishy: done [01:50:05] cool [01:51:38] madhuvishy: np. thanks for fixing it all up :) [01:51:43] madhuvishy: next step: limn1 :) [01:51:55] :P [01:52:14] YuviPanda: :) db has not been moved over yet - when it's all up, i'll get cake for us :P [01:52:44] some of limn1 will go away when the dashiki fabric + puppet is adopted [01:53:23] and Dan is making a new layout for the browser reports, which should deprecate most of the old limn reports [01:53:27] we'll get there [01:53:28] :) [01:56:32] madhuvishy: :) just remember that if something crashes hard on the limn instance we can't really do much [01:56:49] ya okay [02:20:43] !log wikimetrics Add rest of analytics team as wikimetrics project admins [02:44:59] 6Labs, 10wikitech.wikimedia.org: Exclude nova resource pages from *default* wikitech search - https://phabricator.wikimedia.org/T122993#1936095 (10Tgr) Labs-project pages could actually be very useful if people used them to document those projects (which quite often does happen). The machine pages are indeed n... [03:22:32] YuviPanda: You wouldn't happen to still be on, would you? [03:25:28] So guys, I have a Python Flask tool on Labs that I just barely got working a few days ago. I messed with the environment to try to up-patch it from Python 2 to Python 3, but I couldn't figure it out. [03:25:42] I tried to roll the change back, but now I'm getting a 404 error. [03:26:07] What baffles me is that the webservice reports itself as running, but the page is a 404. [03:26:49] If I replace the app.py file that is running everything with the app.py quickstarter (http://flask.pocoo.org/docs/0.10/quickstart/), there is issue, and output is as expected: "Hello World!" [03:28:11] The complex tools app.py works in a local environment on my home system without issue. [03:49:09] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936128 (10ResMar) 3NEW [03:56:44] ResMar: Have you seen https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_.28uwsgi-plain.29 ? [03:57:34] I think the "mount = /myproject=foo:bar" bit is the important part for py3 [03:57:50] No I haven't, thanks for the reference. [03:58:29] yw. There is a bit more detail at https://phabricator.wikimedia.org/T104374#1911373 [03:58:30] I was having a hard time parsing the information in the file that was posted to the Phab ticket, since I don't see a `.sock` file in my application (is one created? is it just a placeholder?) [03:58:42] That I did see [03:58:56] This happens in Python 2: https://phabricator.wikimedia.org/T123704 [03:59:58] valhallasw was look at it earlier, but we didn't figure out what was going on, I don't think [04:00:24] I got the application running by turning debug on and off again, somehow. [04:00:55] heisenbugs! [04:01:09] Yeah :-( My experiences on Labs have been...interesting. [04:01:37] it is certainly more tested with php. [04:01:49] I haven't tried making any python web apps yet [04:01:55] (on tools) [04:02:20] Oh man, we need less things in PHP, not more. :) [04:02:37] heh. opinions vary [04:03:24] Yeah. [04:04:36] I guess I'll just have to sit on it for now [04:05:07] I think the sock file will be created by uwsgi once it starts [04:05:44] What do you think? [04:05:44] looking at your app.py module I think the example "mount = /myproject=app:app" should match your code [04:06:09] Blow everything up and try to move to Python 3? [04:06:38] oh. so it's python2 now and not working? [04:06:54] Right now? Yeah. [04:06:59] hmm... [04:07:39] Maybe if I move to Better Faster Stronger Python it'll start working :-) Just maybe Yuvi should take a look at it first and try to troubleshoot [04:16:06] ResMar: https://tools.wmflabs.org/bd808-test2/ [04:16:35] It looks like it is working? [04:16:43] It does [04:16:47] Did you port it to 2? [04:16:50] Er, 3. [04:16:54] Or is this in Python 2? [04:17:01] just 2 [04:17:06] Uh... [04:17:20] I followed these steps -- https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_2_.28uwsgi.29 [04:17:25] Did I get a lemon Lab tools instance? :) [04:18:08] maybe? [04:19:10] ResMar: I added you to that tool (bd808-test2) so you can poke around [04:20:03] I can't become it just yet, hmm [04:20:23] you have to log out and log back in to get new permissions [04:20:31] Yeah I figured [04:20:49] I usually switch bastions just out of paranoia too [04:22:26] Ok looking around [04:24:02] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936154 (10bd808) I set this tool up as bd808-test2 with these commands: ``` $ become bd808-test2 $ mkdir -p www/python $ cd www/python $ git clone https://github.com/ResidentMario/signpostlab.git src $ cd src $ virtualen... [04:24:56] Ok, so I've been poking through it and the only difference that I found was that I had a wsgi.ini file and you did not [04:25:13] I deleted that and, unsurprisingly, that didn't fix the issue [04:25:19] :/ [04:25:25] That is a little unnerving [04:25:35] One instance that works and one that doesn't, on the same server [04:25:50] well on the same job grid anyhow [04:26:51] How do I get your job grid? :0 [04:27:18] and you are not getting any error output in error.log or wsgi.log? [04:27:28] *uwsgi.log [04:27:52] uwsgi.log indicates that it's working, as does error.log [04:29:05] that 404 page makes me think that portgrabber (the bit that tells the proxy how to find your container) messed up [04:29:25] Maybe, let me look up what v had to say [04:30:17] Shoot, I don't think he filed a bug. [04:30:35] When he was going through it I think he said that there was a 500 crash that was being returned by the software as a 404 [04:32:46] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936158 (10ResMar) Yep---so that's one instance that works and one that doesn't, on the same service. Someone gave me a lemon! [04:34:35] I'm out of ideas for helping :/ [04:36:26] ¯\_(ツ)_/¯ [04:36:51] Thanks though. [04:37:09] I guess at some point you are going to want to try for python3 to be able to use mwapi [04:37:18] Yeah [04:37:38] While we're on the topic, how difficult is to learn how to send scheduled jobs via the grid? [04:37:52] There's a couple of scripts I'd like to run on a weekly basis off of Labs [04:38:01] But I don't have any prior experience with chron jobs [04:38:13] And Labs is not usually the most...transparent...place. [04:39:53] basically you just add a cron entry to `jsub ...` the script you want to run [04:40:33] I would personally make a shell script that does all the hard work, test it manually, and then add the cron entry [04:41:10] uh oh, shell scripts [04:41:44] `man -s5 crontab` explains a cron entry pretty well [04:42:32] the shell script might be as simple as "jsub my_script.py" [04:42:50] Maybe, with my luck, doubt it though :-) [04:43:46] I'll give it a go once this patch is taped over, though [04:49:17] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1936169 (10Yaron_Koren) I can't see the problem - was it fixed? I should note tha... [04:49:25] Thanks for your help! Got to go now. [06:31:14] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1936218 (10Florian) > Wikitech will be rolled back to 1.27.0-wmf.9 so stuff isn't... [09:21:06] chasemp: on your earlier question wrt "where is that 1h caching set out of curiousity": that's configured with the "positive-time-to-live" values in /etc/nscd.conf [13:57:55] (03CR) 108ohit.dua: "I'm abandoning this patch set and copying the contents from github to here." [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [13:59:06] (03Abandoned) 108ohit.dua: [WIP] coming_soon [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [14:09:14] Hi all, I wanted to ask about a possible bug in the enwiki events log [14:09:48] this is an example item from such log [14:09:48] --- [14:09:48] [14:09:48] 68818859 [14:09:49] 2015-09-04T16:21:40Z [14:09:49] [14:09:49] Colipon [14:09:49] 14772 [14:09:49] [14:09:50] current title is a bit too informal sounding [14:09:50] move [14:09:51] move [14:10:28] prior to February 2015 the params tag contained just the new name of the page [14:10:52] is it expected that now the now format is that wierd-looking string? [14:43:02] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936736 (10ResMar) 5Open>3Resolved [14:45:18] 6Labs: Flask app on uwsgi-python variably fails - https://phabricator.wikimedia.org/T123704#1936738 (10ResMar) After re-doing setup it's up again. Hopefully I just missed a step somewhere and didn't realize it. [14:47:55] bd808: I redid setup and it's working again :-) [14:48:01] Hopefully I jsut missed a step somewhere [14:48:31] I closed the Phab issue. Will try to roll it over to Python 3 so I can use mwapi some other time, though, kind of OCD about it at this point. [15:48:00] (03PS16) 10ArthurPSmith: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 [15:49:20] (03CR) 10ArthurPSmith: [C: 031] "Ok, this works with python3!" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [16:11:24] moritzm: thanks! [17:39:44] chasemp, YuviPanda: yeah, TORQUE (https://en.wikipedia.org/wiki/TORQUE) [17:52:37] chasemp, YuviPanda: I've been trying to figure out if there is any documentation on the SGE config on the toolserver, but I can't find much. What I could find: the sge qmaster ran on the 'HA cluster' ( https://www.mediawiki.org/wiki/Toolserver:Admin:HA_cluster ), which means that it did not use NFS and did not have a shadow master [17:53:03] oh interesting [17:53:27] basically, it's a comparable setup to our NFS server, but then the server also ran a whole host of other services [17:59:27] I've heard the downtiem there was sometimes significant as well [18:01:58] mostly not SGE related though. mostly NFS. and LDAP. Sounds familiar :D [18:02:51] yes :) [18:03:30] ah, and replication broke a lot [18:04:40] mostly it looks like you could just grab all toolserver mails, replace 'toolserver' with 'tool labs' and you wouldn't notice a thing ;-) [18:05:56] eheh, well I'd love to change that :) [18:06:17] valhallasw`cloud: maybe late next week or into teh one after I can try to show you some speculative thinking [18:06:29] on what I'm leaning towards, this is jsut a heads up that I'd like to pick your brain [18:06:37] as you probably have some of the best use case semantics [18:06:46] sure [18:08:11] also, just as a note: there are some people around who know more about SGE. Merlissimo/Merl is still around, nosy and DaB. might know some stuff, and river (=user:kate, who basically set up the toolserver back in the day) seems to have recently resurfaced [18:09:27] sweet thanks for the insight, I can try to cast a wider net [18:09:40] especially for when we are in trouble :) [18:27:24] PROBLEM - SSH on tools-mail-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:13] !log tools tools-mail-01 is locked up I am rebooting [18:34:30] I don't know if tools-mail-01 is signifiacnt but there is a tools-mail [18:34:36] so it's not exactly clear [18:37:15] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [18:48:31] chasemp: YuviPanda sge master dead again? [18:48:39] Got cron error mail [18:48:46] But not at computer [18:49:02] I'm not at my computer* [18:49:39] seems up [18:49:49] there was some issue with tools-mail-01 [18:52:48] I woldn't be surprised if old or weird emails were stalled [18:52:54] but master seems up and I can submit jobs etc afaict [18:52:56] so far [18:53:49] maybe :) [18:59:48] general health seems ok [19:01:57] chasemp: yeah, another bdb crash + a bit of downtime [19:02:04] 01/15/2016 18:27:10| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery [19:02:13] so it recovered on it's own? [19:02:17] and it was juts back up when you checked [19:02:18] yeah [19:02:34] what log is that in? [19:02:46] in /data/project/.system/gridengine/spool/qmaster/messages [19:02:50] k [19:03:28] qping -info tools-grid-master 6444 qmaster 1 [19:03:33] ^ this shows uptime, among others [19:03:49] grrrreat [19:03:54] via http://thread.gmane.org/gmane.org.wikimedia.toolserver/6077/focus=6079 [19:06:02] so that's happening a lot [19:08:54] * YuviPanda waves [19:09:03] I think [19:09:08] we should maybe do a bdb dump and recover? [19:09:24] I was just reading teh same thing, we should at least do a dump somewhere [19:09:33] I mean...if we lose this db totally do we have any clue what is in it? [19:09:36] I know settings and not just queue [19:09:49] no [19:09:55] kk [19:10:05] I think just information on running jobs? [19:10:10] the configuration is in... errr [19:10:17] aren't the settings in a bdb as well? might not be the same one [19:10:18] *checks toolserver docs* [19:10:24] I'm almost postive when you choose bdb it stores settings there [19:10:28] for queue's etc [19:10:40] and it make believe's it's a file with /tmp when you do like qconf -mq [19:10:42] then I think that means the toolserver didn't use bdb queuing [19:10:43] and then saves to db [19:11:15] "The qmaster configuration and state is in /global/misc/sge62, but generally SGE is configured by commands which can be run on any host" https://www.mediawiki.org/wiki/Toolserver:Admin:Sun_Grid_Engine [19:11:23] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1937379 (10Tgr) What would such a namespace accomplish? You can always name a page `Portal:Something` (or, if you prefer plain English, `Something... [19:11:47] if I stop the master do exec nodes give out or does just new job submission stop working? [19:11:52] I mean, do ongoing things keep chugging? [19:12:06] just new job submissions [19:12:07] obv new crons take a big hit [19:12:09] and qstats [19:12:44] (03PS2) 10Yuvipanda: passwords: add root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) (owner: 10Merlijn van Deen) [19:13:03] (03CR) 10Yuvipanda: [C: 032 V: 032] "andrew +1'd in task. I'll also add him to cloudadmin. \o/ <3" [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) (owner: 10Merlijn van Deen) [19:17:06] yeah, it might be the './spool/spooldb/sge' database (as compared to ./spool/spooldb/sge_job) [19:17:45] that's what I was thinking [19:17:47] sge: Berkeley DB (Btree, version 9, native byte-order) [19:17:47] sge_job: Berkeley DB (Btree, version 9, native byte-order) [19:17:55] var/spool/gridengine/spooldb [19:19:08] otoh, it's 3.4MB (!) [19:19:13] that's a lot of config [19:20:48] but it is config. the keys are stuff like 'ADMINHOST:tools-bastion-01.eqiad.wmflabs\x00','COMPLEX_ENTRY:arch\x00', MANAGER:valhallasw\x00, etc. [19:21:25] how did you see that? db_dump? [19:21:28] I'm not sure if the value actually means anything (''\x00\x00\x00\x00\x10\x02\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00valhallasw\x00' for my manager entry) [19:21:38] https://docs.python.org/2/library/bsddb.html on one of the backup databases [19:21:47] the one in /data/project/.system/gridengine/spool.bak/spooldb [19:25:55] so assuming the sge db is settings [19:26:00] and the sge_jobs one is the queue [19:26:07] I imagine teh queue db is correct [19:26:09] corrupt I mean [19:26:17] yeah, I think so [19:26:19] and the only reason it recovers is that teh sge one is semi sane [19:26:24] and so it keeps coming back up and retrying [19:26:36] if the sge db was corrupt (i.e. a config change that hit the nfs outage) [19:26:44] we would be in deeper poop than now [19:27:10] at least parts of it are puppetized, but it would be good to do a full dump of the configuration somewhere [19:27:24] yes desperately we need to do that [19:27:29] I can't figure out exactly how yet [19:27:36] even a plan text crappy version we can work from [19:27:55] so interesting is [19:28:01] I have a test master somewhere [19:28:06] and if I totally remove the sge_job db [19:28:07] https://arc.liv.ac.uk/SGE/howto/backup.html ? ;-) [19:28:09] and restart [19:28:12] it creates a new one [19:28:23] can it still see existing jobs? [19:28:37] I would imagine no but there were no jobs there atm [19:28:40] I can submit one and test [19:28:59] YuviPanda: we can mass restart all web services right [19:29:02] and cron will sort itself out [19:29:06] but task and continuous? [19:29:09] would be lost? [19:29:39] at least partially [19:29:48] bigbrother might reschedule some continuous jobs [19:29:52] or rather [19:30:12] the jobs would probably happily purr on, and we could get issues with double irc bots etc [19:30:36] chasemp: the last time we did this we rebooted all the machines [19:30:38] to stop this problem [19:30:41] of double bots [19:32:35] chasemp: https://github.com/valhallasw/son-of-gridengine/blob/d1673d47d84fa526548657ed8e0771fd1f5cac26/source/dist/util/upgrade_modules/save_sge_config.sh [19:32:43] and https://github.com/valhallasw/son-of-gridengine/blob/d1673d47d84fa526548657ed8e0771fd1f5cac26/source/dist/util/upgrade_modules/load_sge_config.sh [19:33:34] ok I'll pull those downa nd try them [19:33:46] fwiw on my test setup which is admittedly trivial [19:33:49] but same versions etc [19:33:58] I stopped sge, removed the entire queue db, started it [19:34:03] and then succesffully submitted a new job [19:34:15] so I'm thinking about that [19:36:48] 6Labs, 10Tool-Labs: tools.taxonbot and tools.giftbot cronjobs not firing - https://phabricator.wikimedia.org/T123186#1937492 (10Giftpflanze) 2016-01-01 00:00 UTC and 2016-01-15 00:00 UTC: 0 0 1,15 * * jlocal... [19:37:36] ^ given the timing, I'm pretty sure it's a load issue [19:37:37] ^ could we do something about that or is there a workaround? [19:38:18] gifti: workaround: fire a few minutes earlier or later [19:38:43] yeah, i thought so [19:38:51] will do [19:42:29] valhallasw`cloud: did you see the etherpad I sent? :) [19:42:41] do you plan on doing something about the load? [19:42:45] YuviPanda: yeah. looks good. [19:42:54] gifti: I'm not sure what's causing the load. [19:43:11] ah [19:43:12] migrating to a more powerful vm is possible, but will take time [19:48:34] ok [19:48:39] now I'm going to look at our stats again [19:49:15] so once I sorted out the path issues that actually does make a plan text dump of things [19:49:23] it test anyway [19:49:28] in test [19:51:48] parentcommandline in the eventlogging schema is fascinating [19:53:04] https://phabricator.wikimedia.org/P2477 [19:54:14] * * * * * /usr/bin/jsub [19:54:21] times 5 [19:54:24] let me look [19:54:46] wtf [19:56:17] valhallasw`cloud: can you get on the master? [19:56:20] root/emergency_sge_dump [19:56:51] that looks good [19:57:26] yeah, looks good to me (although I wouldn't be able to say if anything is missing) [19:58:00] me neither [19:58:04] but the things I know to check for are there? [19:58:09] not a great indicator tho :) [19:58:17] but this is teh best backup we have [19:58:19] and also [19:58:23] no issues dumping the //settings// db [19:58:27] 24217 total invocations for cluebot [19:58:33] 12123 without -once [19:58:34] this clearly doesn't fool w/ the active queue [19:58:39] that's almost 50% without -once [19:58:44] * YuviPanda investigates the ones without once [19:58:48] do we allow jobs to submit other jobs? [19:58:54] aaah [19:58:56] job [19:58:58] that's double [19:59:02] let me stop recording 'job' [19:59:04] that's useless [19:59:38] nope [19:59:40] without 'job' [19:59:42] there are 0 things [19:59:44] without -once [19:59:46] from cluebot [20:00:31] oh [20:00:34] nvm [20:00:36] my sql was wrong [20:00:38] hmm [20:01:03] my sql is still wrong [20:01:48] bah [20:01:50] my code is wrong [20:02:24] I was emitting 'job' as 'jsub' [20:02:29] oh well [20:02:31] fixable [20:02:54] yeah [20:02:58] 0 things without -once [20:03:00] from cluebot [20:03:25] valhallasw`cloud: I wonder if we can / should change cluebot's crontab [20:03:39] YuviPanda: chasemp : more from the archives: https://meta.wikimedia.org/wiki/Toolserver/Stable_server/Candidates [20:03:52] YuviPanda: I would leave a message at their talk page / create a task in phab [20:03:56] and only act if there's no response [20:04:19] valhallasw`cloud: yeah [20:04:24] valhallasw`cloud: so I'm going to look at their crontab [20:04:27] and then do those things [20:04:47] hmm [20:04:51] YuviPanda: I think our best bet is to wipe out the queue db [20:04:53] actually [20:04:55] they're all -once [20:05:06] I made a dump of the settings db in /root [20:05:10] and am backing it up locally [20:05:20] chasemp: during the migration to your new master or? [20:05:23] and it seems like from testing if you remove the spool db it recreates [20:05:36] well, I think we shoudl do it now honestly [20:05:38] if we can [20:05:43] friday afternoon? :) [20:05:48] yeah shitty timing [20:05:53] but it's crashing hundreds of tiems a day [20:06:10] hmm [20:06:12] I guess, what's teh fallout? risk of dupe jobs? [20:06:13] chasemp: I have a better idea [20:06:23] have we tried the dump + load thing yet? [20:06:24] new master [20:06:25] please :) [20:06:36] import configuration, but with the other spool system, and not on NFS [20:06:39] YuviPanda: I have dumped yes but it's //settings// not queue [20:06:56] and I think considering the state of the queue db it's not sane to try to recover it [20:07:06] db_recover seesm to basically want a journal file [20:07:09] and it wants to roll back [20:07:12] no [20:07:14] not recover [20:07:17] but a dump and reload [20:07:21] which should ignore the journal [20:07:31] then I think we can switch over by changing the file that indicates the master [20:07:32] db_dump do you mean? [20:07:37] db_dump [20:07:37] althugh I'm not sure what that does to execds :/ [20:07:57] chasemp: db_dump writes out TXT, then db_load [20:08:06] valhallasw`cloud: yeah I agree but not today I think or not unless we have to today [20:08:21] there is some crazy path issues on the current master I'm not entirly sure of everthing [20:08:27] YuviPanda: does that work for you? [20:09:04] if the queue db is corrupt and it's crashing on handling teh qeueue the shortest path to no crashes seems to be to start a fresh queue [20:09:27] I don't think dumping the current one and trying to recover it makes sense as it's corrupt [20:09:33] chasemp: so we have to announce 'we're going to kill the queue database' anyway [20:09:35] and db_recover doesn't seem feasible [20:09:41] so in that sense it doesn't necessarily matter [20:09:57] I think db_dump/db_load should be OK because it's an issue in bdb consistency, not in the contents [20:10:02] yeah [20:10:06] am going to try that now [20:10:16] not on the live one!! [20:10:44] apparently that can already cause corruption [20:12:08] valhallasw`cloud: db_dump can cause corruption? [20:12:14] not db_load [20:12:18] db_load probably can [20:12:20] I don't understand how dumping a corrupt db and restoring it is going to work out? [20:12:21] db_dump shouldn't [20:12:24] YuviPanda: yes [20:12:32] that's what https://arc.liv.ac.uk/SGE/howto/backup.html says [20:12:33] it may lock it to dump [20:12:39] there is a db_hostbackup or something [20:12:54] too late because I did it anyway [20:13:10] YuviPanda: "Caution: This method is not safe if Berkeley DB spooling is in use, the qmaster is running, and it opens the spool database in ‘private’ mode to allow spooling to a network filesystem. " [20:13:11] >_< [20:13:35] well [20:13:39] I'm going to go for lunch [20:13:41] YuviPanda: are you dumping teh spool db or the sge one? [20:13:46] chasemp: I dumped sge [20:13:49] into /tmp/bdb [20:13:51] and /tmp/wat [20:13:53] for the text file [20:13:54] that's not the corrupt db I don't think [20:13:57] ok [20:13:59] well [20:14:03] or wasn't [20:14:03] I've to go for lunch now [20:14:22] I should probably stop doing things [20:14:32] since at this point both chasemp and valhallasw`cloud definitely know way more about this than I do :| [20:14:36] I hope I didn't corrupt sge [20:14:44] YuviPanda: I think it's OK [20:14:50] well I did jsut take a dump of it too [20:14:52] there's a "01/15/2016 20:11:53| timer|tools-grid-master|E|error checkpointing berkeley db: (13) Permission denied" in the log, but that sounds OK [20:14:54] I'll brb soon [20:15:03] I would highly reccomend not doing anything drastic today [20:15:05] brb [20:15:21] is a fresh queue drastic? serious question [20:15:29] if it loses all running jobs: yes [20:15:32] valhallasw`cloud: so I basically agree with all you said before [20:15:46] but the mass of insanity here makes me very afraid of the move right now [20:15:57] coming back to why db_dump / db_load could work: it forces the database structure to be fresh [20:16:08] hm [20:16:24] I guess I figure [20:16:49] well I'm not sure [20:16:54] it's also the least drastic option we have [20:17:05] other than 'don't touch it while it's still half-alive' [20:17:15] yes [20:17:17] true [20:17:26] let me try it on a test setup to see what happens [20:17:48] chasemp: ./inst_sge -bup is supposed to be the official way [20:18:00] to do dump and restore? [20:18:13] to backup, at least [20:18:15] is that command on the master now? (I don't think it is) [20:18:16] ah [20:18:18] I can't just do [20:18:20] db_dump [20:18:24] but that might just be config [20:18:27] I think you can [20:18:28] ok [20:18:31] db_dump is the low level one [20:18:32] it looks like [20:18:37] but you have to shut down master I think [20:18:38] if you dump an empty queue [20:18:39] file [20:18:40] all you see is [20:18:47] VERSION=3 [20:18:48] format=bytevalue [20:18:48] type=btree [20:18:49] db_pagesize=4096 [20:18:51] HEADER=END [20:18:53] DATA=END [20:18:55] hah. That makes sense [20:18:55] so that's interesting [20:20:31] chasemp: it basically makes a dump where every octet is represented by a hexadecimal number [20:20:33] is db_dump invasive? [20:20:53] yes we think it is [20:21:07] *nod* [20:21:14] so if I stop the master, job submission dies [20:21:19] obv insight into things as well [20:21:23] yes. But the dump should be pretty fast [20:22:06] chasemp: dumping one of the backup BDB files took only 3 secs [20:23:08] ok I'm going to try that [20:23:10] hold on to your hat [20:23:38] oddly enough the hex dump is 4 MB... while the bdb file is 40 [20:23:57] but I can't figure out db_load yet [20:24:09] uh well it did not work out [20:24:23] ? [20:24:26] service gridengine-master stop && db_dump sge_job > /root/emergenc_sge_job_dump/sge_job_dump && service gridengine-master start [20:24:30] and all I got was [20:24:45] also I saw a dupe master process [20:24:50] which could also account for the corruption [20:24:57] huuuh. [20:25:10] all I got was /root/emergenc_sge_job_dump/sge_job_dump [20:25:22] that seems like the deal [20:25:47] yeah that job file is empty I think maybe [20:25:51] 8K [20:26:19] chasemp: you need /data/project/.system/gridengine/spool/spooldb [20:26:19] ok so [20:26:24] var/spool/gridengine/spooldb [20:26:47] I think these are symlinked [20:26:58] yeah [20:27:39] service gridengine-master stop [20:27:42] does not kill the master [20:27:43] :) [20:27:48] that's why my start at the end [20:27:50] started a second one [20:27:56] oooooooh k [20:28:46] chasemp: in the meanwhile, I'll try to trace your second master observation [20:28:50] etc/init.d/gridengine-master stop similarly does not kill the amster [20:29:04] I killed the second master proc manually when I saw it [20:29:49] chasemp: sudo lastcomm | grep qmaster [20:30:35] there's a few starts every 30 mins [20:30:47] ok so [20:30:55] I had to kill that runaway master proc [20:31:00] it responded to nothing nicely [20:31:10] and then I made sure I could start it sanely with no other changes [20:31:11] seems so [20:31:19] so I stopped it nicely and that was ok [20:31:22] and then I dumped the job db [20:31:24] and restarted [20:31:29] and it seemed to work out [20:31:37] so we have a running thing w/ old same bdb file [20:31:49] and a dump of that file which if i restored it would lose anything from now till then [20:31:54] but mainly I wanted to see all that happen [20:32:03] root/emergenc_sge_job_dump/sge_job.dump [20:33:07] ok, so then the next question is how to reload it into a bdb file [20:33:34] seemed to work in test: db_load -f my_dump sge_job [20:33:44] but I want to do it with a running job and see it come back here [20:35:19] valhallasw`cloud: what's your wikitech name? [20:35:26] (going to add you to a project) [20:35:29] chasemp: Merlijn van Deen [20:36:15] so this is my teset master totally unrelated [20:36:15] tool-master-05.tool-renewal.eqiad.wmflabs [20:36:19] it has been installed with no nfs [20:36:38] the setup there is speculative so it's hard to explain but I can execute jobs and such [20:36:47] so you can see what I'm seeing [20:37:08] * valhallasw`cloud tries logging in [20:38:14] chasemp: ok, I'm in [20:38:30] I jsut started a job that basically echos and sleeps [20:38:32] qstat shows it [20:38:45] yep [20:39:00] and I dumped unsafely into root@tool-master-05:/var/spool/gridengine/spooldb# file foo [20:39:17] *nod* [20:39:21] I'm thinking about whether the queue settings here are the same enough to get a sense of job resurrection [20:39:30] afa stopping master && dump && restore && starting [20:40:26] chasemp: so there are three things I think we should try: 1) stop/restart (should be OK), 2) stop/delete/restart (should lose the job), 3) stop/dump/move spool dir to .bak/restore in new dir/start [20:40:48] or do you want to restore the toollabs jobs list here? [20:41:06] so I was kind of thinking about that but I dn' thave all teh queues etc [20:41:08] in that case I think you need to restore the queues as well which might be nontrivial [20:41:09] so it wouldn't really work out [20:41:13] :) [20:41:15] buuut [20:41:23] you can just copy the sge bdb file as well? [20:41:31] hmmm [20:41:43] worst case scenario is that it tries to write to wrong directories and gives up, I think [20:41:49] there has to be a reason taht is a bad idea but I can't think of it yet [20:41:51] :) [20:41:58] yeah [20:42:12] it doesn't have tcp access to tool labs, so I think it should be fine? [20:42:15] the current tools master has all kinds of sym link shenanigans and such [20:42:30] well, project isolation isn't as hard and fast as it seems [20:42:48] but it shouldn't cause issues as sge seems to rely on exec's etc [20:42:52] knowing where to get their stuff at [20:43:00] but then again yeah it's a thing [20:43:15] I could also use the dump script, change the master name before importing [20:43:15] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tom29739 was created, changed by Tom29739 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Tom29739 edit summary: Created page with "{{Tools Access Request |Justification=To power a bot to get rid of attack pages on the English Wikipedia. |Completed=false |User Name=Tom29739 }}" [20:43:16] mmm [20:43:27] let me check something [20:44:19] I'm goign to stop the current test master tho? [20:44:22] ready? [20:44:47] valhallasw@tool-master-05:/var/spool/gridengine/spooldb$ telnet tools-exec-1205 6445 [20:44:48] Trying 10.68.17.91... [20:44:51] ^ so that should be OK [20:44:57] sure [20:45:11] kk [20:46:00] so service gridengine-master stop && db_dump sge_job > foo && db_load -f foo sge_job && service gridengine-master start [20:46:07] dumped and came back and the job was still running [20:46:19] so it says [20:46:29] chasemp: did that restore to a different file? [20:46:37] because I'm not sure what restoring to the same file does [20:47:11] so your thought is [20:47:36] service gridengine-master stop && db_dump sge_job > foo && rm -f sge_job && db_load -f foo sge_job && service gridengine-master start [20:48:23] service gridengine-master stop && db_dump sge_job > foo && cd .. && mv spooldb spooldb.old && mkdir spooldb && cd spooldb && db_load -f ../spooldb.old/foo sge_job && cp ../spooldb.old/sge . && service gridengine-master start [20:50:49] ok I understand what you are thinking now [20:51:07] idk what [20:51:07] __db.001 __db.002 __db.003 log.0000002927 [20:51:08] are [20:51:20] or if it will be bad if I don't return them (on the prod install) [20:51:47] chasemp: log are bdb log files (which I interpret as journaling) [20:52:01] any ideas on __db.001? [20:52:44] well at least I have them test as well [20:52:46] so we shall see [20:52:46] https://stackoverflow.com/questions/8957999/what-does-db-001-mean-in-berkeley-database [20:52:47] :) [20:53:04] so I suppose they might disappear when you shut down the master? [20:53:08] interesting [20:53:18] assuming we close out db ops nicely I guess yeah [20:55:15] so I did [20:55:24] service gridengine-master stop && ps -ef | grep grid && db_dump sge_job > foo && cd .. && mv spooldb spooldb.old && mkdir spooldb && cd spooldb && db_load -f ../spooldb.old/foo sge_job && cp ../spooldb.old/sge . && service gridengine-master start && ps -ef | grep grid [20:55:27] things did not come back [20:55:46] 01/15/2016 20:55:02| main|tool-master-05|E|couldn't open database environment for server "local spooling", directory "/var/spool/gridengine/spooldb": (13) Permission denied [20:55:58] ah [20:56:06] I suppose a chown sgeadmin:sgeadmin is needed :-) [20:56:48] https://lists.wikimedia.org/pipermail/toolserver-l/2009-April/001995.html *grin* [20:56:53] did it and still not yet [20:57:26] 01/15/2016 20:56:47| main|tool-master-05|E|couldn't open berkeley database "sge": (22) Invalid argument [20:57:26] mmm [20:57:43] but it did create the __db.X files [20:57:54] where did you see that error? [20:58:09] in /var/spool/gridengine/qmaster/messages [21:00:15] it wanted the .log file [21:00:27] huh. [21:00:47] oh! the .log file belonged to sge, not sge_job [21:00:57] there you go yes [21:01:03] so we should probably dump and reloda both at the same time [21:02:20] that's old tho now I think [21:02:23] oops [21:02:35] I had an irc weirdness nvmd on that response [21:03:22] so it still let me stop that job [21:03:26] even after all that [21:03:31] and the job was for sure still running [21:06:09] that sounds good [21:07:26] so we could do this either today or on monday. I think it should be a fairly safe operation, but if it breaks there's not a lot of time to fix it, and the weekend is prime time for volunteer work [21:07:57] another intersting thing is [21:07:58] 01/15/2016 18:46:53| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery [21:08:03] which was happening every 4 minutes or so [21:08:07] seems to have trailed off for a bit [21:08:16] hrm. [21:08:27] and I haven't seen one in an hour and hafl? [21:08:30] roughly since I restarted [21:08:40] well maybe [21:09:47] first one I can find is 12/30/2015 02:47:55| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery. [21:10:58] I see other gaps in the failure logs tho [21:27:00] !log wikimetrics stopping old prod queue and scheduler [21:28:13] !log wikimetrics set up symlinks for static files on new prod server [21:28:27] !log wikimetrics importing db data to new prod server [21:32:39] I am back [21:33:18] yo [21:34:18] wow lots of backscroll [21:34:22] this is lovely! [21:34:40] I wonder if the corruption is because of two masters running [21:34:54] so [21:35:07] at some point on the master now it seems to not respond to the service script [21:35:19] and when I go to stop it it ignores and when I go to start it [21:35:22] it starts a second [21:35:27] so that's interesting [21:35:29] but yeah [21:35:33] could def do the deal [21:36:54] so assuming it's starting two masters on the same host, that suggests it might correlate with the new tools-grid-master [21:37:01] ok [21:37:02] I jsut confirmed [21:37:06] puppet starts a dupe master pro [21:37:08] proc [21:37:21] and that proc wants to do all the same things including manage the queue file [21:37:26] woooo [21:37:26] I disabled puppet [21:37:30] I might know what's happening [21:37:36] there's no 'status' [21:37:39] for the gridengine msater process [21:37:42] it was alrady running [21:37:42] so puppet just starts them [21:37:44] Notice: /Stage[main]/Gridengine::Master/Service[gridengine-master]/ensure: ensure changed 'stopped' to 'running' [21:37:44] Info: /Stage[main]/Gridengine::Master/Service[gridengine-master]: Unscheduling refresh on Service[gridengine-master] [21:37:46] and that happend [21:37:47] there's a way around it [21:37:48] and then [21:37:54] root@tools-grid-master:/var/spool/gridengine# ps -ef | grep grid [21:37:55] sgeadmin 15161 1 19 21:31 ? 00:00:58 /usr/lib/gridengine/sge_qmaster [21:37:55] root 16313 1 0 21:36 ? 00:00:00 /usr/lib/gridengine/sge_qmaster [21:37:56] root 16573 3492 0 21:36 pts/2 00:00:00 grep --color=auto grid [21:38:01] and see the second one started as ROOT [21:38:20] so that will no doubt cause issues I would think [21:38:28] how freaking long has this been going on [21:38:54] chasemp: https://github.com/wikimedia/operations-puppet/commit/3a29a719fa409aa58339b056a4d68fc448f45532 [21:39:01] 17 nov. [21:39:17] > If a service’s init script does not support any kind of status command, you should set hasstatus to false and either provide a specific command using the status attribute or expect that Puppet will look for the service name in the process table [21:39:33] which doesn't explain why it worked without obvious issues for a month [21:39:42] maybe it's a race condition for corruptino [21:39:45] and it just took awhile [21:39:49] 6Labs: Rename labcontrol1001 to labtestweb2001 - https://phabricator.wikimedia.org/T123790#1937988 (10Andrew) 3NEW [21:39:53] but once it gets hosed up [21:39:54] then it's on [21:40:21] so [21:40:25] could some of those errors [21:40:29] be one of the two master procs [21:40:33] bumping into the other [21:41:07] Yeah, I think so. Unfortunately, the log file doesn't log the process id :/ [21:41:12] I mean, two masters trying to manage all the same things that's just [21:41:16] well that's bound to be insane [21:41:20] hi all - I will be physically at the event tomorrow, I hope to say hello to -labs people [21:41:47] 6Labs: Rename labcontrol2001 to labtestweb2001 - https://phabricator.wikimedia.org/T123790#1938009 (10Andrew) [21:42:26] chasemp: is this also happening on your test host? [21:42:33] (03PS1) 10Yuvipanda: Stop recording eventlogging messages for `job` [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264374 [21:42:35] oh, I'm root, I can check this myself ;-) [21:42:41] valhallasw`cloud: I don't have puppet trying to start anything [21:42:51] I think it's puppet not understanding the lack of status as described [21:42:58] and there is no custom status command [21:43:01] and it's just a house of cards [21:43:05] but why doesn't it start it on your host then? [21:43:21] I'm sure it would if it was coming from puppet :) [21:43:32] that's all me walking through it manually to write my own puppetization [21:43:34] ah! [21:43:59] so the shadow corruption idea was pretty close I guess [21:44:04] it's just it was all on the same VM [21:44:49] should I write a patch for puppet to make it not start it? [21:44:58] err [21:45:03] at least fix the status and hasstatus stuff? [21:45:15] I would want to play with a custom status command for a bit and I don't want to rush into it (pun) [21:45:20] I would be cool w/ no auto start for the weekend? [21:45:33] I'm tailing things now [21:45:33] let's revert it and see if ew get any more database warnings [21:45:35] yeah sure [21:45:39] looking for corruption indicators [21:45:40] or just disable puppet for the weekend [21:45:41] let's do that [21:45:49] to see if even w/ one proc it throws crazy errors [21:46:02] valhallasw`cloud: ha same thought yeah [21:46:29] (03CR) 10jenkins-bot: [V: 04-1] Stop recording eventlogging messages for `job` [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264374 (owner: 10Yuvipanda) [21:46:32] what a damn mess, this is what caused a lot of that drama I bet [21:46:46] yeah [21:47:22] fwiw my next move was [21:47:23] echo "stopping master" && service gridengine-master stop && echo "------" && ps -ef | grep grid && echo "remove foo" && rm -f foo && ls && db_dump sge_job > foo && cp -p sge_job /root/ && ls /root/ && md5sum sge_job && rm -f sge_job && db_load -f foo sge_job && md5sum sge_job && chown sgeadmin:sgeadmin sge_job && service gridengine-master start; echo $? [21:47:32] and I can verify the rebuilt binary is a different hash [21:47:48] i.e. I think the rebuilding even of the same file is enough of a change to overcome some versions of corruption [21:48:05] but I'm holding off now [21:48:46] YuviPanda: can you revert that? I gotta brb for a minute [21:48:51] yeah [21:48:53] am on it [21:49:35] So I'm still reading the toolserver-l archives, and we had the same issue with tools not having multiple maintainers, but worse because the accounts needed renewal. I think the comments I made then are still true for tool labs tools with a single maintainer *grin* [21:49:36] https://lists.wikimedia.org/pipermail/toolserver-l/2012-May/004967.html [21:50:26] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1938073 (10bd808) >>! In T123427#1937379, @Tgr wrote: > What would such a namespace accomplish? You can always name a page `Portal:Something` (or,... [21:52:04] 6Labs: Rename labcontrol2001 to labtestweb2001 - https://phabricator.wikimedia.org/T123790#1938082 (10Andrew) [21:52:05] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Investigate decommissioning labcontrol2001 - https://phabricator.wikimedia.org/T118591#1938081 (10Andrew) [21:52:11] valhallasw`cloud: <3 [21:52:17] bd808: ^^ ( valhallasw`cloud's link) [21:53:05] ok I'm back and I realized I literally haven't left the house since sunday evening [21:53:28] chasemp: I haven't left since Saturday afternoon! [21:54:58] nice [21:55:03] YuviPanda, valhallasw`cloud: I recorded that on https://www.mediawiki.org/wiki/User_talk:BDavis_%28WMF%29/Projects/Tool_Labs_support for posterity [21:55:27] chasemp: I think my record was 13 days [21:55:50] I need to go out today though to get a bday card for my significant other [21:59:53] YuviPanda: I do see a lot of [21:59:53] 01/15/2016 21:58:51|worker|tools-grid-master|E|The job -j of user(s) tools.toolschecker does not exist [22:00:03] is that from teh canary check somehow? [22:00:16] does it maybe uncleanly handle a job or soemthing [22:06:51] chasemp: so I just found the old Toolserver puppet repository! I'm not sure yet if there's anything useful, but there might be some gridengine stuff in there [22:07:44] 6Labs, 10Tool-Labs, 5Patch-For-Review: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1938150 (10chasemp) After much shenanigans it was determined that Puppet in it's ignorance was starting a second master processes which we believe was conflicting (it may have started more tha... [22:07:58] neat I didn't know it was puppetized :) [22:10:14] no, me neither :-) I just found a single e-mail mentioning it on the mailing list [22:11:50] YuviPanda, chasemp: https://github.com/valhallasw/ts-puppet -- not much in there, though. [22:17:05] :D [22:17:17] chasemp: that's probably a canary gone wrong [22:19:52] it happens a lot but I'm not sure what to make of it [22:20:09] there's a qstat -j [22:20:23] and if the joblist following it is empty [22:20:26] for some reason [22:20:28] I think that causes that error [22:20:38] anyway [22:20:46] gotcha [22:20:46] chasemp: valhallasw`cloud I merged a patch reverting the ensure => runner [22:20:58] YuviPanda: no, that should give scheduling info: queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1201.eqiad.wmflabs" dropped because it is temporarily not available etc [22:21:09] hm [22:21:53] maybe qdel -j? [22:21:56] hmm [22:21:58] maybe [22:22:04] no [22:22:06] there's only one -js [22:22:08] -j [22:22:10] and that's for a qstat [22:26:07] anyway, time for bed [22:26:09] g'night! [22:26:39] valhallasw`cloud: thank you much and good night [22:30:17] valhallasw`cloud: <3 good night [22:31:27] YuviPanda: can you look at https://gerrit.wikimedia.org/r/#/c/264395/ [22:31:28] it's small [22:31:59] basically I got some brain dump from ma.rk this morning and part of that outcome was to look at the raid monitoring there and it seems like it shoudl be using mdadm etc [22:33:40] chasemp: left a comment [22:34:26] chasemp: should I reenable puppet on master and check? [22:34:30] so the script in general takes no args [22:34:34] ah [22:34:35] and does a lot of self determination [22:34:36] ok [22:34:38] then fine [22:34:44] like figuring out os and utililty etc [22:34:51] so I'm basically tryign to say true to form for it [22:35:03] and since this is run under the guise of nagios and nrpe [22:35:05] can you add a comment there saying this? [22:35:09] because I'm sure this comes from upstream [22:35:12] and is copied in [22:35:19] my commit msg I think says basically that [22:35:21] gtg otherwise [22:35:22] or maybe it's confusing? [22:35:33] right, but I think these things should be left as comments inline :) [22:35:40] ah sure [22:35:41] I get it [22:35:45] I thought you meant a gerrit comment [22:35:46] sure [22:35:51] ah [22:35:53] right [22:35:55] no, inline in the source [22:36:24] chasemp: anyway, objections to starting puppet again on master [22:36:26] ? [22:36:48] darkblue_b: I might be, apparently there was a registration and I did not know this [22:38:16] YuviPanda: nope but do it in console to make sure it doesn't do more crappy things :) [22:38:33] YuviPanda: eh https://gerrit.wikimedia.org/r/#/c/264395/ [22:38:51] chasemp: yah [22:39:16] chasemp: self termination? :) [22:39:28] it takes no args and tries to figure out what to run and when to run it [22:39:34] maybe that only makes sense to me [22:39:39] chasemp: shouldn't that be self determination? [22:39:47] oh cripes [22:40:17] heh nice [22:40:21] thanks updating [22:40:59] +1 [22:41:01] d [22:49:32] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tom29739 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=258392 edit summary: [22:58:43] chasemp: btw didn't touch gridengine master [22:58:50] (when I re-enabled puppet, that is) [23:00:22] k [23:00:42] seems good [23:00:59] I honestly think puppet would just keep thrashing master procs against each other until the master dies [23:01:06] but I don't want to go backwards to test that [23:01:17] heh [23:55:44] (03CR) 10Yuvipanda: [C: 032] "Debian Glue is broken." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264374 (owner: 10Yuvipanda) [23:56:46] (03CR) 10Yuvipanda: [V: 032] Stop recording eventlogging messages for `job` [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264374 (owner: 10Yuvipanda)