[00:22:05] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:55] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [01:33:36] bblack, around? [01:48:15] (03PS1) 10Ori.livneh: Be multithreaded. [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [02:08:18] (03PS1) 10Faidon Liambotis: auto-install: move private1-ulsfo to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/101799 [02:08:19] (03PS1) 10Faidon Liambotis: auto-install: disable swap on appservers (mw.cfg) [operations/puppet] - 10https://gerrit.wikimedia.org/r/101800 [02:11:13] !log LocalisationUpdate completed (1.23wmf6) at Mon Dec 16 02:11:13 UTC 2013 [02:11:31] Logged the message, Master [02:13:11] !log salt swapoff -a; sed -i "/swap/d" /etc/fstab on all srv*, mw* [02:13:26] Logged the message, Master [02:13:33] (03CR) 10Faidon Liambotis: [C: 032] auto-install: move private1-ulsfo to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/101799 (owner: 10Faidon Liambotis) [02:14:07] (03CR) 10Faidon Liambotis: [C: 032] auto-install: disable swap on appservers (mw.cfg) [operations/puppet] - 10https://gerrit.wikimedia.org/r/101800 (owner: 10Faidon Liambotis) [02:19:59] !log LocalisationUpdate completed (1.23wmf7) at Mon Dec 16 02:19:59 UTC 2013 [02:20:16] Logged the message, Master [02:34:36] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 16 02:34:36 UTC 2013 [02:34:51] Logged the message, Master [02:42:04] hey uhm [02:42:07] made a machine on labs [02:42:18] connected once to it [02:42:23] then couldn't connect to it anymore [02:42:30] tried to install a package on it from a deb [02:42:34] dpkg just stalled [02:43:17] and the link was quite slow, not sure why, maybe cause I'm on the other side of the ocean ? [02:43:51] well, yeah, anyway. I'll probably circle around tommorow again about this, probably not a good time right now [02:45:12] it's not the right time nor the right channel :) [02:45:47] paravoid: true [02:53:17] (03CR) 10Faidon Liambotis: [C: 04-1] "Good stuff! (very cursory look)" (033 comments) [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [03:37:43] (03CR) 10MZMcBride: "Hashar: jenkins-bot seems to be complaining about RewriteEngine, but I'm not sure why." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101787 (owner: 10John F. Lewis) [04:09:31] has gerrit.wikimedia.org key changed? [04:09:36] RSA key fingerprint is 83:fe:34:4b:16:2c:9e:95:1d:f6:d7:7d:ee:28:03:02. [04:11:57] hmm, actually i cant upload anything to gerrit, :( [04:15:22] yurik-road: you exceeded your patch quota [04:15:32] you have to relax until january [04:15:40] ori-l, funny :) [04:15:55] althuogh, ori-l, who should be talking! :-P [04:16:03] how's your tooth doing? [04:16:18] git pull fails :( [04:17:35] its back! [04:17:51] my tooth is awful :/ [04:17:59] :( [04:24:23] ori-l: https://github.com/trebuchet-deploy/trigger#extending-trigger [04:24:50] specifically: https://github.com/trebuchet-deploy/trigger#extending-trigger [04:24:55] PROBLEM - MySQL Slave Running on db1026 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: Error Deadlock found when trying to get lock: try restarting transac [04:25:34] ugh. stupid markdown [04:25:54] Ryan_Lane: that is abusing decorators a little, I think -- composing classes via inheritance is a better model when you need this much configurability [04:26:05] like Django views, or python's threading library for that matter [04:26:18] you subclass thread and override run [04:26:30] I'm following an openstack model here [04:26:45] I think the decorators way of handling this is rather nice. [04:26:45] do you fabric? [04:26:55] RECOVERY - MySQL Slave Running on db1026 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:27:12] do you know fabric, even [04:27:41] I know if its existence [04:28:02] I decided against it pretty early on [04:28:10] due to its reliance on ssh [04:29:25] oh, yeah, i wasn't suggesting using it [04:29:40] anyway, abusing decorators like this lets you configure each function without the overhead of a class [04:29:42] it has some nice patterns for building up a library of snippets of code for remote execution [04:29:56] ah. right. this is not for remote execution [04:30:00] i was just going to suggest robbing it for ideas [04:30:11] this is just for extending argparse [04:30:44] meh [04:30:50] all of the remote stuff occurs via salt [04:31:02] (03PS1) 10Springle: depool db1026 during wikidata.wb_terms schema changes (slave sql thread deadlocks if attempted while online) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101807 [04:31:08] a while back i wrote this thing that used the inspect module to get the function signature and generate an argparser based on that [04:31:24] (03CR) 10Springle: [C: 032] depool db1026 during wikidata.wb_terms schema changes (slave sql thread deadlocks if attempted while online) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101807 (owner: 10Springle) [04:31:45] it's cute but ultimately annoying and inflexible [04:31:55] what is? the decorators? [04:32:19] a little, yeah [04:32:27] in which ways is it inflexible? [04:32:42] !log springle synchronized wmf-config/db-eqiad.php 'depool db1026 during schema changes' [04:33:00] Logged the message, Master [04:33:08] well, it's not always easy to anticipate, but here's one off the top of my head [04:33:15] implementing argument mutual exclusion [04:33:21] or argument groups [04:33:26] both supported by argparse [04:33:33] if your decorator was just [04:33:47] @util.args(argument_parser_instance) you could do it [04:37:41] https://github.com/openstack/python-novaclient/blob/master/novaclient/v3/shell.py#L207 [04:38:10] ugh [04:38:24] that is pretty horrible, come on [04:38:33] it's two screenfuls of decorators [04:38:56] you'd have two screenfuls of argparse extension there no matter what [04:39:34] I'm not opposed to another method of extension, but this one is relatively straightforward [04:40:03] and the code is easily adapted, since it's the same license [04:40:11] re: two screenfuls no matter what [04:40:37] yes, but they nevertheless deviated from the standard argparse pattern, presumably because they think this syntax is clearer or more convenient [04:41:02] and i just don't think it's true, since the meaning a decorator conveys most eloquently is: "this function you're about to see, it's a <...>" [04:41:34] examples: flask's @route (it's the handler for /index.html) django's @signal_handler, etc [04:41:55] when you see the '@' you think: i'm about to see a function [04:42:04] but then you have to put that in a buffer while you're reading unrelated things [04:42:42] this thing that i'll show you in a moment, once i show it to you, which will be shortly, like no more than another line or two, then you'll see, that is is, a thing that, ... [04:44:14] anyways, code aesthetics are subjective, and if you find that it's a good API, then don't let me and my toothache get you down [04:44:19] heh [04:44:41] no worries. I understand you're dislike of the code [04:45:08] one of the reasons I liked this model was that it kept the argparse code in the same place as the action [04:45:18] and the extension model is specific to actions [04:45:31] an alternative would be to limit each action to an extension [04:46:29] and handle the subparser via a function in the extension [04:47:20] anyway, it's not a major change for either myself or extension authors down the line if I switch up the model [04:47:50] I was mostly point this out to let you know it's now possible to add a 'traps' action, or something like that [04:48:01] I was thinking that git notes could be a good way of handling that [04:48:11] *pointing [04:55:31] i've never used git notes [04:55:37] been meaning to try them [04:58:15] I believe if you add notes to a repo they stick all the way through [04:59:11] yep [05:01:23] (03PS1) 10Legoktm: Add MassMessage to $wgDebugLogGroups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101809 [05:02:05] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:02:36] ori-l: ^ too. I'm not sure if anything else needs to be done to make that work... [05:02:49] and thanks :D [05:03:05] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [05:05:08] legoktm: you can create a remote branch using gerrit's UI, called 1.23wmf6, and specify 8077269c2120bc39aa43bfb62b4ee267847f34f3 as its starting point, because that's the commit wmf6 is currently on [05:05:14] and you can cherry-pick the debug logging patch into that branch [05:05:27] if you do that, I could sync it [05:05:44] sure [05:08:52] ori-l: done [05:10:32] legoktm: you should also bump the submodule commit in the wmf6 branch [05:10:55] ok [05:13:38] (03CR) 10Ori.livneh: Be multithreaded. (031 comment) [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [05:16:12] ori-l: is there an interface in gerrit to do that, or do I need to do it manually? [05:17:25] manually. so, assuming you have a 1.22wmf6 branch tracking origin/1.22wmf6, and assuming it's checked out: [05:17:45] 1.23, i mean [05:18:01] I found http://stackoverflow.com/questions/8191299/update-a-submodule-to-the-latest-commit [05:18:23] git submodule update --init extensions/MassMessage ; cd extensions/MassMessage ; git fetch ; git checkout origin/1.23wmf6 ; cd ../.. ; git add extensions/MassMessage ; git commit -m 'Updating MassMessage to tip of 1.22wmf6 branch' ; git review [05:19:30] ok [05:29:56] (03CR) 10Ori.livneh: [C: 032] Add MassMessage to $wgDebugLogGroups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101809 (owner: 10Legoktm) [05:30:27] (03Merged) 10jenkins-bot: Add MassMessage to $wgDebugLogGroups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101809 (owner: 10Legoktm) [05:34:29] !log ori synchronized php-1.23wmf6/extensions/MassMessage/MassMessageJob.php 'Iec240623a: Add debug logging for bug 57464' [05:34:40] !log ori updated /a/common to {{Gerrit|If79a9443a}}: Add MassMessage to $wgDebugLogGroups [05:34:45] Logged the message, Master [05:35:00] Logged the message, Master [05:35:52] !log ori synchronized wmf-config/InitialiseSettings.php 'If79a9443a: Add MassMessage to ' [05:36:02] * ori-l headdesks. [05:36:08] :/ [05:36:09] Logged the message, Master [05:36:19] I always forget to escape $ in sync messages [05:36:31] do you know how long it will take the code to propagate to the job queue runners? [05:36:43] negative one minute? [05:36:50] :D [05:36:57] let me send a test message [05:39:16] [[Special:Log/massmessage]] skipbadns * MediaWiki message delivery * Delivery of "Testing [[bugzilla:57464|bug 57464]]" to [[Legoktm]] was skipped because target was in a namespace that cannot be posted in [05:39:28] ori-l: did anything show up in MassMessage.log? [05:39:47] yes [05:39:58] i'll tell you on monday [05:40:23] what do you think this is? you just push a patch and get logs? psh. [05:40:26] er, alright.. [05:40:31] :| [05:40:31] i'm just trolling [05:40:46] https://dpaste.de/MAbS/raw [05:40:55] I clicked. [05:41:35] thanks [05:41:49] I think this falls under the "something else is terribly wrong" [05:41:58] yay, i was hoping for that [05:42:10] the interwiki prefix just vanished [05:45:03] forgive me, but it sounded so ominous, i got excited [05:45:03] "what do you mean, the interwiki prefix just vanished?!" [05:45:04] "i'm telling you, chief, it's just gone!" [05:45:04] "well go out there and find it!" [05:45:04] haha [05:45:04] I was assuming that the title object that goes in would be deserialized exactly the same, which doesn't seem to be the case. [05:51:09] ohhhhhhh [05:51:53] I blame core. [05:51:58] +1 [05:52:09] JobQueueRedis::getNewJobFields [05:52:22] and JobQueueRedis::getJobFromFields [05:52:28] $title = Title::makeTitleSafe( $fields['namespace'], $fields['title'] ); [05:53:12] we should make logmsgbot echo to -dev [05:53:32] this conversation isn't opsy but we keep gravitating here because of the sync notices [06:07:21] (03PS13) 10Yurik: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:17:04] (03CR) 10Tim Starling: [C: 04-1] "Also -1 due to the relicensing." (032 comments) [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [06:23:45] (03CR) 10Tim Starling: "Yes, delivering a 404 is the responsibility of the target domain, but some URLs under secure.wikimedia.org are not redirected." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99024 (owner: 10Tim Starling) [06:24:41] (03PS2) 10Tim Starling: Re-add the docroot/secure directory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99024 [06:25:00] (03CR) 10Tim Starling: [C: 032] Re-add the docroot/secure directory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99024 (owner: 10Tim Starling) [06:27:13] !log tstarling synchronized docroot/secure/404.html [06:27:14] (03PS2) 10Tim Starling: secure.wikimedia.org ErrorDocument [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99026 [06:27:17] (03CR) 10jenkins-bot: [V: 04-1] secure.wikimedia.org ErrorDocument [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99026 (owner: 10Tim Starling) [06:27:31] Logged the message, Master [06:29:26] (03CR) 10Tim Starling: [C: 032] secure.wikimedia.org ErrorDocument [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99026 (owner: 10Tim Starling) [06:29:28] (03CR) 10jenkins-bot: [V: 04-1] secure.wikimedia.org ErrorDocument [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99026 (owner: 10Tim Starling) [06:29:44] (03CR) 10Tim Starling: [V: 032] secure.wikimedia.org ErrorDocument [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99026 (owner: 10Tim Starling) [06:36:07] (03PS1) 10Ori.livneh: Make it possible for logmsgbot to report to more than one channel [operations/puppet] - 10https://gerrit.wikimedia.org/r/101816 [06:39:32] (03PS1) 10Springle: repool db1026 after schema changes, LB lowered for warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101817 [06:40:03] (03PS2) 10Springle: repool db1026 after schema changes, LB lowered for warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101817 [06:40:55] (03CR) 10Springle: [C: 032] repool db1026 after schema changes, LB lowered for warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101817 (owner: 10Springle) [06:42:04] !log springle synchronized wmf-config/db-eqiad.php 'repool db1026 after schema changes, LB lowered during warm up' [06:42:18] Logged the message, Master [06:55:01] (03PS14) 10Yurik: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:59:08] (03PS2) 10Ori.livneh: Be multithreaded. [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [07:09:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [07:37:10] (03Abandoned) 10Arav93: Renamed $wmf* to $wmg* [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [07:50:41] (03PS3) 10Ori.livneh: Be multithreaded. [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [08:12:25] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:32:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:57:57] (03PS1) 10Arav93: Renamed $wmf* to $wmg* Bug:43956 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 [09:01:55] RECOVERY - DPKG on mw1017 is OK: All packages OK [09:03:06] (03PS2) 10Peachey88: Renamed $wmf* to $wmg* Bug:43956 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [09:06:28] paravoid, around? [09:06:40] (03CR) 10Peachey88: "When doing commit messages, the "Bug:" line should have a blank line inbetween the message and the bug line, Also a space after the colon " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [09:08:38] (03CR) 10Arav93: "Sorry, Did you change it , or should I ?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [09:10:52] (03PS4) 10Ori.livneh: Be multithreaded. [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [09:11:11] (03CR) 10Peachey88: "I have already do it (It's patchset two on this commit)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [09:22:18] (03PS1) 10Stefan.petrea: Json schema, output and test [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101821 [09:22:57] damnit, my use of gerrit is suboptimal [09:23:43] ori-l: so I made use of json-glib and i gave no warnings, now in the meantime you pushed PS3 and PS4 and I did some merges with your code. I probably messed something up [09:23:58] but there is some JSON support now [09:24:25] it's all right, we'll figure it out [09:24:42] i'll probably need to go through another patchset :/ [09:24:42] I also made a test that starts mwprof , throws stuff at it on UDP, then connects through TCP to it, gets the stats, kills it. And now the unfinished part, testing the output stats against a JSON schema (which is also unfinished) . [09:24:56] oh cool! [09:25:12] ori-l: will you have time to review my patch too ? :) [09:25:33] yes, but not tonight, it's late [09:25:35] ok [09:25:39] thank you [09:25:51] thank you, good night [09:26:06] good night [09:26:40] * average goes for cigarettes and another for another attempt to acquire more powerful hardware [09:35:42] (03PS2) 10Stefan.petrea: Json schema, output and test [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101821 [09:48:40] (03CR) 10Ori.livneh: "Tim: I incorporated the original permission statement from collector.c verbatim. I'm not sure what I got wrong. My sole motivation was to " [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [09:52:44] (03PS5) 10Ori.livneh: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [10:03:25] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [10:35:00] (03CR) 10Tim Starling: [C: 04-1] "You added GPL licensing and a "copyright WMF" statement to files which were previously public domain and mostly contributed by a volunteer" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [10:38:43] paravoid, I added proxy handling to VCL - https://gerrit.wikimedia.org/r/#/c/88261/ , please let me know if you see any issues with this approach [10:39:32] thx, and off to bed i go :) [11:18:27] (03PS1) 10Springle: decom all pmtpa s[1-7] db nodes except the temporary masters [operations/puppet] - 10https://gerrit.wikimedia.org/r/101825 [11:19:47] (03CR) 10Springle: [C: 032] decom all pmtpa s[1-7] db nodes except the temporary masters [operations/puppet] - 10https://gerrit.wikimedia.org/r/101825 (owner: 10Springle) [11:27:21] (03PS1) 10Springle: final s[1-7] pmtpa dbs state before decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101826 [11:27:40] (03CR) 10Springle: [C: 032] final s[1-7] pmtpa dbs state before decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101826 (owner: 10Springle) [11:28:34] !log springle synchronized wmf-config/db-pmtpa.php [11:28:50] Logged the message, Master [11:45:40] yay! [12:18:22] mutante: did you eventually get any off-ticket response to RT 6264? (db29 pgehres) [12:40:43] (03PS15) 10Dr0ptp4kt: WIP: Show W0 (set X-CS) for Opera Mini where applicable. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [12:56:04] (03Abandoned) 10Hashar: parsoid: startup script now has cleared out FDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [12:56:21] hey hashar [13:15:30] (03CR) 10Mark Bergsma: [C: 04-1] "I think PS 14, though elaborate, is more clear than PS 15." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [13:16:59] (03Restored) 10Hashar: parsoid: startup script now has cleared out FDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [13:17:12] (03PS2) 10Hashar: beta: manage parsoid using upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 [13:17:43] (03CR) 10Dr0ptp4kt: "Agreed. Let's use PS14 instead." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [13:18:57] (mark, not abandon the change, just use PS14 instead. i couldn't get stuff to gerrit on friday night for some reason, but yurik cleaned up what i had emailed him for manua review.) [13:21:52] (03PS1) 10Springle: pull pmtpa db boxes from m1, m2, x1, es1, es2, es3 for decom and/or shipping [operations/puppet] - 10https://gerrit.wikimedia.org/r/101835 [13:22:58] (03CR) 10Springle: [C: 032] pull pmtpa db boxes from m1, m2, x1, es1, es2, es3 for decom and/or shipping [operations/puppet] - 10https://gerrit.wikimedia.org/r/101835 (owner: 10Springle) [13:25:38] (03PS1) 10Springle: depool pmtpa es[234] for decom and/or shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101836 [13:26:05] (03CR) 10Springle: [C: 032] depool pmtpa es[234] for decom and/or shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101836 (owner: 10Springle) [13:26:30] (03PS6) 10Stefan.petrea: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [13:26:51] !log springle synchronized wmf-config/db-pmtpa.php [13:27:08] Logged the message, Master [13:27:59] (03PS3) 10Hashar: beta: manage parsoid using upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 [13:32:06] (03CR) 10Faidon Liambotis: [C: 04-1] "See discussion on Bugzilla." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101787 (owner: 10John F. Lewis) [13:33:11] (03Abandoned) 10Stefan.petrea: Json schema, output and test [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101821 (owner: 10Stefan.petrea) [13:33:18] (03PS7) 10Dan-nl: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 [13:33:54] (03PS1) 10ArielGlenn: add missing analytics row a network info [operations/puppet] - 10https://gerrit.wikimedia.org/r/101837 [13:35:07] (03CR) 10Dan-nl: "adding add and remove group privileges on group ‘gwtoolset’ for sysops." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [13:36:14] except that I can't pick PS14 [13:37:52] (03CR) 10ArielGlenn: [C: 032] add missing analytics row a network info [operations/puppet] - 10https://gerrit.wikimedia.org/r/101837 (owner: 10ArielGlenn) [13:39:40] (03PS1) 10Springle: depool pmtpa db boxes from es2, es3, x1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101838 [13:41:25] (03CR) 10Springle: [C: 032] depool pmtpa db boxes from es2, es3, x1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101838 (owner: 10Springle) [13:42:17] !log springle synchronized wmf-config/db-pmtpa.php [13:42:33] Logged the message, Master [13:44:48] (03PS4) 10Hashar: beta: manage parsoid using upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 [14:01:06] (03PS1) 10Springle: keep one pmtpa es[123] host each on 12th floor [operations/puppet] - 10https://gerrit.wikimedia.org/r/101843 [14:02:16] (03CR) 10Springle: [C: 032] keep one pmtpa es[123] host each on 12th floor [operations/puppet] - 10https://gerrit.wikimedia.org/r/101843 (owner: 10Springle) [14:05:23] (03PS1) 10Springle: keep one pmtpa es[123] host each on 12th floor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101844 [14:05:48] (03CR) 10Springle: [C: 032] keep one pmtpa es[123] host each on 12th floor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101844 (owner: 10Springle) [14:06:26] !log ganglia-monitor restart on srv*/mw*; gmond bug with swapoff [14:06:34] !log springle synchronized wmf-config/db-pmtpa.php [14:06:40] !g 101844,1 [14:06:41] https://gerrit.wikimedia.org/r/#q,101844,1,n,z [14:06:42] Logged the message, Master [14:06:58] Logged the message, Master [14:07:58] (03PS2) 10Hashar: beta: properly connect to parsoid instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/99659 [14:36:44] (03PS1) 10Mark Bergsma: Remove all node definitions for Squids [operations/puppet] - 10https://gerrit.wikimedia.org/r/101856 [14:36:45] (03PS1) 10Mark Bergsma: Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 [14:36:45] (03PS1) 10Mark Bergsma: Remove pmtpa Squid LVS monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/101858 [14:36:47] (03PS1) 10Mark Bergsma: Update Icinga cache groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/101859 [14:36:48] (03PS1) 10Mark Bergsma: Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 [14:40:13] wow [14:40:20] that's excellent :-) [14:40:26] bye bye squid [14:43:52] apergos: if you feel brave, I got an upstart script for Parsoid on https://gerrit.wikimedia.org/r/#/c/99656/ [14:44:10] apergos: made it to only apply on beta/labs, production remaining unchanged with the old shell wrapper + init.d [14:44:18] let's have a look [14:44:57] (03CR) 10Hashar: "Forgot to say I have tested it out on deployment-parsoid2.pmtpa.wmflabs and managed to restart the server via ssh without it hanging on op" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [14:45:27] gotta look at your icinga / ferm rules as well :D [14:46:00] pretty sure they will need a bunch of fixups (that's one reason it's a draft) [14:46:40] righ tnow it purges and rewrites the rules each time, with a large diff in the puppet logs, which I hate [14:48:37] !log zuul made gate-and-submit pipeline a dependent pipeline. Changes would thus be triggered in parallel whenever a repo has several +2 attempting to land in. That should speed up gating process. See also {{bug|48419}} and {{gerrit|101839}} [14:48:54] Logged the message, Master [14:51:03] !log jenkins enabled linting jobs to be runnable in parallel. Whenever several changes are made on the same repo, Jenkins will trigger a linting job per change. That will dramatically speed up the processing of changes since some jobs are now parallelized instead of serialized. [14:51:12] hashar: I see you don't keep logs over there, maybe you want to? + logrotate [14:51:19] Logged the message, Master [14:51:21] or at least the script looks that way [14:51:28] apergos: ah yeah forgot about the log damn [14:51:47] I believe you are writing to /dev/null which was my easy solution :-D [14:52:04] maybe I should just >> /var/log/parsoid/parsoid.log [14:52:14] I am not sure how logrotate would work [14:52:37] aka might need to restart parsoid to let upstart point to the new file [14:56:32] (03PS1) 10Mark Bergsma: Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 [14:56:33] (03PS1) 10Mark Bergsma: Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 [14:58:24] https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/mediawiki/manifests/parsoid.pp [14:58:25] hmm [15:02:02] hm [15:04:02] https://git.wikimedia.org/blob/mediawiki%2Fvagrant/c79a03b12cd6835f05a2296fe769ffd56da5c220/puppet%2Fmodules%2Fmediawiki%2Ftemplates%2Fparsoid.conf.erb he runs it without redirection, I wonder what that does [15:07:38] I guess by default it is send to the console [15:07:41] or maybe dev/null :( [15:08:51] !log Depooled all pmtpa Squids in PyBal [15:09:07] Logged the message, Master [15:10:32] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:10:32] PROBLEM - LVS HTTP IPv6 on wikisource-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.072 second response time [15:10:32] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.072 second response time [15:10:35] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.073 second response time [15:10:36] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.193 second response time [15:10:36] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.196 second response time [15:10:36] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.186 second response time [15:10:36] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.190 second response time [15:10:36] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 2.198 second response time [15:10:39] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 2.201 second response time [15:10:43] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 4.193 second response time [15:10:43] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.205 second response time [15:10:53] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:10:53] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:10:56] uhh [15:10:57] PROBLEM - LVS HTTP IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.074 second response time [15:10:57] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.197 second response time [15:11:01] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.195 second response time [15:11:01] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.196 second response time [15:11:02] oh yeah :) [15:11:05] PROBLEM - LVS HTTPS IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 2.204 second response time [15:11:10] :-D [15:11:15] I suppose paging is broken for me [15:11:15] PROBLEM - LVS HTTP IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.082 second response time [15:11:19] PROBLEM - LVS HTTP IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:22] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:23] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:23] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.073 second response time [15:11:23] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.073 second response time [15:11:23] well it's working great for me :-D [15:11:27] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.258 second response time [15:11:31] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:31] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:34] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 1.086 second response time [15:11:34] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:34] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.190 second response time [15:11:38] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.189 second response time [15:11:38] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 1.185 second response time [15:11:38] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 3.208 second response time [15:11:38] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.218 second response time [15:11:39] PROBLEM - LVS HTTP IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.095 second response time [15:11:39] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.094 second response time [15:11:39] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.192 second response time [15:11:42] \o/ [15:11:44] I didn't get paged either but I suppose that's because my paging timezone is still PST [15:11:48] good thing [15:11:48] PROBLEM - LVS HTTP IPv4 on wikinews-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:48] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.077 second response time [15:11:48] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.074 second response time [15:11:48] PROBLEM - LVS HTTPS IPv6 on wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.191 second response time [15:11:50] yeah [15:11:52] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.191 second response time [15:11:52] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.195 second response time [15:11:52] PROBLEM - LVS HTTP IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:55] PROBLEM - LVS HTTP IPv4 on wikisource-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:11:56] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 2.205 second response time [15:11:56] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 4.201 second response time [15:11:56] PROBLEM - LVS HTTPS IPv6 on upload-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.193 second response time [15:12:20] paging works great for me [15:12:22] PROBLEM - LVS HTTPS IPv4 on upload-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.185 second response time [15:12:25] :-D [15:12:27] Speaking of, could someone change that to CET for me? I think I might be the only person who is on the paging list but doesn't have the ability to change their paging timezone [15:12:27] PROBLEM - LVS HTTP IPv4 on upload-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [15:12:28] my phone is having a seizure [15:12:29] same here ... [15:12:40] sorry ;D [15:12:43] oh there it goes [15:12:44] PROBLEM - LVS HTTP IPv6 on upload-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.072 second response time [15:13:05] RoanKattouw: sure, I 'll do it [15:13:22] what's the puppet master db now? [15:13:28] db1001 [15:13:31] tnx [15:13:50] hey, I got pages! [15:13:53] cool [15:13:55] it works again [15:13:56] :) [15:14:02] how very reliable [15:14:38] RoanKattouw: you know it's in puppet, right? :) [15:14:46] for some reason i don't have pages when I am in USA. So yeah.. reliable :-) [15:15:36] that was a lot of notifications [15:15:48] nothing to see here, move along [15:17:38] (03PS2) 10Mark Bergsma: Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 [15:17:39] (03PS2) 10Mark Bergsma: Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 [15:17:40] (03PS2) 10Mark Bergsma: Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 [15:17:41] (03PS2) 10Mark Bergsma: Remove all node definitions for Squids [operations/puppet] - 10https://gerrit.wikimedia.org/r/101856 [15:17:42] (03PS2) 10Mark Bergsma: Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 [15:17:43] (03PS2) 10Mark Bergsma: Remove pmtpa Squid LVS monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/101858 [15:17:44] (03PS2) 10Mark Bergsma: Update Icinga cache groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/101859 [15:18:34] test [15:18:35] paravoid: AFAIK the list of which timezones are available is in puppet, but the manifest of which individual is in which timezone is in some private git repo somewhere [15:18:42] (03CR) 10Hashar: "The $wgAddGroups and $wgRemoveGroups will be shipped by the extension as of https://gerrit.wikimedia.org/r/#/c/101861/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101488 (owner: 10Hashar) [15:18:48] RoanKattouw: looks like it, fixing [15:19:13] roan is in CET now? [15:19:40] mark: Temporarily for two weeks (holidays) [15:19:44] ah [15:19:47] RoanKattouw: paravoid: done [15:19:50] Thanks man [15:19:52] test [15:20:11] ahh Jenkins jobs doing 'puppet validate' are now running in parallel :D [15:20:15] should speed up things a bit for ops [15:20:23] So, nothing permanent but long enough to not want my phone to explode in the middle of the night [15:21:02] hashar: :-) [15:21:14] and it took me ~2 years to figure it out :-D [15:21:49] hashar: hey btw... I want to upgrade the PHP version on the beta cluster. Where should i start looking on how to do that ? [15:22:31] akosiaris: I think last time I have uploaded the files in the /data/project shared directory of deployment-prep labs project [15:22:52] akosiaris: then manually upgraded the packages on the two apaches boxes: deployment-apache32.pmtpa.wmflabs and apache33.pmtpa.wmflabs [15:22:58] then crossed fingers and restarted apache [15:23:47] other boxes are using php as well such as deployment-bastion (that is the main working machine for humans and jenkins) and we have two boxes for async jobs: deployment-jobrunner08 and deployment-video06 [15:24:13] hashar: ok thanx. I 'll start with that and then work my way to ruining production again by uploading a bad PHP version :P [15:24:33] akosiaris: feel free to ruin the jenkins slaves first: gallium.wikimedia.org and lanthanum.eqiad.wmnet [15:24:47] they run so much different php code paths that there is a good chance they can raise weird bugs for you [15:24:56] hmmm that is a neat idea [15:25:03] let's start from those then :-) [15:25:18] and we will get tons of devs complaining [15:25:33] akosiaris: you know greg announced it on the deployment highlights, right? :) [15:25:36] so I would say: upgrade both beta and jenkins slaves [15:25:43] hashar: copytruncate option to logrotate might be ok, since we don't care if absolutely every log entry gets there in tact [15:25:48] it would need testing [15:25:51] hashar: yes I noticed [15:26:00] apergos: I am not sure we want parsoid restarted though :/ [15:26:09] that is without restarting [15:26:25] apergos: ah it copy the file then 'echo -n > ' ? [15:26:31] more orless [15:26:43] truncates the existing file in place [15:26:46] so that should more or less prevent weird restarts of parsoid :D [15:26:54] exactly [15:27:21] hashar, does that mean the current version isn't installed via apt? [15:27:21] and/or puppet? [15:27:21] hashar, akosiaris, I ask because… it is possible to have a special project-local apt repository. That plus a project-specific puppet master would go a long way towards allowing beta to actually test things :) [15:27:32] * hashar out of file descriptors [15:27:37] :-D [15:27:42] andrewbogott: when is labs moving, exactly? [15:27:49] andrewbogott: it is using puppet / apt [15:28:06] andrewbogott: but sometime we want to test out a new PHP version beforehand so ends up installing the packages manually [15:29:12] coffee break [15:29:26] (03PS4) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 [15:30:09] (03CR) 10Manybubbles: "I'm glad I did another review this morning - the last patch would have removed the replica count again...." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [15:31:03] hashar: before it's available as a deb? [15:31:43] aude: We haven't picked a date yet -- your best bet is to watch the labs list for accouncements. [15:31:45] * aude fear our labs stuff gets shut / deleted before i cleanup [15:31:50] ok [15:32:30] aude: that worries me… have I said anything on the mailing list that implied I would delete anything? [15:32:36] Other than deleting things that don't actually exist? [15:32:58] anything not in puppet or data/projects, i'm not sure about [15:33:07] * aude has to read mails again [15:33:37] mans gallium's IO is abysmal... [15:34:02] aude: Anything not puppetized or in shared storage is in /perpetual/ danger -- not so much endangered by the migration as just generally a bad idea. [15:34:08] of course :) [15:34:48] So, the migration represents a slightly higher danger to those things… e.g. any instance which won't survive a reboot won't survive on account of everything will be rebooted. [15:34:48] * aude wonder about stuff like databases [15:34:55] But I'm not actually going to murder anything intentionally :) [15:35:06] they can have backups in data/project [15:35:14] ok [15:35:30] Yeah, I try to run cron backups of my databases to data/project [15:35:39] k [15:35:52] nothing is important for wikidata.... we reset stuff anyway [15:36:04] just to know generally [15:36:33] aude: The puppet think is a special case -- instances will need to cope with the fact that they're in a different place, possibly with a different IP. Puppet can take care of that, but only if puppet can actually run. [15:36:46] not a bit issue [15:36:48] big8 [15:36:51] big* :) [15:36:56] So, that doesn't mean the instance has to be puppetized, just that puppet has to not be stopped or super broken. [15:37:09] ok [15:37:37] * aude go back to coding then and shall spend time on our labs stuff soon [15:37:43] andrewbogott: I believe akosiaris is going to build the new PHP deb packages and put them on beta then dpkg -i install them [15:38:04] andrewbogott: then after a few days we can upload the packages on apt and have fun debugging weird php issues [15:38:36] packages are already built and have been tested on test.wikipedia.org [15:38:45] OK. I guess that's not a whole lot worse than having a beta apt repo and using apt to install them. [15:39:08] Maybe an identical amount of work, now that I think of it :) Since puppet won't upgrade automatically anyway. [15:39:24] kind of... it will for php packages [15:39:30] it has ensure => latest :-) [15:40:45] (03PS1) 10ArielGlenn: add pt domains [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 [15:42:03] apergos: mediawiki.gr amongst them? [15:42:14] didn't look [15:42:17] akosiaris: It does? In production puppet? [15:42:33] I was asked about the pt domains specifically [15:42:33] yes [15:42:35] I thought ensure => latest was banned… I change it back to present anytime I see it [15:42:45] andrewbogott: hi! [15:42:49] andrewbogott: any comments regarding https://gerrit.wikimedia.org/r/#/c/98307/ ? [15:43:15] andrewbogott: I'd like someone from labs to merge it so they can watch for the fallout :) [15:43:16] andrewbogott: well if it is banned it is not obeying [15:43:19] git grep latest |wc -l [15:43:19] 212 [15:43:38] huh [15:43:43] which is a crappy grep btw [15:43:50] it is not that much [15:44:07] git grep 'ensure => latest' |wc -l [15:44:07] 131 [15:44:15] paravoid: I'm in the land of slow internet, will comment once gerrit actually loads :/ [15:44:19] and this is still not correct but you get the idea [15:45:21] akosiaris: Hm. Seems dicey [15:45:39] (03CR) 10Faidon Liambotis: openstack: convert iptables to ferm (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [15:45:49] I don't think there's just an official policy, just that automatic upgrades always lead to outages and mysterious sudden changes [15:47:05] (03PS2) 10Faidon Liambotis: openstack: convert iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 [15:47:31] paravoid: Oh, /that/ patch :) I don't understand it well enough to sign off, but I'm happy to babysit the merge once you're feeling confident. [15:49:27] (03PS3) 10Mark Bergsma: Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 [15:49:28] (03PS3) 10Mark Bergsma: Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 [15:49:29] (03PS3) 10Mark Bergsma: Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 [15:49:35] (03PS3) 10Mark Bergsma: Remove all node definitions for Squids [operations/puppet] - 10https://gerrit.wikimedia.org/r/101856 [15:49:35] (03PS3) 10Mark Bergsma: Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 [15:49:35] (03PS3) 10Mark Bergsma: Update Icinga cache groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/101859 [15:55:50] !log upgraded PHP5 to 5.3.10-1ubuntu3.9+wmf1 on gallium, lanthanum for testing purposes [15:58:02] (03PS1) 10Mark Bergsma: Remove LVS monitoring for now unused esams project LB LVS services [operations/puppet] - 10https://gerrit.wikimedia.org/r/101880 [15:59:15] (03PS1) 10BryanDavis: Revert "beta: let sysops add/remove gwtoolset group" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101881 [15:59:45] (03PS2) 10BryanDavis: Revert "beta: let sysops add/remove gwtoolset group" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101881 [15:59:46] (03CR) 10Mark Bergsma: [C: 032] Remove LVS monitoring for now unused esams project LB LVS services [operations/puppet] - 10https://gerrit.wikimedia.org/r/101880 (owner: 10Mark Bergsma) [16:01:55] (03CR) 10BryanDavis: [C: 04-1] Production configuration for GWToolset (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [16:04:15] (03CR) 10Dan-nl: [C: 031] Revert "beta: let sysops add/remove gwtoolset group" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101881 (owner: 10BryanDavis) [16:06:25] (03PS8) 10Dan-nl: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 [16:07:29] (03PS1) 10Mark Bergsma: Resolve esams IPv6 IP conflict [operations/puppet] - 10https://gerrit.wikimedia.org/r/101883 [16:08:12] (03CR) 10Dan-nl: "permissions moved into extension in I6bfc539." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [16:16:55] (03PS5) 10Hashar: beta: manage parsoid using upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 [16:17:43] (03PS1) 10Mark Bergsma: Temporarily use the old service IP name [operations/puppet] - 10https://gerrit.wikimedia.org/r/101884 [16:18:42] (03CR) 10Hashar: "Seems to be working on labs, might have to amend later on though :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [16:18:55] (03CR) 10Mark Bergsma: [C: 032] Temporarily use the old service IP name [operations/puppet] - 10https://gerrit.wikimedia.org/r/101884 (owner: 10Mark Bergsma) [16:21:11] (03CR) 10Hashar: "… which is not going to work well due to how the permissions are handled on labs. So keep the log on the local instance for now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [16:21:41] apergos: I have added to parsoid the logrotate stanza and a way to specify the log file being used [16:21:47] apergos: can't follow up this evening though, will have to get out in a few minutes [16:21:49] nice! [16:21:52] sure [16:22:01] guess where I'll be tomorrow... [16:22:03] right here :-P [16:22:07] yeah :-D [16:22:15] we can then apply in and try out what it does on beta [16:22:20] nice! [16:22:25] should be safe for production since classes are separted [16:22:28] separated [16:22:33] yep [16:32:00] and I am off :-D [16:59:56] (03PS1) 10Odder: Add Malayalam aliases for NS_MODULE, NS_MODULE_TALK [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101889 [17:00:55] (03CR) 10Yurik: "Thank you mark, there is a reason for my maddness." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [17:02:54] (03PS16) 10Yurik: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [17:04:40] mark, reverted to PS14. Not exactly sure why adam checked in his version - he privetly emailed it to me, and i reworked it [17:05:20] greg-g, i will need a lightning depl today for zere, any earlier time? [17:10:15] (03PS1) 10Odder: Create a Portal namespace on the Odia Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101890 [17:23:58] ^d and greg-g: starting to deploy cirrus updates to wmf7 [17:24:41] (03CR) 10Ori.livneh: "@Stefan: Please submit your work as a separate, dependent commit :(" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [17:25:55] well, whenever jenkins merges my change. it just verified it and I got excited [17:26:20] <^d> Probably from the first push. [17:26:28] <^d> I +2'd before it responded the first time. [17:27:12] (03PS1) 10Ottomata: Make sure mobile_hosts_regex is the same every time. [operations/puppet] - 10https://gerrit.wikimedia.org/r/101894 [17:27:38] (03CR) 10Ottomata: [C: 032 V: 032] Make sure mobile_hosts_regex is the same every time. [operations/puppet] - 10https://gerrit.wikimedia.org/r/101894 (owner: 10Ottomata) [17:28:45] !log manybubbles synchronized php-1.23wmf7/extensions/CirrusSearch/ 'update cirrus to master' [17:29:00] Logged the message, Master [17:30:08] test2wiki looks fine so I'll do wmf6 now [17:30:49] (03PS2) 10Anomie: l10nupdate-1: Log start and end times of rsync [operations/puppet] - 10https://gerrit.wikimedia.org/r/100913 [17:31:38] !log manybubbles synchronized php-1.23wmf6/extensions/CirrusSearch/ 'update cirrus to master' [17:31:55] Logged the message, Master [17:32:04] that caused some freaking out for a second [17:33:20] ^d: it freaked out for a second because we moved the location of the callbacks. [17:33:47] <^d> Non-atomic deploys are fun, eh? [17:33:49] it stabilized, but some updates probably blew up [17:33:57] I suppose [17:34:05] anyway, we're back to working again [17:34:13] so I've going to call that successful [17:34:20] <^d> :) [17:35:47] (03PS7) 10Ori.livneh: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [17:37:04] ^d: did you get a chance to look at the config change? [17:37:12] (03PS2) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [17:37:56] (03PS3) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [17:38:04] (03CR) 10Chad: [C: 032] Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [17:38:05] (03CR) 10Ottomata: Using custom ganglia module instead of Logster. (031 comment) [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [17:38:11] thanks [17:38:18] <^d> manybubbles: Merged. I had the review open in another tab but hadn't pressed save. [17:38:19] (03Merged) 10jenkins-bot: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [17:38:30] <^d> And since we updated code to master, my one comment I was going to leave wasn't right anymore :) [17:39:23] (03CR) 10Ori.livneh: "@Tim: OK. I restored the original authorship / permission statement and added myself to the list of authors." [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [17:39:57] (03CR) 10Ori.livneh: [C: 032] l10nupdate-1: Log start and end times of rsync [operations/puppet] - 10https://gerrit.wikimedia.org/r/100913 (owner: 10Anomie) [17:39:58] cool [17:40:14] ^d: question - can I just sync-dir wmf/config to get all my files synced? [17:40:44] or does sync-file accept multiple files somehow [17:40:45] <^d> Yep [17:40:49] <^d> The former. [17:40:51] cool [17:40:52] <^d> Just sync-dir all of wmf-config [17:40:54] will do [17:41:27] !log manybubbles synchronized wmf-config/ 'update cirrus configuration' [17:41:43] Logged the message, Master [17:42:51] ^d and greg-g: done syncing files. starting the index build process. [17:44:51] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2669: active_shards: 7999: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 2 [17:44:51] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2669: active_shards: 7999: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 2 [17:45:14] please ignore that [17:45:31] I wonder if I can just turn those off during a deloyment [17:45:51] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2676: active_shards: 8020: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:45:51] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2676: active_shards: 8020: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:45:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2676: active_shards: 8020: relocating_shards: 0: initializing_shards: 2: unassigned_shards: 4 [17:46:51] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 2684: active_shards: 8044: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [17:50:30] (03CR) 10GWicke: "We are also moving to upstart in general, so lets figure out one upstart config that works well and put it into debian/parsoid.upstart in " (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [17:57:46] (03PS17) 10Yurik: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [17:59:24] mark, paravoid, seems like we are done with this patch, go for it :) I will deploy backend changes shortly today ( greg-g permitting ), to make proxy name "Opera". You don't have to wait for it to deploy VCL [18:00:31] gwicke: the plan was indeed to eventually use it in production (but not instantly, this allows safe testing) [18:01:40] greg-g, so is anyone deploying anything ? I would like to push some minor zero changes pls [18:05:43] !log all search indexes built for newly cirrused wikis (wikisources other than enwikisource and frwikisource and commons). populating them now. commons will take the longest. it'll take 13 hours at the current rate. past experience tells me we won't be able to keep that rate up but that this is the slow time of day to be doing it anyway. [18:06:00] Logged the message, Master [18:06:08] apergos: that's good [18:06:45] apergos: https://gerrit.wikimedia.org/r/#/c/101900/ [18:07:14] ah, I thought there was another changeset around but I was not able to find it [18:07:23] (03PS1) 10Ottomata: Installing mysql client on stat* machines. [operations/puppet] - 10https://gerrit.wikimedia.org/r/101904 [18:07:24] that's basically what I have been using for Rashomon, which is pretty much the same service [18:07:26] I found ori's after some searching only [18:07:47] (03CR) 10Ottomata: [C: 032 V: 032] Installing mysql client on stat* machines. [operations/puppet] - 10https://gerrit.wikimedia.org/r/101904 (owner: 10Ottomata) [18:07:47] you wanna add a link to that in the comments? [18:07:59] will do [18:08:04] sweet! [18:08:54] (03CR) 10GWicke: "Alternative upstart config modeled on what I have been using for Rashomon:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [18:09:03] gwicke, are you guys deploying? [18:09:26] yurik: right now? [18:09:29] so that's going to wind up being replaced, because we need log rotation happening [18:09:33] yes [18:09:36] yurik, no [18:10:03] hash ar started out from about that place in the upstart work I think [18:10:04] apergos: with the default upstart behavior the logs end up in /var/log/upstart and are logrotated [18:10:15] they are? [18:10:24] upstart also avoids the need to restart the daemon to reopen the logfile [18:10:30] wait in upstart? uuuhhh [18:10:39] can't we have them go to /var/log/parsoid? [18:10:56] that should be possible too, didn't try yet [18:11:14] in the absense of greg-g, anyone else minds if i push some changes to prod? [18:11:25] would be nice if we can preserve the simplicity of the default behavior [18:12:06] we're going to wind up with a logrotate conf [18:12:16] so that we can specify things like a cutoff size [18:12:22] but that doesn't mean it has to be complicated [18:12:46] the complication is mostly reopening the logfile [18:13:10] ah well I was thinking that copytruncate would get us past that hurdle [18:13:28] we are reworking our logging currently, see https://bugzilla.wikimedia.org/show_bug.cgi?id=49762 [18:13:36] aiming for GELF [18:13:57] ori-l, you are not deploying anything yet, right? [18:13:59] and graylog2 [18:14:26] haven't looked into any of that [18:15:06] Reedy, any deployment blockers right now? [18:15:08] yeah, copytruncate should work too [18:15:12] guess I should look at the rfc [18:15:20] although it might be racy [18:15:34] it won't be perfect, i.e. we might lose an entry or two [18:15:40] ok, noones' here, noone would notice, deploying zero... [18:15:47] but it's not mission-critical to have every entry [18:15:56] * apergos is not noone [18:15:57] *nod*, no big deal [18:16:14] anyways for a short to mid term solution it ought to do well [18:17:25] apergos, we were thinking about a good way to have our debianization in our deploy repo, while ops has 'blessed' copies in puppet or some other repo [18:17:55] we are currently switching to https://git.wikimedia.org/summary/?r=mediawiki/services/parsoid/deploy [18:18:06] background at https://www.mediawiki.org/wiki/Parsoid/Packaging [18:18:12] is each individual log message gzipped? [18:18:23] * apergos is skeptical about how much that woudl save, vs gzipping upon rotation [18:18:28] (gelf) [18:18:38] apergos, noone to reply :) [18:18:39] in gelf it is just transfer-encoding afaik [18:18:47] ah [18:18:53] normally one message = one UDP packet [18:19:12] there is support for larger messages too [18:20:03] parsoid seems like a perfect ue case for ryan-git-sartoris-trebucheet-deploy [18:20:14] as pposed to building debs [18:20:17] *use case [18:20:47] apergos, we are trying to support both [18:21:04] ideally without duplicating effort everywhere [18:21:13] uhhh [18:21:20] hats off to you folks if you can pull that one off [18:21:27] we don't need fast/atomic deploys, so debs should work just as well [18:22:34] as long as every deployment can be rolled back by installing the previous debs, I guess [18:22:39] yup [18:22:53] would have to switch from reprepo to some other solution [18:22:54] just thinking about how many times mw fixes are 'I live hacked X, pushing it out now' [18:23:03] mini-dinstall for example [18:23:09] that model sure doesn't work with packages [18:23:37] sketched something about debs for deploy at https://www.mediawiki.org/wiki/Parsoid/Packaging#Building_debs_for_deployment [18:23:52] I would like to see our repos able to hold previous versions of packages, which as you say we can't do now [18:24:20] !log yurik synchronized php-1.23wmf6/extensions/ZeroRatedMobileAccess/ [18:24:36] Logged the message, Master [18:24:54] I don't think that this would work well for the PHP codebase, as it might be simply too large [18:25:27] and relies more on breaking stuff only for a short time with 'atomic' upgrades [18:25:37] rather than versioning interfaces [18:26:17] but for small services that need to be restarted in a rolling fashion debs could work well [18:26:33] well with rgst-deploy (typing out all the names takes too long :-P) versioning and rollback is solved for you [18:26:55] and you can batch the deploy iirc [18:27:44] was it the rolling restarts that was the big draw for debs? [18:27:52] !log yurik synchronized php-1.23wmf7/extensions/ZeroRatedMobileAccess/ [18:28:01] ok, done [18:28:08] Logged the message, Master [18:28:10] if it all crashes, i wasn't here [18:28:27] yeah but the problem is, neither am I (it's well into my evening) [18:31:04] !queue [18:31:05] http://burrow.openstack.org/ [18:31:09] heh [18:31:20] It works! [18:31:27] Ryan_Lane: [18:33:45] apergos: my main motivation for debs is that we need packaging for third parties anyway, and using the same code for our own deploys would avoid duplication [18:33:58] ahh [18:34:10] also, we'd get handling of system dependencies with debs too [18:35:23] without packaging MW will become harder and harder to set up while we are moving things into services [18:35:58] what other things do you have in mind? [18:36:47] that's about it I think [18:37:26] paravoid, mark - backend stuff deployed, so the VCL patch should work from the start [18:38:05] apergos, I hope that we'll get to a point where the MW setup instructions are basically apt-get install mediawiki [18:38:48] well I mean 'moving things into services', what other things? [18:38:58] parsoid is one [18:38:59] storage for example [18:39:07] external storage? [18:39:10] media storage? [18:39:18] both potentially [18:39:39] https://www.mediawiki.org/wiki/Requests_for_comment/Services_and_narrow_interfaces and https://www.mediawiki.org/wiki/Requests_for_comment/Storage_service_and_content_API [18:39:43] both WIP [18:40:29] ok that has just put me over the limit of my off-th clock work related reading but I'll leave the tabs open :-D [18:40:49] ;) [18:43:07] (03PS1) 10Jforrester: Make officewiki's Report: namespace VE-enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101908 [18:44:00] mutante: https://launchpad.net/burrow this [18:44:50] obsoleted apparently [18:44:58] apergos: thanks! [18:45:02] !queue del [18:45:03] Successfully removed queue [18:45:55] !queue is when people talk to you about RT and queues to follow and you don't know what they mean: http://requesttracker.wikia.com/wiki/Queue , https://wikitech.wikimedia.org/wiki/RT#Which_queues_do_we_have_and_what_are_they_used_for.3F [18:45:56] Key was added [18:49:03] !rt [18:49:03] http://rt.wikimedia.org/Ticket/Display.html?id=$1 [19:05:10] !log Reloading Zuul to deploy I01d349bf21b20ce94 [19:05:27] Logged the message, Master [19:27:50] !log stopping puppet on cp1048 to test varnishkafka ganglia module [19:28:07] Logged the message, Master [19:34:41] (03CR) 10Domas: "I have a question, um, is it faster if multithreaded?" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [19:34:41] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:45] why elastic1007 go down [19:38:11] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:38:40] Nap time [19:42:40] (03PS3) 10MaxSem: A short script for viewing exceptions [operations/puppet] - 10https://gerrit.wikimedia.org/r/38252 [19:43:16] (03PS18) 10Dr0ptp4kt: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [19:47:23] (03PS4) 10MaxSem: A short script for viewing exceptions [operations/puppet] - 10https://gerrit.wikimedia.org/r/38252 [19:51:57] yooo o ori-l [19:52:02] got time for a brainbounce w me today? [20:12:17] hello [20:13:36] hashar :-] [20:13:51] paravoid, you there? [20:14:01] I am [20:14:23] this ganglia module thing is becoming a pain too…i don't know much about the our statsd setup, do you? [20:14:41] s/the// [20:15:25] ottomata: hey [20:15:32] ori-l: good morning :-D [20:15:40] what do you want to know about it? [20:15:42] ahh the man I realllly want to talk to [20:15:43] well [20:15:44] and what's the current plan? :) [20:15:51] ori-l: do you have any clue what is the ip / machine to send statsd UDP packets to please ? Zuul graph are broken now :-D [20:15:56] the problem i'm solving is getting the proper rates in ganglia, txbytes per second, or wahtever [20:15:58] so [20:16:11] hashar, ottomata: just came out of a dentist appt, so i'm not entirely present [20:16:13] 1. can statsd do that for me if I just send it the counter values [20:16:13] 2. does our statsd send to ganglia? [20:16:16] oh [20:16:19] ok ok :-) [20:16:19] woozy ori-l! [20:16:20] !ident is 5 geek karma points for (still) knowing what identd and RFC 1413 are https://en.wikipedia.org/wiki/Ident_protocol and how they relate to IRC https://www.freenode.net/faq.shtml#userequals 20 points for giving a crap and actually fixing it, hint: if there is a ~ in your /whois it's actually broken, but nobody cares :) [20:16:21] Key was added [20:16:24] ottomata: 1) yes, that's what it's for [20:16:30] 2) yes, it can [20:16:42] 2) does it already? [20:16:49] that's 3! [20:16:52] * hashar reclaims geek points from mutante [20:16:55] yes [20:16:56] haha [20:16:59] backends => [ 'graphite' ], [20:17:00] ? [20:17:06] in role::statsd [20:17:09] there's another instance iirc [20:17:11] oh [20:17:13] grep for backends [20:17:32] hashar: they're going to the right place, but it's the new graphite instance, which is not mapped to graphite.wikimedia.org just yet [20:18:03] i needed to fix the profiler, but that's done now, so i should be ready to flip the switch (that is, make the DNS change) sometime in the next 24h [20:18:09] can you live with broken graphs until then? [20:18:10] ori-l: so whenever graphite.wikimedia.org is made to point to the eqiad graphite instance the graph will resurrect. That works for me :-D [20:18:18] I can survive with no graphs for a few days / weeks [20:18:25] yes, exactly. thanks, and sorry for the disappearing graphs [20:18:29] yeah yeah, focus on completing the migration [20:18:29] hi all [20:18:32] it is not important [20:18:49] ori-l, not finding it [20:19:01] ah wait [20:19:04] it is in graphite.pp? [20:19:05] ori-l: has long as gdash works, I guess you are fine. https://gdash.wikimedia.org/ . Thank you! [20:19:19] misc::graphite::navtiming [20:19:29] ottomata: how are you going to push to statsd? [20:19:34] ottomata: yes [20:19:58] paravoid, was going to try to go back to logster, i think i should be able to push to stats with that by only adding —ouptut statsd flag(s) [20:20:12] statsd* [20:20:17] that was the advantage of logster in the first place [20:20:49] its just that counter values don't work with ganglia if you can't collect at least every 15 seconds [20:20:55] I think that the whole 'librdkafka emits json, varnishkafka takes the json and writes it to a file, then we run a program to tail the file, parse the json and emit it to $metrics' is kinda pointless, tbh [20:21:04] i know you do [20:21:08] and I am opinionless [20:21:10] on it [20:21:14] too many abstractions and moving parts for something that could be very easy [20:21:15] hashar: yea, 5 granted. 20 refused :) [20:21:15] whatever works is fine with me at this point [20:21:46] ottomata: what other ways are there to get stats out of librdkafka? [20:21:47] also, it looks to me like you're spending more time for this than you're spending actually deploying varnishkafka :P [20:21:49] i think Snaps has a good point, he wants to keep varnishkafka decoupled, especially since he'd have to parse the big json object from rdkafka in order to transform it into statsd fromat [20:21:50] does it have some api? [20:21:58] no [20:21:59] it doesn't, but we can write it :) [20:22:01] just a json file [20:22:04] haha [20:22:07] paravoid, i think you are right! [20:22:14] either statsd support in itself, or some callbacks for stats [20:22:16] i just want to deploy, but i don't want to deploy til I have mor graphs.! [20:22:50] ottomata: if you want a somewhat kludgy but very reliable solution, see asset-check.py in modules/webperf/files [20:22:59] Snaps suggested to make varnishkafka write to a piped process, or maybe even just a udp socket, and then have a separate listener that parsed the json and sent to whatever [20:23:07] it runs a program which emits json and invokes gmetric to send it to ganglia [20:23:10] why serialize in the first place... [20:23:19] it's not what i would call an elegant architecture but it's very reliable [20:23:33] or-l, how do you run that? [20:23:34] cron? [20:23:47] ori-l^? [20:23:48] it's a daemon [20:23:54] haha [20:23:55] i could do that [20:23:58] paravoid didn't want me to [20:23:59] i asked :p [20:24:05] do what? [20:24:10] hashar: can I bug you for a moment? [20:24:15] well, paravoid is right to crinkle his nose at that, it's pretty hacky, but it works [20:24:17] i could easily just write an upstart script that ran logster in a while loop [20:24:19] that would work just fine [20:24:27] gwicke: sure, wanna do it in parsoid channel ? [20:24:30] to do what? I'm confused. [20:24:34] hashar, yup [20:24:48] paravoid: see modules/webperf/files/asset-check.py [20:25:05] paravoid, the problem we are solving is positively sloped metrics don't work with ganglia unless you send them to ganglia more often than gmetad collects them [20:25:05] so [20:25:08] every 15 seconds [20:25:14] logster is meant to be run by cron [20:25:18] i can't run a cron job more than once a minute [20:25:31] logster uses gmetric to send to ganglia [20:25:39] ori's asset-check does too [20:25:44] the difference is taht his runs as a daemon [20:25:51] are you planning to consume more CPU time than varnishkafka itself? :P [20:25:55] haha [20:26:00] no, mostly spent sleeping [20:26:19] paravoid: having librdkafka output json has two reasons: 1) quick integration with whatever stats framework people use. 2) the stats can be extended without ABI breakage. I dont want to conceive some generic TLV format for outputting the stats when there is JSON which is fantastic for generic stuff like this. [20:26:46] thus, the simplest way is to do popen("my-json-to-ganglia-pusher.tcl", "w") from inside varnishkafka [20:26:50] the json keys are also all unknown ahead of time [20:27:06] which is the problem i'm running into with the python ganglia module i just wrote now [20:27:27] because apparently ( I DID NOT KNOW THIS BECAUSE THE GANGLIA DOCUMENTATION IS HORRIBLE), i need to have the .pyconf file list all of the metrics [20:27:29] hashar: Is it an intended behaviour change in zuul/jenkins that it now doesn't attempt to do anything about a merge when it knows it can't (because a parent dependency is not yet merged)? [20:27:38] I honestly think we're spending waaay too much engineering resources for this [20:27:47] (not to mention that I haven't actually gotten anything into ganglia from this yet, yarrgh) [20:27:50] INDEED [20:27:51] so [20:27:51] James_F: do you have an example ? [20:27:58] ottomata: see how we're doing it in varnish; we're generating .pyconf from the plugin itself [20:27:59] paravoid, can I just write an upstart while loop and run logster? [20:27:59] James_F: got to write an email about it tonight [20:28:01] pretty please? [20:28:07] hashar: https://gerrit.wikimedia.org/r/#/c/101845/ etc. [20:28:15] oh cool, paravoid [20:28:17] i could do that [20:28:18] a while loop sounds pretty horrible to me [20:28:20] hashar: Oh, so yes? Cool. Was worried gerrit was broken. [20:28:28] but i still am having trouble with other ganglia things [20:28:32] even if I put the metrics in pyconf [20:28:40] so. many. things. not. working. [20:28:42] this shoudl be so easy! [20:29:16] so you want to busyloop a CPU running logster? [20:29:17] James_F: seems those changes are blocked because the parent https://gerrit.wikimedia.org/r/#/c/101126/6 is not +2 ed yet. [20:29:20] come on :) [20:29:32] busyloop? [20:29:41] what's the problem with running logster every minute or so? [20:29:45] hashar: Yes, but previously Zuul would run the gate-and-submit pipelines immediately, then ping us every 24 hours with COULD NOT BE MERGED until it could. [20:29:47] while true; do, logster —blabla; sleep 15; done [20:29:56] hashar: Whereas now it's just sitting there with a +2 and no activity. [20:30:02] paravoid, because gmetad collects metrics every 15 seconds [20:30:11] and if the metric has not been updated in gmond at least that often [20:30:14] it will write 0s [20:30:20] for positive slope counter values [20:30:21] it does [20:30:25] James_F: hoo that is interesting. I guess the change was submitted by Zuul and then Gerrit would whine about it [20:30:26] curr val - prev val [20:30:28] if curr == prev [20:30:29] that's a 0 [20:30:38] hashar: Yeah. [20:30:38] it does the same for python ganglia modules too [20:30:42] James_F: I made a change today to let Zuul be a bit smarter and thus look at parent changes. [20:30:59] hashar: We have long stacks in VE all the time, so we're quite tuned to existing behaviour. :-) [20:31:03] paravoid: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20eqiad&h=cp1048.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1387225847&event=show&ts=0&v=2054088278&m=vhtcpd_inpkts_enqueued&vl=pkts&ti=Packets%20enqueued&z=large [20:31:12] hashar: https://gerrit.wikimedia.org/r/#/c/101839/ ? [20:31:15] note how the hourly graph is jumpy [20:31:21] which causes the averaged graphs to have wrong values [20:31:26] why aren't you calculating the rate yourself? [20:31:27] James_F: yeah that is an interesting use case. I haven't tested it during the week-end unfortunately (i.e. long dependency chain blocked by a grand parent) [20:31:39] note also that this is a vhtcpd graph [20:31:40] James_F: yeah that is the change [20:31:41] i didn't write that one [20:31:54] paravoid, that's what the ganglia module I just wrote does [20:31:54] hashar: Kk. Will not worry any more. :-) [20:31:56] James_F: basic explanation at http://ci.openstack.org/zuul/gating.html [20:32:07] I just read that page, looks like a really cool feature [20:32:19] Yeah. [20:32:24] and? [20:32:25] If the example looks weird, remember that A through E are meant to be independent changes in the same repo [20:32:33] haha, and it isn't working! [20:32:34] i mean [20:32:36] it works great on the cli [20:32:39] just not in gmond yet, not sure why [20:32:42] i'm trying to figure out why [20:32:50] but now I just learned that I need to generate .pyconf from the module :p [20:32:50] okay...? :) [20:32:54] Zuul makes them dependent internally because it intends to merge them in a certain order [20:33:22] ottomata: because the counter type is now in the RRD [20:33:27] yes [20:33:29] ganglia won't change the counter tpe, even if you report a different type [20:33:35] because it can't [20:33:35] so you need to purge the old ones [20:33:35] (03PS1) 10Odder: Add Grants namespace to $wgNamespacesToBeSearchedDefault [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101995 [20:33:39] ori-l, i did [20:33:40] hmm [20:33:43] you have to either remove the old one or rrdtool dump/restore [20:33:44] but i only restarted gemetad [20:33:45] hm [20:33:48] dump/modify the XML/restore [20:33:50] i removed all the old ones [20:33:52] the rrds [20:33:54] and restarted gmetad [20:33:56] let me try some more [20:33:58] just rm won't do it [20:34:02] ? [20:34:07] RoanKattouw: exactly. [20:34:18] James_F: RoanKattouw: I should really have sent the announce earlier :D [20:34:38] hashar: Yeah, well… ;-) [20:36:01] ottomata: 1) stop your gmond script, 2) run: gmetric --spoof="myhostname:myhostname" -n your.metric.name -v expiring -t string -d 10, 3) wait 30 secs [20:36:27] it's hilariously awful but it's how you get metrics out of ganglia [20:36:45] but, if I restarted gmetad and removed the rrd files…where is ganglia keeping state? [20:36:49] basically you report it as a metric with a constant string value that expires after 10 secs of no activity [20:37:17] ottomata: i'm not exactly sure, it's difficult to reason about, but it does [20:37:22] haha [20:37:53] oof, and now I have to go and generate .pyconf [20:37:54] sigh [20:37:55] :) [20:38:03] but i think this is working! [20:38:19] you might also just want to use graphite [20:38:27] and create a dashboard for gdash [20:39:46] (03PS2) 10Jforrester: Make officewiki's Report: namespace VE-enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101908 [20:39:55] (03PS3) 10Catrope: Make officewiki's Report: namespace VE-enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101908 (owner: 10Jforrester) [20:42:35] ori-l, graphite is still pw protected, ja? [20:42:40] ottomata: sadly, yes [20:47:11] hashar: Job Stats are broken? [20:47:24] https://integration.wikimedia.org/zuul/ [20:47:44] Krinkle: yeah I have pinged ori about Zuul graphs using graphite. [20:47:57] Krinkle: basically graphite.wikimedia.org is broken / pointing to a wrong data store. [20:48:06] will be fixed up later on, that is not that much of a high priority [20:48:24] well that is what I told to ori, i.e. we can live with no graphs for a few days / couple weeks [20:51:10] hashar: it's all of graphite or just these? [20:51:27] afaik people do use them to keep track of regressions etc. [20:51:28] graphite apparently [20:51:37] not sure [20:51:39] gdash.wikimedia.org works though [20:51:50] yeah, but last I checked that unusable. [20:52:06] no, gdash is the pretty one [20:52:10] but doesn't it use graphite backend? [20:52:16] (03PS1) 10Andrew Bogott: Include a timestamp for last puppet run [operations/puppet] - 10https://gerrit.wikimedia.org/r/101997 [20:52:20] it does [20:52:25] makes requests directly to graphite from js [20:52:33] https://graphite.wikimedia.org/render/?title=Wiki%20Pageviews/min%20Holt%20Winters%20Forecast%20-1hours&from=-1hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias(color(holtWintersForecast(reqstats.pageviews),%22green%22),%22pageview/min%20Forecast%22)&target=alias(dashed(color(holtWintersConfidenceBands(reqstats.pageviews),%22white%22 [20:52:33] )),%22pageview/min%20Confidence%22)&target=alias(color(reqstats.pageviews,%22blue%22),%22pageview/min%22) [20:52:37] and that one works? [20:52:55] So... it works, but... it doesn't for zuul? Should we update the url or something? [20:53:53] Krinkle: just be patient for a couple of more days, please [20:54:06] things are in a complicated transitory state and hard to debug for that reason [20:54:11] but i'm really very nearly done [20:54:21] Sure, just trying to understand, no rush. [20:54:43] i'm also just out of a tooth extraction appt so not entirely sane / cogent [21:02:28] (03CR) 10Ori.livneh: "@Domas: Hi! I'm not sure if it's faster. Before I started working on it, it was maxing out a CPU core and dropping data. Switching from BD" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [21:03:30] ori-l, they pulled it? :( [21:03:54] yurik: yeah, nothing else they cld do [21:04:09] sorry [21:07:21] (03CR) 10Ori.livneh: "@Domas: also, any preferences regarding licensing? The reason I specified the Wikimedia Foundation as the licensor in earlier patch-sets w" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [21:07:28] i wonder how we get hits for host == "wikipedia"... some fatal errors there for a while [21:10:14] yurik, some(one|thing) requests http://wikipedia/ [21:10:15] someone probably donated a root domain to us... and we haven't been told... [21:10:32] MaxSem, yes, i saw them in the logs [21:10:38] but no idea who that might be [21:10:56] i wouldn't have thought it would make it through the varnish and apache filters... [21:16:35] (03PS1) 10Ryan Lane: Add firstboot script and ubuntu-standard package [operations/puppet] - 10https://gerrit.wikimedia.org/r/102000 [21:19:12] !log deployed parsoid 7684df12 [21:19:28] Logged the message, Master [21:19:34] Ryan_Lane: https://gist.github.com/subbuss/9bf840d538d0539ffe83 [21:20:00] /cc paravoid: is wtp1002 still depooled? [21:20:03] seems 1002 is missing from the list? [21:20:25] it's possible it just didn't answer the salt call [21:20:38] ok, let me check 1012 and 1001 [21:20:43] are the no status replies a reason for worry? [21:21:14] 1002 returned a ping [21:21:27] no status replies are an issue with salt [21:21:28] according to https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=mem_free&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 all hosts but 1002 restarted [21:21:57] there's like a bug with the way the modules and pillars are sync'd to it [21:22:21] I just restarted 1002 [21:22:30] k, thanks [21:22:33] let me check what's up with the others [21:22:40] could 1002 be missing in redis? [21:22:54] it isn't getting status back from redis [21:23:00] the conf repo also still seems to have tampa hosts in its list [21:23:19] yeah, that's because of what's in redis [21:23:28] how can we update that [21:23:29] ? [21:23:30] I need to add a command to purge data from redis [21:23:44] hm. they all respond properly with the direct commands [21:23:54] I wonder if the reply is just timing out [21:24:15] maybe 30 seconds isn't long enough of a timeout with the batch [21:24:31] I should change that to 60 seconds [21:24:37] ideally this would report back through redis [21:24:52] but salt has a bug with batch mode and using returners (which is what writes to redis) [21:26:17] ok. I've restarted the salt minion on the minions that didn't respond properly [21:26:30] gwicke: would it be a major inconvenience for me to do a restart now to test this? [21:26:52] Ryan_Lane, as long as it is rolling it should be fine [21:26:57] yep [21:26:59] ok [21:28:15] Ryan_Lane, we also have a config change coming up [21:28:22] not returning at all is generally going to be an issue, but not returning a status probably just means the timeout was reached [21:28:24] which will require another restart [21:29:20] bleh. getting the status back from salt without a returner is a pain. I hope they fix that next release [21:31:15] oh. wait. [21:31:28] are you using the upstart yet? [21:31:30] gwicke: ? [21:31:37] Ryan_Lane: no, not yet [21:31:48] this command is going to be relatively unreliable, then [21:32:01] we are prepping that, might be ready for Wednesday [21:32:09] along with the new repos [21:33:27] seems I should increase the timeout for the command. [21:35:19] RoanKattouw: any objections against removing all the wikipedias from localsettings? [21:36:28] gwicke: If Parsoid has a default config with identical URLs then it's fine [21:36:38] yup [21:37:00] OK, sweet [21:38:10] RoanKattouw: will http access to the backend work for office etc? [21:38:20] No [21:38:24] Well [21:38:25] Depends [21:38:28] (03PS1) 10Ryan Lane: Make restart runner and info util more dependable [operations/puppet] - 10https://gerrit.wikimedia.org/r/102003 [21:38:34] can I merge this change in before you guys restart? [21:38:47] Ryan_Lane, go ahead [21:38:57] When you say "http access to the backend", you mean connecting to 10.2.n.n and sending Host: office.wikimedia.org ? [21:39:14] yes [21:39:29] or, well, sending proxy requests [21:39:34] stupid slow jenkins [21:39:47] GET http://office.wikimedia.org/... [21:39:52] That will probably work but only if you send X-Forwarded-Proto: https [21:40:02] and Host: office.wikimedia.org too [21:40:07] If you don't then the response will be 302 Location: https://office.wikimedia.org/.... [21:40:15] even in the backend? [21:40:24] hmm [21:40:29] I think so [21:40:30] (03CR) 10Ryan Lane: [C: 032] Make restart runner and info util more dependable [operations/puppet] - 10https://gerrit.wikimedia.org/r/102003 (owner: 10Ryan Lane) [21:40:36] AFAIK the HTTPS redirects are done in MW [21:40:45] ok, I'll disable direct proxying for those wikis then [21:40:51] they are low volume anyway [21:41:14] I thought the backend only sees HTTP [21:41:26] it might take the proto redirect into account for redirects [21:41:30] I don't know about the wiki-wide HTTPS redirect, but since you'll be forwarding a cookie for a logged-in user whose preference is probably to be fixed to HTTPS, MW should send a redirect [21:41:55] stupid slow puppet :D [21:42:16] office etc will go through the front-end proxies for now [21:42:26] can experiment with those later [21:42:48] Yeah [21:43:11] what is the internal api cluster name again? [21:43:20] api.svc.eqiad.wmnet [21:43:30] thx [21:43:36] I recommend using the IP [21:43:41] ok, so restart will take longer now, but should return more accurate info on the restarts. [21:44:24] Which will be something like 10.2.n.n, you can find it by running "host api.svc.eqiad.wmnet" on a cluster machine [21:44:49] of course without using the upstart it will probably continue to be unreliable [21:49:09] pushing out the config change [21:49:34] gwicke: when you want to do the restart, let me do it, so I can see if it's now working properly [21:49:47] Ryan_Lane, go for it [21:49:51] the code is pushed out [21:50:08] restarting [21:50:25] it'll take one minute to report status back now [21:50:31] !log updated Parsoid config to use the API cluster directly for most wikis [21:50:45] Logged the message, Master [21:50:53] bleh [21:50:57] it still isn't working right [21:51:19] very likely due to the init script [21:51:26] hey paravoid [21:51:39] i think the upload cache's ganglia aggregators are not set properly [21:51:43] node /^cp10(4[89]|5[01]|6[1-4])\.eqiad\.wmnet$/ { [21:51:43] if $::hostname =~ /^cp104[89]$/ { [21:51:43] $ganglia_aggregator = true [21:51:43] } [21:51:46] cp1048, cp1049 [21:51:46] but [21:51:48] ganglia.pp [21:51:54] Ryan_Lane: Will you be online in an hour or so? I was going to try using Trebuchet for the first time this afternoon. This would be the first deploy of Scholarships using it. [21:52:01] "Upload caches eqiad" => "cp1048.eqiad.wmnet cp1061.eqiad.wmnet", [21:52:02] gwicke: they are all restarted [21:52:33] for some reason every run one single minion doesn't return [21:52:38] Ryan_Lane: thx [21:52:50] and two return with no status, even though they restarted [21:52:52] i think cp1061 is correct, as it is in a different row than cp1048 [21:52:53] maybe we should use dsh instead ;P [21:52:56] shoudl I fix? [21:53:08] * gwicke hides [21:53:19] gwicke: as I mentioned, I'm pretty sure this is related to the init script [21:53:33] yeah, likely [21:54:12] Ryan_Lane: is there documentation about how to update the list of hosts to update in redis? [21:54:30] nope. redis is just reporting, though [21:54:39] (03PS1) 10Ottomata: Ganglia aggregators for upload caches eqiad are cp1048, cp1061. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102005 [21:55:02] a batched ping of the parsoid hosts always returns the same number of hosts [21:55:08] RobH, does that look correct to you? ^^? [21:55:21] so I'd say salt is likely working right [21:55:26] just want a second opsie confirm before I merge it [21:55:26] Ryan_Lane: I was mainly wondering about the tampa hosts that show up in config deploys [21:55:28] ganglia.pp has "Upload caches eqiad" => "cp1048.eqiad.wmnet cp1061.eqiad.wmnet", [21:55:41] yeah. that's because they were added into the minion lists and there's no simple way to pull them out [21:55:47] let me delete them from the keys [21:55:49] oh, RobH is probably off data-centering [21:56:07] Ryan_Lane: since you are clearly not busy at all :p... [21:56:11] does that look right? [21:56:16] https://gerrit.wikimedia.org/r/#/c/102005/1/manifests/site.pp [21:56:19] Ryan_Lane: what happens when we add more boxes to the parsoid cluster? [21:56:48] they'll automatically show up in redis when a deployment is triggers on them [21:57:11] when puppet runs on the box, it'll add a 'deployment_target:parsoid' grain:value [21:57:22] aha [21:57:31] so that is then stored in redis? [21:57:33] which is what salt uses for targeting deployments [21:57:36] nope. [21:57:45] k, only reporting is [21:57:50] when a command is run, the minion checks into redis [21:58:03] and adds itself to the minion list [21:58:57] I hope you stay around for a while until somebody knows all the moving parts well enough ;) [21:59:11] (03CR) 10Ottomata: "Merging this, feel free to revert if it isn't correct." [operations/puppet] - 10https://gerrit.wikimedia.org/r/102005 (owner: 10Ottomata) [21:59:21] (03CR) 10Ottomata: [C: 032 V: 032] Ganglia aggregators for upload caches eqiad are cp1048, cp1061. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102005 (owner: 10Ottomata) [22:00:31] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [22:01:48] gwicke: I removed the pmtpa ones [22:02:02] Ryan_Lane: thanks! [22:02:07] gwicke: I plan on being an active upstream for trebuchet [22:02:30] btw, this is what I ran: srem "deploy:parsoid/config:minions" mexia.pmtpa.wmnet [22:02:35] via redis-cli [22:03:25] I'll add a command to purge minions to trigger (https://github.com/trebuchet-deploy/trigger/) [22:04:31] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 07:03:22 PM UTC [22:05:14] Ryan_Lane: added this config in https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis [22:05:19] might need tweaking [22:05:39] s/config/info/ [22:13:34] Can anyone please guide me on how do I tell which MW version a patch will be included in? [22:13:48] I think there was a discussion about this somewhere on wikitech-l, but I can't remember when it was [22:13:53] https://gerrit.wikimedia.org/r/#/c/100825/ [22:14:02] my wild guess is 1.23wmf7? [22:14:23] (03PS1) 10Dzahn: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 [22:14:55] (03CR) 10jenkins-bot: [V: 04-1] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:17:11] (03CR) 10Dzahn: "AndrewBogott, 2 minute review? can you run your script and determine the correct UID, because i would have usually used another one, but w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:20:31] mutante: does msyed have a labs account? If so what is their username there? [22:21:12] gwicke: cool, thanks [22:22:48] andrewbogott: i dont have those answers yet, i would then find out by saying they should first have a wikitech made themselves before asking for shell. that is, if the plan is that i should now always use those high UIDs to match with labs [22:23:26] Yes, the best procedure is for them to create a wikitech account first. [22:23:32] before i just counted up from our regular prod ones [22:23:37] Otherwise there's no way to control what suid they'll get after the fact. [22:23:39] that were unrelated , and still are [22:24:02] andrewbogott: ok, fair, then we shall say having wikitech is a requirement for shell [22:24:15] Yeah, I think that's best. [22:24:16] and i will ask them do go there first and come back [22:24:22] ok, fine with me [22:24:31] I updated the docs to be clearer [22:24:32] eh, last question, where is the script [22:24:44] or do you just look manually [22:24:51] in formey [22:25:05] it sounded like you had something scripted [22:25:14] Ryan_Lane: cool,thx [22:25:25] mutante: so far the script lives only here: https://gerrit.wikimedia.org/r/#/c/98700/ [22:25:28] and it doesn't work that great [22:25:56] mutante: oh, sorry, that comment was meant to gwicke :) [22:26:01] aha, gotcha. well, something to keep in mind for next time then [22:26:05] mutante: It's only useful if you start from scratch… if you have an existing patch then you just need to look up the user in ldap with, um… ldaplist -password on virt0 [22:26:19] Ryan_Lane: heh, that doesn't matter, that comment works on almost anything, haha [22:26:38] yeah, I think having a wikitech account should be a requirement for shell [22:26:46] it makes a lot of sense [22:27:03] andrewbogott: k, thank you, i shall continue with the request accordingly [22:27:21] nods [22:30:29] (03CR) 10Ryan Lane: [C: 032] Add firstboot script and ubuntu-standard package [operations/puppet] - 10https://gerrit.wikimedia.org/r/102000 (owner: 10Ryan Lane) [22:33:39] Ryan_Lane: thanks re docs! [22:33:45] yw [22:34:49] !log dist-upgrading virt1000 [22:35:04] Logged the message, Master [22:35:11] (03PS4) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [22:35:25] (03PS2) 10Dzahn: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 [22:35:47] (03PS2) 10Andrew Bogott: Include a timestamp for last puppet run [operations/puppet] - 10https://gerrit.wikimedia.org/r/101997 [22:36:05] (03CR) 10jenkins-bot: [V: 04-1] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:37:21] (03CR) 10Dzahn: [C: 04-1] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:38:38] (03PS3) 10Andrew Bogott: Include a timestamp for last puppet run [operations/puppet] - 10https://gerrit.wikimedia.org/r/101997 [22:38:55] paravoid, would appreciate a review on the latest varnishkafka ganglia commit [22:39:07] it does like varnish does, and generates the .pyconf file on the file [22:39:09] fly [22:40:21] (03CR) 10Andrew Bogott: [C: 032] "tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101997 (owner: 10Andrew Bogott) [22:40:44] (03PS1) 10Ryan Lane: Up vmbuilder version to 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102016 [22:43:09] (03CR) 10Ryan Lane: [C: 032] Up vmbuilder version to 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102016 (owner: 10Ryan Lane) [22:44:41] !log rebooting virt1000 [22:44:56] Logged the message, Master [22:45:44] ugh. pdns is such a piece of shit [22:47:11] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [22:47:11] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [22:49:51] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [22:49:51] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:50:25] (03CR) 10MSyed: "My Wikitech username is MSyed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:50:34] ottomata: oh. missed your message earlier. yeah. I'm around [22:51:03] !log dist-upgrading virt0 [22:51:21] Logged the message, Master [22:52:41] (03CR) 10Andrew Bogott: "UID 4206" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:55:13] !log rebooting virt0 [22:57:41] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:42] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:02:01] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 35.46 ms [23:03:31] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.044 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [23:06:31] (03PS1) 10Ryan Lane: Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 [23:08:09] (03CR) 10jenkins-bot: [V: 04-1] Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 (owner: 10Ryan Lane) [23:10:02] (03PS1) 10Ryan Lane: Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 [23:10:35] (03CR) 10jenkins-bot: [V: 04-1] Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 (owner: 10Ryan Lane) [23:10:42] (03PS2) 10Ryan Lane: Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 [23:11:44] (03CR) 10jenkins-bot: [V: 04-1] Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 (owner: 10Ryan Lane) [23:12:25] (03PS3) 10Ryan Lane: Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 [23:16:02] (03PS4) 10Ryan Lane: Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 [23:16:09] (03PS2) 10Ryan Lane: Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 [23:16:49] (03CR) 10jenkins-bot: [V: 04-1] Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 (owner: 10Ryan Lane) [23:17:54] (03PS1) 10Ryan Lane: Remove run once logic from firstboot.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/102034 [23:18:06] (03CR) 10Ryan Lane: [C: 032] Make virt1000 a secondary salt master for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102026 (owner: 10Ryan Lane) [23:18:19] (03CR) 10Ryan Lane: [C: 032] Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 (owner: 10Ryan Lane) [23:19:47] (03PS3) 10Ryan Lane: Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 [23:25:57] (03CR) 10Ryan Lane: [C: 032] Add secondary salt master into labs minion config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102029 (owner: 10Ryan Lane) [23:30:38] (03PS1) 10Ryan Lane: Add -y condition to salt-key for puppetsigner script [operations/puppet] - 10https://gerrit.wikimedia.org/r/102038 [23:31:46] (03CR) 10Ryan Lane: [C: 032] Remove run once logic from firstboot.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/102034 (owner: 10Ryan Lane) [23:32:07] (03CR) 10Ryan Lane: [C: 032] Add -y condition to salt-key for puppetsigner script [operations/puppet] - 10https://gerrit.wikimedia.org/r/102038 (owner: 10Ryan Lane) [23:45:51] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:47:31] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3938: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [23:47:41] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:47:41] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:47:41] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:47:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:47:51] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:01] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:01] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:01] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:01] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:02] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1319: active_shards: 3939: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 1 [23:48:11] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [23:51:01] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:57:43] should I be concerned?