[00:00:03] I have a performing client now, much better to use multipart/form-data [00:01:30] grr, the dumps I downloaded expand to > 412G [00:01:59] (03CR) 10Faidon Liambotis: [C: 032] cassandra: add sysctl.d tunable [operations/puppet] - 10https://gerrit.wikimedia.org/r/91326 (owner: 10Faidon Liambotis) [00:02:35] paravoid: i still think managing sysctl.d recursively is right; it's better for this to be in puppet [00:02:54] I don't disagree [00:03:19] it's a package bug + a sysctl::conffile limitation that it always prepends priority [00:03:25] paravoid: how are thumbnails coming along? [00:03:43] paravoid: hm -- what's the prepending priority issue? [00:03:43] AaronSchulz: copying at 10-15MB/s [00:04:04] ori-l: the package's postinst does [00:04:04] if ! sysctl -p /etc/sysctl.d/cassandra.conf; then [00:04:07] (warnings) [00:04:17] rm -v /etc/sysctl.d/cassandra.conf [00:04:22] fi [00:04:38] if that file doesn't exist, configure (i.e. install, upgrade etc.) fails [00:04:39] ugh [00:04:51] that's a bug in itself, I'm going to file it and suggest them to do rm -vf [00:05:06] but sysctl::conffile can't be tweaked to install at that location [00:05:23] externally I mean, we could always fix it :) [00:05:28] i can fix that [00:05:29] but it wasn't an unreasonable decision [00:06:59] RECOVERY - Disk space on xenon is OK: DISK OK [00:08:01] god, I hate jira [00:11:40] ori-l: this seems like something random you would know off the top of your head -- the pybal data files -- do you know where the canonical location for them is? [00:11:56] it doesn't seem to be in the public puppet stuff [00:12:04] no idea at all, i'm sure paravoid knows [00:12:25] oh, um, actually [00:12:32] i think they're on fenari somewhere, no? [00:12:34] yes [00:12:43] fenari:/h/w/conf/pybal [00:12:48] ah; awesome! [00:12:49] what he said [00:12:51] thanks :) [00:12:56] * mwalker writes that down [00:13:20] they're being served as http://noc.wikimedia.org/pybal/ [00:13:37] same thing really [00:13:50] only roots can write to them, though. [00:13:55] anything in particular I can help with? [00:14:12] I just need an updated copy -- I have a script that queries all the text caches for content [00:14:26] I'll looking for stray URLs in CentralNotice that keep getting redirected to foundation wiki [00:14:43] which group are you using? [00:14:49] we're migrating to varnish for text [00:14:55] there's a separate pybal group for this [00:15:04] ("text-varnish" vs. "text") [00:15:08] currently 'text' [00:15:10] and ah [00:15:14] so I'll need to load both [00:15:33] nod [00:15:39] expect them to have overlap though [00:15:57] we have squids in text-varnish as well, with enabled = False [00:16:16] it's easier to just vi text-varnish and convert a few Trues to False and vice-versa than switching groups [00:16:45] hehe -- the magic of python means I just add the two lists together [00:17:18] though; actually; that probably does a true addition... /me looks further [00:17:24] not sure what your script does, but careful :) [00:17:34] *nods* [00:17:59] text would have cp1001 as enabled=True, text-varnish would have cp1001 as enabled=False & cp1052 as enabled=True [00:18:46] it takes a prototype CentralNotice URL; and queries every cache server for it's current content under that prototype -- it's not exactly rate limited -- but we're only talking 30 queries per server -- and it's not threaded [00:19:32] (03PS1) 10Ryan Lane: Namespace mediawiki repos for deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/91327 [00:27:08] (03PS1) 10Springle: warmup db1034 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91332 [00:29:25] (03CR) 10Springle: [C: 032] warmup db1034 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91332 (owner: 10Springle) [00:30:37] !log springle synchronized wmf-config/db-eqiad.php 'warmup db1034' [00:30:49] Logged the message, Master [00:34:53] paravoid: as a first test, I'll import three history dumps in parallel from the three machines [00:35:14] each only hitting localhost [00:35:16] cool [00:36:50] that's around 900G of input text [00:39:46] ori-l: I highly suggest http://chimera.labs.oreilly.com/books/1230000000545/ [00:40:10] it was out of stock in amazon UK but it appeared in stock for 3 whole days so I managed to order it [00:40:16] but I've started reading it out of the html [00:40:34] nothing new/shocking so far, but it's a nice overview of everything [00:40:48] I haven't read it all [00:41:19] it's a bit strange that it talks about the different variants of mobile networks in a book called "high performance *browser* networking" [00:41:35] so it does get a bit too unnecessarily deep into some subjects imho [00:42:12] (the author is quite well known too) [00:42:45] Loads of copies in stock ;) [00:43:06] "Only 11 left in stock (more on the way)." is what amazon.co.uk says now [00:43:16] it was out of stock last week [00:43:38] released a month ago [00:43:59] Ahh [00:44:58] I pinged ori-l specifically because I thought he might care the most -- but it's definitely something that could be of broader interest [00:53:41] paravoid: yeah, ilya gregorik is awesome [00:54:16] paravoid: the import is running, but the client is still the bottleneck [00:54:25] since you're already reading it, if you come away from reading it with any ideas, let me know? [00:54:45] sure. nothing groundbreaking so far [00:54:46] I'm heading out, will work on speeding up the client tomorrow [01:51:00] (03PS1) 10Reedy: Move "RewriteEngine On" earlier in www.wikimedia.org vhost [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91339 [01:52:53] (03PS2) 10Reedy: Move "RewriteEngine On" earlier in www.wikimedia.org vhost [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91339 [02:03:23] Reedy: you still around? [02:03:28] I'm guessing probably not [02:04:16] Bugzilla seems wonky. [02:04:20] Reedy is always around. [02:04:38] I did just try to leave... [02:04:49] gj [02:04:59] same here, but wfm [02:05:39] im not able to look at scary Apache vhosts now [02:05:46] so some improvement [02:06:02] mwalker: wassup? [02:06:12] Elsie: wonky? looks ok to me [02:06:36] I'm seeing about 1/4 of requests to Special:BannerRandom still being redirected to wmfwiki [02:06:41] It gave me an Apache error a few minute ago and it hung a bit. [02:06:55] seems to be all hosts with the name sq[0-9]{2}.wikimedia.org that are doing it [02:06:59] Could just be me, though. [02:07:47] I can provide some example URLs if needed -- but what were you doing before that was clearing them? [02:08:28] Elsie: seems to be working fine for me [02:08:51] I've been able to use it, it just seems a bit wonky. :-) [02:11:16] *nods* not that I'll be able to fix it -- but what's wonky about it? queries being slow; pages rendering odd? [02:12:08] It gave me an Apache error a few minutes ago and it hung a bit. [02:12:14] Could just be me, though. [02:12:38] mwalker: Daniel had been running a script to clear the Squid caches. [02:12:41] purgeList.php [02:12:44] It's on the relevant bug. [02:12:58] He did the small *.wikimedia.org wikis. [02:13:00] but not so useful for wildcards etc [02:13:06] But probably didn't do sq0? [02:13:09] What is that? [02:13:16] squid [02:13:33] squid has an exposed public hostname? [02:13:38] https://bugzilla.wikimedia.org/show_bug.cgi?id=56006#c12 is the bug, mwalker. [02:13:45] it'll only purge a list of URLs [02:13:49] well; I have a live stream of shit that's getting 302'd -- for centralnotice [02:13:56] varnish does it a lot better [02:14:02] so... I can work with the restriction of a purge list [02:14:21] File a bug. :-) [02:14:23] relatively easily fixed then [02:14:35] you can run it yourself even! [02:14:40] I marked 56006 as fixed and recommended others file a bug. [02:14:49] For new issues such as this. [02:14:55] Reedy: where do we run this type of script from? [02:14:58] tin [02:15:05] or terbium [02:15:17] doesn't matter a great deal..quick to run [02:15:23] kk [02:15:32] quick to run over 28k urls? :p [02:15:49] ..and then my phone shut down and i lost my mobile internetz [02:16:07] !log LocalisationUpdate completed (1.22wmf22) at Wed Oct 23 02:16:06 UTC 2013 [02:16:31] Logged the message, Master [02:17:10] it is [02:17:40] without writing to console, no sleep 12k URLs was very quick [02:17:45] ok; preparing the list for blasting [02:17:54] including database queries too ;) [02:18:35] omg dos!!!!!! [02:18:44] That's the response I usually get. [02:19:03] purging all of outreachwiki took seconds :D [02:19:23] Elsie: glancing at bugzilla server i dont see anything obvious.. [02:19:23] how is it doing that? [02:19:23] Should've null edited and you would've purged Squid and refreshed the *links. ;-) [02:19:37] mutante: Okay, no worries. Nobody else is complaining. [02:19:42] well; how does it know about URL parameters? [02:19:49] yes, --all worked ok on small wikis [02:19:50] mwalker: purgeList.php has a --wiki parameter. [02:19:53] quality.wm too [02:19:57] Elsie: ok [02:20:08] put the full URL [02:20:18] http://en.wikipedia.org [02:20:23] arffghj [02:20:37] /WIKI/FOOBAR [02:20:47] stupid phone [02:21:07] yep; they all look something like this: http://meta.wikimedia.org/wiki/Special:BannerRandom?uselang=es&sitename=Wikipedia&project=wikipedia&anonymous=true&bucket=1&country=CO&device=desktop&slot=4 [02:21:11] I'm impressed you were able to start a line with a slash from your phone whilst raging. [02:21:26] press / twice [02:21:28] does it care about SSL varients? [02:21:48] meta isn't forced SSL, so use just http [02:21:53] gotcha [02:22:11] My IRC client went a bit goofy and joined every channel twice. [02:22:25] at worst, do it once, run it, http -> https and run it again ;) [02:24:39] !log LocalisationUpdate completed (1.22wmf21) at Wed Oct 23 02:24:39 UTC 2013 [02:24:57] Logged the message, Master [02:26:55] Reedy: Were there any reports of Meta-Wiki's API returning a 404? [02:27:08] I seem to have a lot of e-mails from cron. [02:29:04] 404s? no... [02:30:07] Reedy: damn; that was fast [02:30:16] It'll be trying to hit /w/api.php... [02:30:23] * Elsie shrugs. [02:30:29] also; /me grumbles about scp having stupid numeric options that I can never remember [02:32:31] add --verbose and it will be a lot slower! [02:33:59] heh [02:34:00] ya [02:36:40] Reedy: presumably the timeout for these 301s is the standard 30 days? [02:36:56] probably [02:37:25] *nods* I'll have to script this then; slowly whittle my horrific cache explosion of doom away [02:37:32] thanks for your help though :) [02:40:20] I think 301s are cached client-side for much longer than 30 days. [02:47:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 23 02:47:16 UTC 2013 [02:47:28] Logged the message, Master [03:09:11] (03PS1) 10Kaldari: Changing default wmgMinimumVideoPlayerSize from 200 to 800. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91342 [03:48:52] (03PS1) 10Legoktm: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 [05:00:31] Unless anyone objects, in about half an hour I'm going to sync a small config change (https://gerrit.wikimedia.org/r/#/c/90265/) that adds purge/thumbnail rate limits. It's not urgent but I said I'd do it today and I'd prefer to stand by that commitment. [05:08:30] ori-l: do you know if that limit will be proactively revisted? or not? [05:09:42] ori-l: but, I don't object, just curious if a follow up bug/something needs to be filed. [05:25:43] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [05:27:53] PROBLEM - Apache HTTP on srv291 is CRITICAL: Connection refused [05:28:53] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.400 second response time [05:30:23] !log powercycled srv291, unresponsive to ping, no login at mgmt console [05:30:39] Logged the message, Master [05:38:33] PROBLEM - Puppet freshness on mw125 is CRITICAL: No successful Puppet run in the last 10 hours [05:40:07] (03CR) 10Ori.livneh: [C: 032] Added some purge/thumbnail rate limits [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90265 (owner: 10Aaron Schulz) [05:40:17] (03Merged) 10jenkins-bot: Added some purge/thumbnail rate limits [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90265 (owner: 10Aaron Schulz) [05:53:49] !log ori synchronized wmf-config/InitialiseSettings.php 'I11c90ed5a: Added some purge/thumbnail rate limits' [05:54:04] Logged the message, Master [05:54:25] !log sync-file: 'Could not resolve hostname mw125: Name or service not known' [05:54:39] Logged the message, Master [06:07:15] (03CR) 10Ori.livneh: [C: 032] Timestamp log messages [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/71315 (owner: 10Ori.livneh) [06:22:03] PROBLEM - Disk space on cp1050 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12345 MB (3% inode=99%): /srv/sdb3 13582 MB (4% inode=99%): [06:27:15] (03PS1) 10ArielGlenn: put mw125 back in (range off by one error) [operations/dns] - 10https://gerrit.wikimedia.org/r/91347 [06:28:12] (03CR) 10ArielGlenn: [C: 032] put mw125 back in (range off by one error) [operations/dns] - 10https://gerrit.wikimedia.org/r/91347 (owner: 10ArielGlenn) [06:29:34] ori-l: wanna resync that file now? [06:29:38] ^^ [06:30:03] apergos: sure [06:30:09] thanks [06:30:28] well I saw it ding my daily checks, but happened also to notice your comment in the logs [06:31:18] wonder what else has been synced in that time, I'd better check [06:31:39] right, might have to scap [06:31:50] in which case it might be better to keep it out of circulation until morning PDT [06:33:05] I can do the sync on that host only [06:34:59] * apergos tries something clever over there [06:35:05] yeah, that's right [06:35:14] sure, go for it [06:35:47] hm puppet not exactly running, that's a bit of a problem. well apache is off for the moment, and I'll poke atit [06:37:02] PROBLEM - Apache HTTP on mw125 is CRITICAL: Connection refused [06:37:43] yeah yeah we know, and it will stay that way too for a bit [06:43:10] ugh, I see... gotta wait a while [06:45:55] have to wait for it to fall out of the tampa recursors [06:46:52] this is a great time to go to the bank, brb [06:57:58] back [07:37:44] RECOVERY - Puppet freshness on mw125 is OK: puppet ran at Wed Oct 23 07:37:39 UTC 2013 [07:39:04] RECOVERY - Apache HTTP on mw125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.274 second response time [07:39:32] good morning [07:39:52] rats, I just started another puppet run without seeing in here that it already went [07:40:05] well I'll check the log as soon as this one completes [07:40:51] mw-sync went, rsync went [07:41:22] so that's sync-common [07:41:31] that sohuld get everything under the sun, problem solved [07:43:27] !log mw125: stopped apache shortly after adding back to dns, (needed to wait for an hour for the update to reach the pmtpa recursors so puppet could run), at first successful puppet run mw-sync completed so this host should be good to go now [07:43:41] Logged the message, Master [07:50:01] apergos: I think you can tail the puppet log in /var/log/puppet.log or something [07:50:16] apergos: I do that whenever the cron tasks kicks off [07:50:25] might even have colors \O/ [07:50:31] well I was not in screen, I was already running puppet [07:50:44] and too lazy to log in via anotehr terminal, that's why I wasn't tailing the log [07:51:14] I had been watching the clock for the dns ttl to expire, see [08:26:39] (03PS4) 10Hashar: ganglia wrapper for py plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 [08:26:40] (03PS1) 10Hashar: ganglia: diskstat.py plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/91351 [08:26:44] (03PS1) 10Hashar: contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 [08:28:02] (03CR) 10Hashar: "I have renamed the define to ganglia::python::plugin per Andrew." [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:33:57] (03PS5) 10Ori.livneh: ganglia wrapper for py plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:34:25] (03PS6) 10Ori.livneh: ganglia wrapper for py plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:35:28] (03CR) 10Ori.livneh: [C: 032] "PS5/6: tiny whitespace / spelling changes; remove reference to bug 36994 from commit message. (Not strictly related any longer.)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:36:22] (03PS2) 10Ori.livneh: ganglia: diskstat.py plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/91351 (owner: 10Hashar) [08:36:44] (03CR) 10Ori.livneh: [C: 032] "Gmetric module code LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91351 (owner: 10Hashar) [08:38:20] I am wondering how many weeks it will takes for me to exhaust Ori :-] [08:40:11] a few, at least :) is there a role where you could put the diskstat resource declaration? [08:40:13] as opposed to site.pp [08:41:07] ori-l: someone commented on my first patch how it was adding unnecessary glue [08:41:08] https://gerrit.wikimedia.org/r/#/c/85669/1/manifests/ganglia.pp,unified [08:41:23] I had a ganglia::plugin::diskstat wrapper [08:41:25] comment is: [08:41:31] This is already nice enough: [08:41:32] ganglia::pyplugin { 'diskstat': [08:41:33] opts => { [08:41:34] devices => [ 'sda', 'sdb' ], [08:41:35] }, [08:41:36] } [08:41:56] yes, it's a question of whether it goes in site.pp or the contint role i guess [08:42:04] it should go in 'standard' eventually, but it's good to make sure it works well first on a select group of machines, so i see the logic of applying it to gallium and lanthanum first, but having it in site.pp seems [08:42:14] yeah that was my idea [08:42:16] *gross [08:42:17] try it out for contint [08:42:25] also the role class might not end up always be applied [08:42:35] i.e. role::contint::master is only on gallium [08:42:41] role::contint::slave is on both [08:43:02] but anyway, both classes rely on a SSD drive being mounted to /srv/ssd which is done at the node level [08:43:08] so I though the monitoring was making more sense there [08:43:27] could it go in ::slave, with a comment saying it is being evaluated for the standard role? [08:44:42] amending :-] [08:45:04] oh, i see what you're saying [08:45:08] (03PS2) 10Hashar: contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 [08:45:20] IMO the mounts should move to contint as well, but that's for you to decide :P [08:45:48] yeah I tried to mount it under contint [08:45:56] does not play well cause role::ci::master require the mount [08:46:04] and role::ci::slave requires it as well [08:46:13] but gallium include both classes so you end up with a duplicate definition [08:46:15] and [08:46:42] it makes more sense to me to handle the mount stuff at the node level, makes it more obvious what the hardware conf is on then ode [08:46:43] node [08:46:59] want me to move the ganglia::plugin::python{'diskstat': } to role::ci::slave ? [08:48:33] hmm. well, i see the logic of having the mounts at the node level [08:48:50] at any rate it looks like faidon did it for swift and i'm sure he knows what he's doing [08:49:05] hopefully [08:49:13] maybe ask him? [08:49:15] or we will end up loosing TB of commons pictures :] [08:49:32] apergos made sure it's all on archive.org anyway :P [08:49:48] no, that was Nemo_bis [08:50:04] oh [08:50:21] I would just go for node [08:50:30] and later on add to standard if that work for them [08:50:31] or [08:50:41] I can move the calls to contint::slave if you prefer, patch is ready to be send :-] [08:50:57] I don't really care one way or another [08:52:02] well, I don't want you to amend the patch just to satisfy my preferences, and I'm suppose to be taking it nice & slow with the Puppet stuff anyway, so maybe we should wait to ask someone else? (apergos, do you have an opinion, if you've been following along?) [08:52:10] *supposed [08:52:16] if we want to add diskstat for mysql server, we will most probably add it to the role db class as well :-] [08:52:21] I haven't, sorry, working on something else right now [08:52:28] I am only looking if somone pings me [08:52:39] got it, sorry [08:53:15] hashar: maybe wait to ask p-void? that way we both learn something [08:53:37] (03PS3) 10Hashar: contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 [08:53:38] na na [08:53:40] role is fine :-] [08:53:50] that is only one line added now [08:54:02] and will make sure all jenkins CI slaves have the diskstat :] [08:54:08] https://gerrit.wikimedia.org/r/#/c/91352/3/manifests/role/ci.pp,unified [08:54:16] grr missing a space [08:55:59] (03PS4) 10Hashar: contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 [08:56:50] also between python + {diskstat [08:58:12] i'll amend it [08:59:55] (03PS5) 10Ori.livneh: contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 (owner: 10Hashar) [09:00:53] (03CR) 10Ori.livneh: [C: 032] contint: monitor CI server diskstats in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/91352 (owner: 10Hashar) [09:02:03] hashar: merged [09:02:19] \O/ [09:02:40] might have a patch for you too in a bit :P [09:04:44] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type ganglia::plugin::python at /etc/puppet/manifests/role/ci.pp:138 on node gallium.wikimedia.org [09:04:45] :( [09:04:46] seriously [09:05:19] puppet [09:05:20] please [09:05:21] :-D [09:05:52] do you have another class named ganglia in that scope? [09:06:11] plugin::python versuse python::plugin [09:06:21] or not [09:07:28] should be plugin::python [09:08:06] (03PS1) 10Hashar: ganglia::python::plugin --> ganglia::plugin::python [operations/puppet] - 10https://gerrit.wikimedia.org/r/91358 [09:08:09] ^^^^ [09:08:12] sorry :( [09:08:42] (03PS2) 10Ori.livneh: ganglia::python::plugin --> ganglia::plugin::python [operations/puppet] - 10https://gerrit.wikimedia.org/r/91358 (owner: 10Hashar) [09:09:16] (03CR) 10Ori.livneh: [C: 032] ganglia::python::plugin --> ganglia::plugin::python [operations/puppet] - 10https://gerrit.wikimedia.org/r/91358 (owner: 10Hashar) [09:10:02] ok,merged [09:15:27] puppet runnig [09:15:41] info: /Stage[main]/Role::Ci::Slave/Ganglia::Plugin::Python[diskstat]/File[/etc/ganglia/conf.d/diskstat.pyconf]: Scheduling refresh of Service[gmond] [09:15:44] \O/ [09:19:56] /usr/sbin/gmond[27120]: [PYTHON] Can't call the metric_init function in the python module [diskstat].#012 [09:19:57] /usr/sbin/gmond[27120]: Unable to find any metric information for 'diskstat_(.+)'. Possible that a module has not been loaded.#012 [09:20:01] you should get to bed ori :-] [09:28:21] hashar: you should set a 'devices' param [09:28:41] http://paste.debian.net/60759/ [09:28:53] from my manual testing earlier it was not needed [09:28:54] :( [09:33:56] you need either 'devices' or 'device_mapper' in params [09:33:59] the sequence is: [09:34:33] line 383-384: else: DEVICES = params.get('devices') [09:34:50] >>> assert {}.get('nonexistent') is None [09:34:51] ori-l: SyntaxError: Unexpected token { [09:35:07] yeah and I am wondering why gmond pass device to it [09:35:10] maybe it s a default [09:36:18] ori-l: so that sounds like a bug in there :D [09:36:38] the get should use '' as a default I guess [09:37:13] wouldn't fix it, too late at that point [09:37:44] the 'fix' would be to change 377 to: [09:37:52] (currently: if params.get('device-mapper') == 'true': ) [09:38:39] to: if params.get('device-mapper') == 'true' or 'device' not in params: [09:39:02] even that is a bit stupid [09:39:04] it should just be [09:39:05] that would make the plugin use device-mapper per default so [09:39:07] if 'device' not in params: [09:39:37] well, if the intention was to make it require an explicit param one way or the other, it should throw an intelligible error in metric_init [09:40:06] if not 'device' in params and not 'device-mapper' in params: log.exception('one of "device" or "device-mapper" must be set!') [09:40:35] i suggest we "fix" this in puppet [09:41:07] well there a bunch of DEVICES != '' [09:41:33] it's not too late to use tim's module, you know :P [09:41:40] falling back to empty string when fetching devices did the job: DEVICES = params.get('devices', '') [09:41:44] hehe [09:43:42] meh, i don't love the tortuous argument-handling and the gratuitous use of globals [09:43:45] but up to you [09:44:16] I could make DEVICE to default to None hehe [09:44:23] will report upstream anyway [09:44:33] (03PS1) 10Hashar: ganglia: diskstats plugin using wrong default [operations/puppet] - 10https://gerrit.wikimedia.org/r/91360 [09:44:57] i still think you should fix it in puppet for now rather than fork [09:45:08] by setting device-mapper in the pyconf file [09:45:14] the puppet change is https://gerrit.wikimedia.org/r/91360 [09:45:20] will submit a pull request upstream [09:45:28] making DEVICES to be null and updating all the logic [09:45:33] no no, i meant in the pyconf erb template [09:45:37] until upstream merges [09:45:55] add param device-mapper { value = 'true ' } [09:45:58] like https://github.com/ganglia/gmond_python_modules/blob/master/diskstat/conf.d/diskstat.pyconf [09:46:05] except not commented out of course [09:46:18] well device-mapper look for block devices under /dev/mapper [09:46:24] and that does not exist on the CI servers :/ [09:48:41] ugh, yeah. in that case, '' is right [09:48:51] (03PS2) 10Ori.livneh: ganglia: diskstats plugin using wrong default [operations/puppet] - 10https://gerrit.wikimedia.org/r/91360 (owner: 10Hashar) [09:49:26] (03PS1) 10ArielGlenn: remove asher, py from icinga access; use wikitech names for authz [operations/puppet] - 10https://gerrit.wikimedia.org/r/91361 [09:49:54] (03CR) 10Ori.livneh: [C: 032] ganglia: diskstats plugin using wrong default [operations/puppet] - 10https://gerrit.wikimedia.org/r/91360 (owner: 10Hashar) [09:51:27] (03CR) 10ArielGlenn: [C: 032] remove asher, py from icinga access; use wikitech names for authz [operations/puppet] - 10https://gerrit.wikimedia.org/r/91361 (owner: 10ArielGlenn) [09:58:57] (03CR) 10Hashar: "Proper fix sent upstream: https://github.com/ganglia/gmond_python_modules/pull/120" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91360 (owner: 10Hashar) [09:59:05] ori-l: patch proposed upstream https://github.com/ganglia/gmond_python_modules/pull/120 [09:59:12] might take a few months for them to process it :-] [10:01:28] ori-l: you are such a hacker [10:15:12] thank you very much [10:45:10] (03PS1) 10ArielGlenn: remove htpasswd from icinga config, not used (see r51315) [operations/puppet] - 10https://gerrit.wikimedia.org/r/91367 [10:46:39] (03CR) 10ArielGlenn: [C: 032] remove htpasswd from icinga config, not used (see r51315) [operations/puppet] - 10https://gerrit.wikimedia.org/r/91367 (owner: 10ArielGlenn) [11:37:48] hey apergos [11:37:51] saw the patch? [12:00:31] (03CR) 10Akosiaris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [12:01:57] YuviPanda: no, not yet [12:02:04] apergos: okay! [12:03:03] apergos: 'tis https://gerrit.wikimedia.org/r/#/c/91293/, for easy clicking/reference :) [12:03:04] you added me as a reviewer so I won't lose it, in the next few days I'll test [12:04:28] apergos: okay, thanks! [12:04:56] thanks for the patch [12:05:36] apergos: :D [12:05:44] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [12:49:31] (03PS1) 10Mark Bergsma: Setup ulsfo boxes as mobile caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/91373 [12:49:49] !log restarting apaches, "stray config": mw1117 mw1126 mw1145 mw1201 mw1206 [12:50:03] Logged the message, Master [12:51:24] (03PS2) 10Mark Bergsma: Setup ulsfo boxes as mobile caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/91373 [12:52:44] (03PS1) 10Faidon Liambotis: apache-fast-test: add API appservers to the list [operations/puppet] - 10https://gerrit.wikimedia.org/r/91374 [12:52:50] Reedy: ^ [12:53:49] (03CR) 10Mark Bergsma: [C: 032] Setup ulsfo boxes as mobile caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/91373 (owner: 10Mark Bergsma) [12:54:44] paravoid: aha [12:54:46] Very easy then :D [12:54:50] for my $pyfile (@_) { [12:55:02] qw() is just a shorthand for writing arrays [12:55:42] so $a = qw(foo bar baz) is the same as $a = ('foo', 'bar', 'baz') [12:56:32] get_server_list_from_pybal_config() just takes N filenames as arguments and enumerates through all of them [12:56:36] very future proof :) [12:56:48] (03CR) 10Faidon Liambotis: [C: 032] apache-fast-test: add API appservers to the list [operations/puppet] - 10https://gerrit.wikimedia.org/r/91374 (owner: 10Faidon Liambotis) [12:57:36] sweet [12:58:38] Thanks! :) [12:58:53] np [13:07:04] PROBLEM - HTTPS on cp4011 is CRITICAL: Connection refused [13:21:33] Reedy: should I add you to the "perl reviewers" group ? [13:23:01] (03PS1) 10ArielGlenn: adding asher back to icinga, we keep him as a volunteer, yay! [operations/puppet] - 10https://gerrit.wikimedia.org/r/91378 [13:23:04] paravoid: ori merged my Ganglia diskstat plugin earlier today. Might be of interest for swift/ceph/whatever [13:23:59] I saw [13:24:10] paravoid: example usage on gallium http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=gallium.wikimedia.org [13:25:15] the metrics are not sorted though, that makes the view a bit hard to analys [13:31:20] (03CR) 10ArielGlenn: [C: 032] adding asher back to icinga, we keep him as a volunteer, yay! [operations/puppet] - 10https://gerrit.wikimedia.org/r/91378 (owner: 10ArielGlenn) [13:32:38] (03CR) 10Hashar: "That is for Apple dictionary. Tim migrated it from puppet to apache-config with https://gerrit.wikimedia.org/r/#/c/43787/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91132 (owner: 10Dzahn) [13:50:44] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:34] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 73.76 ms [13:52:04] RECOVERY - HTTPS on cp4011 is OK: OK - Certificate will expire on 01/20/2016 12:00. [13:55:14] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:14] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 75.15 ms [13:57:14] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:24] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 73.88 ms [14:01:44] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:34] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [14:03:03] (03PS1) 10Mark Bergsma: Add ulsfo mobile LVS service monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/91381 [14:04:16] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo mobile LVS service monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/91381 (owner: 10Mark Bergsma) [14:07:45] PROBLEM - Varnish HTTP mobile-backend on cp4019 is CRITICAL: Connection refused [14:07:55] PROBLEM - Varnish HTTP mobile-backend on cp4012 is CRITICAL: Connection refused [14:08:28] (03PS1) 10Mark Bergsma: Add mobile HTTPS LVS service in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/91382 [14:08:46] (03CR) 10Mark Bergsma: [C: 032] Add mobile HTTPS LVS service in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/91382 (owner: 10Mark Bergsma) [14:11:05] PROBLEM - NTP on cp4012 is CRITICAL: NTP CRITICAL: Offset unknown [14:15:05] RECOVERY - NTP on cp4012 is OK: NTP OK: Offset -0.002661824226 secs [14:15:45] RECOVERY - Varnish HTTP mobile-backend on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.150 second response time [14:18:34] (03PS1) 10Reedy: Add pmtpa apaches for completeness [operations/puppet] - 10https://gerrit.wikimedia.org/r/91383 [14:18:37] ! [remote rejected] HEAD -> refs/for/master (branch master not found) [14:18:39] * Reedy stabs Ryan [14:21:33] <^d> !log elastic: rebuilding all indexes from screen in terbium [14:21:47] Logged the message, Master [14:27:56] Reedy or Ryan_Lane: do you know why http://www.wikipedia.beta.wmflabs.org/ nor http://wikipedia.beta.wmflabs.org/ work? [14:28:20] i mean - don't work :) [14:28:38] if i could figure how they work, i might fix what paravoid has been asking for :) [14:28:50] because i really have to have a testing env before doing this [14:29:01] and if it doesn't even work on beta... [14:34:47] ryan has left the building [14:35:31] hashar: would be the person to ask for that [14:36:12] yurik: they don't have a language set [14:36:19] yurik: you want en.wikipedia.beta.wmflabs.org [14:36:46] hashar: i need the front page - same as www.wikimedia.org [14:37:09] no clue what horrible hack is in use to show that :/ [14:37:12] maybe a docroot [14:37:30] exactly - i tried looking at extract2 and got scared and ran for the bushes [14:37:46] i have no idea how to reproduce it in dev env [14:37:50] * hashar vanishes [14:37:51] :D [14:37:58] so my only guess is to get beta cluster up and break it [14:38:07] IIRC it is fetched from meta directly [14:38:11] correct [14:38:29] and that been done so long ago there is no way I am going to put my fingers in there :D [14:38:29] from some page - where the HTML is stored in raw text [14:39:08] exactly - and i'm scared of that beast too - i wonder who could get it to run on beta cluster, and why it isn't working to begin with [14:39:46] sigh, TBD. Need to figure out ESI issue first - too many people are on my case about that one [14:40:09] * yurik_ is about to break mobile beta cluster again... be warned [14:46:51] extract2 stuff are in their own docroots (for now! WATCH THIS SPACE) [14:48:05] RewriteRule ^/$ /w/extract2.php?title=Www.wikipedia.org_portal&template=Www.wikipedia.org_template [L] [14:49:32] Reedy: do you know why it doesn't work in beta cluster? Do we need to adjust rewrite rules? [14:50:11] No [14:50:15] it would be very good to have beta mimik all those rules before i stick my dirty fingers into the code :) [14:50:26] It's not going to work on beta cluster because it is very likely that it hasn't been setup [14:51:06] Reedy: but beta uses the same puppets - it should work in similar fashion [14:51:14] No [14:51:18] Apache config isn't in puppet [14:51:18] except that HOST is a bit different [14:51:25] oh [14:51:30] lovelly [14:51:41] I'm not sure how/where the beta VirtualHosts are defined [14:55:36] !log jenkins: updating job mediawiki-core-phpunit-api to no more rely on ant [14:55:51] Logged the message, Master [14:56:41] yurik_: Reedy : I guess the apache conf on beta does not have the extract2 rewrite hack [14:56:51] "hack" [14:57:04] hashar: It's only there for specific vhosts where it is needed ;) [14:57:24] there/re-written [14:57:29] file will be in /w [14:57:48] hashar@deployment-bastion:/data/project/apache/conf(master)$ grep extr * [14:57:49] wmflabs.conf: RewriteRule ^/$ /w/extract2.php?title=Www.wikimedia.org_portal&template=Www.wikimedia.org_template [L] [14:57:49] ohh [14:58:22] www.wikimedia.org has docorot /usr/local/apache/common/docroot/www.wikimedia.org [14:58:26] so yeah hmm [14:58:38] yurik_: the apache conf is incorrect [14:59:42] ideally we would use the same as in production with .org rewritten to beta.wmflabs.org [15:00:26] Let's rewrite our rewrites! [15:01:07] yei! who wants to do the honorable task of rewriting rewrite rules? :) [15:01:13] That might not work so badly actually [15:01:18] If we just did them at the end [15:01:28] Warning: DocumentRoot [/usr/local/apache/common/docroot/www.wikimedia.org] does not exist [15:01:29] :( [15:01:39] Did I kill that one? [15:01:48] That one should be restored [15:01:52] Or well, symlinked for now [15:01:56] Reedy: you killed www ? :) [15:01:56] let me fix it [15:01:59] well the apache conf are almost 2 years old :] [15:02:17] hashar: Killed as part of the lead up to yesterdays outage ;) [15:02:21] I am not sure how it is in prod [15:02:29] seriously though, http://www.wikipedia.beta.wmflabs.org/ and http://www.m.wikipedia.beta.wmflabs.org/ [15:02:32] should work somehow [15:02:37] Reedy: can't we point to wikimedia.org docroot ? [15:02:41] currently the M. redirects to en.m. [15:02:47] which is not ideal either [15:04:15] hashar: No [15:04:22] Give me 30 seconds [15:04:26] I was going to do it on tin [15:04:28] 30 [15:04:29] 29 [15:04:31] 28 .. [15:04:32] :D [15:04:41] reedy@tin:/a/common/docroot$ ls -al www.wikimedia.org/ [15:04:41] total 12 [15:04:41] drwxrwxr-x 3 brion wikidev 4096 Oct 22 16:18 . [15:04:41] take your time, don't risk breaking prod! [15:04:42] drwxrwxr-x 64 root wikidev 4096 Oct 22 16:18 .. [15:04:44] drwxr-xr-x 2 root wikidev 4096 Jul 29 2011 w [15:05:26] Actually, bah [15:05:38] https://gerrit.wikimedia.org/r/#/c/91209/ [15:05:51] We just need to convince mutante to deploy that again :p [15:05:55] !log jenkins updating mediawiki-core-phpunit-databaseless to no more rely on ant [15:06:09] Logged the message, Master [15:06:44] Can someone in ops change /a/common/docroot (recursively) to make sure it's wikidev group and wikidev group has write please? [15:06:45] AFK while jenkins is busy [15:06:54] Per the root:wikidev above [15:08:12] Reedy: could you explain (for my general education mostly) what you are trying to do and why is it in prod? Or are you working on something non-beta docroot related? [15:08:28] i thought its in puppet/config somewhere [15:08:35] beta uses the mediawiki-config repo [15:08:42] So it reuses the same docroot folders [15:09:03] I'm in the middle of condensing most of them, which was halted with yesterdays outage [15:09:30] So the docroot that hashar mentioned doesn't exist on beta (or on tin) [15:09:37] But still exists on the production apaches [15:09:44] but I didn't propogate the deletes [15:09:57] moar symlinks [15:11:01] bock [15:11:40] (03PS1) 10Reedy: Add some symlinks to fix portal docroots for now... [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91389 [15:12:11] (03CR) 10Reedy: [C: 032] Add some symlinks to fix portal docroots for now... [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91389 (owner: 10Reedy) [15:12:24] (03CR) 10Faidon Liambotis: "I'm not sure if timeboxing it like that makes sense (what if mobile decides to add graphs in, say, two months? I won't say "too late" then" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91079 (owner: 10JGonera) [15:12:46] i see, thx reedy, so once you are done with the rework, i will jump back in and see how it works, and try to mess with it on beta cluster :) [15:12:56] (03CR) 10JanZerebecki: [C: 031] Move "RewriteEngine On" earlier in www.wikimedia.org vhost [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91339 (owner: 10Reedy) [15:13:04] in the mean time - i'm breaking beta mobile cluster [15:13:12] mobile varnish that is [15:13:21] yurik is working on my bug [15:13:24] I guess SoS works after all [15:13:25] ;) [15:13:37] paravoid: i never stopped working on your bug :) [15:13:43] and who is SoS ? [15:13:59] scrum of scrums [15:14:45] Dear Jenkins, I'm bored. --~~~~ [15:15:27] so yes, hashar, i will jump into it full time once i get ESI out [15:15:39] which is similar in a sense - i'm also doing it on mobile beta [15:17:52] (03Merged) 10jenkins-bot: Add some symlinks to fix portal docroots for now... [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91389 (owner: 10Reedy) [15:19:23] Reedy: Jenkins bottleneck is Zuul which is a bit slow / spending too much time waiting [15:19:37] Reedy: will be solved whenever I manage to get Zuul upgraded :] [15:20:28] Reedy: Also, telling Jenkins you are bored is a recipe for disaster. You do not want Jenkins to be "interesting". [15:23:31] !jenkins mediawiki-core-phpunit-databaseless [15:23:31] https://integration.wikimedia.org/ci/job/mediawiki-core-phpunit-databaseless [15:28:15] (03PS1) 10Mark Bergsma: Add mobile caches ulsfo aggregators [operations/puppet] - 10https://gerrit.wikimedia.org/r/91390 [15:30:43] (03CR) 10Mark Bergsma: [C: 032] Add mobile caches ulsfo aggregators [operations/puppet] - 10https://gerrit.wikimedia.org/r/91390 (owner: 10Mark Bergsma) [15:34:54] RECOVERY - Varnish HTTP mobile-backend on cp4012 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.150 second response time [15:37:05] (03PS1) 10Mark Bergsma: Send OC mobile traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/91391 [15:38:05] (03CR) 10BryanDavis: "At $DAYJOB-1 we had a "homedirs" repository laid out like /home. I don't know exactly how they did the puppet side of it, but as a user I " [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [15:38:19] (03CR) 10Mark Bergsma: [C: 032] Send OC mobile traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/91391 (owner: 10Mark Bergsma) [15:39:04] andrewbogott: i completely screwed up my branch [15:40:09] !log Sending OC mobile traffic to ulsfo [15:40:24] Logged the message, Master [15:45:25] (03PS1) 10Mark Bergsma: Add ulsfo mobile caches to the XFF list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91394 [15:46:00] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo mobile caches to the XFF list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91394 (owner: 10Mark Bergsma) [15:46:32] matanya, do you hve work on the branch that you need to preserve? [15:46:56] no andrewbogott [15:46:59] (03PS1) 10Mark Bergsma: Revert "Send OC mobile traffic to ulsfo" [operations/dns] - 10https://gerrit.wikimedia.org/r/91395 [15:47:06] (03CR) 10Mark Bergsma: [C: 032 V: 032] Revert "Send OC mobile traffic to ulsfo" [operations/dns] - 10https://gerrit.wikimedia.org/r/91395 (owner: 10Mark Bergsma) [15:47:17] matanya, perhaps… https://wikitech.wikimedia.org/wiki/Help:Git_rebase#Don.27t_panic [15:47:43] Get a fresh branch, then cherry-pick from gerrit a less-messed-up version of the patch [15:47:45] yeah, andrewbogott that made it worse :) [15:47:51] reall? [15:47:53] y [15:47:53] ? [15:47:56] What's happening now? [15:48:30] yes, i pulled the first patch and then applied the last one [15:48:40] that totally overwritten my work [15:48:43] first and last version of the same patch? [15:48:47] yes [15:48:53] stupid, i know [15:48:59] should use rebase [15:49:05] *shouldn't [15:49:15] No, you should, but... [15:49:34] but you should cherry-pick into a fresh branch rather than into one that already has work on it. [15:49:50] all i want is the first patch, and fix a small thing there [15:50:13] is it bettr just to abandon that branch and do it cleanly? [15:50:44] Maybe -- it's certainly easy. And if you have the patch in gerrit then you won'd lose anything to start a new branch. [15:51:59] !log mark synchronized wmf-config/squid.php 'Update squid list for ulsfo' [15:52:13] Logged the message, Master [15:57:14] thanks andrewbogott, i'll re-push it later today [15:57:34] matanya, do you understand about the pep8 failures you're getting from jenkins? [15:57:55] (03PS1) 10Hashar: ganglia: diskstat update from upstream [operations/puppet] - 10https://gerrit.wikimedia.org/r/91396 [16:16:59] PROBLEM - SSH on lvs4001 is CRITICAL: Server answer: [16:17:59] RECOVERY - SSH on lvs4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:35:26] ACKNOWLEDGEMENT - Memcached on virt0 is CRITICAL: Connection refused ryan lane firewalled [16:38:01] this needs to be converted into an nrpe check sometime... [16:38:07] we had the same issue with the blog host [16:51:44] paravoid: yep [16:51:54] I was really just testing the ack system [16:52:00] well, auth [16:52:10] since apergos was having issues [16:52:34] and wanted me to test a fix [16:52:59] well I tested the fix for me (= it works), but I needed a test with user name with a space [16:53:08] and you were handy... [16:54:02] so it's capitalization? duh [16:59:42] !log ms-be1006 replacing disk at slot 2 [16:59:59] Logged the message, Master [17:00:12] no it wasn't [17:00:22] capitalization I mean [17:03:59] RECOVERY - RAID on ms-be1006 is OK: OK: State is Optimal, checked 13 logical drive(s), 13 physical drive(s) [17:23:31] (03Abandoned) 10Cmjohnson: Removing cp1021 from role/cache.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/91041 (owner: 10Cmjohnson) [17:26:48] (03PS2) 10Cmjohnson: Removing ms2 from dsh groups as it's decom'd [operations/puppet] - 10https://gerrit.wikimedia.org/r/91044 [17:27:23] greg-g: zero will go ahead and deploy shortly [17:28:00] (03CR) 10Cmjohnson: [C: 032] Removing ms2 from dsh groups as it's decom'd [operations/puppet] - 10https://gerrit.wikimedia.org/r/91044 (owner: 10Cmjohnson) [17:28:25] yurik_: any zero in philippines yet? [17:30:39] jeremyb: not that i know of [17:31:08] jeremyb: its possible that our biz dev are talking to them [17:31:14] but i don't know about it [17:32:24] (03PS2) 10Awjrichards: Ensure that m.mediawiki.org will work as an origin for CORS [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91058 [17:32:42] yurik_: ok. had some problems recently with smart.com.ph; we mailed them but i was wondering if anyone else was already talking to them [17:33:12] (03CR) 10Awjrichards: "To simplify this, I added an explicit rule for m.mediawiki.org and removed the wildcard. m.mediawiki.org is currently semi-broken as a res" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91058 (owner: 10Awjrichards) [17:33:18] jeremyb: sory, no idea, deployment time :) [17:33:28] yurik_: yeah, happy deployment :) [17:45:19] (03PS1) 10Cmjohnson: Remoing ipv6 entries for arsenic and niobium [operations/dns] - 10https://gerrit.wikimedia.org/r/91419 [17:47:05] (03CR) 10Cmjohnson: [C: 032] Remoing ipv6 entries for arsenic and niobium [operations/dns] - 10https://gerrit.wikimedia.org/r/91419 (owner: 10Cmjohnson) [17:47:38] !log dns update [17:47:52] Logged the message, Master [17:50:17] paravoid: I'm importing with 900 concurrent writers now after improving the client [17:55:56] (03PS1) 10Cmjohnson: Changing eqiad upload cache ganglia data sources to cp1048 and cp1061 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91420 [17:58:06] (03CR) 10Ori.livneh: [C: 032] ganglia: diskstat update from upstream [operations/puppet] - 10https://gerrit.wikimedia.org/r/91396 (owner: 10Hashar) [18:01:57] !log yurik synchronized php-1.22wmf21/extensions/ZeroRatedMobileAccess/ [18:02:12] Logged the message, Master [18:03:43] gwicke: wow :) [18:03:57] what's the client you're referring to? [18:04:07] is it rashomon? or a client for that? [18:04:25] (03CR) 10CSteipp: [C: 031] Ensure that m.mediawiki.org will work as an origin for CORS [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91058 (owner: 10Awjrichards) [18:04:27] paravoid: it is a small nodejs client in test/dump [18:04:39] https://github.com/gwicke/rashomon/tree/master/test/dump [18:04:47] decreased the concurrency to 100 per client now [18:04:59] got a connection error with 300, likely a timeout [18:05:31] it reads an XML dump and sends off HTTP requests to the rashomon service [18:05:48] so did you improve that or rashomon itself? [18:05:59] I improved the client [18:06:12] gerrit is painfully slow today :( [18:06:22] rashomon itself is hardly breaking a sweat [18:06:39] cassandra uses most CPU for deflate compression [18:06:42] followed by the client [18:06:51] greg-g: i have updated ver21, but gerrit takes forever, and our window is over, so will update 22 later [18:07:16] can i update it in 2+ hours? [18:09:07] paravoid: the client is still the bottleneck though [18:09:36] yurik_: yessir [18:09:51] yurik_: plz add to [[wikitech:Deployments]] [18:09:59] * greg-g is on crappy conf wifi [18:19:28] (03PS3) 10Nemo bis: Simplify misc::maintenance::update_special_pages a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 [18:31:20] (03PS1) 10Chad: Wikivoyages get Cirrus as alternative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91424 [18:32:21] (03CR) 10Chad: [C: 032] Wikivoyages get Cirrus as alternative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91424 (owner: 10Chad) [18:32:30] (03Merged) 10jenkins-bot: Wikivoyages get Cirrus as alternative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91424 (owner: 10Chad) [18:35:39] !log demon synchronized wmf-config/InitialiseSettings.php [18:35:52] Logged the message, Master [18:54:05] !log db1033 powering down to reseat DIMM removed from pool according to RT5830 [18:54:17] Logged the message, Master [18:59:13] gwicke: heh, you're really pushing that box [18:59:53] yup [19:00:00] i/o wise it's idle [19:00:15] cpu-wise, it's full [19:00:26] yes, the sequential IO in the merge tree helps IO-wise [19:00:51] and the commitlog on the ssd avoids that becoming a bottleneck [19:01:07] for random reads IO performance will matter though [19:01:14] nod [19:02:44] does your client have a progress indicator? [19:02:50] i.e. do we know how far is it? [19:03:20] I had to make it more robust, so it is now rewriting mostly [19:03:30] rewriting? [19:03:39] also started two clients with 100 concurrent requests each per machine [19:03:45] re-uploading the same revisions [19:03:48] oh [19:04:00] no progress indicator [19:04:37] afaik the XML dumps don't have an indication of how many revisions there are in a header [19:04:41] the boxes don't seem terribly fast [19:05:02] so you'd have to parse it all to get the total number of revisions for a progress indicator [19:05:07] right [19:05:20] could print a running count every 1000 revisions though [19:05:22] iirc mako had some code to process every revision for enwiki in a few hours [19:05:33] so what's your plan? [19:05:38] idk what you're doing with them though [19:05:45] what are you planning to do after the dump finishes? [19:06:19] paravoid: this will establish a compression baseline on a representative page sample [19:06:38] and will establish whether cassandra falls over under high write pressure without concurrent reads [19:07:34] I think random reads according to a zipfian distribution or the like would be something good to test [19:07:58] prefer current revision, but some random accesses to old revisions thrown in [19:08:10] then compare rotating disk vs. ssd for those [19:08:17] I wonder if we could get a sample from production [19:08:33] maybe we can find a way [19:09:20] jeremyb: we have a dump grepper that greps a current dump on my laptop in ~20 minutes [19:10:09] the same underlying dumpReader module is used for rashomon testing now [19:10:13] gwicke: how come you picked deflate? [19:10:22] at random or? [19:10:51] I did a bunch of benchmarks with lz4, snappy and deflate and also varied the block size [19:11:07] heh, shame of me to say "random" then :) [19:11:40] since lz4 and snappy operate on fixed 64k blocks and don't use a sliding window they don't compress consecutive revisions very well [19:12:10] https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage#Cassandra_compression [19:13:10] lzma might actually be worth it, as the output is extremely small which means that decompression turns out to be about as fast or even fast than deflate [19:13:53] writes would be more expensive CPU-wise, but we don't do that many of those currently [19:13:55] is there a lzma compressor already or would we have to write it ourselves? [19:14:15] in production we have less than 100 re-parses (both from edits and template updates) per second [19:14:25] edit rate is typically around 10-20 per second [19:14:35] we'd have to write one ourselves [19:14:53] I saw the patch that added lz4, it was fairly straightforward [19:14:57] I remember edit rate to be higher [19:15:21] asher did some stats for us a while ago, he got a peak of around 50 [19:15:33] we have data in graphite [19:15:34] * paravoid checks [19:15:45] that was in a period of heavy bot activity during the language link migration to wikidata [19:15:57] the 50/second number was a five minute average [19:16:09] https://gdash.wikimedia.org/dashboards/editswiki/ [19:16:59] gwicke: have you read this? http://aphyr.com/posts/294-call-me-maybe-cassandra [19:17:02] strange [19:17:15] we see much lower numbers in production [19:17:53] Ryan_Lane: no, did not read that yet [19:18:01] that's a peak of 50/s [19:18:09] the graphite graphs, that is [19:18:20] gwicke: you saw it's /min, right? [19:18:27] yes [19:18:28] and /hour for the weekly one [19:18:47] of course, that's just enwiki [19:19:11] in production the Parsoid service gets something like 50/second from edits *and* template updates combined [19:19:23] with template updates heavily outnumbering the edits [19:19:35] it's another 50/s or so for the next top 10 [19:20:19] gwicke: most people don't use VE [19:20:29] that doesn't matter to Parsoid [19:20:37] VE usage does not show up in our graphs [19:20:38] greg-g: hi, is sam or you doing any deployments now? [19:20:41] I want to push 22 out [19:20:47] greg-g: hi, is sam or you doing any deployments now? [19:20:55] we track all edits and template / file updates from all wikipedias [19:20:56] oh, right, it needs to purge caches and such [19:21:24] what's the difference between edits and submits? [19:21:32] maybe graphite actually measures re-parses? [19:21:59] that would roughly fit with the numbers we see [19:22:24] asher got his numbers from a db query [19:22:33] so it only included saved edits [19:23:14] the subtitle says Edit and Edit+Submission Requests [19:23:24] so it definitely measures more than just new revisions [19:23:50] Ryan_Lane: hitting 'preview' is a submit [19:24:11] ah, ok [19:24:33] not sure if simple requests to action=edit are counted too [19:24:45] (just opening the edit form) [19:25:17] I was wondering the same [19:25:51] (03CR) 10Ryan Lane: [C: 032] Maintain repositories on deployment server [operations/puppet] - 10https://gerrit.wikimedia.org/r/91319 (owner: 10Ryan Lane) [19:26:10] !log deploying change 91319 to git deployment system [19:26:12] oh oh, cassandra on xenon seems to be dead [19:26:19] Logged the message, Master [19:27:50] ha [19:28:08] java.lang.OutOfMemoryError: Java heap space [19:28:29] heh [19:28:30] did we install jna? [19:28:51] well, you may need to tweak the jvm's options [19:29:00] to allow a larger amount of heap space [19:29:20] *nod* [19:29:23] jna is installed [19:30:27] seemed to run out of heap while doing a large compaction [19:30:57] Ryan_Lane: that cassandra link is interesting [19:31:11] Aaron|home: that entire series is oh so awesome [19:31:13] and depressing [19:31:44] hey bblack, mark & paravoid, could you confirm that adding another cookie to mobile in https://gerrit.wikimedia.org/r/91401 will not break anything, please? [19:32:54] !log upgrading Zuul on gallium to add a debug statement (tag: wmf-deploy-20131023 commit: 1e3adfd) See https://www.mediawiki.org/wiki/Zuul#upgrading for upgrade procedure used. [19:33:07] Logged the message, Master [19:33:49] Aaron|home, Ryan_Lane: interesting that there are issues with the paxos implementation [19:34:15] I think our use case should be fine here [19:34:22] we're doing a very basic key/value [19:34:23] I feel like crap, going to sleep early [19:34:24] sorry :) [19:34:26] we don't plan to use that yet, but if we wanted we'd have to do some more testing [19:34:26] where the keys are unique [19:34:35] paravoid: press the quit button right now! :-] [19:34:40] have sweet dreams [19:34:41] and we're not doing updates [19:34:45] ciao [19:34:49] paravoid: see ya! [19:34:54] paravoid: good night! [19:35:03] feel better paravoid [19:35:11] so we should be able to handle partitions without data loss [19:35:30] !log Zuul: restarting to apply upgrade. [19:35:37] at least based on our discussions, that's the impression I got :) [19:35:46] Logged the message, Master [19:37:05] Ryan_Lane: yes [19:37:13] we don't have the lost writes issue [19:38:07] good :) [19:38:45] the mongo article is just full of terrifying info [19:38:57] the redis cluster one too [19:39:31] redis cluster is still beta-ish isn't it? [19:39:34] riak actually looks good, assuming you're using CRDTs [19:39:55] Aaron|home: yes [19:40:07] the series was based on the design of them, though [19:40:24] and as designed redis cluster is unusable for many things [19:43:44] just starting the node on xenon after running out of heap was not successful [19:44:02] I did not try very hard to fix the issue and instead re-joined with empty data [19:44:17] it is now pulling the data from the other replicas [19:44:38] (03CR) 10Ryan Lane: [C: 032] Namespace mediawiki repos for deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/91327 (owner: 10Ryan Lane) [19:45:17] (03CR) 10Catrope: [C: 031] Enable VisualEditor for NS_FILE, NS_HELP, NS_CATEGORY [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90923 (owner: 10Jforrester) [19:45:26] Ryan_Lane: riak is missing column storage and compression [19:45:36] (03PS2) 10Catrope: Enable VisualEditor for NS_FILE, NS_HELP, NS_CATEGORY [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90923 (owner: 10Jforrester) [19:45:46] (03CR) 10Catrope: [C: 031] Enable VisualEditor for NS_FILE, NS_HELP, NS_CATEGORY [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90923 (owner: 10Jforrester) [19:45:47] gwicke: oh, I didn't mean as a use for our use-case [19:45:52] (03PS3) 10Catrope: cawiki: Enable VisualEditor for Portal: and Viquiprojecte: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91197 (owner: 10Jforrester) [19:45:59] I just mean from the not losing data during partitions perspective [19:46:04] and being a reliable system [19:46:05] ah, k [19:46:39] (03PS3) 10Catrope: enwiki: Enable VisualEditor for Portal: and Book: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91198 (owner: 10Jforrester) [19:47:15] Ryan_Lane: the test in the article seems to be done with 2.0.0- 2.0.1 had a lot of fixes [19:47:28] * Ryan_Lane nods [19:47:37] for which parts, though? [19:48:00] some of it is based on cassandra being lww [19:48:18] so it's based on the design and not on bugs [19:48:28] that's only if you don't use CAS [19:48:37] and then the problem is in the way you use it really [19:48:55] expectation mismatch [19:49:18] but writes being lost with CAS should not happen [19:50:12] greg-g: ? [19:51:21] (03PS4) 10Catrope: cawiki: Enable VisualEditor for Portal: and Viquiprojecte: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91197 (owner: 10Jforrester) [19:51:22] (03PS4) 10Catrope: enwiki: Enable VisualEditor for Portal: and Book: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91198 (owner: 10Jforrester) [19:51:23] (03PS3) 10Catrope: Enable VisualEditor for NS_FILE, NS_HELP, NS_CATEGORY [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90923 (owner: 10Jforrester) [19:51:59] Ryan_Lane: the two linked paxos-related bugs are both fixed in 2.0.1 [19:52:15] Can anyone point me to a good example of a repo that has a nice layout for a php app that would be deployed by Sartoris? [19:53:14] I'm starting a sort-of-from-scratch project that will be lamp stack (but not based on mediawiki) that will probably eventually be deployed to the prod cluster in some way [19:53:40] !log all mw Jenkins jobs now have $wgShowExceptionDetails enabled \O/ {{bug|55595}} [19:53:58] Logged the message, Master [19:54:05] bd808: well, for that it just needs to deploy the code and apache will point at it [19:54:52] Ryan_Lane: Any preferred directory layout to separate web root from libs and config? [19:55:47] well, config should surely be a separate repo [19:55:53] libs can be submodules [19:56:04] web root could probably live in config, if it's not part of the app [19:56:46] I'll have an index.php that acts as router, some static media and then "real app code [19:57:07] why would that live in the web root? [19:57:29] it's easier to alias that stuff in [19:57:49] I feel that basically nothing should live in /var/www [19:57:57] Sure [19:58:00] unless it's shared between multiple applications [19:58:21] Maybe I'm using terms differently [19:58:41] if those things are going to be custom to your install, then they should likely go in config [19:58:48] into a separate directory that can be aliased in [19:59:03] otherwise they should be part of the app repo (and should be aliased in) [19:59:14] <3's aliases [20:00:24] yurik_: not me [20:00:35] I'm imagining that a repo mirrors on server layout. Config provides an apache.conf file. I'm wondering if there is a preferred layout for the "app" root directory with common naming for the staic files dir vs the php code dir etc [20:00:40] uuuggghhh. file.symlink doesn't exist in our version of salt :( [20:00:47] ok, i will go ahead in 10ish min unless anyone stops me :) [20:01:05] s/staic/static/ [20:01:07] bd808: apache config should be in puppet [20:01:14] I guess it doesn't have to be... [20:01:18] Ryan_Lane: Agreed [20:01:33] I guess we actually deploy the apache config for MW [20:01:41] (03CR) 10Mark Bergsma: [C: 031] Changing eqiad upload cache ganglia data sources to cp1048 and cp1061 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91420 (owner: 10Cmjohnson) [20:01:50] and puppet tells apache to point at that config [20:01:58] so maybe it makes sense to keep the apache config with your app [20:02:27] (03PS1) 10Mark Bergsma: Send OC mobile traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/91490 [20:02:41] app/etc, app/htdocs, app/lib ? [20:02:56] (03CR) 10Mark Bergsma: [C: 032] Send OC mobile traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/91490 (owner: 10Mark Bergsma) [20:03:22] well, apache config should likely go in the config repo [20:03:43] if the static stuff is always the same, it could just be app/static [20:03:59] if it's deployment specific it should likely go in the config [20:04:39] Ryan_Lane: makes sense so far. [20:06:02] hm. I guess I'll spend the day upgrading salt [20:08:59] (03PS1) 10Ori.livneh: Remove Kraken-specific varnishncsa instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 [20:09:13] Ryan_Lane: hi! I am looking at puppet to add a repo for Sartoris, thanks for the config has! much easier to understand what needs to be added [20:09:17] ^ mark, that one's for you. it's the sort of change you like, i think :P [20:09:36] Ryan_Lane: can we get a grain that define servers in production AND in labs? I get Jenkins slaves in a labs project :] [20:10:01] ori-l: yeah but... that doesn't remove it from the servers does it [20:10:04] syncing zero on 22 [20:10:33] mark: oh. hrm, I'll amend. [20:13:44] mark: i gotta run, but I think what i'll do is fix the varnish::logging define to be ensure => absent -able. [20:13:54] yep [20:14:53] !log yurik synchronized php-1.22wmf22/extensions/ZeroRatedMobileAccess/ [20:15:06] Logged the message, Master [20:16:45] i did a lightning deploy too [20:16:51] of mobile caches in ulsfo [20:17:34] heh [20:18:51] (03PS1) 10Hashar: sartorize contint scripts for jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/91493 [20:19:29] mobile can't complain about caching capacity now ;) [20:19:38] (not that they did) [20:20:45] congrat mark :] [20:21:27] I am wondering how hard it would be to disseminate a bunch of caching DC like that around the world. [20:21:34] seems it is not "too" much of a hassle [20:23:11] yeah [20:23:16] it's getting easier [20:34:05] Ryan_Lane, re https://gdash.wikimedia.org/dashboards/editswiki/: the number of submits (red line) fits fairly well with the edit numbers we are seeing [20:34:14] and with asher's stats [20:34:24] overlooked that line at first [20:34:53] hashar: did you look at the updated docs? [20:35:09] hashar: now you only need to add the hash. the deployment system will even clone your repo automatically [20:35:17] and set up the proper permissions and ownership [20:35:25] and add the prefix to git deploy's config [20:35:38] Ryan_Lane: was quickly looking at https://wikitech.wikimedia.org/wiki/Sartoris [20:35:49] and basically copy pasted from the example :] [20:35:50] though I'm currently fixing an issue [20:35:55] salt needs to be upgraded [20:36:04] I'm using a feature that's in a newer version of salt [20:36:04] i submitted a change for you at https://gerrit.wikimedia.org/r/91493 [20:36:14] the commit summary has a bunch of questions :] [20:36:31] you could reply in the commit summary diff, I will be happy to update [[Sartoris]] aritcle [20:36:57] if I manage to get that simplel repo added, I will have a look at migrating Zuul to that [20:37:22] (Zuul being an unpackaged python daemon + bunch of python dependencies :D ) [20:37:51] hashar: the same salt master isn't used for labs and production [20:38:04] so though they have the same grain, it won't target both labs and production [20:38:18] gotta duplicate the config on the labs salt master so ? [20:38:27] orrrr [20:38:46] salt::master::prod salt::master::labsandprod salt::master::labs [20:38:48] evil [20:39:00] well, I'm saying that doing a deploy from production wouldn't also deploy to labs [20:39:17] we are using labs to play with the browser tests triggered from Gerrit/Zuul [20:39:20] maybe we could do salt master chaining, but that's rough [20:39:33] but can still be automated => true on both salt masters ? [20:40:10] what are you trying to accomplish with 'automated' ? [20:40:33] I merely want to get integration/jenkins.git checked out and maintained up to date automatically on all Jenkin Slaves (prod + labs) [20:40:36] whenever a change is merged [20:40:54] I am currently achieving that with contint::slave-scripts which is a git::clone ( integration/jenkins: ensure => latest } [20:40:58] * Ryan_Lane nods [20:41:02] which works [20:41:14] but there is a X hours delay before puppet kick off on all slaves [20:41:16] and [20:41:20] git::clone is evil :] [20:41:55] but when I needed it, I could not invest time figuring out Sartoris, so that is merely a temporary hack [20:42:33] why does this need to be in /srv/slave-scripts, rather than /srv/deployment/slave-scripts? [20:42:46] why not ? :-D [20:43:13] that was a lame attempt to ask whether we had a setting to use a different path on master and minions [20:43:25] we do not [20:43:26] I can get them in /srv/deployment/slave-scripts [20:43:31] by design [20:43:46] just have to update all jenkins jobs / shell scripts. That is not an issue. [20:43:59] so that's it's easy to know where something is being deployed [20:44:09] maybe I'll change my mind about that at some pount [20:44:11] *point [20:44:21] Ryan_Lane: that is person number 3 asking [20:44:26] akosiaris: :) [20:44:39] I wonder when you 'll give up :-) [20:44:43] maybe I will hack it up :-] [20:44:49] well, I have to actually implement that feature [20:44:52] akosiaris: when mark ask :-] [20:45:01] lol [20:45:03] and it complicates things [20:45:14] more seriously, I don't care, but I think that would be an interesting feature to have [20:45:31] I must say i get Ryan_Lane's point though [20:45:36] we can just fill it as a feature enhancement in bugzilla and revisit later on [20:45:44] I was shocked at first ... after that... it seems ok [20:45:46] for my specific need, I am fine switching to /srv/deployment [20:46:13] most software allows either configuring the position of stuff and worst scenario you symlink [20:46:22] if an application isn't written in such a way that its code can live anywhere, it's probably poorly written [20:46:22] akosiaris: have you worked on my lame servermon pull request to make it shinny for openstack folks ? :] [20:46:39] (03PS1) 10Andrew Bogott: Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 [20:46:41] hmmm that last sentence is syntactically wrong [20:46:50] hashar: actually yes [20:47:15] (03CR) 10jenkins-bot: [V: 04-1] Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 (owner: 10Andrew Bogott) [20:47:16] I spent a whole bunch of time today chasing down distutils, distribute, setuptools and distutils2 [20:47:19] and then pbr [20:47:35] and some other frameworks and when i did [20:47:38] ah I feel sorry :/ [20:47:43] python setup.py install and got [20:47:55] /usr/bin/python: No module named pip [20:47:55] error: /usr/bin/python -m pip.__init__ install 'pbr' 'Django' 'south' 'whoosh' 'ipy' returned 1 [20:48:03] I decided i needed a break [20:48:16] dohh [20:48:16] cause distribution of modules/packages in python sucks [20:48:24] so... [20:48:26] (03PS2) 10Andrew Bogott: Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 [20:48:48] akosiaris: luckily, I have a pbr package pending :] [20:48:50] for Precise [20:48:56] that thing break while setup.py download pip and pbr in my working dir [20:49:02] breaks* [20:49:36] any idea why it decided to download those two eggs in my working dir ? [20:51:15] noooo clue :-( [20:51:19] btw... pbr was written by the openstack folks right ? [20:51:29] at least that is what pypi says [20:51:54] Ryan_Lane: if you could reply on https://gerrit.wikimedia.org/r/#/c/91493 that would be nice, will follow up tomorrow :-] [20:52:01] i am now reading their documentation to at least understand what that does so maybe I can wrap it up at some point [20:52:02] akosiaris: yeah [20:52:09] by the folks in #openstack-infra [20:52:13] hashar: well, this isn't an easy request [20:52:22] I actually don't know what you're trying to do [20:52:35] I think they wanted something having a bunch of default convention and make it every easy to create the setup stuff [20:53:48] Ryan_Lane: well I described the need in the commit summary. Roughly: get integration/jenkins.git repo fetched somewhere (i.e. /srv/deployment/slave-scripts) on some prod machine and labs instances. [20:54:13] Ryan_Lane: and have sartoris/salt/whatever to automatically update the minions whenever a change is merged :-] [20:55:48] are the labs instances all in the same project? [20:55:55] yup 'integration' [20:56:16] don't you have something like deployment-labs::target ? [20:56:17] the problem here is that production salt can't talk to labs [20:56:24] on purpose [20:57:14] akosiaris: maybe python -v -m pip ? [20:57:16] you could pull directly from gerrit, rather than from a deployment system [20:57:22] akosiaris: that should trace the paths lookup [20:57:29] hm. am I allowing a url param in the config/ [20:57:31] * Ryan_Lane checks [20:57:57] ah. I'm not allowing a url param in the config [20:58:13] so, if I allow a url param in the config, you can set the url for the repo to be gerrit [20:58:31] then you could use yuvi's bot to watch gerrit [20:58:43] yuvi's bot could trigger a deployment [20:58:47] hashar: I know why it does not work [20:58:47] no way [20:58:56] I can do that with Jenkins directly :-] [20:58:59] it is because the egg itself was download at my workdir [20:59:03] oh [20:59:04] right [20:59:06] pip-1.4.1-py2.7.egg [20:59:20] the reason I am using the labs instance is to make sure we do not depend from anyone else :-] [20:59:29] with the EGG-INFO and pip directory inside [20:59:41] depend from anyone else? [20:59:45] Ryan_Lane: can't the salt master in labs handle the deployment stuff ? [21:01:12] yes, but how are you going to talk to it? [21:01:24] I guess I could write a runner for this [21:01:26] to chain the masters [21:01:42] ahh [21:01:54] that is because the Sartoris stuff is only in production is it ? [21:02:01] (03Abandoned) 10JGonera: Update MobileWebEditing schema revision [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90451 (owner: 10JGonera) [21:02:02] so it only talks to the salt master in prod [21:02:18] no [21:02:26] it's because there's a salt master for production [21:02:29] and a salt master for labs [21:02:42] but only one Sartoris install / daemon ? [21:02:43] and you can't directly talk to salt minions in labs from production [21:03:05] but you can talk to the labs salt master from the production salt master [21:03:12] and have it make an additional call [21:03:39] nice [21:03:40] the labs minions also can't report back to redis in production [21:03:46] so that's a problem too [21:04:07] and what the automated => true does ? [21:04:31] is that something to poll the gerrit repo and trigger a 'git deploy' whenever something is merged ? [21:04:33] that says something else is going to handle the deployment steps [21:04:45] ohhhh [21:05:15] I was assuming some daemon was polling the repo and would do the deploy for us [21:05:20] no [21:05:23] so such a system could have been installed in labs as well [21:05:26] done :] [21:05:30] wrong assumption [21:05:41] yeah, it's really there for dependency repositories [21:05:49] where one repository is triggering the deployment of another one [21:06:08] so I will still have to manually run git-deploy on tin to push the slave-scripts to the production minions [21:06:09] so, you could host a redis instance in integration [21:06:19] have the minions there return to that redis instance [21:06:19] and manually run git-depoy in labs for the labs minion [21:06:29] point the url of the repo to gerrit [21:06:50] and have jenkins do a chained deployment where it'll have production salt also run a command through labs salt [21:07:14] then you can read the status back through redis [21:07:32] okkk that is a lot more complicated than I was expecting :] [21:07:52] this is going to require some changes to git-deploy [21:07:57] why on earth do I always ends up in a tricky use case [21:08:07] I wasn't really at a stage where automated deployment was possible [21:08:15] I've been working on our current use case [21:08:50] i am not blaming you :] [21:09:02] not saying you are ;) [21:09:09] just explaining the situation :) [21:09:13] making sure you are not think I am :D [21:09:22] think > thinking [21:09:57] hm. pointing at gerrit isn't amazingly straightforward either [21:10:10] (03CR) 10Hashar: [C: 04-1] "On hold, I had a quick chat with Ryan that clarify my question. Will have to think about it a bit :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91493 (owner: 10Hashar) [21:10:11] because the minions need to read a config file to determine which tag it's going to checkout [21:10:28] and it assumes it's on the deployment server [21:10:40] what I was expecting from automated => true is that something listen for Gerrit, detects a merge, git pull, git deploy [21:10:58] I could extend fetch and checkout to also accept a tag parameter [21:10:59] and I was also expecting the salt master in prod to be able to talk to instances :D [21:11:22] (03PS1) 10MaxSem: Remove b/c IPv6 forms [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91505 [21:11:26] well, getting prod to chain its calls to labs is doable [21:11:43] and listening for Gerrit merges is as well :-D [21:11:47] \O/ [21:12:01] yeah, all of this stuff is on my roadmap [21:12:17] 'mediawiki' => automated: true. Then whoever +2 a change in mw/core would deploy automatically on the cluster [21:12:22] but priorities is for simple use of the deployment system, then mediawiki [21:12:25] then automated deploy [21:12:42] yeah that make sense [21:12:44] automated deploy is.... difficult [21:12:48] because it's two-phase [21:12:58] if the fetch stage fails on some minions you need to handle that somehow [21:13:03] and make decisions on what to do [21:13:17] do you continue if it's < x% of minions? [21:13:22] do you accept no failures? [21:13:31] yeah we have no clue yet [21:13:42] but that is definitely the evil plan / goal for later [21:13:57] in my world we'd accept < x% and depool the ones that failed [21:14:08] and send an alert [21:14:23] unless you have less than X servers remaining [21:14:30] yeah that would be nice [21:14:33] well, x% would be a total threshold [21:14:36] not per deployment [21:14:48] much better than the current dsh drama and the sync-common before apache restart [21:14:58] yeah [21:15:05] it just fails in that situation [21:15:21] when you're doing a manual deploy you have the choice of continuing or stopping [21:16:25] thanks for the clarification :] [21:16:29] will dig a bit around the code [21:16:31] yw [21:16:33] and think about something [21:16:44] daughter crying so it is dad hat now :] Have a nice afternoon! [21:16:55] !log upgrading salt to 0.17.1 [21:17:13] Logged the message, Master [21:18:38] Reedy: I'm thinking of fixing https://bugzilla.wikimedia.org/show_bug.cgi?id=31068 - any sugestions? [21:18:51] suggestions*, grr. [21:22:14] (03PS1) 10Hashar: resync with upstream v0.7.0 [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/91506 [21:22:22] (03PS3) 10Andrew Bogott: Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 [21:27:12] (03PS1) 10Kaldari: Moving eventLogging schema settings from config to extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91509 [21:29:13] heh. whoops. kind of overwhelmed brewster. forgot to batch the salt upgrade in production [21:29:48] bleh. 0.17.1 isn't reporting pkg upgrades properly in its output [21:32:25] (03PS1) 10Jdlrobson: Move all MobileFrontend EventLogging rules into MobileFrontend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91511 [21:49:04] (03PS1) 10Ryan Lane: Ensure a specific version of salt [operations/puppet] - 10https://gerrit.wikimedia.org/r/91517 [21:49:17] (03Abandoned) 10Kaldari: Moving eventLogging schema settings from config to extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91509 (owner: 10Kaldari) [21:50:07] (03PS2) 10Hashar: resync with upstream v0.7.0 [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/91506 [21:51:02] (03PS4) 10Andrew Bogott: Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 [21:54:31] can somebody `echo $wgLocalInterwiki` on en.wp real quick for me? [21:54:36] is it 'en' or 'w'? [21:57:22] (03PS2) 10Ryan Lane: Ensure a specific version of salt [operations/puppet] - 10https://gerrit.wikimedia.org/r/91517 [22:02:09] (03CR) 10Ryan Lane: [C: 032] Ensure a specific version of salt [operations/puppet] - 10https://gerrit.wikimedia.org/r/91517 (owner: 10Ryan Lane) [22:07:23] MatmaRex: looking [22:07:47] [22:07][hashar@fenari(mw-inst):~]$ mwscript eval.php --wiki=enwiki [22:07:47] > return $wgLocalInterwiki [22:07:48] en [22:07:50] MatmaRex: en :D [22:07:58] hashar: i'm trying to figure out why https://en.wikipedia.org/wiki/w:a is a bad title [22:08:06] while e.g. https://en.wikipedia.org/wiki/en:a is okay [22:08:12] they are both local (aka forwarded) [22:09:02] must be the interwiki table I gues [22:09:03] s [22:09:25] i think we have maintenance script to dump them [22:09:32] according to https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&meta=siteinfo&format=json&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cinterwikimap and https://en.wikipedia.org/wiki/Special:Interwiki these are exactly the same [22:09:53] does something detect the redirect loop that would happen otherwise, or what? [22:10:05] (this is bug https://bugzilla.wikimedia.org/show_bug.cgi?id=12330 btw) [22:11:05] (03PS5) 10Andrew Bogott: Add install and upstart for proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/91499 [22:15:42] MatmaRex: sorry can't really look [22:15:45] midnight there :( [22:15:51] i am just having fun with jenkins :D [22:15:56] in bed [22:15:59] (but wearing a pant) [22:16:24] midnight here, too [22:16:29] best time for coding [22:16:31] * MatmaRex digs [22:16:38] has pybal really not been updated in over a year? [22:16:44] https://git.wikimedia.org/tree/operations%2Fdebs%2Fpybal.git [22:18:51] hashar: ah, i think i see it. [22:18:59] # We already know that some pages won't be in the database! [22:19:01] if ( $this->mInterwiki != '' || NS_SPECIAL == $this->mNamespace ) { [22:19:09] turns out that's not a true assumption :D [22:23:09] (03PS1) 10Odder: (bug 31068) Configure namespaces for Azerbaijani Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91523 [22:29:09] (03PS2) 10Awjrichards: Move all MobileFrontend EventLogging rules into MobileFrontend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91511 (owner: 10Jdlrobson) [22:31:49] (03CR) 10Hashar: "build on jenkins, then installed it and rebuild it with itself \O/" [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/91506 (owner: 10Hashar) [22:32:29] * hashar waves [22:45:44] Anyone know what the status of using Varnish for text content is? [22:45:46] PROBLEM - MySQL Idle Transactions on db1059 is CRITICAL: CRIT longest blocking idle transaction sleeps for 635 seconds [22:47:25] csteipp: it's used on wikidata for our text [22:47:32] no idea where else [22:47:39] aude: Thanks! [22:47:56] mark would know [22:48:20] Yeah, sadly I never think of these questions at 8am :/ [22:48:46] RECOVERY - MySQL Idle Transactions on db1059 is OK: OK longest blocking idle transaction sleeps for 11 seconds [23:01:12] csteipp: wikivoyages too [23:10:41] gah, missed him [23:10:49] there's more text-varnish [23:22:28] current write throughput is around 900 revisions / second [23:24:24] csteipp: there's more text-varnish than that. [23:24:27] you need a list? [23:26:08] i guess it's everything but wikipedia [23:26:46] (03CR) 10Cmjohnson: [C: 032] Changing eqiad upload cache ganglia data sources to cp1048 and cp1061 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91420 (owner: 10Cmjohnson) [23:26:48] idk what txtsvc is [23:27:07] errr [23:27:10] textsvc* [23:28:35] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:51] 502 Bad Gateway [23:29:03] @ skwiki [23:29:15] RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 89.52 ms [23:32:45] Danny_B: still? [23:33:07] jeremyb: I was just curious if we were running any full wikis on it currently. Meta was a particular issue in the past. [23:33:27] csteipp: best guess without being certain is everything but wikipedia [23:33:48] csteipp: /text-varnish/ @ manifests/lvs.pp [23:33:52] Cool. I'll check with mark in the morning too, but that confirms what I was thinking [23:40:44] jeremyb: nope