[00:00:24] Lightning deploy time, going to deploy VE [00:00:44] I think gwicke wanted to deploy node_modules first? [00:01:29] Parsoid is currently fine without deploying anything [00:01:38] okay, nevermind me [00:01:42] other node services depend on binary npm libs though [00:02:05] so during a node upgrade you need to re-run npm install before restarting the service [00:02:32] best to also rm -r node_modules in my experience [00:03:02] mwalker, oops - that was my commit [00:03:15] MaxSem: ya it was; but I don't understand why it broke things [00:03:31] what's broken? [00:03:53] all writers are rendering to PDF [00:04:03] so bookcmd=render&writer=epub will still render a PDF [00:04:18] who cares about other formats, anyway?:P [00:04:20] but... bookcmd=render&writer=rdf2latex (which is our renderer) works just find [00:04:25] !log catrope synchronized php-1.23wmf6/extensions/VisualEditor/ 'Update VisualEditor with cherry-picks' [00:04:28] apparently lots of people! [00:04:41] Logged the message, Master [00:05:09] * MaxSem tries to recall how this stuff works [00:05:47] OK I'm done [00:05:53] superm401: You had stuff for the LD window? [00:06:06] RoanKattouw, yeah, about to do it. [00:06:09] Cool [00:07:11] hmm, it seems that parsoid for officewiki is now producing 503s [00:07:43] hmm? [00:07:47] Error: tunneling socket could not be established, cause=Parse Error [00:08:03] is the URL for officewiki an https one? [00:08:05] * paravoid looks [00:08:29] yes it is [00:08:38] s/https/http/ for these 5 URLs [00:08:46] if it's https, the node requests code tries to do a CONNECT [00:08:52] when using a proxy [00:09:02] k, will prepare a patch [00:09:21] it's board, collab, office, wikimaniateam & wikitech [00:09:31] oh [00:09:33] on that note [00:09:38] wikitech won't work as it is :( [00:09:40] shit [00:09:49] I can match against the apiURI [00:09:54] (03CR) 10Qgil: "For what is worth, the logo is based on https://commons.wikimedia.org/wiki/File:Wikimedia_logo_text_RGB.svg + the "inter" string. This fav" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100326 (owner: 10Gerrit Patch Uploader) [00:09:56] wikitech doesn't run on the main cluster [00:10:11] you can't connect to api.svc and expect wikitech to work [00:10:25] hrm, minor complication [00:10:44] lookup structure for special cases? [00:10:57] I guess so [00:11:20] apiProxyURIs with an '*' entry and others by prefix [00:12:53] it's an accident that it worked now [00:13:14] wikitech used to be on linode, for example :) [00:14:06] !log mflaschen synchronized php-1.23wmf6/extensions/Thanks/ 'Deploy Thanks bugfix to 1.23wmf6' [00:14:15] You're up, RoanKattouw. [00:14:22] Logged the message, Master [00:17:07] MaxSem: careful; you might lose brain cells [00:18:12] superm401: I went before you [00:18:15] So I'm already done [00:18:59] Oh, whoops, didn't see that. Thanks. [00:21:07] paravoid: is there another service for wikitech and the other special cases that accepts proxy requests? [00:22:15] I don't think so [00:22:31] MaxSem: do you have shell on pdf[1-3]? [00:22:50] hrm [00:23:48] I'll disable the proxy for now [00:25:01] mwalker, no [00:25:08] :'( [00:25:30] my only thought at this moment is that we're missing something in the metabook format [00:25:34] somehow [00:25:36] !log reverted Parsoid proxy change as officewiki and some other https-only wikis were broken [00:25:52] Logged the message, Master [00:27:08] gwicke: everything but wikitech is easily solvable, as far as I'm aware [00:27:12] by s/https/http/ [00:27:17] paravoid: maybe we should have waited a bit with removing the randomization from the parsoid api uri [00:27:58] yeah, just better to un-break things first [00:29:40] I'll add support for completely disabling the proxy for specific wikis [00:31:45] sorry that my change was broken for wikitech :) [00:31:50] or that I did not notice https in the config [00:32:36] there seems to be a problem with securepoll [00:32:40] someone can help me? [00:34:24] Reedy ^ [00:34:48] mwalker: I just poked him in -staff :P [00:34:54] heh [00:34:55] * Vito waits [00:35:39] (03PS1) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [00:35:57] gwicke: ^^ [00:36:24] so, with that change you can add "parsoid" to the service_name config for the parsoid/Parsoid repo [00:36:25] Vito: who setup your securepoll? the decryption key should be held in escrow by them I think [00:37:08] mwalker: I don't know, I'm scrutineer for en.wiki's ACE2013 [00:37:12] mwalker: it was Reedy I think [00:37:14] and from tin you can call: sudo salt-call -l quiet publish.runner deploy.restart parsoid/Parsoid [00:37:24] I'll make a script to wrap that command [00:37:45] it'll batch the restarts to 5 minions at a time [00:37:57] of course, I still need to test this change in labs before I merge it in [00:38:30] Jamesofur|away: you might also be able to help with Vito's problem [00:42:46] hmm; Vito it looks like everyone who might know anything is currently away -- I would drop a message on [en:User_talk:Philippe_(WMF)] [00:43:51] maybe also to [en:User_talk:Jalexander] [00:45:41] mwalker: I'm not sure if Philippe can help us these days [00:45:49] anyway I think I'll fall asleep [00:46:22] hmm; I thought he was the keeper of all the keys [00:46:24] *shrugs* [00:46:26] Ryan_Lane, awesome, let me add that to wikitech [00:46:42] gwicke: wait a bit :) [00:46:47] I need to test and merge it in [00:46:59] Ryan_Lane, ok ;) [00:47:03] I'd like to make a wrapper for the service restart too, since that command is ugly as sin [00:49:10] (03PS1) 10Aaron Schulz: Include redis on logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/100511 [00:49:22] greg-g: I know what commit is breaking collection now; but I'm not sure the best way of pinning the cluster to the commit before it (i'd rather not have to revert that commit and everything past it) [00:49:48] ideally I'd fix it; but I can't seem to find whatever the offending line is [00:49:57] ... at this particular point in time [00:53:28] Ryan_Lane, btw: https://www.mediawiki.org/wiki/Parsoid/Packaging#Option_2:_deploy_repo_with_code_as_submodule [00:54:05] ori-l: yeah it looks like the buffering is really just on the client, so I can lower that down [00:55:14] (03PS2) 10Aaron Schulz: Include redis on logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/100511 [01:03:22] gwicke: hm. I'm not totally sure I understand option 2's proposal [01:03:36] I also wonder what you mean by puppet can manage the config in option 1 [01:03:49] do you mean parsoid's configuration itself, or the upstart file? [01:03:59] upstart [01:04:15] * Ryan_Lane nods [01:04:15] although potentially also localsettings.js [01:04:38] it's better to deploy localsettings.js via git-deploy [01:04:54] ops doesn't want to touch application deployment or configuration [01:05:36] ops does want to maintain upstarts and such, since they run with root privs [01:06:38] updated option 2 to include localsettings.js [01:06:39] Ryan_Lane: I actually received opposite guidance -- that I needed to template / puppetize my configuration [01:06:53] mwalker: for which application? [01:06:59] also added another option that moves the debianization to a submodule too [01:07:00] the new PDF renderer [01:07:12] bleh. who said that? [01:07:21] ori-l and paravoid [01:07:26] I wonder why [01:07:36] it's an application like anything else [01:07:41] wait, what did I say? [01:07:44] I think it makes sense to manage the config via deployment [01:08:03] it makes sense for the init/upstart script to be managed by puppet [01:08:26] paravoid: I interpreted what you said a while ago; that my configuration file for ocg-collection should be a template in a module [01:08:39] I don't remember saying that [01:08:47] and I don't think that's right? [01:08:50] this was a couple of weeks ago when I was trying to get ori to create me a /operations/config/ocg repo or some such [01:08:58] I wasn't around for that [01:09:00] heh [01:09:06] I'd definitely manage the config via deployment [01:09:08] * mwalker scrounges in log files [01:09:12] deploying localsettings like parsoid is is best imho [01:09:20] *is doing it [01:09:26] otherwise ops needs to be around to make config changes [01:09:27] i.e. git-deploy [01:09:33] which may need to occur during deployments [01:10:04] I'd like to get to the point where apps can generally be maintained by devs first, and ops second [01:10:33] ahhhh [01:11:24] it was jeremyb, ori-l, and MZ [01:11:30] an unlucky combination [01:11:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [01:12:01] apologies for the slander paravoid [01:12:01] :D [01:12:07] hehe, no worries [01:12:34] mwalker: so, yeah, making a config repo that you deploy will make things way easier in the long run [01:12:44] *nods* that's what I was thinking :) [01:12:49] paravoid: btw: https://gerrit.wikimedia.org/r/#/c/100509/ [01:12:49] my opinion for the record is that you shouldn't involve puppet in the deployment hot path [01:12:57] I had a request in for /operations/ocg-config [01:13:01] both because puppet takes a while to converge, might be broken etc. [01:13:12] how does https://www.mediawiki.org/wiki/Parsoid/Packaging#Option_2:_deploy_repo_with_code_as_submodule look like to you? [01:13:14] I still need to test my change, of course [01:13:16] and because access rights for puppet are restricted to a smaller set than deployers [01:13:49] gwicke: ah. you'd deploy all of that, and parsoid itself would be a submodule [01:13:58] so you'd do config and code at the same time [01:14:11] that's definitely an option [01:14:21] not a bad one either [01:16:41] Ryan_Lane: yup [01:18:06] I've avoided that elsewhere (like mediawiki) due to a large amount of recursive submodules [01:18:15] because it makes deployment somewhat complicated [01:18:31] I think in this case it's probably simpler [01:19:05] doing it the other way around would force us to use even more submodules [01:19:15] * Ryan_Lane nods [01:19:19] that looks like a good approach to me [01:21:14] now if a script copied in a debian dir specced by ops and built a deb from that, there would not be a need for ops involvement [01:21:38] well, for third party use you can use a launchpad ppa [01:21:53] no need for ops involvement with that [01:22:50] sure [01:22:58] I was more thinking about using it for deployment [01:23:14] so that we can get dependencies etc [01:23:32] how would you install the deb? [01:24:06] and how would it be added to the apt repo? [01:24:12] it sounds not impossible to script 'apt-get install parsoid' with a version parameter [01:24:45] when you needed to deal with dependencies, we'd need to be involved [01:24:59] ops would spec the deps in debian/ [01:25:25] so far no one is agreeing with you that this is a good idea ;) [01:25:45] (03PS1) 10Faidon Liambotis: Varnish: set backend_random for POSTs [operations/puppet] - 10https://gerrit.wikimedia.org/r/100516 [01:25:48] upgrade: 'apt-get install parsoid=0.1.33' [01:25:56] gwicke: ^^ [01:25:57] downgrade: 'apt-get install parsoid=0.1.32' [01:26:05] gwicke: what would call that? [01:26:29] and again, how would the package get into the repository? [01:26:42] Ryan_Lane, a dsh or salt script for example [01:27:05] (03CR) 10Faidon Liambotis: [C: 032] Varnish: set backend_random for POSTs [operations/puppet] - 10https://gerrit.wikimedia.org/r/100516 (owner: 10Faidon Liambotis) [01:27:14] what you're describing is a new deployment system. using debs [01:27:35] copying a file built by an ops-controlled script using ops-controlled debian stuff to the repo is relatively simple [01:27:36] and as I mentioned that's a shitload of work for not very much gain [01:27:40] no. it's not [01:27:55] cp a b [01:28:17] 1. it's a reprepro call, and it needs to happen on a specific system [01:28:23] 2. the package needs to be built somewhere [01:28:30] where does 2 occur? [01:28:32] jenkins? [01:28:36] we don't really trust jenkins [01:28:55] I'd think the deployer triggers the build and increments the version [01:28:57] then, a dsh or salt script needs to be created [01:29:05] and it needs to be callable by deployers [01:29:59] this is way more complex than you're thinking it is [01:30:11] and it gains us very little [01:30:29] why don't we look at how we can handle dependencies in the current deployment system, rather than rewriting it from scratch? [01:30:53] well, we would not have to write another deploy system for others [01:31:04] could use proper dependencies [01:31:17] and generally help third parties [01:31:49] you're taking something that's as simple as deploying source and turning into something that builds a binary, injects it into an apt repo, then requires apt-get update && apt-get install [01:31:57] the amount of scripting does not sound prohibitive [01:32:08] reprepro doesn't support multiple versions of a package in a repo [01:32:13] so we'd need to replace reprepro too [01:32:50] and you'd still need to deploy the codebase to a system via git first, too :D [01:33:01] most deb repos seem to have multiple versions of the same package, I wonder what they are using [01:33:02] to build the deb [01:33:10] fwiw, as much as I love .debs obviously, I think they're unsuitable for an agile deployment workflow [01:33:38] I just think it's an overcomplicated solution to deployment [01:33:54] salt has the ability to manage packages [01:34:02] we could specify dependencies in the repo config [01:34:24] paravoid, which issue to you see that would make agile hard? [01:34:35] the need to increment the version number? [01:34:39] (and multiple versions of the same package in the same suite for the same architecture is not something that Debian or its tools do in general) [01:34:58] paravoid: ubuntu PPAs do it [01:35:08] the cassandra repo for example has all versions since adam and eve [01:35:18] and they add a new one every two weeks or so [01:35:55] anyway, let's look at what you need and solve the problem using the system we have [01:35:56] building their deb from git is a single line [01:35:59] http://www.apache.org/dist/cassandra/debian/dists/20x/main/binary-amd64/Packages [01:36:02] rather than rearchitecting it [01:36:03] has just one version [01:36:28] a PPA for third parties, and git-deploy for us is a simpler solution [01:36:40] they have different suites for 1.2.x etc. [01:37:05] it might have been the datastax repo that had all of them [01:37:16] anyway, the whole "modify source, rebuild deb, put in apt, upgrade packages" workflow isn't great [01:37:24] it's not just the version [01:37:26] it's too messy [01:37:44] from a security perspective especially [01:37:46] it's also going to be slow, which is going to fuck us if we need to revert quickly [01:38:03] and it's a lot of places for the system to break [01:38:05] but in general, it feeld overcomplicated for something that can just be a simple "rsync code" process [01:38:15] s/rsync/git push/ or whatever [01:38:22] "copy files" if you want [01:38:27] * Ryan_Lane nods [01:38:27] http://debian.datastax.com/community/pool/ [01:39:03] gwicke: so, as I asked earlier. what's the problem you're trying to solve? I'm more than happy to add features or make changes to trebuchet [01:39:19] if it's dependencies, that's likely a solvable problem [01:39:38] services won't usually be a single repository [01:40:22] I'm running fever and it's 3:40am, so if you'll excuse me :) [01:40:28] paravoid: yeah, go to bed :D [01:40:33] I'd be happy to discuss this via mail, if you need my opinion, fwiw [01:40:49] ouch, that sounds like bed time [01:41:11] gwicke: right, so make multiple repos and deploy them separately, or have a combined one using submodules [01:41:15] paravoid: k, thx [01:41:50] I still do think .debs are great to publish for use by third parties [01:41:52] Ryan_Lane, I'm a bit wary about duplication of effort [01:42:01] restart the services via the deploy.restart command I just pushed [01:42:04] we'll have to do some proper packaging anyway [01:42:21] so to me it looks attractive to save the effort to do this twice [01:42:23] maybe tag releases every now and then (2 weeks sounds fine, less often even better I think) [01:42:49] gwicke: lots of upstreams make packages for third parties and deploy from git [01:42:54] if I was a third party admin that wanted to just install a wiki and apt showed updates every tuesday and thursday I'd very annoyed [01:42:59] openstack is a great example of this [01:43:00] I'd be* [01:43:10] rackspace and a number of other public clouds use openstack from git [01:43:16] third parties use debs [01:43:17] Ryan_Lane, I guess that is what unstable vs. stable is for [01:43:34] eh? what do you mean? [01:43:55] stable vs. unstable repo [01:43:57] when I say they use git, I mean they don't use debs at all [01:44:13] re third parties only wanting major releases [01:44:18] and there's only one repo. they use branches and tags for releases [01:44:26] err [01:44:29] one repo per project [01:44:45] each project does releases via branches and tags [01:44:51] stable releases are branches [01:45:02] so that security and major bugs can be backported [01:45:11] debs are generated from it [01:46:11] setting up a new service-based MW system using git only won't be very convenient [01:46:17] so IMO we should do nice packaging for that [01:46:28] for third parties it won't be very convenient [01:46:34] for us it is very convenient [01:46:58] you know we moved away from using debs for this kind of stuff, right? [01:47:04] I mentioned this earlier [01:47:10] whether that is also useful for our own service deploys remains to be seen IMO [01:47:35] we've been systematically killing off any "configuration" packaging [01:48:07] I don't think that the distribution mechanism is that important [01:48:40] it matters more for near-atomic deploys [01:49:03] atomicity isn't the key. speed of deploy is [01:49:11] and ability to quickly revert [01:49:18] especially the ability to quickly revert [01:49:46] also, far less people understand how to create debs [01:49:52] so realistically it'll be on ops to do so [01:50:13] yeah, but I don't think that we can get away without packaging anyway [01:50:35] for third party use, yeah [01:51:23] though we've avoided it with MW for ages now [01:52:34] anyway, let me know what you need change in the deployment system to make this work like you need [01:53:11] unless we decide to ignore third party users [01:53:11] for PHP code I agree that unpacking a huge deb would be too slow [01:53:12] installing something like parsoid from a deb otoh is a different animal [01:53:12] a second maybe? [01:53:12] most of the time will be waiting for the restart [01:53:14] seeing as that it took a shitload of political capital to make a new deployment system I think it's unlikely you'll convince anyone to make a debian based one ;) [01:53:39] (03CR) 10GWicke: "Nice!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100516 (owner: 10Faidon Liambotis) [01:54:26] but really, let me know what's missing from the current system [01:54:45] and I'll see about adding it [01:55:15] if the major issue is dependencies, I'll look at how to handle it [01:55:59] you should give me an example of a dependency that needs to be handled and the way in which it needs to be handled [01:56:11] it looks like we'll handle dependencies between our own code with subrepos, and to me that is fine [01:56:27] right, system level dependencies are another beast, though [01:56:39] coordinating something like node upgrade and node_modules would be nice to include in a deployment system [01:57:18] Selective deployment for testing? [01:57:19] especially when going for automated canary stuff and staggered service restarts [01:57:30] RoanKattouw: yeah, I talked about adding a canary option [01:57:56] and I just pushed in a change for staggered service restarts ;) [01:58:51] ideally the deployment system would also manage pooling/depooling as well [01:59:23] I could manage the LVS files for services. that's not amazingly hard [01:59:59] we could actually have multiple LVS pools and move canaries to another testing pool [02:00:17] Right [02:00:19] which would be a matter of depooling them from one and pooling them in the other [02:00:32] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200976) [02:00:56] Ryan_Lane, should I document that now btw? [02:01:15] the restart code? [02:01:23] I'll document it on the trebuchet page when I merge it in [02:01:39] I'll probably do so tomorrow [02:02:05] k, added a note about it soon being available at https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff [02:02:53] cool [02:03:07] when I merge it in, I can also push in a change to your repos and see if that's what you wanted [02:04:42] we'll do our next deploy on Wed otherwise [02:05:10] * Ryan_Lane nods [02:13:51] mini-dinstall looks like an interesting alternative to reprepo [02:15:52] !log LocalisationUpdate completed (1.23wmf5) at Tue Dec 10 02:15:52 UTC 2013 [02:16:08] Logged the message, Master [02:16:10] paravoid: ^ I think the LocalisationUpdate run just fixed "Spezial:Zentrale_automatische_Anmeldung/createSession". [02:29:36] !log LocalisationUpdate completed (1.23wmf6) at Tue Dec 10 02:29:36 UTC 2013 [02:29:51] Logged the message, Master [02:30:34] (03PS2) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [02:30:35] (03PS1) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [02:32:34] gwicke: for my changes to really matter at all, the upstart needs to be available for parsoid [02:32:58] so if you're depending on this change for any reason, you guys may want to look at getting the upstart in :) [02:54:03] (03PS1) 10Dzahn: fix links in uncyclomedia tables [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/100528 [02:54:04] (03CR) 10Dzahn: [C: 032] fix links in uncyclomedia tables [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/100528 (owner: 10Dzahn) [03:15:08] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Dec 10 03:15:08 UTC 2013 [03:15:25] Logged the message, Master [03:24:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [03:30:48] (03PS1) 10Springle: depool db1049 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100529 [03:31:23] (03CR) 10Springle: [C: 032] depool db1049 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100529 (owner: 10Springle) [03:32:22] !log springle synchronized wmf-config/db-eqiad.php 'depool db1049 for upgrade' [03:32:38] Logged the message, Master [03:42:20] (03CR) 10MZMcBride: "Blergh." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [04:15:31] (03CR) 10Mattflaschen: "Please change to 118, per discussion starting at https://bugzilla.wikimedia.org/show_bug.cgi?id=57315#c40 , and also mention it under "Rec" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [04:16:19] (03PS1) 10Springle: repool db1049 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100530 [04:16:42] (03CR) 10Springle: [C: 032] repool db1049 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100530 (owner: 10Springle) [04:17:46] !log springle synchronized wmf-config/db-eqiad.php 'repool db1049 after upgrade' [04:18:02] Logged the message, Master [04:35:49] Why am I getting 51+ minute database rep lag messages? [04:47:49] T13|sleeps: Where? [04:57:07] It's resolved itself. [04:57:57] Was on enwiki and I'm assuming it had to do with db1049 repool [05:24:55] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:25:21] (03PS1) 10Dzahn: push up to version 2.5 [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/100535 [05:28:55] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206870) [05:29:04] ori-l: mwalker|away: I can't remember exactly what I said about making the repo. but I do remember protesting when bd808|BUFFER wanted to put scholarships app conf in apache conf (via env vars) [05:30:49] (03PS2) 10Dzahn: push up to version 2.5 [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/100535 [05:32:13] (03CR) 10Dzahn: [C: 032] push up to version 2.5 [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/100535 (owner: 10Dzahn) [05:33:46] (03PS1) 10Springle: dedicate db1049 to specific query types as per groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100536 [05:34:16] (03CR) 10Springle: [C: 032] dedicate db1049 to specific query types as per groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100536 (owner: 10Springle) [05:35:34] !log springle synchronized wmf-config/db-eqiad.php 'db1049 to LB=0 except groupLoadsByDB' [05:35:50] Logged the message, Master [05:40:54] (03CR) 10Dzahn: [C: 032] icinga: raise timeout of check_job_queue nrpe command [operations/puppet] - 10https://gerrit.wikimedia.org/r/99411 (owner: 10Hashar) [05:43:16] (03CR) 10Dzahn: [C: 032] remove outdated tesla subnet from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/96489 (owner: 10Dzahn) [05:44:24] (03CR) 10MZMcBride: "Ready now?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97331 (owner: 10Dereckson) [05:46:52] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:57:39] (03CR) 10Dzahn: "from neon:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99410 (owner: 10Hashar) [06:12:35] (03CR) 10Dzahn: "worked. it's 30 now. puppet_services.cfg.. nrpe_check!check_check_job_queue!30" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99411 (owner: 10Hashar) [06:24:01] (03CR) 10Dzahn: "thanks, it's fixed now. keep 'em coming." [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/96018 (owner: 10Jack Phoenix) [07:31:24] (03PS1) 10Springle: explicitly direct each updateSpecialPages class by name. QueryPage::recache vslow seems useless? [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100544 [07:32:34] (03CR) 10Springle: [C: 032] explicitly direct each updateSpecialPages class by name. QueryPage::recache vslow seems useless? [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100544 (owner: 10Springle) [07:33:42] !log springle synchronized wmf-config/db-eqiad.php 'explicit LB for each updateSpecialPages job' [07:33:59] Logged the message, Master [07:56:55] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201563) [07:59:00] springle, there was an enwiki labsdb replication breakage report earlier, haven't double-checked it yet: http://lists.wikimedia.org/pipermail/labs-l/2013-December/001942.html [08:01:07] Eloquence: hmm not sure. enwiki -> labs running and no lag atm [08:01:28] we had dewiki problems recently (but it's ok now, too) [08:02:12] *nod* will ask the user to provide details if it's still an issue. [08:10:03] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:12:53] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204287) [08:37:15] (03CR) 10Expi1: "Yeah, that's what I tried to do, since I could couldn't find a original high resolution image. I've been struggling to get the current ico" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100326 (owner: 10Gerrit Patch Uploader) [08:59:55] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:03:27] (03CR) 10Odder: "I think that the general idea that lay behind the original favicon is that grey text would not be visible on a tab, which is the most prom" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100326 (owner: 10Gerrit Patch Uploader) [09:03:58] springle: i aplogize for the bug i caused yesterday. can you please explain why it happened? [09:05:41] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (207043) [09:19:34] matanya: generic::systemuser appears to require an explicit name => 'blah' [09:19:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:20:01] springle: this is weird, and i don't understand why [09:21:31] nor i really :) puppet run failed on the boxes with a message about missing array index. so i went looking at other calls to generic::systemuser, and they all listed name => 'blah'. then it worked... [09:25:41] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201478) [09:27:22] matanya: based on a very cursory scan of puppet docs, declaring the instance of generic::systemuser { 'blah' makes $title=blah, not $name. i guess generic::systemuser needs to default $name to $title somehow [09:27:52] thanks springle i'll try to debug this later [09:28:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:29:45] (03PS1) 10Dan-nl: beta: gwtoolset-whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100547 [09:35:21] apergos: morning :-) Do you have any idea how much disk space we have in Swift ? [09:35:29] morning [09:35:39] no. let's see what ganglia says [09:35:44] orrr [09:35:45] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201302) [09:35:49] there is a GLAM related extension that is going to be deployed this week that would let folks bulk import materials from various museum [09:36:05] I thought it might be worth a mail to ops to warn you guys [09:37:45] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:40:44] 4*700GB * 12 allowing for three copies of everything to be kept [09:41:12] how many T do you guess (ballpark)? [09:41:51] that's how much is free, I mean [09:41:55] hashar: [09:43:10] apergos: I have no idea :] [09:43:13] matanya: when code like that gets merged it's a good idea to puppetd --test on one of the affected hosts (or get someone to do it for you if you don't have access) [09:43:21] will ask folks and report back on ops list [09:43:43] hashar: ok, well I think it's worth reporting to ops just so we know, I think we're not going to run out of space in a month [09:43:47] agreed apergos, i'll poke you next time :) [09:44:18] or whoever did the merge may be able to test it too [09:45:37] apergos: we write on both pmtpa and eqiad, is that free space figure for eqiad or both DC ? [09:45:45] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205048) [09:45:45] eqiad [09:45:55] I'll have a look at tampa [09:46:01] thanks! [09:46:02] though I expect them to be equivalent [09:46:43] if there is several TB that is fine, if there is only 1TB we are in trouble already :D [09:47:00] oh no we are well above 1T :-D [09:47:29] if we were in 1T land without any spares we would be [09:47:39] up that creek without a paddle, and you know which one :-D [09:48:18] 4*600*12 for pmtpa [09:48:24] and 4*700GB * 12 is the total disk space but we have to divide by 3 because we keep 3 copies ? [09:48:28] no [09:48:34] so [09:48:52] 12 partitions on each host, 600 (or 700) gb free on neach partition [09:48:55] 12 hosts total [09:49:02] I already divided by 3 [09:49:14] and that's the free space I'm telling you, not the total space [09:49:19] (03CR) 10Ebrahim: "@Hashar: I followed https://bugzilla.wikimedia.org/show_bug.cgi?id=54826#c8" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99739 (owner: 10Ebrahim) [09:49:43] awesome thank you! [09:49:47] yw [09:51:41] I will one day have to setup swift for the beta cluster [09:52:41] (03CR) 10Ebrahim: "http://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%D9%86%D8%B8%D8%B1%D8%AE%D9%88%D8%A7%D9%87%DB%8C_%D" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99739 (owner: 10Ebrahim) [09:53:10] orilly [10:01:15] ahh mutante merged in my patch to raise check_job_queue timeout [10:01:22] was p**** me off for the last few days [10:01:23] \O/ [10:01:45] (03CR) 10Hashar: "*cheers* Thank you for the verification!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99411 (owner: 10Hashar) [10:10:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:13:05] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205114) [10:16:16] (03CR) 10Hashar: [C: 032] beta: gwtoolset-whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100547 (owner: 10Dan-nl) [10:16:25] (03Merged) 10jenkins-bot: beta: gwtoolset-whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100547 (owner: 10Dan-nl) [10:21:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:24:05] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200749) [10:26:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:50:05] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204410) [11:12:25] (03CR) 10Akosiaris: "I should have caught that while reviewing. Really sorry Sean :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100357 (owner: 10Matanya) [11:13:18] akosiaris: i aplogized to sean already, but you deserve one too. sorry [11:14:47] matanya: that is why code reviews are for. I should have caught that... [11:23:56] that's what testing is for; we'll overlook things, it happens [11:24:18] even testing won't be perfect but it will help [11:27:32] yeah a catalog compilation here would have helped. It would have caught that [11:28:01] I usually do them in big changes, but this seemed innocent enough. [11:28:19] :-) [11:31:38] (03PS5) 10Spage: Enable Flow discussions on a few wikis' test pages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 [11:36:03] (03CR) 10Spage: "PS5 was a rebase; PS6 splits the extension-list change into a separate patch as Benny suggested, and also avoids changing labs to have Flo" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [11:48:02] apergos: I think that we should change our access req process [11:48:39] we're kinda sticking ot it religiously, and I think we should incorporate common sense into the process [11:50:14] common sense is fine, how would you like the process to be changed? [12:30:40] I feel like crap [12:34:03] maybe you should sleep [12:39:41] !log restarting parsoid across the cluster, 100% CPU on all appservers [12:39:57] Logged the message, Master [12:43:36] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:26] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:59:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:01:31] !log hashar synchronized php-1.23wmf6/extensions/ProofreadPage [14:01:48] Logged the message, Master [14:38:52] (03PS1) 10coren: Tool Labs: Add a couple requested packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/100570 [14:40:22] (03CR) 10coren: [C: 032] "Trivial package additions." [operations/puppet] - 10https://gerrit.wikimedia.org/r/100570 (owner: 10coren) [14:44:34] (03PS1) 10ArielGlenn: remove ryan from pager list [operations/puppet] - 10https://gerrit.wikimedia.org/r/100571 [14:44:49] Ryan_Lane: ^^ [14:44:59] apergos: awesome. thanks [14:45:44] (03CR) 10ArielGlenn: [C: 032] remove ryan from pager list [operations/puppet] - 10https://gerrit.wikimedia.org/r/100571 (owner: 10ArielGlenn) [14:45:51] (03PS3) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [14:46:06] (03PS2) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [14:50:12] (03PS1) 10Hashar: fix routing of non-wikipedia on beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100573 [14:51:18] (03CR) 10Hashar: "That broke the beta cluster site that are not wikipedia/wikimedia causing bug 58271." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [15:39:13] Coren: do you know much about Varnish? and if so, can I pick your brain for a moment? I have a really unusual situation with Varnish on beta labs. [15:39:52] (03CR) 10Qgil: "Ok, so the plan here is to recreate the current logo by using the Wikimedia svg logo, Gill Sans condensing and stretching the font as in t" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100326 (owner: 10Gerrit Patch Uploader) [15:41:25] (03CR) 10MZMcBride: "The comma alignment in tests/multiversion/MWMultiVersionTest.php seems funky." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100573 (owner: 10Hashar) [15:47:19] chrismcmahon: I have some Varnish skill, but not at the WMF scale. Maybe I can still be of use. Hit me. [15:49:20] Coren: thanks. Since about late November on beta labs we've been seeing 503 errors coming from Varnish. This mostly affects the Mobile team because they're the only ones using Varnish. The really weird part is that the 503 errors seem to be coming most if not completely from Chrome and essentially zero from Firefox. I have no idea why a 503 error from Varnish would be browser-specific. [15:50:09] chrismcmahon: Hm. That also seems odd to me, unless Chrome presents a header your varnish setup has a Vary on? [15:50:32] Have you compared the headers presented by both browsers? [15:51:16] Coren: yeah, I'm just fishing here, looking for a place to start looking. Headers it is! [15:51:32] hey coren would you mind adding djuvlibre-bin package to the contint boxes please ? change is straightforward https://gerrit.wikimedia.org/r/#/c/99196/ [15:52:50] Also, "since late Novemner" sounds a lot like "since the Labs DNS has been a bit overloaded and randomly flaky". It's gotten better since, but not perfect; but keep an eye in the logs for possible DNS resolution errors. [15:53:45] (03PS2) 10coren: contint: djvulibre-bin for mw djvu unit tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/99196 (owner: 10Hashar) [15:53:51] thx [15:54:09] chrismcmahon, now everything uses varnish [15:54:29] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:36] Coren: yes, but afaict the situation with the labs DNS affected all the browsers, and I still see this huge Chrome vs. FF disparity on the 503s today. [15:54:47] (03CR) 10Ragesoss: [C: 031] "Thanks for submitting this patch Ebrahim. Reedy or someone else from ops will probably merge this soon. Everything looks to be in order in" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99739 (owner: 10Ebrahim) [15:55:09] MaxSem: actually, I think I knew that, but it's still troubling Mobile more than anything else. [15:55:24] Coren, I did a bit of investigation of that issue - one theory was that it was timing out because it was on obama article, but I couldn't find any indications of this in logs [15:55:49] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.36 ms [15:55:56] also, I discovered that apaches on beta are receiving a lot of SIGTERMs [15:55:58] MaxSem: yes, and we've had sporadic reports of the 503s on pages other than Obama also [15:56:33] making a request while apache is being restarted can result in a 503 [16:01:52] (03CR) 10coren: [C: 032] "Yeay array of exactly one member! (Simple package addition)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99196 (owner: 10Hashar) [16:02:11] :-) [16:02:49] MaxSem: So would a connection closing while it's being handled. [16:03:24] But the chrome/mozilla disparity seems unlikely enough if there is no causal link; looking at the headers might be instructive about the underlying cause. [16:08:27] gwicke: paravoid -- parsoid talk [16:08:50] is paravoid's post backend randomization patch deployed yet? [16:08:55] oh [16:09:01] I didn't expect you to be around [16:09:09] gwicke: http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [16:09:16] (03PS8) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [16:09:27] oh wow [16:09:32] (that drop is a restart I did before I crashed to bed again [16:09:52] our monitoring sucks.. [16:10:12] of course it does [16:10:15] I guess the simplest fix is to revert yesterday's deployment [16:10:35] which you told me needs a rebuild of node_modules? [16:10:38] I wasn't sure of the details [16:10:59] maybe it does after all [16:11:06] it'd be nice to find out where the leak is too, I googled extensively for 100% cpu leaks but didn't find any relevant bugs [16:11:21] we have been running this code in rt without issues, but I did update node_modules after verifying that it did not crash [16:11:49] the nodes are using a ton of memory, that's not typical [16:12:02] so it might very well be jsdom leaking memory [16:12:25] being close to the 1.7G memory limit makes node use all cpu on GC [16:12:58] paravoid, can you install node 0.10 on tin? [16:13:06] or on bast1001 [16:13:25] so far I have prepared our node_modules on bast1001 and then rsynced it to tin [16:13:26] done on tin [16:13:30] but that is getting a bit old [16:13:37] thanks! [16:13:39] (I'll puppetize it later) [16:14:30] oh, need npm too [16:14:36] sorry [16:14:48] done [16:14:57] I should have thought that myself :) [16:15:24] I installed -dbg packages and perf on wtp1002 [16:15:49] but I'm not familiar with libv8's internals and it's a bit late to dig in a VM engine [16:16:14] it is pretty likely that an old jsdom is leaking that memory, as it is a binary module and integrates rather deeply with v8 [16:16:32] ok, that makes me feel better to have woken you up [16:16:35] we don't really exercise it, but it might still be initialized whenever we create a new html dom [16:17:26] the ramp up is about 1 hour, so we'll know soon if you attempt a fix [16:18:48] with the right proxy settings npm is now making progress [16:21:21] or so it seemed [16:24:27] maybe the npm version is too old [16:25:53] it's 1.1.39 [16:25:58] not very old [16:26:00] I'm at 1.3.10 locally, the installed version is 1.1.39 [16:26:03] not very new either [16:26:50] the trick of 'npm install npm' does not work either [16:27:08] it looks non-trivial to get fresh packages [16:27:17] I'll have a look later, can you do it now like you did yesterday? [16:27:19] mutante: can you create an RT account for mhoover, and give him access to the procurement queue? (Or tell me how? I can't even find the 'create account' link) [16:27:21] via bast1001 or what was it? [16:27:33] we can install the ppa somewhere [16:27:37] it has a new npm too [16:27:53] I didn't update node_modules in prod yesterday [16:28:07] I can't install software from a random ppa on our deployment box which has access everywhere just like that :-) [16:28:17] andrewbogott: make a ticket for the account to be created and i'll do it and put some docs on it how . k? [16:28:29] yeah, let me scp node_modules from labs over [16:29:40] mutante: https://rt.wikimedia.org/Ticket/Display.html?id=6476 [16:29:45] why is everyone in CA up so early? [16:29:54] …not that I'm complaining [16:29:58] gwicke is because we woke him up :) [16:30:03] argh [16:30:03] andrewbogott: taken [16:30:08] no npm on labs any more [16:30:33] gwicke: ? [16:31:04] moving to the ppa on the labs vm to get npm.. [16:32:13] gwicke: I don't understand… was there a deb and it vanished? [16:32:44] andrewbogott, we are in the backporting business now [16:32:59] for node 0.10 [16:33:08] we forgot about npm [16:33:12] gwicke: ok… just trying to verify that I didn't break anything :) [16:33:37] no, not your fault ;) [16:35:27] !log Nuking Rezabot@fawiki's watchlist, requested by operator [16:35:32] ok, the Debian npm package is impossible to backport [16:35:42] Logged the message, Master [16:36:11] it has 30 dependencies, half of which need backporting too and these depend on more packages etc. [16:36:30] meh [16:36:39] gwicke: shall I just restart node everywhere to buy us some time? [16:36:51] the cluster is probably out of comission now [16:36:51] paravoid, that would be good [16:37:07] glusterfs is not very fast.. [16:38:04] !log restarting all of parsoid again [16:38:21] Logged the message, Master [16:40:41] (03PS4) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [16:40:49] heh [16:41:21] gwicke: you could switch your project to nfs ;) [16:41:44] Ryan_Lane, we'd love to do that [16:41:55] one sec. I think we have docs somewhere [16:42:41] we have about an hour by my count, jfyi :) [16:42:49] then we can restart again, of course [16:43:51] cp finally finished, now waiting for chown.. [16:46:25] and deploying [16:48:29] !log updated Parsoid node_modules for 0.10 to fix what looks like a memory leak [16:48:43] Logged the message, Master [16:50:21] Ryan_Lane, is the config repo still triggering a restart? [16:51:19] nm, looks like it [16:52:24] gwicke: yes [16:52:31] I'm still working on the restart chaneg [16:52:34] *change [16:52:47] I think I just got it to a point where it can be merged [16:53:07] k [16:53:08] (btw, ryan is doing all that in a volunteer capacity ;)) [16:53:22] indeed [16:53:26] not getting paid for this :D [16:53:37] back in an hour or so [16:53:58] memory is going up again [16:56:36] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1386694545&g=mem_report&z=large [16:56:55] so that did not solve it [16:57:13] :( [16:57:18] and I have to run into a meeting in 3' [16:57:23] andrewbogott: ok, resolved. reload 6476 and find instructions and screenshot [16:57:26] k [16:58:00] mutante: thank you! [16:58:16] yw [16:58:18] (03CR) 10Matthias Mullie: [C: 031] Add Flow to extension list for message cache. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100551 (owner: 10Spage) [16:58:37] I'll revert the config repo as we know the previous version to work with both node 0.8 and 0.10 [16:59:18] next step will be to roll back the code, and if that doesn't help either to roll back node 0.10 to 0.8 [16:59:33] nod [16:59:59] valgrind? [17:01:15] maybe [17:02:51] !log downgraded Parsoid to 0ac82a28 and rolled back config update to rule out code changes for memory leak [17:03:07] Logged the message, Master [17:08:48] at first sight it still seems to be leaking [17:15:03] !log running gerrit reviewer-counts cron command manually for bug 52329 [17:15:20] Logged the message, Master [17:15:30] paravoid, we can either try the ppa on a machine or go straight back to 0.8 [17:20:58] (03PS1) 10RobH: RT: 6477 labnet1001 mgmt dns entries [operations/dns] - 10https://gerrit.wikimedia.org/r/100588 [17:22:41] (03CR) 10RobH: [C: 032] RT: 6477 labnet1001 mgmt dns entries [operations/dns] - 10https://gerrit.wikimedia.org/r/100588 (owner: 10RobH) [17:30:26] the ppa is in /home/gwicke/nodejs_0.10.22-1chl1~precise1_amd64.deb on bast1001 in case you'd like to give that a shot [17:32:47] I'm going to re-deploy yesterday's code as that was not the issue [17:33:24] also gives me one restart [17:34:02] (03CR) 10Matthias Mullie: [C: 031] "Looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [17:38:48] !log updated Parsoid back to 31910075 after verifying that this was not the source of the memory leak [17:39:04] Logged the message, Master [17:45:01] anyone around to help with parsoid? [17:45:12] I have two meetings ahead [17:45:29] I can guide people through but someone needs to babysit this [17:45:58] any *ops* around, to be clear -- gwicke is already here and helping out [17:47:53] paravoid: i'm around for the next hour [17:49:15] paravoid: do you prefer to go straight back to 0.8 or would you like to test the ppa first? [17:49:54] at this point, I think just going back to 0.8 [17:50:19] *nod* [17:50:29] then we can do tests in labs or on one server [17:51:02] Jeff_Green: we basically need to downgrade nodejs npm and related packages on the parsoid machines [17:51:06] one by one [17:51:15] oh fun, 0.8 was also in our repo [17:51:19] fortunately it's still there [17:52:00] ok [17:52:17] we should maybe move from reprepo to a repo manager that can keep old versions [17:52:26] cp: writing `./nodejs-dbg_0.8.2-1chl1~precise1_amd64.deb': No space left on device [17:52:29] grrr [17:52:36] gwicke ++++ [17:53:09] https://wiki.debian.org/HowToSetupADebianRepository#mini-dinstall looks interesting at first sight [17:53:10] paravoid: it's not in the caches on the individual machines? [17:53:46] no, that's wrong [17:53:46] but anyway, it's salvaged now, by accident [17:55:33] (03CR) 10Ori.livneh: "Where will this run, initially?" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [17:56:19] ok, apt has only 0.8 now [17:57:27] ok, all wtp* boxes have nodejs/nodejs-dev at 0.8 now [17:57:40] I'm about to run in my meeting [17:57:51] Jeff_Green: can you do a rolling restart of parsoid on all wtp10{01..24} ? [17:58:09] do one, confirm with parsoid that it works, etc. [17:58:22] paravoid: I'm happy to, but I've never touched parsoid [17:58:57] confirm with gwicke, that is :) [17:58:58] paravoid: thanks! [17:59:06] Jeff_Green: /etc/init.d/parsoid restart [17:59:15] ah cool [17:59:29] I was assuming it woudl be a salt thing, which meant I had to finally learn how to use it :-) [17:59:45] you can do it via salt too :P [18:00:00] https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff [18:00:14] but in this case manual would be better [18:00:19] ori-l: nod. in this case I'd prefer not to introduce the addl layer since I'm unfamiliar [18:00:20] cool [18:00:20] so that we can check the first [18:00:28] one sec [18:00:43] wtp1001? [18:01:03] gwicke: yes. logging in [18:01:26] k, got the http test ready [18:01:55] nodejs == 0.8.2-1chl1~precise1 [18:02:01] is that as expected? [18:02:14] yes [18:02:20] ok, here goes the restart [18:02:31] done [18:02:40] looks good [18:02:50] ok. next box [18:03:05] I guess we can do the remaining ones with dsh [18:03:22] yeah? [18:03:25] just not in parallel [18:04:26] should we do salt per the wikitech doc? [18:04:32] oh nm [18:04:34] that works too [18:04:47] salt -b 1 -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid [18:05:04] assuming "-b 1" gets us batches of 1 [18:05:10] sane? [18:05:16] I guess so, yes [18:05:50] really stupid question which highlights how heads-down I've been on fundraising....where do I run salt these days? [18:05:50] parsoid takes ~2-3 seconds to restart, so staggering it is better to avoid all machines going down at once [18:05:52] this would be all at the same time, so don't, but if parsoid.restart_parsoid doesnt work, you can also just run the same command .. salt 'wtp10*' cmd.run '/etc/init.d/parsoid restart' [18:06:05] cmd.run 'what you did manually' [18:06:10] oh hell, I'm just going to do these by hand on each box [18:06:22] dsh -g parsoid should work too [18:06:59] from tin? [18:07:05] or bast1001 [18:08:04] "dsh -g parsoid /etc/init.d/parsoid restart" as root [18:08:14] ok. here goes [18:08:38] hmm [18:08:44] root@bast1001:~# dsh -g parsoid /etc/init.d/parsoid restart [18:08:44] * Restarting parsoid [18:08:44] ...done. [18:08:59] and the prompt hasn't returned yet [18:09:21] hmm [18:09:32] that might be the sucky init script that we are still using [18:09:53] maybe service parsoid restart fares better? [18:09:54] ok, let's just do them by hand [18:10:36] 1002 done [18:10:53] 1003 done [18:11:50] 1004 [18:11:59] I'd give service parsoid restart a try [18:12:05] there are 24 machines [18:12:28] it worked and returned the prompt on 1006 [18:12:35] this actually fits perfectly with my day [18:12:36] Jeff_Green: so... [18:12:41] Jeff_Green: use salt [18:13:04] my phone decided to go into a boot loop, and until the battery dies I hear the android pyonnnng sound every minute or so [18:13:22] salt -G 'deployment_target:parsoid' parsoid.restart_parsoid 'parsoid' [18:13:31] Ryan_Lane, is that all at once? [18:13:32] Ryan_Lane: you think that will do better than dsh did with a crappy init script? [18:13:34] gwicke: yes [18:13:36] Jeff_Green: oh, ^d had that yesterday when he upgraded to the CM nightly ;) [18:13:44] Ryan_Lane, is there a way to stagger them? [18:13:48] Jeff_Green: it will because I wrote this specifically because of this problem [18:13:51] gwicke: yes [18:13:54] for wtp in $(seq 10 24); do ssh wtp10${wtp} ...; sleep .. ; done [18:13:54] -b 1? [18:13:58] greg-g: interesting. verizon is sending me new phone, apparently they don't know about it [18:14:01] salt -b -G 'deployment_target:parsoid' parsoid.restart_parsoid 'parsoid' [18:14:11] Jeff_Green: ah, so you didn't bork it yourself, good work ;) [18:14:21] gwicke: can you please, please deal with the init script issue? [18:14:36] Ryan_Lane, yes- going to delete it this week [18:14:39] I've been asking for months [18:14:48] replaced with the upstart? [18:14:52] yup [18:14:55] cool [18:14:56] thanks [18:14:58] so that we also get log rotation etc [18:15:03] yeah [18:15:10] we'll be able to get rid of that parsoid module function then too [18:15:16] <^d> Jeff_Green, greg-g: Heh, yeah boot loops aren't fun. [18:15:27] and we'll be able to use: salt -G 'deployment_target:parsoid' service.restart 'parsoid' [18:15:43] can't use that right now [18:16:13] maybe right now isn't the best time to merge in this deployment system change? :) [18:16:41] I guess I need to write a wrapper for service restart first anyway [18:16:47] any feeling on this name of this script? :) [18:16:59] is salt waiting for the restart too? [18:17:01] repo-restart? [18:17:16] gwicke: salt won't, if you use the parsoid.restart_parsoid function call I use [18:17:22] err. I mentioned [18:17:22] hmm [18:17:29] then it is basically parallel anyway [18:18:04] it is possible to only start restarting the next node when the previous one is done? [18:18:19] kind of [18:18:26] Ryan_Lane: how about salt -b 1 to do only one at a time? [18:18:34] that should work [18:18:43] starting [18:18:50] I'm not sure if that actually waits till the service is fully restarted, though [18:18:57] but it'll still stagger it at least some [18:19:16] because it does wait until the minion returns before it starts the next [18:19:28] it's done [18:19:40] I wonder how that would work with service.restart [18:19:44] I know that waits [18:19:51] but upstart might return immediately [18:20:15] I did implement restart via the deploy module, so we could put in some logic there if we wanted [18:20:21] Jeff_Green: looks good, thanks! [18:20:26] https://groups.google.com/d/msg/salt-users/EBjhCb6CuIg/bsaiRzdMEDkJ [18:20:27] deploy.restart calls service.restart [18:20:39] and service.restart on all of the dependency repos [18:21:04] ori-l: ? [18:21:13] some dbus - salt bridge would be awesome [18:21:16] yes [18:21:29] I thought it might have been something relevant to this [18:21:34] but yeah, that would indeed be awesome [18:21:50] * gwicke would already be happy with fixing up dsh [18:22:12] meh. dsh is a piece of crap [18:22:16] :) [18:22:53] i for one would like to see dsh go and all focus go into salt [18:22:54] for rolling restarts it pretty much does the right thing [18:23:07] gwicke: not necessarily [18:23:19] in this case it doesn't because the init script doesn't return [18:23:19] but if salt can do that at some point too, then that would be great of course [18:23:24] salt can do this [18:23:25] having multiple tools that do overlapping things ends up being really confusing for people like me who don't normally focus on production [18:23:26] via batch [18:23:33] the issue is that upstart might return immediately [18:23:37] which would be the same problem with dsh [18:23:44] there's no difference there [18:23:53] Jeff_Green: yeah. dsh is almost always out of date [18:24:11] Ryan_Lane: that too, dsh with it's crufty local config files [18:24:13] salt is up to date via puppet runs [18:24:24] which is glorious [18:24:24] hmm, I have been using upstart for rashomon, and it seemed to wait for the restart [18:24:34] gwicke: if that's the case, then salt will work [18:24:54] when it does a batch run it waits for a return from x number of minions before starting on more [18:25:19] I see, and by default it actually waits for the command to return? [18:25:21] if it isn't the case then neither salt nor dsh will work [18:25:22] yes [18:25:42] I wonder why it does not run into the issue with the hanging init script then [18:25:49] dsh? it does [18:25:53] it just did for jeff [18:25:56] salt [18:26:02] I wrote a function for this [18:26:15] the service.restart function does hang [18:26:16] gwicke: do you have a quick way to confirm that parsoid came up everywhere as expected? [18:26:22] aha [18:26:25] the module that I wrote forks [18:26:35] Jeff_Green, https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [18:26:45] and I did some testing [18:26:51] ah great. thanks [18:26:55] so, technically salt doesn't hang on service.restart [18:27:13] it will indeed restart all of them, but it doesn't work with batch because the minions never return [18:27:33] and if you do a restart of salt, all of the parsoid processes with crash, because their pid is tied to salt [18:27:39] Ryan_Lane: that's great- defult salt will then do what I'm looking for [18:27:43] s/with/will/ [18:27:45] *default [18:27:47] yes [18:27:55] the stuff I just pushed in will work for this [18:28:03] I'm merging it soonish [18:28:12] I just need to write a wrapper for the restart command [18:28:35] so, what to name the command? repo-restart? service-restart? [18:29:00] I don't like service-restart because you need to pass in the repo name, not the service name [18:29:11] it has the same security model as deployment [18:29:11] salt-restart? [18:29:33] hm [18:29:55] service-restart may be ok [18:30:08] if it's run from within the repo it could read the repo name from the git repo [18:30:19] if it's run from outside of the repo it would require the repo name [18:31:11] in the future it could just be: git deploy restart [18:31:20] I can't do that right now, though [18:31:35] or: git deploy restart-service [18:32:14] sounds good [18:32:26] ok, writing the wrapper [18:32:30] then I can merge this [18:33:36] the thing I don't like about putting everything into git-deploy is that breakage there would disable the default restart method [18:33:59] fine as long as there are still fallbacks [18:42:42] gwicke: it doesn't [18:43:01] you can always manually call the salt function, if necessary [18:43:23] (03CR) 10Umherirrender: "No, the core merge is part of 1.23wmf6, where deployment Phase 3 is at Thursday, 12 December 2013." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97331 (owner: 10Dereckson) [18:43:51] the only major way trebuchet can break is if salt is down, or if someone broke the system with a code update [18:44:09] and the system is modular, so the restart should work even if deployment itself doesn't [18:44:40] I can make the restart stuff a class, so that it can be called via a script or via git deploy, too [18:49:34] hey Aaron|home, today, while testing gwtoolset on beta, we ran into an issue where a 3,000 record metadata file reached the job threshold we placed on the extension of 1000 media files. would it be okay to raise that threshold to 10,000 ? [18:50:08] atm i'm also investigating using the jobReleaseTimestamp as well [18:51:00] I suppose [18:51:49] thanks, also do you know if it's possible to implement the jobReleaseTimestamp with mysql? [18:52:47] bd808 mentioned that we should be able to use it on beta and production because redis enables the delay, but i was wondering if it's possible to implement locally i with mysql [18:53:14] Jeff_Green / gwicke: how are things going? [18:54:11] wtp* are restarted on the 8.2 nodejs [18:54:16] cool, thanks [18:54:24] and we've confirmed they're up [18:54:26] np [19:08:35] !log reedy synchronized php-1.23wmf6/extensions 'I53db62d469a6944cdf24dc209fa55c972f69b73f' [19:08:52] Logged the message, Master [19:09:48] (03PS3) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [19:11:17] dan-nl: not with mysql (there is no code for that) [19:12:55] k, it looks like i have to check $wgJobTypeConf and see which class it has set for default and then add the jobReleaseTimestamp if it's using JobQueueRedis [19:13:34] Aaron|home: is that correct? or is there a better way to check that? [19:15:30] you can do JobQueueGroup::singleton()->get( 'foo' ) and call supportsDelayedJobs() on that [19:15:46] you probably have to do the former anyway [19:16:29] well, I guess not if you did JobQueueGroup::singleton()->push() [19:16:44] anyway, that's how it could be best checked [19:17:22] k, seeing if i can use the later method without pushing it onto the array … as soon as i do it throws an error locally [19:24:38] (03PS3) 10Dzahn: change bugzilla role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99788 [19:27:19] (03PS10) 10Ottomata: [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [19:31:09] (03PS11) 10Ottomata: [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [19:36:21] ugh. bug in salt with batch runs and returners. when doing batches, the returner is ignored [19:36:33] which will make reporting for service restarts a pain [19:36:53] I'll report directly from the master for now [19:39:57] !log reedy updated /a/common to {{Gerrit|Ibceb9638f}}: beta: gwtoolset-whitelist [19:40:02] (03PS1) 10Reedy: Non wikipedias to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100616 [19:40:15] Logged the message, Master [19:40:40] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100616 (owner: 10Reedy) [19:41:38] (03Merged) 10jenkins-bot: Non wikipedias to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100616 (owner: 10Reedy) [19:42:39] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.23wmf6 [19:42:54] Logged the message, Master [19:43:09] hey Aaron|home, committed https://gerrit.wikimedia.org/r/#/c/100617/. there's not too much to it. are you able to take a look now and +1 if you're okay with it? [19:44:21] Reedy: what's broken? [19:44:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Revert that, CreditSource broken [19:44:29] CreditSource [19:44:35] I think it's only on Wikivoyage [19:44:36] ah [19:44:40] Logged the message, Master [19:44:50] it is [19:46:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias and wikivoyages to 1.23wmf6 [19:46:22] Logged the message, Master [19:52:21] (03PS4) 10Dzahn: change bugzilla role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99788 [19:57:40] Hello guys , we ( Brazilian Education Team) are creating a lot of page to restructure the Education Program portal [19:57:52] so, we need to change the redirect http://educacao.wikimedia.org/ to https://pt.wikipedia.org/wiki/Wikip�dia:Programa_de_Educa��o [20:01:29] rodrigopadula: hello, shouldn't be a problem, did that redirect last time. could you just create a bug or ticket for it? bugzilla or RT is both ok [20:01:49] (or if you wanted to, you could create a patch and upload to gerrit) [20:02:02] can you send me the links ? [20:03:22] rodrigopadula: bugzilla to create ticket: https://bugzilla.wikimedia.org/ repository that has Apache config: https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/apache-config,n,z [20:03:45] how to clone it: https://wikitech.wikimedia.org/wiki/Git#Git.2FGerrit_and_the_repositories [20:09:26] !log applying filter on sandbox subnet in eqiad [20:09:43] Logged the message, Mistress of the network gear. [20:11:09] !log reedy synchronized php-1.23wmf6/includes/SkinTemplate.php 'I8edbdfe615f848963e3bea47dac99d1abd64c7f7' [20:11:26] Logged the message, Master [20:12:10] Tue Dec 10 7:10:39 UTC 2013 mw1102 enwiki Connection lost and reconnected after 60.744s, query: SELECT /* SpecialHistory::doQuery Jdlrobson */ * FROM `revision` FORCE INDEX (page_timestamp) ORDER BY rev_timestamp DESC LIMIT 51 [20:14:30] looks like that stopped [20:19:01] (03CR) 10Dzahn: [C: 032] change bugzilla role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99788 (owner: 10Dzahn) [20:21:20] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [20:22:20] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:23:12] (03PS1) 10Dzahn: remove the system role from the bugzilla module it should be (and is) in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100627 [20:24:05] crap, wrong topic branch and dependency [20:24:19] !log applied a fitler to inet6 sandbox subnet in eqiad [20:24:33] Logged the message, Mistress of the network gear. [20:25:46] (03PS5) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [20:27:11] (03PS2) 10Dzahn: remove the system role from the bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/100627 [20:27:32] well, that change ended up being larger than I imagined [20:27:40] (03PS4) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [20:29:15] (03CR) 10Dzahn: [C: 032] remove the system role from the bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/100627 (owner: 10Dzahn) [20:31:59] Ryan_Lane, that is indeed not such a small diff [20:34:00] quite a bit of it is refactoring [20:34:53] *nod* [20:35:03] ok, doing the last set of testing [20:35:15] there's a salt bug I'm currently having to work around for this [20:35:24] which means the data isn't getting put into redis [20:35:57] so if the command takes longer than 30 seconds on any minion you won't be able to see if it was successful or not [20:36:06] that's likely fine for now [20:37:33] gwicke: default batch size is set at 5. it's adjustable from the service-restart command [20:37:50] Ryan_Lane: sounds good to me [20:38:17] if you have information about the total number of nodes, then a default of 1/10 at a time or the like might also be nice [20:39:13] oh. I can do a percentage, too [20:39:19] is that preferable? [20:39:23] 10%? [20:40:04] 2.4 servers [20:40:10] that captures the impact on the total service quite well [20:40:24] don't take down more than 10% at any time [20:40:39] mutante: heh [20:40:43] easy enough [20:42:56] !log reedy synchronized php-1.23wmf6/extensions/Wikibase [20:43:11] Logged the message, Master [20:44:11] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/#/c/100418/ either needs backporting, or we wait for the wmf7 cycle..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99739 (owner: 10Ebrahim) [20:44:21] (03PS6) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [20:45:21] (03PS5) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [20:52:18] (03PS6) 10Reedy: Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [20:52:24] (03CR) 10Reedy: [C: 032] Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [20:57:01] (03PS7) 10Ryan Lane: Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 [20:58:00] (03PS6) 10Ryan Lane: Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 [21:00:19] Jenkins is much more responsive than usual... [21:01:00] Reedy: you all good to let bsitu deploy flow (ie: you done for now?) [21:01:23] I was waiting for https://gerrit.wikimedia.org/r/#/c/98002/ [21:01:27] But jenkins can't be bothered [21:03:11] (03PS1) 10Jgreen: disable bayes_auto_learn on iodine (for otrs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/100673 [21:03:35] Reedy: let me know when you are done [21:04:02] (03Merged) 10jenkins-bot: Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [21:05:20] Over 13 minutes? [21:06:10] "really, jenkins?! over 13 minutes?" [21:08:14] That's over 9000 somethings [21:08:41] gwicke: https://wikitech.wikimedia.org/wiki/Trebuchet#Restarting_a_service [21:08:44] documented [21:08:56] also documented the new config option here: https://wikitech.wikimedia.org/wiki/Trebuchet#Add_the_new_repo.27s_configuration_to_puppet [21:10:02] Reedy: I accidentally +2 this patch: https://gerrit.wikimedia.org/r/#/c/100671/, I guess this doesn't affect your config change. [21:10:33] I'm going to merge this in soon [21:11:07] (03CR) 10Jgreen: [C: 032 V: 032] disable bayes_auto_learn on iodine (for otrs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/100673 (owner: 10Jgreen) [21:12:52] Ryan_Lane: awesome, thanks! [21:12:56] is that live already? [21:13:04] not yet. will be soon [21:13:08] doing some last minute testing [21:13:08] and what are the rights required for this? [21:13:15] no rights [21:13:21] if you can deploy you can do this [21:13:31] awesome [21:13:43] Reedy: did you see vito's mail regarding ACE2013? [21:14:34] Ryan_Lane: if it is easy, a 'service-stop' command as a big red button might be nice too ;) [21:14:42] heh [21:14:47] in case the running service is corrupting stuff [21:14:47] that sounds... dangerous :) [21:14:50] ah. right [21:15:08] well, I've done all the hard work for making that possible [21:15:20] I can add start/stop runners too [21:15:41] and maybe rename service-restart to service-manage, and require a start/stop/restart argument for it [21:17:07] !log reedy synchronized wmf-config/abusefilter.php [21:17:21] bsitu: Should be ok to go now... [21:17:24] Logged the message, Master [21:17:35] Reedy: all right, thx [21:18:31] (03PS1) 10Dzahn: use include in role instead of class declaration [operations/puppet] - 10https://gerrit.wikimedia.org/r/100679 [21:19:00] Ryan_Lane: being able to start the service again would be a nice bonus of course ;) [21:19:08] :D [21:19:09] indeed [21:21:35] (03CR) 10Dzahn: [C: 032] "13:25 < andrewbogott> mutante: confirmed on labs, making that change fixes the error." [operations/puppet] - 10https://gerrit.wikimedia.org/r/100679 (owner: 10Dzahn) [21:25:25] (03CR) 10Ryan Lane: [C: 032] Add a deploy.restart runner and module call [operations/puppet] - 10https://gerrit.wikimedia.org/r/100509 (owner: 10Ryan Lane) [21:25:36] gwicke: merging them in now [21:25:47] (03CR) 10Ryan Lane: [C: 032] Manual restart for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/100526 (owner: 10Ryan Lane) [21:35:17] (03CR) 10Bsitu: [C: 032] Add Flow to extension list for message cache. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100551 (owner: 10Spage) [21:35:46] (03Merged) 10jenkins-bot: Add Flow to extension list for message cache. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100551 (owner: 10Spage) [21:38:10] (03PS1) 10Ryan Lane: Adjust sudo call used in service-restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/100685 [21:38:32] Ryan_Lane: cool, thanks- will try it tomorrow [21:43:33] (03CR) 10Ryan Lane: [C: 032] Adjust sudo call used in service-restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/100685 (owner: 10Ryan Lane) [21:45:31] !log bsitu updated /a/common to {{Gerrit|Ib34f619e7}}: Enable AbuseFilter block option on Wikidata [21:45:45] Logged the message, Master [21:51:41] (03PS1) 10Ryan Lane: Test service restarts in test/testrepo [operations/puppet] - 10https://gerrit.wikimedia.org/r/100689 [21:54:02] (03CR) 10Ryan Lane: [C: 032] Test service restarts in test/testrepo [operations/puppet] - 10https://gerrit.wikimedia.org/r/100689 (owner: 10Ryan Lane) [22:01:14] Reedy: we're running mergeMessageFileList.php to check sanity of new extension-list. The output is different order than ExtensionMessages-1.23wmf5.php but the same lines except for Flow.i18n.php (good) and 'SpecialCentralAuthAliasesNoTranslate' => "$IP/extensions/CentralAuth/CentralAuth.notranslate-alias.php" (??!) [22:02:05] should we worry? [22:05:22] (03PS1) 10Dzahn: include ::bugzilla instead of bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/100690 [22:07:52] (03CR) 10Dzahn: [C: 032] include ::bugzilla instead of bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/100690 (owner: 10Dzahn) [22:08:12] so SpecialCentralAuthAliasesNoTranslate showed up in wmf6, but running mergeMessageFileList.php on --wiki=enwiki (wmf5) added it. I don't know if this is significant [22:10:41] (03CR) 10Dzahn: "works now on zirconium so far." [operations/puppet] - 10https://gerrit.wikimedia.org/r/100690 (owner: 10Dzahn) [22:12:51] greg-g, RoanKattouw, Reedy : we're following some crufty guidance to check sanity of extension-list (https://wikitech.wikimedia.org/wiki/Configuration_files#extension-list_and_ExtensionMessages-XXX.php), and are confused by the result. Should we scap anyway? [22:13:40] spagewmf: See -tech [22:14:26] saw, thanks [22:15:34] https://bugzilla.wikimedia.org/show_bug.cgi?id=58292 [22:15:43] "Enable HTTPS for download.wikimedia.org (and dumps.wikimedia.org)" [22:15:53] If someone wants to cross-reference RT with that, it'd be helpful. [22:17:37] running scap in a second [22:22:55] !log bsitu started scap: Add Flow extension but not enabled yet [22:23:11] Logged the message, Master [22:27:16] w1060: rsync: send_files failed to open "/php-1.23wmf6/.git/modules/extensions/Collection/BISECT_ANCESTORS_OK" (in common): Permission denied (13) [22:27:21] bunch of such errors [22:27:39] that's mwalker's bisect [22:28:00] oh crap [22:28:04] did I forget to end the session [22:28:26] or... did it just leave a bunch of garbage [22:29:29] no; apparently it just leaves a bunch of garbage [22:30:38] (03CR) 10Addshore: "This will run on a single instance on the Wikidata-build project on labs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [22:31:48] bsitu: I did apparently forget to clean up after myself -- it should stop complaining now [22:32:33] !log issued `git reset bisect a4f97e4`in extensions collection in 1.23wmf6 to clean up after yesterdays bisection mess [22:32:49] Logged the message, Master [22:33:00] mwalker: thx, it's still complaining [22:33:07] huuuurm [22:34:01] well; that file no longer exists on tin [22:34:09] I wonder if it exists on the app servers [22:36:26] mwalker, yup the files are there on some servers, e.g. mw1060 -rw-r--r-- 1 mwdeploy mwdeploy 1068 Dec 9 23:53 BISECT_LOG [22:36:42] ls -l /usr/local/apache/common/php-1.23wmf6/.git/modules/extensions/Collection/ [22:37:24] hurm; that should be ok actually [22:37:35] and going back and looking at the error it's a 'Failed to open file' error [22:42:04] mwalker: yeah that'll probably clear up (as my teenage acne Dr. said :) ). In my experience it's stuff in .git/objects/blah that sync/scap scripts can't clean up [22:46:17] greg-g: does scap update terbium (where we're going to run mwscript) ? scap has been going for 30 minutes. Would be nice if it updated terbium sooner [22:46:32] !log bsitu finished scap: Add Flow extension but not enabled yet [22:46:48] Logged the message, Master [22:47:37] spagewmf: yeah (pretty sure) [22:48:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:48:15] greg-g scap did, terbium must be one of the last servers [22:48:25] * greg-g nods [22:50:40] mwscript sql.php --wiki=flowdb --cluster=extension1 extensions/Flow/flow.sql [22:50:40] /usr/local/apache/common-local/wikiversions.cdb has no version entry for `flowdb`. [22:50:40] Fatal error: /usr/local/apache/common-local/wikiversions.cdb has no version entry for `flowdb`. [22:50:40] in /usr/local/apache/common-local/multiversion/MWMultiVersion.php on line 376 [22:51:00] ^ that's us pretending 'flowdb' is a wiki to get the SQL to run in the right place [22:53:09] sql.php should have a --db param that can be different than --wiki I guess [22:53:47] Aaron|home: thx, let me try it [22:55:06] bsitu: I'll just commit that [22:55:25] Aaron|home: cool, thx [22:56:15] gwicke: oh. right. you can't do a restart using this script until the upstart is in place and the current init script is gone [22:56:20] because the batch will hang [22:56:57] should I revert the change for parsoid back to an automated restart? [22:57:41] Ryan_Lane: I guess for tomorrow's deploy this is fine [22:57:59] well, it'll mean a restart is impossible [22:58:11] using salt [22:58:23] how do you plan on doing a restart? [22:58:29] ask somebody in ops [22:58:46] we got free restarts during deploys so far, but could not restart otherwise [22:59:02] they need to do: salt -G 'deployment_target:parsoid' parsoid.restart_parsoid 'parsoid' [22:59:38] k, that matches our docs in https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff [22:59:40] or, batched: salt -b '10%' -G 'deployment_target:parsoid' parsoid.restart_parsoid 'parsoid' [22:59:59] oh spiffy, that now works generically [23:00:02] Aaron|home: thx. Or could we pretend 'flowdb' is a wiki and add it to 'wikiversions.dat' and rebuild wikiversions.cdb ? [23:00:11] batching is a standard salt feature [23:00:18] the 10% [23:00:24] yeah, that's also standard [23:00:30] you can list a specific number, or a percentage [23:00:34] oh nice [23:00:37] yeah [23:00:51] oh, btw, you can also say "do this on a random set of minions" [23:00:57] heh [23:00:59] tweaked the docs [23:01:26] could make an interesting chaos-monkey like thing with the random feature [23:01:31] :) [23:04:05] greg-g: so we're over our window and still figuring out how to run SQL on an all-wiki centralized flowdb on an external cluster. [23:05:50] spagewmf: how close do you think you are to figuring it out? [23:06:36] Reedy: do you know of any other extension that has a cross-wiki DB? We just need to make mwscript sql.php put it in the right place [23:06:58] Aaron|home: ? [23:07:22] CentralAuth does it's own [23:07:44] But it doesn't use sql.php in any shape or form [23:08:01] (03PS1) 10Chad: Fix Commons config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100710 [23:08:01] I wouldn't use sql.php [23:08:49] Reedy: what do you suggest? :) [23:08:57] Do it manually [23:09:02] Where is the database located? [23:09:16] spagewmf: did you see that commit? [23:09:37] spagewmf: globalusage [23:09:42] Reedy: it's located in extension1 cluster [23:09:46] (03CR) 10Chad: [C: 032] Fix Commons config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100710 (owner: 10Chad) [23:09:54] (03Merged) 10jenkins-bot: Fix Commons config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100710 (owner: 10Chad) [23:09:58] Aaron|home: yes, thanks, seems like it should work but we don't know enough about /maintenance to +2 it [23:10:09] aude: Which doesn't use sql.php and has it's own globals like CA does... [23:10:15] right [23:10:39] !log demon synchronized wmf-config/CirrusSearch-labs.php 'No-op, fixing labs config' [23:10:56] Logged the message, Master [23:11:14] spagewmf: ping ^d ? [23:11:29] <^d> hrm? [23:11:48] Reedy doing it manually means modifying the sql.php to have the right defaults for charset, user, etc. as I recall. prone to error [23:11:58] No [23:11:59] Don't do that [23:12:27] What exactly are you trying to do? [23:12:35] And why wasn't a way discussed before the deployment window? [23:12:48] Reedy then instead modify Flow/flow.sql ? [23:13:01] Reedy: I sent out an email to enginerlist 8 days ago [23:13:09] Right [23:13:10] Reedy we are trying to create database 'flowdb' on the extension1 cluster [23:13:12] But you haven't got an answer [23:13:56] echo "CREATE DATABASE flowdb;" | sql enwiki -h 10.64.16.18 [23:14:13] Reedy: well, the answer from aaron was 'that should work' [23:14:17] the answer was --wiki=flowdb , but turns out that doesn't work without hacking wikiversions [23:14:27] So you didn't test it beforehand? [23:14:55] correct, although i'm completely unfamiliar with deployment and have no clue how it would be tested [23:15:05] (also i'm not doing the deploy, bsitu is for that reason :P ) [23:15:22] Reedy not sure how to test maintenance scripts in the production environment. [23:15:29] reedy@tin:~$ echo "CREATE DATABASE flowdb;" | sql enwiki -h 10.64.16.18 [23:15:29] ERROR 1044 (42000) at line 1: Access denied for user 'wikiadmin'@'10.%' to database 'flowdb' [23:15:36] You'll need a root/dba [23:16:28] Or make a wiki called "flowdb". ;-) [23:16:39] "make a wiki" [23:17:04] why not, we should try to break 1000 sometime soon [23:17:06] greg-g OK, so we have the extension deployed but not enabled, and we'll talk to springle about how best to do this and try again. [23:17:30] spagewmf: alright, so we're ok to sit tight, where we are, the code isn't be called until you flip the switch, right? [23:17:50] where "this" is creating database 'flowdb' on the extension1 cluster. Maybe Aaron|home's patch to add --wikidb will do what we want. [23:17:51] I note that using "sql centralauth" is a specific hack [23:18:04] so the DB is not there yet? [23:18:13] the sql.php change will not resolve that [23:18:26] Aaron|home: correct. greg-g I think so. [23:18:29] that would help with making/changing tables though [23:18:42] so yeah, let's sit tight, maybe enable tomorrow after we get someone to create the db, there's other stuff lined up [23:18:47] sorry spagewmf [23:19:52] https://gerrit.wikimedia.org/r/#/c/100701/ when there is opportunity :) [23:20:02] no worries, thanks y'all we appreciate the help. We knew the DB setup was the tricky bit [23:20:57] spagewmf: my bad, I didn't put 2 and 2 together, should've had you add that as a "schema change" or somesuch for sean to get to before the deploy [23:21:41] Flow, flow, flow your boat. [23:25:14] So... greg-g, is flow done for the day? I'd still like to get a quick centralauth deploy in [23:26:19] greg-g: no worries, we were so focused on how to tell machinery where to CREATE TABLE we skipped over the bit about creating the "where". I'll set up some dependent bugs [23:26:20] csteipp: yeah, sorry, they are [23:26:31] csteipp: We are done! [23:26:37] spagewmf: thanks! [23:28:58] gwicke: regarding yesterday's discussion about parsoid deployment; I missed the end of it; are you guys going for a single deployment repo; with your code being a submodule? [23:29:16] or did y'all end up somewhere else? [23:32:51] Error: 1193 Unknown system variable 'table_type' (10.64.16.18) [23:33:04] * Aaron|home then wonders how addWiki.php works [23:33:09] mwalker, we'll make a final decision tomorrow, but that is where we are headed [23:33:35] I'm still not so sure that debs are so impractical, but for now we want to be compatible with both debs and git-deploy [23:34:04] basically https://www.mediawiki.org/wiki/Parsoid/Packaging#Option_2:_deploy_repo_with_code_as_submodule [23:34:10] yep yep [23:34:28] renamed our node_modules repo to /mediawiki/services/parsoid/deploy [23:35:22] mwalker: we should keep in touch on node packaging [23:35:29] mathoid also needs that treatment [23:35:43] and rashomon, and.. [23:38:32] I'll actually problem just get rid of mine [23:38:38] and use operations/ocg-config [23:38:46] !log csteipp synchronized php-1.23wmf6/extensions/CentralAuth 'Update to master' [23:39:01] though... I may get into trouble there when we integrate mathoid [23:39:02] Logged the message, Master [23:43:08] !log csteipp synchronized php-1.23wmf5/extensions/CentralAuth 'Update to master' [23:43:22] Logged the message, Master [23:43:44] greg-g: I'm done [23:43:50] csteipp: awesome, thanks [23:45:31] werdna: didn't happen. 'we were so focused on how to tell machinery where to CREATE TABLE we skipped over the bit about creating the "where"' [23:45:47] wrong channel [23:49:00] (03PS3) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [23:50:15] is there a gage here yet? [23:50:28] (03CR) 10Dzahn: role and module structure for ishmael (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [23:51:30] ori-l: ^ there, way more parameters for all the stuff that changes between labs and prod [23:52:15] hell, i even made AuthName "WMF Labs (use wiki login name not shell)" one in https://gerrit.wikimedia.org/r/#/c/96403/3/modules/ishmael/manifests/init.pp [23:54:38] (03PS4) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [23:56:31] (03CR) 10Ori.livneh: "Looks good! Could you move Wikimedia-cluster-specific parameter values to the role class and leave the default values in the module unspec" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn)