[00:03:14] Reedy: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&type=signup&returnto=Pride+of+Baltimore+Chorus [00:03:33] or well, http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&type=signup [00:03:59] something specific to production that isn't defined for beta? [00:05:11] Yup, looks like what AaronSchulz changed for ConfirmEdit [00:05:32] beta doesn't use swift at all, does it? [00:05:39] nope [00:06:45] Do we just need to change wgCaptchaFileBackend? [00:06:50] (for beta) [00:07:08] yeah, possible just unset it ;) [00:07:16] well, make it false [00:07:32] false works [00:07:46] notpeter: Im in watchmouse adjusting some schedules [00:07:53] you are presently set to always page, regardless of time of day [00:08:00] would you like me to set you to 8am to 11pm pst? [00:08:03] default is '' [00:08:21] let's go with false [00:09:03] robh: no thanks. I like the est alert hours, actually [00:09:10] TimStarling, woosters: My tasks for the eqiad migration are mostly done but I'm waiting for review, see https://gerrit.wikimedia.org/r/#/c/44183/ and https://gerrit.wikimedia.org/r/#/c/44160/ [00:09:11] New patchset: Ryan Lane; "Get hash items, not just keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44184 [00:09:16] notpeter: you arent set to any hours [00:09:19] you are set to always page. [00:09:25] oh [00:09:26] really? [00:09:32] I guess that's true... [00:09:40] nah, I'm cool with watchmouse paging me [00:09:42] but thank you [00:09:42] yep, set from 0:00 to 23:59 ;] [00:09:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44184 [00:09:46] k [00:12:06] Ryan_Lane: How do I add servers to the salt group for Parsoid deployment? [00:12:28] RoanKattouw: change the regex in role/deployment.pp [00:12:39] I'll need to initialize them [00:12:40] so many channels [00:14:02] New patchset: Reedy; "Don't set $wgCaptchaFileBackend to 'global-multiwrite' for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44185 [00:16:24] !log testing deployment destination initialization on hume [00:16:34] Logged the message, Master [00:17:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44185 [00:17:48] New patchset: Catrope; "Add the new eqiad Parsoid servers to the salt regex for deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44187 [00:17:56] Ryan_Lane: ---^^ [00:18:33] * MJ94 watches [00:18:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44187 [00:18:46] New patchset: Reedy; "Don't set $wgCaptchaFileBackend to 'global-multiwrite' for labs" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44188 [00:18:56] Ryan_Lane: What do you do to initialize those salt minions then? It would be good to have that documented so people can add new boxes if they're not you ;) [00:18:58] Ryan_Lane: what you all are doing is way over my head, but I'm interested [00:19:00] hah [00:19:07] Change merged: Reedy; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44188 [00:19:12] Ryan_Lane: ^ give that a try [00:19:14] RoanKattouw: so, a couple steps [00:19:20] RoanKattouw: first merge the change [00:19:25] then run puppet on sockpuppet [00:19:28] that's 1 step [00:19:58] hey, the login button works, Ryan_Lane :) [00:20:10] create I'm sure still doesn't :) [00:20:13] we didn't fix it yet [00:20:15] nope [00:20:29] I was going to create an account and explore :) [00:20:38] yeah [00:20:40] about to fix [00:20:41] one sec [00:20:47] Ryan_Lane: no rush [00:20:53] sorry to bother you [00:21:15] well, this is actually somewhat related [00:21:19] we need people testing beta right now [00:21:26] why? [00:21:39] because we're testing our new deployment system [00:21:52] beta is upcoming changes to wikimedia software, right? [00:21:54] MJ94: ok, try now [00:22:02] like, you tested PC on beta before en? [00:22:08] nope [00:22:12] force refresh [00:22:15] it's working for me [00:22:17] Ran out of captcha images [00:22:17] Backtrace: [00:22:21] I did [00:22:22] ugh [00:22:47] MJ94: link to the page you are hitting? [00:22:57] http://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&type=signup&returnto=User:Riley+Huntley&returntoquery=type%3Dsignup [00:23:07] works for me [00:23:18] now it does [00:23:20] heh [00:23:31] kinda Ryan_Lane http://puu.sh/1Oeku [00:23:35] I still get 'ran out of capcha images' [00:23:43] RoanKattouw: after running puppet... [00:24:36] andrewbogott: better than a stack trace ;) [00:24:52] oh, and now it works for me too [00:24:58] >>magic<< [00:25:04] So probably I am not being helpful [00:25:09] RoanKattouw: salt -E '^(wtp1|mexia|tola|lardner|kuo|celsus|constable|wtp1001|caesium|xenon|cerium|praseodymium)\..*' saltutil.sync_all [00:25:19] RoanKattouw: salt -E '^(wtp1|mexia|tola|lardner|kuo|celsus|constable|wtp1001|caesium|xenon|cerium|praseodymium)\..*' saltutil.refresh_pillar [00:25:37] RoanKattouw: salt -E '^(wtp1|mexia|tola|lardner|kuo|celsus|constable|wtp1001|caesium|xenon|cerium|praseodymium)\..*' deploy.sync_all [00:26:06] oh [00:26:07] wait [00:26:08] sorry [00:26:15] scratch all three of those commands [00:26:37] RoanKattouw: salt -E '^(wtp1|mexia|tola|lardner|kuo|celsus|constable|wtp1001|caesium|xenon|cerium|praseodymium)\..*' state.highstate [00:27:28] RoanKattouw: on tin you can check their status using: deploy-info --detailed --repo="parsoid/Parsoid" [00:27:57] checkout seemed to have failed for some reason [00:28:35] checkout error of 10.... [00:28:52] Ryan_Lane: right clicking the broken captcha image and opening it in a new tab allows me to see it [00:29:48] seems that only cerium.wikimedia.org had issues [00:30:11] This is after running puppet on *sockpuppet* , not on the nodes themselves, right? [00:30:23] all of this is on sockpuppet [00:30:41] OK [00:31:05] So sockpuppet is the salt server as well as the puppet server then? Is that how it works? [00:31:18] sockpuppet isn't actually a puppet server, it seems [00:31:25] it's just the certificate signer [00:31:29] hah [00:32:58] hm. what's up with cerium? [00:33:49] clone failed for it for some reason [00:34:48] The requested URL returned error: 403 while accessing [00:35:41] I must have a bad cidr match [00:36:44] New patchset: Ryan Lane; "Widen the allowed deployment targets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44190 [00:37:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44190 [00:38:25] RoanKattouw: deploy-info --repo="parsoid/Parsoid" --full-report [00:38:28] that's also useful [00:38:35] though way too full of information [00:39:55] * Riley got pinged for some reason [00:40:12] RoanKattouw: ok. all [00:40:16] Repo: parsoid/Parsoid; checking tag: parsoid/Parsoid-20121213-004439 [00:40:16] 0 minions pending (11 reporting) [00:40:20] RoanKattouw: ^^ [00:40:55] Yay [00:40:57] Thanks [00:41:03] maybe I should change reporting to "total minions" [00:41:27] Ryan_Lane: Hmm so that deploys but it doesn't run the hooks [00:41:32] Both repos are in place but the symlink is not [00:41:34] it didn't run the gooks? [00:41:40] err [00:41:41] hooks [00:41:44] that's wrong, then [00:43:24] /usr/bin/parsoid: line 11: cd: /var/lib/parsoid/Parsoid/js/lib: No such file or directory [00:44:22] Ahm, ? [00:44:38] Oh right crap [00:44:39] Ahm [00:44:41] Yeah [00:44:47] I need a symlink there too, I forgot about that [00:44:51] Ryan_Lane: A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was: [00:44:52] (SQL query hidden) [00:44:52] from within function "GeoData::getAllCoordinates". Database returned error "1146: Table 'labswiki.geo_tags' doesn't exist (10.4.0.53)". [00:45:18] Reedy: known problem? ^^ [00:45:28] or another beta specific config problem? [00:45:32] "known problem" [00:45:41] I beat MaxSem about that one before [00:45:43] heh [00:45:52] MJ94: thanks for the reports :) [00:45:53] MJ94: Which wiki? [00:46:07] http://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=User:MJ94&action=submit [00:46:10] It's enabled on a quite random set of wikis, so it's likely some of the tables weren't added to beta [00:46:30] ok. time to initialize all of tampa [00:46:41] wheee [00:46:44] !log initializing all deployment destinations in pmtpa [00:46:46] Ryan_Lane: I'm not a programmer, I'm just a wikimedia editor. I don't know what any of that means so I hope it helps :) [00:46:53] Logged the message, Master [00:47:03] I'm willing to help you all I can, Ryan_Lane, Reedy [00:47:05] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/WikimediaMaintenance.git;a=commitdiff;h=bab909642b8a4f0f9a8e1fe56589e21d8f2d9484 [00:47:10] MJ94: your reporting is helpful [00:47:48] I'm going to do this in batches of 10. hopefully that actually fans things out [00:47:50] extensions/GeoData/sql/externally-backed.sql [00:48:24] -_- [00:48:34] 'wmgEnableGeoData' => array( [00:48:34] 'default' => false, [00:48:34] 'wiki' => true, [00:48:34] 'wikivoyage' => true, [00:48:34] ), [00:48:34] Useful, ehn? [00:48:41] I guess batched run is broken> [00:48:42] * Ryan_Lane sigh [00:49:53] Whoa, WTF is going on with praseodymium [00:50:00] RoanKattouw: ? [00:50:08] Its SSH key is changing like every minute [00:50:32] stop making up words ;) [00:50:39] Try ssh root@praseodymium [00:50:39] ssh keeps restarting [00:50:43] The connection will drop after a bit [00:50:52] When you try to SSH back in, the host key will have changed [00:51:46] I can't access it with salt, either [00:52:38] ok, I guess I'm initializing everywhere with no fanout [00:52:40] that should be fun [00:53:50] wow. that's incredibly spammy [00:55:16] RobH: Hey so there's something really fucking weird with praseodymium [00:55:19] Apart from the name [00:55:29] According to Ryan its sshd restarts a lot [00:55:45] thanks Ryan_Lane :) [00:55:48] I have observed that if I ssh into it, the connection drops after a minute, then sshing back in it yells at me because the host key has changed [00:55:51] it's probably much worse than that [00:55:54] I'm just looking around [00:55:59] Yes, this is much worse [00:56:02] MJ94: Try again [00:56:03] Beta looks like an abandoned Wikipedia [00:56:04] MJ94: cool. thank you :) [00:56:04] The machine's host key is changing every minute [00:56:11] Why is sql.php sooooo picky? [00:56:20] and tin's network is maxed out :) [00:56:22] Reedy: the edit went through, the error just appeared [00:56:32] MJ94: Make another edit then :p [00:56:49] root@i-00000390:/srv/deployment/mediawiki/slot0/extensions# foreachwiki sql.php GeoData/sql/externally-backed.sql [00:56:50] /usr/local/bin/foreachwikiindblist: line 4: /srv/deployment/mediawiki/common/all.dblist: No such file or directory [00:56:53] rarrrfgghhh [00:56:53] gfh [00:57:15] Reedy: http://puu.sh/1Of3W [00:57:30] http://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Captcha/image&wpCaptchaId=2076100364 [00:57:46] That's an inconvenient way to see the captcha [00:58:10] speaking of captchas, is en.wp's fixed? ACC was going through hell for a while [00:58:36] Reedy: A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was: [00:58:38] (SQL query hidden) [00:58:38] from within function "GeoData::getAllCoordinates". Database returned error "1146: Table 'labswiki.geo_tags' doesn't exist (10.4.0.53)". [00:58:50] Reedy: http://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=User:MJ94&diff=prev&oldid=2434 [00:58:51] Wrong database then [00:59:05] I'm going to do it everywhere when I pick script errors apart [01:00:08] Labs feels slightly out of date.. [01:00:39] why does this exist o_O http://en.wikipedia.beta.wmflabs.org/wiki/Boy_band [01:01:54] MJ94: for the lovely AFTv5 feedback form at the bottom of the page (and every other page on beta) [01:02:48] should this page be deleted? http://en.wikipedia.beta.wmflabs.org/wiki/Bug44010 [01:03:26] I'm not totally sure how content is handled inside of beta [01:03:39] MJ94: Feel free to ask an admin that [01:04:03] Riley: should this page be deleted? http://en.wikipedia.beta.wmflabs.org/wiki/Bug44010 [01:04:11] No [01:04:17] Riley: I don't know how this works, I don't even go here [01:04:23] MJ94: it's not doing any harm [01:05:06] MJ94: It's not doing any harm and it was created by an experienced person [01:05:11] http://toolserver.org/~quentinv57/sulinfo/Nemo%20bis [01:05:47] MJ94: I'd prefer to keep pages in beta enwiki that might have even the most vague historical significance. [01:08:22] MJ94: try again again [01:09:04] Ran out of captcha images [01:09:04] Backtrace: [01:09:04] #0 /srv/deployment/mediawiki/slot0/extensions/ConfirmEdit/Captcha.php(60): FancyCaptcha->getForm() [01:09:04] #1 [internal function]: SimpleCaptcha->editCallback(Object(OutputPage)) [01:09:07] etc [01:09:09] hm. maybe we should bond some ports on the deployment system [01:09:20] meh. we won't need to with bittorrent [01:09:34] Ryan_Lane: Could you let me know when tin is no longer saturated? [01:09:38] I tried to ssh into it but can't [01:09:41] it'll be a little bit [01:09:44] you can't ssh into it? [01:09:54] Once tin isn't under crazy load any more, I need to try git-deploying Parsoid in eqiad [01:10:08] praseodymium is completely fucked, we're giving up on that box for the day, Rob will reinstall it tomorrow [01:10:20] ssh is slow, lemme see if I ultimately get through [01:10:24] Reedy: ^ [01:10:27] ideally it'll never be this bad [01:10:36] latency from fenari to tin is ~1200 ms though [01:10:36] I wasn't fixing the captcha errors [01:10:36] ideally [01:10:38] this is pushing everything to every system [01:11:00] OK I'm in [01:11:03] * AaronSchulz drags Ryan_Lane down from his cloud castle :) [01:11:04] It's slow, but I'm in [01:11:07] :) [01:11:13] AaronSchulz: yeah :( [01:11:34] well, when we switch to bittorrent for l10n it won't be nearly as bad ;) [01:12:46] heh I like the log message prompt [01:12:59] it only writes to redis right now [01:13:30] I need to modify ircecho to pop messages from redis [01:14:15] Ryan_Lane: JSON data wasn't loaded from the pillar call. git-deploy can't configure itself. Exiting. [01:14:20] Is that supposed to happen? [01:14:45] no [01:14:49] use retry [01:15:00] or did it just fail out completely? [01:15:02] quittin time. [01:15:05] cya folks later [01:15:26] OK, retrying [01:15:43] now we can all talk smack about RobH ;) [01:15:51] Ryan_Lane: http://pastebin.com/B50rHhMm [01:16:05] 11 minions pending (11 reporting) [01:16:07] use d [01:16:16] is it just hanging now? [01:16:43] it may not be having an easy time with the network right now [01:16:51] seeing as that it's completely saturated [01:17:03] using d [01:17:09] New review: Reedy; "Seems to be a dupe of https://gerrit.wikimedia.org/r/#/c/44162/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44164 [01:17:10] that'll tell you the status [01:17:12] I should note that there's about a half-second lag in my keypresses alone [01:17:18] $DP = $IP; [01:17:23] Reedy: what is that? [01:17:29] RoanKattouw: yes, because of the network load [01:17:39] http://pastebin.com/cY6sABEv [01:17:48] Ryan_Lane: I realize that, just giving an indication of how bad it is [01:17:49] you picked a bad time to deploy. heh [01:18:04] Yeah I did [01:18:07] so, it started [01:18:16] hit enter [01:18:18] that's the same as c [01:18:27] of course it may take longer than normak [01:18:29] *normal [01:18:32] I pressed enter [01:18:34] with parsoid it's usually very quick [01:18:38] Where's that? : [01:18:38] AaronSchulz: It's an assignment of one variable to another [01:18:39] I think you mean that's the same as y? There's no c [01:18:57] C]oncise report [01:18:57] Now it just says the same again [01:18:59] [C]oncise report [01:19:06] Oh [01:19:12] RoanKattouw: that means it's still waiting [01:19:15] Repo: parsoid/Parsoid; checking tag: parsoid/Parsoid-20130116-011222 [01:19:18] 11 minions pending (11 reporting) [01:19:19] Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [01:19:21] OK [01:19:29] So it's allowing me to periodically check how many have finished? [01:19:33] yes [01:19:37] Ultimately all but one (praseodymium) should finish [01:19:50] you don't want to use 'y' until everthing is fetched [01:20:12] that system won't show up if it never checked in [01:20:44] seeing as that it's broken, I doubt it'll show ;) [01:20:58] damn l10n [01:21:26] Alright I'll just wait a while for it to report success [01:21:35] Or for your push to be done [01:21:37] some of my minions are starting to return [01:21:44] so that means most of them will also be done soonish [01:22:58] I need to fix mw-deployment-vars.erb... Well, the variables used in them... From the private repo.. MW_DBLISTS/MW_DBLISTS_SOURCE [01:23:19] it's in the private repo? [01:23:26] oh, you mean private mw repo [01:23:43] <%= mw_common %> [01:23:47] no, private puppet repo? [01:23:58] ah. that's a password or something? [01:24:17] It's a path [01:24:24] why would it be in the private repo, then? [01:24:36] I've no idea [01:24:40] it isn't [01:24:42] it's in misc/deployment.pp [01:25:15] in misc::deployment::vars set variables $mw_dblists and $mw_dblists_source [01:25:15] * Reedy facepalms [01:25:15] I had that open [01:25:15] Thanks [01:25:19] yw [01:25:33] then modify the template to use those variables instead of fixed locations [01:25:40] RoanKattouw: ok, check now [01:25:45] network spike is dropping [01:25:49] TimStarling: am I supposed to be reviewing the cgroups stuff? [01:25:51] TimStarling: need to fix the dblist locations.. [01:26:25] Ryan_Lane: They're still all out, but tin is more responsive to keypresses now [01:26:29] AaronSchulz: you can if you like [01:27:59] RoanKattouw: use retry [01:28:10] it says started: 55 mins [01:28:26] it must have not been able to make the request [01:30:30] New patchset: Reedy; "Add and update variables for dblist locations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44195 [01:30:50] Reedy: the udp log stuff seems to be split between IS and CS [01:30:52] I think some of the clients timed out on my deploy [01:30:59] maybe that can be it's own file or something [01:31:02] well, initialization, not deploy [01:31:03] AaronSchulz: Haven't we discussed this before? ;) [01:31:19] retrying [01:31:34] AaronSchulz: I think so too tbh [01:31:45] RoanKattouw: thoughts on the interface for this? ok? anything need work? [01:32:08] fail [01:32:10] It's reasonable once you understand what it's reporting [01:32:15] I was totally baffled at first [01:32:37] well, that's not good. heh [01:32:45] New patchset: Reedy; "Add and update variables for dblist locations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44195 [01:32:49] A brief message saying "the minions are taking more than a minute to report, you have to query their status manually and type 'y' when they're done" or something along those lines would be helpful [01:32:58] ah. ok [01:33:07] Because that's what's going on, right? [01:33:12] only kind of [01:33:23] it does the fetch and immediately gives you a report [01:33:38] they may immediately be done [01:33:42] they may take a while [01:33:43] Meh they're still not responding [01:33:46] no? [01:34:07] I retried, 2 mins in now, and none of them have come in yet [01:34:15] Reedy: I with getRealmSpecificFilename was part of a class, heh [01:34:16] [started: 2 mins, last-return: 2 mins] [01:34:23] it looks funny [01:34:29] Oh does that mean they're done? [01:34:33] maybe it isn't reporting its tag properly [01:34:35] TimStarling: Could you review https://gerrit.wikimedia.org/r/44195 please? [01:34:40] fetch: 0 [started: 2 mins, last-return: 2 mins] [01:34:43] the way it reports on the fetch tag is odd [01:34:47] fetch: 0 means success? [01:34:49] yes [01:34:53] but... [01:34:56] it may be an old status [01:35:03] celsus and constable are reporting different tags [01:35:15] origin tags, or deployed tags? [01:35:20] oh [01:35:22] you mean git tag [01:35:30] The tags in the detailed report [01:35:39] celsus.wikimedia.org: parsoid/Parsoid-20121211-215118-2-g76249ae (fetch: 0 [started: 2 mins, last-return: 2 mins]) [01:35:44] wtp1001.eqiad.wmnet: parsoid/Parsoid-20121213-004439 (fetch: 0 [started: 2 mins, last-return: 2 mins]) [01:35:47] The latter is the common one [01:35:51] The former is one that only 2 have [01:35:54] * Ryan_Lane nods [01:36:05] Also, there is one minion that is down completely, that one isn't in the list at all [01:36:09] yes [01:36:13] But I don't care about that box right nwo [01:36:19] that's normal. if it's never checked in it won't appear [01:37:28] hm. they all return ping [01:37:36] Reedy: done [01:37:42] I still don't understand what the status reports actually mean [01:37:55] Does fetch:0 mean success? Failure? Not reported yet? [01:38:05] still running? [01:38:35] fetch: 0 says the last reported status was 0 [01:39:11] New patchset: Tim Starling; "Use PHP_SAPI instead of php_sapi_name()" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44198 [01:39:11] New patchset: Tim Starling; "Fix inappropriate use of die()" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44199 [01:39:12] New patchset: Tim Starling; "Prevent MediaWiki maintenance scripts from running as privileged users" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44200 [01:39:16] And what does 0 mean? [01:39:19] New patchset: Reedy; "Add and update variables for dblist locations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44195 [01:39:24] returned successfully [01:39:32] but again, that's the last reported status [01:39:33] New patchset: Reedy; "Update size related dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44201 [01:39:42] OK [01:39:47] so.... [01:39:47] Should I just say 'y' and let it move on? [01:39:51] It looks like it may have just succeeded [01:39:52] I would [01:40:00] I think it's returning the fetch tag improperly [01:40:01] And I'm doing a --force sync just to run the hooks [01:40:06] oh [01:40:09] that's likely why [01:40:34] git describe origin doesn't report that properly sometimes [01:40:36] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44201 [01:40:47] if the tags point to the exact same thing [01:41:07] or not [01:41:30] it was missing the tag, it seems [01:41:39] wow fenari is almost unusable [01:41:40] RoanKattouw: I'd start it again [01:41:56] RoanKattouw: something may have went wonky with the other deployment wreaking havoc [01:42:27] Thought it was just my connection ;) [01:43:16] New review: Tim Starling; "Needs testing." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44200 [01:45:31] RoanKattouw: so, I'd do "y" all the way through, and redo the deploy with --force [01:45:44] seeing as that this one failed [01:48:34] RoanKattouw: poke poke [01:49:12] I need to tell some systems to initialize and would prefer not to break your deploy again by saturating the link again [01:50:21] ok, then I guess I'll go for it [01:52:16] OK will do [01:52:42] well, I'm initializing some hosts. it may saturate again [01:52:49] Reedy: I gave that change +2, do you want me to deploy it? [01:53:14] indeed. saturated [01:53:18] Please. Having working wrapper scripts is helpful [01:53:29] Reedy: way to many hook handlers in CS too [01:53:35] *too [01:54:11] Ryan_Lane: you don't seem to be making much difference to the overall network graphs ;) [01:54:25] yeah. it's just pegging out a single system ;) [01:54:31] The appservers do show otherwise [01:54:43] bittorrent may increase that some [01:54:44] Ryan_Lane: OK now it's failing in a different way [01:54:51] 9 minions successful [01:54:53] But 2 pending: [01:54:55] constable.wikimedia.org: parsoid/Parsoid-20121211-215118-2-g76249ae (fetch: 0 [started: 0 mins, last-return: 0 mins]) [01:54:57] celsus.wikimedia.org: parsoid/Parsoid-20121211-215118-2-g76249ae (fetch: 0 [started: 0 mins, last-return: 0 mins]) [01:54:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44195 [01:55:02] I think those two are just f-ed somehow [01:55:12] y-ing through it [01:55:16] ok [01:55:48] Because I'm force-deploying for the new machines' benefit and celsus and constable aren't new, I'm just ignoring them, but this is werid [01:55:50] *weird [01:55:58] yes. that shouldn't happen [01:56:14] I'm going to see what's occuring on the systems [01:56:15] Does it seriously prompt you every time to ask you if yuo want to continue? [01:56:21] If everything succeeds, that's annoying [01:56:29] heh [01:56:32] good point [01:56:35] The second step succeeded completely (0 pending, 11 reporting), and it still prompted me [01:56:39] notpeter: where is the php.ini conf? [01:58:13] RoanKattouw: I think those two systems will report properly for fetch next time you push something real out [01:58:38] git reports tags oddly when it sees things point to the same objects [01:59:11] may be good to find the tag, then follow it through to the object and report on the object [01:59:22] well, the commit, that is [01:59:26] Oh right [01:59:33] I forgot about the damn node modules [01:59:37] The compiled stuff [01:59:38] I found this part of the reporting to be difficult [01:59:40] Will have to rsync that over [01:59:48] OK [02:00:00] That makes sense [02:00:09] Maybe the comparison should be against actual hashes, not tags, then? [02:00:22] yeah, but you need to follow the tag through for that [02:01:11] Reedy: it's deployed to tin, need it anywhere else? [02:01:18] from deployment's POV, it knows you are looking for a specific tag, not necessarily the hash [02:01:30] but that's easy enough to correlate [02:02:25] Instead of a == b, use $(git rev-parse $a) == $(git rev-parse $b) I suppose [02:03:18] * Ryan_Lane nods [02:04:04] TimStarling: No, that's goods thanks. Just doing labs bastion now [02:04:16] I need to figure out a way to git-deploy-ify the node_modules dir [02:04:30] I'm tired of using tar&scp to deploy it [02:04:42] It never changes but it still needs to make it out to the hosts for the initial deploy [02:04:45] -_- well, here's a fun bug. some clones failed, but succedded at making an empty repo [02:06:51] Alright, Parsoid in eqiad is up [02:06:58] With one Varnish rather than two, but that's the case in pmtpa as well [02:11:56] RoanKattouw: you tested this read-only mode code right? [02:23:31] I did [02:23:48] I failed to get my MySQL in ro mode so I just gave MW a different user with SELECT rights only [02:23:52] Made sure I couldn't edit pages [02:24:01] Then switched from memc to DB cache and cleared it [02:24:24] Also verified that the short-cache-timeout-on-error bit works [02:24:37] By introducing a typo in a filename [02:26:44] uuuggghhh [02:26:53] snapshot1002 : /dev/sda1 9611492 9433288 0 100% / [02:28:11] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 16 02:28:11 UTC 2013 [02:28:24] Logged the message, Master [02:29:46] Did they not get "fixed" like the apaches then? [02:29:55] seems not [02:30:30] all the snapshot boxes are now at 100% on / [02:30:54] some have mediawiki on /a [02:32:59] TimStarling: I'm going home in a minute. See above for what I did to test my change. I'll be out tomorrow but in the office on Thursday and Friday [02:33:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [02:33:08] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [02:33:08] RECOVERY - Parsoid on caesium is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.055 seconds [02:33:18] I guess I'll make /srv a link to /a? [02:33:22] annoying [02:33:35] RECOVERY - Parsoid on xenon is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.055 seconds [02:33:35] RECOVERY - SSH on msfe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:35:23] PROBLEM - Parsoid on cerium is CRITICAL: Connection refused [02:36:17] PROBLEM - Parsoid on praseodymium is CRITICAL: Connection refused [02:36:26] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [02:40:05] New patchset: Reedy; "else if -> elseif" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44206 [02:40:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44198 [02:40:50] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44199 [02:41:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44206 [02:41:45] ok. I think we're ready to try out test [02:41:59] RECOVERY - NTP on msfe1002 is OK: NTP OK: Offset -0.002246379852 secs [02:42:34] Nice [02:42:52] snapshots aren't done updating, but otherwise hosts are initialized [02:43:13] Author: jenkins-bot [02:43:28] it's srv193.pmtpa.wmnet ? [02:44:10] Yup [02:44:42] I don't see an nfs mount for that one [02:45:39] nfs mount for what? [02:45:48] for mediawiki? [02:47:39] I should be seeing that, right? [02:47:51] or does test no longer use nfs? [02:47:59] it uses /home [02:48:40] or did.. [02:48:56] not anymore it seems [02:49:18] ah [02:49:23] ignore me [02:49:58] I probably really need to use test2 [02:50:01] not test [02:50:19] ah. that's not a specific host [02:50:23] yeah [02:50:28] * Ryan_Lane grumbles [02:50:29] that's served from anywhere [02:51:02] Can't we change the apache config on srv193? [02:51:05] yes [02:51:06] and tell puppet to go away [02:52:30] oh. wait [02:53:25] I think apache hits /usr/local/apache/common-local/ [02:53:32] then mediawiki says "no, actually use /home" [02:53:44] haha [02:54:03] I don't see any special hacks in apache for test [02:54:42] Just every request for that site is routed to srv193? [02:54:49] yeah, via squid [02:54:52] Of course, this is easily confirmed [02:54:55] that's how I found test [02:54:57] * Reedy breaks testwiki [02:55:15] well, that's how I found which one was test, by looking in squid config [02:55:27] heh [02:55:36] http://test.wikipedia.org/ [02:55:41] Looks broken to me ;) [02:55:48] unbreak it :) [02:56:01] I was going to see if common-local was used by moving it [02:56:35] Yeah, it's not using /usr/local/apache/common-local/php-1.21wmf7/LocalSettings.php [02:56:36] yep [02:56:37] broken [02:56:52] ok. let's test the new code [02:57:07] ohh [02:57:16] it's the multiversion wrapper, isn't it?> [02:57:36] nah, we have weird config in the common repo for this [02:57:59] broken [02:58:02] Unable to open wikiversions.cdb. [02:58:29] yeah, it's not in common [02:58:40] That feels deja vu [02:58:52] it's in the one on tin.... [02:59:04] I'm assuming it's .gitignored? [02:59:09] FYI it's all owned root:root... [02:59:13] which means it won't get pulled either, right? [02:59:18] that's fine [02:59:35] there's nothing suid or sgid [02:59:36] Means I couldn't write a wikiversions.cdb file [02:59:43] eh? [02:59:43] Lets remove it from gitignore [02:59:51] reedy@srv193:/srv/deployment/mediawiki/common$ ./multiversion/refreshWikiversionsCDB [02:59:51] Warning: dba_open(/srv/deployment/mediawiki/common/wikiversions.cdb.tmp): failed to open stream: Permission denied in /srv/deployment/mediawiki/common/multiversion/refreshWikiversionsCDB on line 44 [02:59:51] Unable to create wikiversions.cdb.tmp. [02:59:58] oh [03:00:00] right [03:00:06] we want to do this from the deploy host [03:00:29] you want to do the honors, or me? [03:00:39] New patchset: Reedy; "Remove wikiversions.cdb" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44207 [03:00:44] oh [03:00:53] you didn't add it locally, did you? [03:00:59] you can did git add -f [03:01:01] *do [03:01:05] New patchset: Reedy; "Remove wikiversions.cdb from .gitignore" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44207 [03:01:11] then it'll get pulled remotely [03:01:20] ohh [03:01:45] syncing [03:01:59] Change merged: Reedy; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44207 [03:02:14] PROBLEM - Apache HTTP on srv193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error [03:02:26] nagios-wm_: yes, yes, we know [03:02:42] srv 193 is *only* used for test, right? [03:02:48] yup [03:02:54] ok. good :) [03:03:04] no LocalSettings.php [03:03:05] load average: 0.06, 0.13, 0.19 [03:03:13] I need to put that in locally as well [03:04:03] RECOVERY - Apache HTTP on srv193 is OK: HTTP OK HTTP/1.1 200 OK - 1213 bytes in 0.007 seconds [03:04:09] syncing slot0 [03:04:25] I really need to add the damn log bot [03:04:26] http://test.wikipedia.org/wiki/Main_Page/mw-config/mw-config/mw-config/mw-config/mw-config/index.php [03:04:26] heh [03:04:35] :D [03:05:26] bah [03:05:40] damn snapshot systems [03:05:52] they needed to pull slot0 and its l10n [03:06:24] maybe I should put test back to normal temporarily [03:06:41] I don't think anyone is actively using it [03:06:52] ok [03:07:06] no complaints yet anyway [03:07:30] I put it back to normal anyway [03:07:38] looks bad to have that screen up [03:07:38] It's past Chris McMahons working day [03:08:30] I guess I need to make the make-wmf-branch strip LocalSettings.php from .gitignore too [03:08:36] no need [03:08:40] and/or force add [03:08:45] I'm forcing them in [03:08:46] yeah [03:08:58] we don't want someone accidentally pushing one into gerrit [03:09:00] I'll fix that [03:09:12] should revert the cdb change, too [03:09:22] Reedy: make sure to do it inside of git deploy [03:09:49] I'm in the middle of deploying it for slot0 [03:10:13] did you want to do it for slot1? [03:10:34] Oh, LocalSettings etc are just going to be commited locally? [03:10:34] yes [03:10:41] they'll get pulled that way [03:10:44] that's what I did with the cdb [03:11:02] New patchset: Reedy; "Revert "Remove wikiversions.cdb from .gitignore"" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44208 [03:11:53] Have you done AdminSettings.php and StartProfiler.php too? [03:12:00] those are in the private repo [03:12:10] oh, duhh [03:12:15] We changed the paths for them [03:12:18] yep [03:12:54] ok. syncing slot1 too [03:13:29] damn snapshots [03:13:37] I knew I should have finished those up before starting [03:14:20] blasted thigns [03:14:33] well, they should be done soonish [03:16:40] screw em [03:16:47] they'll finish when they're ready [03:18:36] :D [03:19:38] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 196 seconds [03:20:08] I obviously didn't put enough work into the initialization portion of the system [03:20:23] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 212 seconds [03:20:28] the initial sync_all didn't really work properly [03:21:17] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:22:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:14] PROBLEM - Lucene disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 102 MB (1% inode=66%): [03:23:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.565 seconds [03:24:26] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [03:25:10] ugh. the searchidx systems are also using /a [03:27:44] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [03:28:29] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:31:47] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [03:34:21] Reedy: test is running git-deploy [03:35:20] TimStarling: ^^ [03:35:42] excellent [03:36:04] great [03:36:23] so we know that the new CommonSettings.php doesn't instantly spew fatal errors? [03:36:28] yep [03:37:10] there are some warnings in /home/wikipedia/syslog/apache.log [03:37:19] 115 Warning: include() [function.include]: Failed opening '/srv/deployment/mediawiki/common/wmf-config/reporting-setup.php' for inclusion (include_path='/srv/deployment/me [03:37:19] diawiki/slot0/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/srv/deployment/mediawiki/common/1.21wmf7:/srv/deployment/mediawiki/common/1.21wmf7/lib:/usr/local/lib/php:/usr/share/php') in /sr [03:37:19] v/deployment/mediawiki/common/wmf-config/CommonSettings.php on line 1074 [03:37:19] 115 Warning: include() [function.include]: Failed opening '/srv/deployment/mediawiki/common/wmf-config/contribution-tracking-setup.php' for inclusion (include_path='/srv/d [03:37:19] eployment/mediawiki/slot0/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/srv/deployment/mediawiki/common/1.21wmf7:/srv/deployment/mediawiki/common/1.21wmf7/lib:/usr/local/lib/php:/usr/share/ [03:37:21] php') in /srv/deployment/mediawiki/common/wmf-config/CommonSettings.php on line 2194 [03:37:27] well that looks awful [03:37:42] both fundraising related [03:38:06] and presumably should be in the private repo.. [03:38:21] yep [03:38:24] it is [03:38:44] /srv/deployment/mediawiki/private/wmf-config/reporting-setup.php [03:38:52] same with the other [03:39:20] include( "$wmfConfigDir/reporting-setup.php" ); [03:39:25] Easily enough fixed tehn [03:41:10] (just doing them) [03:41:13] cool [03:43:24] Ryan_Lane: so you've pushed out the new code to all the tampa mediawiki-installation hosts? [03:43:41] yes, there's a few that timed out on fetch or initial checkout [03:43:47] I'm getting those sync'd now [03:43:53] Change merged: Reedy; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44208 [03:44:18] I'm going to check the pooled list vs what's reporting to be totally sure [03:44:23] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:47:36] New patchset: Reedy; "Update fundraising related config paths" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44209 [03:48:01] Change merged: Reedy; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44209 [03:49:38] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [03:52:02] New patchset: Asher; "adding $mw_primary to realm.pp, as redis and mha configs needs to be aware of which site is primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44211 [03:52:20] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:55:28] * Ryan_Lane groans [03:55:46] snapshot1002 doesn't even have a /a [03:56:50] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [03:57:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:47] New patchset: Asher; "adding $mw_primary to realm.pp, as redis and mha configs needs to be aware of which site is primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44211 [04:00:17] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:03:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.711 seconds [04:03:25] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44211 [04:04:27] Guess there isn't a great deal more to do till we actually do it (and the usual prep that needs to go into deploying a new branch) [04:08:23] PROBLEM - Puppet freshness on mw40 is CRITICAL: Puppet has not run in the last 10 hours [04:20:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:20:41] Ryan_Lane: Mind deploying the updated common repo? [04:24:19] Wonder if we're gonna have people complaining about the lack of git hashes on Special:Version (again) [04:25:20] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:32:47] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:37:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:38:26] Reedy: want to try it out? [04:39:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.060 seconds [04:45:09] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:48:44] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [04:50:01] Reedy: did you already push it out? [04:59:23] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:13:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:13:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:13:29] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:22:35] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:26:02] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:33:14] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:42:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:50:56] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [05:57:59] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:03:23] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:06:59] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:10:26] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:10:37] New review: Tim Starling; "https://rt.wikimedia.org/Ticket/Display.html?id=4344" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/43991 [06:15:41] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:19:17] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:20:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:22:17] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 261 seconds [06:22:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.464 seconds [06:22:53] PROBLEM - MySQL Slave Running on db1001 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error writing file UNOPENED (Errcode: 9) [06:22:54] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:35:02] anyone around who can help with a large file upload? [06:35:57] yes [06:36:20] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [06:36:20] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [06:36:21] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [06:36:21] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:36:39] TimStarling, if you have a sec .. I put the video in /tmp/MetricsVideo on hume [06:36:57] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:37:20] this used to work: sudo -u apache mwscript maintenance/importImages.php --wiki=commonswiki --comment='WMF monthly metrics meeting video, more metadata to follow shortly' --user=Eloquence --extension=ogv /tmp/(whatever) [06:38:22] Now I get http://pastebin.com/KtLTs22x [06:38:44] interesting [06:39:52] hume doesn't have that NFS volume mounted, and it's missing from its fstab [06:42:19] I'm using hume per previous advice from Roan to avoid I/O error issues when trying to upload from fenari. [06:43:48] but that was pre-netapp and other changes. should I try going via fenari? [06:43:59] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [06:44:51] at least the volumes seem to be available there [06:45:09] I'm editing the relevant puppet manifest [06:45:53] New patchset: Asher; "initial mha config work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44219 [06:46:52] New patchset: Tim Starling; "Mount upload NFS on hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44220 [06:47:37] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44220 [06:50:53] Eloquence: all done, I ran the script with the parameters you suggested to test it [06:51:24] http://commons.wikimedia.org/wiki/File:MetricsFun.ogv [06:51:39] going for dinner [06:52:21] Tim-away, thank you :-). have a nice dinner. [06:52:24] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [06:54:18] New patchset: Asher; "initial mha config work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44219 [06:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:00:16] Change abandoned: Krinkle; "Dupe of Ic54701e3." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44162 [07:02:34] Who's handling Wikivoyage issues? [07:02:37] If anyone. [07:03:29] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:08:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.880 seconds [07:10:32] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:14:08] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:19:41] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:21:31] mz, I'm helping. what's up? [07:24:56] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:25:54] Susan: ^ [07:26:20] don't know if you saw Eloquence there [07:26:47] Certainly didn't. I stopped stalking "MZ" after YouTube URLs and such kept pinging me. ;-) [07:26:54] Eloquence: Nothing too major, but . [07:26:57] I figured :) [07:26:58] Prodego: Thx. [07:27:01] of course sir [07:27:18] Eloquence: I copied Reedy on the bug. Bit strange that it's just suddenly stopped working.... [07:27:32] I wonder if that wiki was upgraded in the past day or two. [07:30:40] wmf7 was deployed to wikivoyage on 1/7 according to https://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap [07:34:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:39:10] Susan, don't see anything suspicious in the config or the code changes. reedy's a good person to investigate further. [07:39:20] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:40:03] Eloquence: Yeah, I thought so too. Thanks for taking a look. :-) [07:40:16] It may actually be two issues: the user right not working correctly and the error message being broken. [07:40:21] Though it's difficult to say. [07:40:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:43:15] the code does a pretty simple check !$user->isAllowed( 'usermerge' ) when throwing that error, so it looks like bureaucrats don't have the permission for some reason [07:44:22] * ori-l investigates Reedy. [07:45:15] :) [07:45:55] he checks out :P [07:46:20] just be careful before committing. [07:47:19] story of my life [07:49:59] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [07:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.840 seconds [07:59:30] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:06:51] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:15:33] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:22:46] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:29:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:31:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:37:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [08:38:12] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:42:33] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:43:45] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [08:49:27] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [08:54:51] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:07:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:14:21] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [09:14:39] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:18:15] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Puppet has not run in the last 10 hours [09:18:33] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 190 seconds [09:18:42] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 194 seconds [09:21:51] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:25:18] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [09:27:06] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:31:10] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay seconds [09:32:03] RECOVERY - MySQL Slave Running on db1001 is OK: OK replication [09:32:17] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [09:35:48] PROBLEM - mysqld processes on db1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:35:57] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:42:06] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [09:42:06] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [09:43:00] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [09:54:46] is get-deploy already in use? [09:58:54] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:02:21] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:07:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:13:00] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:13:17] Nikerabbit: test.wikipedia.org is running off the new source tree [10:20:12] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:22:13] so after today, what does tin deploy to? both eqiad and pmtpa? [10:23:39] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:30:42] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:39:33] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:43:09] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:53:11] New patchset: Mark Bergsma; "Deploy apaches, api and rendering LVS services in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44233 [10:53:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44233 [10:55:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [10:58:36] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 201 seconds [10:58:36] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 201 seconds [11:02:00] New patchset: Mark Bergsma; "Correct LVS service IP for eqiad apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44234 [11:03:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44234 [11:05:27] New patchset: Mark Bergsma; "Bind apaches, api and image scaler LVS service IPs to LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44235 [11:05:39] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:05:48] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [11:06:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44235 [11:11:39] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:12:31] !log Restarted PyBal on lvs1003 and lvs1006 [11:12:42] Logged the message, Master [11:24:15] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:27:51] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:28:12] !log Added eqiad API apaches to the LVS pool [11:28:21] Logged the message, Master [11:33:06] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:34:45] !log Added eqiad apaches and image scalers to the LVS pools [11:34:55] Logged the message, Master [11:36:42] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:39:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:10] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:42:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.225 seconds [11:48:38] hello [11:52:58] New patchset: Mark Bergsma; "Add eqiad bits apaches as unused backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44247 [11:54:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44247 [11:56:03] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [11:56:45] hi hashar [11:56:48] you worked on multi-dc mediawiki config didn't you? [11:56:58] yup been reviewing anomie code [11:57:06] how may I help you ? ;-] [11:57:20] how can I make a mediawiki setting dependent on the datacenter the app server is in now? [11:57:32] I should write a doc on the wikitech I guess [11:57:44] some basic examples would go a long way ;) [11:57:52] short answer: prefix the file with the realm or datacenter name. Ex: db-pmtpa.php [11:58:28] The operations/mediawiki-config.git has a /README file which gives some examples [11:58:29] and then that file is installed as db.php on those app servers? [11:58:34] ok cool [11:58:37] i'll have a look at that, thanks [11:58:43] you will have to be extremely carefuly [11:59:01] cause the realm takes precedent over datacenter [11:59:10] you know me, i'm always extremely careful [11:59:13] ;) [11:59:20] yeah I heard the story already heheh [11:59:57] what story? [12:00:12] I tried a joke sorry does not play well :( [12:00:27] :) [12:00:29] more seriously, you can write your own integration tests using tests/multiversion/MWRealmTest.php , then run the tests with phpunit tests/multiversion/MWRealmTest.php [12:00:51] though I might write them for you. I have no idea how much PHP you know. [12:00:57] and we don't support python yet :( [12:01:27] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:01:45] I wasn't planning on writing any implementation tests hehe [12:01:54] I was just thinking of preparing configurations for the eqiad switchover [12:01:59] yeah that test is mostly unit testing [12:02:08] I need an integration test [12:02:39] mark: binasher might have been working on it already. He poked us on friday cause of the lack of documentation [12:02:46] which made me add the doc in /README [12:02:49] yeah [12:03:13] it's for asher's changes [12:03:25] but it's good to have some info on that, thanks [12:03:31] also the repo has a newdeploy branch which hold the changes needed for git-deploy [12:03:38] might contain the new datacenter stuff too [12:04:07] ah it does not [12:04:08] \O/ [12:04:27] i think i'll put all "my" varnish/squid ops changes as a gerrit topic [12:04:48] anyway, you want the file named something like mc-production-eqiad.php and mc-production-pmtpa.php [12:04:55] I see [12:05:04] mc-pmtpa.php would apply to any pmtpa datacenter, which include the beta install on labs [12:05:14] right [12:05:36] so how do we do this with CommonSettings.php then? ;) [12:05:40] need to split out those settings I guess [12:05:59] redis and parser cache for example are in there [12:07:46] CommonSettings.php loads CommonSettings-labs.php whenever the realm is labs [12:07:47] that is hardcoded in the CS.php file [12:07:56] oh it INCLUDES? [12:08:00] that is nice [12:08:15] I think so [12:08:22] ok [12:08:30] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:10:14] oho the current parser cache seems to be deployed on beta :-] [12:10:57] haha [12:11:08] yeah it's just in CommonSettings.php [12:11:39] we really have to refactor all that stuff [12:11:54] I guess as long it works for production, we are fine [12:13:57] so that should be in CommonSettings-production.php then [12:14:53] I can't remember how CommonSettings-labs.php is loaded [12:15:19] the parser cache stuff should probably be in its own file [12:15:20] this reminds me of tricks with AUTOEXEC.BAT and CONFIG.SYS to optimize memory usage on dos [12:15:24] such as parsercache-production.php [12:15:33] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:15:44] QEMM! [12:16:07] mark: haha, it's been a while since i heard that :D [12:17:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:45] the README is pretty helpful, [12:21:00] but it would be good to be a bit more specific on what exactly is included in which order [12:21:01] like [12:21:10] I assume CommonSettings.php is always included [12:21:23] yeah we lack proper doc [12:21:28] see bug #1 :-] [12:21:29] but then, is CommonSettings-production.php if found included and does it overwrite whatever was in CommonSettings.php? [12:21:55] i assume it does [12:22:17] na the mechanism is not in place for CommonSettings.php [12:22:25] ah [12:22:36] IIRC the main entry point for MediaWiki just include CommonSettings.php [12:22:48] and then also include CommonSettings-labs.php whenever we use labs [12:23:02] then we need files for parser cache and redis I think [12:23:07] redis could live in mc.php I suppose [12:23:17] the mechanism in README use a global function to find out which file to include. Something like: getRealmSpecificFilename( 'db.php' ); [12:23:35] so yeah, you want to split the conf in different files [12:24:00] and included them from CommonSettings.php using something like: include( getRelamSpecificFilename( 'redis.php' ) ); [12:24:05] it's actually ok for now [12:24:24] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:24:33] gosh that check is annoying [12:25:50] ah [12:26:01] so the main entry point does include CommonSettings.php [12:26:15] right [12:26:17] thanks hashar! [12:26:19] somewhere at the bottom of that file is a code to also load a per realm specific cs.php file [12:26:24] if ( file_exists( "$wmfConfigDir/CommonSettings-$wmfRealm.php" ) ) { [12:26:24] require( "$wmfConfigDir/CommonSettings-$wmfRealm.php" ); [12:26:25] } [12:26:29] good [12:26:35] that call should have been refactored to use the getRealmSpecificFilename() [12:26:40] indeed [12:26:46] but we (I) wanted to avoid refactoring too much thing [12:26:50] so [12:27:11] CommonSettings-production.php would include what you need [12:27:31] but for parsercache / redis / memcached. I think you should load them from CommonSettings.php using include( getRealmSpecificFilename() ); [12:27:49] and name the file with -production.php suffix [12:28:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.056 seconds [12:28:01] then later on we can write the -labs.php equivalent. [12:28:30] only db.php was a problem for the eqiad switchover actually, so it's covered for now [12:28:50] db-production-pmtpa.php will allow setting pmtpa db shards to read-only [12:29:03] while eqiad goes read-write [12:29:08] !reinstalling sq48 with lucid [12:29:22] lunch is ready [12:29:22] brb [12:31:36] paravoid: lucid ? :-] [12:31:54] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [12:33:15] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [12:33:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:18] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:34:18] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [12:35:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.197 seconds [12:35:35] apergos: around? [12:35:48] yes [12:36:33] hashar: yeah, precise has squid3 and I'm not going to figure this out now [12:36:44] apergos: how's swift? [12:36:48] paravoid: just forgot sq### are squids (not SQL servers) [12:36:48] moving slowly along [12:36:58] monday a new set of rings went around [12:37:15] I'd suggest not adding new boxes for now [12:37:20] until we get H710s for those [12:37:44] it will be a couple more rouns before we are ready [12:37:45] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [12:37:52] ? [12:37:56] I imagine the 710s will be in by then but I will keep an eye out [12:38:03] more rounds of ring pushes [12:38:15] the disks will need to be reformatted with 710s [12:38:36] yep [12:38:46] sucks but at least there is apparently a solution [12:38:53] so hold off adding more boxes I'd say [12:40:03] I was expecting we would pull old boxes and put new ones in with the 710s [12:40:04] no? [12:40:09] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:40:20] because the old boxes have to go to dell sooner or later and I am sure they would rather have sooner [12:41:16] I'm saying let's shortcut C2100 -> 720xd H310 -> 720xd H710 to C2100 -> 720xd H710 on the rest of them [12:42:10] but you're doing most of the work, so feel free to play it as you like [12:42:22] "most" as in "all" [12:42:42] PROBLEM - SSH on sq48 is CRITICAL: Connection refused [12:42:43] could you do me a favor? [12:42:58] could you give me the SSD models currently used in those boxes? [12:43:06] hdparm -I or smartctl or dmesg [12:43:27] yes, that was my plan, wait for the 710s to come in and put new boxes in with thos, I am guessing the controllers will be here before we are ready for the next new box to go in, is what I am saying [12:43:33] ok, just a sec [12:44:09] cool [12:46:17] INTEL SSDSA2M160G2GC (ms-be6 and ms-be3, I guess the same on all) [12:46:27] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:46:44] these are the new boxes, not the c2100s [12:47:12] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [12:48:57] what model were you testing against? I saw the email reports [12:54:15] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:03:51] PROBLEM - NTP on sq48 is CRITICAL: NTP CRITICAL: No response from NTP server [13:04:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:13:51] New review: J; "in the long run beta needs a swift setup, to many things are different in settle ways if swift is en..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/38307 [13:15:42] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:18:39] those SSDs are the original X-25ms [13:19:20] yes [13:19:41] already requested 320s in that RT [13:22:18] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:23:22] !log gallium: reapplied Tim's patch to PHPUnit install in /usr/share/php/PHPUnit/Util/Test.php to prevent PHP segfaulting while running tests. {{bug|43972}} [13:23:32] Logged the message, Master [13:28:36] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:39:13] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:42:05] mark: if you make mediawiki-config, feel free to add me and anomie as reviewers. Sam can help too. [13:42:14] thanks [13:43:43] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 598 bytes in 0.003 seconds [13:45:49] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:47:55] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [13:52:17] paravoid: mark: and what do we do for the "wikimedia" puppet module? As I understand it you are both too busy with some SSD / disk / hardware installation. Should I ping you again post eqiad migration? :-] [13:52:31] sorry [13:52:34] I'll have a look today [13:52:34] or should I attempt to bribe you with some Champagne or something ? :-] [13:52:41] alcohol is always welcome [13:52:44] :-) [13:53:01] RECOVERY - NTP on sq48 is OK: NTP OK: Offset 0.03335082531 secs [13:53:12] not ouzo [13:53:30] why not? [13:53:46] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:53:50] stupid headache drink ;) [13:54:09] wha?! [13:54:15] you obviously had some very bad ouzo :) [13:54:17] mark: you must know that Ouzo should be mixed with lot of water don't you? [13:54:25] ouzo is one of the best drinks headache-wise [13:54:28] no [13:54:32] also because of that [13:54:33] and I've had greeks confirm it [13:57:44] !log reedy synchronized wmf-config/InitialiseSettings.php [13:57:54] Logged the message, Master [14:02:07] New patchset: Reedy; "Bug 44020 - Special:UserMerge giving permission errors on English Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44250 [14:02:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44250 [14:05:29] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [14:09:49] PROBLEM - Puppet freshness on mw40 is CRITICAL: Puppet has not run in the last 10 hours [14:12:31] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [14:14:18] New review: Anomie; "Looks ok to me, unless you want to add in a default somehow in case we add a third datacenter." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44160 [14:18:42] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers in eqiad (not in pmtpa)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44251 [14:18:43] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers exclusively." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44252 [14:21:22] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [14:37:07] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [14:40:17] easy merge: adding liblua5.1-0.dev package on gallium (contint host). I have already installed it manually :-] https://gerrit.wikimedia.org/r/#/c/43999/ [14:48:28] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [14:58:22] !log restarted Jenkins on gallium to apply plugin updates. [14:58:32] Logged the message, Master [15:05:16] poor Jenkins locked in futex_wait_queue_me [15:11:34] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [15:14:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:14:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:16:00] !log restarted Jenkins again, somehow got stuck :/ [15:16:10] Logged the message, Master [15:20:34] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [15:26:07] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [15:30:18] !log ms-be1005 & ms-be1006 going down [15:30:28] Logged the message, Master [15:33:19] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:33:19] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:24] ahh [15:36:24] strace is wonderful [15:37:06] !log deployed new webstatscollector filter to collect stats on wikivoyage domains and restarted udp2log on locke [15:37:16] Logged the message, Master [15:38:25] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [15:46:26] noone touch the squid config now [15:46:38] or you might regret it ;) [15:47:51] oh? [15:47:54] until the migration you mean? [15:48:00] no [15:48:02] ah, okay [15:48:05] until i have reverted my generated configs [15:48:11] I want to make one change at one point [15:48:23] that's ok, we can rebase later [15:48:32] I want to write a swift middleware (or put in rewrite.py) that responds to /health_ok or something [15:48:43] then switch squids and pybal on checking this instead of /v1/AUTH_.../ [15:49:50] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [15:50:08] squid config is safe again [15:55:14] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:00:50] New patchset: Faidon; "autoinstall: switch ceph-sdd from sdm to sda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44256 [16:03:34] New patchset: Faidon; "autoinstall: switch ceph-sdd from sdm to sda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44256 [16:04:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:04:23] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44256 [16:05:13] paravoid: don't forget my wikimedia module :-] [16:05:18] I am out to get my daughter though [16:05:53] hashar: gah [16:06:05] no worry, we can talk about it later tonight or tomorrow [16:06:05] hashar: want to get rid of init.pp and fix up readme a little bit? [16:06:12] or else wait for me [16:06:16] I'm terribly sorry [16:06:29] will wait :-] we can write the readme together if not done alredy [16:06:57] bbl [16:11:30] New patchset: Mark Bergsma; "EQIAD SWITCH: Migrate mobile backend appservers to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44257 [16:12:56] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:19:59] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [16:20:08] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:25:05] PROBLEM - SSH on ms-be1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:56] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:20] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:30:29] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [16:30:38] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:37:14] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [16:37:14] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [16:37:14] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [16:37:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:37:41] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:38:08] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 27.87 ms [16:43:05] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:46:32] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:53:13] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [16:55:23] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [16:56:53] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: No response from NTP server [17:00:02] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [17:00:38] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:01:06] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 204 seconds [17:04:41] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: Offset unknown [17:06:02] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:06:57] New review: Krinkle; "@Reedy: The other way around (this one is newer)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44164 [17:09:29] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:13:32] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.02539801598 secs [17:16:23] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 28 seconds [17:17:09] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [17:18:20] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:24:02] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.01084697247 secs [17:27:11] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:33:25] ^demon, are there known problems with the 'Old version history' feature in gerrit? [17:33:56] <^demon> Yes. There's a couple of fixes in master for it. [17:34:29] !log powering down titanium to verify console redirection status [17:34:39] Logged the message, Master [17:35:47] ^demon: Is there a way to trick gerrit into actually showing me the diff between two patch sets, or should I just download them and diff locally? [17:36:20] <^demon> Well, it will show a diff between patchsets. It just doesn't behave well on merges or rebases. In which case, yeah, downloading is best. [17:36:38] <^demon> We'll be upgrading soon. Probably the week after eqiad stuff is done. [17:37:15] 'k thanks [17:40:37] !log Swapped slot1 from 1.21wmf6 to 1.21wmf8 [17:40:48] Logged the message, Master [17:44:01] robh: console settings in titanium are correct [17:47:58] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:51:19] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:53:45] robh: drac was hungup...racreset fixed it [17:56:52] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [17:59:43] PROBLEM - LVS HTTP IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:43] PROBLEM - Backend Squid HTTP on amssq57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:52] PROBLEM - Backend Squid HTTP on amssq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:01] PROBLEM - Frontend Squid HTTP on knsq19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:01] PROBLEM - Frontend Squid HTTP on knsq22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:02] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:02] PROBLEM - Frontend Squid HTTP on amssq51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:28] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [18:00:38] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:42] uh [18:00:52] uh indeed [18:01:31] RECOVERY - Backend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 9.861 seconds [18:01:40] PROBLEM - Backend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:40] RECOVERY - Frontend Squid HTTP on amssq51 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 2.781 seconds [18:01:40] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 4.517 seconds [18:01:40] RECOVERY - Frontend Squid HTTP on knsq22 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 7.883 seconds [18:01:46] 2013/01/16 18:00:30| squidaio_queue_request: WARNING - Disk I/O overloading [18:01:49] RECOVERY - Frontend Squid HTTP on knsq19 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 9.210 seconds [18:01:49] 2013/01/16 18:00:50| storeCossCreateMemOnlyBuf: no free membufs. You may need to increase the value of membufs on the /dev/sdb5 cache_dir [18:01:52] 2013/01/16 18:01:06| storeCossCreateMemOnlyBuf: no free membufs. You may need to increase the value of membufs on the /dev/sdd cache_dir [18:01:55] not sure if this is new though [18:02:00] nor I [18:02:25] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 583 bytes in 7.945 seconds [18:02:34] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:39] strange [18:03:10] RECOVERY - LVS HTTP IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 5.779 seconds [18:03:19] RECOVERY - Backend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [18:03:19] knsq15 seems different than the others [18:03:20] https://streber.wikimedia.org/cgi-bin/smokeping.cgi?target=Europe.esams.upload [18:03:54] hm? [18:04:08] something similar went on on Sunday btw [18:04:27] very mild though, didn't result in an outage, just slowness [18:04:38] I debugged it a bit then with no luck (other than finding a knsq with a broken disk) [18:05:35] swift requests show a spike [18:05:39] small [18:06:55] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [18:07:13] PROBLEM - Frontend Squid HTTP on amssq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:13] PROBLEM - Frontend Squid HTTP on amssq51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:29] http://ganglia.wikimedia.org/latest/graph.php?h=knsq19.esams.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358359627&g=network_report&z=medium&c=Upload%20squids%20esams [18:07:33] vs. [18:07:36] http://ganglia.wikimedia.org/latest/graph.php?h=knsq16.esams.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358359627&g=network_report&z=medium&c=Upload%20squids%20esams [18:07:58] http://ganglia.wikimedia.org/latest/?c=Upload%20squids%20esams&h=knsq19.esams.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 vs. http://ganglia.wikimedia.org/latest/?c=Upload%20squids%20esams&h=knsq16.esams.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 in general [18:08:04] !log Rerouting 14907->43821 traffic away from AS1257 [18:08:14] Logged the message, Master [18:08:25] you think it's a network issue? [18:08:39] yes [18:08:52] PROBLEM - Backend Squid HTTP on amssq59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:52] RECOVERY - Frontend Squid HTTP on amssq51 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 3.222 seconds [18:08:55] CPU is through the roof [18:09:01] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:07] see graphs above [18:10:16] knsq19 is not happy [18:10:31] RECOVERY - Backend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.289 seconds [18:10:31] PROBLEM - LVS HTTP IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:31] RECOVERY - Backend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.679 seconds [18:10:45] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20squids%20esams&h=knsq19.esams.wikimedia.org&v=349096&m=mem_cached&r=hour&z=default&jr=&js=&st=1358359793&vl=KB&ti=Cached%20Memory&z=large [18:10:49] PROBLEM - Frontend Squid HTTP on amssq59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:52] huge drop in cached memory [18:11:16] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [18:12:10] RECOVERY - LVS HTTP IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 4.310 seconds [18:12:21] i'll restart knsq19's squid [18:12:24] packets received/sent are quite a lot [18:12:28] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 2.349 seconds [18:12:28] RECOVERY - Frontend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 2.914 seconds [18:12:28] RECOVERY - Frontend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 4.389 seconds [18:12:36] doubled [18:13:13] PROBLEM - Frontend Squid HTTP on knsq18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:30] hm, 251 days of uptime [18:13:40] heh [18:13:47] although that's probably a recent enough kernel [18:14:16] PROBLEM - Backend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:28] pkts/s are on all but knsq15 decreasing [18:16:40] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [18:16:40] RECOVERY - Frontend Squid HTTP on knsq18 is OK: HTTP OK HTTP/1.0 200 OK - 653 bytes in 0.240 seconds [18:16:40] PROBLEM - LVS HTTPS IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:52] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:01] mark: so? [18:18:19] RECOVERY - LVS HTTPS IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 777 bytes in 2.126 seconds [18:19:31] RECOVERY - Backend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.847 seconds [18:21:09] !log praseodymium being odd, its coming down [18:21:18] Logged the message, RobH [18:21:19] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.994 seconds [18:21:55] getting better [18:24:43] !log Rerouting 43821->14907 traffic away from AS1257 [18:24:53] Logged the message, Master [18:25:02] New patchset: Reedy; "Moved slot1 symlinks to being wmf8" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44264 [18:25:20] rare earth metals are weird robh [18:25:31] Change merged: Reedy; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/44264 [18:27:37] PROBLEM - SSH on msfe1002 is CRITICAL: Connection refused [18:28:13] PROBLEM - SSH on praseodymium is CRITICAL: Connection refused [18:28:49] ACKNOWLEDGEMENT - SSH on msfe1002 is CRITICAL: Connection refused daniel_zahn will be renamed - this is not ms-fe1002 [18:35:28] LeslieCarr: maybe, but its still easier than thinking of names [18:35:41] hehe [18:36:37] RobH: should msfe1001/2 be in decom? [18:36:45] or mutante [18:37:55] !log restarted knsq19 earlier [18:38:08] Logged the message, Master [18:40:38] lesliecarr: ping [18:41:08] !log putting logging-only iptables rule on sanger to check out what rules we actually need [18:41:15] cmjohnson1: [18:41:15] yay [18:41:18] perfect timing [18:41:19] Logged the message, Master [18:41:21] call just finished [18:41:23] like 2 seconds ago [18:41:58] cool..so fibers are ready when you are [18:42:03] so i see the card is in asw-c-eqiad [18:42:11] cool, so plug the fibers in please [18:42:17] and i'll turn up the links one at a time [18:42:57] RECOVERY - SSH on msfe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:42:58] k.done [18:43:06] RECOVERY - SSH on praseodymium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:43:48] okay, turning up now ... [18:43:51] fingers crossed :) [18:44:03] !log turning up xe-1/1/0 on asw-c-eqiad (was disabled due to faulty equipment) [18:44:14] Logged the message, Mistress of the network gear. [18:44:58] huzzah [18:45:10] !log turning up xe-1/1/2 on asw-c-eqiad (was disabled due ot faulty equipment) [18:45:21] Logged the message, Mistress of the network gear. [18:45:27] woot! [18:45:29] thank you chris :) [18:45:30] PROBLEM - NTP on praseodymium is CRITICAL: NTP CRITICAL: No response from NTP server [18:45:38] and fnally asw-c-eqiad is not sucking ! [18:45:39] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:24] yw...lesliecarr [18:47:45] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:51:46] RobH: should msfe1001/2 be in decom? [18:52:00] (I'm really asking, we were chatting with mutante the decom rules the other day) [18:53:08] nope [18:53:11] they werent in nagios [18:53:18] so no need to be in decom, i added it, and it broke shit [18:53:26] had to revert [18:54:35] they are in nagios [18:54:41] ..... [18:54:43] god damn it. [18:54:48] 20:27 <+nagios-wm_> PROBLEM - SSH on msfe1002 is CRITICAL: Connection refused [18:54:52] 20:28 <+nagios-wm_> ACKNOWLEDGEMENT - SSH on msfe1002 is CRITICAL: Connection refused daniel_zahn will be renamed - this is not ms-fe1002 [18:54:54] bleh, i'll take care of it [18:54:58] i thought it was yanked [18:54:59] okay :) [18:55:04] no idea [18:55:12] i missed it, thx for spottin [18:57:09] robh: i was chatting w/lesliecarr and want to confirm that the spare ex4500 we have in storage will go in the same rack as the test mx480 [18:59:25] spare? [18:59:29] we have a broken one in a1 [19:00:33] iirc ...there is a spare in storage [19:00:42] * cmjohnson1 goes to check [19:02:18] RobH: did you see rt 4351? [19:02:24] woosters: too [19:03:19] paravoid: there are two tickets already [19:03:22] so going to kill that one [19:03:27] we have a ticket per site already in procurement [19:03:37] ill just merge this into one of them [19:03:44] or depend on them [19:03:51] if there's a broken one we should rma it [19:04:07] cmjohnson1: oh, ok, so uhh [19:04:11] the spare in storage... [19:04:27] cmjohnson1: so yea, psw2 is in a1 [19:04:31] has issues, needs return [19:05:01] cmjohnson1: So why will the spare go in that rack? Does LeslieCarr want a 4500 labs? [19:05:15] if so, thats cool. [19:05:45] cmjohnson1: dont throw the box out [19:05:48] cuz we may need it. [19:06:25] yes i do want a 4500 [19:06:27] (keep a box for the 4500 and for the 4200 if possible) [19:06:32] the radical idea of testing changes ;) [19:06:41] tests belong on production! [19:06:44] so our users can be involved. [19:06:51] https://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-32-02-metablogapi/8054.image_5F00_thumb_5F00_35C6E986.png [19:07:21] man, i'm just not involving my tests with the community [19:07:58] i should have eaten breakfast [19:08:00] im starvin. [19:08:20] robh: ok..cool..yeah so she wants the 4500 for labs...i think it would make sense to keep it all together in one rack [19:08:29] i will not toss juniper boxes [19:08:29] yep, thats fine [19:08:41] ok..cool [19:08:45] keep them close together if you can [19:08:52] so we can use the top half of rack for production [19:09:03] RECOVERY - NTP on praseodymium is OK: NTP OK: Offset -0.01343393326 secs [19:09:53] arise praseodymium, arise, BWAHAHAHAHAHAHAHAHAAHA [19:09:56] ahem. [19:11:47] * ori-l bets $5 you copy/pasted the hostname rather than type it out [19:12:15] at first yes [19:12:21] but i forced myself to type it the past four times [19:12:41] praseodymium isnt that hard. [19:13:12] RoanKattouw_away: i am about to hand this back to you, its been spinning for a few puppet runs without the odd host key crap [19:14:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:15:12] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [19:18:22] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:19:15] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Puppet has not run in the last 10 hours [19:19:23] and now its borking.... wtf [19:21:57] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:23:54] praseodymium hates me. [19:24:06] you gave it its name [19:24:12] i'd hate you for that too [19:24:27] yes well.... if it continues to bork up [19:24:34] and needs another reinstall, the name may change. [19:24:49] im starting to pity whoever else has to work on it [19:26:09] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [19:26:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44219 [19:29:07] New patchset: Asher; "fix typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44272 [19:31:42] mark: just fyi, i added this variable to be used by redis + mha4mysql, in case you would like to use it anywhere, or move/rename it, etc.: manifests/realm.pp: $mw_primary = "pmtpa" [19:32:23] ah good to know [19:32:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44272 [19:33:43] my firefox is shitting itself... [19:34:51] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:38:11] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:43:26] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:46:35] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [19:46:53] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:50:16] New patchset: Alex Monk; "(bug 43851) Allow enwikivoyage bureaucrats to remove sysops" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44276 [19:50:20] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [19:54:47] New patchset: Jgreen; "remove db1013 from fundraisingdb pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44277 [19:55:44] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [19:56:23] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44277 [19:57:12] New patchset: MaxSem; "Enable MobileFrontend on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44278 [20:01:17] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: Offset unknown [20:12:06] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [20:13:46] !log reedy synchronized php-1.21wmf8 [20:13:55] wow [20:13:57] Logged the message, Master [20:14:00] * Reedy tries again [20:14:06] That won't be all the hosts [20:15:02] !log reedy synchronized php-1.21wmf8 [20:15:14] Logged the message, Master [20:15:42] !log reedy synchronized php-1.21wmf8 [20:15:53] Logged the message, Master [20:15:57] binasher: does 1pm work for the read-only test? [20:16:55] !log reedy synchronized docroot [20:17:06] Logged the message, Master [20:17:18] robla: what shard do you want set as read-only, for how long, and who is going to be looking at mediawiki errors? [20:17:31] !log reedy synchronized live-1.5/ [20:17:40] Logged the message, Master [20:18:57] binasher: s1, <5min, and ...um....I don't know [20:19:07] lemme see who I can get [20:19:24] ok, let me know so we can coordinate [20:19:30] but yah, 1pm is fine [20:20:56] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [20:21:45] binasher: I guess I'd look [20:21:51] thanks AaronSchulz! [20:22:18] I'll put a brief note on VPT now [20:22:42] AaronSchulz: ok, i'll let you know before the change. going to deploy db.php with s1 set to read only and also actually put db63 in read-only mode [20:23:10] New patchset: Legoktm; "(bug 44045) enable $wgAbuseFilterNotificationsPrivate for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44281 [20:23:15] can a root please manually run puppet on fenari ? Or at least make world readable the file /etc/ssh/ssh_known_hosts . [20:23:27] there is a puppet bug that makes that file 0600 instead of 0644 :( [20:23:27] mark: laner@deployment-bastion:~$ btmakemetafile http://deployment-bastion.pmtpa.wmflabs:6969/announce /srv/deployment/mediawiki/l10n-slot0/.git [20:23:27] error: name .git disallowed for security reasons [20:23:39] * Ryan_Lane stabs bittornado right in the face [20:23:50] <^demon> wtf? [20:23:53] how about you let *me* decide that. asshole. [20:24:35] hashar: are you aware of the phpunit invoker timeouts? [20:24:56] AaronSchulz: yeah tests are too slow, they either need to be fixed or the timeout raised for the test. [20:25:14] eek, if logmsgbot does not lie, git-deploy is so friggin fast... [20:25:18] AaronSchulz: and even if the test is surely going to be fast, gallium has some I/O spikes from time to time :/ [20:25:23] seems it can't start with a . ? [20:25:32] <^demon> Ryan_Lane: Google says so. [20:25:57] <^demon> I'm seeing people complain about a variety of .foo [20:26:26] <^demon> http://digicomet.blogspot.com/2008/04/bittornado-error.html - dunno if this workaround is still valid. [20:26:39] yeah [20:26:41] that would do it [20:26:48] I was hoping to avoid it :) [20:26:53] I need to patch it anyway [20:27:49] <^demon> Yep, https://github.com/russss/Herd/blob/master/BitTornado/BT1/btformats.py [20:27:50] Ryan_Lane: so what does happen when two people do git deploy start? [20:27:58] AaronSchulz: on the same repo? [20:28:04] yes [20:28:05] AaronSchulz: the second person is told to fuck off [20:28:18] ok, makes sense then [20:28:49] we really need to add some stuff into the hooks for other situations, though [20:29:22] for instance, it should also stop you from pulling/fetching/committing/etc, if you haven't started with: git deploy start [20:30:03] can you alias all git commands to check that first or something? :P [20:30:09] heh [20:30:15] hooks would be better [20:30:20] !log reedy Started syncing Wikimedia installation... : Rebuild message cache for 1.21wmf8 [20:30:26] <^demon> AaronSchulz: You can't alias core commands, iirc. [20:30:30] there are hooks for all those commands [20:30:30] Logged the message, Master [20:30:34] yep [20:30:36] ^demon: yeah, figures [20:30:38] ok, nice [20:30:38] pre hooks for all of them [20:30:45] right [20:30:50] pre-fetch alone should do it [20:30:59] well, and pre-commit [20:31:44] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [20:31:49] pre-commit covers cherry-pick right? Since it makes a commit [20:32:00] I believe so [20:32:09] pre-fetch may also cover it [20:32:17] since it has to fetch it before it can commit it [20:32:31] unless the fetch was piggy-backed onto something else [20:32:37] *for something else [20:32:53] * Ryan_Lane nods [20:32:55] and git deploy *was* used for that fetch [20:33:05] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:33:15] we can check a lock file to see if it's in the deploy or not [20:33:26] it'll allow things to continue if so, and will stop if it not [20:35:45] muahahahaha [20:35:55] python /home/laner/murder_client.py peer http://deployment-bastion.pmtpa.wmflabs/mediawiki/l10n-slot0/.git.torrent test/.git 127.0.0.1 [20:36:19] laner@deployment-bastion:/tmp/test$ git reset --hard [20:36:19] Checking out files: 100% (375/375), done. [20:36:19] HEAD is now at 8c7c5e3 Update localisation cache for 1.21wmf7 [20:36:53] that'll do [20:37:29] git fetch can screw itself [20:37:32] Ryan_Lane- We have murder working in place of git fetch now? [20:37:41] * AaronSchulz hands Ryan_Lane the swear jar [20:37:42] anomie: not yet. this is manual testing [20:37:46] AaronSchulz: :) [20:37:52] and give it back when you are done so I can be filthy rich [20:38:02] AaronSchulz: I'm saving up for vacation [20:38:16] I wouldn't be cursing so much if git fetch wasn't terrible [20:38:16] or at least afford a draft beer or two...or three [20:38:24] New patchset: Asher; "struggles with scoping in puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44326 [20:38:29] why is it terrible? [20:38:36] I thought git was awesome? [20:38:40] git is awesome [20:38:42] fetch is not [20:38:48] submodule foreach is also not [20:38:49] o_O [20:38:55] binasher: ? [20:39:08] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44326 [20:39:08] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:39:17] <^demon> fetch is great for a single repository. [20:39:20] AaronSchulz: fetch can write corrupted objects if the network drops out [20:39:34] git-deploy is scaring me [20:39:35] AaronSchulz: once that happens, it'll never re-fetch unless you manually delete all corrupted objects [20:39:41] binasher: why's that? [20:39:44] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [20:39:48] it has no checksums? [20:39:59] AaronSchulz: it does, which is why it knows it is corrupt [20:40:08] Heh. [20:40:09] but it does not do anything automatically [20:40:12] great [20:40:12] right [20:40:16] isn't that dumb? :) [20:40:21] fetch can write corrupted objects? isn't the whole point of git to strongly guarantee against that? [20:40:23] heh [20:40:26] "BTW, corrupt, suckaaaas." [20:40:36] binasher: yes. annoying, right? [20:40:44] binasher: this is where bittorrent comes in :) [20:40:48] hence, skerrrrd [20:40:54] I'm going to replace the fetch stage with bittorrent [20:41:10] it'll actually be way, way faster [20:41:32] and it'll fix corrupted objects [20:41:57] and it'll make us all 2 inches taller :) [20:42:16] hey, it might break in other ways we're not expecting :) [20:42:38] !log reedy synchronized wmf-config/ [20:42:46] binasher: I was switching out the l10n stuff with it anyway [20:42:48] Logged the message, Master [20:43:16] i would still prefer this brave new world to occur after the eqiad migration [20:43:24] it's going to [20:43:32] i'll be needing to make lots of wmf-config changes, and sync-file works well and quickly [20:43:37] ok, phew [20:43:38] we've already decided that we're aborting for now [20:43:46] <^demon> Hmm, interesting core.fsyncobjectfiles [20:43:46] * binasher is no longer skerrd  [20:43:46] hahahaha [20:43:59] binasher: ah. I didn't realize you didn't know we aborted :) [20:44:11] makes sense, I'd be scared in that situation too [20:44:15] <^demon> There's a lot of interesting core.* options. [20:44:23] <^demon> Such as delta window, etc. [20:44:31] ^demon: looking at gerrit's code for this? [20:44:36] <^demon> No, git-config. [20:44:39] i saw something about abortions in -dev but i misinterpreted, lol [20:44:43] <^demon> Ryan_Lane: http://man.he.net/man1/git-config [20:44:44] :D [20:45:13] binasher: what, you don't think we should implement bittorrent based code distribution in 2.5 business days? that's *plenty* of time :) [20:45:29] * Ryan_Lane runs away scared [20:45:59] that's tons of time. [20:46:14] <^demon> Ryan_Lane: If we optimize how the deployment's .git is repacked, the torrenting would almost never have to replicate packs that changed. [20:46:23] <^demon> Only have to replicate subsequent packs. [20:46:36] ^demon: that sounds like a good idea [20:47:03] we should make a git fetch replacement [20:47:19] then modify gerrit to also seed repos ;) [20:47:50] I guess that's difficult with ACLs [20:48:06] Reedy: Why the sync of wmf-config? [20:48:20] To push ExtensionMessages for wmf8 before scap finished [20:48:23] Ryan_Lane - so we have to get Tim to work on modifying scap to work in eqiad then [20:48:29] woosters: yep [20:48:51] I think we were pushing it in a number of other ways, too [20:49:06] we didn't really check how this would affect the misc systems [20:49:18] <^demon> Ryan_Lane: Not necessarily. We've already got all the repos totally replicated to formey. Could use those bare on-disk repos for seeding. [20:49:19] spence's checks probably would have been broken [20:49:35] <^demon> formey's never overloaded. [20:49:41] +1 ^demon [20:49:53] ^demon: I think it's necessary to restart the tracker and the seeder whenever something changes [20:50:07] with a new .torrent [20:50:25] could add that to a hook, though [20:50:43] for deployment the deployment host will always be the seeder [20:51:30] actually, could we just trickle changes to bare repos on each of the hosts we're distributing to, and deploy from those repos? [20:51:43] that's a really good idea [20:51:45] eys [20:51:47] *yes [20:51:58] that's going to be a lot more work, though [20:52:25] since we're not doing the fetch stage via http anymore, some things are actually a little easier [20:52:37] we don't need to worry about bare repos [20:52:39] <^demon> Wait, we were doing fetch over http? [20:52:40] !log removing uncommitted stuff from DNS SVN, creating RT for wikimedia.jp.net and softwarewikipedia.com [20:52:48] ^demon: currently, yes [20:52:50] Logged the message, Master [20:52:53] and we're not using bare repos [20:53:02] I'm running git update-server-info [20:53:12] on the parent repo and all submodules [20:53:23] !log "bak" directory from unclean SVN copied to /root on sockpuppet as dns_bak.tar.gz just in case [20:53:33] Logged the message, Master [20:54:29] we could actually run periodic fetches on all repos [20:54:45] on the deployment host and on the minions [20:54:45] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [20:54:56] we couldn't really do that using http [20:55:04] <^demon> I think we should replicate from manganese -> tin. [20:55:25] ^demon: well, I was thinking of just doing a git fetch on tin [20:55:30] on all the repos [20:55:37] then have a post-fetch hook [20:55:43] that will re-seed [20:55:56] hm. is any hook called for tags? [20:56:01] <^demon> Not afaik. [20:56:04] :( [20:56:23] sad times indeed [20:56:41] <^demon> Anyway, so my idea is replicate gerrit -> tin. Tin seeds that out (via torrents, like we discussed) to the deployment hosts. [20:56:56] that works for bare repos [20:57:10] <^demon> We could set the git dir explicitly from the clone. [20:57:16] ah [20:57:27] <^demon> So you could have the bare repo in /foo/bar/bare, and the working copy in /foo/bar/clone [20:57:32] hm. let's write up some docs on this [20:57:33] <^demon> And clone would set bare as it's gitdir. [20:57:40] since this will complicate things some [20:58:05] <^demon> What's kind of cool, is the replication+seeding will constantly happen in the background, so the fetch step becomes irrelevant to the deployer. [20:58:11] yep [20:58:16] <^demon> (as long as they do everything through gerrit, which they should) [20:58:26] we do local hacks as well [20:58:36] LocalSettings.php for instance is a local commit [20:58:37] "Because life is too easy." [20:58:54] New patchset: Andrew Bogott; "Explicitly include packages in generic::mysql::server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44333 [20:59:09] http://etherpad.wmflabs.org/pad/p/git-deploy-bittorrent [20:59:22] <^demon> Ryan_Lane: You'd just have to rebase possibly when checking out your change/tag you want to deploy. [20:59:30] why is puppet fubaring on this server....whyyyyyyyyy [20:59:45] i hate you praseodymium [20:59:52] even though i can now spell your name without cut and paste [21:00:15] praseodymium praseodymium [21:00:15] i can [21:00:19] * hashar flexes [21:00:23] ^demon: let's assume the first iteration still has developers doing the fetch/pull [21:00:36] what you're suggesting will be 2nd iteration [21:00:41] <^demon> *nod* [21:00:51] since we're aiming for next wmf branch release [21:01:40] hashar: see, proves its a good server name. [21:01:47] !log reinstalling sq48 w/lucid [21:01:57] Logged the message, Master [21:02:13] cmjohnson1: thanks for taking that one [21:03:03] mutante: anytime [21:03:43] New patchset: Asher; "more scope fixing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44334 [21:04:25] binasher: wasn't deployed though [21:04:28] err: /Stage[main]/Base::Sysctl/File[wikimedia-base-sysctl]: Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet:///files/misc/50-wikimedia-base.conf.sysctl: getaddrinfo: Name or service not known at /var/lib/git/operations/puppet/manifests/base.pp:283 [21:04:30] wtf. [21:04:43] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:18] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [21:05:19] wtf wtf [21:05:22] ....... [21:05:28] i just tried running a sync-file from fenari [21:05:31] how the hell can it not see the file.. [21:05:37] binasher: arent we on git deploy today? [21:05:45] the host keys have changed everywhere?/ [21:06:36] notpeter: do you know why ^ [21:06:49] mutante mentions you may cuz it happened last week if you recall [21:06:57] http://pastebin.mozilla.org/2064430 [21:08:02] deploy is fucked [21:08:10] while s1 is set to read only on some servers [21:08:12] ugh [21:08:13] this is awesome [21:08:20] New patchset: Andrew Bogott; "Explicitly include packages in generic::mysql::server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44333 [21:08:26] now we have two non-functional deployment systems [21:08:32] wait [21:08:39] :) [21:08:45] shouldn't /etc/ssh/known_hosts be readable by everyone? [21:08:52] Reedy: what's the latest status? [21:08:57] Ryan_Lane: ah, yes [21:09:00] this happened once before [21:09:03] yes [21:09:06] Ryan_Lane: yes, that's what Reedy was complaining about earlier [21:09:06] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=nfs1.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [21:09:11] fixed [21:09:12] ok, glad that is the prob vs all the keys changing [21:09:13] try now [21:09:23] phew, much better [21:09:25] Ryan_Lane: thank you! [21:09:26] \o/ [21:09:27] yw [21:09:28] I fixed it by adding 398 lines to my known_hosts file [21:09:30] !log asher synchronized wmf-config/db.php 'setting s1 to read-only for several minutes' [21:09:40] Logged the message, Master [21:09:56] <3 logmsgbot [21:09:57] Ryan_Lane: " [21:09:57] Considerations: [21:09:57] - authorized keys at /etc/ssh/userkeys/* and root-owned, instead of in homedirs" [21:10:05] <-- suggested by Jeff on RT 4338 [21:10:11] !log set global read_only=1; on enwiki master for "we have no unit tests, so lets test read-only in production" testing [21:10:21] AaronSchulz: enwiki is read only [21:10:21] Logged the message, Master [21:10:28] please tell me when to set back :) [21:10:34] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [21:11:13] binasher: most errors are AFT [21:11:28] AFT is mostly errors [21:12:07] almost all extensions, except for Block code [21:12:21] things mostly look fine. let's take it back out of r-o mode [21:12:26] thanks binasher and AaronSchulz [21:13:08] * AaronSchulz is browsing around now [21:13:30] !log asher synchronized wmf-config/db.php 'taking s1 out of read-only' [21:13:33] !log set global read_only=0; on enwiki master [21:13:35] robla: https://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-32-02-metablogapi/8054.image_5F00_thumb_5F00_35C6E986.png [21:13:40] Logged the message, Master [21:13:42] !log reedy synchronized php-1.21wmf7/extensions/Wikibase [21:13:45] I guess that was mostly fine [21:13:49] Logged the message, Master [21:13:59] Logged the message, Master [21:14:18] would be nice to do a wfReadOnly audit [21:14:23] LeslieCarr: yeah yeah :-P [21:14:27] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [21:14:54] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [21:15:18] binasher: wait, we are sticking with scap for eqiad now? [21:15:18] i like it when i openly mock the idea of testing in production two hours ago [21:15:22] and now we test in production [21:15:31] PROBLEM - SSH on sq48 is CRITICAL: Connection refused [21:15:37] that awkward moment when you're asked to set the production db master for the 5th largest site to read-only because we have so little test coverage [21:15:40] can i tweet that? [21:15:40] i feel like the universe exists to underline my wit. [21:15:50] oh damn, 147 characters [21:15:55] binasher: I normally hate twitter, but I approve for this occasion [21:16:04] ok, i'll abbreviate some stuffs! [21:16:13] binasher: switch it to b/c and prod [21:16:13] dat moment [21:16:15] * AaronSchulz hides from LeslieCarr [21:16:16] and there ya go [21:16:16] :) [21:16:19] hehe [21:16:33] no offense :) [21:16:41] !log authdns-update [21:16:51] Logged the message, RobH [21:16:59] I'm not sure how unit testing helps test something systemic like read-only mode [21:17:03] !log https://twitter.com/ashbash/status/291655249075240960 [21:17:12] Logged the message, Master [21:17:17] binasher: at least you didn't try to post to Flutter: http://www.youtube.com/watch?v=BeLZCy-_m3s [21:17:25] heh [21:19:07] robla "unit" may be a misstatement of the type of testing.. but we should automated test cases, browser based if needed, of every major mediawiki function [21:19:15] What the hell is Scap doing [21:19:35] Hmm, only 40 minutes.. [21:19:52] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: Connection refused [21:20:49] binasher: working on that: https://github.com/wikimedia/qa-browsertests so far the test for UploadWizard in particular has prevented all sorts of mayhem [21:20:54] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:21:00] New patchset: RobH; "praseodymium sucks, replacing with titanium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44340 [21:21:07] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44333 [21:21:32] Hooray for the QA team! Hip hip! [21:22:10] you mean hip hop! [21:22:10] New patchset: Andrew Bogott; "Mysql version is now computed by the mysql class." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44341 [21:22:19] chrismcmahon: very awesome, i'm glad you're here and working on it! snark about the state of things in this little 12 year old startup != lack of appreciation for progress finally being made! [21:22:27] thanks marktraceur having gone down a rabbit hole designing my last test I was feeling dumb [21:22:39] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44341 [21:22:49] binasher: it will get more public RSN [21:23:10] thank you chrismcmahon -- i also *heart* testing [21:23:10] New review: RobH; "do not self review" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44340 [21:23:11] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44340 [21:23:25] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44334 [21:23:35] testing means pages are more likely to be broken machines/ops fuckups than other weird messups [21:23:36] yay [21:23:37] still early days, but we're proving value in browser tests [21:23:45] my haiku was cut off [21:23:45] do not self review self review leads to failure watch me self review [21:23:47] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=nfs1.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [21:23:56] do not self review, self review leads to failure, watch me self review [21:24:11] ^ Any idea why the graph is so spiky? (compared to scap normally) [21:24:19] RobH: self-review leads to confidence and strength [21:24:50] it leads to being a maverick [21:25:09] Reedy: argh....wow, still mid-scap? yuck [21:25:10] notpeter: is this your stuff for + define mha_coredb_config ($topology={}) { [21:25:15] in puppet waiting merge? [21:25:19] Around srv220 [21:25:19] im merging my stuff and saw it [21:25:28] + define mha_coredb_config ($topology={}) { [21:25:29] + $shard = $topology[$name] [21:25:29] RobH: ask binasher, but it's probably legit [21:25:32] binasher: ^ [21:25:49] Just over half way :| [21:26:15] RoanKattouw_away: that server is insane. [21:26:20] im putting you on titanium instead. [21:26:30] anyone available to help Reedy? [21:26:48] I might just cancel it and run sync-dir on the l10n dir [21:27:06] well, we should figure out why its hosed [21:27:07] * Reedy does so [21:27:19] NFS? [21:27:21] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.020 seconds [21:27:24] that pasted puppet code would be more legit if it worked. grrrr. [21:27:39] PROBLEM - Parsoid Varnish on praseodymium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:52] binasher: some code is so legit it doesn't even have to "work" [21:30:03] :) [21:30:11] word [21:30:18] truth [21:31:26] robla: network and CPU are more consistent using sync-dir at least.. [21:34:29] consistent being "flatline bad" instead of "spiky bad" [21:34:39] high [21:34:43] meaning it's doing something at least [21:35:02] pushing over 120 M/s [21:35:11] 123.4 apparently [21:35:19] I suppose that's actually a good thing [21:35:47] that's pegged network [21:35:56] indeed [21:36:04] We all love nfs1 [21:36:04] which machine is this ? [21:36:07] oh nfs1 [21:36:08] heh [21:36:12] crap [21:36:14] LESLIE NOTICED [21:36:16] * Reedy hides [21:36:19] hehehe [21:36:21] :) [21:37:18] It's tailing off.. [21:37:19] srv266: ssh: connect to host srv266 port 22: Connection timed out [21:37:22] Should be nearly done [21:38:08] Yeah, it's doing snapshot [21:38:23] paravoid: still around? (is not urgent if you are not/actually enjoying life) [21:38:28] on the last 10 at most [21:38:50] LeslieCarr: he has an afk responder, so I don't think so [21:38:55] ok [21:39:02] what is the magic secret to make puppet stop creating nagios config for a host that does not exist in site.pp? [21:39:21] put it in decom.pp [21:39:21] !log reedy synchronized php-1.21wmf8/cache/l10n/ [21:39:31] Logged the message, Master [21:39:41] notpeter: ah interesting. okthx [21:39:42] Jeff_Green: or, put a donk on it [21:39:53] a donk. is that a technical term? [21:39:53] Reedy: rats: http://test2.wikipedia.org/wiki/Main_Page [21:39:54] hallelujah! [21:40:08] Jeff_Green: http://www.youtube.com/watch?v=t_MNI8qRb6s [21:40:10] robla: wfm? [21:40:19] "TrustedXFF: hosts file missing. You need to download it." [21:40:29] Jeff_Green: yes - it means to set the nagios host on fire [21:40:30] aha [21:40:31] Jeff_Green: is it ever coming back to puppet with that host name? [21:40:32] Easily fixed [21:40:39] I didn't follow my usual list [21:41:06] mutante: this host in particular is (db1013) but I wondered the same about storage3 when I tried to murder it dead [21:41:20] Jeff_Green: although donk now has another meaning as well: http://jalopnik.com/5974931/your-guide-to-the-worlds-most-hated-car-culture-donks [21:41:39] !log reedy synchronized php-1.21wmf8/cache/interwiki.cdb [21:41:50] Logged the message, Master [21:42:02] !log reedy synchronized php-1.21wmf8/cache/trusted-xff.cdb [21:42:03] notpeter: I guess that would be a ticket to the eqiad queue :-P [21:42:11] Logged the message, Master [21:42:12] Jeff_Green: alternative is to just ACK it in Nagios.. had a discussion with paravoid the other day.. he says decom means "never comes back" [21:42:21] Jeff_Green: probably so, yes [21:42:32] Reedy: now we can sing "hallelujah!" :) [21:42:32] robla: Fixed, with less redlinks now too [21:42:52] Jeff_Green: actually.. scheduled downtime [21:42:57] fatal: Not a git repository: /tmp/new-mw-clone-424893351/mw/.git/modules/extensions/Wikibase [21:42:57] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Wed Jan 16 21:42:41 UTC 2013 [21:43:00] * Reedy facepalms [21:43:03] NOT [21:43:04] A [21:43:05] mutante: ok. [21:43:05] GAIN [21:43:52] !log restarted nginx on ssl3001 (was depooled) [21:43:59] http://wikitech.wikimedia.org/view/Nagios#Scheduling_downtimes_with_a_shell_command [21:44:06] Logged the message, Mistress of the network gear. [21:44:51] (only bother if you really want to do it on a group of hosts, if its just one or two its not worth it, since you still have to convert dates to unix timestamp) [21:45:21] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Wed Jan 16 21:45:02 UTC 2013 [21:45:48] RECOVERY - Host virt1008 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [21:46:24] !log reedy synchronized php-1.21wmf8/extensions/Wikibase [21:46:33] Logged the message, Master [21:47:07] hey [21:47:07] so [21:47:16] what dir does git deploy deploy to? [21:47:32] because searchidx1001 is now completely out of / disk [21:47:38] so I'm gonna symlink some shit [21:47:44] oh groovy [21:47:58] eh, edgecase. it's still working. could be worse [21:48:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki to wmf8 [21:48:05] also, that's our backup cluster [21:48:14] Logged the message, Master [21:48:19] notpeter: /srv/deployment [21:48:24] Reedy: thanks! [21:50:00] PROBLEM - SSH on virt1008 is CRITICAL: Connection refused [21:50:01] robla: Thinking it's probably best leaving testwiki on wmf7. I'm think it's still on gitdeploy so moving it won't really make any difference. [21:50:35] Reedy: I think maybe Ryan_Lane may have moved it back to the scap dirs [21:50:57] extensions don't list versions [21:50:57] https://test.wikipedia.org/wiki/Special:Version [21:51:00] One easy way to test it [21:51:21] https://test.wikipedia.org/wiki/Special:Version [21:51:27] You're right it would seem [21:52:03] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki to 1.21wmf8 [21:52:13] Logged the message, Master [21:53:00] testwiki will be partially broken while I re clone extensions [21:53:44] Reedy: if the house isn't burning down on mw.o I'm going to take a break and see what the browser tests did when I get back [21:54:02] chrismcmahon: sounds like a plan [21:54:31] thanks robla [21:56:18] RECOVERY - Lucene disk space on searchidx1001 is OK: DISK OK [21:58:08] !log reedy synchronized php-1.21wmf8/extensions/ [21:58:18] Logged the message, Master [21:58:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki to 1.21wmf8 [21:59:06] Logged the message, Master [21:59:25] New patchset: Cmjohnson; "adding labsdb1003 to netboot.cfg & removing entry for virt1008/9 as they do not exist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44345 [22:02:20] hiyaaa, binasher, when you got a sec, could you answer my question in the x carrier RT? [22:02:22] https://rt.wikimedia.org/Ticket/Display.html?id=3158 [22:02:57] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44345 [22:17:25] Reedy: I moved test off of git deploy [22:17:30] it's back to scap [22:17:41] die( 'lol, nfs' ); told me that ;) [22:17:44] heh [22:18:01] it's going to be another couple weeks minimum before we try again, so no need to keep it on that system [22:18:18] <^demon> Reedy: Don't do it that way or Tim will change you :p [22:18:19] Ryan_Lane: you'll be keeping tin for this though right? [22:18:28] <^demon> To fwrite( STDERR, "Foo" ); die(1); [22:18:42] RobH: yeah, we're just changing a couple parts of the code, not scrapping the whole system ;) [22:21:35] !log reedy synchronized php-1.21wmf8/resources/startup.js [22:21:40] i assumed, but figured id ask =] [22:21:44] Logged the message, Master [22:21:48] * Ryan_Lane nods [22:21:57] we still need the host in pmtpa as well [22:22:11] !log reedy synchronized wmf-config/InitialiseSettings.php [22:22:20] Logged the message, Master [22:25:35] !log removing ganglios requirements to try to make it happily run in precise with icinga [22:25:45] Logged the message, Mistress of the network gear. [22:26:18] New patchset: Asher; "fix creation of $shards array from the top level keys of $role::coredb::config::topology" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44349 [22:27:21] PROBLEM - Parsoid on titanium is CRITICAL: Connection refused [22:32:02] notpeter: good news, the chmod hack's not in icinga's init.d [22:32:16] LeslieCarr: that makes me happy! [22:32:47] eh, i added that hack long time ago afair [22:33:00] without it we sometimes had permission issues that broke nagios [22:34:32] i can't imagine that ever happening… ;) [22:35:27] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [22:35:27] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [22:35:32] New patchset: Lcarr; "both fixing icinga's misccommands to log correctly (and ircbot) and updating init script to use purge script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44350 [22:36:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44350 [22:38:54] PROBLEM - NTP on virt1008 is CRITICAL: NTP CRITICAL: No response from NTP server [22:39:38] New patchset: Dzahn; "remove potassium from decom.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44351 [22:40:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44351 [22:40:26] New patchset: Reedy; "mediawikiwiki, testwiki, test2wiki, wikidatawiki to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44352 [22:40:39] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44352 [22:44:33] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44349 [22:45:27] New review: Dereckson; "Configuration ok." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/44281 [23:09:29] New patchset: Asher; "fix lookup for secondary master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44355 [23:11:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44355 [23:14:49] New patchset: Lcarr; "puppet_checks.d no longer exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44356 [23:15:40] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44356 [23:23:27] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [23:33:21] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:21] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:21] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:21] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:22] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:24] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:33] PROBLEM - Apache HTTP on srv218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:34] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:42] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:52] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:00] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:09] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:09] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:10] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:10] PROBLEM - Apache HTTP on srv250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:10] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:10] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:19] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:19] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:27] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:29] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:36] PROBLEM - Apache HTTP on srv255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:36] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:36] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:40] ok, who killed tall the apaches ? [23:35:45] PROBLEM - Apache HTTP on srv215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:46] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:46] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:46] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:54] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:54] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:54] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:54] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:01] is it just nagios? [23:36:04] PROBLEM - Apache HTTP on srv214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:05] i still see Apache on srv300 f.e. [23:36:08] no, ganglia is down for them [23:36:12] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:12] PROBLEM - Apache HTTP on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:19] http://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [23:36:40] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:46] these are across multiple racks [23:36:48] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:57] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:09] Request: GET http://en.wikipedia.org/w/api.php, from 208.80.152.74 via sq36.wikimedia.org (squid/2.7.STABLE9) to () [23:38:09] Error: ERR_CANNOT_FORWARD, errno [No Error] at Wed, 16 Jan 2013 23:37:55 GMT [23:39:56] I can ping them, but I can't telnet to port 80 [23:40:27] mine are completing when i telnet localhost 80 but a GET / fails [23:40:31] Reedy: did deploy 5db75ef6a4fedb5baeb1acec059966b5652209d6? [23:40:45] this makes me think memcache server failing but … ? [23:40:49] To wmf8 [23:40:50] ah. yeah. it's just taking a long time to connect [23:40:51] mine arent [23:40:51] i can restart apache , no errors, see the process, see it listening on 80, yet they dont talk to me [23:41:01] maybe im not waiting long enough [23:41:16] yea... just took a long ass time. [23:41:31] ahha - erver reached MaxClients setting, consider raising the MaxClients setting [23:41:36] reedy@fenari:/home/wikipedia/common$ mwscript mctest.php --wiki=enwiki [23:41:37] 127.0.0.1:11000 set: 100 incr: 0 get: 0 time: 0.0029680728912354 [23:41:40] yes [23:41:42] that is broken [23:41:50] it should list off all the memcache servers [23:41:59] i don't care/think localhost on fenari has memcache [23:42:15] Oh, that's a different script [23:42:23] I fixed mcc.php in master [23:42:25] that's what used to do it [23:42:37] that makes sense then [23:42:46] what's the new command for that ? [23:42:52] and so we can update our documentation ? [23:42:56] http://wikitech.wikimedia.org/view/Memcached [23:43:12] I see a decent number of memcache errors [23:43:15] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [23:43:30] Yeah, mctest is broken in a similar way [23:43:39] enwikisource: Memcached error for key "" server "10.0.12.9:11211": ITEM TOO BIG [23:43:43] whyyyy is our memcache test broken.... thats shit. [23:44:05] Memcached error for key "nlwiki:preprocess-xml:ea3b184dadcf4e746431117d4a124f59:0" on server "10.0.12.13:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [23:44:15] especially when i emailed on december 15th [23:44:22] and mentioned it was urgent [23:44:31] Ryan_Lane: those don't matter [23:44:34] so does anyone have a clue on this or do we need to ping someone else? [23:44:41] that's just fuck-all djvu metadata > 1mb [23:45:23] ok [23:45:39] maybe want to look at what I put in the security channel [23:46:29] Reedy: if you know how mcc.php is supposed to work, please update the documentation at http://wikitech.wikimedia.org/view/Memcached [23:47:27] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:48:22] peter called [23:48:28] I'm pushing out a fix [23:48:30] PROBLEM - Lucene on search1017 is CRITICAL: Connection timed out [23:48:39] ok [23:48:53] Reedy: from wmf-config: configchange lucene.php [23:48:53] ? [23:48:55] right? [23:49:07] or do I run this from elsewhere? [23:49:10] sync-file wmf-config/lucene.php [23:49:12] ok [23:49:47] !log laner synchronized wmf-config/lucene.php 'Switching search prefix pool to pmtpa' [23:49:56] Logged the message, Master [23:50:00] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.027 second response time on port 8123 [23:50:09] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [23:50:32] we may need to graceful the api apaches [23:50:36] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.730 second response time [23:50:40] any way to graceful just them? [23:50:45] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.055 second response time [23:50:54] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.912 second response time [23:50:55] well, they seem to be coming back, I guess [23:50:56] LeslieCarr: mctest.php fixed on fenari with a live hack [23:50:58] will commit properly [23:51:03] reedy@fenari:/home/wikipedia/common$ mwscript mctest.php --wiki=enwiki [23:51:03] 10.0.12.1:11211 set: 100 incr: 100 get: 100 time: 0.08999490737915 [23:51:03] 10.0.12.2:11211 set: 100 incr: 100 get: 100 time: 0.079756021499634 [23:51:04] yada yada [23:51:08] Reedy: was I supposed to push that lucene change into gerrit? :) [23:51:12] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:12] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:12] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [23:51:12] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [23:51:12] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [23:51:13] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:13] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [23:51:14] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:14] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [23:51:15] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [23:51:16] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2582 bytes in 0.085 seconds [23:51:19] thank you Reedy [23:51:19] Ryan_Lane: needs doing seperately [23:51:21] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [23:51:21] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.072 second response time [23:51:30] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [23:51:30] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [23:51:39] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.067 second response time [23:51:39] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [23:51:39] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [23:51:39] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [23:51:39] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.073 second response time [23:51:40] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.073 second response time [23:51:48] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:48] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [23:51:48] RECOVERY - Apache HTTP on srv215 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [23:51:48] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [23:51:57] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [23:51:57] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [23:51:58] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [23:51:58] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.087 second response time [23:51:58] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.072 second response time [23:52:06] RECOVERY - Apache HTTP on srv214 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [23:52:24] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [23:52:24] RECOVERY - Apache HTTP on srv218 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [23:52:24] PROBLEM - Puppet freshness on virt1008 is CRITICAL: Puppet has not run in the last 10 hours [23:52:42] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [23:52:42] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [23:54:30] thanks Ryan_Lane and Reedy :) [23:55:24] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [23:55:33] yw [23:55:37] now to push that change in [23:55:53] so, what exactly was the timeline ? when/who did the search move to tampa ? [23:55:55] no clue [23:55:58] :) [23:56:20] hehe [23:56:20] I don't see it in the common repo [23:56:26] ok, who did the change ? [23:56:29] maybe it wasn't checked in? [23:56:29] anyone know? ;) [23:56:32] blame domas (as the bugzilla quip suggests) [23:56:34] either asher or peter [23:56:46] sure, domas's fault [23:56:50] along with arbcom ;) [23:57:04] notpeter: DEFEND YOURSELF SIR [23:57:22] RobH: where is your glove? [23:57:32] i actually have a glove in my desk drawer [23:57:34] its cold here. [23:57:37] have gloves and hat. [23:57:49] New patchset: Ryan Lane; "Switch search prefix pool back to tampa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44364 [23:58:23] RobH: you need to throw it on the ground first [23:58:43] New review: Ryan Lane; "Already live." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44364 [23:58:43] Change merged: Ryan Lane; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44364 [23:59:09] !log reedy synchronized php-1.21wmf8/maintenance/mctest.php [23:59:19] Logged the message, Master [23:59:36] !log reedy synchronized php-1.21wmf7/maintenance/mctest.php [23:59:46] Logged the message, Master