[00:46:54] Syntax error on line 45 of /etc/apache2/sites-enabled/bugzilla.wikimedia.org: [00:46:54] SSLCertificateFile: file '/etc/ssl/certs/star.wikimedia.org.crt' does not exist or is empty [00:46:55] ...fail! [00:48:00] those certificates are meant to come from puppet, I think [00:49:39] install_certificate { "star-wikimedia": host => "bugzilla.wikimedia.org" } IIRC [00:55:34] i wonder if someone changed or removed that recently, heh [01:05:16] binasher: is that a precise host ? [01:05:30] oh nm [01:06:09] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:16:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [01:30:08] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:42:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 261 seconds [01:42:35] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 252 seconds [01:48:35] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 611s [01:50:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [01:53:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.391 seconds [01:55:29] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 10s [01:56:23] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 2 seconds [02:20:09] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [02:26:09] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [02:28:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [02:44:37] RECOVERY - Puppet freshness on virt1002 is OK: puppet ran at Wed Jul 25 02:44:12 UTC 2012 [02:54:30] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Wed Jul 25 02:54:12 UTC 2012 [04:03:59] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [04:06:59] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [04:41:02] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [06:49:29] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15963 [06:51:33] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15440 [07:35:16] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [08:34:32] morning [08:40:33] PROBLEM - Puppet freshness on srv198 is CRITICAL: Puppet has not run in the last 10 hours [08:47:27] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [09:04:31] * apergos checks the timestamp [09:04:34] yes, it was :-P [09:05:31] hahaha [09:24:06] hell o:-) [09:24:14] damn its is 11:30 already [09:24:52] paravoid: do you want to do the NFS beta stuff ? [09:24:59] I would have to move out in 45 minutes though [09:25:44] should be 5' thing, shouldn't it? :) [09:26:56] should indeed :-D [09:27:28] change is https://gerrit.wikimedia.org/r/#/c/15545/ [09:27:41] I did test it in labs [09:27:45] though not in production :D [09:27:59] ah, needs rebase [09:28:04] should I try the rebase button? :) [09:28:20] although it says "merge or rebase locally" [09:28:41] if there is a conflict that will not work [09:29:08] The rebase failed since conflicts occured during the merge. [09:29:09] yay [09:29:14] care to rebase it? [09:29:22] doing it right now [09:29:48] grmblblb [09:30:06] <<<<<<< HEAD [09:30:06] ======= [09:30:08] all my code [09:30:11] >>>>> my change [09:30:55] ah that is role::applicationserver::labs that got moved [09:32:26] someone killed my code apparently :/ [09:33:50] paravoid: notpeter cleaned up the apache files to get ride of the ::labs role :-D [09:33:57] using if $::realm instead hehe [09:34:18] oh dear... [09:35:41] if $lvsrealserver == true { [09:35:42] include nfs::upload } [09:35:43] kind of strange [09:36:16] which also mean the jobrunner do not have ffs::upload anymore [09:36:21] grr nfs::upload [09:36:58] okay, let me grab a coffee while you're at it [09:37:06] ("should be a 5' thing", famous last words) [09:39:57] indeed [09:40:04] New patchset: Hashar; "include nfs::upload if $upload, not $lvsrealserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16611 [09:40:22] paravoid: ^^^ apparently wrong parameter was used [09:40:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16611 [09:40:45] well then rebase my change on that one [09:40:56] and edit the condition to include the incoming nfs::upload::labs [09:51:53] New patchset: Hashar; "apaches::monitoring::labs wraps to apaches::monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16612 [09:52:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16612 [09:53:41] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [09:53:47] and rebased my change on top of it [09:54:02] paravoid: so you got to review https://gerrit.wikimedia.org/r/16611 https://gerrit.wikimedia.org/r/16612 which are prerequisites [09:54:05] :-D [09:54:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [09:55:00] yeah, on it [09:57:33] gah, I hate this [09:57:45] mark: you were talking with notpeter about apache.pp, didn't you? [09:58:18] both approach are valid, I guess that is a matter of agreed on a coding style [09:58:30] who's idea was the "if $apache == true { include apache }"? [09:58:32] weren't ops willing to write such a guide? [09:59:57] puppet is a declarative language [10:00:04] if/elses should be used seldomly [10:01:17] New review: Faidon; "I hate the rest of it, but the change itself is sane" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16611 [10:01:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16611 [10:03:14] New review: Faidon; "Labs != beta, but oh well..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16612 [10:03:14] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16612 [10:03:39] oh dear [10:08:36] mark: also, parameterized role classes? doesn't that defeat the point of role classes? [10:08:59] isn't that just the common one ? [10:09:11] could be made private or something ;-D [10:09:23] ? [10:09:56] oh dear, I just saw that [10:09:59] oh dear god [10:10:16] why, oh why [10:13:39] hashar: so, beta nfs? [10:14:04] https://gerrit.wikimedia.org/r/15545 [10:14:05] rebased it [10:14:11] but need to get out to lunch with my wife :/ [10:14:17] well :-] [10:14:24] but :-/ about getting out right now [10:14:43] feel free to apply it if it is good [10:14:48] I will fix the labs post lunch [10:15:21] (I will just have to stop squid, run rsync again, umount stuff, rerun puppet to have the sym links set) [10:16:31] I can wait :) [10:17:15] should be back in an hour and a half roughly :-D Will ping you! [10:17:19] out for now :-] [10:18:19] does this thing even work?!?! [10:20:32] hey [10:20:40] what are you talking about? [10:22:11] manifests/role/apache.pp [10:22:46] what about it? [10:23:05] the tons of if/elses [10:23:21] in a common class, that then gets included parameterized below [10:23:28] yeah I commented about that [10:23:32] oh you did? [10:23:33] not sure if peter has changed that yet [10:23:36] apparently not [10:23:47] https://gerrit.wikimedia.org/r/#/c/16122/ [10:23:59] i said i'd sit with him after he's changed this initial stuff [10:24:28] other than the silly stuff, parameters are fine for a common class that's included by the actual role classes [10:24:41] ah, why didn't we merge that? [10:24:44] it's just comments, isn't it? :) [10:25:20] almost just comments [10:25:42] I don't see the point of doing class common($foo=false) { if $foo { } } class foo { class { "common": foo => true } } [10:25:46] I think it's very bad style [10:25:55] that's what I said isn't it [10:25:59] yes [10:26:06] I'm not arguing with you, just agreeing :) [10:26:09] there's no point in putting everything in a common class when it's not common at all [10:26:15] an "include" works just as well [10:26:41] I feel a lot better after seeing your comments [10:26:44] hehe [10:27:09] there will be more comments, but i'll let him fix this first [10:27:13] I thought this was after your reviews [10:27:19] nah I hadn't seen any of this [10:27:58] btw, your gerrit commit is a total fail of our code-review [10:28:25] the fact that you had to do # FIXME: instead of commenting on the code review system [10:28:37] aanyway [10:28:38] i know [10:28:46] for a long time I tried to review everything going in [10:28:48] but only after merge [10:28:51] and that just doesn't work :( [10:29:03] yeah, I'd like seeing all the commits too [10:29:20] not necessarily review them [10:29:27] yeah [10:29:33] but i've given up after a while [10:29:36] too frustrating, the process :( [10:29:47] yeah [10:29:56] do you read wikitech? :) [10:29:58] no followup process [10:29:58] no [10:30:20] there's a discussion about a gerrit replacement [10:30:28] along with this: http://www.mediawiki.org/wiki/Git/Gerrit_evaluation [10:30:36] i've seen that [10:30:53] I've stepped in a bit, mostly to talk about the lack of a post-merge workflow with gerrit [10:31:18] someone even made stats, something like 80% of commits in ops/puppet are self-reviewed [10:32:07] Ryan is a big fan though :) [10:33:02] hmm [10:33:04] ryan fans [10:33:24] I have a question [10:33:43] how much punsishment would commons be able to take in terms of number of edits per minute [10:33:55] has there been any research on this? [10:34:23] I am not trying to break the site but we are discussing how much trrotelling of the bots are needed [10:34:32] *hrotelling [10:34:54] I cant type :( [10:52:15] paravoid: apparently this is peter's response/followup work: https://gerrit.wikimedia.org/r/#/c/16532/3 [10:52:20] haven't looked at it yet, will do so in a bit [11:04:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.61962793651 (gt 8.0) [11:07:08] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.81811864 [11:07:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:31:25] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:39:53] New review: Mark Bergsma; "More comments! Inline." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/16532 [11:46:30] and I am back paravoid [11:50:40] and I'm about to leave :) [11:52:21] :-( [11:52:28] will context switch to something else [11:52:38] paravoid: will you be there later this afternoon? [11:56:53] yes [11:56:58] I'm just going out for lunch [11:57:17] after three of four days at home, I need to see a bit of a sun :) [11:59:22] yeah [11:59:26] i'm working outside [11:59:47] it's lovely [12:00:53] it's peaking at 36 today and 37 tomorrow [12:00:57] so, I don't want to do that :) [12:01:20] anyway, bye for now [12:07:21] just 31 here [12:07:24] i'm sitting in the shade ;) [12:07:26] cya [12:33:53] PROBLEM - Puppet freshness on cp1020 is CRITICAL: Puppet has not run in the last 10 hours [12:34:56] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [12:35:59] PROBLEM - Puppet freshness on srv209 is CRITICAL: Puppet has not run in the last 10 hours [12:36:53] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [12:37:56] PROBLEM - Puppet freshness on lvs5 is CRITICAL: Puppet has not run in the last 10 hours [13:11:28] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [13:11:45] and I though it was a draft [13:11:46] bah [13:12:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [13:12:42] <^demon> Drafts will show up on IRC anyway. [13:13:07] New patchset: Bhartshorne; "updating eqiad cluster hash. adding ms-be1009-12 to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16626 [13:13:44] New review: Hashar; "Still need to complete the over manpages and add a puppet class to install the script in /usr/local/..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [13:13:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16626 [13:14:09] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16626 [13:14:57] why are you located ben? [13:15:01] where [13:15:08] you're up too early for the west coast ;) [13:15:55] I'm in Massachussetts atm. [13:15:57] New review: Hashar; "over -> other." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [13:16:10] ah :) [13:16:27] paravoid: can i merge your apache monitoring changces on sockpuppet? [13:16:44] maplebed: hello ben :-) [13:16:53] (I've been on the east coast since wikimania) [13:16:57] maplebed: we did some cleanup this morning they are probably safe [13:17:07] probably doesn't make me very happy... [13:17:10] paravoid is out for lunch / get fresh air. [13:17:22] it means that when I merge it I have to watch it. [13:17:31] but I'm kinda in the middle of something else. [13:17:38] oh [13:17:46] you can revert his change I guess [13:18:02] :-( [13:18:21] lemme look at it again to see what I'd have to watch. [13:18:35] I have no idea what got merged and what was not. [13:18:57] https://gerrit.wikimedia.org/r/#/c/16612/ (commit 6ece81f965 ) is a wrapper for a deprecated class [13:19:12] only going to cause troubles on labs (and I am not even sure labs actually uses that class anyway :-D [13:19:20] I think it's two changes [13:19:29] https://gerrit.wikimedia.org/r/#/c/16611/ and https://gerrit.wikimedia.org/r/#/c/16612/ [13:20:05] ah 16611 is some cleanup $lvsrealserver was used to conditionally include nfs::upload [13:20:25] which would mean that any roles not having lvsrealserver set to true would lack nfs::upload :-) [13:20:32] and bits apaches probably received /mnt/upload too [13:20:41] might be a bit more tricky to watch / verify [13:20:58] well, because of how puppet works it won't unmount upload on those hosts, [13:21:05] but it does mean that new hosts won't have ti. [13:21:06] it. [13:21:38] mark: do you know if it will cause an error to have a conditional with no body in puppet? [13:21:50] not sure [13:21:50] i.e. if $var == true {} [13:22:03] why would you ever have that though? [13:22:12] I'm just looking at the change. [13:22:16] odd [13:22:21] it stripped out the content but left the conditional. [13:22:24] https://gerrit.wikimedia.org/r/#/c/16611/1/manifests/role/apache.pp [13:22:48] that's not an empty body [13:22:56] it's not "include" means: "insert the contents of that class here" [13:22:59] <^demon> maplebed: How long are you in Boston? [13:23:17] ^demon: on cape cod actually. south yarmouth. [13:23:45] <^demon> Ah, I'm going to be in Boston Aug 18-19 visiting some family friends I haven't seen in years. Just wondering :) [13:24:10] I'm flying home on the 29th of july. no overlap. [13:25:27] I do love east coast storms though. Yesterday afternoon: http://screencast.com/t/SqhvcIIbk [13:26:06] <^demon> Yeah, that was yesterday afternoon here too. [13:26:21] <^demon> Blazingly hot all day, clouds rolled in around 3ish, stormed til about 8. [13:26:59] hashar: I'm going to revert https://gerrit.wikimedia.org/r/#/c/16611/1/manifests/role/apache.pp - both because it seems like a change that affects many servers and because it leaves a conditional with no body. [13:27:12] hashar: https://gerrit.wikimedia.org/r/#/c/16612/1/manifests/apaches.pp seems harmless (outside of labs) though, so I'll merge that one. [13:27:22] are you ok with that? [13:27:42] PROBLEM - Host srv206 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:45] maplebed: I'm re-implementing a lot of that in further apache cleanup [13:28:01] hashar: if this is an issue, ping me [13:28:10] as I'm going to do a lot more on those manifests in the next couple of days [13:28:13] mark: I was looking at the $lvsrealserver conditional. [13:28:26] after removing the include, all it has is a comment. [13:28:38] yeah [13:28:39] true [13:28:45] odd change [13:29:06] er? [13:29:12] huh [13:29:35] maplebed: cleanup continues on those manifests. if you need to revert, go for it [13:29:46] k. [13:29:50] New patchset: Bhartshorne; "Revert "include nfs::upload if $upload, not $lvsrealserver"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16627 [13:30:04] that is an arifact of a copy/paste error of mine [13:30:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16627 [13:30:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16627 [13:30:42] paravoid: ^^^ [13:31:20] hooray for reading sockpuppet's fetch diff. :P [13:33:41] maplebed: sorry had some issue locally [13:34:01] maplebed: fine with me :-] will have faidon to reapply 16611 (the empty body one) [13:35:07] mark: do you have time to poke at the switch for ms-be1009-12 or should I wait for leslie? [13:35:29] I think networking is not yet set up for those hosts. [13:35:29] what's the issue? [13:35:32] ok [13:35:42] I don't have port numbers, [13:35:55] but they should have all attempted to PXE boot in the last 5 minutes or so [13:36:23] didn't leslie set up some? [13:36:30] i see none on the row C switch [13:36:47] she tried yesterday for ms-be1005-8 but it didn't work. [13:37:03] what was the problem there? [13:37:32] she could probably explain it better... I think she couldn't find the mac addresses on the switch [13:37:41] do you have the macs? [13:37:48] i do. I'll pm them to you. [13:37:52] ok [13:38:01] hashar: I think that I've done a lot of the stuff that you were doing I've done in https://gerrit.wikimedia.org/r/#/c/16532/ [13:38:12] which I'm continuing to work on [13:38:44] notpeter: ahh good to know :-) [13:38:49] 1005-8 are in C2, 1009-12 are in C3. [13:39:51] heh [13:39:56] that would be because the respective switch ports are disabled [13:40:04] then the switch can't learn any mac addresses either [13:40:51] ossm. [13:41:45] also, wow. that was one of the worst sentences I've written in a long time... [13:42:09] notpeter: what, only two extra words... [13:42:21] just take those out and you're fine. [13:43:12] I need to use | grep --no-stupid [13:43:16] on all of my IMs [13:43:17] New review: Nikerabbit; "Scheduling this for i18n deployment next Tuesday." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/16498 [13:44:50] notpeter: I will also have faidon to apply https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=2307433e4ac0a882d035287f204565f1c810390f;hp=6ece81f96556979ac9eae1a34156e0efc0b66176 [13:45:12] notpeter: that introduce the new nfs::upload::labs and use that when realm is not production. [13:45:50] notpeter: the reason is that the beta cluster will just uses symbolic links to some subdirectories of /data/project (a per project shared directory) [13:48:34] hashar: cool, that looks good [13:48:50] I was trying to figure out exactly what to do with nfs::upload::labs [13:48:55] it's already in apaches.pp [13:49:08] mar_k asked me to move it over, which seemed reasonable [13:49:14] so yeah, that stuff looks excellent [13:49:48] well I am not sure how we would prefer to handle it [13:49:54] s/we/ops/ [13:50:10] I know Faidon is not a fan of if( $realm ) {} else {} [13:50:30] it's not ideal, but it's how we're doing things currently [13:50:36] I'm open to other suggestions [13:51:09] I am fine with any way as long as I can easily hack the puppet files to insert my 'beta' specific kind :-) [13:51:37] I for sure love the parameterized common class [14:03:54] New patchset: Mark Bergsma; "Add public1-c-eqiad and private1-c-eqiad subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16629 [14:04:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16629 [14:05:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16629 [14:05:12] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [14:05:51] New patchset: Hashar; "Fix bits VLC when enable_geoiplookup is disabled" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15445 [14:06:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15445 [14:06:31] New review: Hashar; "Patchset 5 pass the hostname (test.wikipedia.org) and escape it in the bits.inc.vcl.erb template." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15445 [14:09:24] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [14:17:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [14:17:19] New review: Hashar; "Patchset 12 rely on cluster_options hash that was introduced in I033e5b7f and set the default values..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [14:27:55] New review: Hashar; "Patchset 12 was deployed on deployment-cache-bits02. See http://commons.wikimedia.beta.wmflabs.org/w..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [14:28:05] New review: Hashar; "Patchset 5 was deployed on deployment-cache-bits02. See http://commons.wikimedia.beta.wmflabs.org/wi..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15445 [14:37:32] Logged the message, Master [14:41:39] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [14:52:36] RECOVERY - swift-object-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:52:54] RECOVERY - swift-container-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:58:45] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:58:54] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:59:39] RECOVERY - swift-account-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:11:48] heya [15:11:57] that took longer than expected :) [15:12:20] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-container-server on ms-be1006 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-object-server on ms-be1006 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-account-replicator on ms-be1006 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-account-auditor on ms-be1009 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [15:12:21] PROBLEM - swift-container-auditor on ms-be1009 is CRITICAL: Connection refused by host [15:12:22] PROBLEM - swift-object-auditor on ms-be1009 is CRITICAL: Connection refused by host [15:12:22] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [15:12:38] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [15:12:38] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [15:12:38] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [15:12:38] PROBLEM - swift-account-reaper on ms-be1009 is CRITICAL: Connection refused by host [15:12:38] PROBLEM - swift-container-replicator on ms-be1009 is CRITICAL: Connection refused by host [15:12:39] PROBLEM - swift-container-updater on ms-be1006 is CRITICAL: Connection refused by host [15:12:39] PROBLEM - swift-object-updater on ms-be1006 is CRITICAL: Connection refused by host [15:12:40] PROBLEM - swift-object-replicator on ms-be1009 is CRITICAL: Connection refused by host [15:12:40] PROBLEM - swift-account-server on ms-be1006 is CRITICAL: Connection refused by host [15:12:56] PROBLEM - swift-object-auditor on ms-be1006 is CRITICAL: Connection refused by host [15:12:56] PROBLEM - swift-container-auditor on ms-be1006 is CRITICAL: Connection refused by host [15:12:56] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [15:12:56] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [15:12:56] PROBLEM - swift-object-server on ms-be1009 is CRITICAL: Connection refused by host [15:12:57] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [15:12:57] PROBLEM - swift-account-auditor on ms-be1006 is CRITICAL: Connection refused by host [15:13:06] PROBLEM - swift-container-server on ms-be1009 is CRITICAL: Connection refused by host [15:13:14] PROBLEM - swift-account-replicator on ms-be1009 is CRITICAL: Connection refused by host [15:13:20] paravoid: now look what you've done [15:13:23] PROBLEM - swift-account-reaper on ms-be1006 is CRITICAL: Connection refused by host [15:13:23] PROBLEM - swift-account-server on ms-be1009 is CRITICAL: Connection refused by host [15:13:23] PROBLEM - swift-object-replicator on ms-be1006 is CRITICAL: Connection refused by host [15:13:23] PROBLEM - swift-container-replicator on ms-be1006 is CRITICAL: Connection refused by host [15:13:23] PROBLEM - swift-object-updater on ms-be1009 is CRITICAL: Connection refused by host [15:13:24] PROBLEM - swift-container-updater on ms-be1009 is CRITICAL: Connection refused by host [15:13:24] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [15:13:32] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [15:13:32] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [15:14:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16420 [15:15:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16457 [15:20:30] New review: Reedy; "Where's the talk NS?" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16432 [15:22:14] paravoid: so not peter is refactoring apache roles :- [15:22:43] paravoid: and one change got reverted :/ [15:22:55] apparently you forgot to merge it on sock puppet! [15:23:09] Iefedf2b7 Revert "include nfs::upload if $upload, not $lvsrealserver" (MERGED) [15:25:20] New patchset: Hashar; "include nfs::upload if $upload, not $lvsrealserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16632 [15:25:59] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16558 [15:25:59] New review: Hashar; "Ben said:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/16632 [15:25:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16632 [15:26:04] New review: Hashar; "Resubmitted with https://gerrit.wikimedia.org/r/#/c/16632/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16611 [15:27:07] New patchset: Hashar; "include nfs::upload if $upload, not $lvsrealserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16632 [15:27:45] New review: Hashar; "Patchset 2 removes the empty block:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16632 [15:27:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16632 [15:28:59] paravoid: notpeter: added you to https://gerrit.wikimedia.org/r/#/c/16632/ which make the apache role to include nfs::upload when $upload is true instead of when $lvsrealserver is true. [15:29:26] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: No response from NTP server [15:30:11] PROBLEM - NTP on ms-be1009 is CRITICAL: NTP CRITICAL: No response from NTP server [15:30:11] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: No response from NTP server [15:31:36] reverted? [15:31:57] and yeah, I saw the refactoring before I left [15:32:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16577 [15:32:56] paravoid: the reason for reverting the change was because of the empty block that was left [15:33:11] paravoid: the second is because ben had to merge some swift changes ;) [15:34:11] paravoid: I wasn't sure how wide ranging the effect would be and i didn't want to just merge and walk away. top that off with the empty conditional thing and I just reverted to let you poke again when you returned. [15:34:20] okay, thanks [15:34:26] I didn't merge on sockpuppet on purpose [15:34:34] I was about to leave and didn't want to push something then leave [15:34:35] how come? [15:34:55] sadly merging in gerrit and not publishing on sockpuppet means that you block all subsequent merges. [15:35:02] I couldn't publish my change without reverting yours. [15:35:03] I was going to merge it now, but you started working early :) [15:35:18] well, mark's also pushed some stuff while you were gone. [15:35:50] better to just leave it pending in gerrit (or mark +2 and submit but don't publish) [15:36:07] but anyway, it all worked out. [15:36:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.636 seconds [15:37:51] New patchset: Ottomata; "statisics.pp - Renaming some variables, fixing undefined $gerrit_stats_data_path." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16633 [15:38:26] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:38:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16633 [15:39:22] New patchset: Ottomata; "statisics.pp - Renaming some variables, fixing undefined $gerrit_stats_data_path." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16633 [15:39:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16633 [15:40:08] meeester maplebed or paravoid, that is a quick little approve+merge, would one of you have a sec for me there? [15:40:28] https://gerrit.wikimedia.org/r/16633 [15:41:11] mark: yup. [15:41:15] err.. [15:41:17] ottomata: yup. [15:41:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16633 [15:41:51] ottomata: done. [15:41:59] danke [15:42:24] nitpicking: you shouldn't quote variables, i.e. "$foo" is "wrong" [15:42:45] hashar: sooo, let's resume [15:43:03] oh, really? [15:43:09] i started doing that to be consistent [15:43:14] what about [15:43:20] "$path/and/more/path" [15:43:21] ? [15:43:32] that's fine, although I'd prefer "${path}/and/more/path" [15:43:46] i do that in bash scripts [15:43:49] although not always in puppet [15:43:54] only when needed I guess [15:44:22] paravoid, is that your opinion or a more general wm or puppet style recommendation? [15:44:23] that's also mentioned on the style guide, although we don't use that (yet?) [15:44:33] haha, you answered my question before I asked it [15:44:34] ok cool [15:44:37] :) [15:44:51] in the future I will not quote variables unless I need to [15:45:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16234 [15:47:54] mark: when you return, something is still preventing the row C hosts from working - they can't run puppet right. [15:48:01] I'm not sure what's causing it. [15:48:07] the error is that it can't run apt-get update, [15:48:38] and looking at the host they're not getting the config to use brewster as a proxy [15:48:42] but I don't know why they're not getting that config. [15:49:13] maplebed: you mean on the installer? [15:49:16] or is this installed systems? [15:49:54] pxeboot successfully runs and on reboot I get a login prompt. after connecting via sockpuppet and teh install key, and doing the puppetca stuff, puppet fails to run. [15:50:21] do you want me to have a look? [15:50:25] a second pair of eyes might help [15:50:35] but if you prefer mark handling it, that's okay too [15:50:47] you're more than welcome to. [15:50:57] give me a hostname? [15:51:02] the reason I pinged mark is that there've been a bunch of network-related things in between the row c hosts and them working. [15:51:06] I fear this is another. [15:51:21] ms-be1006.eqiad.wmnet [15:51:49] this problem manifests on ms-be1005 through ms-be1012 [15:52:13] (but not ms-be1001 or 1002, eqiad hosts not in row C) [15:52:20] these hosts are the first to come online in row C. [15:52:27] paravoid: sure. Was doing some mediawiki review :) [15:53:18] paravoid: so we were at https://gerrit.wikimedia.org/r/#/c/15545/ [15:53:32] paravoid: you attempted to merge it and it got a conflict which I have fixed [15:53:57] maplebed: networking is fine [15:54:05] something else is going on [15:54:10] I'll attempt running puppet and see [15:54:49] back [15:54:59] it's probably the squid on brewster [15:55:06] hi mark! [15:55:13] I looked at that and tested with curl. [15:55:19] brewster let me through [15:55:33] the restriction is 10.64/12, which should include the new hosts' IPs, right? [15:55:55] yeah [15:56:12] I didn't try putting the proxy config in place and testing that, [15:56:20] but puppet really should put that in place before trying to run apt [15:56:24] and it looks like it's not. [15:56:28] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [15:56:30] it can't [15:56:43] the apt.conf.d for the proxy is not there [15:56:52] but it runs in a different stage, so it should run before apt-get update [15:56:55] so it contacts directly [15:56:56] I'm running puppet now to check [15:57:01] so it's fine then [15:57:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [15:57:05] our repo works, security.ubuntu.com isn't [15:57:13] New review: Hashar; "Rebased on I70fd2fe5 which adds the :" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16632 [15:57:15] +1 paravoid. [15:57:17] that's what the proxy is for [15:57:22] yes [15:57:34] there's no http_proxy set in the environment and the apt.conf.d configuration is not there [15:57:48] so then apt update times out [15:57:49] is the proxy config supposed to show up in the install instead of from puppet? [15:57:52] not a big deal right? [15:57:58] no it's from puppet [15:58:03] when the update times out, everything else says 'dependency failed' [15:58:09] (and doesn't execute) [15:58:21] hum [15:58:24] perhaps we should fix that [15:58:29] :P [15:58:59] paravoid: I did some more rebasing :/ Changes are 15545 and 16632 [15:59:02] thing is, this problem didn't crop up on ms-be1001 on monday. [15:59:26] the installer also puts a proxy: [15:59:27] d-imirror/http/proxystringhttp://brewster.wikimedia.org:8080 [15:59:40] yeah, I checked that [16:00:00] it's private1-c-eqiad.cfg, which has that [16:00:04] yes [16:00:25] and the install did work ok, so that was probably effective. [16:00:36] this is the first post-install puppet run that's failing. [16:01:16] what's the reason the apt.conf/sources.list aren't in the first stage? [16:01:29] we currently have base::apt (include apt.conf, sources.list and some other stuff) and base::apt::update [16:01:39] base::apt::update runs on the first stage, base::apt runs on the main stage [16:01:46] any particular reason for that? [16:02:03] base::apt couuld run in the first stage as well [16:02:43] that doesn't explain why it works on every other host out there [16:02:44] we could put a few other things there [16:02:47] like setting the root pass [16:02:52] no it doesn't [16:03:27] we can easily workaround it too [16:03:36] http_proxy="http://brewster.wikimedia.org:8080/" puppetd -vt [16:03:46] but I'd like to find the root cause first [16:04:26] I think I found it [16:04:44] private1-c-eqiad.cfg wasn't included by the installer [16:04:49] the file exists, but it's not listed in netboot.cfg [16:05:28] indeed [16:05:31] heh [16:05:34] adding [16:05:39] okay [16:05:49] want me to move the apt stuff in the first stage? [16:05:49] I'll tell one of them to do a complete reinstall to verify. [16:05:54] sure go ahead [16:06:08] can we do one change at a time [16:06:11] maplebed: check whether they have static network config [16:06:13] to verify the fix? [16:06:29] mark: they don't [16:06:31] looks dynamic to me. [16:06:35] then reinstall [16:07:01] my git pushes are super slow lately :( [16:08:06] New patchset: Mark Bergsma; "Include row C and D private subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16635 [16:08:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16635 [16:09:38] running puppet on brewster [16:10:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:38] and that's gonna take a while obviously ;) [16:12:39] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:08] paravoid: going to move out soon so I guess we should postpone tomorrow morning :-D [16:18:10] hashar: argh [16:18:11] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:18:56] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:21:47] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [16:22:00] it ran [16:22:25] paravoid: :-] [16:22:32] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [16:22:41] paravoid: I am out now :-) see you tomorrow! [16:23:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [16:27:14] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: Connection refused [16:27:14] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: Connection refused [16:27:41] PROBLEM - Memcached on ms-fe1003 is CRITICAL: Connection refused [16:27:41] PROBLEM - Memcached on ms-fe1004 is CRITICAL: Connection refused [16:28:17] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [16:33:41] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:41:47] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [16:42:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [16:52:39] Logged the message, Master [16:55:08] PROBLEM - Memcached on srv266 is CRITICAL: Connection refused [16:55:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:02] PROBLEM - SSH on srv266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:51] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:54] New patchset: Mark Bergsma; "Update subnets hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16639 [16:58:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16639 [17:01:41] New patchset: Mark Bergsma; "Change analytics1-c-eqiad subnet to address plan" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16640 [17:02:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16640 [17:03:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16639 [17:03:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16640 [17:04:53] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:05:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.925 seconds [17:05:56] RECOVERY - SSH on srv266 is OK: SSH OK - OpenSSH_5.3 (protocol 2.0) [17:09:43] New patchset: Mark Bergsma; "Add asw-c-eqiad to Torrus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16641 [17:10:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16641 [17:10:34] !log Added asw-c-eqiad to Torrus, RANCID, Observium [17:10:43] Logged the message, Master [17:15:14] j^: don't we also need package wikimedia-job-runner ensure => absent? [17:15:47] which should run before you other stuff too, or else dpkg will delete the initscript/default file :) [17:17:19] j^: also, you didn't change site.pp, which means it won't actually apply to existing production [17:17:28] and puppet will break instead [17:23:11] who else thinks we are going to see 7? [17:24:14] ? [17:24:25] paravoid: can your approve https://gerrit.wikimedia.org/r/16642 [17:24:40] wrong channel sorry [17:24:50] RECOVERY - Memcached on srv266 is OK: TCP OK - 0.001 second response time on port 11000 [17:25:29] preilly: hi patrick [17:25:40] paravoid: hi [17:26:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16642 [17:26:48] merged it, let me find out where I need to force puppet runs [17:28:26] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [17:28:50] ah, cp1041-1044, running it now [17:31:11] preilly: all done [17:31:29] paravoid: okay great thanks! [17:33:59] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.352 second response time [17:36:23] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [17:39:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:12] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [17:40:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [17:51:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [18:02:31] !log streaming hot backup of db1033 to db63, intended db12 replacement and first precise enwiki db [18:02:40] Logged the message, Master [18:03:17] hot [18:03:29] :) [18:24:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.526 seconds [18:41:47] PROBLEM - Puppet freshness on srv198 is CRITICAL: Puppet has not run in the last 10 hours [18:42:23] New patchset: preilly; "revert to strtok_r for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16651 [18:42:24] paravoid: you still around? [18:42:41] notpeter: you around? [18:43:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16651 [18:43:10] I am [18:43:22] preilly: ^ [18:43:50] New patchset: preilly; "revert to strtok_r for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16651 [18:44:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16651 [18:44:26] New patchset: Kaldari; "Removing $wgNoticeFundraisingUrl override - the correct value is in the ext now." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16652 [18:44:27] paravoid: can you merge and push this: https://gerrit.wikimedia.org/r/#/c/16651/2/templates/varnish/mobile-frontend.inc.vcl.erb [18:44:51] preilly: it'd be nice to have an explanation & consequences of that :) [18:44:57] preferrably on the commit message [18:45:03] paravoid: well it's what we had before [18:45:15] paravoid: and I've got a carrier waiting to test it [18:45:17] well yeah, but why is it reverted? [18:45:30] paravoid: the new way didn't seem to work correctly [18:47:05] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [18:47:09] paravoid: are you able to approve and merge this change? [18:47:41] New patchset: Hashar; "basic README introducing our files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16035 [18:47:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [18:48:01] New review: Hashar; "Patchset 2:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/16035 [18:48:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16651 [18:48:50] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [18:49:24] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16652 [18:49:28] paravoid: are you forcing a puppet run too? [18:49:33] I am doing that now [18:50:05] paravoid: cool — thank you very much [18:50:25] you're welcome [18:50:30] paravoid: can you ping me when that's complete [18:50:47] preilly: ping :-) [18:50:55] paravoid: heh heh heh [18:52:47] New patchset: Kaldari; "Re-enabling PageTriage now that the db is updated." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16653 [18:53:08] I know you're in a rush now, but I like nice commit messages "revert 01ab02ef; brings back strtok_r, the new way is broken due to blah" [18:53:20] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16653 [18:53:24] paravoid: yeah good point [18:53:37] paravoid: sorry about that [18:53:49] no worries [18:55:46] I wonder if we could have a VCL test suite for such things [18:56:24] maybe with vmods, testing a shared library rather than the whole VCL should be easier [18:57:07] paravoid: I've been using a labs instance for testing [18:58:20] paravoid: on the mobile-testing vm I've got varnish configured in front of apache and I test the config with the following request http://mobile-testing.wmflabs.org/carrier-test/?cid=412416436 [18:58:59] nice :) [19:00:56] I was thinking more of a "make test" [19:01:24] alias "make test"="curl -I http://mobile-testing.wmflabs.org/carrier-test/?cid=412416436" [19:01:36] paravoid: yeah I know what you meant [19:01:49] paravoid: a proper test harness would be nice [19:01:51] but anyway, you seem to be handling this, no reason to step on your toes :) [19:02:53] dinner time [19:02:55] paravoid: it's okay I appreciate all of your assistance [19:03:00] paravoid: have a nice dinner [19:05:25] New patchset: J; "Add videoscaler class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16654 [19:06:02] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [19:06:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16654 [19:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [19:08:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:10] New review: J; "i have no idea where the changes in patch 8 come from. and not able to push patch 7 again? is it pos..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16501 [19:16:53] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:35] New review: Demon; "The extra changes in patch set 8 are due to the rebase against production before pushing. They clutt..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16501 [19:19:17] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [19:21:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [19:21:03] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:23:35] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:23:44] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:59] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.52 ms [19:26:44] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [19:34:05] New patchset: Hashar; "(bug 36748) deployment-dbdump is the syslog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16661 [19:34:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16661 [19:35:16] is there anyone around to merge a horrible hack to our base::remote-syslog class please ? https://gerrit.wikimedia.org/r/16661 [19:35:33] beta instance deployment-dbdump is kind of the equivalent of nfs1/nfs2 so it should not have base::remote-syslog deployed on it [19:35:55] that install rsyslog package which conflict with syslog-ng package used as a sysloglog receiver. [19:47:00] cmjohnson1: the new dbs (db63 onwards) that you imaged.. i just realized their hw raid is configured as a raid5 [19:47:36] cmjohnson1: needs to be raid-10, 256k stripe size, writeback caching, with no readahead [19:47:57] need to wipe the disk in the process, so they'll need to reimaged after [19:50:27] binasher: how is the emergency schema change going? :p [19:51:07] well, not really an emergency with git revert :) [19:52:01] hah [19:53:05] AaronSchulz: i haven't started yet, was supposed to be leaving now for lunch with an ops candidate who isn't here [19:53:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:01] starting it and immediately walking away seems like bad form :) [19:58:45] binasher: I'd love to see how osc is launched [19:59:18] AaronSchulz: see fenari:/home/asher/db/run-online-schema-change which uses /home/asher/db/pt-online-schema-change-2.1.1-no_child_table_patch [19:59:18] under your user dir, classy [19:59:18] i think maybe i can switch to 2.1.2 without any modifications, percona addressed the issue i fixed in the version in my home dir ;) [19:59:18] i need to test 2.1.2 still though [19:59:49] if a released version works reliably, i'll make it a documented process [20:00:44] heh [20:00:50] but while there's risk of it breaking replication.. i'd rather it not be [20:04:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.259 seconds [20:16:36] notpeter: do you know anything about srv281? it's only been up a few days and it's commented out of the rendering lvs config (so not in rotation) [20:16:46] it seems it doesn't have a recent apache deploy [20:17:07] I'm wondering if it's known broken or just lost by the wayside. [20:19:39] New patchset: Ryan Lane; "Changes for essex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16669 [20:20:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16669 [20:26:48] New patchset: Ryan Lane; "Changes for essex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16669 [20:27:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16669 [20:29:49] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16669 [20:35:13] hi again. Hope you had a good lunch :-] [20:35:20] is there anyone around to merge a horrible hack to our base::remote-syslog class please ? https://gerrit.wikimedia.org/r/16661 [20:35:56] New patchset: Ryan Lane; "Fix glance config file names in essex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16672 [20:36:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16672 [20:38:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16672 [20:40:32] will find out tomorrow :-] have a good day! [20:40:43] New patchset: Ryan Lane; "Adding back in now missing files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16673 [20:41:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16673 [20:43:31] !log started abuse_filter_log migration via osc. first step, adding afl_rev_id, afl_log_id columns. running in cluster order starting with s1 (enwiki) [20:43:39] Logged the message, Master [20:49:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.384 seconds [20:50:57] AaronSchulz: Do I remember right that you said thumb_handler.php was already deployed to the image scalers? [20:51:55] I don't see it on srv224. [20:54:02] maplebed: I said the code was done...I think live-1.5/ needs a file for it, just like thumb.php [20:54:04] New patchset: Kaldari; "Re-implementing wgNoticeFundraisingUrl override for wmf7 wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16674 [20:54:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [20:54:41] I don't know what "live-1.5/ needs a file for it" means. [20:54:43] New patchset: Kaldari; "Re-implementing wgNoticeFundraisingUrl override for wmf7 wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16674 [20:54:52] maplebed: see /usr/local/apache/common-local/live-1.5 [20:55:04] there are "het deploy" entry wrapper files there [20:55:32] * AaronSchulz should add that [20:55:48] I don't see thumb_handler.php in /usr/local/apache/common-local/live-1.5 either. [20:56:08] that what I was getting at, it's not there yet [20:56:17] it needs a 3 line file [20:56:22] *that's [20:56:48] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16674 [20:56:54] are you waiting on me to finish the rewrite rule before you deploy it? [20:57:13] pretty much, I'll add that file now [20:57:44] well, really what I'm getting at - I want a place I can test the rewrite rule. Without an endpoint to call and see what happens, I don't know where to test. [21:02:37] maplebed: ok everything should have that file now [21:03:14] maplebed: are you trying to take out a scalar and test it? [21:03:26] I've already taken one out. I thought your part was done. [21:04:02] yeah I forgot about needing that het deploy wrapper [21:04:30] did you add rewrite rules? [21:04:40] to srv224 only. [21:05:24] didn't work. [21:05:39] what url did you use? [21:05:43] but now that thumb_handler is there I'll keep poking at it. [21:06:06] oh, yeah it wouldn't have worked without that wrapper [21:07:07] my test URL: http://commons.wikimedia.org/thumb/a/a2/Little_kitten_.jpg/1px-Little_kitten_.jpg [21:08:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:08:36] oh, I see one problem. [21:09:56] grumble. [21:17:17] maplebed: something simple? [21:17:47] yeah - I was testing with commons but I only put the rewrite rule in wikipedia (not wikimedia) to start, so commons wasn't part of my testable sample set. [21:18:09] I have a new test image: http://en.wikipedia.org/thumb/d/d8/Wikiwsy.jpg/1px-Wikiwsy.jpg [21:18:35] aka upload.wikimedia.org/wikipedia/en/thumb/d/d8/Wikiwsy.jpg/120px-Wikiwsy.jpg [21:19:26] AaronSchulz: do you have some time for us to poke at it together? [21:19:33] now is ok [21:19:51] can we move to a different channel? [21:19:55] sure [21:20:33] New patchset: Alex Monk; "(bug 38690) Lots of rights changes for trwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16675 [21:27:23] moved to #wikimedia-dev for anybody interested. [21:32:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:04:50] New patchset: Alex Monk; "(bug 38690) Lots of rights changes for trwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16675 [22:04:52] LeslieCarr: oversight-en-wp@wikipedia.org [22:05:39] !log started second step of abuse_filter_log migration via osc, indexes for new columns [22:05:47] Logged the message, Master [22:07:06] New patchset: Alex Monk; "(bug 38690) Lots of rights changes for trwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16675 [22:08:09] notpeter: Ping re srv281, it's throwing disk full errors. Looks like the /a -> /usr/local/apache migration didn't happen there [22:23:58] maplebed: "Got bad URI '/thumb/d/d8/Wikiwsy.jpg/2px-Wikiwsy.jpg' [zone URI /wikipedia/en/thumb]" [22:24:10] gah, it's not making it past the url sanity checks [22:24:31] wrong, channel, arg [22:25:24] so, looks like puppet is broken ? [22:25:35] on stafford, that is [22:25:41] anyone checking that out or should i get it ? [22:26:29] oh yeah, that looks pretty broken [22:26:32] go for it [22:29:43] !log fixed puppetmaster on stafford (bad stafford, no cookie!) [22:29:51] Logged the message, Mistress of the network gear. [22:30:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 The environment must be purely alphanumeric, not - 284 bytes in 0.495 seconds [22:34:32] PROBLEM - Puppet freshness on cp1020 is CRITICAL: Puppet has not run in the last 10 hours [22:35:36] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [22:36:38] PROBLEM - Puppet freshness on srv209 is CRITICAL: Puppet has not run in the last 10 hours [22:37:32] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [22:38:35] PROBLEM - Puppet freshness on lvs5 is CRITICAL: Puppet has not run in the last 10 hours [22:39:17] New patchset: Catrope; "Fix double inclusion of misc::udp2log on nfs2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16690 [22:39:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16690 [22:40:30] LeslieCarr: nfs2 fix https://gerrit.wikimedia.org/r/16690 [22:40:58] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16690 [22:49:40] New patchset: Catrope; "Make nrpe::monitor_service conditional rather than using present/absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16691 [22:50:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16691 [22:50:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16691 [22:54:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:56:57] hrm, anyone know why puppetmaster would start up and stay single threaded ? [22:57:16] it should be spawning itself left and right [23:06:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [23:10:10] New patchset: Catrope; "Add description and split the other one too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16695 [23:10:42] fyi, figured out hte puppetmaster thing, need to not start it via service puppetmaster , start it via service apache2 [23:10:44] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/16695 [23:10:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [23:11:18] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [23:12:16] New patchset: Catrope; "Add description and split the other one too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16695 [23:12:55] New patchset: preilly; "fix zero regex and remove opera check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16697 [23:13:31] New review: Lcarr; "squiggly braces are magical" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16695 [23:13:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16695 [23:13:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16695 [23:13:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16697 [23:15:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16697 [23:17:37] New patchset: Catrope; "Don't try to do ensure=>absent, seems to be broken" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16698 [23:18:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16698 [23:18:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16698 [23:23:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:55] New patchset: Catrope; "Wrap the monitor_service call too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16699 [23:24:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16699 [23:30:17] !log completed abuse_filter_log migration for all wikis [23:30:25] Logged the message, Master [23:31:07] LeslieCarr: https://gerrit.wikimedia.org/r/16699 [23:31:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16699 [23:33:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.600 seconds [23:35:14] New patchset: Catrope; "Also wrap the cron job in an if" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16700 [23:35:46] binasher: We should've just let it error for a few hours ;) [23:35:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16700 [23:36:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16700 [23:40:06] Is gerrit "broken" for anyone else? [23:40:11] I'm getting "The page you requested was not found, or you do not have permission to view this page." for anything [23:40:33] RoanKattouw: Krinkle ^ [23:40:54] Oh, wait [23:40:57] Is this this chrome bug.. [23:41:07] Yup [23:41:08] FAIL [23:41:39] to the canary! [23:48:50] New patchset: Lcarr; "include class nfs::home::wikipedia check in nfs::home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16702 [23:49:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16702 [23:49:39] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16702 [23:52:45] New patchset: Lcarr; "importing nfs.pp so class knows where File/home/wikipedia is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16703 [23:53:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16703 [23:53:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16703 [23:57:29] New patchset: Lcarr; "removing dependency on another class's files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16706 [23:58:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16706 [23:58:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16706