[00:15:04] PROBLEM - Puppet freshness on db63 is CRITICAL: Puppet has not run in the last 10 hours [00:28:37] New patchset: Ryan Lane; "Add keystone config import to nova's api-paste" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17030 [00:29:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17030 [00:29:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17030 [00:43:25] New patchset: Ryan Lane; "Add keystone support to glance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17033 [00:44:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17033 [00:44:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17033 [00:55:07] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [01:21:03] New review: Catrope; "Can and should be merged without 16966, see comments on that change." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/16967 [01:24:55] New patchset: Pyoungmeister; "apache refactor: first half of mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17035 [01:25:33] New patchset: Pyoungmeister; "apache refactor: second half of mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17036 [01:26:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17035 [01:26:10] New patchset: Pyoungmeister; "apache refactor: second half of mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17036 [01:26:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17036 [01:41:53] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 245 seconds [01:41:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 246 seconds [01:48:29] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 642s [01:53:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 20 seconds [01:54:02] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 37s [01:54:38] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [03:23:17] RECOVERY - Puppet freshness on mw60 is OK: puppet ran at Tue Jul 31 03:22:52 UTC 2012 [04:18:43] New patchset: Ori.livneh; "*UNTESTED* Calls to bits-lb.eqiad/event.gif 204'd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16724 [04:19:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16724 [04:48:40] New patchset: Ori.livneh; "*UNTESTED* Event tracking endpoint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16724 [04:49:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16724 [06:35:54] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [07:19:51] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:32:53] New patchset: Ori.livneh; "(RT 3325) olivneh restricted => mortals" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [07:33:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17040 [07:43:37] New patchset: Ori.livneh; "(RT 3325) olivneh restricted => mortals" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17040 [07:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17040 [07:44:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:45:10] New patchset: Jeremyb; "certs.pp: fix cert IDs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17041 [07:45:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17041 [07:52:12] * jeremyb wonders which ops are awake? [07:59:46] paravoid: you awake? [08:00:43] preilly: it is 10am ;-) [08:01:01] jeremyb: not in SF [08:01:11] wait, no i can't do math. 11am [08:01:12] jeremyb: it's 1:01:06 AM [08:01:47] Tuesday (PDT) - Time in San Francisco, CA [08:01:55] 31 08:01:12 < preilly> jeremyb: it's 1:01:06 AM [08:02:03] +3 for CEST [08:02:07] err, EEST [08:02:13] * jeremyb is obviously half asleep [08:05:25] !g I9e1b90579fba24 [08:05:25] https://gerrit.wikimedia.org/r/#q,I9e1b90579fba24,n,z [08:05:47] well if someone wants to sanity check or review or merge that, great, thanks ;-) [08:50:21] still no ops... [08:50:26] * jeremyb runs away shortly [08:53:44] jeremyb: shouildn't you sleep now? [10:03:45] New review: Hashar; "Ideally we would want to lookup the openssl package version and use that instead of the distribution..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/17041 [10:08:33] jeremyb: did a basic review. I have sent an email to ops about the certs being broken on Precise. Replied to it and mentioned your change. [10:08:44] hopefully will attract someone :) [10:15:45] PROBLEM - Puppet freshness on db63 is CRITICAL: Puppet has not run in the last 10 hours [10:55:57] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [12:19:35] !log Depooled cp1042 for reinstall with Precise [12:19:43] Logged the message, Master [12:27:54] PROBLEM - Host cp1042 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:10] hm? so all of the mobile varnishes are going to be internal? [12:29:24] New patchset: Faidon; "certs: use c_rehash instead of manually symlinking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17065 [12:30:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17065 [12:30:54] New review: Faidon; "Thanks Jeremy. While reviewing this I thought of taking a different approach, namely:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/17041 [12:31:30] mark: ping? [12:32:15] yes? [12:32:15] I've been getting multiple mails per day for the new jobrunner stuff (I've reviewed most of the incarnations) [12:32:36] it has the potential of breaking jobs cluster-wide [12:32:44] oh noes [12:32:59] so, what do you suggest? merge it? or you want to have a look first? [12:33:03] or? [12:33:07] i'll have a look today [12:33:38] I think I can handle it, I'm only asking if /you/ want to double-check :) [12:34:00] Backend host '"srv193.pmtpa.wmnet"' could not be resolved to an IP address: [12:34:01] Temporary failure in name resolution [12:34:01] (Sorry if that error message is gibberish.) [12:34:01] ('input' Line 129 Pos 17) [12:34:01] .host = "srv193.pmtpa.wmnet"; [12:34:01] ----------------####################- [12:34:24] !? [12:34:59] ah nm [12:35:05] i changed the vlan of that host hehe [12:35:10] heh [12:35:18] so, what's the plan with the mobile cp? [12:35:23] move them internal [12:35:33] how will that work? [12:35:39] how will that not work? [12:35:50] erm? lvs dr? [12:35:54] yes? [12:36:13] if the frontends are internal how will they send traffic to clients? [12:36:21] like they always do [12:36:25] send to the router, router sends on [12:36:54] mind you, lvs service ip is out of subnet [12:37:02] yeah, I thought of that [12:37:02] whether it's a public ip subnet or a private ip subnet does not matter one bit ;) [12:37:08] right [12:37:12] the internal name confused me [12:37:27] it's not exactly "internal" in the sense I'm used to [12:38:37] I just had to reinstall git [12:38:57] and apple says "your security preferences currently forbid installing apps from unknown developers!" [12:39:17] shame on you! trying to install applications on your operating system! [12:42:18] New patchset: Mark Bergsma; "Move cp1042 internal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17067 [12:42:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17067 [12:43:09] so are you planning to move all cp* to internal? [12:43:21] and all !lvs I presume? [12:43:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17067 [12:51:12] yes [12:51:24] if it doesn't need to talk to the internet [13:16:16] !log compiling phpllvm tests on bast1001 [13:16:25] Logged the message, Master [14:33:37] !log Reinstalled cp1042 with Precise [14:33:45] Logged the message, Master [14:33:52] !log Repooled cp1042 [14:34:00] Logged the message, Master [15:58:52] notpeter: oh? are you in SF? [16:00:27] paravoid: indeed. I was at defcon last week, so I thought I'd come to sf for a week [16:00:29] !log beginning swift deploy to make thumbnail requests bypass ms5 and go straight from swift to the rendering cluster [16:00:37] Logged the message, Master [16:00:41] and I got here just in time to be sick! wooo [16:01:16] New patchset: Aaron Schulz; "Set the 404 handlerUrl for the thumb zone." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17078 [16:02:41] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17078 [16:03:11] you can even get human viruses at defcon now? [16:05:44] so it would seem... luckily, I'm pretty confident that my electonics are unscathed, sadly, there was a lot of NSA/DoD people at defcon (like, *a lot*), so this is probably a mind control virus [16:08:04] paravoid: Around? [16:08:10] * RoanKattouw just got up and is making breakfast [16:09:14] hehe [16:09:17] i'm looking at your commits now [16:09:22] it's a bit confusing like this [16:09:28] several gerrit changes [16:09:44] i'm thinking, perhaps we should make modules out of them [16:09:54] then we also don't have to worry (now) about keeping current systems in tact [16:10:47] sure, that's reasonable [16:10:54] I can also squash them all, if tha twould help [16:11:16] ideally it would become a new patchset to the existing change [16:11:23] then I can more easily see what's changed and what has happened to the comments [16:11:38] but so far it's looking better anyway ;) [16:11:53] part of me would also like to switch it to a module later, so as to not hold up deploy. [16:12:47] yes, I believe that I addressed all of your comments. I used it as a checklist [16:13:51] mark: I'm still waiting for ack on the ssh module [16:13:57] and possibly varnish, although that has diverged from now [16:14:01] and wasn't tested in the first place [16:14:15] hold up what deploy? [16:14:34] paravoid: well we need to decide on the tabs/spaces don't we? ;) [16:15:23] AaronSchulz: about to push https://gerrit.wikimedia.org/r/#/c/16821/ [16:15:49] mark: eqiad? [16:16:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16821 [16:16:13] why would that hold up a deploy? [16:16:43] you're drastically changing things now with existing systems running off those manifests [16:16:48] that's risky and painful [16:16:51] true [16:16:58] and a trivial change to make something into modules would hold up a deployment? :) [16:17:05] true [16:17:14] it would allow you to use the new stuff independently for now [16:17:25] fair enough [16:17:37] I am a fan of that [16:18:15] also [16:18:26] why do you name files different from the classes contained within them all the time? [16:18:34] at least the autoloader will deal with that ;)' [16:19:52] usually due to trying to reconcile naming conventions and not quite getting it right. I will make sure to not do that in the future [16:20:10] so [16:20:18] if we make modules we can do this right from the start [16:20:21] what modules should we have [16:20:34] a module that maintains a functional mediawiki instance [16:20:45] not an apache service, but just a mediawiki instance on a host [16:20:49] yep [16:20:55] so also used for things like cron jobs, dumps, whatever [16:21:14] it should be as generic as we can make it with our current crap deployment system [16:21:25] heh [16:21:54] and... a module called applicationserver I think [16:21:58] I hate the word 'apache', too confusing here [16:21:59] mark: I know we do, I'm just saying that we should probably decide that soon if you want other people to start using modules :) [16:22:03] sounds reasonable [16:22:10] I think we need both [16:22:21] both what? [16:22:26] an apache module for all the apache related stuff [16:22:28] AaronSchulz: live on ms-fe1 and 2 [16:22:34] yeah that's possible [16:22:35] (50% deployed) [16:22:36] packages, sites-avail/enabled, reload etc. [16:22:45] but we also have more generic apaches [16:22:46] that's being reused in the appserver module [16:22:49] not related to mediawiki [16:22:55] right, the apache module should be usable by itself [16:23:01] and I don't really want to do that now [16:23:10] so I'd be ok with getting an apache module later [16:23:20] the way I envision it is having multiple layers of parameterized classes ending up in a role class [16:23:21] our apache config right now is a mess anyway, there's no chance we can integrate that into a new apache module in time [16:23:26] yeah [16:23:34] but we can't realistically do it all now [16:23:46] okay, I haven't looked at that much [16:23:56] so, I wouldn't know, I'm talking purely theoritical here [16:24:01] theoretical even [16:24:06] I fully agree [16:24:14] great [16:24:18] just being realistic here now [16:24:33] so, in the future, our application servers (from the role classes) would use an apache module (also used for other stuff) [16:24:43] for now, they'll probably handle the apache config themselves [16:24:51] kk [16:24:55] since it works very differently from how we handle apache/sites config elsewhere (which ALSO sucks, for different reasons ;) [16:25:04] hehehe [16:25:17] so I would be ok with it if the apache handling for app servers would exist in... the applicationserver module, for now [16:25:21] we also have webserver::apache from what I can see [16:25:29] that's for non-mediawiki stuff [16:25:32] right [16:25:52] let's not use that unless it's something simple like "install the apache packages" [16:25:58] and it already does exactly what we need [16:26:05] no config handling [16:26:10] ok, a mediawiki module and an appserver module (that will currently include apache stuff) [16:26:13] yeah, I was just saying that an apache module could replace both in the future [16:26:14] yup [16:26:23] and then we need to do something with all the special case mediawikis [16:26:29] I mean, special case servers [16:26:31] yeah... [16:26:33] for dumps, cron jobs, etc [16:26:39] part of that is role classes [16:26:41] part of it may not be [16:26:43] I'm not sure yet [16:26:50] perhaps we should start small and simple [16:26:53] when we use modules, we have that liberty [16:26:59] yep :) [16:27:03] since we don't need to worry about the existing pmtpa cluster [16:27:14] well, hopefully all of those things will be doable with just the mediawiki module [16:27:20] we can reuse those existing manifest bits, but only if they're good [16:27:23] (and not much is ;) [16:27:49] btw, something else: what I've seen some people do, and you might be interested in [16:27:59] is having a two-level hierarchy for modules [16:28:12] so, modulepath=/etc/puppet/modules/base;/etc/puppet/modules/site [16:28:19] with base/ having only software-related modules [16:28:24] e.g. apache, squid, varnish etc. [16:28:47] Change abandoned: Jeremyb; "Thanks Faidon, that WORKSFORME" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17041 [16:28:48] and site having "site-specific" modules, appserver, dbserver, mcserver etc. [16:29:08] I'm not a big fan [16:29:10] but some people are [16:29:16] so I'm just putting it on the table :) [16:29:18] it's much like our role classes [16:29:43] well, you can do that (as I suggest) without having a different *directory* hierarchy [16:29:58] yeah [16:30:05] btw, one problem we have with our role classes now [16:30:11] is that they're used for two different things: [16:30:22] one is to tie different manifests/modules/whatever into one system [16:30:32] so a system has just one role class, normally [16:30:48] * paravoid nods [16:30:50] but now, especially for labs, it's also used to set up a common configuration of some manifest classes [16:30:54] to fill in parameters [16:31:05] but then multiple role classes end up on one box [16:31:10] could you give an example of that? [16:31:18] some of hashar's work recently does that [16:31:27] so, some misc manifest gets parameterized [16:31:42] and to use that tiny service on a box with other services on it [16:31:49] a role class is created, to fill in the parameters [16:31:58] and then that role class, amongst others (now or in the future :) is used [16:32:23] iirc one example was fenari/bast1001 [16:32:25] let's see [16:32:32] hrm, I'm not sure I fully understand but it doesn't sound very good [16:32:34] nagios-wm: [16:32:36] oops [16:33:02] * aude lurking and learning puppet [16:33:29] ah [16:33:30] nfs1/2 [16:33:30] btw, the base/site thing I was telling before, explained in better and more words: http://serialized.net/2009/07/puppet-module-patterns/ [16:33:37] which are not puppetized very well of course [16:33:38] but still: [16:33:39] include standard, [16:33:39] misc::nfs-server::home, [16:33:39] misc::nfs-server::home::backup, [16:33:39] misc::nfs-server::home::rsyncd, [16:33:39] misc::syslog-server, [16:33:40] ldap::server::wmf-cluster, [16:33:40] ldap::client::wmf-cluster, [16:33:41] backup::client [16:33:41] # don't need udp2log monitoring on nfs hosts [16:33:42] class { "role::logging::mediawiki": monitor => false } [16:34:09] okay [16:34:10] that's not a role class which completely defines one system [16:34:16] why not? [16:34:16] and of course, that's hard in this case [16:34:20] New review: Catrope; "Yes. The l10nupdate script works in its current form but this is much nicer (and makes new extension..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/6905 [16:34:23] it's a production bastion host [16:34:30] no it's nfs1/nfs2 [16:34:42] oh, right, I was confused [16:34:53] oh you mean it's NFS + syslog + LDAP? [16:34:57] yeah [16:35:02] role classes are easy for clustered systems [16:35:07] not for misc machines with multiple services on them [16:35:12] aha [16:35:17] I think that's fine [16:35:24] because that system essentially fulfills multiple roles [16:35:25] it might conflict some day [16:35:28] yeah I agree [16:35:33] and we could (and are planning to) split it up [16:35:34] but many role classes also include "standard" for example [16:35:38] which sets up their base stuff [16:35:43] so? [16:35:48] that won't conflict [16:35:54] you can include a class as many times as you want [16:35:58] i know [16:36:05] but we also need to make that parameterized really [16:36:28] to allow for different resolvers, or a different syslog server, or stuff like that [16:36:30] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [16:36:31] and then it gets difficult [16:36:36] i'm not saying this is a problem -yet- [16:36:51] I'm just signaling that this is slightly different usage of role classes that could become problematic [16:37:04] hm [16:37:18] I need a whiteboard dammit [16:37:23] hehe yeah [16:37:26] use your windows [16:37:28] whiteboard paint? [16:37:31] if you have whiteboard markers that works [16:37:52] jeremyb: horribly expensive and dirty overall [16:37:59] (been there done that) [16:38:06] i think you can expense a whiteboard if you want one ;) [16:38:25] hahaha [16:38:43] hah [16:39:03] paravoid: otoh, there's darst's whiteboard ;) [16:39:33] which one? [16:39:43] paravoid: whiteboard.debian.net [16:39:48] ah [16:57:57] New patchset: Bhartshorne; "changing swift pmtpa-prod cluster to use the image scalers directly instead of ms5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17082 [16:58:00] AaronSchulz: ^^^ [16:58:11] \o/ [16:58:24] kill kill kill [16:58:35] i'll be even happier when you'll kill the solaris boxes [16:58:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17082 [16:58:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17082 [16:59:41] * AaronSchulz yays [17:05:17] ok, live on both 1 and 2 [17:06:50] seems ok [17:07:17] running puppet on 3 and 4 [17:08:32] restarting the swift proxy on 3 and 4 [17:08:38] paravoid: did you see you review? [17:08:44] your review* [17:14:44] aude: http://www.netways.de/puppetcamp/puppetcamp2012/program/ [17:16:53] AaronSchulz: the swift side of things is done. [17:17:01] \o/ [17:18:48] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [17:19:24] mark: ^ lvs ? [17:19:57] notpeter: search32...dell is sending another mainboard. I will get it when i get back [17:20:09] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:20:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [17:21:21] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:38] New review: Mark Bergsma; "This does not really belong here. This looks like a "role class" which uses a jobrunner with a speci..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16654 [17:22:49] cmjohnson1: cool! [17:22:51] thank you [17:23:13] that will be the 2nd mainboard and countless DIMM...after this..it's going back [17:23:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:23:45] it's cursed [17:23:49] cmjohnson1: i have found a few times that a bad power supply can sometimes keep killing those off [17:23:55] if it's not delivering the proper voltages [17:24:06] LeslieCarr: seen lvs6 above? [17:24:53] lesliecarr: good thought..i will check the power [17:25:18] sigh [17:25:21] checking out lvs6 [17:25:29] y u break it ? [17:26:04] LeslieCarr: it recovered (also above) but i thought maybe it was worth a look anyway [17:26:09] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:26:38] hrm, lvs6 looks happy…. [17:26:42] checking out spence [17:28:44] heh, someone should have fixed up icinga by now… i wonder who that was… ;) [17:29:01] * jeremyb blames neon [17:29:10] yeah, bad neon! [17:29:15] !log finished swift deploy - image scaling requests now go straight to the rendering cluster [17:29:23] Logged the message, Master [17:29:36] cmjohnson1: also, sometimes a lack of being hit with hammers can negatively affect servers. maybe take a look at that [17:29:56] mark: I was wrong yesterday when I said that ms5 is now completely out of the loop - mediawiki still checks it for images, just not swift. [17:29:59] notpeter: is there a study? [17:30:18] maplebed: ok [17:30:22] maplebed: when is that going to change? [17:30:38] notpeter: that is a good idea..i did notice that search32 was getting comfy next to dataset1...maybe they worked something out [17:30:42] mark: when we switch originals, currently planned for next week. [17:30:51] cool [17:30:53] jeremyb: http://www.allthingsdistributed.com/images/hammer.JPG [17:30:55] there is the study [17:31:04] cmjohnson1: hahaha [17:31:42] maplebed: or maybe the week after, next week might be a bit close [17:31:52] notpeter: i don't think it's a reliable source [17:31:54] :P [17:32:32] jeremyb: it's on the internet. wikipedia is on the internet. wikipedia is a reliable source. therefore, that "study" is a reliable source [17:32:44] uhuh [17:32:54] man, maybe I hsould be a lawyer. as I understand it, that's the level of understanding of logic that is required... [17:34:11] New patchset: J; "Add videoscaler class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16654 [17:34:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16654 [17:35:22] New review: J; "thanks for the feedback, moved the class to role::jobrunner::videoscaler" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16654 [17:38:54] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:45:14] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [17:45:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:45:49] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/13484 [17:46:12] New review: Demon; "In PS10: I removed the conditional inclusion of gerrit::account. Really, any server using gerrit::je..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13484 [17:48:03] ^demon: you know you failed lint? [17:48:36] <^demon> That was already his hashar's fault. [17:50:11] RECOVERY - Host cp1042 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [17:50:29] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: Connection refused by host [17:50:41] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [17:50:47] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: Connection refused by host [17:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [17:51:35] <^demon> jeremyb: Fixed his mistake :) [17:51:45] i see ;) [17:53:11] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [17:53:29] RECOVERY - Varnish HTCP daemon on cp1042 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [17:54:32] j^: ma_rk is having me move what I'm doing into modules, so what I'm doing will now be independent. [17:57:10] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16967 [17:57:29] what's the latest on whitespace? [17:57:35] j^: yep [18:07:02] jeremyb: it's still white [18:07:05] More news in an hour [18:07:24] uhuh [18:07:43] but tabs and spaces are *both* white! [18:34:36] PROBLEM - SSH on ms-be1006 is CRITICAL: Connection refused [18:41:48] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:46:50] New patchset: MaxSem; "Wiki Loves Monuments API server, RT#3221" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16990 [18:47:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16990 [18:50:12] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:17] New review: awjrichards; "Just some trailing whitespace in latest patchset - also were you going to move the update stuff to a..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16990 [18:55:45] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [18:56:39] PROBLEM - SSH on ms-be1009 is CRITICAL: Connection refused [18:58:09] RECOVERY - swift-container-server on ms-be1006 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:58:09] RECOVERY - swift-object-server on ms-be1006 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:58:18] RECOVERY - swift-object-updater on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:58:27] RECOVERY - swift-account-server on ms-be1006 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:58:36] RECOVERY - swift-container-updater on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:58:45] RECOVERY - swift-object-auditor on ms-be1006 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:58:45] RECOVERY - swift-container-auditor on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:58:54] RECOVERY - swift-account-auditor on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:59:03] RECOVERY - swift-object-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:59:03] RECOVERY - swift-container-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:59:12] PROBLEM - SSH on ms-be1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:22] RECOVERY - swift-account-reaper on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:00:24] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:02:39] RECOVERY - SSH on ms-be1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:07:12] New patchset: Bhartshorne; "changing ganglia's idea of the swift eqiad cluster from the test cluster to the to-be-prod cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17100 [19:07:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17100 [19:09:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17100 [19:13:54] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.0161318779 secs [19:21:24] RECOVERY - Host calcium is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms [19:26:37] * jeremyb pokes binasher [19:26:37] !log authdns-update for wtp1 info [19:26:45] Logged the message, RobH [19:27:42] hey jeremyb [19:28:01] binasher: < jeremyb> binasher: hey, slight bump on the OTRS RT i sent [19:28:24] also, hi! [19:28:58] hrmmmm, OTRS needs a logo [19:30:12] New review: Hashar; "Roan> since you are in SF, could you poke an ops IRL to have that 3 months old change to be merged i..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6905 [19:32:01] jeremyb: closed it :) hi [19:34:19] binasher: oh, cool. should i have gotten an email? [19:36:59] jeremyb: yeah, from RT [19:39:04] binasher: just now? or when? [19:39:28] you opened the ticket right? [19:39:30] i'm seeing nothing. just had a pass at the spambox too [19:39:32] yes [19:40:02] opened: Date: Mon, 23 Jul 2012 16:12:51 -0400 [19:40:03] right? [19:40:16] well, i resolved the ticket which should trigger mail. mysterious are the ways of rt. [19:40:31] okey, well thanks! [19:40:53] its definitely sending some mail.. weird [19:41:22] so, we're good to do a --single-transaction dump on a slave? [19:42:29] PROBLEM - NTP on calcium is CRITICAL: NTP CRITICAL: No response from NTP server [19:43:07] yep, that could be done against just the otrs db on db49. is jeff green going to take care of that part? [19:43:49] i suppose. i have no access of course [19:44:02] i'll find him [19:44:26] 49 is intermediate master? [19:44:31] * jeremyb wants this in dbtree ;) [19:44:32] <^demon> binasher: Ah, I was looking for you earlier. I was wondering--is there any way to get db1048 on ishmael? [19:46:49] ^demon: oh, yeah.. i forgot i'm only doing that on pmtpa dbs. i'll get that done today [19:47:18] <^demon> Ok, thanks :D [19:49:50] PROBLEM - Host stat1 is DOWN: CRITICAL - Host Unreachable (208.80.152.146) [19:49:59] ottomata: ^ [19:50:58] whaaaaaaaaaaaa [19:51:15] well pooper scoopers [19:51:17] iunno! [19:51:39] LeslieCarr, if you got a sec, could you peak at stat1.wikimedia.org [19:51:39] ? [19:52:05] it is down [19:54:52] Ryan_Lane: binasher woosters? ^^^ [19:57:39] RobH: can you raise someone maybe? ^ [20:00:45] i'm taking a look [20:03:58] sorry, was afk in datacenter [20:04:05] RECOVERY - Host stat1 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [20:04:51] RobH: that's kinda expected i think ;) [20:05:03] stats1 is back [20:05:15] it went oom [20:05:25] stat1* [20:05:35] that [20:05:36] hmmmm, wonder what was happening there [20:05:39] ottomata: Jul 14 15:46:46 stat1 kernel: [2589787.359676] [ 7742] 602 7742 13459855 7815341 3 0 0 python [20:06:00] nice, i betcha drdee was running stuff for metrics meeting [20:06:03] that was the process with the largest rss [20:06:38] the oom killer killed a perl process before that too [20:06:38] not guilty [20:06:40] afaik [20:06:48] erik z maybe? [20:06:52] perl's are probably erik z [20:06:57] aye [20:07:04] python's could be me or someone else [20:07:11] i wasn't running anything right now [20:07:14] ok cool [20:07:19] thanks binasher_ [20:16:59] PROBLEM - Puppet freshness on db63 is CRITICAL: Puppet has not run in the last 10 hours [20:19:57] Ryan_Lane: paravoid: can we do something about 17041 / 17065 ? [20:20:03] !g 17065 [20:20:03] https://gerrit.wikimedia.org/r/#q,17065,n,z [20:27:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:54] New patchset: Aaron Schulz; "Added mwEmbed and TMH to extension-list." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17112 [20:32:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.568 seconds [20:42:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17112 [20:57:03] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [20:57:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17065 [21:01:18] New patchset: Ryan Lane; "Follow up to change 17065, adding the rapidssl ca source back in" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17115 [21:01:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17115 [21:02:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17115 [21:06:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:45] New review: Jeremyb; "followup in Ie2746894b9d525 to address my review" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17065 [21:17:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.167 seconds [21:18:03] PROBLEM - Puppet freshness on calcium is CRITICAL: Puppet has not run in the last 10 hours [21:23:06] preilly: https://wikitech.wikimedia.org/view/How_to_do_a_schema_change [21:23:27] for creating new tables for a new extension, any deployer can do that [21:23:35] see the sql.php section [21:24:55] preilly: http://bit.ly/R0Lrt5 [21:37:42] <^demon> binasher: Looking at db1048 in ganglia, it looks like it's showing almost zero load. Would you have any objections to me raising the query limit in gerrit? It wouldn't apply to normal usage, just people who are doing stats queries and the like. [21:38:08] ^demon: go for it [21:38:19] <^demon> Cool beans. We'll keep an eye on it just in case :) [21:38:30] ^demon: can we publish that repo? [21:42:51] New patchset: Aaron Schulz; "Enabled TMH for testwikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17128 [21:43:21] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17128 [21:44:44] * Nemo_bis hugs ^demon  [21:45:59] * jeremyb steals Nemo_bis's underscore [21:46:03] * jeremyb runs away, bbl [21:47:02] grrrrrrrrrrrrrrr [21:48:00] btw gerrit doesn't like underscore either [21:51:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:31] New patchset: Aaron Schulz; "Disable ogg handler when TMH is enabled." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17133 [22:01:07] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17133 [22:04:04] !log added memcached_1.4.14-0wmf1_amd64 to precise-wikimedia [22:04:12] Logged the message, Master [22:04:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [22:09:12] preilly: have you all started yet? AaronSchulz isn't done with the TMH deployment yet [22:09:49] robla: I can wait [22:10:45] thanks. trying to make sure that the problem we're seeing is something related to TMH [22:10:58] test.wikipedia.org is pretty broken right now [22:17:18] New patchset: Asher; "return an instant http 204 for http://bits.wikimedia.org/event.gif requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17138 [22:17:21] preilly: can you stash or remove your live wmf-config changes? [22:17:29] I want to pull a fix to extension-list [22:17:29] !log authdns-update for ms-be1006 and ms-be1012 [22:17:37] Logged the message, RobH [22:17:40] AaronSchulz: I just disabled it on testwiki [22:17:51] AaronSchulz: you can pull over it [22:18:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17138 [22:18:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17138 [22:21:00] New patchset: Aaron Schulz; "Fixed extension-list entries." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17139 [22:21:14] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17139 [22:33:28] New patchset: Asher; "http 204 responses should not contain a body" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17140 [22:34:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17140 [22:36:02] New patchset: Asher; "http 204 responses should not contain a body" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17140 [22:36:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17140 [22:39:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17140 [22:49:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [23:04:33] RECOVERY - Puppet freshness on db63 is OK: puppet ran at Tue Jul 31 23:04:25 UTC 2012 [23:04:47] srv281 being out of disk space is known and is no big deal? [23:05:10] robla: I think that's true. [23:06:03] RECOVERY - MySQL disk space on db63 is OK: DISK OK [23:06:08] okee doke...I guess I won't worry about it then [23:06:32] robla: the only place it exists in the pybal configs is commented out of the rendering cluster. [23:06:53] some bits of scap fail because of it being out of space [23:07:02] but I don't know which bits those are [23:07:08] I mean, it'd be great to be fixed and all. [23:07:24] but I don't think it's actively harming any user-facing processes (only dev-facing processes, eg scap). [23:07:34] got it...ok [23:12:19] New patchset: Asher; "adding db63 to s1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17144 [23:12:57] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17144 [23:13:09] Change abandoned: Ori.livneh; "Committed by Asher in I190eb906 and I2690406f" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16724 [23:20:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:05] New patchset: Bhartshorne; "adding ms-be1007 to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17146 [23:22:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17146 [23:23:50] j^: proximally, the squids. originally, maybe the image scalers? [23:24:00] As the headers should say.. [23:24:03] RECOVERY - NTP on db63 is OK: NTP OK: Offset -0.01002800465 secs [23:24:08] I don't actually remember how swift chooses mimetypes. [23:24:13] X-Cache: MISS from sq44.wikimedia.org [23:24:13] X-Cache-Lookup: MISS from sq44.wikimedia.org:3128 [23:24:15] yada yada [23:25:12] oh wait, that's not a thumbnail. [23:25:13] nevermind. [23:25:30] upload squids/apaches [23:26:16] oh, no, won't be apaches, it'll be ms7? [23:31:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.457 seconds [23:38:24] New patchset: J; "Add webm mimetype to apache configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17149 [23:39:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17149 [23:39:24] New review: Spage; "Hi, I'm new to E3. I noticed this rule matches eventXgif as well as event.gif, so maybe prepend a ba..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17138 [23:47:04] Ryan_Lane: our scap scripts are at /trunk/debs/wikimedia-task-appserver right? (not the ones called from fenari, which are in puppet) [23:48:20] I don't know [23:48:34] Ryan_Lane: you're in good company :) [23:48:39] I *think* that's it [23:49:28] TimStarling would know :) [23:53:00] yes, wikimedia-task-appserver [23:53:32] which is still in subversion unlike all the debs that have been updated recently [23:54:02] they can probably be moved out to puppet now that we have puppet [23:55:32] TimStarling: I made 2 commits to that svn dir now [23:55:52] I want the texvc stuff to not automatically be triggered