[00:09:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.960 seconds [00:53:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.239 seconds [01:38:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:55] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 245 seconds [01:42:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 262 seconds [01:49:24] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 651s [01:49:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [01:52:27] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [01:55:45] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 22s [01:57:24] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 4 seconds [01:59:57] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [02:02:48] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [02:21:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.033 seconds [05:31:03] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:05:26] New review: Jeremyb; "change looks good, no comment on policy" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/16237 [07:08:55] New patchset: Jeremyb; "InitialiseSettings.php: reformat some sections" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16273 [07:28:09] New review: Nikerabbit; "Scheduled for I18n deployment tomorrow" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/16252 [07:40:59] hello [07:42:58] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:28] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:47:37] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:01:07] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [09:01:34] New patchset: Hashar; "beta: send udp2log messages to -dbdump" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16276 [09:01:48] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16276 [09:02:06] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:27:26] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:38:49] New patchset: Hashar; "role::logging::labs for udp2log in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16278 [09:39:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16278 [09:49:47] hello [09:53:54] good morning :-) [09:58:06] New patchset: Hashar; "abstract out udp2log for MediaWiki logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16278 [09:58:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16278 [09:59:03] New review: Hashar; "Patchset2 : reuses existing code from nfs nodes and make it a new role: role::logging::mediawiki. Be..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16278 [10:00:16] paravoid: if you want a morning review, got you an easy one with https://gerrit.wikimedia.org/r/#/c/16278/ ;-D [10:00:24] it is to setup udp2log on beta [10:09:29] hashar: nak [10:09:56] as far as I know about our design, role classes are meant to tie other classes together [10:10:22] ha hm [10:10:23] wait [10:10:32] ;-) [10:10:56] I thought about factoring out the code to misc::udp2log::instance::mediawiki [10:11:05] and then have the role class to just require that new one [10:11:22] but that added an extra level which I thought was not really going to help anyone [10:43:49] paravoid: and, include more wikimedia specifics [10:44:00] ? [10:44:31] mark: I don't understand [10:47:32] sometimes we can pass specifics via variables so not all the details have to live in manifests and templates [10:47:50] so the actual manifests (to become modules now) can be a little bit more generic than they'd otherwise be [10:48:05] what are you commenting on? [10:48:15] ah, the role classes comment [10:48:18] yep [10:48:22] okay, that makes sense now [10:48:28] :) [10:48:33] I context switched three times since I made that comment [10:48:52] did you see hashar's commit? [10:48:56] not yet [10:49:25] please do, I'm not sure if it fits into your definition of role classes [10:50:52] no not really [10:51:04] in general, only one role class should arrange everything for a box [10:51:52] New review: Mark Bergsma; "This is not really a role class entirely. In general, a role class should (under normal circumstance..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16278 [10:54:00] back from lucnh [10:54:07] well technically not a lunch but he .. :-D [10:54:17] paravoid: have you looked at my change for labs logging ? [10:55:44] have YOU read the backlog? ;) [10:58:07] mark: doh cleaned it out :D [10:58:14] it is kind of a reflex to ^L it [10:59:32] mark: so should I rename / move that class from role::** to misc::logging::mediawiki ? [10:59:52] or misc::udp2log::instance::mediawiki [11:00:09] which would just be a wrapper around the parameterized class [11:02:18] we only need this for labs, right... since otherwise you could just put it in site.pp directly [11:04:21] we need to move this crap off of nfs anyway [11:08:11] mark: it is just for beta indeed [11:10:38] that's really slightly different from a role class [11:10:47] but I'm not sure we should add another layer of abstraction for htis [11:10:47] this [11:13:29] mark: so what do I do now ? ;-) [11:13:50] I don't care changing the class around to fit whatever coding style or class organization ops prefer [11:13:57] I just need some a clear direction [11:15:40] can you, fix the indenting to use tabs [11:15:45] and add a system_role definition [11:15:52] then I guess it's fine for now [11:15:59] argh [11:16:01] copy pasted :-D [11:17:00] New patchset: Hashar; "abstract out udp2log for MediaWiki logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16278 [11:17:27] mark: done (patchset 3 ) [11:17:32] read again [11:17:35] New review: Hashar; "Patchset3: space to tabs in manifests/role/logging.pp" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16278 [11:17:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16278 [11:17:50] need a system_role [11:18:44] <^demon> hashar: https://github.com/klaussilveira/gitlist/blob/master/lib/GitList/Application.php -- how is line 30-32 even possible? [11:19:24] <^demon> Shouldn't PHP yell at you for using an object as an array? [11:19:37] ^demon: not if the object is an array? [11:19:46] <^demon> How can an object be an array? [11:19:47] ^demon: you could make an object implement ArrayObject or something [11:20:33] <^demon> Would that work still? [11:20:53] well since it is an array [11:20:53] oh dear, tabs again :) [11:20:55] it should :-] [11:21:45] well regardless of the decision we'll make, we should not mix tabs and spaces in existing files eh [11:21:53] ^demon: that should be the ArrayAccess interface: http://www.php.net/manual/en/class.arrayaccess.php [11:22:14] <^demon> hashar: Ahhh, eventually up the symfony stack it implements ArrayAccess. [11:22:25] <^demon> Still, that's kind of silly. Nobody does that. [11:22:40] ^demon: expect all the frameworks that use symphony / silex and all ? ;-] [11:22:46] ^demon: our code base is like 5 years old hehe [11:22:57] <^demon> Doesn't mean that it's right ;-) [11:23:06] mark: as I linked the other day, http://www.emacswiki.org/pics/static/TabsSpacesBoth.png [11:23:23] (yeah, fully agreed) [11:24:08] ^demon: by reimplementing an Array, you could validate the key given to the array. So $object['invalid_key'] = 'value' , could be made to throw an exception about how 'invalid_key' is … invalid! ;-D [11:24:38] <^demon> hashar: In any case, that 'gitlist' software isn't pretty. Everything is done via exec() to cli git, and then lazy-cached in a ./cache directory if its expensive. [11:24:40] <^demon> *shudder* [11:25:59] hashar: re: jobrunner [11:26:06] are we waiting for the TMH changes as well? [11:26:10] I've kinda lost track [11:26:16] (again, heh) [11:29:42] there is the https://gerrit.wikimedia.org/r/#/c/11610/ [11:29:59] the thing I hate is that operations/debs/wikimedia-job-runner is a Debian package [11:30:11] which rely on a shell script in a MediaWiki extension ( extensions/WikimediaMaintenance ) [11:30:29] maybe we could ditch the package out and replace it by a puppet class :-D [11:31:01] anyway, in change 11610, we needed a new timeout parameter [11:31:55] ewww [11:31:58] which is closely related to https://gerrit.wikimedia.org/r/#/c/15954/ that adds -t maxtime [11:32:00] or [11:32:13] if you are in the mood for it, we can deprecate / kill the debian package [11:32:17] and move everything in puppet :-] [11:32:21] I am in the mood for it [11:32:23] (or move everything in the deb package up to ops) [11:32:25] hehe [11:32:35] but I don't see how moving it to puppet will be any better [11:32:54] one thing we have to be VERY carefully, is that any change to job related script has the potential to kill the production job system :/ [11:33:01] (which we need to rewrite entirely, really) [11:33:53] people keep telling me that I have to break the site to really be part of this team [11:34:02] which I haven't done yet [11:34:08] seriously? [11:34:13] ;-D [11:34:33] :P [11:34:41] you must be very cautious (which is a great competency/skill/ability/something) [11:35:14] killing the jobrunners doesn't cut it [11:35:20] that's too boring [11:35:28] damnit [11:36:01] so, [11:36:15] why do we even have the job runner deb? [11:36:29] isn't that a mediawiki thing? [11:36:33] I guess that is how we managed dependency / deploying init script and such [11:36:43] because once upon a time, we didn't have puppet [11:36:50] and that of course :-D [11:36:57] hehe [11:37:04] i'd be fine with that moving into puppet [11:37:10] so we had to poke mark/tim to get the change to sneak in a .deb [11:37:21] as long as puppet doesn't need to deploy heaps of files which belong in a db [11:37:22] deb [11:37:26] but I don't think that's the case here [11:37:31] right, fully agreed [11:37:45] there is a shell script / an init script / a default file in etc. That is about it [11:38:01] yeah, sounds like something more easily handled in puppet [11:38:27] could convert it to upstart too [11:38:32] if that's cleaner [11:38:39] paravoid: that would need a wait to setup specific init script [11:38:42] role based [11:38:53] eh? [11:38:58] something like having a default /etc/init.d/run-job-${some name} [11:39:11] with TMH, we will start transcoding video [11:39:20] so we will have boxes dedicated to only videotranscoding [11:39:28] says who [11:39:34] the shell script should be given the type of job to run like webTranscoding [11:39:44] * hashar finds in puppet an example [11:39:56] init scripts don't take arguments [11:40:27] some of them do, but that's not during boot and that's always counterintuitive [11:40:38] something like the varnish stuff: service { "varnishncsa-${name}": [11:40:38] require => File["/etc/init.d/varnishncsa-${name}"], [11:40:47] yes [11:40:49] I hate that :P [11:41:03] * hashar git blame the varnish.pp to find out who introduced that :-]]]]]]]]]]]] [11:41:09] then again, i'm not sure if upstart's INSTANCES are any better [11:41:22] peter introduced that [11:41:56] i played around with upstart's INSTANCE env var support, but couldn't get that to work reliably with puppet [11:42:09] although I believe puppet has some support for upstart jobs now, haven't looked at it yet [11:42:28] * paravoid knows very little about upstart jobs [11:42:46] (not on purpose :) [11:43:31] the problem is, you need to pass an environment variable (or argument) to the init script [11:43:38] which indeed doesn't work on boot [11:43:42] and also doesn't work that well in puppet [11:43:44] or at least, didn't [12:00:33] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [12:03:23] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [12:19:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16273 [12:22:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16268 [12:22:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16267 [12:22:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16237 [12:24:00] Reedy: I did sneak a change for beta this morning [12:24:27] Reedy: 66ca8b0 - beta: send udp2log messages to -dbdump (3 hours ago) [12:24:35] haven't synced it on production though :/ [12:24:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16264 [12:26:02] ^demon: do you happen to know which DB server Gerrit uses now? [12:26:15] <^demon> db1048. [12:26:48] bah it is not in ishmael apparently :/ [12:28:43] <^demon> hashar: https://gerrit.wikimedia.org/r/#/c/16150/ [12:30:29] ^demon: great :-] [12:30:57] ^demon: latency was definitely improved. that might a point to strike in the github vs gerrit wiki page :-] [12:34:00] mark: what do we do about my role::logging::mediawiki class https://gerrit.wikimedia.org/r/#/c/16278/ ? [12:34:29] mark: should I get rid of it in favor on misc::udp2log::instance::mediawiki or is that good to go ? :-] [12:35:38] <^demon> hashar: Someone already crossed it off the "todo" list on the eval page. [12:36:20] I did it :p [12:36:50] \O/ [12:38:02] !log rebalanced swift rings moving more content to new object servers [12:38:10] Logged the message, Master [12:43:21] i thought we already covered that hashar [12:43:59] mark: mind copy/pasting / repeating ? :r( [12:44:28] 13:15:40 <@mark> can you, fix the indenting to use tabs [12:44:28] 13:15:44 <@mark> and add a system_role definition [12:44:29] 13:15:52 <@mark> then I guess it's fine for now [12:45:14] ahh my brain parser skipped the system_role line :-D [12:45:51] twice :P [12:45:52] mark: I did some nice alignment for the values passed to parameters [12:45:53] maplebed: oh hi [12:45:55] wanna keep them ? [12:46:01] or should I get rid of them too ? [12:46:02] you seem like you need vacation hehe [12:46:04] paravoid: morning [12:46:13] definitely :-( haven't took any vacations for like 2 years [12:46:30] maplebed: can I shut down owa1/owa2? [12:46:30] You can parse brains? [12:46:31] and my little daughter keep crying every evening so yeah, will definitely just sleep for 3 weeks huuh [12:46:34] !log rebooting ms-be10 for xfs errors and a clean boot [12:46:42] Logged the message, Master [12:46:56] paravoid: I'd rather not; how come? [12:47:05] oh, and shut down or reboot? [12:47:43] PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host [12:47:43] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host [12:47:43] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host [12:47:52] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host [12:47:52] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host [12:47:52] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host [12:48:01] PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused [12:48:10] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:10] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:10] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:11] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host [12:48:11] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host [12:48:11] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host [12:48:11] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:17] maplebed: shutdown, mark was telling me that owa is something obsolete? can we recycle the hardware? [12:48:29] but CT was telling me that you might still be using it for swift tests [12:48:38] PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host [12:48:46] PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host [12:48:46] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host [12:48:52] the hardware is effectively being recycled atm (as you say, for swift tests)... I just didn't change the names (since we don't do that). [12:49:16] oh I thought that had moved to labs already [12:49:23] mark: not performance testing. [12:49:26] can't do it there. [12:49:31] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.011 seconds [12:49:33] functional testing is in labs. [12:50:00] ignornig these pages right? [12:50:17] New patchset: Hashar; "abstract out udp2log for MediaWiki logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16278 [12:50:18] i assume ben is looking at it [12:50:20] As soon as I get the eqiad cluster up and running perf testing will move to that cluster and we can actually recycle the machines. [12:50:43] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.016 seconds [12:50:43] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.018 seconds [12:50:43] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.010 seconds [12:50:56] maplebed: so, remove them from decom hosts then [12:50:58] New review: Hashar; "Patchset 4:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16278 [12:50:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16278 [12:51:02] (I will I mean) [12:51:07] is this just owa1/owa2? [12:51:13] mark: got the system_role and removed the spaces : https://gerrit.wikimedia.org/r/#/c/16278/ [12:51:15] or owa3 too? [12:51:18] paravoid: 3 as well. [12:51:21] okay [12:51:35] they hadn't run puppet for a while though [12:51:46] apergos: ms-be10 paged? or just hit IRC? [12:51:55] ms-fe LVS paged [12:51:58] that's system_role is wrong, hashar [12:52:03] that's not the name of the class is it [12:52:05] paravoid: that's true; puppet's disabled on them (so as to not wipe out the perf testing changes) [12:52:19] erm, that's bad [12:52:36] yeah that's not a good idea [12:52:36] ms-fe yeah [12:52:39] esp. considering we do access prov/revocation through puppet [12:52:43] among other reasons [12:53:39] can we puppetize or make puppet ignore these perf testing changes? [12:53:57] mark: I have no idea what the system_role is for I just copied it from above aka role::logging [12:54:18] hashar: it's the line in /etc/motd, and it the name should match the role class name [12:55:34] New patchset: Faidon; "Remove owa1/2/3 from decom, still in use" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16294 [12:56:09] New patchset: Faidon; "Remove OWA manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16295 [12:56:34] mark: I fixed both system roles with https://gerrit.wikimedia.org/r/16296 ;-D [12:56:34] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [12:56:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16294 [12:56:44] New patchset: Hashar; "fix system_log entry in role::logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16296 [12:56:47] maplebed: ^^ deletes owa.pp, ack? [12:56:52] 16295 that is [12:56:54] mark: too [12:57:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16295 [12:57:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16296 [12:57:28] paravoid: fine by me, but I"m no authority. I just use the hardware... :P [12:57:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16294 [12:58:05] maplebed: :-) [12:58:36] maplebed: so, < paravoid> can we puppetize or make puppet ignore these perf testing changes? [12:58:42] paravoid: yeah, fine [12:59:09] New review: Faidon; "approved my mark on IRC" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16295 [12:59:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16295 [12:59:17] by even :) [12:59:23] paravoid: to me, it makes sense to puppetize the results of perf testing changes, but not really to do each one while testing it. [12:59:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16278 [12:59:41] i.e. when we figure out which changes make performance better, we puppetize and deploy. [12:59:52] but when poking around at all the various knobs, it's not really useful. [12:59:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16296 [12:59:58] sure, but can we then make puppet just ignore these local changes? [13:00:09] maplebed: that's all nice and fun, but you should never disable puppet on a system [13:00:17] having a disabled puppet on a system is a bad idea imho [13:00:18] since we rely on it for maintaining our systems [13:00:25] you can disable it briefly [13:00:30] but shouldn't do that for more than a day or two [13:00:53] if you want puppet not to touch something you're working on, you need to configure it not to do that, not disable it [13:01:11] mark: (I merged your merges on sockpuppet) [13:01:17] (thanks) [13:01:58] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:03:43] I'm trying to think how I would ask puppet to ignore the swift config files while still using the role class setup we've got. [13:04:00] don't use the role class setup? [13:04:05] I suppose I could just not include the swift classes on those hosts (since they're there, puppet won't remove them) [13:04:08] or use it temporarily to set stuff up, and then remove it [13:04:17] yeah [13:04:29] that feels way weird. [13:05:03] but yeah, it would work. [13:05:08] feels fine to me [13:05:20] it feels weird because then puppet is lying about what the host is doing. [13:05:22] feels way better than disabling config management entirely and letting boxes sit unmanaged anyway [13:05:47] if you care about it, you can add parameters for a debug/testing mode where it handles the box slightly differently [13:05:59] but it clutters up the config and takes more time [13:08:44] apergos: btw, I cleaned up snapshot1/2.wikimedia.org [13:08:52] thank you [13:09:20] New patchset: Bhartshorne; "disabling puppet swift configs on test cluster for local perf testing changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16298 [13:09:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16298 [13:10:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16298 [13:10:04] maplebed, mark: btw, what's this "do not rename policy"? [13:10:25] are these machines going to be named owa forever? [13:12:06] these probably won't [13:12:16] because we have no owa project, and they're now misc machines [13:12:38] we try to avoid renaming machines as it's a pita and for misc machines, we have generic names [13:12:52] aha [13:12:54] okay :) [13:12:55] thanks. [13:13:20] so I try to only use cluster names where we're sure they won't rename (squids, mediawiki, etc) [13:13:25] and of course this example is right at the line [13:13:48] hehe [13:15:28] I got a funny puppet error : Could not find resource 'Class[Misc::Udp2log]' for relationship on 'Misc::Udp2log::Instance[mw]' [13:15:45] hahahaaha [13:15:47] manifests/misc/udp2log.pp [13:15:56] Reedy: at least one laughing :-] thanks! [13:16:29] the parameterized class misc::udp2log::instance has something that look like a dependency: Class["misc::udp2log"] -> Misc::Udp2log::Instance[$title] [13:16:38] I am wondering if there is a case mismatch [13:17:11] that's horrible :) [13:17:16] no case mismatch I can see though [13:17:23] ahh [13:17:35] maybe I needed to include misc::udp2log BEFORE calling the parameterized class [13:17:51] what do you mean by before? [13:18:13] I have setup a new role class class role::logging::mediawiki { [13:18:18] which just call misc::udp2log::instance { "mw": [13:18:24] maybe it need an include misc::udp2log [13:18:25] right [13:18:28] yes [13:18:30] you need to do that [13:18:40] the Class[...] -> ... is a depedency, not an include [13:18:50] can't puppet nicely autoload its classes? [13:18:52] and you depend on something that was never included, hence the error [13:19:13] it can, but how would it know that you needed that now? :) [13:20:48] cant we include the class directly inside the parameterized class ? [13:20:52] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [13:21:01] you can if you want [13:21:20] thus the Class["misc::udp2log"] -> Misc::Udp2log::Instance[$title] will self satisfy ;) [13:26:16] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:26:39] New patchset: Hashar; "role::logging::mediawiki needs misc::udp2log and utilities" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16300 [13:26:58] paravoid: ended up including the needed class before calling the parameterized instance ^^^ ( 16300 ) [13:27:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16300 [13:29:52] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16300 [13:31:30] paravoid: also puppet complain about : require => apache_site['controller', '000_default'] with message Resource references should now be capitalized [13:31:37] paravoid: so we should do Apache_site [13:31:57] paravoid: but that might just be a false positive from puppet since apache_site is one of our parameterized class [13:32:51] nope, that's right [13:33:01] lemme fix that [13:33:16] paravoid: I got a change to fix some other deprecations [13:34:20] ♥ [13:34:33] maplebed: if only we had that love machine [13:34:50] so true. [13:35:20] New patchset: Hashar; "Reource references should now be capitalized" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16302 [13:35:41] paravoid: here are some more deprecations https://gerrit.wikimedia.org/r/#/c/16302/ [13:35:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16302 [13:36:07] paravoid: I did not fix the calls to our classes such as require => apache_site[foobar], or require => git::clone[foobar] [13:36:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16302 [13:37:25] grnbmbm [13:37:28] my puppet syntax file sucks [13:37:34] New patchset: Faidon; "Fix a non-capitalized resource reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16304 [13:37:39] "$swiftcleaner_basedir/swiftcleanermanager -c $swiftcleaner_basedir/swiftcleaner-$name.conf -A /tmp/swiftcleaner-${name}-\$(date +\%Y\%m\%dT\%H\%M\%S) -p /tmp/swiftcleaner-$name.pid >> /tmp/swiftcleaner-${name}-\$(date +\%Y\%m\%dT\%H\%M\%S).log" [13:37:43] that one is fully in purple [13:37:54] apparently puppet complain about \% not being a recognized escape sequence [13:38:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16304 [13:38:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16304 [13:40:27] paravoid: mark: I got mediawiki logs again on beta!!!! thanks! ;-] [13:40:48] yay [13:44:52] !log authdns-update for ms-be eqiad hosts [13:44:59] Logged the message, RobH [14:09:37] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:27] RECOVERY - Host ms-be10 is UP: PING WARNING - Packet loss = 80%, RTA = 0.25 ms [14:41:42] our udp2log stuff is a real mess :-D [14:42:32] the generated init script adds parameters which are not recognized by the udplog daemon :/ [14:44:19] generated by whom? [14:45:16] it is an erb template, eh? [14:45:18] haha, hilarious [14:45:22] i helped refactor the puppet stuff [14:45:34] so we have several init scripts [14:45:35] but I tried not to change the end result too much [14:45:37] yeah [14:45:41] its really annoying [14:45:43] one which is a file named udp2log-aft [14:45:47] the one provided by the debian package [14:45:47] yeah [14:45:51] and one which is .erb based [14:45:55] (which afaik is not used anymore) [14:45:58] that .erb is used by non -aft puppet classes [14:46:09] but the .erb includes parameters such as -b and --test [14:46:14] which are not in udp2log source : / [14:46:17] the debian package one? [14:46:22] oh [14:46:38] yeah, so, afaik, all udp2log instances are puppetized using udp2log::instance [14:46:39] in puppet [14:46:43] and we have two source trees for udp2log (one in svn under /trunk/udplog and the other in gerrit analytics/udplog [14:46:43] hehe [14:46:52] yeah I have used that puppet class [14:46:55] which rely on the .erb [14:46:55] yes, and those afaik are the same [14:47:04] i created the analytics/udplog one [14:47:12] the svn / git source tree are similar. I guess the svn has been migrated to git [14:47:13] but it has not been changed from svn head [14:47:16] yeah [14:47:17] and hopefully made readonly [14:48:31] ok [14:48:39] will first open a bug about phasing out the svn path [14:51:14] ok [14:51:18] thanks! [14:51:55] https://bugzilla.wikimedia.org/show_bug.cgi?id=38602 [14:51:58] cced you and Tim [14:52:05] I guess ^demon will take care of it :-] [14:52:31] ottomata: oh and I have added you to the linkedin (great way to remember names behind nicknames :pD ) [14:52:42] aye! cool [14:52:49] ottomata: so can I phase out the -b --test parameters in the erb template. [14:53:18] i think so [14:53:28] or maybe even check to see if anything is using the default template [14:53:32] udp2log [14:53:37] hmmmm no wait it is [14:53:38] ahhh i dunno [14:53:39] yes. [14:53:40] i think so [14:54:05] maybe [14:54:08] <^demon> hashar: Why did you waste time opening a bug? [14:54:13] lolol [14:54:14] <^demon> Could've just pinged me to begin with. [14:54:32] ^demon: wasn't sure you were online sorry ;-] [14:54:39] I like bug because that is like "fire and forget" [14:54:46] <^demon> I've been around all morning. [14:54:51] sure sure sorry ;-( [14:54:53] I am tired [14:54:54] Not my problem. (tm) [14:54:56] must be that [14:54:58] Reedy: ;))) [14:55:43] ottomata: is the udp2log-aft still around? What was it for? [14:56:00] <^demon|away> ^ See what I did there? I let you know I'm not around ;-) [14:56:14] -aft being article feedback tool atta guess ;) [14:56:23] ^demon|away: ohhh I am filtering nicknames changes and hiding join/part ;-D [14:56:33] ^demon|away: though I could look up your name in the list or using tab completion [14:56:37] ^demon|away: sorry ;-( [14:57:54] I for one welcome bugs even when I'm around [14:58:12] helps me to not deal with it immediately if I want to and helps me to not forget [15:05:27] srv281 is fulled ups [15:05:56] mark: heya [15:05:56] hey RobH, any news from Dell on stat1001? [15:05:57] New patchset: Faidon; "Create a varnish module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16411 [15:06:03] mark: ^^^ [15:06:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16411 [15:06:46] mark: completely untested, but I thought of converting a larger module in case we need to make some decisions [15:06:47] drdee: i dunno why i have not gotten the new part yet, i will call them later today [15:06:57] ty! [15:07:16] srv281 The last Puppet run was at Mon Sep 26 00:48:26 UTC 2011 (434298 minutes ago) [15:07:23] /dev/sda1 7.9G 7.5G 2.2M 100% / [15:07:27] 106 updates are security updates. [15:07:39] yay [15:07:53] I work that out to be 300 days... [15:08:03] and I can't login [15:08:21] with root key? [15:08:22] probably because my key was added less than 434298 minutes ago :) [15:08:26] heh [15:09:39] hashar: re udp2log-aft [15:09:40] yes [15:09:43] i think that is still around [15:09:47] that is article feedback tool [15:10:12] i can't remember why it is a separate instance, i think that the AFT is sending logs to in manually or something, [15:10:27] so it is not consuming webserver access logs, like all of the other udp2log instances [15:10:42] so they wanted to keep the AFT logs separate from the webserver logs, e.g. sampled-1000.log, etc. [15:12:14] ottomata: btw, any news from the scribe people? [15:12:31] ottomata: well I need to figure out how it is installed so :-) [15:12:56] ahh misc::udp2log::instance { "aft": [15:13:01] it is templatized already [15:14:42] !log srv281 has a full / and hasn't had a puppet run in over 434298 minutes [15:14:46] Reedy: are you going to file an RT or should I/ [15:14:49] (for completion) [15:14:50] Logged the message, Master [15:14:57] I can do [15:15:11] Or we pick on someone who does have access ;) [15:15:19] heh :-) [15:15:45] Ah, interesting [15:15:52] paravoid, re scribe people, nopers [15:16:01] I'm presuming it's not pooled, it's just still in mediawiki-installation [15:16:16] one guy on the google group said "yay do it" but that's about it [15:16:51] cmjohnson1: agreed [15:17:46] Seems pretty sensible [15:18:02] well, I guess I don't need access to reprovision [15:18:39] apaches:#{ 'host': 'srv281.pmtpa.wmnet', 'weight': 100, 'enabled': False } #testing as a renderer only now [15:18:43] rendering:#{ 'host' : 'srv281.pmtpa.wmnet', 'weight': 40, 'enabled': False } [15:18:50] so, disabled indeed [15:20:28] cmjohnson1: I can do it now, no ticket needed [15:21:02] New patchset: Bhartshorne; "adding puppet rules for eqiad prod swift cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16413 [15:21:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16413 [15:23:16] PROBLEM - MySQL Idle Transactions on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:19] what do you mean by 'disconnected everything'? [15:26:44] would you replace the drives in slots 5, 7, and 8, then power back on? [15:26:58] ping me when you power on and I'll watch it boot via ipmi. [15:27:44] back [15:27:45] hey paravoid + notpeter, [15:28:00] i want to set up scribe to try some things on our analytics sandbox cluster [15:28:11] i want to scribe_cat from udp2log from oxygen into hadoop [15:28:23] i have no problem installing from .deb manually on sandbox cluster [15:28:26] New patchset: Bhartshorne; "adding puppet rules for eqiad prod swift cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16413 [15:28:46] but to cat from ud2plog, the easiest thing to do would be to install scribe package on oxygen [15:28:58] hm. [15:29:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16413 [15:29:04] and then set up a udp2log pipe into scribe_cat [15:29:11] cmjohnson1: there aren't any spare 2TB drives? [15:29:29] since the scribe .debs are not yet in our apt repo [15:29:36] would it be ok to manually dpkg -i them for now? [15:29:44] on oxygen? or should I not do that? [15:30:03] preferably not [15:30:25] mark, that was response for me? [15:30:26] I wouldn't like installing random unreviewed stuff on a production box [15:30:28] ottomata: yes [15:30:34] indeed [15:30:42] cmjohnson1: can you throw in the 1TB drives for now and replace them with 2TBs when you get the chance? [15:30:54] yeah makes sense [15:30:57] thought i'd ask [15:30:57] hmmmm [15:31:13] i'm waiting for notpeter to try it out on some test lucene search cluster [15:31:20] but i'm not sure how long that will take or when he will have time [15:31:20] cmjohnson1: honestly, any working drive will be ok [15:31:22] gr, hmmm [15:31:25] also, it seems to me that you're experimenting with new stuff on production [15:31:35] cmjohnson1: I just don't wan to upset the drive order that the OS sees [15:31:37] ok lemme see if I can figure out a multicast pipe to get the logs there [15:31:38] you should be doing this in labs [15:31:40] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:31:40] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:41] even if there were debs for what you want, it'd seem a bad idea to me [15:31:43] and have those debs reviewed [15:31:49] totally [15:31:54] i've been working with peter on this [15:31:59] it's not just about the debs, it's an experiemental thing altogether [15:32:01] well, peter shouldn't be doing that either [15:32:11] afaik he isn't doing it in production [15:32:21] and I have been testing these things out on my local vm, and in labs [15:32:23] where is that deb? [15:32:47] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16413 [15:32:47] mark, I will forward you an email I sent a bit ago [15:32:58] ok [15:33:36] tbh, I think the .deb is an implementation detail [15:33:42] ? [15:33:53] you could have used whatever software that is in Ubuntu and it would still be wrong doing that on oxygen [15:34:17] ? [15:34:28] there isn't anything in ubuntu [15:34:32] and I haven't done anything on oxygen [15:34:33] you're doing experimenting/evaluating on a production system [15:34:37] or planning to do [15:34:39] was asking you guys [15:34:42] if it was ok [15:34:48] and you said nope, which is what I thought :) [15:34:52] right, that's what I'm replying to :) [15:35:05] which is what I am replying to :) [15:35:05] what I'm saying is that it's not about installing the .deb [15:35:08] hah awhat else are we doing? [15:35:23] paravoid, scribe is installed and running on a labs instance [15:35:32] great! :) [15:35:33] i have reviewed all of my committed changes with notpeter [15:35:49] there are two separate things we are talking about here [15:36:00] 1. we are going to try out scribe with lucene search logging [15:36:06] as a trial run [15:36:11] i have tested on local vm and labs [15:36:27] that is just to make sure it works at that scale [15:36:42] i don't expect any problems, but the lucene search logging was the less impactful of the two was to try this [15:36:47] the other is nginx ssl logging [15:37:06] so that's the one thing, i'm just waiting on notpeter to test some stuff out in the lucene test machines (wherever that is) [15:37:10] 2. [15:37:26] I want to pipe stuff through scribe to our analytics sandbox [15:37:37] that is not something that is really doable in labs [15:37:43] i'm not going to try to set up a hadoop cluster in labs [15:37:47] lol, srv281 is in a right mess [15:38:00] Reedy: I'm about to reprovision it [15:38:00] so, totally cool that I shouldn't install these .debs on oxygen [15:38:04] since it is a 100% prod machine [15:38:18] i don't consider the analytics cluster production yet [15:38:20] so I have no problem doing it there [15:38:41] we are going to wipe those machines and reinstall everything (with tons of reviews) before those will be considered production [15:38:43] ottomata: I don't really know about the specifics, but can't you simulate the log traffic instead of piping real actual logs there? [15:38:55] paravoid: sure, just saw a spam of permission errors about php-1.17 [15:39:17] would be kinda annoying, i guess I could manually copy logs over and script something to read them and try to pipe them in at about the same rate as the are written from prod servers [15:39:29] but, i think oxygen has a multicast udp2log thing set up [15:39:32] gonna try that [15:39:45] i might be able to subscribe to it from our analytics cluster, without installing anything new or fancy on oxygen [15:39:59] testing scribe on the analytics sandbox cluster is fine [15:40:02] it's not production yet [15:40:04] aye [15:40:44] ottomata: i'd like those debian packages to be pushed to gerrit for review [15:40:49] in a git-buildpackage repository format [15:41:17] we should make repositories for them [15:41:24] ok, i am kinda new to debian packaging, so paravoid, if you could help me with that, i would be much obliged [15:41:33] i did not make these debians/ t hough [15:41:43] they were made by others and modified by me to build the java libs [15:41:46] but, mark [15:41:48] the reason they are not in git [15:42:07] is because paravoid and I are trying to foster more collaboration from the scribe community on these [15:42:26] how is that a reason for them not being in git? ;) [15:42:31] they are in git [15:42:31] github [15:42:35] sure [15:42:39] they can be in gerrit as well [15:42:48] well, I didn't really expect that you wanted them installed somewhere until that happened :-) [15:42:50] they are forked from other github repos [15:43:03] if I put them in gerrit, they lose the fork history [15:43:04] i understand [15:43:12] but whatever we are gonna build and run in production, should be in gerrit [15:43:13] RECOVERY - MySQL Idle Transactions on db35 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:43:13] no they don't [15:43:21] that's the point of git [15:43:30] well, the github fork history perhaps [15:43:34] yeah, exactly [15:43:42] too bad [15:44:19] hm, ok... [15:44:31] why woudl you want to make people outside of wmf have to deal with gerrit? [15:44:31] I don't think so. [15:44:39] and it's ms-be10, not 9. [15:44:39] they don't need to deal with gerrit [15:44:52] you can push whatever branch you want in gerrit where it can be reviewed [15:44:52] I htink I would lose a lot of cooporation potential if I told people to go to our gerrit repos [15:44:57] oic. [15:45:01] what happens in github is separate [15:45:06] cmjohnson1: is there an access light? [15:45:10] I can thrash the drive for a minute [15:45:14] hm, ah ok, so you want me to maintain 2 repositories? [15:45:38] i mean, eyah, i guess there are tons of 'repositories' since it is git... [15:45:42] hm [15:45:49] so have a wmf 'production' branch [15:45:53] hm [15:45:54] i just want whatever we run that's not in ubuntu, be in gerrit [15:45:56] where we review it [15:46:12] we do the same for debian/ubuntu packages that we modify [15:46:20] we pull them in from git.debian.org, make changes, push those to gerrit [15:46:21] right... [15:47:07] is there a way to have gerrit track a branch on github? [15:47:09] cmjohnson1: ? [15:47:15] i don't think so [15:47:27] but we don't have to [15:47:34] New patchset: Hashar; "update udp2log init script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16417 [15:47:35] so, i'd have to add instructions in the README on how to pull/push between gerrit and github, right? [15:47:36] it's not like we're gonna build and install changes every day from that [15:47:39] right [15:47:46] it's jsut standard git push/pull [15:47:58] i don't see why that would need instructions in the README [15:48:02] well, they'd have to clone from github, and then set upstream on some branch to push to gerrit [15:48:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16417 [15:48:10] git remote add origin git@github; git remote add review git@gerrit # done [15:48:14] cmjohnson1: not that i know of no [15:48:18] its not a raid controller [15:48:22] right but that is only on my local [15:48:25] it is specifically NOT a raid controller [15:48:29] if someone else cloned the github repo, they'd have to do that too [15:48:39] or just you do the syncing to gerrit [15:48:42] does it not tell you which failed disk? [15:48:49] yeah, i would have to do it [15:48:54] the disk bay #s should be on the case someplace. [15:48:59] whenever we feel we need to build a new package release [15:49:03] which won't be every week now will it ;) [15:49:05] i guess i just like setting up things so that I personally am not a required piece [15:49:15] i don't see how you're a required piece [15:49:20] everyone can pull from github, push to gerrit [15:49:35] right, but the information on how to do so, where to pull, where to push, etc. needs to be known [15:49:40] this is no different from any other packages we maintain [15:49:43] someone would ahve to ask you or me [15:49:53] ok ok ok, anyway, that will work [15:49:55] this can be set as a field in the package control file [15:50:02] bwerrrrrrrrrrr [15:50:02] oh yeah? [15:50:05] you can point it at github if you want [15:50:06] yeah [15:50:11] as part of the git-buildpackage or whatever? [15:50:12] if you do "apt-get source " [15:50:19] often it will tell you the VCS url [15:50:22] hmmmmmm [15:50:24] coool [15:50:32] ok, I will get paravoids help with that when it is time [15:50:41] i'm not sure if we are ready for that yet, [15:50:51] we *might* (if I have time) try to build from some newer versions of thrift [15:50:56] fwiw, if we're going to be serious about it then we should use git.debian.org rather than github [15:51:02] which supposedly is doable but was difficult on my first runthrhough) [15:51:13] oh mygoodnes [15:51:24] ok, mark, paravoid, i am not going to think about this right now [15:51:31] although considering how Scribe is unmaintained, I'm not sure if I would want that in Debian [15:51:39] yeah you might be right [15:51:59] flume is another good looking option [15:51:59] I'm not sure I'd like wikimedia to move all of its logging infrastructure to an unmaintained software either [15:52:02] but it is jvm stuff [15:52:03] but that's not my call I guess [15:52:07] yeah, you might be right, i dunno [15:52:20] in general, i think scribe kinda just works [15:52:28] which might be why there isn't much development on it [15:52:32] famous last words? [15:52:35] haha, uyp [15:52:40] except when it won't and then we'll be on our own [15:52:47] (apparently) [15:53:07] yup [15:53:08] heheh [15:53:16] i'm used to that [15:53:45] also somehow, our logging infrastructure failing does not scare _me_ ;) [15:53:49] I can turn that off [15:53:57] but others get upset about that I think, hehe [15:54:05] heheh [15:54:26] esp. when one of the design goals of the new system is to not lose data :) [15:54:34] aye [15:54:44] bollocks is what I think about that [15:54:44] well, ok, let me ask the two of you a question [15:54:49] but ok, perhaps not lose a lot of data [15:55:14] for stuff that will be 100% on the analytics cluster, I think we can make decisions about what softwares to use ourselves [15:55:14] but [15:55:20] and it should be known what amount of data is lost [15:55:20] I don't see how loosing 5min of data is a big deal, it's not like you're dealing with high precision measurements or money [15:55:22] when it comes to things we will need to install in existing production machines [15:55:37] i think we need to work closely with you guys [15:55:42] so we know what you expect [15:55:44] you need to work closely with us anyway [15:55:45] and you know what is going on [15:55:50] of course of course [15:55:54] since we need to help maintain it [15:55:55] the final analytics cluster will be all reviewed [15:55:59] are you going to operate the analytics cluster? [15:56:03] yes [15:56:12] didn't know that [15:56:12] but but but [15:56:13] if you don't we can't help there and will have to turn it off if it fails [15:56:21] so, no ops involved in that at all? [15:56:22] haha, yes yes [15:56:26] ottomata will be ops too [15:56:28] no puppet, nagios, ganglia etc.? [15:56:28] yes and no [15:56:29] yeah [15:56:31] yes puppet [15:56:36] yes nagios, ganglia, etc. [15:56:42] but I will be main point of ops contact for that [15:56:46] i guess (?) [15:56:48] but yeah [15:56:55] everyone should know what is going on [15:57:06] i think I will be in ops meetings from now on [15:57:07] New patchset: RobH; "Revert "Adding db63 -db77 to the dhcpd file"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16418 [15:57:08] so that will help with that [15:57:22] yeah, you'll have to be a full part of ops or this can't work [15:57:27] yeah totally [15:57:30] i'm all for that and excited about it [15:57:41] but but, for existing prod stuff [15:57:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16418 [15:57:51] how do we go about evaluating options and choosing techs? [15:58:02] mail the lists [15:58:03] scribe vs. flume vs. kafka is a good example of a choice we need to make [15:58:07] hmm, right duh [15:58:09] heheh [15:58:10] I don't understand how that is equivalent to "you'll make decisions about softwares to use yourself" :) [15:58:19] ok ok [15:58:19] so [15:58:20] example [15:58:22] how you being part of ops [15:58:35] do we want to use datastax or do we want to use cloudera hadoop? [15:58:38] but not my call for sure [15:59:53] do you guys want to be involved in evalutating and experimenting with how the two work, how they perform on the cluster under different types of loads and jobs, etc. etc.? [15:59:58] s. [15:59:59] yes. [16:00:10] ok, so who else in ops is going to join us in doing that [16:00:14] everyone [16:00:17] would love some help and input for sure [16:00:19] as in, you mail the lists [16:00:22] and you'll get input [16:00:32] i fully expect asher to have some things to say about that, for example [16:01:10] and if you don't get any input, noone can complain that we weren't asked :) [16:01:19] when you do whatever you want :) [16:01:20] ja, ok [16:01:34] hehe [16:01:36] i mean, obviously everything we do is going to have to be reviewed and discussed [16:01:41] since it will all be set up via puppet eventually anyway [16:01:49] yeah right [16:01:57] in 5 years probably :P [16:02:01] ? [16:02:05] i'm always sceptical about that hehe [16:02:37] somehow this "we first test and then review/reinstall" never happens in practice ;) [16:02:46] I think sooner is better than later, the more you delay it the bigger the review workload will get [16:02:51] i am puppetizing as we go [16:02:53] and then it'll be too big to happen [16:03:01] but I have no idea if you've done that already [16:03:06] maybe you have and all is good [16:03:35] a bit of an off-topic question [16:03:45] have you seen the sFlow/HTTP stuff that some people are working on for logging? [16:03:52] New patchset: Platonides; "(Bug 38404) Change $wmgBabelMainCategory for eswiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16420 [16:04:33] no [16:04:50] !log reinstalling srv281; disk full, hasn't run puppet for 300 days, depooled for ages [16:04:58] Logged the message, Master [16:05:38] ottomata: http://sflow.org/draft_sflow_http.txt [16:05:42] ottomata: http://host-sflow.sourceforge.net/relatedlinks.php [16:06:28] PROBLEM - Host srv281 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:34] this is system perf monitoring? [16:06:47] http [16:06:54] hmm, do we do precise for MWs yet? [16:07:12] paravoid, mark, re puppetizing as we go [16:07:14] http://git.less.ly/?p=kraken-puppet.git;a=tree [16:07:22] (sandbox, remember) [16:07:23] paravoid: I have added you as a reviewer to https://gerrit.wikimedia.org/r/#/c/16417/ [16:07:47] ottomata: :/ I don't understand why you work from a separate puppet/git tree [16:07:48] paravoid: minor tweaks to the udp2log init script and some cleanup of an old file [16:07:59] because this is a sandbox [16:08:06] what does that mean? [16:08:06] and we can't wait for review every time I want to try soemthing [16:08:20] we won't include all of these configs in the eventual prod cluster [16:08:29] datastax vs CDH3 vs CDH4 [16:09:14] scribe vs flume vs kafka vs etc. [16:09:32] doing benchmarking with different cluster configs, etc. [16:09:33] hashar: will check in a bit [16:09:45] hashar: do you know if we do precise for srvNNN yet? [16:09:45] paravoid: I am out for today, will connect later on for some conf call [16:09:51] paravoid: I have no idea [16:09:59] okay, thanks anyway :) [16:10:01] paravoid: but "beta" has its apaches running Precise [16:10:22] paravoid: and I think you already updated some packages to let us run mw under Precise (such as font packages that got renamed) [16:10:45] yeah, I remember we fixed that for beta, but I'm not sure if we do that for prod yet [16:10:56] probably not :/ ask in ops-l maybe? [16:11:28] paravod, re sFlow, looks cool, but we want a logging solution that is a bit more generic I think [16:11:45] http requests, random application logs, lucene search logging, etc. [16:11:49] clicktracking, etc. [16:11:50] srv194 is precise, so I guess we do that [16:13:08] ottomata: can you imagine every team wanting to work on something creating their own git repo with their own puppet install and then coming a few months later with huge puppet commits? [16:13:16] that's my only problem with that [16:13:42] I haven't been involved in the discussions though [16:13:55] so nothing that I say should be considered the "ops team's position" [16:14:06] i think it is either I use this puppet repo as a sandbox, or I don't puppetize until we are done figuring it out [16:14:21] RECOVERY - Host srv281 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:14:42] but this isn't even a clone of ours [16:14:45] analytics cluster != labs, so I can't commit our puppet stuff without needing review [16:14:54] I never understood why we didn't make the analytics cluster part of labs [16:14:55] you can commit to your own clones as much as you want [16:15:15] s/clones/branches/ ? [16:15:16] running puppetmaster::self or whatnot there [16:15:24] no, clones [16:15:35] clones of our branches/repo/whatever [16:15:38] how would I get a clone applied to these machines? [16:15:45] oh you mean elsewhere, like we are doing [16:15:49] you have setup a puppetmaster haven't you? [16:15:52] yes [16:16:00] clone our git repo to that, modify that repo [16:16:01] yes i see, you are saying use our repo as a starting point [16:16:04] yes [16:16:10] yep, that would work better too [16:16:12] yeah could do that, but then I get two puppetmasters trying to apply the same configs [16:16:18] huh? [16:16:26] base.pp does a lot of stuff [16:16:38] probably not much of what you should change [16:16:40] i guess the clone node owuldn't include that [16:16:45] why not? [16:16:50] you can modify base.pp on your own clone [16:17:14] this is no different from the puppetmaster::self stuff done in labs [16:17:20] yep [16:17:36] no i mean [16:17:38] with 2 puppetmasters [16:17:40] and you can keep pushing individual branches/commits to gerrit where we can review them as we go [16:17:43] why 2 puppetmasters? [16:17:57] PROBLEM - SSH on srv281 is CRITICAL: Connection refused [16:17:59] a single puppetmaster (yours) running the ops repo in your own branch [16:18:28] so, you setup one puppetmaster in your sandbox environment [16:18:29] not following, puppetd by default contacts production puppetmaster running production branch [16:18:30] you clone our repo [16:18:35] you change some stuff [16:18:37] if I make a clone [16:18:49] your puppetmaster runs off YOUR clone [16:18:49] i still need my own puppetmaster to serve up the .pp configs from that clone [16:18:51] right [16:18:56] but that's mostly our repo [16:18:57] ok, paravoid was asking why 2 puppetmasters [16:19:04] you don't need 2 puppetmasters [16:19:06] you need only one, your own [16:19:18] so you have a clone of our repo on there [16:19:21] you modify stuff as you need [16:19:28] say, for hadoop, or whatever [16:19:30] that can probably wait a bit [16:19:37] until we review it [16:19:51] you can submit that when you're positive it works the way you wan tit [16:20:01] so you are suggesting that analytics cluster does not contact production puppetmaster at all [16:20:02] but other things, like, you need to modify base.pp for something [16:20:10] ( i don't need to modify base.pp) [16:20:18] i'm trying to make is so that everything I do in puppet is modular [16:20:19] you just said that, so i'll use it as an example [16:20:31] you can make that change in base.pp so it works more generically or whatever [16:20:33] well, i said that, meaning that if both puppetmasters were trying to use the same configs [16:20:34] and immediately push that to gerrit [16:20:38] base.pp would be applied by both of them [16:20:43] so it can already be reviewed and incorporated into production [16:20:47] aye true [16:20:57] RECOVERY - SSH on srv281 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:20:58] no you don't need to contact production puppetmaster, that doesn't really work [16:21:08] its working right now, because my repo is not a clone [16:21:12] just make sure you keep pulling/merging from our repo [16:21:18] so you stay up to date with that [16:21:27] pulling/rebasing preferrably :) [16:21:38] yeah [16:21:38] i'm tryign to keep my stuff 100% modular, and by having a separate setup anyway, it forces me to do this [16:21:45] this is silly [16:23:20] why would you want to work in two different environments when they need to get merged anyway [16:23:35] mark, can you tell me why it matters? i don't really like having this repo at git.less.ly, so we can put it in gerrit or github or whatever, but aside from where the repo is hosted, why does it matter if I use a clone and my one single puppetmaster, or if I use 2 different repos and 2 puppetmasters? [16:23:49] i like the way it is now, because I am using the produciton puppetmaster for the regular stuff [16:23:52] because you're going to make one huge diff [16:23:54] how are you going to merge it in the end? [16:24:04] and my analytics puppetmaster for analytics stuff [16:24:05] that is not reviewable [16:24:14] one huge single commit adding all of your stuff? [16:24:14] wouldn't it be the same if it i was in gerrit? [16:24:19] all the changes are making are new files [16:24:23] no, we'd have the whole history [16:24:26] of multiple commits [16:24:37] we'd have a normal branch merge [16:24:43] instead of one collapsed commit [16:24:48] so, you are telling me, that when you review this stuff, you are going to go back and read all of my individual commits? [16:24:58] also, you could choose to incrementally push some stuff in the meantime [16:25:00] we're certainly not gonna read one huge diff [16:25:04] that you feel they're ready [16:25:14] but you are telling me to push to a branch or a clone [16:25:18] where it isn't going to be reviewed anyway [16:25:22] yes it is [16:25:24] so when it goes back into production branch/clone [16:25:48] that review is preferably done over time [16:25:49] you guys are going to review something when i do say "ok, now let's try a slightliy different balance of mappers/reducers so I can benchmark"? [16:25:52] over and over again? [16:26:03] yeah, some parts of that anyway [16:26:03] git rebase --interactive [16:26:09] and squash commits [16:26:11] is your friend [16:26:18] isn't that the same as one giant commit? [16:26:23] squashing? [16:26:31] if you squash everything in one huge commit, yes [16:26:31] not squash everything into one commit [16:26:33] but you shouldn't [16:26:35] i see [16:26:38] well how about this [16:26:43] squash stuff into multiple commits [16:26:45] i would love to host this repo in gerrit [16:26:52] and preferrably, *merge early* some of your stuff [16:26:59] you don't need to host it in gerrit [16:27:00] so, let me give you an example [16:27:03] (we /could/) [16:27:05] but I would like to keep the machiens talking to prod puppet master [16:27:07] you're using modules [16:27:09] so I don't have to deal with messing that up [16:27:10] yes [16:27:20] our puppet infrastructure didn't support modules until last Thursday [16:27:24] how would we merge that? [16:27:32] ? [16:27:39] they are all new files, right? [16:27:54] we didn't support modules *at all* [16:28:20] we could place your files into a module/ directory, but they would never got loaded by our puppetmaster [16:28:21] oh you are saying how would we merge that if you hadn't already done the work to support modules? [16:28:25] yes. [16:28:28] right, then we would work on that when it happens [16:28:30] to support modules [16:28:53] so rather than working with what we have, you prefer to work on something completely on your own and then expect the existing setup to adjust itself to support it? [16:29:09] do you see the problem here? :) [16:29:10] or adjust it myself, (with ops' help/review) [16:29:11] yes [16:29:27] this is just an example; modules was fairly easy to add support for [16:29:35] but other things may be like that [16:29:37] yes and no, i understand the desire for history, that's cool [16:29:46] it's not just history [16:29:49] you use none of the patterns that exist in the tree, like role classes [16:29:53] this is just totally not distributed version control [16:30:07] and you're going to realize that months from now, when you'll request everything to be merged into a single commit [16:30:08] this is totally doing your own thing [16:30:22] ok ok , so there are 2 ways that I think this can be done that will make me/us happy [16:30:23] so [16:30:26] "release early, release often" :) [16:30:30] 1., the way I am doing now (that's the me) [16:30:44] 2. use a clone, but not host my own puppetmaster [16:30:48] i'm not sure how to make 2 happen [16:30:54] why can't you run a clone on your own puppetmaster? [16:30:59] that's how labs projects work too [16:31:05] yeah with self, hmmm [16:31:07] hang on [16:31:15] if you find this too hard to manage, perhaps you should be doing it in labs [16:31:21] instead of on real iron [16:31:26] hehe, gimme an equivalent labs cluster and that will be fine [16:31:28] we are benchmarking [16:31:37] you can benchmark after you have this stuff dealt with [16:31:58] don't be mean mark! I'm not finding this too hard to manage, we are working together here to find the best solution [16:32:19] the best solution seems rather obvious to me, but ok [16:32:30] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [16:32:32] right, and that is not a healthy way to convince me [16:32:38] you guys have the final word, sure [16:32:49] but let's try to make it feel like we are working togehter positively, ok? [16:33:04] so what's the problem with running a clone of our repo on your own puppetmaster? [16:33:06] imho, it seems to me that you've tried doing this a bit of your own way and we're already seeing clashes -- before even coming to the merging part [16:33:09] thinking, one sec [16:33:17] I propose that we do the clone thing [16:33:27] (I, too, propose) [16:33:32] and since I worked on puppetmaster::self [16:33:37] I can help you in doing that if you wish [16:33:50] so I do a clone, host my own puppetmaster, occasionally pull from production, commit my stuff to my clone, etc. [16:33:57] yes [16:33:58] but all my machines need to work with my puppetmaster [16:34:05] Can someone run rm -rf /home/wikipedia/common/php-1.20wmf4 as root for me please? Various git objects get permission denied errors [16:34:07] isn't that the case already? [16:34:15] no he does both [16:34:18] Reedy: where? [16:34:21] they work off our puppetmaster, and his own [16:34:24] yeah [16:34:27] on fenari will do it [16:34:33] and he tries to make sure they don't collide [16:34:49] which thus far has worked fine, but you guys are saying that later it will clash [16:34:55] when I want to merge to produciton puppet [16:34:56] but your other machines talk with your own puppetmaster as well, don't they? [16:35:05] yes [16:35:18] yeah, but they all work with prod puppet without any modifications to regular puppet [16:35:27] so, all of your machines "work with your puppetmaster", don't they? [16:35:34] there is an /etc/puppet.analytics [16:35:35] yeah [16:35:47] but that analytics puppetd is not running [16:35:48] you don't need to do anything other than just make them to not work with the prod puppetmaster [16:35:49] i run it manually [16:35:52] which is fine here too [16:36:07] i need to change puppet.conf in the clone [16:36:23] and remember not to merge that back to prod, but aside from that it'll work just fine yeah [16:36:46] Reedy: done. [16:36:47] puppetmaster::self, if it doesn't already, can be made to work for this use case [16:36:51] Thanks [16:37:03] it doesn't, because it assumes it runs on labs [16:37:11] there is a slight problem with our plan [16:37:16] which is the private repo [16:37:30] ah right [16:37:31] hm [16:37:31] yeah that probably needs to be the labs private repo [16:37:32] the labs-private could be used [16:37:37] right [16:37:41] this is now a separate realm [16:37:47] it's not quite labs [16:37:51] OR [16:37:51] what is? analytics? [16:37:54] what about this [16:37:55] it could be put in labs [16:37:59] right, it's a bit bastardized [16:38:04] can I keep running with 2 puppetmasters and still use a clone? [16:38:11] my clone could just make sure that the analytics nodes [16:38:16] don't include anythign from the production repo [16:38:17] although when I first heard about that, my immediate reaction was "let's expand labs to iron too" [16:38:31] since this will surely come up in the future again [16:38:37] certainly will [16:38:39] in fact, it came up today! [16:38:42] with ben's swift stuff [16:39:32] i think that would work and would be minimal work [16:39:44] and effectively the same [16:39:47] how would that help? [16:40:13] main puppetd would contact production puppetmaster, jsut as it is now [16:40:14] no change [16:40:19] so no issues with private repo [16:40:36] you can just use the labs private repo [16:40:43] analytics puppetmaster would run off of a clone of production puppet repo [16:40:45] running off two puppetmasters is asking for trouble imho [16:40:50] but in site.pp in the clone [16:40:56] i would change it so analytics nodes do not include anything [16:41:04] except my stuff [16:41:05] could work if you're careful and lucky but it'll always be a headache [16:41:16] it is working great right now [16:41:40] it was mainly just configuring them to work out of different directories [16:42:06] mark, to use the labs private repo, what do I ahve to do? change puppet.conf so that it points elsewhere? [16:42:24] we need to make sure puppetmaster::self clones from the labs private repo [16:42:26] or sorry [16:42:28] fileserver.conf [16:42:29] ? [16:42:47] mark: it does [16:42:47] okhmm [16:42:48] ha [16:42:53] yeah it probably does [16:43:00] I'm sure it does, I wrote it :) [16:43:02] we need to make sure puppetmaster::self works outside of labs, basically [16:43:07] that I'm not sure of [16:43:13] paravoid can help with that [16:43:16] hahaha [16:43:22] yes, I can [16:43:30] ok ha [16:43:35] then there's just one puppetmaster, and I don't see why that wouldn't work [16:44:03] how do I get :;self installed if I am running my own puppetmaster? [16:44:20] wipe that box and start over [16:44:47] (i mean, you don't have to, but you made your life hard by not doing it this way ;) [16:46:06] we could look at making these machines fall under realm labs (at least for now) [16:46:37] is there a problem with my latest suggestion? [16:46:39] but private data is a problem here [16:46:56] 18:40:45 running off two puppetmasters is asking for trouble imho [16:47:01] 18:41:05 could work if you're careful and lucky but it'll always be a headache [16:47:05] right, but is working [16:47:10] and if it is aheadache [16:47:12] then it is my headache [16:47:34] so, we did this for a reason, you know? [16:47:34] fine, do as you want, have your headache [16:47:39] we're experimental atm. [16:47:48] obv we're not going to run it this way in production [16:47:56] dschoon: we're suggesting to do exactly what we're doing in labs [16:48:10] but it's been really helpful to be able to quickly test out puppet changes without pushing to gerrit and waiting for review [16:48:22] please read the discussion before you chime in [16:48:26] we're way past that [16:48:33] i did, but i signed on recently [16:48:37] so i might have missed things [16:48:40] ok [16:48:56] we're saying, "totally run your own puppetmaster, but do it off our (cloned) git repo" [16:48:59] it's not in my scrollback, so :) [16:49:00] dschoon: we're basically asking to do exaclty what we do with labs and local puppetmasters, which means no waiting for review [16:49:10] i'll shoosh then :) [16:49:21] i have nothing else to contribute to puppet politics [16:49:24] dschoon: that will help for when the time for pushing into prod (and hence review) comes [16:49:25] hehe [16:49:40] ok, paravoid, mark, I will create a clone (in gerrit?) of operations/puppet repo [16:49:42] ideally these reviews don't come all at once when you're totally done [16:49:56] when you start working off random software projects do you do "git checkout -b foo" or "rm -rf .git; git init"? [16:50:04] of course not. we +2 them quietly while you're all asleep. [16:50:06] that's basically the discussion :) [16:50:19] I will use that to run my secondary puppetmaster [16:50:41] and I will modify site.pp so that analtyics nodes do not include anything but my stuff [16:50:49] if you and paravoid get the labs on iron stuff working [16:50:50] when the time for pushing comes, we need to merge a branch, not have a separate commit of "Commit the past 6 months of analytics work" [16:50:55] we can switch to that, or do that in the future [16:51:24] ottomata: not in gerrit, just git clone on your puppetmaster [16:51:39] aaahhhhhh but I don't want to edit there [16:51:48] then edit elsewhere [16:51:56] you don't have to edit there, you can edit/commit locally and push it to your puppetmaster [16:51:57] i need to commit? [16:51:58] oh [16:51:59] this is not svn :) [16:52:01] commit to branch [16:52:04] clone and checkout branch [16:52:10] you need to change your mindset to distributed version control [16:52:21] you sound like you think very much in svn/cvs ways [16:52:23] probably so [16:52:26] yes, "git clone gerrit:operations/puppet; git checkout -b analytics" [16:52:36] i've only used get in a centralized setup (gerrit, etc) [16:52:39] ok cool [16:52:41] that makes more sense [16:52:52] also note that we haven't discussed this before with mark at all [16:52:57] you can push/pull changes between any arbitrary git repos [16:53:12] no need to have gerrit in the way of things ;) [16:53:24] gerrit is just our gatekeeper of what will be our official/"production" stuff [16:53:37] so we're two people coming to the exact same conclusion separately; it doesn't mean it's right, but it certainly has a bit more weight this way methinks :) [16:53:48] that's also why for scribe, I don't care about what happens in github, as long as what runs in production on our cluster, gets reviewed via gerrit [16:53:48] right, yeah, [16:54:00] indeed [16:55:07] also, all of this isn't to blame you for anything; things have been forming as we go (and on the management level too with you joining ops etc.) [16:55:26] but we should align your work with the ops work better as we go forward I think [16:55:29] yeah I think me being in ops meetings will be really really helpful [16:55:40] this is a perfect example of why we need to work together along the way [16:55:40] and I can definitely help you there if you need it [16:55:48] you've only just started and already you're doing things differently than we would do [16:55:53] imagine how it would be in a few months from now [16:56:03] if you come by with one huge diff out of a completely separate git repo [16:56:29] hey, i was redirected here from #wikimedia-tech... i notice a suspiciously high amount of accesses to the undefined pages of all wikimedia projects for the last at least 2-3 weeks... just as an example, today 15:00-16:00 utc there were 44716 accesses to http://en.wikipedia.org/wiki/undefined (that's more than 1/10th of the Main_Page with 354000 views!) [16:56:43] anyone here with the permissions to investigate? [16:57:00] btw, I think I'm about as new as you are to the foundation ottomata :) [16:57:10] jorn: interesting... can you file a bugzilla ticket for that? [16:57:17] it's not some old-timers attitude or anything :) [16:58:37] mark: doesn't seem so [16:58:39] :D [16:58:45] my username and pw are invalid [17:01:39] i can't really help there i'm afraid, i'm not a bugzilla admin I believe [17:02:30] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [17:03:24] just thought that would be a funny reply, seems firefox somehow associated the wrong username / pw [17:03:27] !log reobooting potassium, shouldnt be in use, had tons of cruft ssh connections from month ago [17:03:35] Logged the message, RobH [17:03:40] New patchset: Cmjohnson; "Replacing linux-host-entries.ttyS1-115200 with corrected version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16423 [17:03:40] category wikimedia / general is ok? [17:04:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16423 [17:06:15] PROBLEM - SSH on potassium is CRITICAL: Connection refused [17:06:19] aha [17:06:21] BBQ is ready [17:08:35] New review: RobH; "everything back like it should, much better" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/16423 [17:08:36] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16423 [17:14:23] bug report is here: https://bugzilla.wikimedia.org/show_bug.cgi?id=38604 gtg, bus [17:22:48] New patchset: Pyoungmeister; "apache overhaul: round one of responding to mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16426 [17:23:24] New patchset: Pyoungmeister; "Initial comments to app server manifests work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16122 [17:23:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16426 [17:23:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16122 [17:38:46] Ryan_Lane: Looks like nova is already using keystone on nova-precise1. Can you suggest some next steps I should pursue? (I presue the interesting bits is getting labsconole to pull account & project information out of keystone) [17:41:12] Reedy: I reformatted srv281 only to get a system with a disk full again, yaaay [17:41:32] our partioning scheme is just crazy [17:41:43] andrewbogott: I don't see Ryan online [17:41:58] paravoid: Good point :) [17:42:09] Speaking of seeing people online… anyone seen Asher lately? [18:07:05] RobH: ping [18:07:11] ? [18:07:54] jeremyb: Aye sah [18:07:57] hey [18:07:58] RobH: do we have any high performance miscellaneous servers available? [18:08:19] RobH: I'm asking in regards to the WLM project [18:08:32] there are none in tampa, we are ordering more [18:08:36] i think i have some in ashburn, checking [18:09:31] preilly: i have a couple in ashburn [18:09:41] is there a procurement ticket for this? [18:09:55] RobH: yes [18:10:02] RobH: let me look it up [18:10:22] jeremyb: I believe the docs I have found so far indicate that the debian source package is just the four files I've created through debuild, so those are all up now [18:10:45] New patchset: Bhartshorne; "adding some of the eqiad swift production hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16431 [18:11:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16431 [18:11:26] marktraceur: that looks much better [18:12:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16431 [18:12:38] RobH: https://rt.wikimedia.org/Ticket/Display.html?id=3221 [18:13:38] this isnt procurement, this is a request [18:13:49] and its not really approved yet, so ct needs to comment and say to do it [18:14:01] asher raises the concerns in his last posting. so im not against giving you servers [18:14:09] i just need a more mgmt level approval [18:14:12] paravoid: care to help marktraceur some with packaging? /me has been busy with wikimania and now some with other work. and you're much more experienced of course ;). I can try to help some in ~1 week but if you want to before then that's great too [18:14:31] RobH: got it [18:14:42] https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/etherpad-lite/ is old but maybe useful [18:14:51] http://marktraceur.info/shared/packages/etherpad-lite/ is newer [18:14:53] but i have a couple of high performance and standard misc servers [18:14:55] in ashburn [18:14:58] fyi [18:15:39] https://github.com/MarkTraceur/etherpad-lite is the repo [18:16:09] RobH: okay I talked to Tomasz he is going to circle around with CT [18:16:15] cool [18:16:16] RobH: thanks for looking into it [18:16:23] quite welcome [18:16:30] * jeremyb always forgets, binasher is pacific? [18:16:50] he lives in SF, yeah [18:16:51] i guess maybe he's not online today [18:17:21] ryan_lane: can you tell me where you're at regarding OSM and keystone? Partially underway, or not at all underway? (Looks like you switched Nova to use keystone on nova-precise1 already...) [18:17:30] I've not seen an out of office email or anything.. [18:17:48] andrewbogott: yeah, but I didn't import the LDAP data, yet [18:18:06] and the LDAP data needs to be changed to work with keystone's schema [18:18:35] for OSM, I've added some basic REST support to the controller class [18:18:38] OK. And, the git diff in the OSM code on that system is pretty big… that's something unrelated that you're working on? [18:18:51] lemme see [18:19:41] Reedy: well he doesn't seem to have been in this channel in the last 24 hrs [18:20:04] Because yesterday was a weekend [18:20:16] paravoid: apache stuff should be on /a now. If not, we can beat notpeter ;) [18:21:02] Reedy: sure, not complaining, just want to poke him. if i knew he wasn't around today then I would delay poke for a day ;) [18:21:57] andrewbogott: oh. the diff is large because I'm subclassing the controller [18:22:03] so that we can support ec2 and nova [18:22:05] while we switch [18:22:15] ok. [18:22:19] it's going to be way larger too [18:22:25] because I need to fix all the stupid model classes [18:22:47] So, any specific subtasks you'd like me to look at? [18:22:50] I should have used JSON for the objects, and I used EC2's data class [18:23:03] (I can also just get out of the way if you have it well in hand.) [18:24:26] andrewbogott: well, I could use help with keystone [18:24:43] you mean, the account migration? [18:24:44] if you can import the current test LDAP data, then massage it for keystone, that would be good [18:24:59] it should be relatively straightforward [18:25:48] How hard was it to to set up nova-precise1? Is it mostly puppetized? [18:31:11] andrewbogott: should be fully puppetized [18:31:30] ok, I'll make myself a separate instance so I don't step on your toes [18:34:25] New patchset: Alex Monk; "(bug 19569) Add Portal namespace to urwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16432 [18:35:17] andrewbogott: no, it's fine to work on that one [18:35:51] lemme import the data really quick [18:37:03] Ryan_Lane: bump on ircecho reviews (but not in a rush either). some might need to pull in more recent changes that have been merged so I'll double check them before merging [18:38:42] * Ryan_Lane nods [18:39:10] jeremyb: sorry about the wait on that. I've been fighting labs stablezation issues for weeks [18:41:32] Ryan_Lane: i think i noticed a small bit of that [18:41:54] * jeremyb suddenly wonders if the hw migrations by the more crappy route ever finished [18:42:15] Ryan_Lane: http://docs.openstack.org/essex/openstack-compute/admin/content/migrating-from-nova-auth.html <- if this is not relevant to what we need, can you explain? [18:43:15] andrewbogott: I don't think this will deal with ldap at all [18:43:36] andrewbogott: this is probably the best resource available for this: http://adam.younglogic.com/2012/02/openstack-keystone-ldap-redux/ [18:43:47] it at least describes the schema [18:44:29] So the ldap system we use is yet a third auth scheme, neither keystone nor 'nova auth'? [18:45:02] (Hm, I've seen that page but though it was about the transition between two different keystone versions.) [18:45:10] LDAP is the backend for keystone [18:45:19] like it's currently the backend for nova [18:48:12] Ryan_Lane: Sorry if I'm being dim. That page that I linked purports to be about migrating from nova-auth to keystone. It's in reference to a system that doesn't use ldap as a back-end, but just relies on the nova/keystone sql dbs? [18:48:21] yes [18:48:22] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:48:35] it's migrating the auth info from the nova db to the keystone db [18:48:41] both using a sql backend [18:49:00] we use an LDAP backend for nova [18:49:07] using the nova schema [18:49:25] Yep, ok. [18:52:06] hm [18:52:15] most keystone implementations I've seen use a passwrod [18:52:18] *password [18:52:28] I'm not a huge fan of that [18:52:47] I'd much prefer to use a token, like we currently do [18:54:59] I'm not sure that the ldap backend can pull a token, though [18:56:05] woosters: i've been following https://rt.wikimedia.org/Ticket/Display.html?id=3221 and were at the point where we'd like to get one of the available misc servers [18:56:42] tfinc: high performance misc server from Ashburn ^^ [18:57:25] !log took ms-be10 out of rotation because it ate itself [18:57:33] Logged the message, Master [18:58:10] preilly: yup [18:58:27] tfinc - will discuss it with asher and get back to u [18:59:24] woosters: will he be in today? [18:59:45] he should be [19:03:13] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:05:05] andrewbogott: I have a really good feeling I'm going to be writing some LDAP code for keystone :( [19:05:18] * Ryan_Lane hangs himself [19:05:28] Worried that no one else is using it atm? [19:05:46] I'm pretty sure there's no support for token auth [19:05:56] and I *really* don't want to use password auth [19:06:36] I guess it's possible to use password auth if I set the expiration time of the keystone token to be the same as the expiration time of the labsconsole cookie [19:07:18] it's also possible that we write a token backend for mediawiki [19:16:30] andrewbogott: I guess we'll do simple auth and set the expiration time for the keystone token to be as long (or longer) than mediawiki's long-lived tokens [19:17:46] * andrewbogott nods [19:19:48] New patchset: Faidon; "nrpe: add a missing requires" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16438 [19:20:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16438 [19:20:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16438 [19:27:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:28:30] !log powering down old transcode1, was camera gateway sandbox, reclaiming name [19:28:38] Logged the message, RobH [19:31:15] !log authdns-update for transcode name updates [19:31:23] Logged the message, RobH [19:45:06] New patchset: Pyoungmeister; "apache overhaul: round two of responding to mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16440 [19:45:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16440 [19:47:36] New patchset: Pyoungmeister; "significantly increase timeout on mw-sync." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16441 [19:48:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16441 [19:48:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16441 [19:53:34] paravoid: an additional note on the ruby apps for code review. It's almost definite I'd end up needing to write the LDAP code for any replacement. [19:54:18] I don't mind ruby, but I really don't feel like learning it, as I don't really plan on using it much in the future [19:54:37] and I'm *really* tired of writing LDAP code for every single application we use [19:55:01] * Ryan_Lane doesn't understand how applications get this so wrong [19:56:17] heya [19:56:27] that's really the case for every app out there I think [19:56:35] haha [19:56:38] whatever the language [19:56:44] most people get ldap wrong [19:56:56] I know :( [19:57:02] incl. gerrit, although at least in gerrit it works for basic stuff [19:57:07] I've rewritten implementations in almost 20 apps [19:57:19] bu the fact that I can't e.g. change my cn is very annoying [19:57:22] I haven't needed to change gerrit at all :D [19:57:29] or the ssh key stuff [19:57:36] ssh key functionality is missing [19:57:41] that's the only thing I'd need to add [19:57:48] it shouldn't let you change your CN [19:58:01] I asked you if I can change my cn in ldap [19:58:05] right [19:58:07] and you told me that gerrit would break :) [19:58:11] gerrit doesn't support user renames [19:58:13] that counts like broken to me [19:58:24] yes. that's broken [19:58:45] both those things are accepted bugs, thankfully, though [19:58:55] so they'll get fixed in some future version [19:59:00] I think user renames will be coming soon [19:59:04] ^demon: ^^ ? [19:59:07] you didn't actually reply to my question though [19:59:18] which language would be actually ok for the majority of the team [19:59:18] <^demon> Um, scrollback, one sec. [19:59:30] actually, you even added a language/framework :-) [19:59:47] php, or python, mostly [19:59:52] I think Python has the best chance [19:59:58] likely [20:00:04] php? really? I've heard numerous ops people bash php [20:00:05] <^demon> Gerrit supports renames. And LDAP supports renames. Gerrit just doesn't support renames when using ldap. [20:00:08] php is good because the devs would more actively mantain it [20:00:17] ^demon: yep [20:00:32] paravoid: from a maintenance POV, PHP is really easy [20:00:51] so is python [20:00:52] we were talking about what ops hate or love though :) [20:00:58] * jeremyb waves binasher [20:01:10] there's a couple reasons for hating ruby [20:01:17] also, the code quality of most php applications is really reeeally bad [20:01:18] 1. it's really difficult to maintain [20:01:26] binasher: just wondering about OTRS myisam -> innodb. is that going to magically happen or should I make a ticket or? [20:01:27] 2. people seem to code horribly in ruby [20:01:28] oh I know, I have lots to say about ruby [20:01:36] I work in a distro, remember? [20:01:40] yeah :) [20:01:46] jeremyb: making a ticket would be a good reminder :) [20:01:52] for such an elegant language, it's hard to believe people suck so much in ruby [20:02:08] binasher: should you be cc'd on the initial mail? [20:02:09] I think it's due to the fact that ruby has frameworks that hide most of the hard work from you [20:02:14] binasher: ops-reqs? [20:02:17] it's like the Active Directory of programming languages [20:02:38] it's not like Java is universally loved in the ops world though [20:02:55] you're one of the very few ops person that I know that doesn't go "ewwwww" when he hears Java actually :) [20:02:57] Java applications are generally easier to maintain than ruby ones [20:03:14] paravoid: jeremyb and I tried to ping you a while back, in case you didn't see it; has to do with some packaging review [20:03:15] depends, have you ever deployed/debugged JBoss apps? [20:03:16] drop a war file into a container service. or launch it directly [20:03:20] paravoid: please rank pentabarf [20:03:22] ;) [20:03:24] paravoid: I maintained jboss apps for years [20:03:26] it's easy [20:03:34] you drop the war file in, and it runs, usually [20:03:49] after you untar the 5gb or so in /opt [20:04:06] depends on how you have jboss configured [20:04:11] it can run the war directly [20:04:20] and you have hundreds of megabytes of logfiles in some random /opt location without logrotates or anything [20:04:31] o.O [20:04:42] and after taking about 5' to start after a restart (jboss as7 actually is much better at that) [20:04:45] none of this sounds accurate to me :) [20:04:49] (but also not very popular yet) [20:05:06] at my last job I managed 20+ jboss apps [20:05:27] in general it was just dropping in a config file and the war file [20:05:44] all the logs go into sane locations, and you use the system's logrotate to manage the log files [20:05:51] or you make jboss send it out to syslog [20:06:04] what do you mean sane? it's usually /opt/jboss/server/default/logs [20:06:06] or something like that [20:06:17] it's certainly not /var/log, because that's out of the container [20:06:23] the app can (and should) change that [20:07:04] usually they can't, because of tomcat's security policies [20:07:27] you can either use syslog, or modify the security policy [20:07:29] unless you configure that, but now we've crossed by quite a lot the realm of "easy to configure" :) [20:07:43] jeremyb: yes, ops-req [20:07:50] binasher: k [20:08:00] <^demon> Hypothetical "hard to configure" java apps aren't really relevant here. [20:08:06] also, have you ever tried packaging jboss? or jars? [20:08:07] <^demon> Fact is, gerrit is simple to deploy/configure. [20:08:09] indeed. gerrit is simple to configure [20:08:14] paravoid: yes. on red hat [20:08:27] opendj is easy to configure/deploy too [20:08:35] lucene also isn't very hard [20:08:45] <^demon> I've packaged gerrit 3-4 times today already ;-) [20:08:47] for opendj at least, you still have all the embedded by upstream jars [20:08:50] all of the java apps we use are pretty simple in that regard [20:09:04] paravoid: which is why I don't package it for upstream ;) [20:09:11] it's only 2-3 libraries, though [20:09:20] it probably wouldn't be amazingly hard to decouple [20:09:38] that means that we aren't really getting security updates [20:10:07] well, opendj themselves should be handling that [20:10:10] but yes, that's correct [20:10:18] we need to upgrade opendj, thinking of that [20:10:22] I have the new version packaged [20:10:28] I should do that at some point soon [20:10:36] I've tried packaging shibboleth in the past [20:10:42] it was pure hell [20:10:52] new version has an easier way to remove replication agreements [20:12:03] also, I hate how java apps tend to distribute binaries rather than source [20:12:13] yep. that's normal [20:12:18] I hate it :) [20:12:38] gerrit, opendj and lucene all distribute source [20:12:50] I build opendj from source to make the package [20:12:57] oh really? that's cool [20:12:58] I don't with gerrit, yet [20:12:59] it's unusual [20:13:14] the gerrit package was a rush job [20:13:28] <^demon> We could feasibly build our own gerrit *.war, it's not hard [20:13:38] ^demon: it would be ideal if we did [20:13:46] then we could patch if needed [20:13:59] I didn't say it is hard, we were talking about cultures :) [20:14:08] yeah [20:14:16] I prefer java's to ruby's [20:14:22] and I was saying that every culture has its weirdness [20:14:32] yep [20:14:46] ruby's is just too annoying to deal with [20:14:50] I've worked with Java, it was always less pleasant than e.g. Python, but I would never say no to a useful app just because it's in Java [20:14:54] binasher: im trying to wrap my head around how to set up a puppet manifest for the WLM API host. 1) is there a manifest for anything remotely similar to what we'll need to do that i can look at? 2) is manifests/misc the correct place to put our manifest? [20:15:09] * jeremyb has heard ruby has a lot of bits (e.g. mailing lists) where you must know japanese [20:15:10] eventually they'll start being stable, but until they do, I don't like dealing with them [20:15:23] jeremyb: their primary communication language is japanese [20:15:36] I think they've fixed that by now [20:15:36] Ryan_Lane: right [20:15:39] I don't necessarily have an issue with that [20:15:46] but still, the whole gem thing is the problem [20:15:49] yes [20:15:57] and the embedding of whatever version of rails in each and every project [20:16:01] <^demon> gems are just as bad as pear. [20:16:03] paravoid: they haven't fixed the stability issues [20:16:04] and every other dependency of course [20:16:10] <^demon> Same reason I'd loathe any suggestion relying on them. [20:16:12] because they have no api stability [20:16:14] they still break API compatibility in point releases [20:16:20] which is *bullshit* [20:16:21] I know [20:16:24] * jeremyb still wants paravoid to rank pentabarf... :P [20:16:35] never worked with penta, heard the worst about it [20:16:46] oh i thought you did [20:16:49] awjr: not sure, notpeter and Ryan_Lane and others have been more on top of app server puppetizing, and paravoid could be a good resource for general "puppet in labs" questions [20:17:07] thanks binasher [20:17:22] ok. need food [20:17:23] * Ryan_Lane waves [20:17:33] binasher: you should have mail. RT never seems to give me autoreplies with ticket #s so I can't give it to you ;-( (i assume that's by design) [20:20:08] yep, got it! [20:24:58] binasher: Did my email about ceph testing make sense? And/or did I maybe not actually send it to you? [20:25:55] andrewbogott: i'll reply later today [20:26:02] ok [20:26:37] you're looking at using it in the mounted block device way aka EBS, not as an object store, right? [20:27:00] notpeter, Ryan_Lane, paravoid: i need to put together a puppet manifest for for misc app server which we're setting up in labs and will hopefully be transitioning to a dedicated host for production. it will be the basic lamp stack for hosting an API for the wiki loves monuments mobile app. i'm not really familiar with puppet (other than looking through existing confs) - can any of you provide some guidance on how best to get started/where i [20:27:30] binasher: shared filesystem. Which I believe runs on top of EBS [20:43:31] binasher: thanks a ton to anyone involved in making Gerrit sooo fast. I end up wondering if my action has been correctly handled because the screens just appears :-] [20:43:54] ;) [20:44:24] Logged the message, Master [20:46:53] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [21:14:30] New patchset: awjrichards; "Disable mobile redirect for donate.wikimedia.org" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/16449 [21:28:34] awjr: the gerrit link in your rt ticketed doesn't work for me. can you add code reviewers in gerrit? [21:30:51] binasher sure. i've been having problems posting to RT. it will occasionally timeout, and then when i go back to try posting again, my text is reformatted with escaped html entities, which looks like what happened [21:31:58] New review: Hashar; "Following a discussion with Faidon and Mark on 23rd, we will probably move that Debian package to pu..." [operations/debs/wikimedia-job-runner] (master) C: 0; - https://gerrit.wikimedia.org/r/11610 [21:35:06] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/16449 [21:36:47] Is there an RT ticket for the timedmediahandler etc transcode hardware? [21:37:47] reedy - https://rt.wikimedia.org/Ticket/Display.html?id=3298 [21:38:09] Thanks [21:44:07] New patchset: Asher; "updated as of https://gerrit.wikimedia.org/r/16449" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16453 [21:44:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16453 [21:46:17] awjr: how about changing the redirector to read its regex from a config file instead of being hardcoded? it'd be nice if we didn't have to compile and push out binaries for further changes [21:46:34] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16453 [21:46:44] binasher that is a good idea [21:49:31] awjr: and maybe a puppet manifest for managing the config file [21:50:08] ok. i'll get it to make us all coffee too [21:50:36] uh [21:51:06] binasher you don't like coffee? [21:51:50] seems like it'd be cold by the time it got to sf, so make sure you use the coldbrew class [21:52:01] heh [21:53:44] awjr: but seriously the config file should be managed by puppet in this particular case [21:54:15] preilly yeah, sounds good [22:01:23] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [22:04:23] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [22:07:20] !log deploying new mobile redirector to eqiad squids [22:07:28] Logged the message, Master [22:51:34] New patchset: Alex Monk; "Remove weird unset statement from flaggedrevs.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16457 [22:59:18] New patchset: Ryan Lane; "Enforcing a regex for usernames in nslcd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16458 [23:00:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16458 [23:00:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16458 [23:15:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 210 seconds [23:29:28] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds