[00:00:29] strchr would be enough [00:01:40] are you sure you own that pointer? [00:02:42] pretty sure it would be a bad idea to modify that pointer [00:02:53] that said, he isn't. I just had that discussion with him [00:04:57] it's a pity C has no efficient substring operation [00:05:20] you know D has slices, which are a substring reference that can be used like a string [00:09:52] TimStarling, I think on a gnu/linux system strtok is threadsafe [00:10:11] it'd have __thread storage [00:10:14] yes, probably [00:10:52] and yes, it should be better documented [00:11:10] well, I said "according to the linux manual" [00:11:23] I know [00:11:52] the man pages point what the spec says [00:12:03] from ISO C or POSIX... [00:12:07] the glibc manual doesn't say much either [00:12:13] which have no notion of thread-local [00:13:15] well, to be fair the glibc manual says it's not reentrant, and gives a link to a discussion of signal handlers [00:13:23] it doesn't say anything about threads [00:13:52] oh, right, C std doesn't know about threads either :) [00:14:34] yes, it may still not be safe for a signal-handler [00:15:36] New patchset: preilly; "use strtok_r instead of strtok for thread safety reasons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16095 [00:15:50] Ryan_Lane: ^ [00:15:55] TimStarling, Platonides, Ryan_Lane, binasher ^^ [00:16:05] yeah [00:16:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16095 [00:16:24] Ryan_Lane said you weren't going to modify the pointer [00:16:40] seems I had this exact same code in another spot [00:16:47] using strtok_r, though [00:16:50] preilly, why are you repeating VRT_GetHdr() ? [00:16:51] easy enough to switch to [00:16:54] but you're calling strtok_r() which modifies the pointer [00:17:02] surely it is much more expensive than keeping the first pointer? [00:17:17] Change abandoned: preilly; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16095 [00:18:13] you should use that awesome C99 variable-size automatic array feature [00:18:41] what language is that, by the way? [00:19:03] int len = strcspn(header, ","); [00:19:04] looks like a mixture [00:19:23] char ip[len+1]; [00:19:42] memcpy(ip, header, len); [00:19:51] ip[len] = '\0'; [00:20:08] char* ip = strchrnul(header, ','); [00:20:10] you know I discovered it when one of the analytics people used it by mistake [00:20:15] *ip = '\0'; [00:20:26] I saw it and thought "why the hell does that compile?" [00:20:39] it has been in gnu for a long time, too [00:20:46] I'm getting old [00:20:57] TimStarling: the first call modifies the pointer? [00:21:01] don't understand this new-fangled C99 [00:21:04] or only subsequent calls? [00:21:29] Ryan_Lane: strtok overwrites the input string, adding \0 characters where it finds delimiters [00:21:31] what's that \020 there? [00:21:46] as does strtok_r [00:21:59] I was under the impression that only happens on subsequent calls [00:22:09] then it returns a pointer to within the input string [00:22:11] not the initial [00:22:12] no [00:22:19] it happens in the initial, too [00:22:26] * Ryan_Lane grumbles [00:22:39] so strtok("whatever", 'sep') is a way to truncate at sep [00:23:00] binasher, https://gerrit.wikimedia.org/r/#/c/16097/ [00:23:23] it'd be equivalent of doing [00:23:27] char *t = strchr("whatever", ','); if (R) *R = '\0'; [00:23:36] * Ryan_Lane nods [00:23:47] er.. r and R being the same variable :P [00:24:17] so, this would effectively mangle the XFF that's sent past this point [00:24:30] clients would likely only get the 1st in the list [00:25:20] yes [00:25:24] AFAIK, we only care about the 1st one [00:25:25] New patchset: preilly; "use strtok_r instead of strtok for thread safety reasons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16098 [00:25:42] I hope you're not going to argue that that's just fine and elegant and good practice [00:25:44] we strip XFF if it doesn't come from the HTTPS servers [00:25:55] it's not elegant [00:25:58] but it's fine [00:26:02] wouldn't inet_pton stop at the comma anyway? [00:26:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16098 [00:26:05] I gave you code that will work and be good practice [00:26:24] no allocation, leet C99 feature use [00:26:34] ah. missed that [00:27:04] you could also set the , to \0, then restore back to a , [00:27:09] hehe [00:27:30] you would think it would return a const char* [00:27:51] and then hopefully somewhere deep in varnish, a compiler will be screaming in pain [00:28:06] old tricks when constant section wasn't read-only [00:28:13] I got some bugs when the compiler changed that [00:28:19] also there's the little problem of varnish being multithreaded [00:28:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16098 [00:28:34] several threads holding the same header? [00:28:42] seems unlikely [00:28:44] what if there is some monitor thread that's dumping current request headers? [00:28:52] so, we'll merge this in, and later, when we fix this, we'll fix it in both places it's broken [00:28:54] MaxSem: what was happening without a cutoff in setlimits? [00:29:04] though in practice it isn't broken [00:29:12] I'd test inet_pton [00:29:20] I think it is likely to handle it right [00:29:37] without any need of string operations [00:29:57] one problem with writing broken C code and knowing that it will work for some subtle reason is that other people will read it and not know that that's what you're doing [00:30:03] binasher, the server's limit (1000 by default) gets used. this change is the only one needed for PECL extension to bw used [00:30:08] TimStarling: indeed [00:30:10] s/bw/be/ [00:30:14] it's been broken for about 6 months, though [00:30:18] and they will copy your example and really break something properly [00:30:23] yeah [00:31:06] in both of our use cases, we specifically only want one IP [00:31:13] MaxSem: cutoff should be optional for the pecl sphinxclient? [00:31:17] and we only want to pass a single one through [00:31:17] btw that code I gave you will work as long as the headers are limited to some reasonable size, like a few megabytes [00:31:30] if someone passes a gigabyte header then it will probably segfault [00:31:39] but most webservers limit header length [00:31:50] /* This functions returns the first ip provided by a list header like XFF */ [00:31:51] int first_ip_to_inet_addr(const char* src, void* dst) { /* We know inet_pton() does the right thing*/ return inet_pton(src, dst); } [00:31:53] wasn't this a problem very recently in apache? [00:32:13] I don't think so [00:32:17] MaxSem: oh, that's max_matches, not cutoff.. nm [00:32:42] good night [00:32:47] segfaults on header length, not this specific example [00:32:55] right [00:33:18] well, if you don't know about that variable size automatic feature you might be tempted to do [00:33:25] heh [00:33:32] char header[10000]; // surely no header could be bigger than this [00:33:55] right [00:34:05] strcpy(header, VRT_GetHdr(...)); [00:34:08] I think for our use case, the solution we have is simple and non-buggy [00:34:12] lol [00:34:24] yes, that would indeed be bad. [00:35:13] maybe documentation specifying our use case would suffice? [00:35:32] Ryan, the right way to do it is like 3 lines of code [00:35:42] and you want to document why you did it the wrong way? [00:35:50] binasher, though on my VM's Linux PECL seems slower than pure-PHP [00:36:21] TimStarling: so you want it like: http://ideone.com/Ikfrg [00:36:24] meh. this is a review I'm doing at the end of the day. I don't feel like fixing it right now. [00:36:29] that's the real issue [00:36:53] I would commit a fix if I had a test setup [00:37:41] preilly: yes [00:38:18] maybe size_t instead of int for len [00:48:50] TimStarling: how does this look http://ideone.com/OcWIz to you? [00:48:51] MaxSem: that's odd. slower at any one thing in particular? [00:49:23] preilly: looks good [00:50:22] New patchset: preilly; "remove use of strtok_r" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16101 [00:50:46] Ryan_Lane: ^^ [00:51:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16101 [00:51:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16101 [00:56:07] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:18:01] I really hope you guys are actually writing a buffer overflow in our varnish instance [01:19:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:21:17] csteipp: how so? [01:21:51] maybe we should switch channels? [01:42:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 284 seconds [01:44:20] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 299 seconds [01:50:11] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 654s [01:51:32] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [01:56:20] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [01:56:38] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 11s [02:05:38] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:53:38] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [03:56:38] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [04:16:49] Reedy: Do you know how to /// have you ever // .. deploy bugzilla updates? [04:17:06] not bugzilla core, but I mean a simple change in svn to your skin and comment parser regex [04:17:11] our* [04:17:31] I created an RT ticket, but its been 2 weeks. Something like this shouldn't take weeks, nor require an RT ticket. [04:17:48] isn't it just svn up on the right machines? [04:28:38] morning [06:48:44] yes it is [07:24:51] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [08:23:17] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:54:07] paravoid: I have been looking at your module for ntp. [08:54:28] paravoid: have you considered making each of our puppet module an independent git repo ? [08:54:40] this way third parties could easily reuse / include our modules in their project [08:54:57] we could publish them in http://forge.puppetlabs.com/ ;-D [09:24:41] sorry, was talking with banks [09:24:52] multiple silly charges for christ's sakes [09:24:54] I hate greek banks [09:25:00] anyway [09:25:04] <3 greek food [09:25:13] so, separate git for every tiny module is a bit of an overkill imho [09:25:33] also, it would make doing site-wide changes atomically oh so much more difficult [09:25:48] (I presume you also meant using git submodules) [09:26:59] publishing some of our things in forge (or riseup or just somewhere locally) is something I mentioned in my mail [09:27:32] so, it might make sense splitting up a very large independent module of ours to a separate repo [09:27:44] but let's do that later or when a real need arises :) [09:36:31] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [09:58:46] paravoid: totally presumed using submodules :-] [09:59:20] paravoid: a typical use case would be the git:: classes which are just a wrapper for the git client [09:59:33] we will see :-] we can always make them a git repo later on [09:59:48] paravoid: what is your thought about rspec / cucumber ? [10:04:14] paravoid: oh and by pure coincidence, i was looking at a test system for puppet when you send your mails :D [10:04:56] well got http://projects.puppetlabs.com/projects/ci-modules/wiki/Blog which compare them [10:05:18] hashar: http://puppetlabs.com/blog/the-next-generation-of-puppet-module-testing/ [10:05:22] jul 12 :) [10:05:27] ahah [10:05:34] i should subscribe to that blog rss [10:06:05] cucumber-puppet is not compatible with puppet 2.7 anyway [10:06:54] the reason I was asking is because I am rewriting the squid class to factor out some code [10:09:04] the draft https://gerrit.wikimedia.org/r/16115 [10:09:43] added you as a reviewer so you can read it :-D [10:10:14] anyway, daughter duty + lunch [10:10:17] will be back in roughly 2 hours [10:38:38] !log Built new varnish 3.0.3~rc1+persistent1-wm1 packages and inserted them into the precise-wikimedia APT repository [10:38:48] Logged the message, Master [10:39:15] yay [10:39:36] hm, !log for reprepro uploads? [10:39:42] should I do that too? [10:39:48] I've added several packages yesterday [10:39:58] yes [10:40:08] but since noone else does that, I added a log script to reprepro yesterday [10:40:18] just in time I saw, because I got a whole bunch of stuff immediately ;) [10:40:42] I saw the mails and was wondering :) [10:40:55] i noticed a lot of packages got added which I've never seen [10:41:09] in !log you mean? [10:41:11] or yesterday? [10:41:15] and I also noticed leslie saying on labs-l: "feel free to build a package and I'll happily add it to the repo!" [10:41:19] and I was thinking: "not so fast" [10:41:26] hahaha [10:41:31] stuff should get decent reviews [10:41:38] *at least* I want to notice when stuff gets added [10:42:05] there are security risks etc [10:42:09] oh I know :) [10:42:18] so yeah the existing log script is simply echo $@ | mail [10:42:19] I was thinking of how to fix that [10:42:25] as I didn't have time to figure out what all the params were [10:42:27] but it works well enough ;) [10:42:33] as in, how to make a list of all the things that we have in the repo [10:42:43] and notify us on USNs etc. [10:42:47] we didn't have that much so far [10:42:51] but it's growing quickly now [10:42:57] yeah that would be good [10:43:19] I still haven't deployed the new php5 [10:43:29] and noone seems to care, which is disappointing [10:43:55] noone really knows how to handle it well [10:44:14] there's not really any better process than what you've already done [10:44:24] other than people intimately familiar with mediawiki keeping an eye on it ;) [10:44:30] (which is typically not ops) [10:44:35] so that should change [10:45:10] any ideas how? [10:46:26] we need to run a mediawiki test suite also for system level changes [10:46:34] sort of what you did by upgrading the jenkins server (I think?) [10:46:44] but I have no idea how complete / thorough that is by now [10:46:50] (I don't follow mediawiki development at all tbh) [10:48:26] running the phpunit test suite is better than nothing [10:48:46] certainly [10:48:58] TimStarling: are there instructions on how to do so? [10:49:09] I'd be interested [10:50:22] see tests/phpunit/README [10:50:27] DO NOT RUN THESE TESTS ON A PRODUCTION SYSTEM OR ON ANY SYSTEM WHERE YOU NEED [10:50:27] TO RETAIN YOUR DATA. [10:50:37] that's what README says, so don't do that [10:50:39] lol [10:50:44] interesting [10:51:04] you'll have to create a separate instance of MediaWiki [10:51:27] we had a problem a while back where running the test suite would create an admin user with a fixed password [10:51:46] I think that's fixed now, but it's hard to stop people from accidentally accessing the DB [10:51:57] but I think hashar has been working on that problem [10:52:55] I was hoping more on getting an answer like "ssh to srv193, cd /foo/bar, run php runtests.php" [10:53:18] i also always hope my work is already done by others [10:53:19] ;) [10:53:29] indeed [10:53:32] maybe hashar will have something already set up for you [10:54:00] hashar seems to be the one to talk to [10:54:09] okay [10:54:31] he set up jenkins recently, for pre-merge automated testing [10:54:46] he mostly works on QA stuff these days [10:55:02] yeah, we've been working a bit together [10:56:16] speaking of PHP, I was looking at the obama benchmark again today [10:56:51] the PHP VM seems to be at least 25% of the execution time [10:56:57] and 25% of that is branch misprediction [10:57:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:57:42] there's no way to avoid branch misprediction in a normal VM, there's no way for the processor to guess what handler it will jump to far enough ahead [10:58:38] so I thought, it would be easy enough to avoid that overhead by converting the sequence of opcodes to a sequence of machine code handler calls instead [10:59:04] with fixed call addresses [10:59:11] then the processor will know what is going on [10:59:24] long story short, someone already did it 3 years ago, integrating LLVM with Zend PHP [10:59:50] 1000 lines of code, looks as simple as falling off a log [11:03:19] TimStarling: http://gitorious.org/php-llvm [11:03:24] Brion Vibber created project php-llvm [11:03:38] heh [11:04:05] yeah, that one [11:04:15] I guess he didn't commit anything to it? [11:05:16] New patchset: Mark Bergsma; "Puppetize reprepro configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16117 [11:05:33] I'm cleaning up the bitrot, trying to get it to compile against a recent version of LLVM and PHP [11:05:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16117 [11:06:06] I guess at some point I'll work out why everyone has abandoned it [11:06:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16117 [11:06:45] btw I also identified the main reason why 5.4 is faster than 5.3 [11:07:56] a "zend_literal" abstraction was introduced for passing string literals around [11:08:29] it is used to avoid hash calculations, to make hashtable lookups faster [11:09:00] New patchset: Mark Bergsma; "Fix file paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16118 [11:09:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16118 [11:09:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16118 [11:09:58] especially looking up object method calls [11:10:14] haha [11:20:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:22:30] mark: any reason to keep karmic & oneiric? [11:22:38] we don't have any of these anymore [11:22:47] I believe so too, but wasn't sure [11:22:52] the old search servers were karmic [11:22:53] servermon is your friend :) [11:23:02] but I believe peter reinstalled them all [11:23:51] where is servermon again? [11:23:52] i've never seen it [11:23:58] bad mark [11:24:04] http://sockpuppet.pmtpa.wmnet/servermon/ [11:24:19] fwiw 10 hardy, 708 lucid, 63 precise [11:25:01] django++ [11:25:06] hehe [11:25:14] I'm fluent in django [11:25:44] so, go to fact query [11:25:48] i'll remove those dists [11:25:54] choose all hosts [11:26:03] and pick lsbdistcodename at the facts box [11:26:09] and generate the report [11:26:20] fact query is quite handy [11:26:24] i used to use ganglia for this [11:26:26] but this is better :) [11:27:04] glad to hear that [11:27:12] esp. since I've written half of it :) [11:27:31] yeah good work [11:27:48] that also means I can implement features that you'd like [11:28:23] you could do that even when you hadn't written it [11:28:35] well yes, but now it's easier for me [11:28:40] i already told mithrandir that you're our debian minion now [11:28:45] hehe [11:28:57] ;) [11:29:00] so, since you're at it [11:29:07] could you help me with a cleanup? [11:29:17] on the main page there's the "problematic puppetized hosts" view [11:29:33] it's basically a list of hosts that the puppetmaster hasn't seen for N runs [11:29:40] (in our case, 4 hours) [11:29:42] right [11:30:11] the list is extensive, which means that there are decommisioned hosts that are still in the puppet db [11:30:22] yup [11:30:30] those mw*.eqiad we should put in decommissioning [11:30:32] but we have a cronjob that cleans up [11:30:34] we should turn those off until we start using them [11:30:38] they should have never been installed [11:30:58] ms1-4 can be decommissioned too [11:31:09] (not 5-8) [11:31:20] payments no longer runs our puppet [11:31:23] that's the fundraising realm [11:31:25] by decomissioned you mean puppetstoredconfigclean.rb or something else? [11:31:27] PCI scope etc [11:31:38] i mean listed in "decommissioning.pp" [11:31:44] that also means that cron job will run for them [11:31:49] we can take them out there again when we want to reuse [11:31:59] it's a bit misnamed, it's basically a cleanup list [11:32:16] knsq30 is broken [11:32:19] I don't think i'll fix it [11:32:19] why does this exist? [11:32:23] so we can decommission it [11:32:38] it exists to cleanup stuff, e.g. in the puppet db, in ganglia, ec [11:32:40] nagios [11:32:41] etc [11:33:27] hm, shouldn't nagios be cleaned up automagically? [11:33:38] how? [11:33:46] as long as it stays in the puppet db, it stays in nagios [11:33:50] in fact [11:33:55] if it doesn't stay in the puppet db, it still stays in nagios :) [11:34:06] at least using the old method, not sure about naggen [11:34:10] oh, I meant, after cleaning up the puppet db [11:34:18] perhaps naggen removes [11:34:19] with naggen it will cleaned up [11:34:23] but the old method it will just stay around [11:34:24] and without naggen there was another way [11:34:35] to purge unmanaged resources [11:34:42] in puppet maybe [11:35:02] i've never really tried that [11:35:12] lemme find itresources { [ "nagios_service", "nagios_servicegroup", "nagios_host" ]: purge => true; } [11:35:27] s/^lemme find it// [11:35:27] :) [11:35:45] should be slow though [11:35:58] I don't remember why I didn't use that [11:36:03] perhaps it didn't exist at the time [11:36:13] in any case, right now we just do a check on that decommissioning.pp list [11:36:15] maybe because it's a very esoteric puppet feature [11:36:16] and if a host is in there [11:36:18] it's ensure =. absent [11:36:29] yeah, I'm looking at the manifest right now [11:36:37] and does that for a few other things [11:36:44] like in ganglia, it puts those hosts in a decommissioned group [11:36:49] aha [11:36:53] okay [11:36:57] makes sense [11:36:59] the idea was to do more with it [11:37:05] but not much has happened with it [11:37:13] like what? [11:37:15] one problem is that often decommissioned hosts don't run puppet themselves anymore [11:37:19] because they're broken or so [11:37:27] well, general cleanup of services, where necessary [11:38:29] we wanted to generate node groups from puppet for example (but didn't yet) [11:38:44] or automatically remove from torrus/other statistics [11:39:12] it can be useful in some places in our puppet manifests to know that a certain host existed but does no longer [11:39:53] * paravoid nods [11:43:50] so, I should add mw*.eqiad, ms1-4 and payments* to decom, right? [11:44:14] yep [11:44:21] what about the rest? [11:44:21] owa* too [11:44:34] ms7 is solaris [11:44:38] we're never gonna run puppet on that anymore [11:44:46] yucks [11:44:56] knsq30 can be decommissioned [11:45:10] it's out of warranty, replacement servers are racked already [11:45:24] knsq25 not sure [11:45:42] and the rest will need investigation [11:45:56] mw*.eqiad have been seen sometimes as low as a day [11:46:04] are you sure someone is not working on them? [11:46:36] they have been installed with lucid almost a year ago [11:46:38] and never used [11:46:47] ct was pushing for them to get installed, I thought that was stupid [11:46:55] and now we're gonna reinstall them all with precise [11:47:01] so they can be shutdown now [11:47:29] it /is/ possible someone is using them for some sort of testing, but I don't think so [11:47:47] okay [11:49:36] payments are also monitored externally? [11:49:57] i.e. if I add them to decom, will be they removed from nagios when they shouldn't? [11:50:54] hmm [11:50:56] that i'm not sure about [11:51:39] they sure to be monitored by our nagios [11:51:43] *seem to [11:52:05] check with jeff i guess [11:52:16] and there's ganglia for them too [11:52:17] I will [11:53:20] what's owa? [11:53:23] New patchset: Mark Bergsma; "Remove karmic and oneiric, no longer used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16119 [11:53:33] open web analytics [11:53:37] old project, discontinued [11:53:43] afterwards the servers were used for swift testing [11:53:47] now they can be used for whatever [11:54:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16119 [11:54:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16119 [11:58:19] New patchset: Faidon; "Add a few decommissioned hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16120 [11:58:21] mark: ^^^ [11:58:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16120 [11:59:35] i'm not sure if adding ms7 will also remove monitoring [12:00:00] other than that ok [12:01:10] ms7 is not in nagios [12:01:18] but is in ganglia [12:06:52] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [12:11:08] I vaguely remember getting ganglia going over there ... I think it was on that host [12:16:09] ok [12:17:22] mark i've been working on pxeboot/preseed for the payments cluster, have we considered making the preseed config dynamic instead of flat files? [12:19:52] yes [12:19:58] just noone has ever really worked on it [12:20:12] I think that only really makes sense if we redo the thing with more intelligence / data from some inventory db or something [12:22:29] heh, we were planning to do that with servermon at grnet [12:22:41] (servermon has an inventorydb module) [12:22:59] mark: that's what I was thinking [12:23:30] for CL we did the dynamic stuff first, and then created a couple of data backends for different purposes [12:24:19] we gave IBM a stripped down version that had a csv file as the backend--which they used to pre-image a couple hundred blade servers before delivery, and in the end they turned over the .csv which we imported to our asset db [12:24:46] the prod instance backended in a mysql db [12:26:16] New patchset: ArielGlenn; "Yet Another S3 library initial commit (old python/curl script sucked bad)" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/16121 [12:27:01] New patchset: Mark Bergsma; "Initial comments to app server manifests work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16122 [12:27:20] notpeter: added a bunch of comments to the manifests [12:27:30] but I also think we should redo this in a (set of?) modules [12:27:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16122 [12:28:05] reee [12:28:23] and think hard about the separation between "mediawiki", an "application server" (and varieties thereof), what belongs in role classes and what doesn't [12:28:41] :D [12:28:49] here moving to a module helps [12:28:55] because it makes us think about decoupling [12:30:04] puppet would pull host config data from an external canonical data store? [12:30:06] notpeter: I don't expect you to solve this on your own btw, but have a look at all the comments first, then we should sit down together I think :) [12:30:20] oh nm [12:36:14] Jeff_Green: so, payments* is being monitored by normal nagios/ganglia, right? [12:36:58] New patchset: Faidon; "Add a few decommissioned hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16120 [12:37:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16120 [12:37:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16120 [12:37:50] (removed ms7 from the list) [12:52:20] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [12:52:42] http://www.theonion.com/video/hp-on-that-cloud-thing-that-everyone-else-is-talki,28789/ [12:52:45] where is Ryan [12:53:06] PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:53:29] oh? [12:53:38] anyone looking at that yet? [12:53:45] I just started to [12:54:34] ns2 = nescio, right? [12:54:44] PROBLEM - Host 91.198.174.6 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:55] yes [12:58:00] Broadcast me [12:58:04] the only thing that is on the console [12:58:31] weird [12:59:53] !log powercycling nescio/ns2, unresponsive network & console [13:00:01] Logged the message, Master [13:00:23] shouldn't we get a page for that? [13:00:28] no [13:02:14] RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 117.85 ms [13:02:14] RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 117.88 ms [13:02:32] RECOVERY - Host nescio is UP: PING OK - Packet loss = 0%, RTA = 117.70 ms [13:04:08] snapshot1.wm.o and snapshot2.wm.o are form the days when those hosts had public ips (no longer true) [13:04:14] so those names are no longer used [13:04:17] ( paravoid ) [13:04:48] Jeff_Green will likely have a related story for storage2 [13:04:51] *3 [13:04:53] then you should add them to the list, apergos [13:05:42] isn't losing one third of our nameservers bad enough for paging? [13:05:46] so the related names snapshot1.pmtpa etc are in use [13:06:09] paravoid: I don't think so :P [13:06:16] there's nothing dumb enough to not use the who fqdn in the decommisioned list, right? [13:06:21] *whole [13:06:26] if we get a ton of pages all the time, noone's gonna react to them [13:06:29] mark: :P [13:06:51] I believe more in restricting it to really bad events [13:06:59] since we do tend to notice stuff anyway [13:07:20] I got like 20 mysql pages the last days, and god knows why [13:07:34] someone said "ignore that" here when those came [13:07:46] but yeah, agreed on the general principle [13:07:52] that doesn't help when i'm not on irc [13:08:13] those mysql checks should not be marked critical probably [13:08:19] i want to know when a master goes down [13:08:23] or anything else mediawiki can't handle [13:08:35] makes sense [13:08:42] i don't want to get paged if some process is not running one db I don't really care about [13:08:51] it's better now it won't wake us up anymore, at least [13:08:58] that really sucked [13:09:26] did I tell you what happened to me in Nicaragua? [13:09:30] I probably did [13:09:31] yes [13:09:42] :) [13:11:20] apergos: yeah the purge script will currently try to remove all possible fqdns [13:11:29] so feel free to cleanup stuff manually instead [13:11:40] ok then [13:11:54] run the purge script, remove the relevant files on the nagios server(s) [13:11:55] I thought I had but if we are still seeing warnings... [13:18:27] knsq25 has a broken disk too [13:18:30] i'm not gonna fix it :) [13:18:39] wasn't someone looking at hydrogen? [13:19:12] yeah I think so [13:19:23] you were apparently, hehe [13:19:46] !log powercycling hydrogen, down since yesterday [13:19:53] Logged the message, Master [13:20:41] RECOVERY - Host hydrogen is UP: PING OK - Packet loss = 0%, RTA = 35.70 ms [13:24:25] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [13:24:25] RECOVERY - Host 2620:0:861:1:7a2b:cbff:fe09:c21 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [13:24:43] PROBLEM - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:26:04] PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:27:25] RECOVERY - Recursive DNS on 208.80.154.50 is OK: DNS OK: 0.356 seconds response time. www.wikipedia.org returns 208.80.154.225 [13:27:26] why do those eqiad boxes die all the time [13:27:34] RECOVERY - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is OK: DNS OK: 0.069 seconds response time. www.wikipedia.org returns 208.80.154.225 [13:28:46] mark: text.pmtpa.wikimedia.org is down on nagios for 168d, I guess it doesn't exist anymore? [13:29:15] indeed [13:29:20] virt1.wikimedia.org too, moved to wmnet [13:29:25] csw5-pmtpa for 139d? [13:29:49] I'm in a cleanup mode, feel free to tell me to stop if I'm getting too annoying :) [13:30:30] csw5 is also gone [13:30:36] cleanup is good [13:30:39] i certainly won't stop you [13:33:45] huh [13:33:47] who installed hydrogen [13:34:59] ah noone [13:35:01] hehe [13:35:06] it was already installed, but broken [13:35:11] I added puppet config, wanted to reinstall it, couldn't [13:35:13] rob fixed it [13:35:18] server came back up, and added puppet config [13:35:25] but it's lucid, so I still need to reinstall it [13:35:47] for srv in $(cut -d'"' -f 2 -s /var/lib/git/operations/puppet/manifests/decommissioning.pp) [13:35:50] hooooly crap [13:36:19] !log Reinstalling hydrogen [13:36:27] Logged the message, Master [13:38:31] PROBLEM - Host hydrogen is DOWN: PING CRITICAL - Packet loss = 100% [13:38:40] PROBLEM - Host 208.80.154.50 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:43] PROBLEM - Host 2620:0:861:1:7a2b:cbff:fe09:c21 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:22] RECOVERY - Host hydrogen is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [13:41:48] paravoid: rspec is kind of funny [13:42:02] paravoid: I expect to send some base to be build on by the end of the afternoon [13:43:18] New patchset: Mark Bergsma; "Add server hydrogen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16126 [13:44:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16126 [13:44:13] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [13:44:22] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Fri Jul 20 13:43:53 UTC 2012 [13:45:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16126 [13:45:16] RECOVERY - Host 2620:0:861:1:7a2b:cbff:fe09:c21 is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms [13:45:44] hmm [13:45:52] the mw1xxx are up right now? [13:46:08] so what I did effectively ping-pongs the puppet db [13:46:25] the cronjob removes them, then puppet runs again on them [13:47:16] hmm [13:47:17] yes [13:47:19] dammit [13:47:24] although they should readd nagios entries [13:47:27] as the manifests recognize that [13:47:37] they will [13:47:41] shouldn't [13:47:46] perhaps for naggen [13:47:59] we're constantly adding and removing a lot of stuff in the database [13:48:11] multiple rows per host, multiple hosts [13:48:40] it's bad, I think I'll just remove them from decom for now [13:48:43] PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:48:51] you can shut them down too :) [13:48:55] oh? can I? [13:48:59] why not [13:49:02] they're unused [13:49:32] we'll start using them within a month or two, but there's no point in keeping them turned on in the mean time [13:49:37] PROBLEM - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:49:38] since we'll reinstall anyway [13:51:09] just to confirm, all of mw1[0-9][0-9][0-9].eqiad.wmnet, right? [13:51:13] you can also remove their keys on sockpuppet [13:51:18] I'm about to batch this, wouldn't want to shutdown everything :) [13:51:26] so they can't run puppet [13:51:39] nah, shutting them down since more green :) [13:51:52] i'm not aware of any of those servers being in use [13:51:59] but that's not a complete guarantee anymore [13:53:43] $ cat manifests/decommissioning.pp |grep mw1[0-9][0-9][0-9] | sed 's/^.//;s/..$/.eqiad.wmnet/' | while read mw; do ssh root@$mw poweroff & done [13:54:05] !log powering off all of mw1[0-9][0-9][0-9].eqiad.wmnet, unused [13:54:07] PROBLEM - SSH on hydrogen is CRITICAL: Connection refused [13:54:13] Logged the message, Master [13:54:36] should watch our power graphs now [13:54:44] damnit [13:54:46] they're broken [13:54:52] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [13:55:14] are we paying for power or is it a flat flee included in the colo? [13:55:20] flat fee [13:55:32] in esams it's per kWh [13:57:52] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [13:57:55] New review: ArielGlenn; "Note that this is version 0.1, very preliminary." [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/16121 [13:57:57] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/16121 [14:02:51] New patchset: ArielGlenn; "remove the done stuff from todo" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/16128 [14:03:07] PROBLEM - NTP on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server [14:03:14] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/16128 [14:04:37] RECOVERY - SSH on hydrogen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:05:43] paravoid: I also wrote a DNS monitor for pybal btw [14:06:20] since I put LVS in front of rec DNS [14:08:25] hehehe [14:09:05] i was quite familiar with the twisted async dns code since ipv6 day anyway [14:09:43] RECOVERY - Recursive DNS on 208.80.154.50 is OK: DNS OK: 0.050 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:10:19] RECOVERY - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is OK: DNS OK: 0.062 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:12:07] RECOVERY - NTP on hydrogen is OK: NTP OK: Offset -0.01455581188 secs [14:14:33] 2012-07-20 14:12:24.272522 [dns_rec DNSQuery] hydrogen.wikimedia.org (disabled/up/not pooled): DNS query successful, 0.285 s: www.google.com A 74.125.137.105 74.125.137.147 74.125.137.104 74.125.137.99 74.125.137.103 74.125.137.106 [14:14:33] 2012-07-20 14:12:34.287789 [dns_rec DNSQuery] hydrogen.wikimedia.org (disabled/up/not pooled): DNS query successful, 0.015 s: www.google.com AAAA 2001:4860:800a::93 [14:14:33] 2012-07-20 14:12:34.358283 [dns_rec IdleConnection] hydrogen.wikimedia.org (disabled/up/not pooled): Connection established. [14:14:34] 2012-07-20 14:12:44.407695 [dns_rec DNSQuery] hydrogen.wikimedia.org (disabled/up/not pooled): DNS query successful, 0.120 s: en.wikipedia.org AAAA 2620:0:861:ed1a::1 [14:18:46] paravoid: I was thinking... [14:18:52] should make a dbus monitor for pybal [14:19:11] so it can monitor things like upstart events and such [14:19:33] gosh that's evil [14:19:39] evil? [14:19:44] nice idea [14:19:47] I would never think of that [14:19:58] (evil in a good way) [14:20:00] hehe [14:20:46] should poweroff owa2 too? [14:20:49] should I [14:20:51] I think so [14:20:59] better to have unmanaged servers shutdown [14:21:03] especially those which are public ;) [14:22:29] !log powering off owa1/2/3, unused [14:22:37] Logged the message, Master [14:23:16] http://torrus.wikimedia.org/torrus/Facilities?path=/Power_usage/Total_power_usage/Power_per_site [14:23:17] not much lower [14:35:56] !log Added server hydrogen to the dns_rec eqiad LVS pool [14:36:04] Logged the message, Master [14:41:17] New patchset: Mark Bergsma; "Define IPv6 LVS services on all balancers now, but keep BGP disabled for some" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16131 [14:41:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16131 [14:42:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16131 [14:51:22] New patchset: Mark Bergsma; "Enable BGP announcement for all IPv6 (nginx backed) services" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16132 [14:51:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16132 [14:52:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16132 [15:15:21] mark: thank you! saw commits. look good to me! [15:17:14] cool [15:17:20] early next week we'll have another look [15:27:44] New patchset: Faidon; "Add all of mw10[0-9][0-9].eqiad.wmnet to decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16135 [15:28:17] PROBLEM - Host mw1088 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:17] PROBLEM - Host mw1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:17] PROBLEM - Host mw1042 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:17] PROBLEM - Host mw1093 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:17] PROBLEM - Host mw1084 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:18] PROBLEM - Host mw1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:18] PROBLEM - Host mw1090 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:19] PROBLEM - Host mw1078 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:19] PROBLEM - Host mw1091 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16135 [15:28:27] PROBLEM - Host mw1040 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:27] PROBLEM - Host mw1072 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:27] PROBLEM - Host mw1036 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:27] PROBLEM - Host mw1070 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:27] PROBLEM - Host mw1058 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:27] PROBLEM - Host mw1026 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:34] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16135 [15:28:35] PROBLEM - Host mw1080 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1050 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1024 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1029 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] PROBLEM - Host mw1082 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:37] PROBLEM - Host mw1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:37] PROBLEM - Host mw1092 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:37] now you're happy those are not paging eh [15:28:38] PROBLEM - Host mw1031 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:38] PROBLEM - Host mw1046 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:39] PROBLEM - Host mw1052 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:39] PROBLEM - Host mw1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:44] PROBLEM - Host mw1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:44] PROBLEM - Host mw1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:44] PROBLEM - Host mw1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:44] PROBLEM - Host mw1023 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:44] PROBLEM - Host mw1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:45] PROBLEM - Host mw1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:45] PROBLEM - Host mw1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:45] that's me [15:28:46] PROBLEM - Host mw1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:46] PROBLEM - Host mw1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:47] PROBLEM - Host mw1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:47] PROBLEM - Host mw1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:48] PROBLEM - Host mw1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:48] PROBLEM - Host mw1086 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:49] PROBLEM - Host mw1067 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:49] PROBLEM - Host mw1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:50] PROBLEM - Host mw1034 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:50] PROBLEM - Host mw1079 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:51] I forgot about these before [15:28:51] PROBLEM - Host mw1057 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:51] PROBLEM - Host mw1075 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:53] PROBLEM - Host mw1062 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:53] PROBLEM - Host mw1043 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:53] PROBLEM - Host mw1045 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:53] PROBLEM - Host mw1028 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:54] PROBLEM - Host mw1060 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:54] PROBLEM - Host mw1051 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:55] PROBLEM - Host mw1025 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:55] PROBLEM - Host mw1044 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:56] PROBLEM - Host mw1048 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:56] PROBLEM - Host mw1049 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:57] PROBLEM - Host mw1037 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:57] PROBLEM - Host mw1035 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:58] PROBLEM - Host mw1030 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:58] PROBLEM - Host mw1095 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:59] boo [15:28:59] PROBLEM - Host mw1065 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:59] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:00] PROBLEM - Host mw1064 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:00] PROBLEM - Host mw1076 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:02] PROBLEM - Host mw1069 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:02] PROBLEM - Host mw1068 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:02] PROBLEM - Host mw1081 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:02] PROBLEM - Host mw1073 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:03] PROBLEM - Host mw1087 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:03] PROBLEM - Host mw1071 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:04] PROBLEM - Host mw1077 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:11] PROBLEM - Host mw1096 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:11] PROBLEM - Host mw1089 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:11] PROBLEM - Host mw1098 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:11] PROBLEM - Host mw1099 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:12] !log powering off the rest of mw10[0-9][0-9] [15:29:19] Logged the message, Master [15:36:24] # wc -l puppet_services.cfg [15:36:24] 0 puppet_services.cfg [15:36:27] that can't be good [15:38:57] that's fine I think [15:39:05] they're stored per host, separate files [15:40:14] daemon.log shows puppet creating services like crazy [15:40:32] Jul 20 15:40:25 spence puppet-agent[24850]: (/Stage[main]/Nagios::Monitor/Nagios_service[sq45 frontend http]/ensure) created [15:40:35] etc. [15:40:45] those are decommissioned [15:40:50] perhaps they came back? [15:41:00] or hm [15:41:06] no nm [15:41:08] not decommissioned [15:41:21] nevermind me [15:41:24] apparently that's… normal [15:41:59] root@spence:~# egrep -c '^Jul 20.*Nagios_service.*created' /var/log/daemon.log [15:42:03] 63112 [15:42:03] fuck this, I want naggen [15:42:05] root@spence:~# egrep -c '^Jul 20 15:.*Nagios_service.*created' /var/log/daemon.log [15:42:08] 4037 [15:42:35] what's holding up neon? [15:43:04] manifests cleanup iirc [15:46:10] New patchset: Hashar; "very basic rspec layout to build upon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16139 [15:46:23] paravoid: that 16139 is for next week :) [15:46:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16139 [15:48:02] New review: Hashar; "Really hacky for now. I could not manage to make rspec-puppet to autoload file so I had to make it i..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16139 [15:52:16] hashar: a) there's a pending change for ssh.pp to convert it into a module, b) we have to find a better way than to clutter our tree with if (!$testing_in_rspec) everywhere [15:53:28] paravoid: fully agree [15:53:39] but haven't found a way to come around generate( /some/script ) for now :/ [15:57:55] hashar: you liked the idea of having puppet tests eh? :-) [15:58:09] well [15:58:35] yesterday I started refactoring the squid class so I could sneak in a class for beta [15:58:55] I ended up splitting the conf to squid::commons , doing some nasty stuff and so on [15:59:07] then I wondered how to verify that and did some google for "puppet unit testing" [15:59:17] found the rspec / cucumber stuff :-D [15:59:26] so looks like we have the same interest :-]]]]]] [15:59:36] if it can speed up review, I am a lll for test [16:00:43] yououhouu err: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function get_var at /etc/puppet/manifests/squid.pp:16 on node i-0000034b.pmtpa.wmflabs [16:01:26] looks like it got removed :/ [16:07:51] hashar: I'm all for RSpec but I'm pretty sure you wouldn't want Cucumber for puppet tests [16:08:33] chrismcmahon: apparently the puppet/cucumber plugin does not support puppet 2.7 (only 2.8) [16:08:51] al rspec seems to be the most used on puppet forge :-] [16:09:49] hashar: usually Cucumber is used as a tool so that the business people and the tech people can agree on what each feature should do. [16:10:14] not to say you *couldn't* use it for puppet, just that it would probably overkill [16:10:49] probably be [16:11:45] <^demon> chrismcmahon: Every time someone mentions the gap between business and tech people, I'm reminded of http://theoatmeal.com/comics/design_hell [16:13:53] chrismcmahon: I have never used rspec nor cucumber. I guess we will get rspec anyways since Selenium is probably going to use it and puppet labs does too [16:13:55] chrismcmahon: we will see [16:13:56] ^demon: yep :-) So the Agile people have this thing they call Acceptance Test Driven Development, which is the whole "as an X, I want Y, so that Z" thing. Tools like Cucumber, FitNesse, Robot implement ATDD. [16:14:40] hashar: there is a really good book about RSpec that I keep meaning to buy, nicely enough called "The RSPec Book". it's about 2 years old. [16:15:55] <^demon> chrismcmahon: I prefer the methodology called "Commit what felt right at the time." This is why I prefer ugly internals stuff and don't do well with user-facing features ;-) [16:16:27] ^demon: damn cowboys! ;-) [16:17:05] <^demon> Unicorn-riding cowboys, heck yeah [16:19:25] ^demon: actually, whatever works. I used to be a huge process nerd, but that's pretty much behind me. I've just seen too much great software made with minimal process, and crappy software made with every 'best practice' in the world. [16:20:04] <^demon> My case study for "kool-aid drinking process wonks" is Capital One. [16:20:16] <^demon> I swear, those guys were more in love with the idea of writing software than actually writing it. [16:20:21] I think I am not your typical QA guy. [16:20:38] chrismcmahon: I was working on process toolkits for telecom companies; in the end I conclude we are just delivering tools for firing people and treating them like commodity [16:20:53] <^demon> chrismcmahon: Good. I hated the QA guys at Cap1 ;-) [16:21:36] lemme guess: ignorant pedants who claimed to be misunderstood? [16:22:35] <^demon> We just /didn't understand/ what they had to deal with. [16:22:52] <^demon> If only we /truly appreciated/ their process, we'd understand why QA was so rigorous. [16:22:56] <^demon> There's just /so much/ at stake. [16:24:22] I know a lot of QA people like that, I no longer have the patience to even have those conversations any more. [16:26:29] so I am out for now [16:26:38] see you monday :D [16:27:03] <^demon> Adios hashar, have a good weekend. [16:29:57] <^demon> chrismcmahon: The worst part is when you end up working with someone who used to be in Cap1 middle management. They try to export all those amazing processes to your new organization. [16:32:00] ^demon: "I’ve also been tired for years of software people who seem embarrassed to admit that, at some point in the proceedings, someone competent has to write some damn code." -Marick http://www.exampler.com/discipline-and-skill.html [16:34:26] <^demon> It comes back to what I said earlier, people in love with the /idea/ of programming, but not the actual process. [16:34:40] <^demon> I believe there's an analogy involving sausage production that's fitting. [16:35:42] <^demon> Anyway, we're way cooler than those kids. We actually do stuff ;-) [16:37:29] New patchset: Faidon; "Add msfe1001 to decommissioned hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16144 [16:38:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16144 [16:38:12] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16144 [16:39:45] three unhandled hosts down [16:40:02] all of them with an RT ticket associated [16:41:06] mark: you know about ssl3004 btw? [16:41:22] died 3 months ago :) [17:25:03] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [17:25:48] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [17:25:57] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [17:26:51] !log authdns-update run for new servers [17:26:59] Logged the message, Master [17:57:37] New patchset: Ryan Lane; "Switch db and ldap host for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16150 [17:58:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16150 [17:58:25] yay [17:59:06] !log db48 is now replicating from db1048 (and db1048 from db48) [17:59:14] Logged the message, Master [18:11:35] !log starting gerrit [18:11:43] Logged the message, Master [18:14:43] <^demon> Ryan_Lane: I'm still getting 503 [18:14:49] yep [18:15:11] the gerrit downtime window is 11-12.. please wait til we give an all clear :) [18:16:03] <^demon> binasher: Yeah, but if gerrit stays down people blame me. Kinda vested in this :) [18:16:33] * jeremyb waves binasher [18:16:49] because pushing people in critical moments always produces better results, eh? :) [18:17:08] * jeremyb tries to decode that last db48 !log [18:17:13] is that master-master? [18:17:19] ^demon: It's all your fault for making users picky! :D [18:17:44] oops, ^demon stressed me out and i accidentally dropped the gerrt db [18:17:50] no backup, sry [18:18:01] dammit, now we have to switch to another review system [18:18:14] Change merged: Asher; [operations/debs/mysqlatfacebook] (master) - https://gerrit.wikimedia.org/r/15989 [18:18:15] <^demon> binasher: pics or it didn't happen [18:18:33] * Damianz finds a picture of a server on fire [18:18:34] tempting ;) [18:18:51] /usr/local? eeewwwww [18:19:00] New patchset: Ryan Lane; "Switch db and ldap host for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16150 [18:19:14] paravoid: where would you put it.. /wikisoftware? [18:19:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16150 [18:19:43] <^demon> oh man, same datacenter is soooo nice. [18:19:47] <^demon> this is so much faster. [18:19:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16150 [18:19:58] yeah it is [18:20:03] better hw too [18:20:12] * Reedy hands ^demon a large black permenant marker [18:20:31] binasher: why do you need coexistence with mysql packages? [18:20:43] <^demon> binasher, Ryan_Lane: i love you. [18:20:50] we all do [18:20:52] ^demon: it's so much faster, right? [18:21:01] <^demon> omg, like night and day. [18:21:07] +1 [18:21:30] <^demon> Hell, groups page loaded in like 1s. [18:21:39] paravoid: because i want to be able to install distro packages for certain things [18:21:57] like libdbd-mysql-perl [18:22:34] and their dependency tree leeds to things compiled against libmysqlclient18 from mysql 5.5 which is what's in precise [18:22:58] which is all fine to have side by side with something entirely different [18:23:16] hey guyyyys [18:23:28] i'm getting ready to set up some iptables rules on the analytics cluster [18:23:30] i'm starting to hate using debs for anything intimately tied to the application stack [18:23:36] would love a quick review of them if someone has a sec [18:23:55] i'm an iptables amateur, so pro opinions are helpful :) [18:24:16] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [18:24:30] binasher: libraries have sonames, so coexistence should be fine [18:24:35] php / memcached / mysql / varnish / nginx / squid.. whatever else we want to modify, i'd rather use a [18:24:46] paravoid: indeed they do [18:24:58] without /usr/local/lib I mean :) [18:25:09] <^demon> binasher: Now that we have logging on the gerrit db...if we find queries that suck or just need better indexing, I wanna share that upstream. [18:25:16] its not like mysql-common only installs a shared library and mysql-common-5.1 and mysql-common-5.5 coexist [18:25:47] ^demon: even the group list loads fast :D [18:26:07] <^demon> I know. Now I almost forgive him for the ungodly number of queries that page does. [18:26:09] ah. you already noticed that :D [18:26:13] ^demon: Isn't sucky queries sorta known upstream? [18:26:17] binasher: you'd rather use a …? [18:26:35] <^demon> Damianz: Yeah, but "this query sucks" or "add this index" is more useful than "your queries all suck." [18:26:43] True [18:26:50] ..use tarballs or rsync'd /usr/local containing everything we build, always kept totally separate from distro stuff [18:27:03] No query refactor binging involved :D [18:28:22] <^demon> Ryan_Lane: The number of queries on that page on a cold cache hit is roughly (4 * number of groups) iirc. [18:44:57] Change abandoned: Asher; "not yet, db10 is still in use" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13881 [18:45:23] ^demon: wow. that's absurd [18:46:42] hey binasher if you have a sec sometime in the next hour or so, can I ask you to look over a few iptables rules for me? They should be pretty simple, i just want to make sure I don't lock myself out [18:46:44] <3 orms! [18:46:57] ottomata: where are they? [18:47:27] https://gist.github.com/3152516 [18:47:51] analytics1001 has a public IP [18:48:12] i have an http auth proxy that will allow me to access the HTTP stuff I need [18:48:32] aside from that, i want to block everything else [18:48:38] external [18:48:57] and ssh (which you did already) [18:49:03] yeah [18:49:04] and ping! [18:51:56] New patchset: Asher; "adding db28 to decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16155 [18:52:15] ottomata: i think that looks ok [18:52:30] binasher: what does this mean? master-master? < binasher> !log db48 is now replicating from db1048 (and db1048 from db48) [18:52:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16155 [18:52:40] ok cool, gonna try it on an01 then, thanks [18:52:57] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16155 [18:55:21] jeremyb: db48 is the otrs master, db49 and db1048 slave from it. db1048 is the gerrit master, db1046 and db48 slave from it. [18:55:44] ohhh [18:56:55] binasher: is there an OTRS replica that we can take for schema migration tests? (stop slave, dump OTRS, reimport under new DB name, run test on new DB name, give it back to you) [18:57:11] !log powerin on owa1/owa2 again, they're being used as ganglia aggregators for swift [18:57:15] binasher: (or you can reslave from scratch at the end I guess) [18:57:18] Logged the message, Master [18:57:46] \\ [18:58:42] // [18:58:49] || [18:59:58] binasher: anyway, whatchya think? or else maybe slave a new replica on a spare misc box and run tests there (and then make it a spare again when done) [19:00:56] jeremyb: it looks like otrs has a few myisam tables but they're all empty apart from faq_log and link_relation [19:01:10] hrmmm [19:01:14] i think we barely use FAQs [19:01:16] i could convert them to innodb and then you could mysqldump it without messing with replication [19:01:47] ok, so then we just find any spare box? [19:02:05] link_relation maybe is about merged tickets? [19:02:40] what sort of changes do you want to test? [19:02:55] binasher: OTRS software upgrade [19:03:07] idk how invasive it is [19:03:47] ah, that seems worth testing [19:04:19] it might be slow and clunky but could you use labs, provided you have a dump of the db? [19:04:34] i think labs is out per privacy policy or something [19:04:42] binasher: it's about 350GB on disk, and requires an NDA to handle [19:05:30] sounds like otrs needs a dev server [19:05:39] y [19:07:44] * jeremyb runs away for a bit [19:08:33] RECOVERY - Puppet freshness on db29 is OK: puppet ran at Fri Jul 20 19:08:22 UTC 2012 [19:09:54] RECOVERY - MySQL disk space on db29 is OK: DISK OK [19:13:21] PROBLEM - NTP on analytics1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:30:45] ottomata: ping [19:31:40] poooooooooooong [19:31:50] ottomata: nagios-wm ^^ [19:32:19] ntp ahhhh [19:32:22] hmmm [19:32:23] ok [19:32:24] danke [19:32:42] binasher, do you know what networks I need to allow? [19:32:46] for ntp, puppet, whatever else? [19:33:33] i just allowed any incoming from our analytics cluster network [19:34:04] i should also allow nrpe/nagios servers, puppet server, ntp server, whatever else, eh? [19:34:06] Change abandoned: Demon; "Was reverted upstream, need to find a better fix." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13325 [19:34:31] (btw, thanks for noticing that jeremyb) [19:34:59] binasher: db63, 65-77 have been provisioned for you....db64 has network issue [19:35:34] ottomata: http://pastebin.com/Usa0i9Cx [19:35:41] sure ;) [19:36:03] great, danke [19:38:15] RECOVERY - NTP on analytics1001 is OK: NTP OK: Offset 0.0009073019028 secs [19:47:49] robhalsell: do you want to take rt 3213? [19:48:24] yea there are some things we need to do [19:48:32] and i need to document them [19:48:43] since these arent decoms, but reclaims [19:58:03] robhalsell: okay [20:08:20] New patchset: Cmjohnson; "adding es5-8 to the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16184 [20:08:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16184 [20:14:12] robh: plz merge my change. [20:14:46] reviewing now [20:15:19] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16184 [20:16:34] cmjohnson1: its live on cluster, you need to login as root on brewster and run puppetd --test [20:16:41] for force the puppet run so you dont have to wait for it [20:16:47] your key should have also populated out by now [20:17:11] yep, i just checked, its on brewster [20:17:25] so you should be good cluster-wide now [20:57:56] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:01:00] New patchset: J; "job runner now supports being run on a specific job type" [operations/debs/wikimedia-job-runner] (master) - https://gerrit.wikimedia.org/r/11610 [21:02:35] New patchset: Pyoungmeister; "misc::maintenance::pagetriage : added this class to hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16188 [21:03:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16188 [21:06:21] New patchset: Pyoungmeister; "misc::maintenance::pagetriage : added this class to hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16188 [21:06:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16188 [21:21:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:40:18] New patchset: Pyoungmeister; "setting ms-fe1 and ms-fe2 as ganglia agregators for pmtpa swift group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16192 [21:40:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16192 [21:41:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16192 [21:47:47] New patchset: Pyoungmeister; "misc::maintenance::pagetriage : added this class to hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16188 [21:48:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16188 [22:47:46] I seem to recall the thumb-handler.php being up to date.. [22:48:14] though, core includes a thumbhandler [23:06:11] New review: Pyoungmeister; "got some review by asher on the script itself. looks good!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16188 [23:06:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16188 [23:10:44] New patchset: Asher; "run lvm snapshots on db49 and db1046 (otrs/gerrit)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16197 [23:11:21] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16197 [23:26:30] RECOVERY - MySQL Replication Heartbeat on db48 is OK: OK replication delay 0 seconds [23:44:38] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [23:47:47] RECOVERY - MySQL Replication Heartbeat on db1048 is OK: OK replication delay 0 seconds [23:55:44] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [23:57:32] PROBLEM - MySQL Replication Heartbeat on db1048 is CRITICAL: NRPE: Unable to read output [23:58:44] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [23:58:53] RECOVERY - MySQL Replication Heartbeat on db1048 is OK: OK replication delay 0 seconds