[00:07:44] * robla checks gerrit's pulse [00:08:34] ...and it's dead [00:10:00] any opsen around with a defibrillator? [00:12:01] anyone? anyone? [00:12:19] * robla thumps microphone [00:12:22] is this thing on? [00:12:54] * TimStarling will look [00:13:30] thanks! [00:14:15] it looks busy [00:15:05] it's faking it....I know it's just playing Minesweeper [00:17:12] ohai! [00:17:18] * maplebed just got paged. [00:17:21] sorry, wasn't watching IRC. [00:17:43] there's a request going that's asking for a tarball, maybe that's it [00:18:11] just saw a gerrit page ? [00:18:18] yeah, TimStarling's on it so far. [00:18:28] I just joined teh party. [00:18:36] ok cool [00:18:45] whee! http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [00:18:49] it's pretty quiet now, is it working again? [00:19:07] it's working again for me. [00:19:14] TimStarling: didy ou kill something? [00:19:18] or did it finish on its own? [00:19:27] it finished on its own [00:19:32] looks better now [00:19:50] tarball creation, huh? [00:19:51] I attached to a couple of apache processes with gdb [00:20:13] * robla wonders if that's a gitweb thing maybe [00:20:17] I only had time for two before it started working again [00:20:34] one was /r/ and one was /r/gitweb?p=mediawiki/core.git;a=snapshot;h=7a4db900deaf235bd729cd0557b2d2a7ef3d9a40;sf=tgz [00:20:55] maybe there is access log information [00:22:02] * robla wonders if anyone dare try that again [00:22:12] links are here: https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git;a=shortlog;h=HEAD [00:23:03] there's something called YandexBot crawling it apparently [00:23:39] New patchset: Bhartshorne; "bumping object replication concurrency to 2 to decrease the time necessary to get ms-be5 fully loaded." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13100 [00:24:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13100 [00:24:19] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13100 [00:24:22] ah, but it's 80legs that's hitting the tarball links [00:24:47] funny that the crawlers can find the gitweb links even though our developers can't [00:26:25] <^demon> TimStarling: So gerrit got overloaded due to a spider? [00:27:05] there are several crawlers hitting gitweb [00:27:14] probably one or a combination of them caused the overload [00:27:47] <^demon> Should be pretty trivial to slap a robots.txt on manganese. [00:28:36] dear robots, please don't simultaneously request tarballs for every revision of every repo simultaneously. kthxbai [00:28:44] Boring [00:29:27] pity there's no timing in the access log [00:29:31] anyone mind if I add it? [00:29:52] I say go for it [00:30:04] we don't have any scripts processing these logs do we? [00:30:27] * robla isn't aware of any [00:30:36] ^demon: ? [00:30:46] <^demon> Nope. [00:31:23] it'll just take half an hour [00:31:37] 1 minute for the config change and 29 minutes to get it into puppet [00:35:19] New patchset: Demon; "Stop robots from trying to index gerrit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13103 [00:35:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13103 [00:37:11] oh gerrit [00:37:19] <^demon> TimStarling: I've already got the puppet repo open in front of me, what's the change? [00:37:53] someone get the paddles again [00:38:50] <^demon> Granted, this probably isn't gerrit's fault anyway, gitweb is installed via the package, it's not bundled or anything. [00:38:57] maybe I was exaggerating slightly [00:39:10] I had to spend a bit of time reading the apache manual [00:41:40] and it's still not working [00:42:15] ah, no, there we go, actually the requests were just taking so long that they started before I made the config change [00:42:53] like 146 seconds [00:43:59] or 294 seconds [00:48:15] it looks like only one thing can run git at a time [00:49:14] so it's blocking and waiting? [00:50:03] <^demon> Should be more than one, but we could up the number of processes available to jgit. [00:50:22] <^demon> I think the default is something like ~4 simultaneous clones [00:50:25] if it's gitweb, it's not using jgit, right? [00:50:43] no, no point in increasing gitweb processes [00:51:29] there's 20 already and they're just sitting around doing nothing [00:54:39] actually most of the gitweb seem to be waiting for a write to stdout to complete [00:54:55] waiting for java [00:55:24] <^demon> Heh, silly gerrit docs. "Existing installations have successfully processed change reviews with more than 16,000 files per change. However, since 16,000 modified/new files is a massive amount of code to review, it is more typical to see less than 10 files modified in any single change." [00:58:59] <^demon> If gitweb's performance sucks, we could look at using cgit. Gerrit has integration for that out of the box as well. [00:59:33] seems more like a deadlock [00:59:49] the classic sort of deadlock where you have to read from both stdout and stderr [01:00:35] http://paste.tstarling.com/p/MDxjvJ.html [01:00:47] see, this gerrit thread is reading from gitweb's stderr [01:00:57] but the gitweb is writing to stdout (FD=1) [01:02:19] !log on manganese: killing all gitweb.cgi processes [01:02:25] Logged the message, Master [01:02:51] <^demon> TimStarling: So is this something to be fixed in gerrit? [01:03:20] yes [01:03:36] maybe gerrit allows 20 subprocesses [01:04:08] eventually say 17 get used up with hung gitweb processes and we start to wonder why it is slow [01:04:24] I will see if I can find it in the gerrit source [01:04:42] <^demon> So we could up the number of gerrit processes, but it would just prolong the time before they all get tied up? [01:06:24] copyStderrToLog(proc.getErrorStream()); [01:06:24] if (0 < req.getContentLength()) { [01:06:25] copyContentToCGI(req, proc.getOutputStream()); [01:06:25] } else { [01:06:25] proc.getOutputStream().close(); [01:06:25] } [01:06:36] that is the bug [01:07:35] hmm, ok maybe not [01:07:54] it actually starts a new thread to read stderr [01:10:40] I regret killing those gitweb processes now [01:10:45] they were definitely hung [01:11:44] the bot requests are much faster now [01:22:52] I'll restart gerrit [01:23:16] !log on manganese: restarting gerrit [01:23:22] Logged the message, Master [01:41:08] <^demon> gerrit's error log is huge from today (800M and counting). I should look at some kind of monitoring for those logs...having the same error spammed 110149999 times should've been a hint that gerrit wasn't feeling well. [01:41:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 235 seconds [01:43:03] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 268 seconds [01:48:34] <^demon> TimStarling: I'm calling it a night as long as the immediate fire is out. Is there anything re:gerrit you'd like me to look into tomorrow? Or perhaps could you send an e-mail summarizing anything you find? [01:49:30] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 656s [01:49:32] I'll send you an email, good night [01:49:55] <^demon> Thanks. [01:50:24] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [01:52:03] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 16 seconds [01:52:30] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 43s [01:55:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:29] New review: Reedy; "Needs adding to wmf-config/extension-list too, so the messages get into the localisations cache with..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/13099 [01:57:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.057 seconds [02:18:48] TimStarling: Did you get about regarding ExtensionDistributor and REL1_19 / Git ? [02:21:41] yes I talked with sam about it a couple of hours ago, it's cloning now [02:21:55] with REL1_19 from subversion and master from git [02:22:19] nice [02:23:02] TimStarling: What about extensions that are only in svn or only in git? Not a problem, just curious how it is handled. [02:23:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:25:26] New patchset: Jalexander; "Add WikimediaShopLink settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13099 [02:26:08] probably if they're only in svn they'll be broken [02:26:15] I'd have to check [02:26:42] so you use clone-all.php and then the svn dir name as dir in that clone? [02:27:41] I used the extensions.git submodules [02:27:44] New patchset: Tim Starling; "Log time elapsed in gerrit access logs for incident analysis" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13107 [02:28:25] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13107 [02:28:25] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13107 [02:31:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.054 seconds [03:29:43] New patchset: Krinkle; "Add WikimediaShopLink settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13099 [03:31:07] New patchset: Krinkle; "Add WikimediaShopLink settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13099 [03:31:58] New review: Krinkle; "Spotted several issues in the extension (mostly small issues). Submitted Ifaf0e574f736264b3d27d376b6..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/13099 [03:57:53] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13103 [03:57:55] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13103 [04:35:15] New review: Jalexander; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/13099 [04:36:25] New review: Jalexander; "dependency on extension merged in as well." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/13099 [04:43:51] mornin [04:44:03] what the hell happened [04:44:12] I only slept for like 5 hours [04:44:25] and all hell broke loose apparently [04:58:07] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [04:59:11] huh I guess I got 5.5 hours [04:59:23] looking at my email now [05:01:22] what you said about bad luck yesterday. surely we shoudl be due for some good luck soon? :-/ [05:02:10] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [06:25:03] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:30:54] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:21:09] good morning [07:22:03] New review: Faidon; "First of all, a minor nitpick: the file won't have a final newline, which will make it a bit unpleas..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12377 [07:22:13] good morning :) [07:22:44] oh the newline :-D [07:23:03] I did some housework in my gerrit change list, so you might have received review requests from me [07:23:36] you did? [07:24:15] I had several draft changes I abandoned and some that did not have a reviewer [07:27:54] paravoid: for the $realm, my commit message is misleading. I am indeed expecting /etc/wikimedia-realm to let us know on which puppet $::realm we are running under [07:29:36] New patchset: Hashar; "/etc/wikimedia-realm containing $::realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12377 [07:29:53] * hashar find out how to add a newline [07:30:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12377 [07:33:32] New patchset: Hashar; "/etc/wikimedia-realm containing $::realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12377 [07:34:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12377 [07:35:53] New review: Hashar; "I am assuming realm to be the one from puppet. Aka either 'labs' or 'production'. My comment about '..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/12377 [07:43:26] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12377 [07:43:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12377 [07:45:07] paravoid: is puppet labs running form production branch nowadays or should I cherry pick that change to the test branch ? [07:50:55] Should be running from prod [07:51:28] seems so :-D [07:51:39] I still have to read the mail about puppetmaster::self [07:55:46] Reedy: ping? [07:56:17] paravoid: he is usually not there during the morning (idle time: 6:45) [07:58:02] PROBLEM - Puppet freshness on mw56 is CRITICAL: Puppet has not run in the last 10 hours [07:58:40] New review: Hashar; "Do not submit this yet. Need to arrange $cluster / $realm in our Mediawiki config files." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/12583 [08:01:43] sure, nothing urgent [08:02:07] maybe you can help too: I'm wondering which extensions should I push to Debian [08:02:11] php5-parsekit is one candidate [08:02:36] but I should probably join the PHP packaging team and it'd be nice to have a list of things to do beforehand [08:03:25] hashar: hi, i'll be right there for jenkins [08:03:43] paravoid: I am not sure how many custom extensions we use. [08:04:04] paravoid: I know we have a wikidiff extension, which is a C implementation of mediawiki PHP differ. But it is most probably already upstream [08:04:22] mutante: take your time :-] [08:04:35] it is [08:07:24] paravoid/hashar: yesterday wondering about "mediawiki-math" which already is a Debian package and appears to be used on PDF servers, but "This is a transitional package and can safely be removed. " [08:08:28] http://packages.debian.org/sid/mediawiki-math [08:08:58] mutante: that's because it's replaced by mediawiki-extensions-math [08:09:31] http://qa.debian.org/developer.php?login=pkg-mediawiki-devel%40lists.alioth.debian.org [08:10:34] looking at that qa page now.. i guess their mailing list and our mailing list for mediawiki package maintainers should exchange messages or something [08:12:29] New patchset: Hashar; "detect cluster with /etc/wikimedia-realm" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12583 [08:13:21] paravoid: pkg-mediawiki-devel@lists.alioth.debian.org <-> mediawiki-distributors@lists.wm ? you think we should invite list members vice versa? [08:14:04] New review: Hashar; "The loaded value from /etc/wikimedia-cluster was not at all what we expected for $cluster. Thanks to..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/12583 [08:17:16] mutante: I think they're talking already [08:17:38] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12583 [08:17:59] mutante: getting a coffee and I am ready whenever you are :-] [08:21:09] hashar: ok. soo. first question: you really still want 1.470? i see 1.472 [08:21:32] mutante: seen that yesterday [08:22:15] mutante: yeah just get 1.472 [08:22:22] hmm, release days are 06/13, 06/18, 06/24 [08:22:30] jenkins is doing contint, releasing a new version every week or so [08:22:42] just like we tag a new MediaWiki wmf branch every 2 weeks [08:22:57] k, i see from the dates, lots of versions yea [08:23:17] I already reviewed the past changelogs [08:23:34] there is some bug fix I am interested in, would most probably fix issues I have with our inst [08:23:44] so might as well get the 1.471 and 1.472 [08:24:16] mutante: can you also have a look at https://wikitech.wikimedia.org/view/Jenkins while doing change ? [08:24:16] wget http://pkg.jenkins-ci.org/debian/binary/jenkins_1.472_all.deb [08:24:20] we might want to update our doc [08:24:35] i got that open, yea [08:24:42] wooster told me about having Diederik trained on doing Jenkins upgrade [08:24:53] (will be for next time though :pD ) [08:25:04] being, lazy, wget direct to brewster [08:25:47] yea, uhm, it needs brewster access [08:25:59] which you dont? [08:26:00] ? [08:26:10] bah lame question [08:26:14] i do, but about training others [08:26:23] ah you are right [08:26:44] !log importing jenkins_1.472_all.deb into lucid-wikimedia using reprepro [08:28:02] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [08:28:11] updating package lists on gallium [08:28:21] hashar: shall i upgrade on gallium using apt? [08:28:26] it offers it now [08:28:44] according to our doc apt-get update && apt-get upgrade jenkins [08:28:49] Inst jenkins [1.458] (1.472 Wikimedia:10.04/lucid-wikimedia) [08:29:07] yeah, did the update, just asking if you want me to right the second [08:29:18] right ? [08:29:20] delete ? [08:29:25] yeah just upgrade :) [08:29:55] !log apt-get upgrade on gallium, installs newer jenkins [08:30:00] puppet just have ensure => present [08:30:10] so we don't have it upgraded "by mistake" [08:30:27] hmm, well, you could change to "latest" since we control when there are new packages [08:30:39] but yea, it works just fine like this too [08:31:07] and you can separate the steps import to repo and upgrade on server [08:31:30] "Please wait while Jenkins is getting ready to work..." [08:31:34] + I need to be there to check that Jenkins still work [08:31:53] i see that message on integration.mw/ci/ now [08:32:02] it is loading :) [08:32:32] there it is, yay [08:32:33] INFO: Jenkins is fully up and running [08:33:12] now I am going to run a test [08:33:16] hashar: so that wasn't that much to do after all:) a) wget .. b) reprepro .. c) apt-get upgrade . done [08:33:24] kk, please test [08:34:22] https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/2532/ testing in progress :-] [08:36:17] mutante: works for me :-] [08:36:29] :) [08:36:39] mutante: so that is a success for me as far as I can tell [08:36:53] though we might have some surprise later on. But I guess I can fix them myself [08:37:01] cool! [08:37:13] I love when maintenance run well [08:37:25] hashar: console output looks nicer :) [08:37:33] heh,yeah, just because it breaks sometimes, doesnt mean it has to all the time :) [08:37:34] its indented now, nice [08:37:43] you are right [08:37:51] looks like they replace \t with some   or something [08:38:16] so you can close the RT ticket [08:38:23] will you be there this afternoon? [08:38:30] in case something is screwed ? [08:38:49] are you talking to me? [08:39:50] hashar: RT resolved, pasted the important lines from IRC [08:39:57] thought so :) [08:40:26] argh sorry [08:40:35] your nicks use the same color [08:40:41] lol [08:40:42] so I thought Krinkle was mutante :-] [08:40:50] hashar: yea, ehm, afternoon, at some point i need to do some errands but i can be a bit flexible about it [08:40:55] I don't mutate ;-) [08:41:18] colors? i dont think i have IRC colors [08:41:44] I am pretty sure your cli client could colorize nicknames [08:42:13] yea, guess so, but i never picked any color for me, guess you can pick a color for me in your client [08:42:34] Colloquy only distinguishes between "you" and "the rest" (red/orange) [08:43:31] hashar: Their anchor links are borked though [08:43:32] https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/2532/console#ant-target-4 [08:43:37] ^ goes to "clean" [08:43:44] but you see "create-dirs" [08:43:48] because of that hover toolbar [08:44:07] ohnice, they even right-align the labels (exec vs mkdir) so that the response is left-aligned [08:44:16] someone's been busy [08:44:30] the wonders of doing contint? heh [08:44:59] my client does some kind of hashing of the nickname then select a color out of a few possibility [08:45:25] mutante: http://scripts.irssi.org/html/nickcolor.pl.html [08:45:39] that one does add ord() of each character of the nick [08:45:50] then does a modulo 11 to pick a color :-] [08:46:05] yeah, when you talk to me directly, your nick turns yellow, it stays gray if you don't use a nick: prefix [08:46:40] hehe, ok, i might check it out [08:46:47] hashar: Any idea what could cause this error on a fresh* labs instance in a project I own? *fresh= created yesterday, everything is done and set up "running" now, including in ganglia [08:46:48] hashar: > krinkle is not allowed to run sudo on i-000002eb. This incident will be reported. [08:47:01] Why can't I sudo? Without it can't even run puppet or whatever. [08:47:11] probably forgot to add you as a sysadmin [08:47:11] https://labsconsole.wikimedia.org/wiki/Nova_Resource:I-000002eb [08:47:27] hmm, i heard one other report about not being able to sudo anymore [08:47:30] lets switch to #wikimedia-labs [08:47:36] but couldnt reproduce on instance i use [08:47:40] yep [08:48:32] mutante: again thanks for the Jenkins upgrade :-] [08:49:14] yw! glad it worked smoothly. wouldnt really know what needs to be fixed in docs [08:50:28] i mean in a good way, they already said what i did [08:51:34] I think I wrote that doc based on your input [08:51:37] and you even reviewed it [08:51:47] indeed : https://wikitech.wikimedia.org/history/Jenkins [08:51:48] ;) [08:52:02] so you are merely congratulating yourself for writing your own doc :-] [08:52:08] * hashar loves doc [08:52:53] heh. ok [09:23:40] New patchset: Dereckson; "(bug 37981) Babel categories configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13121 [09:29:58] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:33:07] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:36:27] !log starting swift-container-auditor on ms-be3 [09:37:28] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:57:40] morebots is dead [10:04:09] he's known as lessbots [10:06:22] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:07:43] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:18:49] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:23:10] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:38:44] Damianz: should be $(PAGE)bots [10:38:51] so we can use our favorite pager [10:39:04] less is more ;) [10:43:45] New patchset: Hashar; "subscribe memcached service to /etc/memcached.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13129 [10:44:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13129 [11:08:25] paravoid: hi [11:08:50] Reedy: hi :-) [11:11:15] hiiii [11:13:40] hm [11:13:47] how do you push a fast forward merge into gerrit... [11:13:58] it just says "no new changes" [11:18:55] I dunno, its always worked for me [11:19:22] i can push directly, but that's not how it should be :) [11:19:35] why a no-ff? [11:19:47] so that all the changes come in as separate commits? [11:19:48] a what? [11:19:55] what do you mean? [11:19:56] oh [11:20:01] nevermind [11:20:13] it's a straight ff, all it needs to do is update the ref :) [11:20:36] but that means there's no merge commit on my end [11:20:49] right [11:20:50] no clue [11:21:04] I guess I should use -no-ff [11:21:20] are all the virtN nodes configured with the same profile? [11:21:26] even the new ones? [11:21:28] same profile? [11:21:30] i.e. do they get the VLAN? [11:21:41] I'd hope they do [11:21:55] mark: how about that router access? :) [11:22:27] http://code.google.com/p/gerrit/issues/detail?id=1145 [11:22:33] right [11:26:11] mark: yep. that kind of sucks [11:28:21] mark: turned out to be a "not cabled properly" issue [11:28:26] so, no access needed for that [11:39:03] hm [11:39:12] instances with public IPs can't talk to the outside world [11:39:57] at the moment? [11:40:00] yes [11:40:11] I can, however, ping en.wikipedia.org [11:40:53] have you changed anything? [11:41:12] with the exception of the stuff we did yesterday, no [11:41:51] what did you do yesterday? [11:42:16] basically nothing [11:42:25] I brought virt1 up as a network node [11:42:30] which moved two IP addresses [11:42:38] and the snat/dnat rules [11:42:48] when I took it back down, it moved them back [11:42:59] Ryan_Lane: my wikistats instance has a public IP and can also still fetch data with a shell script from external mediawikis [11:43:00] here's the snat rule for bastion-restricted: SNAT all -- 10.4.0.85 0.0.0.0/0 to:208.80.153.232 [11:43:12] I wonder if it's just instances on virt1 [11:43:18] lemme look at virt1 and virt3's rules [11:43:39] Ryan_Lane: well - you know the network is just setup for virt2 right now, right? [11:43:46] yes [11:43:51] so [11:44:03] nothing can use virt1 as a network node right now [11:44:07] right [11:44:10] everything moved back [11:44:24] I think I see the issue [11:45:50] fixed [11:45:58] virt1 had a bad SNAT rule [11:47:09] New patchset: Mark Bergsma; "Merge branch 'mp-bgp'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/13131 [11:47:34] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13131 [11:47:36] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/13131 [11:49:22] \o/ [11:49:33] gonna deploy that one one lvs machine in a bit [11:49:44] then when it works well for a day or two, on the other ipv6 boxes [11:49:52] then when that remains working well for a week, we can finally upgrade the rest ;) [11:50:28] conveniently, lvs1004 serves no traffic atm [11:50:33] which is perfect for pybal testing [11:50:42] as pybal behaves exactly the same, traffic on it or not [11:51:04] Can someone kick morebots? Thanks [11:51:46] lvs1005 I mean [11:53:28] 208.80.153.192 64666 2647 2708 0 296 22:02:57 Establ [11:53:28] inet.0: 0/0/0/0 [11:53:28] inet6.0: 1/1/1/0 [11:53:35] 22 hour uptime, good [11:53:49] hehe [11:53:58] paravoid: you used some 32 bit ASNs in the default pybal config file [11:54:10] and bgp.py didn't handle that well ;) [11:54:55] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13099 [11:54:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13099 [11:55:03] #bgp-local-asn = 65551 [11:55:04] #bgp-peer-address = 192.0.2.254 [11:55:04] #bgp-as-path = 65551 65552 [11:55:12] was that intentional? :) [11:55:33] that's not 32 bit [11:55:43] well it's not 16 bit either :P [11:57:17] ah, my bad [11:57:26] hehe I didn't see it during review either [11:57:26] that's ASNs from rfc5398 [11:57:34] and it took me a minute to realize as well [11:57:42] when struct.encode was throwing errors ;) [11:57:43] IANA has reserved a contiguous block of 16 Autonomous System numbers from the unallocated number range within the "16-bit" number set for documentation purposes, namely 64496 - 64511, and a contiguous block of 16 Autonomous System numbers from the "32-bit" number set for documentation, namely 65536 - 65551. [11:57:55] ok [11:57:59] i'll change that to 16 bit for now ;) [11:58:02] yep [11:58:11] i'll add 32 bit asn support soon, shouldn't be hard [11:58:17] that's why I pasted the 16-bit ASN for documentation [11:58:18] but I need to get back to other things now ;) [11:58:30] I can commit/push that if you prefer. [11:58:35] i'm already on it [11:59:21] !log kicked morebots [11:59:27] Logged the message, Master [11:59:33] gah, nano being the default editor in labs :P [11:59:35] what's up with that [11:59:52] The vms do run debian :P [11:59:59] no they don't [12:00:03] eh? nano? [12:00:09] for git [12:00:09] Well ubuntu... debianenough [12:00:10] ? [12:00:14] yes [12:00:20] that's always the default for git [12:00:27] oh [12:00:28] the default for git is $EDITOR [12:00:37] that's what I would expect [12:00:40] it asks you the first time [12:00:44] and I thought we forced that to vim on our cluster [12:00:49] update-alternatives --config editor [12:01:02] yeah but I thought I already changed that once [12:01:33] hm [12:01:38] my pybal versioning scheme sucks I guess [12:01:42] it's 1.01 now [12:01:44] it's not quite 2.00 [12:01:46] but it's a big change [12:01:50] need an extra level ;) [12:02:29] 1.99? 2.0~rc1? [12:02:33] 2.0~beta1? [12:02:52] I guess I could do 2.0 [12:03:00] it's just gonna be a snapshot build now [12:05:05] 2.0 should have docs [12:05:12] so that we can upload it to Debian ;) [12:05:23] see, that's why I'm not doing that :P [12:05:30] it's just extra work from my perspective ;) [12:06:21] I heard paravoid loves writing docs ;) [12:07:59] I see someone who should be silent real quick :P [12:09:47] :P [12:10:10] Some of us are bored enough while eating lunch to annoy others in irc. [12:10:44] hm. maybe I should upgrade gluster today [12:10:51] I was supposed to do it yesterday [12:22:06] labs is so slow again [12:23:28] well, if we get virt6-8 networked properly, we can fix that [12:23:44] what's up with them? [12:23:51] eth1 isn't cabled [12:23:56] doh [12:23:59] ye [12:24:01] *yep [12:24:28] can't you use wireless!? [12:24:34] :D [12:24:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:24:55] I'd love to see gluster on wireless when dumps does an update :D [12:25:37] that'll get faster when we have multiple network nodes [12:26:27] we can also trunk multiple ports if that's a problem now [12:26:42] * Ryan_Lane nods [12:26:59] Mmm etherchannels are sexy [12:27:21] etherchannels are old crap [12:28:24] They are useful sometimes, mostly when you don't have a nice server that can handle bonded interfaces in a sane way. [12:30:42] New patchset: Faidon; "puppetmaster: fix ::self" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13137 [12:30:44] Ryan_Lane: review ^? [12:30:53] if it breaks, it'll break bad [12:31:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13137 [12:31:18] since it'll mess with stafford [12:35:50] kinda like walter white? [12:35:58] closedmouth: shut up [12:36:54] PROBLEM - Puppet freshness on mw1122 is CRITICAL: Puppet has not run in the last 10 hours [12:36:54] PROBLEM - Puppet freshness on mw37 is CRITICAL: Puppet has not run in the last 10 hours [12:38:01] damnit [12:38:09] facter doesn't list our "static" v4 mapped ipv6 address [12:38:27] do I really need to regen it from the ipv4 address now :P [12:45:57] New patchset: Mark Bergsma; "Add bgp-nexthop-{ipv4,ipv6} global variables to pybal.conf, as is now required" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13141 [12:46:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13141 [12:46:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13141 [12:46:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13141 [12:52:47] paravoid: lemme see [12:53:56] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/13137 [12:54:05] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13137 [12:54:07] paravoid: ^^ [12:54:25] New patchset: Mark Bergsma; "Attempt to fix the inline_template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13142 [12:54:37] thanks! [12:54:53] hurry up gerrit [12:54:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13142 [12:54:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13142 [12:55:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13142 [12:55:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13142 [12:55:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13137 [12:56:36] mark: gerrit would hurry up if it had databases close to it :) [12:57:09] i'm not convinced :P [12:57:13] it's not ruby, but... :P [12:57:36] it was much more responsive when it was in the same datacenter as the databases [12:57:42] it has far fewer users, though too [12:57:51] *had [12:57:53] <^demon> We might need to play with cache settings again. [12:57:57] yeah [12:58:06] and I need to turn off friggin watchmouse at night [12:58:11] heh [12:58:13] stupid gerrit outage woke us up last night [12:58:15] and I couldn't care less [12:58:39] I was cursing demon at the time, although I believe it wasn't his fault this time ;) [12:58:57] fortunately I didn't hear a thing [12:59:07] <^demon> It totally wasn't my fault. I had no internet all day yesterday until about 15-20 minutes before the outage [12:59:08] <^demon> :) [12:59:12] heh [12:59:17] nah. we need a robots.txt [12:59:26] <^demon> Submitted, Tim merged. [12:59:31] ah. cool [12:59:37] <^demon> https://gerrit.wikimedia.org/r/#/c/13103/ [13:00:30] block all the bots [13:01:09] hm [13:01:17] we need to disable precise's ipv6 privacy extensions [13:01:22] no need for our servers to have privacy :P [13:02:01] heh [13:02:14] they totally need privacy. wouldn't want the world to know about them [13:02:36] people might start counting them [13:04:43] !log Started PyBal 1.02 snapshot build on lvs1005 [13:04:49] Logged the message, Master [13:04:52] so far so good [13:09:13] paravoid: puppetmaster::self still broken at the same place for me despite the merge of your fix https://gerrit.wikimedia.org/r/#/c/13137/ [13:10:12] New patchset: Mark Bergsma; "Allow BGP to be enabled for IPv6 services on IPv6-enabled LVS hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13143 [13:10:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13143 [13:12:18] New patchset: Mark Bergsma; "Allow BGP to be enabled for IPv6 services on IPv6-enabled LVS hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13143 [13:12:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13143 [13:12:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13143 [13:13:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13143 [13:19:52] hashar: sigh [13:20:16] ah, I know [13:20:28] paravoid: I did enable puppetmaster::self before the change. Maybe it has a copy of the puppet repo and need a manual fix ? [13:20:39] hm. i need food. I'm going to upgrade gluster when I get back [13:20:51] paravoid: feel free to test on deployment-cache-bits / i-00000264 [13:21:24] New patchset: Faidon; "puppetmaster: another fix for ::self" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13144 [13:21:55] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13144 [13:21:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13144 [13:22:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13144 [13:24:18] !log Added IPv6 LVS service IPs to the LVS_import policy on cr2-eqiad, for testing with lvs1005 [13:24:23] Logged the message, Master [13:26:47] how nice, we're missing 09087a7f133597f32f806dee72cda9d47f632377 from prod [13:26:51] wonder why [13:27:10] 2620:0:861:ed1a::b/128 [13:27:10] *[Static/100] 3w0d 19:06:47 [13:27:10] > to 2620:0:861:2:208:80:154:138 via ae2.1002 [13:27:10] [OSPF3/150] 3w0d 19:07:26, metric 0, tag 0 [13:27:10] > to fe80::5e5e:abff:fe3d:87c0 via ae0.0 [13:27:11] [BGP/170] 00:00:06, MED 10, localpref 100, from 208.80.154.138 [13:27:11] AS path: 64600 I [13:27:12] > to 2620:0:861:2:208:80:154:138 via ae2.1002 [13:28:08] and to answer your question from a few weeks ago paravoid... we're using MEDs for determining failover/active hosts in BGP ;) [13:32:43] :-) [13:33:52] i'm gonna let this run for a while [13:34:23] brb [13:35:11] New patchset: Demon; "If there's no comment, don't bother telling IRC about it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13146 [13:35:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13146 [13:36:20] New patchset: Faidon; "Re-add ssh parameter to git::clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13147 [13:36:35] I wonder what else we lost during the merge [13:36:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13147 [13:37:33] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13147 [13:37:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13147 [13:39:04] runs so far, yay [13:39:10] \O/ [13:40:10] mark: ospf? [13:43:36] New patchset: Demon; "Adding tracking ability to gerrit changes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11451 [13:44:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11451 [13:56:21] paravoid: looks like it run on integration-apache1 instance : notice: /Stage[main]/Puppetmaster::Self::Gitclone/File[/var/lib/git/operations]/ensure: created [13:56:22] ;) [13:57:27] Could not prepare for execution: Got 1 failure(s) while initializing: change from absent to directory failed: Could not set 'directory on ensure: File exists - /etc/puppet/manifests [13:57:27] blabh ;) [14:01:06] paravoid: so puppetmaster debian package refuses to install because /etc/puppet/manifest exist (it is a symlinkà [14:01:43] paravoid: also /etc/puppet/manifests -> /var/lib/git/operations/puppet/manifests <-- path does not exist, the repo is fetched in /var/lib/git/operations directly (missing 'puppet' ? [14:13:44] what the hell? [14:17:04] git clone has some trouble skipping a dir [14:17:04] :( [14:17:09] New patchset: Hashar; "puppetmaster:self clone puppet prod branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13155 [14:17:34] paravoid: ^^^ that change make it fetch operations/puppet.git using 'production' branch [14:17:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13155 [14:18:12] grrr [14:18:16] prod -> production [14:18:26] New patchset: Hashar; "puppetmaster:self clone puppet prod branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13155 [14:18:59] New review: Hashar; "patchset2 rename branch from 'prod' to 'production'." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13155 [14:18:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13155 [14:21:15] New patchset: Faidon; "puppetmaster::self: fix git::clone target" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13156 [14:21:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13156 [14:21:53] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13156 [14:21:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13156 [14:22:34] New review: Faidon; "Correct, thanks!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13155 [14:24:56] Please merge (or rebase) the change locally and upload the resolution for review. [14:25:00] that's a first for me [14:25:26] you've never hit that? [14:25:35] next version of gerrit will have a rebase button :) [14:25:38] can't wait for that [14:28:46] PROBLEM - Host upload-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:57] <^demon> Ryan_Lane: We need to schedule a time for that upgrade. [14:29:06] when I get back ;) [14:29:12] next week at some point works for me [14:29:16] mark: [14:29:16] (don't worry about that) [14:29:20] ah, okay :) [14:29:27] paged too [14:29:27] I disabled privacy extensions [14:29:32] yes please [14:29:42] apparently that removed all v6 addresses from all interfaces [14:29:43] how are our servers going to go to the bathroom now? [14:29:44] not so good ;) [14:29:48] ah ok [14:30:00] hahaha [14:30:18] take away the servers privacy and they revolt. I see [14:30:37] hm [14:30:42] why didn't puppet put the service IPs back [14:30:58] ah of course [14:31:08] kk [14:31:24] sooo [14:31:28] RECOVERY - Host upload-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [14:31:31] rolling that change on all servers is not such a good idea [14:31:35] :D [14:31:47] it's ipv6, who cares :) [14:31:47] :P [14:32:01] btw, I've installed Klaxon on my android [14:32:05] it's an app for paging [14:32:06] really nice [14:34:18] ah [14:34:35] just removing /etc/sysctld/10-ipv6-privacy.conf should do the trick actually [14:34:42] then it won't take effect immediately but won't be reactivated on reboot [14:34:44] which is fine by me [14:34:59] fcking ubuntu [14:35:00] lemme add that to puppet [14:35:11] I *hate* privacy extensions [14:35:41] we go through all this trouble to have public address on all of our machines [14:35:48] to enable end-to-end connectivity and whatnot [14:36:12] and then you introduce a spec when these public addresses randomly change every so often [14:36:27] so you can't add it to your firewalls or DNS etc. [14:37:14] New patchset: Mark Bergsma; "Disable IPv6 privacy extensions on all servers (takes effect on reboot)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13161 [14:37:42] perhaps we should remove this in the installer as well [14:37:45] so it won't even be present on first run [14:37:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13161 [14:37:56] as that affects facter etc [14:39:20] New patchset: Mark Bergsma; "Disable IPv6 privacy extensions on all servers (takes effect on reboot)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13161 [14:39:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13161 [14:39:55] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13161 [14:39:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13161 [14:40:40] i have one change of you [14:40:43] can I merge it? [14:40:44] how do I rebase/merge manually hashar's change? [14:40:57] which one? [14:41:09] let me see, sec. [14:41:26] ah, yes, please [14:41:34] git remote update ; git-review -d ; git-review -f [14:41:39] git-review does rebase automatically [14:41:44] you might even skip 'git remote update' [14:41:48] or i can rebase it if you want [14:42:53] paravoid: I did it [14:43:08] New patchset: Hashar; "puppetmaster:self clone puppet prod branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13155 [14:43:11] directory => "$gitdir/operations/puppet", [14:43:12] - branch => "test", [14:43:13] + branch => "production", [14:43:45] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13155 [14:43:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13155 [14:43:51] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13155 [14:46:44] !log Rebooting lvs1005 (after dist-upgrade) [14:46:49] Logged the message, Master [14:48:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 26, down: 1, shutdown: 1BRPeering with AS64600 not established - BR [14:48:52] PROBLEM - Host upload-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::b [14:49:07] ugh [14:49:10] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:16] ok. I won't be upgrading gluster today [14:49:25] seems it will require downtime [14:49:43] "Stop all glusterd, glusterfs and glusterfsd processes running in all your servers and clients." [14:49:46] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 27, down: 0, shutdown: 1 [14:50:04] RECOVERY - Host upload-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.88 ms [14:50:13] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [14:50:19] that sounds like a good enough reason to ditch gluster completely [14:50:28] jesus [14:50:31] well, we have no replacement right now [14:50:38] no, I mean in general [14:50:44] earlier upgrades didn't require downtime [14:50:48] not sure why this one does [14:50:50] it's stupid [14:51:04] the whole point of doing things like gluster is avoiding SPOFs and hence downtimes [14:51:44] yes [14:51:58] this is ridiculous [14:54:13] that's the last straw [14:54:24] no more gluster [14:54:47] just think about how that would work with instance storage [14:54:58] I did :) [14:55:06] we'd need to take a full downtime just to upgrade the fucking filesystem [14:56:51] If you're designing a product to be used in 'clustered' systems surley at the forefront of your mind would be 'upgrades without downtime' [14:57:03] yes [14:57:15] "Yes, protocol compatibility is something we have tried to be conscious about with this release. Going forward, we will definitely try to retain backward compatibility so that an online rolling upgrade is possible." [14:57:17] no [14:57:22] "clustered downtime" :) [14:57:53] I love having to cancel upgrades because of stupidity [14:57:53] Hey dawg we heard you like downtime, so we put downtime in your cluster so you can have an outage while having an outage... yeah that doesn't work [14:58:21] it's as if gluster actively wants us *not* to use their filesystem [14:58:32] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [14:58:44] well, they've won [14:59:09] mark: is it you keeping my hate list? you can add gluster on there :) [14:59:18] Hmm even if the protocol isn't compatible couldn't you upgrade half the nodes, suffer a massive split brain, upgrade the clients then upgrade the other half and re-stat the whole fs? Or would that screw with how we use clustering stuff. [14:59:21] it was during the ipv6 sprint [14:59:37] New patchset: Faidon; "puppetmaster: split geoip into a subclass" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13167 [15:00:09] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13167 [15:00:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13167 [15:00:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13167 [15:00:56] Damianz: I'm not dealing with a split brain again [15:01:00] it's a pain in the ass to fix [15:01:19] Well yeah, and with instance storage you'd probably end up in a really, really bad place [15:01:32] it's probably more difficult with the project storage [15:01:41] because we have 100 of them [15:02:44] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [15:04:20] Ryan_Lane: there are sites for amplifying your hate towards certain software. to make you feel better http://amplicate.com/hate/gluster [15:04:32] hahaha [15:05:02] oh wait, "Did you mean: GlustrFS"! [15:05:02] I set up my smartcard again [15:05:21] so far it holds my 1024 RSA + two 2048 just fine [15:05:24] 66% love glusterfs? [15:05:34] and I can probably switch to a single 2048 key for wmf + wmflabs [15:05:40] paravoid: I need to get one of those [15:06:00] if you do, make sure to get a fast one [15:06:04] yeah [15:06:05] mine's slow, esp. with 2k keys [15:06:14] I've probably told you this before [15:06:18] the old ones I was used to were incredibly slow [15:06:30] Yubi key <3 [15:06:34] esp. when for a typical wmf login when you login twice [15:06:44] once to bastion and then from there to another host [15:06:59] or that's what I do, other people keep their keys in bastion [15:07:09] SSH agent forwarding? [15:07:15] thanks but no thanks [15:07:24] My key gets unlocked when I unlock the screensaver and sorted :D [15:08:12] I use ProxyCommand ssh -W [15:08:16] and ControlMaster [15:08:35] unfortunately Debian stable doesn't have ControlPersist, that's very nice too [15:08:42] paravoid: i tried these in my ssh config: Ciphers arcfour256 MACs umac-64@openssh.com , but i didn't really benchmark or anything, just at some time found that as a suggestion for "fastest" [15:08:59] mutante: try "GSSAPIAuthentication no" [15:09:10] this is a great login time speedup in my experience [15:09:16] thx [15:09:31] indeed. why even try it if it isn't supported? [15:09:53] and yet it it slows down things [15:10:02] mark: I got two general questions for you about apaches for eqiad [15:10:07] well, it doesn't know it isn't supported ;) [15:10:29] so it tries it, fails, then goes onto the next method [15:10:51] 1. asher said you had some idea as to how to solve the need for the nfs::upload class before swift is done. care to explain? or discuss? [15:11:21] notpeter: i had ideas for it, but i'm not really willing to make that work now swift is nearly ready [15:11:26] 2. architecturally, should I make all jobrunners on their own boxes in eqiad? it seems prudent at this point [15:11:27] I think swift should just get deployed there [15:11:34] mark: ok! [15:11:41] notpeter: yes [15:11:45] cool cool [15:11:45] (2) [15:11:47] thanks! [15:14:35] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:15:34] eh yeah, sometimes i restart these, sometimes they come back by themselves anyways.. [15:15:56] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:16:26] puppet restarts them [15:16:44] is it important to have it back up asap? [15:16:44] so, mark, you were thinking of using pybal for the relays too? [15:16:53] v6 relays [15:16:54] well [15:16:55] not pybal [15:17:00] a simple script that uses bgp.py :) [15:17:12] it's like 30 lines of code [15:17:19] not a big quagga fan? :) [15:17:24] no [15:17:30] but also because it's again a good test of the code [15:17:49] quagga is a bit overkill for just injecting some bgp routes I think [15:17:58] exabgp would be fine too, but since that's not packaged either [15:18:04] we might as well use this code and test it a little better [15:18:21] I have no objection for bgp.py, but it's not like a quagga setup would be complicated either [15:18:26] well, since I'm not upgrading gluster anytime soon, maybe I'll look at using exabgp with openstack [15:18:29] you only need bgpd, not zebra or any other daemons [15:18:32] I was thinking of making a simple bgpinject.py script that takes routes to be injected from command line parameters [15:18:36] yeah sure [15:19:17] so, why not pybal? [15:19:25] we could even have a process check for miredo :) [15:19:32] hehe [15:19:36] it would work [15:19:40] it sounds a little dirty ;) [15:19:48] pybal in dryrun mode hehe [15:19:53] so it doesn't attempt to modify lvs state [15:20:58] I was maintaining AS112 at grnet [15:21:01] i guess that's not too bad [15:21:04] I had a very complicated monit script [15:21:18] gives me a good reason to add route withdrawals to pybal too [15:21:20] that stopped quagga when dns was not working [15:21:22] i.e. if all realservers are down [15:21:24] (it doesn't currently do that) [15:21:31] yeah [15:22:02] ok let's do that [15:22:07] hahaha [15:22:49] sounds a little dirty -> let's do that in 3' [15:22:52] ;) [15:23:04] well it has its advantages [15:23:12] useful new stuff in pybal [15:23:25] and it's simpler, for now [15:41:16] New patchset: Reedy; "Use cron.php symlink at /home/wikipedia/common/wmf-config/extdist/cron.php so that we have a consistent location for both the files, and the svn-invoker.conf that goes with it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13170 [15:41:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13170 [15:41:50] oh. hey. look at that [15:41:53] we're already running 3.3 [15:41:55] gluster [15:42:03] I can do the upgrade afterall [15:42:10] I forgot about that [15:42:13] we're running a beta [15:46:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/12178 [15:53:26] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, sessions up: 6, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [15:54:44] hm [15:57:56] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [16:21:06] New review: Hashar; "I have installed apaches::service on psm-precise instance then cherry picked this change latest patc..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/12178 [16:27:06] New patchset: Hashar; "gerrit: disable IRC notification on empty comment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13175 [16:27:34] <^demon> hashar: I already pushed that for review. [16:27:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13175 [16:27:43] ^demon: bah [16:27:50] <^demon> https://gerrit.wikimedia.org/r/#/c/13146/ a couple of hours ago [16:28:14] Change abandoned: Hashar; "dupe of https://gerrit.wikimedia.org/r/#/c/13146/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13175 [16:28:51] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13146 [16:29:08] ^demon: thanks! [16:29:13] <^demon> yw :) [16:29:15] your idea to put that on wiki is a smart one [16:29:20] really easier to follow what people want [16:29:38] I still have to pester someone to get my changes to wikibugs reviewed [16:29:47] (do you know perl by any chance) [16:29:50] or write test [16:29:51] hmm [16:29:56] testing a CLI utility [16:32:18] I still need to bike shed about setting +v(oice) flag to prominent channel users [16:32:30] such as root here :) [16:42:26] New review: Asher; "This would result in mass cache invalidation over too short a time frame and likely a site outage an..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/13129 [16:46:03] < EmilyS> Google I/O keynote currently live: https://developers.google.com/live/shows/ahNzfmdvb2dsZS1kZXZlbG9wZXJzcg4LEgVFdmVudBiYjNkCDA/ [16:46:24] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.240:11000 (timeout) 10.0.8.8:11000 (timeout) 10.0.8.29:11000 (timeout) [16:46:38] Getting errors [16:46:51] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:51] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:05] mediawikiwiki is down [16:47:09] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:09] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:09] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:09] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:15] oh [16:47:32] memcached died ? [16:47:36] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:37] yeah [16:47:45] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:53] or just that's the first nagios noticed but not the first to die [16:48:03] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [16:48:03] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [16:48:03] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:12] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:12] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:13] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [16:48:13] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:13] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:14] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:14] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:14] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:14] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:22] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:22] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:23] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:23] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:24] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:25] ^ [16:48:30] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [16:48:30] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:30] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:30] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:30] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:30] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:30] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:31] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:39] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:40] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:40] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:41] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:41] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:42] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:42] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:43] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:43] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:44] PANIC! [16:48:44] Ouch [16:48:44] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:49] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:50] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - SSH on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:57] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:58] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:58] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:59] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:59] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:00] ffs [16:49:00] wtf [16:49:00] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:00] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:01] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:01] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:02] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:05] who unplugged servers? [16:49:06] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [16:49:06] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:16] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:16] all because our caching servers broke [16:49:16] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:16] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:17] apache fall down go boom? [16:49:17] srv258 [16:49:21] Apache doesn't like life :( [16:49:22] srv240 [16:49:24] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:24] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:25] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:25] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [16:49:33] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:33] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:34] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:34] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:35] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:35] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:36] goddamnit [16:49:36] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:36] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:37] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:37] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:38] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:38] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:39] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:39] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:40] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:42] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:42] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:42] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:42] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:42] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:51] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:52] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:00] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:00] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:00] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:00] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:08] En effet :D [16:50:09] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:09] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:18] PROBLEM - SSH on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:25] hey [16:50:27] PROBLEM - SSH on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:27] holy shit [16:50:28] !log rebooting srv258, swapdeath [16:50:30] memcache ? [16:50:31] Hi LeslieCarr [16:50:33] Logged the message, Master [16:50:38] yep, memcache just ditched on us [16:50:39] darn [16:50:40] yeah, i got into work just in time, eh ? [16:50:46] Hi Leslie [16:50:50] <^demon> Good morning LeslieCarr :) [16:50:52] paravoid, also 240, 275, 279 I guess? [16:50:55] shit [16:50:59] i'm working on 279 [16:50:59] yes, just noticed [16:51:08] we need to name a server "swapdeath" [16:51:18] LeslieCarr: language please :D [16:51:21] paravoid/mark i don't want to step on your toes, do you want to task me with something ? [16:51:23] could someone debug one of those apaches to see what keys are they hanging on? [16:51:47] gonna do 275 now [16:51:53] mark: just did it [16:51:59] ok [16:52:02] !log rebooting srv275, swapdeath [16:52:07] Logged the message, Master [16:52:18] !log Rebooting srv279, swapdeath [16:52:21] So, multiple memcacheds this time? [16:52:23] Logged the message, Master [16:52:27] apparently [16:52:32] this is getting ridiculous [16:52:33] PROBLEM - Host srv258 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:35] yet again [16:52:43] hashar: were you going to deploy a change to log jobs at the start instead of just finish? [16:52:43] I'm afk for 5 mins and miss all the action [16:53:12] back up [16:53:16] binasher: given we were suspecting apache .. I did not add any log [16:53:31] on mgmt of s258, i see it reboting. done [16:53:35] logging out again [16:53:53] I'm thinking "swapoff -a" on all of them [16:53:56] binasher: are the apaches boxes spiking cause of some job loop? [16:53:58] oom killer > swapdeath [16:54:01] yup [16:54:21] RECOVERY - Host srv258 is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [16:54:42] this time feels different though, in that many boxes puked at once instead of only one. and the rest all climbed in memory usage (whereas IIRC before it was really just one that spiked super high) [16:54:57] RECOVERY - SSH on srv258 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:54:57] RECOVERY - SSH on srv279 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:55:22] That was slightly ott compared to the other day [16:55:25] hashar: i thought all of the hosts that entered swap death were also job runners [16:55:30] dang, and this was right after Asher commented a memcached related gerrit change with "this could cause a site outage", yet of course unrelated [16:55:42] paravoid: if the boxes die upon going into swap, i'm for swapoff [16:56:04] that's still a workaround though [16:56:14] if we can isolate an apaches box and keep with 100% CPU that might help debugging the issue [16:56:19] i hadn't seen any that were just apaches die, but haven't looked in the last few days [16:56:21] doing an strace on one of the long running job [16:56:23] yes, but if it keeps the site up for now.... [16:56:30] yes, they die of swapdeath. and a workaround is better than what we have now [16:56:34] it looks like there are a few that are still close to going over the edge: 211, 226, 229, 230, 240 [16:56:36] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60330 bytes in 1.176 seconds [16:56:47] they had 1MB of swap ? [16:56:48] en wiki is kinda sluggish still [16:56:59] NPROCS SYSCPU USRCPU VSIZE RSIZE RDDSK WRDSK RNET SNET CPU CMD 1/14 [16:57:02] 49 1m53s 50m51s 22.5G 6.6G 250e3 103e3 0 0 523% apache2 [16:57:05] 951 17.85s 4m21s 412.7M 84980K 23680 0 0 0 46% php [16:57:08] 1 9.03s 3.72s 2.2G 2.0G 7800 0 0 0 2% memcached [16:57:11] srv258's last moments [16:57:12] ooh [16:57:18] not the jobs' fault [16:57:21] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.848 second response time [16:57:33] o.0 that's a lot of apache cpu usage [16:57:39] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60136 bytes in 0.145 seconds [16:57:45] also, apaches went from 4.3G resident to 6.6G [16:57:48] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [16:57:48] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.865 second response time [16:57:48] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.062 second response time [16:57:50] srv258 is about to die [16:57:57] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.455 second response time [16:57:57] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:57:57] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:57:57] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:57:57] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:57:58] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.222 second response time [16:57:58] 226 looks ok frm being on the host (uptime, top, free) [16:57:58] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47341 bytes in 0.675 seconds [16:57:58] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [16:57:59] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.619 second response time [16:57:59] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.197 second response time [16:58:00] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.200 second response time [16:58:00] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.971 second response time [16:58:01] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.176 second response time [16:58:03] matanya: erm? [16:58:03] in a single atop cycle (10 minutes) [16:58:06] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [16:58:06] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [16:58:06] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [16:58:06] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [16:58:06] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.420 second response time [16:58:06] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.594 second response time [16:58:06] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.031 second response time [16:58:07] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [16:58:07] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [16:58:08] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:58:08] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.070 second response time [16:58:09] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.185 second response time [16:58:10] paravoid, I guess you don't have its access log or a core? [16:58:13] nice [16:58:15] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60143 bytes in 0.168 seconds [16:58:17] its coming back [16:58:24] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:58:24] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [16:58:24] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [16:58:33] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [16:58:34] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [16:58:34] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [16:58:34] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.212 second response time [16:58:34] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:58:34] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:58:34] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [16:58:35] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [16:58:35] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:58:36] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [16:58:36] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:58:37] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.785 second response time [16:58:37] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.282 second response time [16:58:42] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:58:42] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [16:58:42] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [16:58:42] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:58:42] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [16:58:42] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [16:58:42] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [16:58:43] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.627 second response time [16:58:51] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:58:51] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:58:51] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:58:51] RECOVERY - Apache HTTP on srv227 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [16:58:51] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [16:58:51] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.412 second response time [16:58:51] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.236 second response time [16:58:52] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.231 second response time [16:58:52] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.261 second response time [16:59:00] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [16:59:00] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47348 bytes in 1.292 seconds [16:59:09] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [16:59:09] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:59:18] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60329 bytes in 0.806 seconds [16:59:18] PROBLEM - SSH on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:26] yeah [16:59:51] srv240 gave an OOM error and killed an apache2 process. [17:00:03] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47148 bytes in 0.172 seconds [17:00:04] so, 2.3G in 10' [17:00:21] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [17:00:26] I guess 264 wants kicking [17:00:41] apergos: good guess, doing it [17:01:17] !log rebooting srv264, swapdeath [17:01:22] Logged the message, Master [17:01:36] 211 too [17:01:45] Could not connect: 10.0.8.14:11000 (timeout) 10.0.8.30:11000 (timeout) [17:02:06] I'm on it and watching (211) [17:02:10] ha [17:02:12] me too [17:02:18] 211 is spiking but not dead yet [17:02:30] 280 [17:02:32] load is dropping and slowly it is backing out of swap [17:02:42] srv212 looks dying [17:02:44] or not. [17:02:59] binasher: so my patch is https://gerrit.wikimedia.org/r/13181 [17:03:03] srv212 is out of memory [17:03:09] and swapdeathing now [17:03:11] [5236373.641605] Out of memory: kill process 9493 (apache2) score 239750 or a child [17:03:11] [5236373.656418] Killed process 9493 (apache2) [17:03:13] from 211 [17:03:19] Reedy: https://gerrit.wikimedia.org/r/13181 adds an entry in jobRun whenever a job start. Might help [17:03:22] !log powercycling srv280 [17:03:28] Logged the message, Master [17:03:28] Reedy: though it is most probably apache2 causing the issue [17:03:40] i'm going to try restarting apache2 on srv212 [17:03:44] can we keep a box up to investigate the apache ? [17:03:49] hashar: [17:03:54] hashar: check out srv212 [17:03:57] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:57] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:57] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:57] (until the box dies) [17:03:59] or srv211 [17:04:17] saved dmesg frm 211 [17:04:42] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:50] I'd prefer the access log, or even an apache strace :P [17:04:51] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:51] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:51] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:51] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:51] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:51] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:54] down again [17:05:00] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:00] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:00] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:00] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:00] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:00] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:01] god no [17:05:05] Sadtimes [17:05:06] accesslogs no got [17:05:06] doh! [17:05:09] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:09] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:10] srv270, srv287, srv264 [17:05:11] !log rebooting srv280 [17:05:16] Logged the message, Master [17:05:20] you guys moving the chatter to -tech? [17:05:22] 270 277 287 [17:05:23] *mind [17:05:27] RECOVERY - SSH on srv264 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:05:27] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:27] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [17:05:36] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - SSH on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:43] oh darn [17:05:45] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:45] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:45] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:45] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:55] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:55] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] IPs here: http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_all_memcacheds [17:06:03] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [17:06:03] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:03] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:04] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:12] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:12] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:12] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:12] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:13] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:13] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:13] want me to stop bot? [17:06:13] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:13] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:14] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:21] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:22] yes please [17:06:22] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:22] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:22] mutante: i think moving to -tech is good [17:06:23] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:23] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:26] I already just ignored the bot. [17:06:27] or that [17:06:30] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:31] makes the conversation followable. [17:06:39] ok, lets just ignore in clients [17:06:39] PROBLEM - Apache HTTP on srv191 is CRITICAL: Connection refused [17:06:45] just -tech :-D [17:06:46] MOVE TO TECH [17:06:50] robla is looking very festive today: [17:06:51] http://cl.ly/02170B2j3g2S1l3H3S0r [17:07:09] lol [17:07:22] so pirate robla sinked the apaches! [17:07:33] PROBLEM - SSH on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:07:34] yar [17:07:39] I thought this channel was created specially for ops being able to discuss stuff without annoying disturbances [17:07:40] Irony [17:07:55] vvv: nagios-wm didn't get the memo. [17:09:57] PROBLEM - Host srv287 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:36] RECOVERY - Host srv287 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:11:36] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39274 bytes in 3.896 seconds [17:11:48] mhm [17:11:54] PROBLEM - NTP on srv287 is CRITICAL: NTP CRITICAL: Offset unknown [17:12:12] RECOVERY - Apache HTTP on srv191 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:12:12] RECOVERY - SSH on srv287 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:13:33] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.394 second response time [17:13:33] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.008 second response time [17:13:42] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [17:13:42] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.911 second response time [17:13:42] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.862 second response time [17:13:42] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.332 second response time [17:13:51] RECOVERY - SSH on srv277 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:14:00] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60329 bytes in 0.808 seconds [17:14:07] vvv: -operations is indeed for ops stuff [17:14:09] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:14:09] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:14:09] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:14:09] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:14:09] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:14:16] lol [17:14:18] vvv: but -tech is valid too and is a fallback whenever we have an outage :) [17:14:30] vvv: we got bots out of -tech just to make it a discussion channell [17:14:44] also -tech is great to avoid disturbing ops :-D [17:14:55] for example: deploying an extension [17:16:27] notpeter: https://gerrit.wikimedia.org/r/13184 [17:16:50] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13184 [17:18:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13184 [17:18:29] preilly: ok, I shall push now, yes? [17:18:35] notpeter: yes [17:18:39] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13184 [17:18:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13184 [17:19:17] dead end from dmesg output, just shows the oom (and a bunch more similar), nothing before that for some time [17:19:37] preilly: forcing puppet runs now [17:19:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=37990 [17:24:08] petan: live [17:32:54] * jeremyb wonders why you don't log serial output? seems implied that you don't currently [17:34:09] mark: idk if ToAruShiroiNeko talked to you. he was complaining in #wikimedia-ops earlier and several people told him to wait until the outage was over [17:35:19] * jeremyb couldn't tell exactly what the kick reason was [17:37:11] RobH: can you tell me what your worry about [wikimedia #2996] Create .m mobile domains for remaining WMF projects is ? [17:37:32] the wikitech or the subdomain in subdomain thing? [17:37:51] wikitech is not in cluster, doesn't need mobile inclusion [17:38:02] the subdomain of a subdomain is just bad practices, but i suppose we can do [17:38:16] i just rather someone who isnt the one and only person who can do onsite things do it. [17:38:28] as i am on site and have things to do that no one else can =P [17:38:45] tfinc: jeff took the ticket over [17:38:51] Jeff_Green: ^ [17:39:12] * tfinc shifts his gaze at Jeff_Green and asks the same question  [17:39:16] uhhh, sub in sub is going away to make ssl cert names feasible? [17:39:34] sub-subdomains are fine, as long as they'll be covered by *.m.whatever [17:39:45] * Jeff_Green stares at the sky and whistles [17:40:02] if for some reason they can't, then we'll have to figure out an alternative [17:40:08] ohh, i was thinking like arbcom.nl.wikipedia or whatever it was [17:40:16] yeah, that's not covered [17:40:26] i guess that's not mobile anyway, so doesn't matter [17:40:35] we have *.wikipedia, so nl is good, arbcom.nl is not [17:42:09] well, if weng timeout: 255 seconds) [17:42:09] 19:40:06 i guess that's not mobile anyway, so doesn't matter [17:42:09] 19:40:15 we have *.wikipedia, so nl is good, arbcom.nl is not [17:42:11] ops [17:42:23] haha [17:49:46] RobH: in other excitement, row c switch serials ? [17:51:49] LeslieCarr: all in racktables [17:52:23] !log cp1017 is offline due to memory error. replacement memory on site, pulling system for swap [17:52:29] Logged the message, RobH [17:52:58] cool :) [17:54:59] urgh, my backlog of datacenter shit is a mess. [17:55:38] * RobH goes at it, one task at a time. [17:55:52] fyi, cp1037 to cp1040 are powered down, do not have a role yet but are waiting for you in Nagios with a 3-month downtime (so no notifications) to be reactivated anytime by deleting the downtime [17:56:18] mutante: did you add them to decomissioned hosts ? [17:56:35] LeslieCarr: no, they are more like _pre_commissioned [17:57:01] afaik and per notpeter [17:57:13] hehe, well yeah, but since they were in nagios, putting them in decomissioned will remove the old alerts [17:57:57] well, it all depends how long it takes i guess. this was an easy way to turn of notifications but keep them warm standby [18:05:48] !log cp1017 memory replaced [18:05:53] Logged the message, RobH [18:16:55] LeslieCarr: "Sorry! We could not process your edit due to a loss of session data. Please try again. If it still does not work, try logging out and logging back in." [18:18:03] i think that could be related to memcache restarts -- binasher , you know more about this than i do though [18:18:34] <^demon> Yeah, sessions are in memc so that's likely. [18:20:07] New patchset: Lcarr; "trying to fix snmptt restarts on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13196 [18:20:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13196 [18:20:57] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13196 [18:21:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13196 [18:22:34] preilly: quote: "its inconvenient to users whose sessions where there" [18:22:46] New patchset: preilly; "fix landing page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13197 [18:23:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13197 [18:23:19] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13197 [18:23:26] notpeter: https://gerrit.wikimedia.org/r/#/c/13197/ [18:32:25] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13197 [18:32:28] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13197 [18:41:56] New patchset: Lcarr; "ensure snmpd running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13198 [18:42:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13198 [18:47:03] !log added several mobile hostnames to DNS for RT #2996 [18:47:08] Logged the message, Master [18:48:14] LeslieCarr: i see you are ensuring snmp services, fyi, on labs nagios this was the one that needed/needs extra care to run after reboots: /usr/sbin/snmptrapd -On -Lsd -p /var/run/snmptrapd.pid [18:49:33] yeah, theoretically if you set in the init file TRAPDRUN=yes it works [18:51:51] Jeff_Green: did you do those with the script? [18:51:52] ok, just if you ever happen to see a flood of everything being down in labs nagios it would be that to check first [18:52:02] thanks :) [18:52:19] preilly: authdns-update? yes [18:52:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13198 [18:52:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13198 [18:53:13] Jeff_Green: yep [18:53:31] and i remembered to check them individually this time [18:53:42] *this* time . . . [18:54:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12724 [18:54:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12724 [18:55:10] heh [18:56:14] Jeff_Green: thanks for getting that ticket completed [18:56:20] no problem! [19:17:21] New patchset: MaxSem; "Add debug log group for MobileFrontend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13207 [19:35:18] binasher: were you serious about getting the 2nd es cluster? [19:37:16] yes [19:44:28] New review: preilly; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13207 [19:44:30] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13207 [19:48:09] binasher: are you back in the office tomorrow? [19:48:19] yep [20:00:41] New patchset: Pyoungmeister; "adding olivneh to admins::restricted per RT 3116" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13223 [20:01:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13223 [20:01:37] Change abandoned: Pyoungmeister; "doing this instead in https://gerrit.wikimedia.org/r/#/c/13223/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11572 [20:01:48] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/13223 [20:01:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13223 [20:09:04] notpeter: hi [20:09:09] notpeter: did you need me? [20:29:14] petan: I don't think so? [20:29:24] petan: what is the context? [20:29:44] [19:24:08] petan: live [20:29:55] nvm then [20:30:17] right now I have 22:30 if you were searching own log [20:30:59] oh! sorry, I meant to say preill_y [20:31:03] :) [20:31:07] autocomplete failed me [20:31:08] sorry! [20:31:09] np [20:31:57] mark: You around? [20:33:50] !log db1003 mgmt is not responsible, I need to remove power and reboot. confirmed iwth asher this is an s3 slave and can do a short downtime without issues [20:33:56] Logged the message, RobH [20:34:44] New review: Hashar; "Oh I did not think about that. I will have to remember to restart memcached :-D" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13129 [20:35:11] !log clean mysql shutdown, db1003 now offline [20:35:16] Logged the message, RobH [20:35:26] * RobH gets a hammer to fix db1003's attitude problem [20:35:27] New patchset: Asher; "opening gmetad xml to fenari again to fix dbtree" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13273 [20:36:00] Change abandoned: Hashar; "that change would kill the whole cluster whenever someone sync a new version of /etc/memcached.conf ." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13129 [20:36:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13273 [20:36:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13273 [20:44:30] !log db1003 mgmt issue due to bad cable, system booting back up, replacing mgmt cable [20:44:36] Logged the message, RobH [20:51:53] !log rebooting srv266 as it is unresponsive [20:51:58] Logged the message, Mistress of the network gear. [20:55:02] !log db1003 back online, replaced mgmt cable and mgmt is working now as well [20:55:07] Logged the message, RobH [21:03:17] * RobH needs to fix dell self dispatch [21:03:23] tired of hold times =P [21:16:35] go to their warehouse and ship out the stuff you need? :D [21:36:39] Reedy: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#IPv6_is_now_enabled_but_broken [21:36:51] I'm just looking at it [21:36:54] I also cannot do anything about it [21:37:45] well, do you think it's possible that there's a problem at the esams cluster in particular? [21:37:53] possibly [21:38:02] I don't have any ipv6 connectivity, so can't test [21:38:21] there are tunnelbrokers you can use [21:38:23] Reedy: Maybe its possible to do from toolserver ssh? [21:38:42] (Or wikimedia's for that matter, I suppose) [21:39:42] m ark would be the person probably best placed to deal with this. Just not now [21:41:02] bugzilla? [21:45:07] full ack @ what Reedy said [21:45:29] ? [21:45:36] I'm just trying to find someone else in the EU to confirm it [21:45:46] full ack? [21:45:47] I like 2620:0:862:ed1a::1 [21:45:52] full acknowledge [21:46:15] whose idea was it to have ed1a (like "edia") in the IPv6 addresses? [21:46:21] Reedy: i dont have v6 at home currently, or i would install tcptraceroute6 [21:46:29] Ditto [21:46:56] http://p.defau.lt/?n6pbornE5_SJJ9pH7gvFuA [21:46:57] WFSE? [21:48:03] apparently so... [21:48:12] Reedy: could use a sixxs.net tunnel and "aiccu" client .. but ..tomorrow ? [21:48:28] ^I'd recommend a Hurricane Electric [21:48:29] My traceroute ends at the lb for eqiad from a server in france [21:48:30] tunnel [21:48:47] Damianz: screwed dns? [21:48:58] NOOOOOOOOOO [21:49:04] I don't think so [21:49:05] he.net loads http content on a https page [21:49:06] It's a HE tunnel to london so probably :P [21:49:08] * Reedy cries [21:49:09] it seems to resolve fine for me [21:49:34] Actually is this a tunnel.... hmm it might have native ipv6 actually [21:49:52] HE would mean you start with 2001:470 [21:50:24] Ooh it is native, other server is a tunnel. [21:50:43] IP is not ICMP pingable. Please make sure ICMP is not blocked. If you are blocking ICMP, please allow 66.220.2.74 through your firewall. [21:50:45] pffft [21:50:56] Damianz: you aren't using a tunnel [21:51:05] with my last home router it even failed for me with aiccu and sixxs because my router just did not like to forward the "strange" 6in4 protocol 41 [21:51:23] It works from from both my he tunnel and a server in frace :P [21:51:26] The DMZ feature is usually a workaround [21:51:42] well, in any case, that person's issue is his own config/ISP [21:51:58] Reminds me I should fix my router at home to start handing out ips from my /48 tunneld subnet, last time it broke the internet somewhat though [21:52:25] (DNS errors are a common IPv6 misconfiguration) [21:52:52] It's all autoconf's fault! [21:53:10] * Damianz runs and hides [21:53:18] * Jasper_Deng configures his IPv6 DNS manually, and uses DHCPv6 for address distribution [21:53:43] crap. [21:53:51] I actually got autoconf working properly then decided I prefered the idea of using dhcp with static leases :( [21:54:10] ^bad decision [21:54:15] :( [21:54:35] DHCPv6 is not a complete solution unless you want to do manual configuration too [21:55:00] That's amazing [21:55:12] my irc session survied a router reboot [21:55:22] ... [21:55:31] It works enough that my laptop has happie, but yes it has a load of manual config [21:55:33] maybe it wasn't a reboot after all...? [21:55:34] Reedy: Welcome to tcp [21:56:03] !log mw1102 has no nic0, rather than troubleshoot it for a long time, reinstall! (rt 3058) [21:56:09] Logged the message, RobH [21:59:44] LeslieCarr: looks like the mainboard was replaced, as the dhcp file has an old mac address [21:59:54] i dont recall doing this, but hell, it happens often enough [22:00:05] so easier to reinstall than hack at making it work [22:01:12] New patchset: RobH; "mac change for mw1102" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13281 [22:01:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13281 [22:01:54] New review: RobH; "mac correction" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13281 [22:01:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13281 [22:12:04] mutante: Jasper_Deng WFM via a henet ipv6 tunnel [22:12:32] ? [22:12:39] oh [22:12:44] WFM=works for me [22:12:53] I'm going to post this on en [22:23:38] !log temporarily pulling srv211 from pybal [22:23:43] Logged the message, Master [22:54:39] * AaronSchulz drowns binasher in patches [23:10:19] New patchset: preilly; "add rule for m.mediawiki.org structure for domain" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13282 [23:10:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13282 [23:10:57] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13282 [23:11:29] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13282 [23:11:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13282 [23:25:01] New patchset: preilly; "add www. for replacement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13283 [23:25:33] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13283 [23:25:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13283 [23:28:03] New patchset: preilly; "add www. for replacement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13283 [23:28:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13283 [23:29:07] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13283 [23:29:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13283 [23:39:34] New patchset: MaxSem; "Update $wgMobileUrlTemplate for mediawiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13284 [23:41:32] New review: preilly; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13284 [23:41:36] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13284 [23:50:29] * jeremyb wonders if LeslieCarr is unsick? [23:50:37] jeremyb: pretty much [23:50:38] hehehe [23:50:42] woot [23:50:45] is that a "please check my code review requests" [23:51:22] no, it's a why is my approved and merged puppet change not having the expecting effect [23:51:32] did the gdash thing get to sockpuppet? [23:51:43] expected* [23:53:30] it did [23:53:36] hrmmm [23:53:49] what machine is that on again ? [23:53:56] professor [23:54:06] aka gdash.pmtpa.wmnet [23:54:56] ah [23:55:07] actually fenari is gdash.wikimedia.org ... [23:55:13] right [23:55:18] 2 different boxes [23:55:19] New patchset: MaxSem; "pngcrush everything" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13285 [23:55:23] ah ok [23:55:35] running puppet on professor again