[00:02:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [00:05:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:08:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:08:39 UTC 2013 [00:09:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:09:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:09:42 UTC 2013 [00:10:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:10:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:10:38 UTC 2013 [00:11:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:11:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:11:30 UTC 2013 [00:12:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:12:14] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:12:12 UTC 2013 [00:13:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:13:24] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:13:19 UTC 2013 [00:14:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:15:24] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:32:24] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:32:54] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 00:32:48 UTC 2013 [00:33:15] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:39:31] PROBLEM - Disk space on cp1041 is CRITICAL: Timeout while attempting connection [00:40:21] RECOVERY - Disk space on cp1041 is OK: DISK OK [01:03:35] New patchset: Alex Monk; "(bug 46990) Add the 'editor' restriction level on pl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58038 [01:04:15] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:13:35] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:50:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:50:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:50:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:04:28] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:04:48] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [02:18:18] !log LocalisationUpdate completed (1.22wmf1) at Mon Apr 8 02:18:18 UTC 2013 [02:18:26] Logged the message, Master [02:25:07] !log LocalisationUpdate completed (1.21wmf12) at Mon Apr 8 02:25:07 UTC 2013 [02:25:13] Logged the message, Master [02:34:57] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:04:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [03:07:27] PROBLEM - Squid on brewster is CRITICAL: Connection refused [03:55:16] !log on all apaches: upgrading libpoppler [03:55:23] Logged the message, Master [03:59:30] RECOVERY - Squid on brewster is OK: TCP OK - 0.027 second response time on port 8080 [03:59:54] !log on brewster: root partition was full, removed squid access.log and store.log and started squid [04:00:01] Logged the message, Master [04:05:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:08:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:08:35 UTC 2013 [04:09:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:09:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:09:40 UTC 2013 [04:10:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:10:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:10:36 UTC 2013 [04:11:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:11:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:11:26 UTC 2013 [04:12:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:12:17] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:12:09 UTC 2013 [04:13:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:13:27] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:13:17 UTC 2013 [04:14:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:31:11] New patchset: Ori.livneh; "udp2log on fluorine: relay MW errors to vanadium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58047 [04:32:04] New patchset: Ori.livneh; "udp2log on fluorine: relay MW errors to vanadium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58047 [04:32:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 04:32:43 UTC 2013 [04:33:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:34:59] TimStarling: if you have a moment, could you look at https://gerrit.wikimedia.org/r/58047 ? it's a small change which adds a udp log filter on fluorine. [05:04:41] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [05:44:43] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [06:00:43] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [06:08:24] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:16:24] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [06:28:34] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 06:28:27 UTC 2013 [06:29:24] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:29:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 06:29:40 UTC 2013 [06:30:24] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:32:54] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 06:32:50 UTC 2013 [06:33:24] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:06:23] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:30:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:02] !log restarting Zuul for demo purposes :-) [07:48:10] Logged the message, Master [07:53:16] apergos: Hi, gerrit got stuck again and refuses to talk to zuul (which is needed for the gerrit/jenkins integration). Could you please restart the gerrit daemon on manganese? [07:53:24] σεψ [07:53:25] sec [07:55:04] qchris: I gave a bit of context on the bug report for history purposes [07:55:12] apergos: good morning :-] [07:55:18] hashar: Thanks [07:55:27] qchris: what struck me is that whenever stream-events is blocked, it is blocked for everyone else even a new connection. [07:55:32] done [07:55:34] please check now [07:55:50] apergos: Thanks \o/ [07:55:55] yay [07:56:02] whenever that happened, I tried establishing a new connecting with my account. It does not receive any new event either :( [07:56:06] hashar: Yes, there seems to be something blocked within gerrit [07:56:29] But I did not yet manage to reproduce reliable. [07:56:40] But that's on the agenda for this morning :-) [07:57:47] hashar: Is there some repository that we can use to periodically invoke "recheck" on that easy on Jenkins tests? [07:58:28] maybe test/mediawiki [07:58:32] (As fix until Chad joins us to install a new gerrit.war) [07:58:39] Ok. Thanks [07:58:44] bah it got deleted [08:00:11] qchris: you can create a test change under integration/zuul-config and spam recheck there :-] [08:00:27] Ok. I'll try that. Thanks. [08:00:52] It triggers a YAML linter job https://integration.wikimedia.org/ci/job/integration-zuul-config-yamllint/ [08:01:29] qchris: also I could not found the Gerrit source code we are using. It used to be under operations/gerrit.git [08:02:00] ^demon: Decided that we run vanilla upstream [08:02:12] okkk [08:02:23] In the bottom right of a gerrit page you'll find something like "2.6-rc0-144-gb1dadd2" [08:02:32] and so the production version shows up a sha1 of gb1dadd2 but that is not in upstream [08:02:40] So it's the commit starting in b1dadd2 [08:02:46] .. [08:02:49] g [08:03:03] not hexadecimal [08:03:18] 'b1dadd2cc209482f60ca0e52c47e76bc51b87ed7' [08:03:26] ^ is the full hash [08:03:48] The string stems from 'git describe' [08:04:12] And that prefixes the hash with 'g' as in /g/it [08:04:38] today I learned a new git command "describe" :-] [08:06:02] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:08:11] git log --oneline --no-merges 52fb5ae..b1dadd2 |wc -l [08:08:12] 38 [08:08:18] we are lucky, only 38 commits to look at :-] [08:08:22] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 08:08:12 UTC 2013 [08:09:02] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:09:22] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 08:09:16 UTC 2013 [08:10:02] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:29:35] qchris / hashar : Did you automate that recheck comment yet, or would you need some help? [08:30:02] siebrand: Done :-) [08:30:12] cool [08:31:22] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [08:33:12] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 08:33:11 UTC 2013 [08:34:02] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:35:18] ah [08:35:19] Received disconnect from 208.80.154.152: 2: User idle has timed out after 600000ms. [08:35:20] :-D [08:35:27] that must be the message zuul can't parse [08:36:55] qchris: I will create you another repo for the 5 minutes ping [08:37:18] Ok. Great. [08:38:40] qchris: test/gerrit-ping owner is ldap/wmf [08:39:22] Thanks hashar. I am not in ldap/wmf (last time I checked) ... let's see if I can push anyways :-) [08:40:10] you should be able to create a change against it then add a comment like 'ping' [08:40:21] Ok. I'll try [08:43:47] it receives events :-] [08:44:01] hashar: Now gerrit-wm is spamming #mediawiki with the 'recheck's. Is that ok? [08:44:13] ahrgh [08:44:50] seems #mediawiki and #wikimedia-dev are the default ahah [08:44:55] will amend the hook [08:44:56] Is there some way for gerrit-wm to ignore me and my comments? [08:49:11] hashar: besides, jenkins-bot does not seem to act on test/gerrit-ping :-( See [08:49:14] https://gerrit.wikimedia.org/r/#/c/58060/ [08:49:27] yeah but Zuul receive events nonetheless [08:49:34] I can create a jenkins job if you want [08:49:40] Nono. [08:49:50] If zuul acts, it's ok [08:50:12] Zuul is just too quick for me to notice :-) [08:51:24] New patchset: Hashar; "gerrit: no IRC message for test/gerrit-ping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58063 [08:52:18] apergos: sorry to interrupt again but we would need Gerrit to not send notification for test/gerrit-ping.git repository. The change is https://gerrit.wikimedia.org/r/58063 :-] [08:52:43] apergos: you can blindly trust me on this one :-]  I think it just need merge + puppet run on manganese [08:55:53] please give me just two minuntes first [08:56:06] take your time 8) [08:58:55] morning [08:59:16] gooood morning ! [09:03:34] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58063 [09:04:01] apergos: Thanks [09:05:16] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [09:09:03] paravoid: I have officially joined the Debian python module team :-] [09:09:13] woo! [09:09:15] paravoid: will probably upload my debs tomorrow :-] [09:09:42] do you have a sponsor? [09:09:43] just have to figure out my credentials to access the subversion repo , tweak the maintainer field and add myself as uploader [09:09:53] aren't you my sponsor? ;-] [09:10:05] haha [09:10:06] yes I am :) [09:10:08] ;-] [09:10:11] so yes [09:10:16] you can't upload them yourself [09:10:26] you'll need to commit them to SVN [09:10:38] then I'll build and upload [09:10:49] will ping you whenever I have done the commit so [09:10:59] I though we could use dput to send the package to some public area [09:11:06] there's mentors.debian.net [09:11:10] but no need to [09:40:49] speaking of deb packages, can someone create operations/debs/libvpx i want to push a backport with a patch, anything except creating the repo can go via gerrit review i guess [09:59:22] j^: hi :-]  I can create a git repo named operations/debs/libvpx [10:02:34] hashar: cool, can i push branches via gerrit(upstream, pristine-tar) or would i need push permissions for those? [10:02:49] hmm [10:02:54] http://packages.ubuntu.com/search?keywords=libvpx [10:03:04] there is a 1.1.0-1 package in ubuntu Quantal and Raring [10:03:08] maybe we can reuse those [10:03:47] I am not sure whether you need a git repo [10:03:59] hashar: yes i started off with the versoin from quantal imported via git-import-dsc --pristine-tar libvpx_1.1.0-1.dsc and added my patch [10:04:14] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [10:04:26] ah so that is upstream + a patch :-] [10:04:45] hashar: https://rt.wikimedia.org/Ticket/Display.html?id=4868 [10:05:14] yes, current release + one patch from git [10:06:38] was told I should use git here on #wikimedia-operations, so I put it in git, now just a question of where to push it to [10:09:32] creating creating [10:13:02] bah I don't know how to create the orphan branches upstream and pristine-tar :( [10:16:42] j^: I think I screwed it up [10:16:55] I created the branches master pristine-tar and upstream [10:17:02] but they all point to the first initial empty commit [10:17:04] not ideal [10:17:37] deleted them [10:20:33] New patchset: J; "Imported Upstream version 1.1.0" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58070 [10:21:14] yeah I think we need to be in the upstream branch don't we ? [10:21:50] ! [remote rejected] 60c5e53cf81d47a0f1d2f4333bc338a2fcf092f8 -> refs/for/upstream (branch upstream not found) [10:21:50] Change abandoned: J; "target should be upstream" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58070 [10:21:50] :( [10:21:57] I have no idea how to do it [10:25:06] so https://wikitech.wikimedia.org/wiki/Git-buildpackage only works if one can push changes, using gerrit to setup a repo does not quite work [10:25:44] * [new branch] 60c5e53cf81d47a0f1d2f4333bc338a2fcf092f8 -> upstream [10:25:45] Bus error: 10 [10:25:46] oh ho [10:26:37] j^: bah I have pushed your change as the first revision of the upstream branch [10:26:49] need to clean that out [10:29:43] j^: I have created dummy commits for the pristine-tar and upstream branches [10:29:51] j^: you can restore your change https://gerrit.wikimedia.org/r/58070 [10:30:04] and push its sha1 to refs/for/upstream [10:30:15] that might update the change to be against the upstream branch [10:30:52] !log gerrit: created operations/debs/libvpx for j^ . Initialized pristine-tar and upstream branches using empty commits and force push. [10:30:58] Logged the message, Master [10:31:16] j^: I can even try it for you :] [10:31:21] Change restored: Hashar; "(no reason)" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58070 [10:33:41] cherry picked [10:33:44] resending [10:34:00] all this seams way to complicated [10:34:02] New patchset: Hashar; "Imported Upstream version 1.1.0" [operations/debs/libvpx] (upstream) - https://gerrit.wikimedia.org/r/58071 [10:34:22] and that is a new change huuh [10:34:51] New review: Hashar; "I have made this change against upstream branch with https://gerrit.wikimedia.org/r/#/c/58070/" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58070 [10:35:05] j^: I must agree [10:35:25] j^: anyway your change is https://gerrit.wikimedia.org/r/#/c/58071/ [10:35:41] and you get dummy branches for master and pristine-tar [10:36:07] well that was the first of something like 5 commits that [10:36:39] git-review also keeps messing with my commit history and the branches are no longer in sync [10:38:42] not sure if getting it through gerrit/review is reasonable for deb packages [10:39:09] i can see that it should happen for the patch, but review for upstream tarball [10:40:09] in the rt ticket i have a link to the deb files. this is all to 'document' the changes. since i might have to push another package soon i would not mind figuring out a workflow that i can repeat [10:41:11] New review: Hashar; "Leslie : that is mostly harmless :-]  The reason I added you as a reviewer is because I think you h..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55304 [10:48:28] j^: mortals can't push to operations/debs repositories. So if you want to update an upstream branch you have to submit a change that will be merged by ops [11:02:45] hashar: ok. do you know how i can push a review that is not for master? [11:04:06] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [11:06:40] j^: git push origin 123456ABCDEF:refs/for/upstream [11:06:44] I mean [11:07:02] git push :refs/for/ [11:07:05] so that would be: [11:07:12] git push origin upstream:refs/for/upstream [11:07:51] restarting comp brb [11:11:29] New patchset: J; "pristine-tar data for libvpx_1.1.0.orig.tar.bz2" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58072 [11:13:54] New patchset: J; "Imported Upstream version 1.1.0" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58073 [11:13:55] New patchset: J; "Import 1.1.0-1+wmf1" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58074 [11:26:09] all this does not work since git-buildpackage uses sha1 hashes to identify i.e. pristine-tar commits, once Change-Id is added they of cause change and building no longer works [11:34:08] j^, how so? [11:34:15] it just an extra line in the commit mst [11:35:08] Platonides: yes but the sha1 of the commit includes the commit msg so changing a commit will change its hash, git-buildpackage links to a commit hash in another commit [11:35:40] pass the hash of the changed commit to git-buildpackage ? [11:36:43] Platonides: yes I can manually do what git-buildpackage does, but at that point its not helping to 'use' git-buildpackage [11:37:07] i.e. git-import-dsc --pristine-tar ../libvpx_1.1.0-1.dsc [11:37:20] creates commits in 3 branches [11:38:15] does it obey git hooks? [11:38:37] if the hook adding the Change-Id is installed, I would expect it to work [11:38:38] tried that, does not look like it does [11:38:43] :( [11:50:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:50:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:50:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:52:44] <^demon> !log bringing gerrit down for urgent update [11:52:51] Logged the message, Master [11:57:13] <^demon> !log gerrit back, deployed 2.6-rc0-154-gfcdb34b which contains a temporary fix for the stream-events timeout issues. See bug 46917 for info. [11:57:19] Logged the message, Master [12:05:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:06:14] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:08:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:08:35 UTC 2013 [12:09:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:09:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:09:39 UTC 2013 [12:10:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:10:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:10:36 UTC 2013 [12:11:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:11:34] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:11:26 UTC 2013 [12:12:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:12:54] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:12:47 UTC 2013 [12:13:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:32:54] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 12:32:49 UTC 2013 [12:33:15] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [12:33:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:35:51] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:39:33] bblack--: a little early isn't it [12:39:38] :) [12:39:54] I'm back on central US time at home now :) [12:40:44] :) [12:41:35] So, I got home Saturday night from the airport to find a gargantuan tree had been knocked over by a storm while I was gone last week, right across my driveway. Didn't hit anything important (but crushed a basketball goalpost). [12:41:52] ouch [12:41:56] Spent all day yesterday with a chainsaw and a couple friends cutting it up and getting it out of the way [12:42:34] So that was my exercise for the week, now I can just sit in this chair for days and not feel guilty [12:42:54] i've been laying stones/pavement yesterday [12:43:07] so I had trouble fitting my socks this morning [12:43:33] hah [12:56:11] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [13:05:47] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [13:21:37] New patchset: Nemo bis; "Global jobqueue check: mwscript path fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58079 [13:31:07] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:31:58] New patchset: Odder; "(bug 41745) Remove ptwiki, ptwikinews from EmergencyCaptcha" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58081 [13:32:57] New review: Nemo bis; "Per bug, nihil obstat." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/58081 [13:40:13] New patchset: Nemo bis; "Prevent gerrit logo from pushing the search bar outside the screen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:42:02] New review: Nemo bis; "I've not mentioned https://bugzilla.wikimedia.org/show_bug.cgi?id=36471 because this is only a very ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:42:35] ^demon: I hope that patch makes sense [13:48:10] PROBLEM - Varnish traffic logger on dysprosium is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:01:10] RECOVERY - Varnish traffic logger on dysprosium is OK: PROCS OK: 3 processes with command name varnishncsa [14:04:08] PROBLEM - Varnish traffic logger on dysprosium is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [14:04:38] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [14:15:48] hashar: about to merge the cowbuilder stuff [14:15:50] ack? [14:15:57] yeahhhh [14:16:08] New patchset: Faidon; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [14:16:09] paravoid: that is a bit messy though. [14:16:20] maybe I should have written a short shell script to generate the images :-] [14:16:30] but hey, it works [14:16:33] I didn't look closely [14:16:58] it's contint material and I trust you enough for that :) [14:17:09] ;-] [14:18:34] paravoid: and you need to merge :-] https://gerrit.wikimedia.org/r/#/c/56382/ [14:18:39] jenkins does not merge for ya [14:18:43] (on ops/puppet [14:19:08] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [14:19:15] I know [14:19:28] I rebased and I was waiting for jenkins to give verified [14:20:37] merging the gerrit stuff on sockpuppet to [14:20:39] *too [14:23:58] \O/ [14:26:38] PROBLEM - Varnish HTTP mobile-backend on cp1041 is CRITICAL: Connection timed out [14:27:28] RECOVERY - Varnish HTTP mobile-backend on cp1041 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.643 second response time [14:30:08] RECOVERY - Varnish traffic logger on dysprosium is OK: PROCS OK: 3 processes with command name varnishncsa [14:30:29] New review: Ottomata; "(5 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [14:31:07] I'm listening [14:31:08] :) [14:31:44] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [14:31:50] hehe [14:32:04] first an easy one: [14:32:11] git clean -dffx -e private [14:32:12] s'ok? [14:32:18] private isn't in sockpuppet's working copy anyway [14:32:20] but just in case [14:32:22] I guess :) [14:32:34] ha, k [14:33:38] uhh, as for flock, sure! [14:34:23] shall I? [14:34:40] we can iterate [14:34:42] i guess it only helps, if someone else is runnign puppet-merge, then the second user shouldn't be able to [14:34:43] if you prefer that [14:36:04] hmm, either way I guess, i'm worried about locks being left behind if someone ctrl-cs or something (haven't used flock much) [14:36:34] yeah let's leave it for later [14:36:56] hmmk [14:37:05] I'll add a TODO comment in the script [14:37:13] have you tested this? [14:37:38] not a lot with the puppet repository, but in lots of different cases with my own test repo [14:37:49] removing, adding, modifying, etc. canceling, etc. [14:38:18] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:44:11] do we still need to build package for the `hardy` distribution ? [14:44:17] it does not have cowbuilder :-] [14:44:35] so I though we could remove hardy from the package building class [14:45:23] no, yes [14:46:10] New review: Faidon; "Seems reasonable, let's iterate if needed." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50452 [14:46:15] ottomata: ^ [14:46:40] COOOOOOLLLL [14:46:51] so, what should I do then, merge it and send an email to #ops explaining? [14:47:10] New patchset: Hashar; "pbuilder: get rid of `hardy` environnement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58090 [14:47:16] merge it, let's start using it and yeah, inform the rest of ops [14:47:17] ;-] [14:47:27] woot [14:47:53] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [14:48:08] ^ that's just a comment change [14:49:17] New review: coren; "The sense, you are making it." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58090 [14:50:29] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [14:50:41] ^ and that was the rebase [14:51:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [14:51:20] paravoid: removing hardy with https://gerrit.wikimedia.org/r/58090 :-] [14:54:06] coren: are you about? [14:54:17] cmjohnson1: I live! [14:54:45] great...so, chaning the cables for you now [14:54:55] cmjohnson1: /me dances. [14:55:13] you want both servers connected to both disk shelves...correct? [14:56:54] cmjohnson1: Exactly [14:57:06] cmjohnson1: Multi-server rather than multipath. :-) [14:57:24] okay [15:00:27] Coren: are you aware of any current issues with project storage? I had a read-only fs /home, and after reboot I can't log in [15:01:18] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:01:32] mark: It's not so much "current" as it is ongoing. That sounds like you just lost a brick. [15:02:07] I don't know if Ryan wrote down notes on what he does to fix that. I'll try to find if he did. [15:04:10] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [15:05:18] coren: should be good but plz check [15:06:27] hey paravoid, would you have a couple of minutes to look at the kafka external contractor position that I drafted (you should have received the link by email) [15:06:30] cmjohnson1: I go check now. [15:10:11] New patchset: Mark Bergsma; "Add device detection to mobile ResourceLoader requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56774 [15:11:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56774 [15:16:15] cmjohnson1: AFAICT, I only see one shelf from 1001. Lemme check something in the H800 doc. [15:16:18] thanks mark for your feedback! [15:18:11] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:20:57] coren: 1001 will only have 1 shelf...it's on a different rack....I will have to move it to the same rack as the others [15:21:10] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:21:17] cmjohnson1: Ah! Which ones did you tie then? [15:21:22] 1003/1002 [15:21:35] Well, perhaps I should check /those/ then. :-) [15:22:06] heh..if you want it all tied together ...i can do that...i will just have to move it to a different rack (i have the space) [15:22:32] cmjohnson1: No, that's allright, I need just the two, it's not really important /which/ two. [15:26:23] cmjohnson1: 1002 only sees one shelf too. :-( [15:26:34] ok [15:26:36] cmjohnson1: They both powered on? :-) [15:26:41] yep [15:26:46] Darn. [15:26:55] Want me to powerdown 1002 before you go play with it? [15:27:11] if you are on it...go ahead an power it down [15:27:13] And/or want me to check 1003? [15:28:14] is off. [15:29:01] New patchset: Hashar; "pbuilder: cowbuilder image are prefixed with "base-"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58093 [15:32:00] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.01185198473 (gt 8.0) [15:37:18] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:58] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:08] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.247 second response time [15:38:18] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:28] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:28] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:28] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:38] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:48] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.741 second response time [15:39:08] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.560 second response time [15:39:19] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.675 second response time [15:39:28] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.174 second response time [15:39:28] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.082 second response time [15:39:38] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.808 second response time [15:40:58] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:48] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 2.725 second response time [15:42:33] coren: the controller on 1002 sees 24 disks [15:43:11] The H800? [15:43:30] Because the H700 is internal. :-) [15:44:01] So you should see 36 disks total, 12 on the H700, 24 on the H800. :-) [15:44:22] Unless Ryan is wrong and the server doesn't have 12 disks itself. [15:44:34] the h800 [15:44:48] the server does not have 12 disks [15:44:56] it has 8 [15:45:50] but to confirm 24 on the h800..i left in raid cfg on 1002 if you want to connect [16:00:42] New patchset: MaxSem; "Add HTTP header X-WAP for MobileFrontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32866 [16:01:16] mark, we will also need this ^^^ [16:01:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [16:03:43] * Coren goes check. [16:05:36] cmjohnson1: ... I see 12. I fear we may have a /labelling/ error. The system I just powered up is the one I have in DNS as 1002 for the DRAC. [16:06:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:07:54] coren: i am using the terminal and connecting via admin@labsdb1002.mgmt [16:08:19] dns is correct [16:08:20] labstore1002.mgmt.eqiad.wmnet is the one I use. [16:08:41] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:08:39 UTC 2013 [16:08:42] labsDB! [16:09:01] cmjohnson1: Have you been playing the the /database/ rather than the /storage/ servers? :-) [16:09:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:09:36] New patchset: Ottomata; "Fixing echo command in puppet-merge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58094 [16:09:51] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:09:41 UTC 2013 [16:09:55] cmjohnson1: ... I didn't specify which, did I? [16:10:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:10:33] cmjohnson1: and you didn't ask... Half a trout each, then? :-) [16:10:37] we are working on 2 diff systems coren...no ..labs1001...which I just connected to disk shelves is what I did...ticket was a bit confusing...so labstores...not labsdb [16:10:41] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:10:37 UTC 2013 [16:10:50] yep...so let's fix this now... [16:11:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:11:16] Sorry for the double work. At least we didn't flub this on stuff that was already in production! :-) [16:11:20] i will have to revert the labsdb cabling... [16:11:28] right!.....silver lining [16:11:31] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:11:27 UTC 2013 [16:11:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58094 [16:12:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:12:12] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:12:10 UTC 2013 [16:13:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:13:22] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:13:17 UTC 2013 [16:14:12] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:17:01] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [16:22:09] coren: there you go! all finished...checked labstore1002 and i see all 36 disks 12 on h700 and 24 on h800 [16:23:21] New patchset: Faidon; "pbuilder: get rid of `hardy` environnement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58090 [16:23:26] Coren: +2 but not merge? [16:23:45] New patchset: Mark Bergsma; "Set 16 GB malloc storage for dysprosium's frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58095 [16:23:50] cmjohnson1: Success! Thanks! (BTW, which two did you tie in the end?) [16:24:02] paravoid: Oh, needed verify. Sorry I didn't notice. Gimme a sec. [16:24:21] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:24:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58090 [16:24:46] New patchset: Faidon; "pbuilder: cowbuilder image are prefixed with "base-"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58093 [16:25:06] New patchset: Mark Bergsma; "Set 16 GB malloc storage for dysprosium's frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58095 [16:25:16] paravoid: Ah, already merged. [16:25:17] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58093 [16:25:59] paravoid: I'm used to Jenkins giving the Verified+2; I came in before /it/ did. :-) [16:26:00] coren: labstore 1002/1 [16:26:01] New patchset: Mark Bergsma; "Set 16 GB malloc storage for dysprosium's frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58095 [16:26:13] ottomata: I just used puppet-merge [16:26:25] cmjohnson1: Thanks a bundle. [16:26:25] worked, although I'd prefer using a pager instead of scrolling [16:26:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58095 [16:26:34] yep..sorry for the confusion [16:26:34] but I guess that's a matter of taste [16:26:48] same here. Half a trout each for dinner. :-) [16:27:03] I just used puppet-merge too [16:27:13] works, well done :) [16:27:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:04] yay! [16:28:07] yeah i used it too [16:28:36] I just wrote up this: [16:28:36] https://wikitech.wikimedia.org/wiki/Puppet_usage#Updating_operations.2Fpuppet_on_production_nodes [16:28:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.104 second response time [16:28:45] I added a bunch of stuff about modules and submodules, although its not quite finished yet [16:28:48] so i'm not going to email it out yet [16:28:49] Sure! Make the process simple *after* I had to stumble figuring out the previous one! :-) [16:28:54] i want us to use puppet-merge for a day or too as well [16:29:09] and work out the submodule issue (which will probably be me looking into patching the gerrit replication thing) [16:33:01] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 16:32:53 UTC 2013 [16:33:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:33:12] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:36:09] New patchset: Mark Bergsma; "Add HTTP header X-WAP for MobileFrontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32866 [16:36:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32866 [16:51:36] New patchset: Asher; "adding db1058 (precise, mysql 5.1-fb) to s1 for testing" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58097 [17:03:55] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58097 [17:05:01] mark: which project is showing read only? [17:05:11] !log asher synchronized wmf-config/db-eqiad.php 'adding db1058 to s1 at a low warmup rate' [17:05:20] Logged the message, Master [17:06:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [17:09:21] !log asher synchronized wmf-config/db-eqiad.php 'adding db1058 to s1 at full weight' [17:09:27] Logged the message, Master [17:11:11] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:11:33] Ryan_Lane: varnish [17:12:06] well done dude - forcing me to feel the lab storage pain ;p [17:13:00] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [17:14:24] mark: :D [17:14:35] well, gluster is actually relatively stable right now [17:14:58] of course, that's excluding that some projects have a hung brick [17:15:10] PROBLEM - SSH on labstore1001 is CRITICAL: Connection refused [17:15:20] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [17:15:20] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [17:15:20] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [17:17:10] so that just means you're unlucky ;) [17:17:14] New patchset: Ottomata; "mod. use get_project_host_map method to generate map for project to host key." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56576 [17:17:18] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56576 [17:17:56] :D [17:17:59] is it fixable? [17:18:16] yes, and I just fixed it [17:18:26] thank you [17:18:27] I found the hung brick and killed it [17:18:41] then restarted the volume [17:18:45] woot [17:18:46] works again [17:19:10] I just finished a custom precise image too [17:19:27] instance creation now takes 1-3 minutes. it used to take 6-11 minutes [17:19:38] nice [17:19:59] yep. it doesn't take a full puppet run to login anymore [17:21:14] RobH: what's with the permissions of the php- dirs in /usr/local/apache/common-local on terbium? [17:27:11] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: No response from NTP server [17:27:13] Ryan_Lane: Did you jot down nodes in what to do if that happens I can follow to fix it when I get to it first? [17:27:43] Coren: gluster volume status [17:28:01] which will be: -home [17:28:02] or [17:28:10] -project [17:28:18] find the process ids [17:28:26] kill them on the appropriate hosts [17:28:37] just sigterm? [17:28:39] gluster volume start force [17:28:43] yes. but... [17:28:47] check the pid after [17:28:55] usually one of the two doesn't die [17:29:00] which is the hung brick [17:29:36] That one then needs to be killed harder. [17:30:10] aww damn [17:30:34] the week in hell starts now. [17:30:48] lol [17:31:00] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:31:05] someone is sounding like Cpt. Janeway [17:31:07] mutante, did the guy reply to #822? [17:31:32] everyone fetch your wishlists for ops requests [17:31:36] Coren: yeah. −9 usually [17:32:00] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.31374697674 (gt 8.0) [17:32:10] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:32:18] Nemo_bis: you can, but it has to go in an RT ticket. [17:32:38] thinking i would actually touch any new tickets, when I have two week old tickets to process, i may hit it by friday ;] [17:32:45] * RobH is only being slightly sarcastic [17:39:11] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:53:14] urgh, why can my rt user still delete?!? [17:53:24] * RobH logs in as admin to reduce his own account [17:54:54] New review: Kaldari; "Agreed. In the long-term we should make the footer links work the same way the sidebar links do and ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [17:58:17] New review: RobH; "Please see discussion in RT 4735." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53861 [18:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.159 second response time [18:03:10] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [18:04:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.22wmf1 [18:04:27] Logged the message, Master [18:05:31] New patchset: Reedy; "enwiki to 1.22wmf1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58115 [18:05:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58115 [18:11:20] New patchset: RobH; "adding matthew walker to deploy access, if you break it, you buy it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58117 [18:12:15] oh damn it i forgot to include the rt in changeset description. [18:12:54] New patchset: RobH; "RT 4747 adding matthew walker to deploy access, if you break it, you buy it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58117 [18:13:10] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:14:07] New review: RobH; "picard" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58117 [18:14:08] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58117 [18:14:44] RobH: make it so! [18:14:55] LeslieCarr: YES someone got it, \o/ [18:15:23] i'd link to an image, but those arent exactly fair use I suppose for inclusion in gerrit ;] [18:15:42] hehehe [18:16:46] speaking of that, are you going to http://www.cinemark.com/star-trek-the-next-generation-the-best-of-both-worlds ? [18:17:40] didnt know it existed [18:18:01] ahhh haha, i would go to that! [18:18:02] haha [18:18:16] oh! ottomata you might be able to get some nyc tix [18:18:19] sold out in the city [18:18:26] i am so excited! [18:18:39] man i've seen that one so many times though! [18:18:47] so gooood [18:18:50] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: No successful Puppet run in the last 10 hours [18:18:53] hrmm [18:22:21] New patchset: Andrew Bogott; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [18:22:28] mutante: after many attempts here, at last, is an rt labs machine that came up straight from puppet: http://rt-testing12.pmtpa.wmflabs/ [18:22:35] Now if I can remember what we were going to test... [18:25:14] New patchset: Ottomata; "Removing unused udp2log filters on emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58119 [18:26:40] New patchset: Ottomata; "Removing unused udp2log filters on emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58119 [18:26:46] andrewbogott: that's great. thanks!:) [18:26:52] Leslie it is in NYC!~ [18:26:59] ohboyohoby [18:27:03] andrewbogott: the upgrade i guess [18:27:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58119 [18:27:53] notpeter: can you fix the common-local dir perms and what not on terbium? [18:28:03] :) [18:28:04] mutante, we were talking about whether it was safe to apply to production. I think it should be now since the db-initialize phase is now marked to only run if there's no existing db. [18:28:16] But… probably worth thinking about that a bit more before we merge it [18:29:04] hey ops, quick question: vanadium.eqiad.wmnet should be reachable by all apaches, correct? (i'd like to make the udp log destination for apache-generated events; currently it's emery.) [18:29:31] andrewbogott: ok, cool, agreed [18:30:08] to be more specific: apache-generated eventlogging events, which is a tiny subset of all udp logging that comes from the apaches [18:30:51] AaronSchulz: sure [18:31:50] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [18:31:54] ori-l, i think it is the only udp2log that comes from apaches [18:32:01] all the other logs are from frontend caches or nginx [18:32:11] notpeter: "sudo -u apache php /usr/local/apache/common-local/multiversion/MWScript.php eval.php testwiki" should work when everything is cleaned up [18:32:16] oh there might be some mediawiki stuff that we don't know much about [18:32:17] hm [18:32:20] ottomata: no, there's all the stuff that goes to fluorine, like the exception log etc [18:32:28] well, at least test2wiki :) [18:32:30] AaronSchulz: they're kinda fucked in a lot of ways [18:32:30] ah hm, cool, [18:32:32] k [18:32:35] should a sync-common as root work? [18:32:58] you can try it, I don't know how we initialize boxes for the first time [18:33:06] AaronSchulz: ok, cool [18:33:52] !log deployed 1.22 on payments [18:33:59] Logged the message, Master [18:38:40] Jeff_Green. hey hey hey locke bye bye? [18:41:59] ottomata: pas. [18:42:00] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 9.37440068182 (gt 8.0) [18:42:01] New patchset: Ori.livneh; "$wgEventLoggingFile: emery => vanadium" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58122 [18:42:18] ottomata: we got held up on netapp replication needing to be reversed [18:42:42] berrr, hmk? [18:42:43] i had missed the fact that the new host is at the other datacenter, where we're r/o because of replication [18:42:59] hmmm [18:43:13] ma.rk and I were going to fix it last week but I got sick then there were distractions [18:43:18] what's that mean for us then, hard to do? i mean, we can give you locke and it can be your deditcated FR udp2log host :p [18:43:18] oh ok [18:45:18] New patchset: Ori.livneh; "Remove AFT ClickTracking set-up from emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58123 [18:46:00] nice, thanks Ori! [18:46:14] New patchset: Aaron Schulz; "Made mwscript use /usr/local/... if there is no source (e.g. "/home/.." dir)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58124 [18:46:16] ottomata: thank you [18:46:18] hi hashar [18:46:58] hello ori-l [18:47:14] New review: Ori.livneh; "-2ing to ensure this is not merged by accident before Ie8097d64a." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/58122 [18:48:02] ottomata: it's just a longer window of disruption that I have to coordinate, that's all [18:48:12] ok cool [18:48:16] I'm up for doing it this week of the fr folks don't throw anything conflicting at me [18:48:49] cool, danke [18:49:12] i'm not too worried about it, all of our stuff is over on gadolinium now, so I don't have anything left to do cept decommission locke [18:49:24] but analytics folks will ask me in our standup how it goes [18:49:27] thus I ask you too :) [18:56:13] New patchset: Demon; "Set gerrit idle timeout to 10 days" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58126 [18:57:27] <^demon> Ryan_Lane: Can you please look at ^ [18:58:07] <^demon> We've got some problems with mina sshd, and this should help for now pending a nicer fix. [18:58:50] ^demon: I'm fine with the change, assuming you've tested it [18:59:16] <^demon> Well, right now we're defaulting to 0, which is causing a buffer overrun in mina :\ [18:59:19] hm [18:59:22] <^demon> So technically this is lowering the timeout. [18:59:23] this may be excessive [18:59:39] 240 hours? [19:00:14] <^demon> Well, it needs to be some time > maximum time between patches. [19:00:23] 10 days? :D [19:00:24] !log removing some unused filters from emery's udp2log instance [19:00:31] Logged the message, Master [19:00:46] <^demon> Well, probably at least > 24h, to cover weekends when we're slow. [19:00:55] <^demon> How does 72h sound? [19:01:05] oh. this is idle [19:01:09] <^demon> Yes :) [19:01:13] <^demon> Idle timeout. [19:01:15] so if a person drops the connection it goes away [19:01:22] <^demon> Yeah. [19:01:29] this is fine, then [19:01:30] <^demon> This is so stream-events doesn't timeout between events. [19:01:32] <^demon> Which is breaking zuul. [19:01:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58126 [19:01:47] <^demon> ty. [19:02:22] yw [19:03:11] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [19:06:03] Speaking of Gerrit [19:06:07] It's 503ing right now [19:06:12] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [19:06:38] <^demon> Restarting to deploy config change. [19:06:41] <^demon> Queue was empty. [19:06:47] <^demon> :) [19:06:55] * RoanKattouw hands ^demon a !log ;) [19:07:08] <^demon> !log gerrit restarting, config change to fix idle timeouts. [19:07:15] Logged the message, Master [19:07:41] <^demon> !log gerrit back [19:07:48] Logged the message, Master [19:07:49] * ^demon grins at RoanKattouw ;-) [19:07:58] hashar: Re https://gerrit.wikimedia.org/r/#/c/56637/ , should we take away the V+2 right from humans? [19:08:02] PROBLEM - Disk space on rdb1 is CRITICAL: NRPE: Command check_disk_space not defined [19:08:13] PROBLEM - RAID on rdb2 is CRITICAL: NRPE: Command check_raid not defined [19:08:18] <^demon> RoanKattouw: For core? YES [19:08:22] PROBLEM - RAID on rdb1 is CRITICAL: NRPE: Command check_raid not defined [19:08:42] PROBLEM - DPKG on rdb2 is CRITICAL: NRPE: Command check_dpkg not defined [19:08:52] PROBLEM - Disk space on rdb2 is CRITICAL: NRPE: Command check_disk_space not defined [19:08:52] PROBLEM - DPKG on rdb1 is CRITICAL: NRPE: Command check_dpkg not defined [19:09:57] I was wondering why my git review was taking 10 minutes [19:10:16] <^demon> 10 minutes? [19:10:19] <^demon> Gerrit was down for <1m [19:10:51] No, it was down for much longer [19:10:55] At least 5 [19:11:56] that is because of the frontend apache I guess [19:12:11] RoanKattouw & hashar, please revoke +2 from humans only for mw/core master, deployment and release branches should be mergeable even without tests [19:12:17] Yes [19:12:19] They should be [19:12:44] then people break the tests [19:12:57] In deployment branches? [19:13:10] Jenkins doesn't even run the tests correctly in those branches 80% of the time [19:13:48] <^demon> We should fix that then. [19:15:42] AaronSchulz: sorry, got distracted, but perms of /usr/local/apace/common are the same on hume and terbium now [19:15:46] should be fixed [19:16:20] <^demon> RoanKattouw: If we've got problems with test infrastructure, the proper thing to do is fix that. Continuing to override jenkins just causes people not to trust jenkins, which reduces its utility imho. [19:16:25] hashar, production branches are different in that we sometimes need to change them really quick/ so quick that we've no time to fix/update tests [19:16:56] MaxSem: sorry I don't have the time to talk about it. Please raise the issue on a mailing list :-] [19:17:16] ^demon: I agree. So then we either fix Jenkins first, then take away V+2 everywhere, or we take away V+2 selectively and fix Jenkins later. Just as long as we don't take V+2 from places where Jenkins isn't reliable yet, that's all [19:17:29] <^demon> I disagree with that assumption. If something needs deploying that fast, they'll typically deploy directly from fenari anyway and not bother with gerrit. [19:17:44] <^demon> :) [19:18:32] <^demon> I'd say take away V+2 from mediawiki group on core, maybe hold off on wmf-deployment group until jenkins is fixed for deployment branches. [19:18:41] Yeah exactly [19:18:54] and make sure release branch we maintain pass too :) [19:19:00] <^demon> But bring it up on list :) [19:19:14] <^demon> Tests are important. I don't want people to ignore tests...we had too many years of that :) [19:19:22] hashar: Mind if I put in a commit that just fixes the jsduck warning rather than reverting the whole commit? [19:19:30] too late [19:19:39] I don't care about fixing other people stuff [19:19:43] if it is broken, I blindly revert [19:19:44] period [19:19:45] :D [19:19:47] That's fine [19:19:50] that is rude but saves a ton of time [19:19:51] I'll unrevert+fix [19:19:58] doh [19:20:00] ;-] [19:20:01] Don't worry about it, this is my MO too :) [19:20:22] RoanKattouw: just cherry pick https://gerrit.wikimedia.org/r/#/c/56637/ :-] [19:20:31] Yeah [19:20:39] change the Change-Id and done [19:20:43] but you know the story :-] [19:21:56] reopened bug 46401 too [19:22:00] * MaxSem kills a couple of tests to counteract ^demon's TDD oppression:P [19:22:34] the stupid parser tests really need to be improved [19:22:37] they are sooo slow [19:23:20] New patchset: Ram; "Bug: 43663 Fix bad OAI repo URL for Ukraine wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58133 [19:23:42] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:06] New review: Andrew Bogott; "Logged for the upstream, here:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [19:28:52] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:29:00] New review: Hashar; "Thank you! :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [19:29:42] Ryan_Lane, can I get a +2 for https://gerrit.wikimedia.org/r/#/c/57426/ ? [19:29:49] looking [19:30:09] New review: Pyoungmeister; "thank you for the bug fix~!" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58133 [19:30:10] New patchset: Ryan Lane; "Remove gluster's broken logrotate script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [19:30:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58133 [19:30:44] I can't merge it [19:30:56] New patchset: Ryan Lane; "Remove gluster's broken logrotate script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [19:31:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [19:31:06] I had to rebase it [19:31:07] twice [19:31:09] how annoying [19:31:12] huh [19:31:16] ff-only on the repo is incredibly annoying [19:31:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:29] we could use cherry-pick :-D [19:31:37] heh [19:31:40] New patchset: RobH; "move account awight from admins::restricted to admins::mortals (RT-4819)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55918 [19:31:41] but then the commit sha1 that land in the repo is not the same as in Gerrit interface [19:31:44] which can be troublesome [19:31:51] I have a feeling there was a good reason that mark switched it [19:31:57] Ryan_Lane, this one is also languishing: https://gerrit.wikimedia.org/r/#/c/43886/ [19:32:01] <^demon> Ryan_Lane: He didn't want merge commits :) [19:32:02] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.22180069767 (gt 8.0) [19:32:03] xyzram: ^demon I'll go through the process of restarting the various search nodes in 30 minutes once puppet has run on them [19:32:09] <^demon> Cool beans. [19:33:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [19:33:14] notpeter: just an FYI or related to some specific issue ? [19:33:52] xyzram: the patchset you just submitted :) [19:33:59] andrewbogott: any reason those two site files can't be a template? [19:34:11] Oh, ok (that was quick!) [19:34:24] notpeter: does https://gerrit.wikimedia.org/r/#/c/58124/ look ok? [19:34:37] Ryan_Lane: Nope. I'll do that. [19:34:42] cool. thanks [19:34:42] garg. not to self: never reboot office phone just before a scheduled candidate interview... [19:35:41] notpeter: Can you also rebuild the uawikimedia indices ? (How do you normally do this BTW ?) [19:36:02] AaronSchulz: I'm not sure where those vars are being set, but generally, yes. looks good to me. shall I merge? [19:36:52] /usr/local/lib/mw-deployment-vars.sh which is a puppet template filled with those vars imported from some manifest [19:37:29] xyzram: sure. it's the same process as just builind a new index [19:38:01] (there are some other ways that are needed for big wikis, but just doing a fresh import is the easiest [19:38:12] AaronSchulz: cool! sounds good. I'll deploy now [19:38:15] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55918 [19:39:14] New patchset: Pyoungmeister; "Made mwscript use /usr/local/... if there is no source (e.g. "/home/.." dir)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58124 [19:39:33] AaronSchulz: (had to rebase...) [19:40:30] it works in testing [19:40:44] notpeter: can you do a run on terbium? [19:41:08] AaronSchulz: yep, one sec [19:41:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58124 [19:42:02] AaronSchulz: doing so now [19:43:46] AaronSchulz: puppet run done on terbium [19:43:46] New patchset: Ottomata; "Prepping for emery upgrade to precise." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58136 [19:43:56] \o/ [19:46:20] mark and binasher, during tomorrrow's mobile deployment (starting at 1pm PST) we will roll out everything needed for new caching. if everything goes well, we can attempt enabling it for testwiki. will you guys be around? [19:47:32] MaxSem: I'm not sure if mark's around now, but he mentioned during the meeting that he won't be available and I think it was agreed to move it to next week [19:47:41] and asher mailed that he's on vacation [19:47:47] whee [19:48:12] jon was present and I thought he'd coordinate with you [19:48:25] I might be paraphrasing though, so better coordinate with the right people [19:48:28] yeah, he told me to communicate with mark [19:48:32] haha [19:50:00] so yeah, we could try enabling it on testwiki for some time just to see if it looks to be working, but will not keep it on until there will be some varnish guru around [19:50:50] robh: mind if take this from you rt4714 [19:57:50] New review: Hashar; "I have removed the patch from the instance." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [20:01:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:02:49] heya mark, is it possible that there are varnishd processes on the mobile frontends that haven't been restarted (or re-read config) in a long time? does varnish have a graceful reload or restart? [20:03:14] we are seeing occasional requests that should have been tagged with X-Analytics/X-CS that aren't [20:03:30] even some that are from the same IP to the same frontend varnish node [20:04:24] New review: Ottomata; "-2ing this until the emery upgrade is complete." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/58136 [20:07:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:09:12] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:09:08 UTC 2013 [20:10:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:10:21] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:10:11 UTC 2013 [20:10:39] ottomata, in principle `service varnish reload` should do that w/o flushing the caches, however I dunno if there's anything tricky about doing it in our infrastructure [20:10:45] https://www.varnish-software.com/static/book/Getting_started.html [20:11:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:11:21] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:11:13 UTC 2013 [20:12:04] ok, thanks MaxSem [20:12:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:12:11] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:12:05 UTC 2013 [20:13:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:13:41] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:13:38 UTC 2013 [20:13:54] * jeremyb_ grumbles at rt 822 [20:14:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:14:11] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:14:04 UTC 2013 [20:15:01] !log aaron synchronized php-1.22wmf1/extensions/FlaggedRevs 'deployed c61688baef682b7d97068fefdcc399256786a387' [20:15:08] Logged the message, Master [20:15:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:16:14] and i just missed mutante-away by a few mins [20:16:37] binasher: I wonder why ChangeHandler OOMs the runners so much [20:18:02] !log Moved ceph sync scripts to terbium and started a second pass [20:18:08] Logged the message, Master [20:18:53] LeslieCarr: not just emergency, but EMERGENCY [20:18:55] wait what ? [20:19:05] AajaadBhonsda: what is emergency ? [20:19:16] FiberNet maintenance ;) [20:19:18] oh [20:19:19] hehe [20:19:23] you worried me! [20:19:33] my work is done then [20:19:40] AaronSchulz: ceph is sick :( [20:19:50] I'm trying to get ahold of the ceph people [20:19:52] you have to keep ops people on their toes, lest they become complacent [20:23:24] paravoid: unusable? [20:23:32] probably not [20:23:35] hard to say for sure [20:23:58] let me know if you see requests hanging or other issues [20:27:37] New patchset: Aaron Schulz; "Bumped file journal ttl to a year and avoided some duplication." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58213 [20:29:18] paravoid: how is the hardware doing? [20:30:22] we had 9 failed disks [20:30:58] I think cmjohnson1 has replaced all of them but I haven't carefully checked/put them back into the cluster [20:31:06] health HEALTH_ERR 33 pgs inconsistent; 33 scrub errors [20:31:11] this is a bit more important :) [20:31:27] could be data corruption, or some other nasty issue [20:31:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.172 second response time [20:32:45] paravoid: all the disk have been replaced [20:32:51] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Mon Apr 8 20:32:44 UTC 2013 [20:32:56] cmjohnson1: thanks :) [20:33:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:36:21] RECOVERY - search indices - check lucene status page on search17 is OK: HTTP OK: HTTP/1.1 200 OK - 55880 bytes in 0.132 second response time [20:37:22] PROBLEM - search indices - check lucene status page on search1018 is CRITICAL: Connection timed out [20:37:31] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:51] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:21] PROBLEM - search indices - check lucene status page on search13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:21] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:21] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:21] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:22] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:22] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:22] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:23] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:23] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:31] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:31] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:31] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:31] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:31] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:32] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:32] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:35] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58213 [20:38:41] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:41] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:51] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:51] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:51] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:01] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:11] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:21] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:21] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:21] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:21] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:22] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:22] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:22] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:23] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:23] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:24] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:24] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:25] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:25] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:26] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:31] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:31] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:41] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:51] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:51] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:51] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:01] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:03] New patchset: Ryan Lane; "Run OSM's echo notification script signing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58215 [20:40:21] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:23] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:28] Ummm [20:40:31] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:40] We're seeing timeouts on Wikidata [20:40:41] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:41] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:12] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [20:41:21] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:31] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:10] looks like its search-prefix.svc.eqiad.wmnet [20:42:17] yeah [20:42:21] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 2.84651953488 [20:42:23] notpeter: are you on it? [20:42:30] looking, but help appreciated [20:42:35] those nodes are up, is the thing [20:42:49] going to try restarting lucene again [20:43:11] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.134 second response time [20:43:11] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [20:43:11] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.319 second response time [20:43:11] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 0.321 second response time [20:43:11] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.491 second response time [20:43:12] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [20:43:14] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [20:43:14] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [20:43:14] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [20:43:14] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [20:43:14] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.079 second response time [20:43:15] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [20:43:15] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [20:43:16] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.085 second response time [20:43:16] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.079 second response time [20:43:17] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.072 second response time [20:43:17] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2759 bytes in 0.078 second response time [20:43:18] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.076 second response time [20:43:18] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.086 second response time [20:43:19] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.090 second response time [20:43:19] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.094 second response time [20:43:20] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [20:43:20] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [20:43:21] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.050 second response time [20:43:21] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [20:43:22] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [20:43:22] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [20:43:23] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [20:43:23] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.074 second response time [20:43:24] that's fucking ridiculous [20:43:24] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.079 second response time [20:43:24] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [20:43:25] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.567 second response time [20:43:25] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.568 second response time [20:43:26] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [20:43:26] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 0.062 second response time [20:43:26] would that have been related to wikidata? [20:43:27] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [20:43:27] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [20:43:28] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [20:43:28] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [20:43:29] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [20:43:29] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [20:43:30] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [20:43:30] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [20:43:31] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [20:43:31] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [20:43:32] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [20:43:32] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.091 second response time [20:43:33] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.962 second response time [20:43:33] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [20:43:36] or at least the wikidata timeouts? [20:43:41] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.072 second response time [20:43:41] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [20:43:41] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [20:43:41] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [20:43:41] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [20:43:42] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [20:43:51] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [20:43:51] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [20:43:55] search is down == every apache is down [20:43:59] I just love this [20:44:20] paravoid: not even search, but the search autocomplete feature specifically ;) [20:44:22] RECOVERY - search indices - check lucene status page on search1017 is OK: HTTP OK: HTTP/1.1 200 OK - 60075 bytes in 0.033 second response time [20:44:25] but yeah, this makes me really sad [20:45:20] this is a case where even the timeout for waiting for the connection to the search prefix hosts was taking a long time and the timeouts aren't managed well by MWSearch / mediawiki [20:45:26] connect(43, {sa_family=AF_INET, sin_port=htons(8123), sin_addr=inet_addr("10.2.2.15")}, 16) = -1 EINPROGRESS (Operation now in progress) [20:45:27] poll([{fd=43, events=POLLOUT|POLLWRNORM}], 1, 1000) = 0 (Timeout) [20:45:28] poll([{fd=43, events=POLLOUT|POLLWRNORM}], 1, 1000) = 0 (Timeout) [20:45:29] poll([{fd=43, events=POLLOUT|POLLWRNORM}], 1, 1000) = 0 (Timeout) [20:45:30] yeah [20:45:30] etc... [20:46:40] the apaches i was watching were all hanging on just connecting to search-prefix up until the timeout [20:46:58] at a request per keystroke.. [20:47:16] New patchset: Andrew Bogott; "Added a basic nginx module and two (labs) use cases." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [20:47:56] can't we farm off those requests to a smaller set of apaches? [20:48:03] so that it doesn't take down the entire site? [20:48:03] they used to go to apis [20:48:11] that will also take down the site, though [20:48:17] woooooo [20:48:24] AaronSchulz: I'm running a deep scrub on all data [20:48:27] it should be its own (small) cluster [20:48:46] AaronSchulz: so it's going to take some time, during which it won't be very fast or possibly even responsive [20:48:51] we should put varnish in front of the search prefix pool [20:49:33] that would also be really helpful. how to invalidate results, though? [20:49:37] cache queries relatively briefly, but also let varnish be agressive about 503'ing requests from mediawiki [20:49:47] ah [20:49:48] who cares about invalidation [20:50:03] yeah, if you're just caching for a short period of time, that works [20:50:14] its not like searching indexing is real time [20:50:19] cmjohnson1: 4714 is all yours [20:50:24] it sounds like a nice idea, but it also sounds like papering over a problem in another layer :) [20:50:27] mediawiki specifically [20:50:28] k' [20:50:41] paravoid: welcome to ops. [20:50:47] heheh [20:51:03] well, I'd like to compartmentalize it so that it fails separately from everything else [20:51:15] paravoid: deep scrub? [20:51:26] I couldn't imagine it would need much resources to split it into another apache cluster [20:51:45] AaronSchulz: essentially validating that all three copies of each pg contain the same data, i.e. that there's no data corruption [20:51:58] I see [20:53:27] !log aaron synchronized wmf-config/filebackend.php 'Bumped file journal ttl to a year' [20:53:34] Logged the message, Master [20:56:11] RECOVERY - search indices - check lucene status page on search13 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.111 second response time [20:57:51] PROBLEM - search indices - check lucene status page on search1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:11] RECOVERY - search indices - check lucene status page on search14 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.110 second response time [20:58:21] PROBLEM - search indices - check lucene status page on search1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:29] mediawiki's http class implements overall timeouts via $this->curlOptions[CURLOPT_TIMEOUT] = $this->timeout; which gets set to 10 seconds for search, but it provides no facility for setting CURLOPT_CONNECTTIMEOUT_MS which would have prevented this outage [20:58:41] RECOVERY - search indices - check lucene status page on search1012 is OK: HTTP OK: HTTP/1.1 200 OK - 504 bytes in 0.002 second response time [20:59:04] should only be a few lines of php in each of ore and the MWScript extension [20:59:11] RECOVERY - search indices - check lucene status page on search1011 is OK: HTTP OK: HTTP/1.1 200 OK - 504 bytes in 0.009 second response time [20:59:31] i'm going offline for a bit, notpeter want to open a bugzilla ticket with the above ^^ ? [20:59:35] binasher: hhhhmmm, that would be most excellent :) [20:59:47] uh, sure :) [20:59:52] muhaha [20:59:57] <^demon> We haven't deployed the poolcounter changes either. [21:00:00] binasher: well played [21:00:05] ;) [21:00:07] ;) [21:00:11] <^demon> That'd help keep the apaches from stampeding the search indicies. [21:00:58] paravoid: can't lvs be set to limit the number of connections per realserver too? [21:01:27] I don't know offhand, let me check [21:02:08] probably not [21:02:20] I don't think LVS would RST [21:04:24] paravoid: see the --u-threshold option in ipvsadm [21:04:31] yeah I saw that [21:04:40] I'm not sure what would happen if all realserver reached their threshold though [21:04:59] if it'd send an RST or just blackhole packets [21:05:23] ah [21:07:11] if (mark->cl == p && mark->cw == mark->di) { [21:07:12] /* back to the start, and no dest is found. [21:07:14] It is only possible when all dests are OVERLOADED */ [21:07:15] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.011 second response time [21:07:17] dest = NULL; [21:07:20] ip_vs_scheduler_err(svc, [21:07:23] "no destination available: " [21:07:25] "all destinations are overloaded"); [21:07:28] goto out; [21:07:31] } [21:07:35] out: [21:07:35] write_unlock(&svc->sched_lock); [21:07:35] return dest; [21:07:45] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [21:08:50] <- decom'ed and already ran the clean puppet resources thing [21:13:28] binasher: I'm still not sure, it /might/ send an icmp port unreachable [21:16:58] !log temp stopping slave on db55 [21:17:05] Logged the message, notpeter [21:21:18] paravoid: "Goto out;" really? :-) [21:21:29] yeah, why not? [21:21:47] it's fairly common across the kernel (or any sizeable C codebase) [21:22:16] paravoid: I know. Grates me everytime. It's a symptom of poor factorization, as a rule. :-) [21:22:23] no it's not [21:22:29] perfectly fine practice [21:23:03] * notpeter breaks out some popcorn [21:23:21] goto