[00:02:20] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:05:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:05:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:06:00] !log maxsem synchronized php-1.22wmf1/extensions/MobileFrontend/includes/MobileFrontend.hooks.php 'https://gerrit.wikimedia.org/r/#/c/58436/' [00:06:06] Logged the message, Master [00:07:27] !log maxsem synchronized php-1.21wmf12/extensions/MobileFrontend/includes/MobileFrontend.hooks.php 'https://gerrit.wikimedia.org/r/#/c/58436/' [00:07:34] Logged the message, Master [00:09:00] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:08:52 UTC 2013 [00:09:10] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:09:08 UTC 2013 [00:09:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:10:20] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:10:50] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:10:43 UTC 2013 [00:11:00] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:10:59 UTC 2013 [00:11:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:12:31] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:12:28 UTC 2013 [00:12:50] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:12:43 UTC 2013 [00:13:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:13:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:37] WTF is going on with these checks? [00:14:10] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:14:05 UTC 2013 [00:14:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:14:20] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:14:19 UTC 2013 [00:15:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:15:40] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:15:37 UTC 2013 [00:16:00] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:15:51 UTC 2013 [00:16:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:16:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:10] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:17:04 UTC 2013 [00:17:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:30] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:17:21 UTC 2013 [00:18:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:18:30] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Wed Apr 10 00:18:23 UTC 2013 [00:18:40] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Wed Apr 10 00:18:35 UTC 2013 [00:19:00] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:21:05] !log removing xenon and db1012 from icinga configs, running puppetstoredconfigclean.rb on them, restarting icinga [00:21:12] Logged the message, Master [00:21:44] MaxSem: ^ some issue with decom'ed hosts that don't get removed from monitoring, known issue, but not solved yet, should stop now [00:22:19] PROBLEM - SSH on cp1043 is CRITICAL: Server answer: [00:23:19] RECOVERY - SSH on cp1043 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:42] New patchset: Dzahn; "turn planet into a puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54493 [00:32:42] New patchset: Dzahn; "rename planet class, per docs init.pp must exist and contain a class matching the module name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54494 [00:33:40] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [00:53:12] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54493 [00:56:11] New patchset: Dzahn; "turn planet into a puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54493 [00:57:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [00:58:41] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:58:46] New patchset: Dzahn; "turn planet into a puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54493 [01:01:41] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:05:20] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [01:06:11] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:20:31] New patchset: Dzahn; "turn planet into a puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54502 [01:23:29] New patchset: Dzahn; "turn planet into a puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54502 [01:46:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:46:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54502 [01:47:43] New review: MZMcBride; "\o/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54502 [01:57:34] New patchset: Dzahn; "remove duplicate definition of locales package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58445 [01:58:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58445 [02:01:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:05:26] PROBLEM - Puppet freshness on db1012 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:06] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [02:17:20] !log LocalisationUpdate completed (1.22wmf1) at Wed Apr 10 02:17:19 UTC 2013 [02:17:27] Logged the message, Master [02:26:46] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [02:27:38] New patchset: Dzahn; "resource references should now be capitalized" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58446 [02:27:46] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:27:47] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:28:06] !log LocalisationUpdate completed (1.21wmf12) at Wed Apr 10 02:28:06 UTC 2013 [02:28:10] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58446 [02:28:13] Logged the message, Master [02:29:46] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:30:26] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.276 second response time [02:30:29] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:36] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:36] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:36] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.254 second response time [02:30:38] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:39] PROBLEM - Apache HTTP on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:39] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:40] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:46] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:47] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:47] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:48] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:48] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:49] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:49] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:50] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:50] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:51] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:51] PROBLEM - Apach [02:33:23] LeslieCarr: I'll update the VPs, sure, once I can reach en.wikipedia.org again... :-/ [02:33:34] well, it's back up [02:33:40] works for me [02:33:51] localization again [02:33:54] apache process also running on a random one (mw1080) [02:33:59] ugh, ok [02:34:33] yeah, works now again [02:35:16] Is there any chance to get something more useful than "(Cannot contact the database server: Unknown error (10.64.16.6))"? [02:36:21] errr, wtf, icinga-wm stopped mid msg? [02:36:53] (never seen that before) [02:37:13] andre__: it's related to localization updates and happened yesterday too, already has an Ops thread [02:37:26] ah, saw that one [02:37:29] eh, and should be bug 27320 [02:37:58] jeremyb_: yea, usually it gets kicked at that point :p [02:38:23] for flooding [02:38:54] right [03:00:29] New patchset: Dzahn; "make the planet logo a "per language" thing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58448 [03:01:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58448 [03:10:24] Coren: I just copied you on https://bugzilla.wikimedia.org/show_bug.cgi?id=47067 [03:10:38] I wasn't sure if uberbox was your work Bugzilla account. [03:11:08] Susan: It is; it predates my indentu^W staff-like status. :-) [03:11:26] Susan: OpenID is in the short-term plans, btw. [03:12:05] I don't touch Labs very much. [03:12:05] Coren: you can always change email addresses in bz if you desire [03:12:12] without breaking anything [03:12:24] Coren: I imagine some people will be too lazy to switch from the Toolserver. [03:12:29] I'm not sure I'm moving everything over. [03:12:34] There's a lot of random shit. [03:12:52] * Susan shrugs. [03:13:01] Susan: They'll eventually hit a brick wall; the final posts on the roadmap is tarballs of what's left and /sbin/poweroff [03:13:11] Some projects are sustained by scripts that people have forgotten about. [03:13:15] Wikisource, for example. [03:13:19] arr, so how do i debug more if puppetd -tv just doesn't do anything for minutes and nothing in syslog and it used to finish in seconds until my last change :p [03:13:39] Susan: No doubt. Silke is hard at working trying to ferret those out -- but it's going to be a pain to track the authors down. [03:13:40] Coren: Of the Toolserver? Not in this decade, I don't imagine. [03:13:57] Coren: I'm saying that the authors aren't going to want to move shit over. [03:14:01] Like me. [03:14:02] well, at some point they'll just go away, then [03:14:20] It seemed like replication would break one day. [03:14:28] And that would be the beginning of the end for the Toolserver. [03:14:39] mutante: maybe puppetd -td ? [03:14:57] iirc they're not the same [03:15:11] Susan: We're making a welcoming home for refugees, not just new maintainers. The problem is the tools that have been abandonned; we can't just grab them. [03:15:34] mutante: self-hosted puppet? [03:15:34] You can if they're under an open license. [03:15:52] Susan: Yes, but the actual licensing info is often not there at all. [03:15:59] if no one is willing to maintain something, it should go away [03:16:08] Coren: no, the production one, it is just on this node though [03:16:10] One benefit of the Toolesrver being so unstable is that tools are starting to use the API more often. [03:16:23] Ryan_Lane: That hurts small projects pretty badly. [03:16:40] Ryan_Lane: I wish it were that simple; but there are project out there that rely on unmaintained tools. :-( [03:16:40] Like I said, some tools that are very important to the Global South projects aren't maintained. [03:16:46] Even some deployed extensions. [03:16:52] Coren: then someone better step up and maintain them [03:16:57] The API doesn't suffer replay very often, which is nice. [03:16:58] Coren: do you plan on maintaining them? :) [03:16:59] It's much rarer, at least. [03:17:17] localization again [03:17:24] TimStarling: was it not? [03:17:37] Ryan_Lane: Agreed. AFAIK, that's Silke's primary task (enumerate old tools, see what can be saved (has the right licenses), find adopters) [03:17:47] Are localization updates now breaking the site? [03:17:51] I considered just running LU manually, during off peak yesterday, to generate the errors I needed to isolate the problem [03:18:04] but I thought I might get yelled at for not having a deployment window [03:18:09] hahaha [03:18:16] so I just set up some logging and let it happen by itself [03:18:17] * Coren yells at TimStarling. Just because. [03:18:42] TimStarling: I got your new and improved(R) log2udp(tm) by the way. Where do you want me to stick it? :-) [03:19:27] ® [03:19:28] ™ [03:19:31] They're easy on a Mac. [03:19:33] Coren: excellent [03:20:06] On Linux too. ™ ® Compose key FTW. [03:20:13] I was just lazy. :-) [03:20:15] mutante: tried -td ? [03:20:28] can you push it to gerrit for review? [03:20:30] Susan: you added yourself to a closed bug? :D [03:20:36] TimStarling: New repo or somewhere specific? [03:20:40] I've been doing that lately. [03:20:43] I'm not really sure why. [03:20:45] Coren: log2udp? is this for sending to ircecho? [03:20:50] In case it gets reopened, I guess. [03:20:56] Sometimes I'm sad to have missed the bug. [03:20:59] I éńábĺéd the compose key on my desktop recently [03:21:19] jeremyb_: yea, it stops after "Using cached certificate.." [03:21:30] mutante: strace? :) [03:21:44] https://en.wikipedia.org/wiki/Compose_key # Interesting. [03:21:50] i've had some good results with it. but not tried to strace puppet yet [03:21:52] it let's you type some things on IRĉ that would take a long time with a character map [03:22:04] Alt codes seem like the kind of thing MediaWiki would do. :v [03:22:15] TimStarling: you know ctrl-shift-u ? [03:23:01] TimStarling: new repo or did you have somewhere in mind? [03:24:04] Coren: did you write it from scratch, rather than starting from the existing code? [03:24:13] TimStarling: Scratch. [03:24:23] maybe a new project under analytics then [03:24:25] there's a project called scratch... :P [03:24:46] should it be called something different, to avoid confusion? [03:25:00] jeremyb_: the puppetmaster is just thaat busy.. (stafford) [03:25:15] I can put it in analytics/udplog since I named it (imaginatively) log2udp2 :-) [03:25:27] what's the change? [03:25:35] mutante: is it a particularly heavy catalog generation? e.g. neon [03:25:42] tell ori-l about it, he will probably care [03:26:12] I can make an analytics/log2udp2 project [03:26:13] TimStarling: i've developed an unreasonable attachment to it. stockholm syndrome. [03:26:41] ori-l: A bit more flexible, fixed footprint, should be blazingly fast. Has a numbering and prefix option, and merges lines into packets when possible. [03:26:44] I think udplog should mostly be for things which use that C++ library [03:26:54] kk. I'll make a new repo [03:27:40] merges lines into packets is a good thing and will spare a lot of unneeded udp frames, but be careful with it [03:27:54] lots of scripts that assume 1 datagram = 1 line [03:28:10] can someone review this varnish configuration change? https://gerrit.wikimedia.org/r/#/c/58269/ [03:28:36] ori-l: Need it to be optional, then? When Tim gave me the requirements, it seems like it was core. [03:29:04] tim knows better than me, but i think it's just a matter of socializing the change, esp. to erik zachte [03:29:23] So much socialization lately. [03:29:26] who loyally shepherds a flock of perl scripts that munge udplog data [03:29:30] if it has a different program name then you can migrate to it at your leisure [03:30:32] jeremyb_: it's compiling catalogs for all kinds of servers succesfully.. but it's also not getting finished on neon. yea [03:30:36] obviously 1 datagram = 1 line is not a great convention [03:30:51] that's why only log2udp uses it [03:30:59] mutante: i didn't know which node you were focusing on [03:31:03] log3udp [03:32:13] TimStarling: re: that change -- doesn't it just duplicate line 495 in the same file? [03:33:18] jeremyb_: zirconium, shouldn't have that much to do, it's planet, i'll just wait a bit, getting too late here [03:33:18] that's for upload [03:33:33] otherwise yes [03:33:44] it duplicates line 495 and line 703 [03:33:59] because I think bits should be logged as well as upload and mobile [03:34:04] mutante: ok, gute nacht. please poke me about rt 822 tomorrow? [03:34:07] getting late here too [03:35:00] jeremyb_: yep, don't have a reply yet, maybe the owner email has changed since 2011.. ttyl, good night [03:35:34] mutante: no, i mean about keeping status quo functionality and the confusing terminology "shortening service" [03:36:03] jeremyb_: if it can be done with a change to redirects.conf we should be good [03:36:10] mutante: it can be... [03:36:17] great..ok. [03:36:23] it's a simple regex subst [03:36:35] i wasn't sure either what the term includes and what it doesnt [03:36:40] TimStarling: well, aren't you duplicating the resource name, though? [03:36:44] "service" sounded like a bit more [03:36:53] right [03:37:01] if something requires => Varnish::logging['locke'] or whatever [03:38:09] the same problem would exist for mobile and upload, right? [03:38:47] TimStarling: let me trace this through [03:38:55] s/would/does, maybe [03:40:16] it looks like this change is fine to me [03:40:26] TimStarling, ori-l: https://gerrit.wikimedia.org/r/#/c/58449/ [03:40:58] oooo your code is purty [03:41:38] thanks Coren [03:41:58] i presume "quasi-circular" means pizza slice [03:42:15] * ori-l brbs [03:42:19] Heh. I explain what it means the next statement. But pizza slice sounds good. :-) [03:43:15] ori-l: it would only be a problem if that resource was defined on the same system [03:43:23] Coren: you should add reviewers... [03:43:26] ori-l: and in that case, you'd set instance_name [03:43:41] jeremyb_: I have no idea who I should add. :-) ori and Tim I suppose. [03:43:58] I just added a few. [03:44:11] Coren: that's a big problem with gerrit IMO. you can't request review from "whoever feels like reviewing right now" [03:44:14] And Domas, why not. [03:44:47] jeremyb_: unless you add a mailing list as a reviewer? [03:45:11] I hate my overcomplicated makefile. :-) [03:45:13] mutante: uhuh? [03:45:25] mutante: which lists are gerrit users? [03:45:40] jeremyb_: not saying to add ops :p just in general, if you had an appropriate list or alias, you could add it like a user [03:45:59] does gerrit allow adding non-users? [03:46:29] i don't think it would be able to tell the difference, it would just be Foo Bar with a certain email address..no? [03:47:10] 1 way to find out ... [03:47:59] New patchset: Ryan Lane; "Remove labstore3/4 bricks from management (DON'T MERGE)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58451 [03:49:36] jeremyb_: yep, out for real, cya [03:52:25] Oh, I just realized I used my own code format. I hope I didn't commit some WMF sin there. :-) [03:52:50] Uses cino={.5s:.5s=.5sl1g.5sh.5s(0u0U1 [03:55:10] Coren: errr, where are you talking about? [03:58:20] jeremyb_: Indenting and whitespace conventions in C-like languages is... religious. :-) There's the church of K&R, the cult of Allman, the GNU cabal, etc. And then there are the iconoclasts like me that do something a bit different. Mine is close to 1TBS. :-) [03:58:35] the cino= is the vim settings to cindent to make that. :-) [03:59:10] ahhh [03:59:27] well we don't have much compiled stuff. AFAIK [03:59:35] Ryan_Lane: well, it's still a mistake [03:59:55] the fact that by chance it doesn't actually engender conflict doesn't mean it isn't worth the extra second it would take to make the names distinct and informative [04:01:12] the fact that tim put 'locke' as the resource name probably betrays a kind of half-conscious belief that this value specifies the destination host name, or that it ought to be the destination host name, or something like that [04:01:23] it's confusing and funny looking [04:02:22] sooner or later there will be the occasion to ask 'which?' -- when some logger indicates a failure of that resource, or something [04:02:30] not a big deal, but i'm sticking to my guns [04:08:19] Coren: re code format, I don't think it matters much for something like this [04:09:03] when I hear "sticking to my guns" I always have the mental image of GWB :) [04:09:11] ori-l: then all of them need to be fixed [04:09:12] ori-l: Besides, 1TBF is Good. :-) [04:09:54] and it should be done as a refactoring [04:10:06] yeah, not suggesting it block this change [04:10:16] can't blame someone for staying consistent with what's there, and can't expect them to refactor everything for it :) [04:10:34] * Ryan_Lane nods [04:10:38] I agree with you otherwise [04:12:25] * ori-l retires to his texas ranch [04:14:50] Aaron|home: BGW? [04:14:55] err, GWB* ? [04:16:37] the ahh, former presi-dent... of Amurica [04:16:46] ahhh [04:16:49] 43 [04:17:09] not GHWB [04:17:22] GWB is also George Washington Bridge [04:30:53] New patchset: Tim Starling; "Add UDP log to bits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58269 [04:31:03] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58269 [04:38:20] is stafford down? [04:38:42] Connection timed out during banner exchange [04:39:06] ah yes, ganglia says swapdeath [04:40:35] oh, I will need the root password for this, I don't think anyone ever gave it to me [04:41:02] faidon was going to send it by email to my PGP key, but then he didn't [04:41:24] how mean [04:41:30] sleeping? fine, I will page your asses in retaliation for not giving me the root password ;) [04:41:53] * Aaron|home can't recall Tim saying "ass" before [04:42:09] * Aaron|home does a whois [04:42:12] checks outs [04:42:33] I had to specially translate it into american so that they would understand [04:42:48] no point insulting americans in english english [04:45:44] what's the english english version? [04:46:07] asses -> arses [04:46:20] i think that's understandable. at least in Brooklyn [04:46:30] have you got it, jeremyb? [04:46:39] the passwd? no [04:46:48] maybe ssh will work, I am trying with -oConnectTimeout=10000 [04:46:56] hehe [04:46:59] TimStarling: one sec [04:47:40] console: Serial Device 2 is currently in use [04:47:40] note: never been to britain and i think it's 20ish years since i was in the commonwealth [04:47:42] * Ryan_Lane grumbles [04:47:55] yeah, that's me [04:48:02] waiting for the root password to arrive in my inbox [04:48:13] but don't bother, I got an ssh shell and I'm killing everything now [04:48:17] ok [04:48:36] was there a specific reason you didn't get root? [04:48:41] I don't believe so [04:48:55] logistics are hard? :) [04:49:23] I think I wasn't in the room at the relevant times [04:49:30] off at some platform meeting or something [04:50:06] only yourself to blame, working with platform instead of ops…. [04:50:09] :) [04:50:29] Ryan_Lane: don't you dare [04:50:43] or maybe it's because I actually want to maliciously destroy data and the lack of the root password is the only thing that is stopping me is the lack of the root password [04:50:58] and therefore CT wisely decided not to give it to me [04:50:58] that's definitely the reason [04:51:07] * jeremyb_ detects some redundancy in that statement [04:51:15] and of course he didn't tell you because it would get back to me and then I would DDoS everything [04:51:20] :D [04:51:39] TimStarling: err, not just `halt` ? [04:52:00] CDOS <--- Centralized DOS [04:52:22] where's the fun in that? [04:52:49] !log stafford went into swapdeath. I killed all ruby processes a couple of times, and then eventually stopped apache2 for a while, while it recovered [04:52:56] Logged the message, Master [04:53:02] i didn't realize fun was in the spec [04:53:37] * jeremyb_ wonders if stafford had ever swapdeath'd before [04:55:17] unlikely [04:55:32] ganglia says memory usage massively increased around the end of March [04:56:27] from 3GB to 8GB [04:56:50] so maybe that is related [04:58:34] Beware the ides of March? [05:00:13] right, the load has increased too [05:00:42] so maybe the memory usage per process hasn't increased [05:01:47] syslog:Apr 10 04:10:17 stafford kernel: [24359322.661058] ntpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 [05:01:58] obviously it's all ntpd's fault [05:02:50] not 150 ruby processes using 150MB each [05:04:24] the syslog doesn't go back far enough to tell if it happened before [05:04:26] hi, TimStarling or someone, if you have a sec, could you merge https://gerrit.wikimedia.org/r/#/c/58265/ pls [05:05:20] it has already been posted to the community http://meta.wikimedia.org/wiki/Meta:Babel#Zero_configuration_namespace_coming_to_meta_near_you [05:05:43] o_O What was the OOM killer smoking to kill ntpd? It it tuned? [05:06:04] Coren: core [05:06:04] yurik: go sleep! [05:06:12] * jeremyb_ too [05:06:15] jeremyb_, my flight to s africa in 10 hours [05:06:22] need to finish a whole bunch of crap before that [05:06:30] 20 hours on the plane [05:06:44] yurik: hah, you're following amit or kul? :) (wild guess) [05:06:46] i'll go insane... getting there already [05:06:50] both [05:07:08] oh, not amit - kul & dan [05:07:08] * Coren goes to sleep too, actually. [05:07:18] conference [05:07:23] New patchset: Ori.livneh; "Added "Zero" & "Zero talk" namespaces to metawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58265 [05:07:29] learning USSD [05:07:32] the joy! [05:07:37] ahhhh [05:07:39] fun [05:08:14] any stopover? [05:08:22] jeremyb_, usually we try to work with newer tech, not the antiquated custom protocols... [05:08:37] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58265 [05:08:39] TimStarling, 1 in johannesburg [05:08:41] yurik: i still use ussd! [05:08:57] jeremyb_, to pay bills? or to browse the web? ;) [05:09:24] so like 18 hours to johannesburg and then 1 hour to wherever you're going? [05:09:30] ussd is what telcos supposedly use to show you your network/minute usage [05:09:40] yurik: neither. to check usage. yeah, that [05:09:47] TimStarling, yep - cape town, 2 hrs+ [05:10:00] you could visit stellenbosh! [05:10:10] what/who is that? [05:10:36] ahh, spelled it wrong [05:10:50] https://meta.wikimedia.org/wiki/Wikimania_2012/Bids/Stellenbosch [05:10:57] i'm going to sync yurik's config change, unless anyone objects [05:10:58] ori-l, always lurking ;) [05:10:59] https://gerrit.wikimedia.org/r/#/c/58265/4/wmf-config/InitialiseSettings.php [05:11:10] lol, i started typing that before ori said anything :) [05:11:12] meta folk seem OK with it [05:11:24] thanks ori-l ! [05:12:21] i'm going to wait 10 minutes to see if anyone says no [05:12:38] ori-l, i was thinking of introducing a new sec group, but couldn't find a proper way to do it [05:12:57] what would be right way to do it? [05:12:59] i looked at http://www.mediawiki.org/wiki/Manual:$wgNamespaceProtection [05:15:06] i *think* if you do $wgNamespaceProtection[your_namespace] = array('zero-config-edit'); [05:15:49] and then $wgGroupPermissions['zero-folk'] = array( 'zero-config-edit' => true ) [05:15:59] the assignment will effectively create the group [05:16:03] but i can't remember exactly [05:18:05] right, that seems like the right way to do it, but for some reason initialiseSettings.php doesn't have it [05:18:15] commonSettings does [05:19:41] hmm, seems like wgAddGroups is what is being used for that [05:20:27] i think wgAddGroups is who can add / remove groups for users [05:20:42] lets see [05:20:52] New patchset: Tim Starling; "NRPE on bits caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58455 [05:21:15] ori-l, i guess noone objects :) [05:21:23] Zero, here we come :) [05:21:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58455 [05:24:48] yurik: you may want groupOverrides or groupOverrides2 [05:25:10] to tired to deicde if that's right or not [05:25:55] yep, that's what it looks like, but not sure how it should be set up: [05:26:34] bon voyage [05:26:46] when do you get back? [05:26:54] 'metawiki' => array( 'zeroadmin' => [05:27:47] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Add 'Zero' & 'Zero talk' namespaces to metawiki (If6b3ce5e4)' [05:27:54] Logged the message, Master [05:28:00] jeremyb_, 23rd i think [05:28:12] but i should be online [05:28:14] hopefully [05:28:16] :) [05:28:42] (i sound like a true new yorker... they have internet in s affrica, right?) :D [05:29:02] (and something tells me they have much better internet then what i got at my home) [05:29:25] New patchset: Tim Starling; "Varnish bits logging: set port and instance name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58456 [05:30:03] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58456 [05:31:58] yurik: synced, checked, and i noted so on meta:babel [05:32:08] ori-l, you rock! [05:32:10] thank you!!! [05:32:40] http://meta.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces [05:32:42] confirmed! :) [05:33:28] ori-l, would you know by any chance if its possible to restrict a namespace editing without changing settings files ? [05:33:44] you can do it in your extension [05:33:58] yes, but I'm pretty far from being ready to deploy it [05:34:32] !log on brewster: proxy down due to full root partition, again. Will fix it properly this time. [05:34:39] Logged the message, Master [05:35:10] well, you don't need to deploy the whole thing [05:35:21] create a deployment branch that has a few minimal pieces [05:36:30] anyways, i don't think you need to worry about restricting the namespace [05:36:37] true, not just yte [05:36:39] yet [05:36:44] but pretty soon - for sure [05:36:53] if you're far from being ready to deploy it, then vandalism on that ns won't really have terrible consequences [05:37:03] exactly [05:37:18] and it's not a very attractive target for vandalism [05:37:21] why are you depolying the namespace if your not ready to use it? [05:37:52] peachey|laptop__, who said i am not ready to use it? [05:38:09] i am already using it in a way - in all the testing [05:38:22] those are settings, they need to be verified before the new extension goes live [05:38:26] on all wikies [05:38:49] so the step 1 - convert existing settings system into Zero pages [05:39:01] and step 2 - once all checks out, deploy extension to all sites [05:39:10] not the other way around :) [05:41:14] New patchset: Tim Starling; "On brewster: disable access.log and store.log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58457 [05:41:34] microscopic, lol [05:42:50] microscopic as in, you can't buy an SD card that small anymore, they've stopped making them [05:43:04] ori-l, http://meta.wikimedia.org/wiki/Zero:250-99 :) [05:43:27] 5GB [05:52:56] New patchset: Tim Starling; "Bits varnish logging: set port correctly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58460 [05:53:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58460 [05:57:03] !log on stafford: swapdeath repeat narrowly averted via killall ruby && /etc/init.d/apache2 stop [05:57:10] Logged the message, Master [06:29:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection refused [06:32:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [06:51:04] New patchset: Tim Starling; "On stafford: reduce passenger pool size" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58461 [06:51:26] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58461 [06:56:22] * yurik throws lots of heavy objects at jeremyb_ ! [07:03:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.194 second response time [07:08:05] New review: Ori.livneh; "recheck" [operations/debs/python-jsonschema] (debian/wikimedia) - https://gerrit.wikimedia.org/r/58311 [07:24:06] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:24:31] New patchset: Isarra; "Update wikipedia favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [07:26:16] PROBLEM - Varnish traffic logger on cp3020 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:16] PROBLEM - Varnish traffic logger on sq69 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:36] PROBLEM - Varnish traffic logger on sq70 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:46] PROBLEM - Varnish traffic logger on cp3022 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:46] PROBLEM - Varnish traffic logger on niobium is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:46] PROBLEM - Varnish traffic logger on sq67 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:26:56] PROBLEM - Varnish traffic logger on sq68 is CRITICAL: NRPE: Command check_varnishncsa not defined [07:29:35] New patchset: Isarra; "Update wikipedia favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [07:31:15] New review: Ori.livneh; "Duh, Jenkins isn't enabled for this repository. I'm being stupid. This has to be merged manually." [operations/debs/python-jsonschema] (debian/wikimedia) - https://gerrit.wikimedia.org/r/58311 [07:32:16] PROBLEM - RAID on stafford is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:56] New review: Isarra; "What the crap is going on here?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [07:41:55] New patchset: Isarra; "Update wikipedia favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [07:46:57] New patchset: Isarra; "(Bug 15716) Update wikipedia favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [07:53:04] !log krinkle synchronized php-1.22wmf1/resources/startup.js 'Ia54dd738b3ce0995fa' [07:53:12] Logged the message, Master [08:06:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:40] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [08:26:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.328 second response time [08:29:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:34:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.000 second response time [08:37:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.444 second response time [08:45:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.204 second response time [08:51:10] PROBLEM - RAID on stafford is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:58] New patchset: Tim Starling; "On stafford: reduce PassengerMaxRequests to 5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58470 [08:53:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58470 [08:54:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.646 second response time [08:57:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.668 second response time [09:10:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:41] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:53:58] New patchset: ArielGlenn; "make sure mysql client is availavble for en wp job queue checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58473 [09:56:04] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58473 [09:57:54] apergos: thanks! there's also https://gerrit.wikimedia.org/r/#/c/58079/ for the other [09:58:25] ah that was going to be nxt [09:58:31] lemme make sure this one is good first [10:33:40] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:40] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [11:04:00] RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 88.78 ms [11:06:00] PROBLEM - Varnish traffic logger on cp3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [11:06:10] PROBLEM - Varnish HTTP upload-frontend on cp3006 is CRITICAL: Connection refused [11:25:10] PROBLEM - RAID on stafford is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:06] New patchset: Hashar; "zuul: typo in labs status_url" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58488 [11:36:33] root: super easy change to merge in https://gerrit.wikimedia.org/r/58488 :-] [11:36:41] fix a typo :-] [11:40:12] meh, I shot two puppet processes over there that were taking 2 and 5 gb respectively (long running times of over a half hour each), so we have some free, but I don't know what that will have done to those two jobs [11:56:11] Nemo_bis: /usr/local/bin/mwscript is on hume (as it should be) [11:56:17] re: https://gerrit.wikimedia.org/r/#/c/58079/1/manifests/ganglia.pp [11:56:36] New patchset: Hashar; "jenkins: actually enable verbose mode for git plugin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58489 [11:57:12] apergos: so /usr/local/bin/mwscript works, although mwscript doesn't? [11:57:25] apergos: can you possibly approve the two changes I have submitted? They are fairly simples and I have tested them out :-] https://gerrit.wikimedia.org/r/58488 https://gerrit.wikimedia.org/r/58489 [11:57:31] seems to, it looks like I'm getting cronspam only from one of these but noth the other [11:57:38] (so much cornspam it's hard to tell though) [11:57:58] * apergos can't wait for puppet to apply the right config on hume and will add the package now out of desperation [11:58:21] New review: Hashar; "The switch needed to be applied to java (ie in JAVA_ARGS). It seems I forgot to verify that change ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49814 [11:59:13] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58488 [12:02:56] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58489 [12:03:03] apergos: thank you :-]]]] [12:03:14] yw [12:03:21] I'm not going to run puppet though [12:03:31] so just wait a bit for it to go around [12:03:47] apergos: I can do it now :-] [12:03:53] excellent [12:04:22] I should specify it I Guess [12:11:08] New patchset: ArielGlenn; "comment out enwikijobqueue monitoring on hume til fixed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58490 [12:12:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58490 [12:14:10] PROBLEM - RAID on stafford is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:39] Nemo_bis: since you are the one that did changeid I4b67f60a62a370ea327f7fa68eea9ca444baa3bc maybe you know about the en wp jobqueue check brokenness [12:17:59] ERROR 1146 (42S02) at line 1: Table 'enwiki.job' doesn't exist [12:20:56] apergos: no I don't, I just restored a previous code [12:20:59] I see [12:21:24] all sorts of brokenness from those jobqueue changes, 1.20/21 is such a sad release [12:22:10] PROBLEM - RAID on stafford is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:52] !log removed outright puppet.log on neon, was > 7gb and no room left on device [12:22:59] Logged the message, Master [12:30:56] New patchset: Mark Bergsma; "Revert "make the planet logo a "per language" thing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58491 [12:31:05] New patchset: Mark Bergsma; "Revert "resource references should now be capitalized"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58492 [12:31:15] New patchset: Mark Bergsma; "Revert "remove duplicate definition of locales package"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58493 [12:31:38] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58427 [12:31:51] New patchset: Mark Bergsma; "Revert "turn planet into a puppet module"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58494 [12:37:28] grr [12:37:30] grrrr [12:37:46] mark: I was about to do that [12:37:53] strace shows planet on the top [12:37:57] that and mysql lib [12:38:14] I didn't find anything wrong with the module that would cause that though [12:38:18] i did [12:38:24] a recursive definition [12:38:27] ah! [12:38:32] that would explain this [12:38:58] yes [12:39:00] New patchset: Mark Bergsma; "Revert "make the planet logo a "per language" thing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58495 [12:39:04] ok, I see it now too [12:39:07] but I would also like a better review of that module [12:39:09] you are correct sir [12:39:21] yeah, I saw a few things that striked me as odd too [12:39:25] like removing the role class [12:40:20] 104 [pid 22996] stat("/var/lib/git/operations/puppet/modules/stdlib/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [12:40:23] 104 [pid 22996] stat("/var/lib/git/operations/puppet/modules/mysql/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [12:40:26] 104 [pid 22996] stat("/var/lib/git/operations/puppet/modules/apache/lib/puppet/type", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [12:40:30] 54 [pid 22996] stat("/var/lib/git/operations/puppet/modules/planet/manifests/init.pp", {st_mode=S_IFREG|0644, st_size=688, ...}) = 0 [12:40:30] New patchset: Mark Bergsma; "Revert "make the planet logo a "per language" thing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58495 [12:40:38] brb [12:41:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58495 [12:43:04] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58491 [12:43:12] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58492 [12:43:19] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58493 [12:43:26] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58494 [12:45:33] New patchset: Mark Bergsma; "Revert "On stafford: reduce passenger pool size"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58496 [12:46:01] New patchset: Mark Bergsma; "Revert "On stafford: reduce PassengerMaxRequests to 5"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58497 [12:46:45] New patchset: Mark Bergsma; "Revert "On stafford: reduce passenger pool size"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58496 [12:46:54] the constant rebasing is quite annoying [12:47:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58496 [12:47:59] New patchset: Mark Bergsma; "Revert "On stafford: reduce PassengerMaxRequests to 5"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58497 [12:48:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58497 [12:55:54] New patchset: MaxSem; "Fix test.m load.php domain, enable $wgMFVaryResources on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58501 [12:58:23] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58501 [12:59:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.60232285714 (gt 8.0) [13:19:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 17.1655526891 (gt 8.0) [13:20:10] RECOVERY - Varnish HTTP upload-frontend on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 675 bytes in 0.176 second response time [13:21:00] RECOVERY - Varnish traffic logger on cp3006 is OK: PROCS OK: 3 processes with command name varnishncsa [13:21:14] New patchset: J; "Install timidity for Score extension" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58504 [13:23:40] New patchset: J; "Install timidity for Score extension" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58504 [13:29:29] New patchset: Ottomata; "Disabling filters on locke except for fundraising." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58505 [13:30:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58505 [13:32:33] New patchset: Ottomata; "Excluding private from git clean, just in case." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58506 [13:32:44] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58506 [13:39:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.68819320896 (gt 8.0) [13:40:08] New patchset: Nemo bis; "Global jobqueue check: mwscript path fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58079 [13:40:45] apergos: as you said, I hope I did it right this time... [13:42:17] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46950 [13:42:34] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46976 [13:42:49] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/47023 [13:47:50] PROBLEM - Varnish HTTP upload-backend on cp3006 is CRITICAL: Connection refused [13:48:50] RECOVERY - Varnish HTTP upload-backend on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.175 second response time [13:51:40] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:51:40] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:51:40] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:54:39] * jeremyb_ throws lots of heavy objects at yurik!! [14:03:50] argh [14:03:58] just pushed directly into a repo instead of via gerrit [14:04:56] <^demon|sick> :( [14:05:08] how can I reset that? [14:06:30] <^demon|sick> Reset your local branch to that commit, then `git push -f ` [14:06:44] doesn't accept it as it's not ff [14:06:52] <^demon|sick> Which repo/branch? [14:06:56] <^demon|sick> I'll grant +force. [14:07:18] operations/debs/varnish [14:07:22] i can do it myself first, whichever you prefer ;) [14:07:36] s/first/also/ [14:07:52] <^demon|sick> Done. [14:07:58] i'd rather not have direct push access, but sometimes you need it :( [14:07:58] thanks [14:08:08] seems to work [14:08:12] <^demon|sick> That's what I do. I tend to grant direct pushing, then revoke it when I'm done. [14:08:20] New review: Jeremyb; "errr, no it wasn't. it was reverted in I093cfc85d1ca8550c19a246ae9ed8fb0e91149e3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54502 [14:08:30] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm9) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58514 [14:08:30] New patchset: Mark Bergsma; "Update streaming range patch with Martin's token deadlock fixes" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58515 [14:08:30] New patchset: Mark Bergsma; "Fix race in persistent storage loading of segments" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58516 [14:08:31] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm10) precise; urgency=medium" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58517 [14:08:32] uh, why we don't have a test2.*m*.wikipedia.org? [14:09:00] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is -0.449792592593 [14:09:07] MaxSem: because a redundant test wiki where you can't actually test anything is enough? [14:09:26] heh [14:09:34] it's not redundant;) [14:10:05] maybe.. but I never was able to test anything on test2 [14:10:21] although labs is worse, the one time I tried to test something I got an unrelated fatal [14:10:23] do you know the difference between test and test2? [14:10:31] I know the supposed difference :) [14:10:41] it's not supposed [14:10:41] doesn't seem to be that reflected in reality [14:10:53] oki, just personal experience, nothing more [14:11:09] which is what? [14:12:00] !log Inserted varnish 3.0.3plus-rc1-wm10 packages into the APT repository [14:12:07] Logged the message, Master [14:14:29] New patchset: Jeremyb; "(Bug 15716) Update wikipedia favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [14:15:20] New patchset: MaxSem; "Uh-oh, no test2.m" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58518 [14:15:57] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58518 [14:17:00] fraek... [14:18:23] is there already a deb package in operations/debs where upstream and pristine-tar are commited via gerrit? can only find packages without Change-Id commits [14:18:53] !log maxsem synchronized wmf-config 'Enable $wgMFVaryResources on test' [14:19:00] Logged the message, Master [14:19:11] j^: no, initial push we usually do direct [14:19:16] as we're not gonna review all those in gerrit anyway [14:19:19] j^: i imagine it would be direct push for upstream branches where we are not upstream. [14:19:50] mark: whats the workflow to get this pushed since i dont have the right permissions [14:20:19] publish repository elsewhere and ask here for someone to clone and push? [14:20:40] there's not really a working workflow yet i'm afraid [14:20:49] if you're not in ops and already have full push access [14:20:53] mark: i am trying to change vlan for labsdb's from private to labs in row c ge-3/0/8-9. i tried changing membership and deleting membership. The commit goes okay but the change doesn't happen [14:20:59] any suggestions? [14:21:02] i guess asking us to do those pushes is easiest [14:21:27] cmjohnson1: checking [14:21:31] thx [14:22:37] cmjohnson1: you made a typo and therefore created a new interface-range that doesn't do anything [14:22:48] interface-range vlan-labs-hosts1-c-eqiad { [14:22:48] member ge-2/0/0; [14:22:48] member-range ge-2/0/0 to ge-2/0/1; [14:22:48] member-range ge-3/0/0 to ge-3/0/1; [14:22:48] unit 0 { [14:22:49] family ethernet-switching { [14:22:49] vlan { [14:22:50] members labs-hosts1-c-eqiad; [14:22:50] } [14:22:51] } [14:22:51] } [14:22:52] } [14:22:52] interface-range vlan-labs-host1-c-eqiad { [14:22:53] member ge-3/0/8; [14:22:59] !paste | mark [14:23:02] :D [14:23:08] * peachey|laptop__ looks at mark [14:23:15] thx mark for looking [14:23:31] why do people always get so worked up about a few lines of pasted code in irc, seriously [14:23:44] if it's under 30 lines i usually don't bother using a pastebin [14:24:40] PROBLEM - Puppet freshness on search34 is CRITICAL: No successful Puppet run in the last 10 hours [14:24:40] PROBLEM - Puppet freshness on search13 is CRITICAL: No successful Puppet run in the last 10 hours [14:24:40] PROBLEM - Puppet freshness on cp1033 is CRITICAL: No successful Puppet run in the last 10 hours [14:24:40] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [14:24:59] cmjohnson1: do you know how to correct it? [14:25:14] i will have to delete the typo'd range [14:25:18] yes [14:25:23] be careful ;) [14:25:34] k [14:25:40] PROBLEM - Puppet freshness on cp1024 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on db1030 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on db1029 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on lvs6 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on cp1023 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on mw1013 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] PROBLEM - Puppet freshness on gadolinium is CRITICAL: No successful Puppet run in the last 10 hours [14:25:41] PROBLEM - Puppet freshness on mw121 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:41] PROBLEM - Puppet freshness on search17 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:42] PROBLEM - Puppet freshness on search36 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:42] PROBLEM - Puppet freshness on lvs5 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:43] PROBLEM - Puppet freshness on search22 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:43] PROBLEM - Puppet freshness on search21 is CRITICAL: No successful Puppet run in the last 10 hours [14:29:00] New patchset: MaxSem; "Proper fix for testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58519 [14:29:48] ^demon|sick, ERROR: Possible problem with your *.gwt.xml module file. The compile time user.agent value (safari) does not match the runtime user.agent value (gecko1_8). Expect more errors. [14:30:14] <^demon|sick> Bleh. [14:30:35] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58519 [14:33:28] !log maxsem synchronized wmf-config/mobile.php [14:33:34] Logged the message, Master [14:35:07] mark: can you clone/push http://r-w-x.org/wmf/debs/libvpx.git/ to operations/debs/libvpx whats currently at operations/debs/libvpx can be removed/reset [14:36:12] ok [14:36:18] (that would be the master/upstream/pristine-tar branch importing libvpx from ubuntu 12.10, will push patch through gerrit once its in) [14:37:40] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:39:27] <^demon|sick> MaxSem: What browser were you using? [14:40:12] Opera likely ;) [14:40:41] j^: should be done [14:41:14] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58079 [14:43:44] New review: J; "this is outdated and should be abandoned." [operations/debs/libvpx] (upstream) C: -1; - https://gerrit.wikimedia.org/r/58071 [14:44:26] <^demon|sick> Heh, opera and gerrit don't get along. [14:45:46] Change abandoned: J; "(no reason)" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58074 [14:46:00] Change abandoned: J; "(no reason)" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58072 [14:46:09] Change abandoned: J; "(no reason)" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58070 [14:46:19] ^demon|sick: i would say you have redundant words in that, but people will hurt me [14:46:22] Change abandoned: J; "(no reason)" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58073 [14:53:45] New patchset: J; "Import 1.1.0-1+wmf1" [operations/debs/libvpx] (master) - https://gerrit.wikimedia.org/r/58520 [14:54:47] mark: thanks, looks good now, pushed patch to https://gerrit.wikimedia.org/r/#/c/58520/ once that is reviewed, building the package is an ops task again? [14:56:28] I suppose so [14:57:39] ok [15:21:06] ^demon|sick, Reedy, FF - gerrit doesn't like opera [15:22:30] <^demon|sick> Meh, gwt does some stuff opera doesn't like. [15:22:33] <^demon|sick> gwt was like fix it. [15:22:37] <^demon|sick> opera was like stop doing that. [15:22:53] <^demon|sick> everyone stopped caring because nobody uses opera. [15:24:33] ^demon|sick: you make me want an xkcd to go with that [15:28:27] New patchset: Demon; "Switch nostalgiawiki to use Nostalgia from extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56402 [15:31:58] New patchset: Demon; "DO NOT MERGE UNTIL 1.22WMF2 IS ON ALL WIKIS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58523 [15:32:08] lol [15:34:37] WHAT ARE YOU SAYING? [15:36:01] <^demon|sick> I'M SAYING DON'T MERGE KTHNX. [15:36:05] <^demon|sick> ;-) [15:36:40] PROBLEM - DPKG on cp1033 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:37:40] RECOVERY - DPKG on cp1033 is OK: All packages OK [15:38:05] !log demon synchronized php-1.22wmf1/extensions/Nostalgia [15:38:12] Logged the message, Master [15:39:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56402 [15:40:02] !log demon synchronized wmf-config/extension-list 'Nostalgia ext' [15:40:08] Logged the message, Master [15:40:28] !log demon synchronized wmf-config/CommonSettings.php 'Nostalgia ext' [15:40:34] Logged the message, Master [15:44:36] ^demon|sick: you should -1 yourself or something [15:45:15] <^demon|sick> -2'd. [15:47:43] !log Ran dist-upgrade (for varnish upgrade) on cp1021-1036 [15:47:50] Logged the message, Master [15:52:50] PROBLEM - Frontend Squid HTTP on cp1005 is CRITICAL: Connection refused [16:00:41] csteipp: 2 ideas about SQL box. 1) make it use a read-only DB user for queries from that box. idk if that's a supported option though. 2) what about stored XSS in a message received by the system? [16:00:50] RECOVERY - Frontend Squid HTTP on cp1005 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.003 second response time [16:01:21] jeremyb_: I would hope the developers thought about #2.... but it wouldn't surprise me. [16:01:45] csteipp: well we've had stored XSS issues with this package before IIRC [16:01:49] But, the attacker would probably have to csrf it to make it useful, so it would be a pretty difficult attack [16:02:08] It's common [16:02:37] i don't follow. couldn't they just inject arbitrary JS and then do almost anything? [16:02:40] For #1, that would be great if it was possible. I honestly haven't looked much at the code. I'm not sure if that's an easy thing to implemetn or not. [16:02:59] jeremyb_: They could, if they knew that an admin would run that query [16:03:11] oh, yeah, sure :) [16:03:24] To make it reliable, they would need to get the admin to run the query... csrf, or plead with an admin :) [16:03:27] i was mostly thinking about mail already in the system though [16:05:49] New patchset: ArielGlenn; "fix typo in field name rev_content_model, allow NULL for rev_parent_id and rev_text_len" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58530 [16:06:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58530 [16:07:19] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58514 [16:08:11] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58515 [16:08:25] New patchset: ArielGlenn; "mwxml2sql no longer reads text data from stdin, update usage message" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58533 [16:08:43] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58516 [16:08:47] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58533 [16:09:02] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/58517 [16:10:40] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:40] PROBLEM - Puppet freshness on db1051 is CRITICAL: No successful Puppet run in the last 10 hours [16:11:20] mark: 58514 and 58517 touch only changelog, nothing else? [16:11:47] why do you ask? [16:12:01] well where is the actual change? [16:12:21] in the commits before [16:13:30] New patchset: ArielGlenn; "fix handling of mysql password option" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58534 [16:13:47] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58534 [16:14:03] mark: oh is this the one where you were complaining that you direct pushed? [16:14:15] no [16:14:42] https://gerrit.wikimedia.org/r/gitweb?p=operations/debs/varnish.git;a=commit;h=6ef9caabbf2e6c4826ba4432cc3a7e4c3a649955 links to https://gerrit.wikimedia.org/r/#/q/6ef9caabbf2e6c4826ba4432cc3a7e4c3a649955,n,z which is an empty result set [16:15:11] oh maybe that was pushed direct [16:15:13] I guess it was [16:15:39] ok [16:23:20] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1001.0546 (gt 1000) [16:38:30] drdee: udp-filter does not run on the edge [16:39:15] it runs on locke/emery/oxygen/etc. ? [16:43:52] New patchset: Andrew Bogott; "Modify adminbot for use with a grid engine." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/58538 [16:47:25] Change merged: Andrew Bogott; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/58538 [16:53:38] New patchset: Andrew Bogott; "Bumped to 1.7" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/58539 [16:53:56] Change merged: Andrew Bogott; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/58539 [16:59:24] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:00:08] Somebody around for yet another cache purging issue? https://bugzilla.wikimedia.org/show_bug.cgi?id=46976#c11 [17:00:16] * andre__ wonders if LeslieCarr is already around ^^ [17:00:23] nope [17:00:26] not around [17:01:11] plus there is another issue reported in https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#SVG_image_thumbnail_caching_broken.3F , but that looks different (constantly getting the wrong image even after purging) [17:06:47] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:08:30] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:10:20] RECOVERY - Solr on vanadium is OK: All OK [17:11:52] New review: Faidon; "I didn't look in depth, but I don't understand why we need separate classes for the two cases (i.e. ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:14:33] !log authdns-update [17:14:39] Logged the message, RobH [17:18:02] New review: Catrope; "Regarding MZ's comment about scalability: maybe the $wgDBname check can be replaced with a $wmgUseFo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [17:21:18] mark: i know udp-filter does not run on the edge nor was I implying it should [17:22:19] is udp-filter too slow to anonymize the 1:1000 sampled stream? [17:23:36] !log restart unresponsive cp1037 [17:23:43] Logged the message, Mistress of the network gear. [17:24:00] RECOVERY - Puppet freshness on cp1023 is OK: puppet ran at Wed Apr 10 17:23:54 UTC 2013 [17:25:00] RECOVERY - Puppet freshness on cp1024 is OK: puppet ran at Wed Apr 10 17:24:50 UTC 2013 [17:25:35] andre__: all purging should be happy again - i doulbe checked all the htcpd daemons on upload varnishes [17:25:43] andre__: can you respond ? [17:25:50] I'll check [17:26:50] RECOVERY - Puppet freshness on cp1033 is OK: puppet ran at Wed Apr 10 17:26:43 UTC 2013 [17:27:25] LeslieCarr, I can confirm that https://bugzilla.wikimedia.org/show_bug.cgi?id=46976#c11 is fixed [17:27:26] mark: no, udp-filter is, AFAIK, not too slow for 1:1000 sampled streams [17:27:37] LeslieCarr, but https://bugzilla.wikimedia.org/show_bug.cgi?id=47087 is still an issue (and a different problem) [17:27:56] did you try repurging ? [17:28:14] New review: Brion VIBBER; "Confirmed this favicon shows high-resolution in Chrome on a Retina MacBook Pro, and still works in F..." [operations/mediawiki-config] (master); V: 2 C: 1; - https://gerrit.wikimedia.org/r/58463 [17:28:19] just repurged [17:28:20] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.4376 (gt 1000) [17:28:23] double check ? [17:28:52] !log aaron synchronized php-1.22wmf1/maintenance/copyJobQueue.php 'deployed 54f74111901ddebdc69f26847e3904140b3723d5' [17:28:58] Logged the message, Master [17:29:43] LeslieCarr, https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/BirdRespiration.svg/220px-BirdRespiration.svg.png is still wrong for me after purging [17:29:46] LeslieCarr, should be https://en.wikipedia.org/wiki/File:BirdRespiration.svg [17:30:06] I went to https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/BirdRespiration.svg/220px-BirdRespiration.svg.png?whatever and then to https://en.wikipedia.org/wiki/File:BirdRespiration.svg?action=purge [17:32:20] RECOVERY - Solr on vanadium is OK: All OK [17:36:38] ok this is different [17:36:51] so purging from varnish is working [17:37:11] it's just either the imagescalers somehow fuck up the rescaling or the purging isn't purging that image from swift and it gets resent [17:39:32] yeah, different issue (judging from its outcome) [17:40:20] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.56415 (gt 1000) [17:41:20] RECOVERY - Solr on vanadium is OK: All OK [17:42:07] New patchset: RobH; "RT 4920 stat1002 deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58543 [17:45:22] !log authdns-update to rename db1012 to stat1002 [17:45:28] Logged the message, RobH [17:46:44] New review: RobH; "kirk! wait, i mean picard!" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58543 [17:46:47] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58543 [17:54:30] jdlrobson: ping [17:54:37] hey preilly! :) [17:55:48] jdlrobson: do you have lunch plans today? [17:56:00] ummm nope [17:56:09] oh shit maybe.. [17:56:23] we have a lunch hangout - i'm not sure if that was cancelled or not - will have to ask tomasz when he's back [17:57:18] jdlrobson: okay well let me know [17:57:25] rfaulkner: ping [17:57:29] * preilly heh heh [17:57:40] preilly: hey [17:57:50] rfaulkner: do you have lunch plans today? [17:58:04] * preilly is running down his list…  [17:58:29] i'm, wfh. not in until 2pm … could do a late lunch [17:58:54] or .. meet somewhere around 1:30/45 [17:58:58] preilly: will do [17:59:26] rfaulkner: okay [17:59:36] rfaulkner: I'm at home so anything works for me [17:59:46] New patchset: Aaron Schulz; "Enabled use of redis for null test jobs." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58545 [17:59:57] ok. i'll be cycling in around that time and will give you a call [18:01:05] rfaulkner: sounds good [18:02:20] New patchset: Aaron Schulz; "Switched all remaning wikis to 1.22wmf1." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58546 [18:04:27] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58546 [18:06:23] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Switched all remaning wikis to 1.22wmf1 [18:06:29] Logged the message, Master [18:10:20] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1001.68463 (gt 1000) [18:12:20] RECOVERY - Solr on vanadium is OK: All OK [18:14:17] notpeter: https://gerrit.wikimedia.org/r/#/c/58545/1/wmf-config/jobqueue-eqiad.php [18:14:21] do those IPs look sane? [18:17:53] Aaron|home: http://www.youtube.com/watch?v=RijB8wnJCN0 [18:18:05] but yes, those IPs are what they claim to be [18:18:26] might be nice if there were vips for such things [18:20:40] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [18:21:15] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58545 [18:23:41] !log aaron synchronized wmf-config/jobqueue-eqiad.php 'Enabled use of redis for null test jobs' [18:23:49] Logged the message, Master [18:25:36] hrm, i think we can fix that :) [18:35:03] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54493 [18:35:15] New patchset: Jdlrobson; "Update mobile.uploads.schema and add modules to mobile OutputPage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58551 [18:35:29] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [18:35:45] New review: Jdlrobson; "Merge dependency first." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/58551 [18:36:23] paravoid! hehe you review toO fast! not ready! :) [18:36:26] still testing! [18:36:30] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54495 [18:36:47] :) [18:37:15] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54494 [18:38:16] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54501 [18:38:20] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.5937 (gt 1000) [18:40:39] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54496 [18:41:34] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58551 [18:42:27] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54497 [18:42:30] RECOVERY - Puppet freshness on db1029 is OK: puppet ran at Wed Apr 10 18:42:19 UTC 2013 [18:42:51] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54498 [18:43:08] notpeter: is rdb1001 in ganglia? [18:44:07] negatory [18:45:23] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54499 [18:45:40] Change abandoned: Dzahn; "already squashed into 54502 (and reverted)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54500 [18:46:03] Aaron|home: lemme patch that up, yo [18:46:11] and rdb1002 while at it [18:46:24] seems to work fine, but I want to watch memory usage [18:46:48] yeah [18:48:28] New review: Dzahn; "even though i don't have a very strong opinion about it i still tend to agree with Krinkle, already ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [18:49:41] New patchset: Pyoungmeister; "finishing up making an rdb eqiad ganglia group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58552 [18:51:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58552 [18:51:32] Change abandoned: Ori.livneh; "No consensus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [18:52:07] Aaron|home: there are like 3 things you have to do to actually get things into ganglia, and it's not uncommon for people to forget one of them... [18:53:29] $pool = RedisConnectionPool::singleton( array( 'connectTimeout' => 1, 'persistent' => false, 'password' => '' ) ); [18:53:30] var_dump( $pool->getConnection( '10.64.32.76' )->info( 'MEMORY' ) ); [18:53:43] that's useful :) [18:54:28] ottomata: wanna do the two debs or should I? [18:55:17] two debs...? [18:55:30] 2 debs 1 build? [18:55:31] eeeewwwww [18:55:33] haha [18:55:40] PROBLEM - Puppet freshness on virt8 is CRITICAL: No successful Puppet run in the last 10 hours [18:55:40] PROBLEM - Puppet freshness on mw7 is CRITICAL: No successful Puppet run in the last 10 hours [18:55:40] PROBLEM - Puppet freshness on mw1019 is CRITICAL: No successful Puppet run in the last 10 hours [18:55:40] PROBLEM - Puppet freshness on mw3 is CRITICAL: No successful Puppet run in the last 10 hours [18:55:46] (i'm in a meeting right now) [18:56:35] oh email! [18:56:35] reading [18:57:12] paravoid, I can probably do it in an hour or something [18:58:40] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: No successful Puppet run in the last 10 hours [18:58:59] notpeter: labsdb1002-3 are fixed...they were in wrong vlan [18:59:04] raid cfg was okay [18:59:25] cmjohnson1: woo! thank you [19:00:03] cmjohnson1: could you check in on labsdb1001 to verify that raid is configured correctly? [19:00:07] feel free to shut it down [19:00:27] yep...np [19:00:45] PROBLEM - Puppet freshness on mw4 is CRITICAL: No successful Puppet run in the last 10 hours [19:00:45] PROBLEM - Puppet freshness on mw6 is CRITICAL: No successful Puppet run in the last 10 hours [19:00:45] PROBLEM - Puppet freshness on virt1007 is CRITICAL: No successful Puppet run in the last 10 hours [19:00:45] PROBLEM - Puppet freshness on virt5 is CRITICAL: No successful Puppet run in the last 10 hours [19:02:40] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [19:03:40] PROBLEM - Puppet freshness on mw11 is CRITICAL: No successful Puppet run in the last 10 hours [19:05:30] PROBLEM - Host labsdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:40] PROBLEM - Puppet freshness on mw1018 is CRITICAL: No successful Puppet run in the last 10 hours [19:05:40] PROBLEM - Puppet freshness on mw16 is CRITICAL: No successful Puppet run in the last 10 hours [19:07:40] PROBLEM - Puppet freshness on mw1017 is CRITICAL: No successful Puppet run in the last 10 hours [19:07:40] PROBLEM - Puppet freshness on mw14 is CRITICAL: No successful Puppet run in the last 10 hours [19:08:40] PROBLEM - Puppet freshness on mw9 is CRITICAL: No successful Puppet run in the last 10 hours [19:09:24] sbernardin: the cable on search23 did not fix the problem..i will have to submit a network ticket [19:09:40] PROBLEM - Puppet freshness on mw5 is CRITICAL: No successful Puppet run in the last 10 hours [19:09:41] notpeter: how much disk space does rdb1001 have btw? [19:09:41] notpeter: the raid cfg was fine on labsdb1001....are you having trouble with it? [19:10:20] cmjohnson1: nope. just wanted to double check. thank you! [19:10:27] cmjohnson1: easier to do before is in prod :) [19:10:33] :-P [19:10:53] cmjohnson1: will put the original cable back....do you need port numbers for search23? [19:11:13] yes please [19:11:20] Aaron|home: / is 433G, /a is 278G, /tmp is 19G [19:12:23] ottomata: So I am spinning up the new stat1002 now =] [19:12:33] its identical to stat1 in hardware, actually prolly slightly faster [19:12:35] as both are R510s [19:12:40] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: No successful Puppet run in the last 10 hours [19:12:43] !log running puppetd --enable on all nodes via salt [19:12:47] oh nice! [19:12:48] cool! [19:12:50] Logged the message, notpeter [19:13:02] i put this on internal ip, i dont recall if we discussed [19:13:06] but i think we did and internal was fine. [19:13:21] yes, internal ip is correct [19:13:27] Ryan_Lane: paravoid any ideas as to how puppet might still be getting randomly disabled on nodes? [19:13:30] RECOVERY - Host labsdb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [19:13:32] cool, so i wasnt sure what can run in both sites at same time [19:13:41] so the manifest for stat1002 is pretty much just include you as sudo [19:13:41] this is happening for a while [19:13:43] no idea [19:13:46] and you can add what you need to site.pp [19:13:46] perfect [19:13:47] yeah thank you [19:13:49] i'll add as needed [19:13:51] danke [19:13:54] paravoid: I was hoping that not using the agent would help [19:13:57] might clean up a bit as I do too (more role classes) [19:13:59] quite welcome, it'll be ready for hand off in a bit [19:14:09] but lots stopped running in the last 24 hours :( [19:14:31] we had a puppet outage this morning [19:14:37] well, european morning I mean [19:14:40] PROBLEM - Puppet freshness on mw8 is CRITICAL: No successful Puppet run in the last 10 hours [19:14:53] there was a change merged that had a recursive define [19:15:19] and puppet was stupid enough that stafford melt [19:15:28] so I'm guessing lots of clients got timeouts/500s [19:16:08] paravoid: oh, hrm [19:16:12] that makes sense [19:18:31] cmjohnson1: Search23 in port 24 (search22 in port 23 and search24 in port 25) on asw-b3-sdtpa [19:18:38] thx [19:18:40] PROBLEM - Puppet freshness on mw1091 is CRITICAL: No successful Puppet run in the last 10 hours [19:18:40] PROBLEM - Puppet freshness on mw2 is CRITICAL: No successful Puppet run in the last 10 hours [19:20:40] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:21:40] PROBLEM - Puppet freshness on mw10 is CRITICAL: No successful Puppet run in the last 10 hours [19:21:42] so, it looks like puppet run times are for the most part up to about 1500 seconds or so [19:21:47] and lots are failing [19:22:30] RECOVERY - Puppet freshness on lvs6 is OK: puppet ran at Wed Apr 10 19:22:29 UTC 2013 [19:22:40] PROBLEM - Puppet freshness on mw75 is CRITICAL: No successful Puppet run in the last 10 hours [19:24:59] yeah puppet is dying all over itself [19:29:10] RECOVERY - Puppet freshness on mw1091 is OK: puppet ran at Wed Apr 10 19:29:06 UTC 2013 [19:29:40] PROBLEM - Puppet freshness on mw1139 is CRITICAL: No successful Puppet run in the last 10 hours [19:29:40] PROBLEM - Puppet freshness on mw15 is CRITICAL: No successful Puppet run in the last 10 hours [19:31:40] PROBLEM - Puppet freshness on mw1 is CRITICAL: No successful Puppet run in the last 10 hours [19:31:40] PROBLEM - Puppet freshness on mw1009 is CRITICAL: No successful Puppet run in the last 10 hours [19:31:40] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [19:31:49] New patchset: RobH; "have to include a group since no services are listed that do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58556 [19:32:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58556 [19:33:50] RECOVERY - Puppet freshness on db1030 is OK: puppet ran at Wed Apr 10 19:33:44 UTC 2013 [19:34:40] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:34:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [19:35:40] RECOVERY - Puppet freshness on search34 is OK: puppet ran at Wed Apr 10 19:35:35 UTC 2013 [19:36:10] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Wed Apr 10 19:36:00 UTC 2013 [19:36:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [19:39:05] RECOVERY - Puppet freshness on search17 is OK: puppet ran at Wed Apr 10 19:38:56 UTC 2013 [19:39:10] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Wed Apr 10 19:39:06 UTC 2013 [19:39:40] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Wed Apr 10 19:39:37 UTC 2013 [19:40:10] RECOVERY - Puppet freshness on search21 is OK: puppet ran at Wed Apr 10 19:40:02 UTC 2013 [19:40:10] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Wed Apr 10 19:40:02 UTC 2013 [19:41:50] RECOVERY - Puppet freshness on search36 is OK: puppet ran at Wed Apr 10 19:41:44 UTC 2013 [19:43:40] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 10 hours [19:43:40] RECOVERY - Puppet freshness on mw121 is OK: puppet ran at Wed Apr 10 19:43:36 UTC 2013 [19:43:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 228 seconds [19:45:00] RECOVERY - Puppet freshness on gadolinium is OK: puppet ran at Wed Apr 10 19:44:51 UTC 2013 [19:46:04] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [19:46:20] New patchset: Ryan Lane; "Split openstack manager away from controller" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58558 [19:46:50] RECOVERY - Puppet freshness on analytics1004 is OK: puppet ran at Wed Apr 10 19:46:42 UTC 2013 [19:47:43] New review: Hashar; "The issue is about bots spamming #mediawiki and rendering that channel useless and annoying for supp..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [19:47:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [19:48:50] RECOVERY - Puppet freshness on mw75 is OK: puppet ran at Wed Apr 10 19:48:43 UTC 2013 [19:49:41] New review: Hashar; "I disagree Ori, we have a consensus :-]? Just that mark asked to move that utility to another class..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [19:50:40] PROBLEM - Puppet freshness on mw13 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:29] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [19:54:40] PROBLEM - Puppet freshness on db1045 is CRITICAL: No successful Puppet run in the last 10 hours [19:54:40] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [19:55:24] RECOVERY - Puppet freshness on mw1139 is OK: puppet ran at Wed Apr 10 19:55:12 UTC 2013 [19:55:50] notpeter: let me know when the ganglia stuff is up [19:56:22] New patchset: RobH; "including base statistics role in stat1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58559 [19:56:36] New review: Krinkle; "@Hashar: You are working from the (imho incorrect) assumption that most developers working on MediaW..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [19:56:37] New patchset: Aaron Schulz; "Moved async upload jobs to redis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58561 [19:57:04] sbernardin: around? [19:59:01] RECOVERY - Puppet freshness on mw1009 is OK: puppet ran at Wed Apr 10 19:58:59 UTC 2013 [19:59:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58561 [19:59:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58559 [19:59:42] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [20:00:27] !log aaron synchronized wmf-config/jobqueue-eqiad.php 'Moved async upload jobs to redis' [20:00:34] Logged the message, Master [20:01:00] RECOVERY - Puppet freshness on ms-fe4 is OK: puppet ran at Wed Apr 10 20:00:50 UTC 2013 [20:01:05] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [20:03:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [20:05:01] RECOVERY - Puppet freshness on mw1013 is OK: puppet ran at Wed Apr 10 20:04:58 UTC 2013 [20:05:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [20:12:31] New review: Ottomata; "Ok, so. The need for puppetmaster::self is for backwards compatibility only. There are instances o..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [20:13:40] PROBLEM - Puppet freshness on db1052 is CRITICAL: No successful Puppet run in the last 10 hours [20:18:40] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: No successful Puppet run in the last 10 hours [20:24:28] Change merged: Ottomata; [operations/debs/python-jsonschema] (debian/wikimedia) - https://gerrit.wikimedia.org/r/58311 [20:24:54] New patchset: Ryan Lane; "Split openstack manager away from controller" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58558 [20:24:58] Change abandoned: Ottomata; "This has been done here:" [operations/debs/python-jsonschema] (debian/experimental) - https://gerrit.wikimedia.org/r/56064 [20:25:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58558 [20:26:23] Change abandoned: RobH; "included in another patchset" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/54582 [20:29:50] New patchset: ArielGlenn; "version 0.0.2, scripts that prepare subset of wiki content for import" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58568 [20:32:40] PROBLEM - Puppet freshness on db1043 is CRITICAL: No successful Puppet run in the last 10 hours [20:33:44] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/58568 [20:34:40] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [20:46:22] !log added python-jsonschema to wikimedia apt repo: ori-l :) [20:46:29] Logged the message, Master [20:47:41] Aaron|home: https://ganglia.wikimedia.org/latest/?c=Redis%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [20:48:55] nice [20:49:23] though the only calling it 'Redis eqiad' is a bit funny since mc1-16 are redis too :) [20:49:36] Change merged: Ottomata; [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [20:50:53] Change merged: Ottomata; [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/57263 [20:52:39] they're in the memcache eqiad group :) [20:52:44] I don't really care how they're grouped [20:52:51] I was just finishing off what asher started to code up [20:55:38] New patchset: Jgreen; "fundraising.pp file owner cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58603 [20:57:17] New patchset: Anomie; "Comment out call to clearMessageBlobs.php in l10nupdate-1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58604 [21:03:56] anybody know wtf is going with gerrit? ^demon doesn't seem to be around. gerrit has been super slow for me today, and now i'm getting things like 'code review - error; server unavailable; 0' or 5xx errors [21:04:46] +1 what awjr said [21:05:35] ^demon: is sick [21:05:40] New patchset: Ottomata; "Release 0.6.1-4 precise-wikimedia for WMF." [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/58605 [21:05:57] gerrit-wm also appears to be sick :( [21:06:05] Change merged: Ottomata; [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/58605 [21:06:43] !log added python-voluptuous to apt repo (for hashar) [21:06:50] Logged the message, Master [21:09:32] New patchset: Aaron Schulz; "Set $wgJobQueueMigrationConfig for use by scripts." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58606 [21:09:35] Change merged: Ottomata; [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/56602 [21:11:30] New review: Ottomata; "I see at least one 'nice job' comment from Faidon, and he poked me about this today." [operations/debs/python-statsd] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/55069 [21:11:41] Change merged: Ottomata; [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [21:15:04] jdlrobson: gerrit or gerrit-wm ? [21:15:23] jeremyb_: sorry gerrit. it auto completed for me- seems better now though [21:15:32] right :) [21:15:40] New patchset: Ottomata; "1.5.2-2 release for WMF" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/58609 [21:15:40] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:15:55] Change merged: Ottomata; [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/58609 [21:16:03] New patchset: Pyoungmeister; "adding db1001 to eqiad mq shard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58610 [21:17:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58606 [21:18:29] !log aaron synchronized wmf-config/jobqueue-eqiad.php [21:18:33] !log added python-statsd to apt repo for hashar [21:18:36] Logged the message, Master [21:18:42] Logged the message, Master [21:19:25] New patchset: Pyoungmeister; "adding db1001 to eqiad mq shard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58610 [21:20:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58610 [21:29:10] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:49] only one -lb? [21:29:53] that's odd [21:30:10] RECOVERY - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 94858 bytes in 4.635 second response time [21:30:24] I EXPECT FULL PAGERSTORM FOR ANY LVS ISSUE [21:30:40] I do :) [21:32:50] New patchset: Aaron Schulz; "Moved htmlCacheUpdate jobs to redis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58613 [21:33:10] New patchset: Jgreen; "more permissions/ownership hackery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58614 [21:34:10] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58613 [21:34:41] ok, who is killing esams [21:34:44] Ryan_Lane: is it you ? [21:35:10] RECOVERY - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 94856 bytes in 8.377 second response time [21:35:11] me? :D [21:35:13] no [21:35:21] it's weird that it's only a single -lb reporting [21:35:34] !log aaron synchronized wmf-config/jobqueue-eqiad.php 'Moved htmlCacheUpdate jobs to redis' [21:35:41] Logged the message, Master [21:37:30] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:20] this is weird [21:38:20] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 75904 bytes in 0.767 second response time [21:39:30] RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Wed Apr 10 21:39:23 UTC 2013 [21:40:46] so puppet had the magically disabled thing [21:40:53] LeslieCarr: yeah :( [21:40:57] but couldn't find anything else seriously wrong - doulbe checked the network [21:41:04] it's also taking like 30 minutes to run on every host [21:44:11] New patchset: Jgreen; "fix users/groups for civicrm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58616 [21:44:50] PROBLEM - Packetloss_Average on analytics1004 is CRITICAL: CRITICAL: packet_loss_average is 8.80552976378 (gt 8.0) [21:44:51] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: CRITICAL: packet_loss_average is 9.22065704918 (gt 8.0) [21:45:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.54963076336 (gt 8.0) [21:45:14] yeah [21:45:30] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 8.392666 (gt 8.0) [21:45:40] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 10.4121939024 (gt 8.0) [21:46:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58616 [21:46:00] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 9.05165941667 (gt 8.0) [21:46:01] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: CRITICAL: packet_loss_average is 10.9583460656 (gt 8.0) [21:46:02] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 8.38687435484 (gt 8.0) [21:47:09] also surprisingly enough cpu not pegged on staffor [21:47:10] d [21:47:58] stafford had enough attention for one day [21:50:15] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output [21:50:15] New patchset: Aaron Schulz; "Moved refreshLinks jobs to redis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58617 [21:51:40] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: No successful Puppet run in the last 10 hours [21:52:11] hmmm, is there a weirdness now I don't know about? packet loss alerts on all udp2log instances at once [21:53:09] New review: Mwalker; "I'll push this tomorrow with my CentralNotice deploy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58463 [21:53:25] LeslieCarr: any network problem right now? [21:54:33] not that i can tll [21:54:40] lemme check some more [21:57:40] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.03045722222 [21:58:00] RECOVERY - Packetloss_Average on analytics1003 is OK: OK: packet_loss_average is 2.55244773438 [21:58:48] I'm going to try restarting apache on stafford [21:59:00] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.42737580882 [21:59:08] !log gracefulling apache on stafford [21:59:14] Logged the message, notpeter [22:00:06] so [22:00:10] I'd guess [22:00:48] that tim reduced the PassengerMaxPoolSize on stafford and restarted apache [22:01:13] later ma_rk reverted that change, but I don't think that he restarted apache [22:01:35] after they found the root cause of stafford's swap-death inducing issue [22:02:03] stafford in now pegging it's cpu nicely trying to handle all of the queued puppet run requests [22:02:09] should even out in a little bit [22:02:10] RECOVERY - Puppet freshness on virt1007 is OK: puppet ran at Wed Apr 10 22:02:02 UTC 2013 [22:02:10] RECOVERY - Puppet freshness on virt5 is OK: puppet ran at Wed Apr 10 22:02:02 UTC 2013 [22:02:14] will keep an eye on it [22:02:43] also, will make sure to run puppetd --enable on all the nodes that need it [22:02:48] LeslieCarr: ^^ [22:03:00] RECOVERY - Puppet freshness on virt1 is OK: puppet ran at Wed Apr 10 22:02:52 UTC 2013 [22:03:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.128 second response time [22:03:10] RECOVERY - Puppet freshness on mw6 is OK: puppet ran at Wed Apr 10 22:03:03 UTC 2013 [22:03:10] RECOVERY - Puppet freshness on mw4 is OK: puppet ran at Wed Apr 10 22:03:03 UTC 2013 [22:03:50] RECOVERY - Puppet freshness on mw11 is OK: puppet ran at Wed Apr 10 22:03:44 UTC 2013 [22:05:00] RECOVERY - Puppet freshness on mw16 is OK: puppet ran at Wed Apr 10 22:04:56 UTC 2013 [22:05:35] :) [22:06:00] RECOVERY - Puppet freshness on mw1017 is OK: puppet ran at Wed Apr 10 22:05:56 UTC 2013 [22:07:10] RECOVERY - Puppet freshness on mw14 is OK: puppet ran at Wed Apr 10 22:07:08 UTC 2013 [22:07:10] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [22:07:18] gerrit is slow again… any ideas why? It's really harming my ability to do anything today :( [22:07:30] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is 0.922189756098 [22:07:50] RECOVERY - Puppet freshness on mw9 is OK: puppet ran at Wed Apr 10 22:07:48 UTC 2013 [22:07:50] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Wed Apr 10 22:07:48 UTC 2013 [22:08:00] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.0117669918699 [22:08:01] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 0.335521818182 [22:08:37] Change restored: Tim Starling; "I think it should be installed on all servers. I think it may be possible to convince Mark of the me..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [22:08:50] RECOVERY - Packetloss_Average on analytics1004 is OK: OK: packet_loss_average is 0.331661322314 [22:08:51] RECOVERY - Packetloss_Average on analytics1005 is OK: OK: packet_loss_average is 0.57254632 [22:09:00] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Wed Apr 10 22:08:50 UTC 2013 [22:09:01] slow gerrit [22:10:00] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Wed Apr 10 22:09:56 UTC 2013 [22:10:00] RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Wed Apr 10 22:09:56 UTC 2013 [22:10:04] load average is evening back out on stafford [22:10:14] Aaron|home: slow doesn't quite describe it… [22:10:30] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Wed Apr 10 22:10:22 UTC 2013 [22:11:49] Aaron|home: i'm going to mail wikitech [22:12:00] RECOVERY - Puppet freshness on mw2 is OK: puppet ran at Wed Apr 10 22:11:53 UTC 2013 [22:14:13] Aaron|home: Slow? more like unresponsive. I get nothing more but the menu and [Working...] [22:14:16] !log temporarily disabling DynamicSidebar support in OpenStackManager [22:14:20] RECOVERY - Puppet freshness on mw10 is OK: puppet ran at Wed Apr 10 22:14:10 UTC 2013 [22:14:21] yep, same here [22:14:23] Logged the message, Master [22:14:40] PROBLEM - Puppet freshness on cp3007 is CRITICAL: No successful Puppet run in the last 10 hours [22:15:08] anyone remember whether we moved off jetty n gerrit yet? [22:15:20] !log enable DynamicSidebar support in OpenStackManager [22:15:26] apergos: we did not [22:15:27] Logged the message, Master [22:15:29] Krinkle: Aaron|home yeh i can't do anything right now. It's making me extremely frustrated as I have too many things to do [22:15:37] http://code.google.com/p/gerrit/wiki/Scaling I know it was discussed [22:16:20] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58617 [22:16:30] RECOVERY - Puppet freshness on mw13 is OK: puppet ran at Wed Apr 10 22:16:22 UTC 2013 [22:17:58] I also can't understand why 'r' is still broken (going from finishing an inline comment draft to the submit window, instead of manually mouse-clicking on "Up to change" and then "Review") [22:18:13] it just trigger [Working..] and does nothing, that's actually been a bug for almost a month now [22:18:24] funny thing is, this particular [Working..] label sticks [22:18:35] it'll stay there for ever while navigating the site and everything else is broken. [22:18:49] May be unfair, but I think that's what you get for writing javascript with java. [22:18:59] It can be done, but why? [22:20:10] oh, and now we're out of service. [22:20:31] <^demon|sick> !log restarting gerrit, was hung [22:20:39] Logged the message, Master [22:20:41] it was working again. [22:20:49] (before the restart) [22:22:08] <^demon|sick> It was still pegged at 100% cpu, and was getting some OrmConcurrency exceptions. [22:22:40] RECOVERY - Puppet freshness on virt4 is OK: puppet ran at Wed Apr 10 22:22:32 UTC 2013 [22:22:40] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Wed Apr 10 22:22:37 UTC 2013 [22:22:50] RECOVERY - Puppet freshness on db1045 is OK: puppet ran at Wed Apr 10 22:22:42 UTC 2013 [22:23:00] RECOVERY - Puppet freshness on mw1 is OK: puppet ran at Wed Apr 10 22:22:57 UTC 2013 [22:23:06] New review: Kaldari; "Personally, I don't think we need to worry about scalability until we have more than 1 wiki requesti..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:23:37] !log aaron synchronized wmf-config/jobqueue-eqiad.php 'moved refreshLinks to redis.' [22:23:40] RECOVERY - Puppet freshness on virt8 is OK: puppet ran at Wed Apr 10 22:23:39 UTC 2013 [22:23:44] Logged the message, Master [22:24:00] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Wed Apr 10 22:23:49 UTC 2013 [22:24:00] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Wed Apr 10 22:23:55 UTC 2013 [22:24:10] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:24:27] New patchset: Krinkle; "Add 'Contact Wikipedia' footer link on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:25:10] RECOVERY - Puppet freshness on mw1019 is OK: puppet ran at Wed Apr 10 22:25:06 UTC 2013 [22:25:13] !log added live hack to OpenStackManager on wikitech, ensuring that DynamicSidebar support is only enabled for logged-in users [22:25:21] Logged the message, Master [22:25:22] New review: Krinkle; "(1 comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/57649 [22:28:50] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Wed Apr 10 22:28:44 UTC 2013 [22:29:35] New patchset: Kaldari; "Add 'Contact Wikipedia' footer link on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:32:10] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [22:34:09] New patchset: Jdlrobson; "Update mobile.uploads.schema and add modules to mobile OutputPage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58551 [22:49:08] RECOVERY - Host labsdb1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:50:28] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Wed Apr 10 22:50:25 UTC 2013 [22:51:08] PROBLEM - RAID on stat1002 is CRITICAL: NRPE: Command check_raid not defined [22:51:18] PROBLEM - MySQL Idle Transactions Port 3308 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_idle_transaction_3308 not defined [22:51:18] PROBLEM - MySQL Slave Delay Port 3308 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_delay_3308 not defined [22:51:28] PROBLEM - MySQL Slave Running Port 3306 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_running_3306 not defined [22:51:28] PROBLEM - MySQL Recent Restart Port 3306 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_recent_restart_3306 not defined [22:51:38] PROBLEM - MySQL Slave Running Port 3307 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_running_3307 not defined [22:51:38] PROBLEM - MySQL Recent Restart Port 3307 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_recent_restart_3307 not defined [22:51:38] PROBLEM - DPKG on stat1002 is CRITICAL: NRPE: Command check_dpkg not defined [22:51:48] PROBLEM - Disk space on labsdb1003 is CRITICAL: NRPE: Command check_disk_space not defined [22:51:48] PROBLEM - MySQL Slave Running Port 3308 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_running_3308 not defined [22:51:48] PROBLEM - MySQL Recent Restart Port 3308 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_recent_restart_3308 not defined [22:51:48] PROBLEM - Disk space on stat1002 is CRITICAL: NRPE: Command check_disk_space not defined [22:51:58] PROBLEM - MySQL Slave Delay Port 3306 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_delay_3306 not defined [22:51:58] PROBLEM - MySQL Idle Transactions Port 3306 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_idle_transaction_3306 not defined [22:52:08] PROBLEM - MySQL Slave Delay Port 3307 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_slave_delay_3307 not defined [22:52:08] PROBLEM - MySQL Idle Transactions Port 3307 on labsdb1003 is CRITICAL: NRPE: Command check_mysql_idle_transaction_3307 not defined [22:53:48] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:01:48] RECOVERY - Disk space on labsdb1003 is OK: DISK OK [23:01:48] RECOVERY - MySQL Recent Restart Port 3308 on labsdb1003 is OK: OK seconds since restart [23:01:48] RECOVERY - MySQL Slave Running Port 3308 on labsdb1003 is OK: OK replication [23:01:58] RECOVERY - MySQL Idle Transactions Port 3306 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for seconds [23:01:58] RECOVERY - MySQL Slave Delay Port 3306 on labsdb1003 is OK: OK replication delay seconds [23:02:08] RECOVERY - MySQL Idle Transactions Port 3307 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for seconds [23:02:08] RECOVERY - MySQL Slave Delay Port 3307 on labsdb1003 is OK: OK replication delay seconds [23:02:18] RECOVERY - MySQL Idle Transactions Port 3308 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for seconds [23:02:18] RECOVERY - MySQL Slave Delay Port 3308 on labsdb1003 is OK: OK replication delay seconds [23:02:28] RECOVERY - MySQL Recent Restart Port 3306 on labsdb1003 is OK: OK seconds since restart [23:02:28] RECOVERY - MySQL Slave Running Port 3306 on labsdb1003 is OK: OK replication [23:02:38] RECOVERY - MySQL Recent Restart Port 3307 on labsdb1003 is OK: OK seconds since restart [23:02:38] RECOVERY - MySQL Slave Running Port 3307 on labsdb1003 is OK: OK replication [23:02:48] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [23:04:45] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [23:18:35] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:45] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:47] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:50] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:52] !log pgehres synchronized php-1.22wmf1/extensions/CentralAuth/maintenance/migrateAccount.php 'Updating CentralAuth mainentance script' [23:19:15] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:17] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:20] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:22] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:24] oh shit [23:19:24] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:26] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:28] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:31] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:33] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [23:20:35] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:01] LeslieCarr: do you know what's up? [23:21:03] is this networking? [23:21:13] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:15] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:16] dunno, can't reach the host [23:21:18] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:20] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:22] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:24] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:27] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:21:27] just IPv6 ? [23:21:40] yeah [23:21:44] amslvs3 is ipv6 [23:21:49] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:22:01] I can't see eqiad over ipv6 either [23:22:02] from here [23:22:05] so it's not esams [23:22:12] last hop is XO [23:22:19] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [23:22:21] cir1.ashburn-va.us.xo.net specifically [23:22:22] LeslieCarr: ^^ [23:22:27] thanks [23:22:31] back up [23:22:41] ping6 wikipedia-lb from fenari works though [23:22:57] that's over our link though [23:22:59] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.75 ms [23:23:03] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 85.65 ms [23:23:05] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 85.83 ms [23:23:05] there it is again, yep [23:23:07] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.98 ms [23:23:10] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.52 ms [23:23:12] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.74 ms [23:23:15] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.00 ms [23:23:17] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.70 ms [23:23:19] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.65 ms [23:23:28] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.80 ms [23:23:31] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.03 ms [23:23:33] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 89.10 ms [23:23:44] still lossy and with a higher latency than usual [23:24:05] and back down again [23:24:22] down as in lower latency [23:24:28] and that is why i have the unlimited texting plan. [23:24:28] haha [23:24:40] and back up [23:24:51] so64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=27 ttl=44 time=156 ms [23:24:54] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=28 ttl=44 time=157 ms [23:24:57] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=29 ttl=44 time=160 ms [23:25:00] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=30 ttl=44 time=217 ms [23:25:03] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=31 ttl=44 time=255 ms [23:25:06] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=32 ttl=44 time=247 ms [23:25:09] stable at ~250ms [23:25:12] compared to ~150ms which is the usual [23:25:15] same path [23:25:41] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 86.85 ms [23:25:43] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 88.87 ms [23:26:21] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 87.59 ms [23:26:22] LeslieCarr: it's doing this 150->250->150 every few minutes [23:26:23] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 86.97 ms [23:26:31] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 86.68 ms [23:26:32] and lots of packet loss [23:26:34] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 87.67 ms [23:26:36] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 86.61 ms [23:26:38] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 87.89 ms [23:26:41] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 88.10 ms [23:27:03] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 86.84 ms [23:27:06] home -> eqiad is over level3->xo, eqiad->home is tele2->level3 [23:27:33] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 85.84 ms [23:27:48] LeslieCarr: ^^^ [23:28:14] hrm, i'll try taking down xo [23:28:45] as long as you're on top of this [23:29:34] i can login and start killing peers but I'd feel more comfortable if you do :) [23:29:34] hehe [23:30:24] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=277 ttl=44 time=182 ms [23:30:27] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=278 ttl=44 time=237 ms [23:30:30] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=279 ttl=44 time=277 ms [23:30:34] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=280 ttl=44 time=280 ms [23:30:37] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=282 ttl=44 time=281 ms [23:30:40] 64 bytes from 2620:0:861:2:7a2b:cbff:fe09:11ba: icmp_seq=284 ttl=44 time=280 ms [23:30:43] stable at 280ms now [23:30:45] !log deactivated XO ipv6 transit [23:30:46] jesus [23:30:48] I think I'm crossing the atlantic three times or something :) [23:30:52] Logged the message, Mistress of the network gear. [23:30:57] interesting traceroutes ? [23:31:12] now I'm going through XO [23:31:13] but over Tampa [23:31:14] haha [23:31:23] hehehe [23:31:26] back to 180ms now [23:31:30] well as long as that one is stable ... [23:31:33] oh via which route ? [23:31:46] ams level3 -> tampa xo [23:31:58] 280ms again [23:33:28] kill xo @ tampa too? [23:35:49] done [23:35:49] tinet now [23:35:51] any better ? [23:35:53] cool [23:35:55] 150ms [23:35:58] let's wait a bit [23:35:58] :) [23:37:40] looks stable [23:42:44] New patchset: Dzahn; "add metrics.wikimedia.org SSL cert for RT-4912" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58632 [23:45:38] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:45:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58632 [23:52:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:52:08] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:52:08] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:52:22] New patchset: Dzahn; "add frdata.wikimedia.org SSL cert for RT-4895" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58633 [23:55:46] New patchset: Ori.livneh; "Add python-jsonschema & pymongo to EL dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58635 [23:57:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58633