[00:12:27] Change abandoned: awjrichards; "This is no longer a necessary change." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10203 [01:40:45] New patchset: Jeremyb; "make ircecho config sane (not just very long strings)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8344 [01:41:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8344 [01:41:13] New review: Jeremyb; "rebased on top of Ib8c0ea3e5bbb54fd" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8344 [01:46:54] some people have said that Teredo (2001::/32) should not be blocked b/c Windows Vista/7 uses it, but I disagree. [01:48:18] isn't that for the stewards to decide? [01:48:56] a steward told me to talk to ops [01:54:04] ? [01:54:34] that means you, Snowolf [01:54:42] Yes but I didn't tell you to talk to ops [01:54:50] I told you that I wrote to one of the wmf ops people :P [01:55:01] I took that as a request to discuss with ops [01:55:19] i should've said [01:55:31] "I was advised that the ops opinion was needed" [01:56:13] In any case I'm not touching or listening to any IPv6 stuff until it's live [07:49:45] New patchset: Jeremyb; "admins.pp: comment includes of disabled accounts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10226 [07:50:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10226 [07:56:27] morning asher [07:56:47] heya [08:11:41] wtf [08:11:54] if I send PyBal a TERM signal, it kills its BGP connection, not itself [08:21:39] New patchset: Pyoungmeister; "adding a key that ben gave me. at his request someone else must +2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10227 [08:22:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10227 [08:31:44] !log reimaging searchidx2 [08:31:49] Logged the message, notpeter [08:45:11] !log restarted mysql on es1004 with default innodb file format as barracuda [08:45:16] Logged the message, Master [09:02:25] yay [09:02:27] nailed that bug [09:05:48] what was it? [09:06:04] are you familiar with Twisted (python)? [09:06:53] yep [09:07:01] basically I was installing a signal handler to run before twisted shutdown [09:07:09] the signal handler was returning a Deferred [09:07:25] the deferred was coming from my twisted BGP library, which was supposed to fire the callbacks on it [09:07:35] the signal handler was meant to stop/release BGP connections [09:07:44] but because the actual code was synchronous, no callbacks were ever fired [09:07:49] and the signal handler did never finish [09:07:56] so pybal's own terminate handler never got called, either [09:08:29] the synchronous code made a new fresh Deferred, but nothing did anything with it [09:08:42] event based code can be fun to debug ;) [09:09:44] New patchset: Mark Bergsma; "Fix erroneous use of str.find()" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10228 [09:09:45] New patchset: Mark Bergsma; "Catch and handle server list load errors" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10229 [09:09:45] i looked at the pybal code after you mentioned the bug an hour ago and couldn't see anything wrong :) [09:09:45] New patchset: Mark Bergsma; "Set umask 022 for the daemon" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10230 [09:09:46] New patchset: Mark Bergsma; "Simplify subCommandServer" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10231 [09:09:47] New patchset: Mark Bergsma; "Make sure we don't pass any IPv6 service IP addresses to BGP at the moment" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10232 [09:09:47] New patchset: Mark Bergsma; "Replace some too generic exception handlers by more specific exceptions in bgp.py" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10233 [09:09:48] New patchset: Mark Bergsma; "Fix bug where BGPPeering.manualStop() would not fire callbacks" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10234 [09:09:49] New patchset: Mark Bergsma; "pybal (1.01) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10235 [09:09:55] ^ last change :) [09:10:01] well, one but last [09:10:31] it's not really a critical bug, just has been annoying me for some time [09:10:39] and since i'm rolling a new release now anyway, I just wanted to fix it ;-) [09:10:44] but now back to installing LVS servers! [09:14:35] !log Built PyBal 1.01 for precise, and included it in the precise-wikimedia APT repository [09:14:35] bot is currently down [09:14:40] Logged the message, Master [09:14:52] * mark kicks wm-bot in the nuts [09:16:05] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10228 [09:16:07] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10228 [09:16:41] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10229 [09:16:49] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10229 [09:16:50] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10229 [09:17:18] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10230 [09:17:20] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10230 [09:18:01] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10231 [09:18:03] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10231 [09:18:29] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10232 [09:18:31] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10232 [09:19:13] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10233 [09:19:15] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10233 [09:20:13] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10234 [09:20:14] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10234 [09:20:38] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10235 [09:20:40] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10235 [09:21:59] or, perhaps, breakfast first... [09:35:01] New patchset: Pyoungmeister; "giving searchdix2 new indexer class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10237 [09:35:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10237 [09:35:28] !log rebooting es1 for kernel+mysql upgrade. dont need to pull from db.php because it was never correctly added or queried? [09:35:32] Logged the message, Master [09:35:56] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10237 [09:35:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10237 [09:42:17] mark: ! [09:52:18] !log stopping indexing on searchidx1001 to rsync to searchidx2 [09:52:22] Logged the message, notpeter [09:53:02] !log cancel that, it's mid-cron. will do later [09:53:06] Logged the message, notpeter [09:54:40] innodb compression will not help with our blobs scheme too much [09:57:34] paravoid? [09:58:26] i don't think it will either, i'm trying dynamic though since we have some amount of blobs that should fit in the pkey page in that case [09:59:50] mark: hi [09:59:55] what to do about lvs servers? [10:00:00] you told me not to reinstall [10:00:04] but dunno why [10:00:19] i'm comparing the size of compact vs. dynamic for one of the enwiki blobs tables.. i will probably try compressed after, just for the hell of it. maybe the page size change will result in better space utilization. innodb compact uses around 30% more space than myisam in this case [10:03:04] paravoid: you can reinstall lvs1 until it works fully automatically [10:03:11] but I want to check it completely before we do the others [10:03:15] a few changes to the manifests too [10:03:18] no more pdns recursors [10:03:28] oh? [10:03:37] no longer needed, pybal now does its own dns resolution [10:03:46] no timeout on every check anymore [10:03:50] that was the issue [10:03:59] if primary resolver was down, checks would timeout and depool everything [10:04:09] even before the 2nd resolver could help out [10:04:17] right [10:04:37] now it resolves everything on startup and then it's fine [10:04:48] in a future release I'll make it resolve when the ttl expires, but I think it's fine now [10:04:49] you guys need any help with anything? [10:05:10] feel free to setup 6to4/miredo relays ;-) [10:05:19] first we need to fix the lvs server installs [10:05:22] heh [10:05:27] then we could use some help in reinstalling the boxes perhaps [10:05:29] but right now, not much [10:05:36] where should I stick the relays? [10:05:36] try again later today ;-) [10:05:42] on new misc servers [10:05:43] low perf [10:06:01] this of course assumes I know how to do that :D [10:06:07] hehe [10:06:08] I mostly know how to do that [10:06:08] better time than any to learn [10:06:10] it can also wait on paravoid [10:06:18] so, feel free to let it for me or do it and ask me [10:06:30] * Ryan_Lane nods [10:06:34] I've setup 6to4 in ciscos before and read about miredo [10:06:35] i'm having breakfast now [10:06:38] just had shower :D [10:06:42] will be back soon [10:06:54] well, I don't like having infrastructure around that I don't understand, so I don't mind learning. don't wait on me, though [10:07:00] I do my best software releases before breakfast [10:07:06] like the spirit :) [10:07:31] Ryan_Lane: 6to4 is easy, miredo setup is easy too but you first have to understand Teredo [10:07:37] which is a huge mess [10:07:37] * Ryan_Lane nods [10:07:45] start reading the Teredo article on wikipedia [10:07:52] I dislike when things break and I need someone else to fix them [10:07:58] I needed to read it like 4 times before I understood what's going on [10:08:01] :D [10:08:13] what will we be using this for? [10:08:18] a replacement for doing it on nginx? [10:08:22] no no [10:08:29] a lot of people use tunneling [10:08:30] ah [10:08:31] I see [10:08:37] mostly automatically [10:08:38] this is so that we can use ipv4 internally [10:08:44] nope [10:08:59] so, 6to4 for example [10:09:06] some OS like Windows set it up automatically [10:09:08] (teredo too) [10:09:16] 6to4 connects to an anycast relay [10:09:28] both on the way forward and on the way back (different relays) [10:09:37] so you hit random tunneling servers on the internet [10:09:40] some of them may be buggy [10:09:48] or have packet loss [10:09:54] or be very far way increasing latency [10:10:10] by setting up our own relays, we minimize the hop count and can ensure performance [10:10:30] we're sure people will use a relay that works and their experience won't be very crappy [10:10:40] oh, this is nasty [10:10:49] I'm not sure how much tunneled traffic we will really see [10:10:56] but I guess we'd have to set it up and check :) [10:11:00] yeah [10:11:06] (we could also check analytics for the address spaces, but...) [10:11:12] I get the reasoning, this is dirty as hell, though :( [10:11:36] the way those protocols work you mean? [10:11:39] yes [10:11:42] yes. [10:12:23] so, [10:12:27] teredo is symmetric [10:12:46] as a client, you always use the relay which is closer to the destination (e.g. website that you visit) [10:13:10] right, so they'd use us, if we have one available [10:13:16] right [10:13:22] and we'd use that for the replies too [10:13:34] so all of a client's traffic would pass through our teredo relay [10:13:40] both ways [10:13:50] 6to4 otoh, is asymmetric [10:14:04] so, we need at least two. can this be load balanced, or set up in a failover way? [10:14:14] we'd handle it via bgp? [10:14:24] nah, I think that's an overkill [10:14:36] well, we don't want a SPOF [10:14:40] if that fails, another relay would be used on the internet [10:14:44] it's all dynamic [10:14:57] ah [10:15:07] that's easy enough, then [10:15:18] it's anycast [10:15:24] we'll have one per dc [10:15:28] eqiad and tampa [10:15:30] that's enough ;) [10:15:32] you just announce 2001:0::/32 [10:15:32] and one in esams [10:15:36] over bgp [10:15:44] ah, was gonna say, no esams? :D [10:15:53] esams is less necessary [10:15:56] tons of relays in amsterdam ;) [10:16:00] so, if the one in eqiad dies, they'll hit pmtpa? [10:16:04] yeah [10:17:03] oh wow [10:17:15] mark: the relay I'm hitting from pmtpa is... SURFnet!!! [10:17:20] haha [10:17:23] why am I not surprised [10:17:26] that sucks [10:17:36] Ryan_Lane: SURFnet is the GRNET of the Netherlands [10:17:42] so a Teredo user in the U.S. [10:17:51] :D [10:17:56] would go back to the netherlands, to be decapsulated [10:17:57] we could announce it to everyone [10:18:00] and then back to the u.s. [10:18:10] hence the need of running our own relays [10:18:10] but it might hit a lot of traffic [10:18:14] so let's wait with that ;) [10:18:17] * apergos starts reading the teredo article [10:18:19] don't want to have that tomorrow [10:18:57] also the prefix gets filtered by our transits now anyway ;) [10:19:07] yeah, I thought of that... [10:19:15] 6to4 is easier and doesn't have that problems [10:19:32] we just route 2002::/16 internally for our network to our relay [10:19:34] and poof [10:19:46] yep [10:20:02] how do you want to do bgp injection? [10:20:18] we could do a simple quagga [10:20:23] just in case the box goes down [10:20:26] yeah [10:20:29] that should be enough [10:20:30] 6to4 is being done on kernel space [10:20:35] so, no other daemons to watch for [10:20:39] indeed [10:20:46] tcp over udp? uurgghh [10:20:59] apergos: it's better than tcp over tcp :P [10:21:04] hahahaha [10:23:51] Ryan_Lane: http://www.getipv6.info/index.php/Linux_or_BSD_6to4_Relays [10:23:54] this is getting a little network-y for me abilities [10:23:59] *my [10:24:10] you did juniper lesson 1, didn't you [10:24:13] :D [10:24:14] get on it already! ;-) [10:24:25] hey, at least I can set up my own ports and such now [10:25:01] I really need to tell my old employer that it's a bad idea to split the network group from the ops group [10:25:09] well, if you have free time, this is nice practice [10:25:13] true [10:25:18] we can help out :) [10:25:27] just play with it on a scratch box [10:25:31] then when you have it working, puppetize [10:25:38] i'll handle the router config for you ;) [10:25:43] * Ryan_Lane nods [10:25:51] but if you don't have time, fine too [10:25:54] I do [10:26:00] I have nothing better to do right now [10:26:16] you have a day off or something? ;D [10:26:30] no, but the things I need to work on, I was going to work on with paravoid ;) [10:26:38] ah ;) [10:27:08] so, helping frees him up faster anyway. heh [10:27:15] start with 6to4 i'd say [10:28:22] so, the router will be setup to advertise a route to the relay? [10:28:45] or will the relay use quagga to advertise its own route? [10:29:23] oh? you're waiting for me? [10:29:32] for labs things [10:29:37] didn't realize, sorry [10:29:40] it's no rush, though [10:29:57] ipv6 obviously has higher priority right now :) [10:30:23] (and seeing as that I wanted to do this last year, I'd like to see it happen ;) ) [10:31:07] Ryan_Lane: the relay will use quagga to announce its route [10:31:14] * Ryan_Lane nods [10:31:16] and the router will accept that and route it on the network [10:31:26] the routers will choose the closest relay automatically (anycast) [10:31:47] it sees the same route in multiple places, it picks the shortest/fastest [10:32:01] wow, this is seriously sketchy... (still reading article) [10:32:03] heh. the good thing is the juniper class actually made it so that I understand the concepts of this at least ;) [10:32:11] that the route actually goes to different destinations is not a problem here [10:32:25] the same we'll do with dns soon [10:32:29] yep [10:32:50] though we're doing anycast everywhere will this, rather than just in the US, like we'll do with DNS [10:32:55] *with [10:32:58] well [10:33:03] it's actually the same [10:33:08] the anycast will not leave our network (for now) [10:33:11] and esams is a different network [10:33:28] aren't we going to also have a relay in esams, though? [10:33:31] so esams will always hit either its own relay, or it will use some external one from the internet [10:33:39] ahhh ok [10:33:39] and pmtpa and eqiad will share their two relays [10:33:49] esams will never use the pmtpa/eqiad relays [10:33:57] (unless we make that work) [10:34:58] paravoid: so... lvs1 [10:35:03] yes, on it [10:35:07] it needs a reinstall, i used it as a test box now ;) [10:35:14] yep [10:35:15] let's make that work fully and then we'll move on with the others [10:35:17] btw [10:35:26] something to note [10:35:36] is that the precise installer does indeed IPv6 [10:35:43] yeah [10:35:46] we should do something with that soon [10:35:50] just... perhaps not now ;) [10:35:50] and at some point (probably not today) [10:35:53] indeed [10:36:03] heh, glad we're on the same page :-) [10:36:29] also, if people are now gonna add AAAA records to each and every service on our network i will shoot them [10:36:44] because the way most apache vhosts are now handled, that's a massive pain, just like https [10:36:57] that should be redone, and only after we've had this done properly, I'd like to add ipv6 everywhere [10:37:06] so mostly only core services right now [10:40:02] do you guys know molly-guard? [10:40:30] i've seen it [10:40:33] not used it [10:40:43] it's interesting [10:41:02] it prompts you to type the hostname before rebooting, as to not reboot by mistake [10:41:05] but [10:41:08] you can also write custom hooks [10:41:16] what would you like to use it for then? [10:41:20] so I had one that refused to reboot if you had kvm running processes [10:41:25] H [10:41:26] ah [10:41:29] and we could use it to refuse to reboot if it's an active LVS [10:41:35] well [10:41:41] it would automatically failover anyway [10:41:53] bgp session closes, router switches over [10:42:01] however, tcp sessions break [10:42:04] but all the connections [10:42:06] right [10:42:06] (lvs state is not shared currently) [10:42:07] yeah [10:42:15] currently? is that possible? [10:42:17] yes [10:42:20] oh! [10:42:23] lvs has replication for v4 [10:42:25] (lvs is new for me) [10:42:32] it uses multicast [10:42:37] and sends connection tuples over udp [10:42:54] we don't use it, except manually sometimes when I feel the need [10:42:59] but failovers are very rare [10:43:07] and mostly people don't notice when they happen [10:43:22] (obviously some people will be hitting reload ;-) [10:44:52] i'm gonna add static routes on the router now [10:45:00] routers [10:45:14] static of what? [10:45:19] what routes? [10:45:22] lvs routes [10:45:25] instead of bgp [10:45:32] ah, right [10:45:34] perhaps i'll implement bgp soon, but not today ;) [10:45:40] hahaha [10:45:45] we always have them anyway, as backup routes [10:45:51] i.e. if all pybal hosts die, there's still a route to one lvs box [10:46:11] if all the hosts or daemons? [10:46:14] daemons [10:46:21] right, that makes more sense :) [10:46:26] or if the bgp implementation would have a bug or so [10:46:29] I just changed it, for example [10:46:34] yep [10:46:39] I love how it's all redundant! [10:46:57] yeah [10:47:05] people often ask me why we don't just use VRRP/CARP/etc [10:47:10] but I think this is so much better [10:47:10] ew [10:47:14] that feels clumsy ;) [10:47:53] here's a static route for upload: [10:47:54] static { [10:47:54] route 2620:0:860:ed1a::b/128 { [10:47:54] next-hop 2620:0:860:1:208:80:152:120; [10:47:54] readvertise; [10:47:54] no-resolve; [10:47:55] preference 250; [10:47:55] } [10:47:56] } [10:48:27] yay [10:49:18] I think we can also have juniper do a simple ping check on a static route, if we don't wanna do quagga on those relays [10:49:32] iirc that was possible [10:50:29] iface eth0 inet dhcp [10:50:30] argh. [10:50:32] still. [10:50:34] :( [10:52:48] oh I know [10:55:25] mark: oh, you're saying we should have a misc-services address for https and ipv6, rather than having a billion records? [10:56:14] I'd love that, as I'd really like to get the https certificates off the misc servers [10:56:42] Ryan_Lane: rather, I want to migrate to a new apache manifest/module in puppet first [10:56:46] though if we move to stud, that becomes harder [10:56:50] so we're not rewriting the same vhost stuff each and every time ;-) [10:56:51] I understood that he said that we should use Include in Apaches [10:57:02] ah [10:57:14] I despise our apache config in puppet [10:57:17] yeah [10:57:24] just wanna say "enable ipv6!" and be done. [10:57:31] like Include commonSSLEngine on [...] Include common and so on [10:57:33] like we did for nginx? :) [10:57:36] yeah [10:57:45] yeah or some template [10:57:50] i started a new style system a while ago [10:57:51] but tbh [10:57:57] many people must have written good apache puppet modules ;-) [10:58:02] yes [10:58:17] if it's sort of similar to what I started and offers the same flexibility we need, then i'm fine with using that too [10:58:22] as long as we don't continue the way we do it now [10:58:26] hate hate hat [10:58:31] see, can't even type [10:58:35] agreed. we have 5 different ways to do it [10:58:46] and all of them suck [10:58:51] so unlike with https, I will shoot when people add v6 everywhere now [10:58:55] before we tackle this [10:59:06] even though v6 is a bit easier than https and new :443 vhosts ;) [11:02:47] junos needs a 'sort' command [11:03:05] I hate it when my routes and stuff are not in order :P [11:05:35] aaaaand reinstalling again [11:06:00] any smart ideas on how to fix the partitioning while I'm at it? [11:06:35] what's currently wrong about it? [11:07:11] there's no partioning for lvs servers [11:07:14] I'm doing it by hand [11:07:37] just use partman/lvm.cfg [11:10:39] any specific misc server to use for this? [11:11:01] any free low perf misc server [11:11:05] is robh not around? :) [11:11:24] don't thin so [11:11:49] look in rack b4 in pmtpa [11:11:55] racktables should list which ones are in use [11:14:06] YES [11:14:17] iface eth0 inet static [11:15:31] good! [11:15:32] what was it? [11:15:43] disable_dhcp has been renamed to disable_autoconfig [11:15:48] i hate that [11:15:50] they rename stuff all the time [11:16:00] and don't document properly [11:16:05] make sure lucid installs don't break ;) [11:16:56] there's nothing in b4 [11:17:02] a4 then? [11:17:24] I'm leaving both in [11:17:26] shouldn't hurt [11:17:26] there's no a4 in pmtpa? [11:17:34] oh I mean sdtpa [11:17:35] sorry [11:17:36] in fact, there's no row a [11:17:37] hahaha [11:17:38] ok [11:17:38] (for both) [11:17:40] that makes more sense :) [11:17:52] ohh, pmtpa row A [11:17:55] *shivers* [11:18:07] it was the horror [11:18:38] blondel is a Wikimedia Core Database server (db::core) [11:18:41] wtf [11:18:48] why? whyyyyyy? [11:18:50] db9/db10? [11:19:41] getting food [11:19:44] back in a little bit. [11:21:03] damn foundry [11:30:37] ok [11:30:39] static routes are done [11:47:06] New patchset: Mark Bergsma; "Don't install PDNS recursors on Precise LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10241 [11:47:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10241 [11:48:47] New patchset: Mark Bergsma; "Don't install PDNS recursors on Precise LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10241 [11:49:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10241 [11:49:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10241 [11:49:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10241 [12:01:34] zhen seems like it's installed, but I can't ssh into it [12:01:41] I'm assuming it's in use [12:01:45] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs to bits realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10242 [12:01:50] not with the installer key? [12:02:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10242 [12:02:18] well, I'd imagine if it's installed, then it's there for a reason [12:02:26] so I shouldn't steal it [12:02:54] check with serial consoel? [12:03:01] might just be a test box someone forgot about [12:03:12] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs to bits realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10242 [12:03:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10242 [12:03:45] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs to bits realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10242 [12:04:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10242 [12:04:14] it's installed [12:04:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10242 [12:04:20] it has a wikimedia.org dns address [12:04:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10242 [12:04:31] perhaps it's fundraising [12:05:48] ah [12:05:50] it's for vumi [12:06:12] capella is installed, but isn't in site.pp [12:06:20] stupid copy/paste in adium [12:06:41] capella is waiting for me to get back to it [12:07:02] I don't see any open misc servers, then :( [12:07:35] take on in eqiad first then [12:07:37] one [12:07:38] I wish people should add the role information [12:07:43] *would [12:07:54] so that when I log in, I know what the damn box is used for [12:08:20] B4/A4 in eqiad? [12:08:22] yes [12:10:37] !log rebuilding capella as precise [12:10:42] Logged the message, Master [12:14:16] New patchset: Ryan Lane; "Changing capella to precise and giving it a partman recipe." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10244 [12:14:38] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10244 [12:14:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10244 [12:14:47] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10244 [12:14:49] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10244 [12:16:56] god damn it [12:17:00] notice: Run of Puppet configuration client already in progress; skipping [12:17:04] no puppet in ps [12:17:07] no lock file [12:18:24] New patchset: Mark Bergsma; "Assigning IPv6 LVS service IPs to LVS balancers for upload and mobile" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10245 [12:18:32] any idea how to fix that? [12:18:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10245 [12:18:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10245 [12:18:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10245 [12:18:58] no lock file? [12:19:04] oh, of course when I run it via strace it works [12:19:06] there's always a lock file when I see that [12:19:18] I hate puppet so, so much [12:19:46] wait. no. just took longer to tell me puppet is already running [12:19:54] where is this? [12:19:56] ah ha [12:19:58] /var/lib/puppet/state/puppetdlock [12:20:01] yeah [12:20:06] the lock is in a different place [12:20:08] silly [12:20:11] that's the normal place [12:20:21] where did you think it'd be? [12:20:21] why don't they use /var/run? :( [12:20:25] that's for pid files [12:22:36] ugh drac5 [12:24:13] paravoid: status? [12:28:22] New review: Lydia Pintscher; "Please do not redirect wikidata.org to meta:Wikidata. This is exactly what we want to avoid. wikidat..." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/9874 [12:29:30] New patchset: Mark Bergsma; "Add IPv4 mapped main IPv6 server IPs to all bits realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10246 [12:29:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10246 [12:30:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10246 [12:30:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10246 [12:32:58] ugh [12:33:03] capella is pmtpa.wmnet [12:33:09] that needs to be wikimedia.org, doesn't it? [12:33:16] yeah so you need to change subnet [12:33:21] fortunately you can do that now! ;-) [12:33:25] :( [12:33:33] not if it's foundry or cisco [12:33:45] b4 or a4? [12:33:57] lemme see [12:34:06] "lldpctl" will tell you [12:34:09] A4 [12:34:13] then it's foundry [12:34:17] heh [12:34:20] cisco we don't have ;) [12:34:26] mark: sorry, was having lunch [12:34:39] paravoid: you're now the critical path ;-) [12:34:59] am I? [12:35:05] lemme push [12:35:06] i'm nearly done with the rest... [12:35:07] meh, the disk is too small for partman/lvm.cfg [12:35:13] so what's the status now? [12:35:26] Ryan_Lane: I think you can change it into percentages [12:36:15] it's at 80% right now [12:37:08] then how is it too small? [12:37:20] you can use percentages inside the recipe too [12:37:31] I have a new recipe for lvs [12:37:38] lemme down a final reboot to check for it [12:37:47] You asked for 28.5 GB to be used for guided partitioning, but the selected partitioning recipe requires at least 34.9 GB. [12:38:26] meh [12:38:41] I'll just not use a partition scheme and do guided with lvm [12:38:59] I rarely do partitioning schemes when doing one-off misc servers [12:39:14] well, I often use raid1 [12:39:37] I think this system doesn't have more than one disk, though, so what's the point [12:40:08] New patchset: Mark Bergsma; "Add IPv6 main server IPs to the mobile realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10247 [12:40:14] I think it does [12:40:21] I don't think we have any servers left that don't [12:40:28] this is a pretty old one [12:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10247 [12:40:46] well, if it does, then I'll do a raid1 [12:40:47] not that old [12:40:50] I can say with safety that I've provisioned lvs1 more than a dozen times [12:40:59] New patchset: Ryan Lane; "Remove auto-partitioning scheme from capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10248 [12:41:15] mutante: I hope you're on today at some point :( [12:41:21] New review: Bhartshorne; "asher's right. don't merge this key until I get home, wipe the laptop, and regenerate." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/10227 [12:41:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10248 [12:41:23] internet still isn't working at my place [12:41:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10248 [12:41:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10248 [12:41:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10247 [12:41:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10247 [12:42:53] hm [12:42:57] which subnet for capella? [12:43:04] public-services [12:43:04] public services 2? [12:43:06] no [12:43:10] we're trying to get rid of that [12:43:10] ok [12:43:12] or, use squid subnet [12:43:13] that's not as full [12:43:15] yeah use that [12:43:17] ok [12:44:20] I merged your change [12:44:25] thanks [12:44:44] paravoid: so... is lvs1 ready now? [12:44:50] sec [12:45:18] mark: I used 208.80.152.115 for capella [12:46:17] New patchset: Faidon; "autoinstall: another fix for precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10249 [12:46:27] sounds fine [12:46:40] New patchset: Faidon; "autoinstall: add partman recipe for LVS servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10250 [12:46:44] mind making the change in the foundry? :) [12:47:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10249 [12:47:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10250 [12:47:02] paravoid: dude [12:47:07] why didn't you put it in common.cfg ;) [12:47:14] Ryan_Lane: ok [12:47:31] !log changing capella's subnet in DNS [12:47:35] Logged the message, Master [12:48:24] because disable_dhcp was there [12:48:34] (running puppet on lvs1) [12:49:08] so, should I amend or push it as-is? [12:49:15] the installation works fully automated now btw [12:49:16] whichever you prefer [12:49:18] good [12:49:24] as for the partman recipe [12:49:34] I was trying to make the swap partition primary [12:49:39] but it wasn't working for some reason [12:49:44] so I just left it as sda5 [12:49:46] oh well :) [12:50:31] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10249 [12:50:34] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10249 [12:50:45] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10250 [12:50:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10250 [12:52:24] applied on brewster [12:53:03] doing a reboot on lvs1 so that lvs kernel settings take effect [12:53:09] and should be ready to go [12:53:24] awesome [12:53:59] the thing with ifup that we did is noisy though [12:54:03] it logs everytime you write into puppet [12:54:06] but should be fine for now [12:54:10] I know [12:54:14] want to fix that some day [12:54:16] ...but not now [12:55:27] so, done [12:55:30] should I move to lvs1001? [12:55:32] or amslvs1? [12:55:38] lvs2 [12:55:39] or 2 or whatever [12:55:49] i'll check lvs1 real quick [12:56:10] paravoid: you're brave. most people are unwilling to touch the lvs servers [12:58:05] paravoid: looks good [12:58:07] go ahead :) [12:58:51] hm. it isn't pxe booting [12:59:12] dns cache? wrong subnet [12:59:20] probably dns cache [12:59:52] !log rebooting lvs2 to reinstall with precise [12:59:56] Logged the message, Master [13:00:02] bah [13:00:05] hm, haven't heard from the nagios bot in a while [13:00:14] and I've rebooted lvs1 so many times that it should have said something [13:00:30] well often it misses it [13:00:37] the installer has ip too, so does ping reply [13:00:38] but yeah [13:00:40] it's a bit silent [13:00:51] New patchset: Ryan Lane; "Give capella the correct address in dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10251 [13:01:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10251 [13:01:34] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10251 [13:01:35] New patchset: Faidon; "autoinstall: pick precise for lvs2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10252 [13:01:38] oh crap, I sense a conflict [13:01:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10251 [13:01:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10252 [13:02:26] ? [13:02:43] Ryan and I touched the same file [13:02:48] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10252 [13:02:50] gerrit merges anyway [13:02:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10252 [13:02:55] we did? [13:02:56] which is a bit dangerous [13:03:03] hm, didn't know that [13:03:12] Ryan_Lane: yes, see above [13:03:24] it merges fine if it can cleanly merge [13:03:38] yes [13:03:39] good to know [13:03:39] but yeah, it can be dangerous, depending [13:03:43] doesn't mean it's semantically correct though ;) [13:03:48] exactly [13:04:14] I have drac5 so much [13:04:21] *hate [13:04:29] so much it makes me type improperly [13:04:54] Ryan_Lane: are you building capella with lucid? [13:04:59] with precise [13:05:02] ah, great [13:05:10] otherwise it was already built ;) [13:05:17] :-) [13:05:39] Ryan_Lane: the good thing of this whole exercise is that I know enough of our provisioning [13:05:46] that I might be able to do it for the Ciscos too [13:05:52] and you can improve it [13:05:58] that too :-) [13:06:03] since it's basically still the same as when I set it up in 2006 when we moved from fedora to ubuntu ;) [13:06:04] oh my god. fuck you drac5 [13:06:10] it's a bit primitive [13:06:19] it's not that bad [13:06:22] we used to have FAI at GRNET [13:06:26] they still do I suppose [13:06:28] and I hated it [13:06:36] ubuntu has a new MaaS thing [13:06:41] I did a soft reset of the drac specifically so that I wouldn't get an error before I powercycled [13:06:41] i wonder if that has something good for installs [13:06:46] metal as a service [13:06:57] then I powercycled, and tried to connect, and error :( [13:06:57] dunno [13:07:04] mark: MaaS is cobbler [13:07:11] the rest of it is juju [13:07:15] i've not looked at it at all [13:07:20] so, if we wanted that, let's just use cobbler [13:07:24] the thing that we did "better" was that we had autosign enabled and puppet run automatically after reboot [13:07:47] so, you just pxe and then you ssh'ed normally [13:07:55] with manual cleaning of keys before reinstall? [13:08:03] well, yes, of course [13:08:09] how did you verify the client? [13:08:10] but no new_install keys and the such [13:08:15] Ryan_Lane: domain name [13:08:22] * Ryan_Lane twitches [13:08:23] in autosign you can say *.wikimedia.org [13:08:28] it checks reverse and then back forward [13:08:38] if we used dnssec, maybe I'd trust that [13:08:47] fuck dnssec [13:08:55] why's that? [13:09:02] one small error and your zone breaks [13:09:05] we didn't have much private info in the repo either though [13:09:08] * Ryan_Lane nods [13:09:11] just the root password [13:09:15] it's very complicated [13:09:18] which was sha512 with 40k rounds [13:09:39] if we had dnssec we could also stick our ssh fingerprints in there too [13:09:40] let's use that for our next pw change [13:09:59] if puppet would work better ssh fingerprints would be fine [13:10:01] ssh keys [13:10:07] stupid external resources [13:10:09] someone should finally fix that [13:10:30] what's the problem? [13:10:36] that it slows down puppet so much [13:10:50] exporting ssh keys and collecting them back? [13:10:51] we have ssh key collection turned off on most puppet runs to prevent it from slowing down [13:10:54] the collection takes ages [13:10:58] so only fenari does it on every run [13:11:05] yes, because it checks the file first to see if it's there already [13:11:07] the rest of servers, only once every 20 runs or so (random) [13:11:10] stupid way of doing things [13:11:14] we can do many things for that [13:11:28] a) collect each key into a different file then have a hook to cat them all together [13:11:36] I don't see how it's safe to us exported resources for this anyway [13:11:36] in a recurse/purge => true directory [13:11:42] *for [13:11:49] b) adapt naggen to do that [13:11:51] the entire idea is to spot when a key changes [13:12:14] if a key changes, and it gets propagated out and is trusted, then we have no clue it changed [13:12:18] we may as well turn it off [13:12:32] I sync fenari's known hosts to my computer manually btw :) [13:12:44] and only my computer(s) do key verification [13:13:00] I guess it protects against arp poisoning and other MITMs [13:13:04] yes [13:13:06] that's the point of it [13:13:37] capella has one disk [13:13:42] hw raid1 then? [13:13:47] maybe [13:13:54] no wikimedia misc server has one disk. seriously [13:14:02] this one does ;) [13:14:11] Ryan_Lane: on wikitech.wikimedia I keep hitting the silly captcha, as much as I love doing maths, is there a user group you can put me in to bypass it? [13:14:14] who stole it!? [13:14:19] what. the. fuck. [13:14:30] this is the worst guided partitioning I've ever seen [13:14:49] ryan has clearly not done a lot of installs recently ;-) [13:14:56] Ryan_Lane: you can use lvs.cfg :-) [13:15:02] 1.2 GB for /, and 34 GB for swap [13:15:05] actually this might be a question for woosters [13:15:08] paravoid: tried that [13:15:09] the networky 6to4 stuff is not his biggest hurdle today ;) [13:15:20] lvm [13:15:27] Thehelpfulone: It does that for everyone [13:15:33] Thehelpfulone: if you add an external link [13:15:58] Ryan_Lane: and? [13:16:02] lvs.cfg, not lvm [13:16:05] the one I just made [13:16:06] ah [13:16:08] ok [13:16:35] I suppose there's no reason we should add ipv6 addresses to our lvs balancers... [13:16:35] it creates a 1GB swap as sda5 and the rest space sda1, ext3 for / [13:16:46] that'll work perfectly [13:17:04] Ryan_Lane: ok, does it need to (if you have an account then surely you're trusted to link properly?) [13:17:16] Ryan_Lane: me likes you being in an EU tz [13:17:24] me too [13:17:31] you should totally hit up that airbnb girl [13:17:31] Thehelpfulone: dunno. no one wants to make any changes on wikitech, though [13:17:34] ;) [13:17:38] you should totally move permanently [13:17:41] :D [13:17:43] hahaha [13:17:46] heh [13:17:53] I do like berlin.... [13:18:06] I wonder how they'd handle my employment [13:18:06] who knows, you may found someone here [13:18:07] wink wink [13:18:18] :D [13:18:23] just keep a US address and be done with it? [13:19:02] New patchset: Ryan Lane; "Use lvs.cfg on capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10253 [13:19:04] that would work [13:19:06] so, where's lvs2? [13:19:08] I could use my mom's [13:19:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10253 [13:19:24] lvm.cfg I hope [13:19:33] mark: collecing SSH keys :-) [13:19:34] no. I used lvs.cfg [13:19:40] oh [13:19:47] you guys like to make things confusing [13:19:52] yep [13:19:56] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10253 [13:19:56] does lvs really need its own recipe? [13:19:58] I mean [13:19:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10253 [13:20:00] no [13:20:01] it doesn't [13:20:03] it doesn't care at all about partitioning ;) [13:20:11] it should have been called something else ;) [13:20:16] so it should probably be more generic [13:20:33] find a name and I'll fix it [13:20:49] generic.cfg [13:20:49] :D [13:20:55] (the network shouldn't be called pmtpa-squid either, that cost a reboot and a head-scratching) [13:20:59] cost me [13:21:09] hehe [13:21:17] since I tried the disable_autoconfig on pmtpa.cfg first [13:21:21] if we just took those network parameters from dhcp, we wouldn't need all the subnet files [13:21:35] it's either public or internal, and a few subnet specific params [13:21:41] all of which can come from dhcp I would think [13:22:11] lvs2 is ready [13:22:28] I seriously can't wait till all drac5 systems are dead [13:22:38] I thing nagios-wm needs a restart, remind me how to do that... [13:22:42] ryan has so much hate [13:22:45] puppet, drac5 [13:22:50] paravoid: /etc/init.d/ircecho restart [13:22:55] thanks. [13:23:03] I should rewrite that bot some day [13:23:04] mark: it's healthy to hate terrible things [13:23:08] which bot? [13:23:10] ircecho? [13:23:15] all our bots break all the time [13:23:21] seriously, what's so hard to stay online on irc ;) [13:23:21] it just needs like a line to reconnect [13:23:29] cool [13:23:31] then add that :P [13:23:39] the bot itself is actually fairly nice [13:23:52] or i'll rewrite it in twisted some day [13:23:54] with a line to reconnect :P [13:23:59] bah [13:24:02] that's overkill [13:24:08] but it'd work [13:24:11] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.147 seconds [13:24:17] the irclib has support for reconnect [13:24:21] i've been restarting wikimedia irc bots for 8 years now [13:24:34] somehow our irc bot devs can't seem to figure it out [13:24:43] -_- [13:24:47] that's my bot [13:24:57] yes I'm also looking at you ryan lane ;-) [13:25:09] I'm going to write them to all write into a central queue [13:25:13] and have one bot that writes to irc [13:25:17] that'd be good [13:25:21] it's stupid having to have a user for every bot [13:25:28] and to have every machine connecting to irc [13:25:41] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.108 seconds [13:25:56] mark: the hard part about reconnecting is things like netsplits [13:25:58] mark: lvs2 is ready (said that before) [13:26:03] I believe that's why the bot died, btw [13:26:07] paravoid: including reboot? [13:26:15] oops [13:26:40] you can go on with lvs1005 and lvs1006, if they're idle [13:26:49] fuck [13:27:06] I forgot to run puppet on brewster. now I need to reboot this host again [13:27:13] yeah that's annoying [13:27:26] even more so with drac5 [13:27:29] ack [13:27:29] since it's all public anyway, we should probably host autoinstall from the puppetmaster or something [13:28:32] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [13:28:42] hm. I wonder if MaaS can handle the rebooting and pxe boot steps of the install [13:28:52] that could be seriously dangerous, though [13:30:03] that would beat paravoid's story of puppet firewalling with exported resources [13:30:14] all servers reinstalling at the same time [13:30:20] hahaha [13:30:21] indeed [13:30:36] New patchset: Faidon; "autoinstall: pick precise for lvs1005 & lvs1006" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10254 [13:30:37] HAHAHAHA [13:30:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10254 [13:31:04] if you're lucky, the install hosts tries that too ;) [13:31:33] -_- [13:31:41] it didn't get the partition config [13:31:46] * Ryan_Lane sighs [13:32:06] not sure why [13:33:28] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10254 [13:33:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10254 [13:35:24] eeeh [13:35:43] faidon@fenari:~$ host lvs1005.mgmt.eqiad.wmnet [13:35:43] Host lvs1005.mgmt.eqiad.wmnet not found: 3(NXDOMAIN) [13:35:43] faidon@fenari:~$ host lvs1005.mgmt.pmtpa.wmnet [13:35:43] lvs1005.mgmt.pmtpa.wmnet has address 10.65.3.43 [13:35:45] wtf? [13:35:56] aren't 1xxx eqiad hosts? [13:36:03] yes [13:36:10] that's wrong ;) [13:36:49] first time I was logging into an eqiad host's mgmt (or probably an eqiad host in general) [13:36:56] lucky me. [13:37:09] some people have talent ;) [13:38:44] New patchset: Mark Bergsma; "Configure lvs2 for IPv6 duty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10255 [13:39:04] the host didn't pick up its hostname somehow? [13:39:05] weird [13:39:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10255 [13:39:11] I hate installing systems [13:39:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10255 [13:39:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10255 [13:39:45] puppet, drac5, installing systems [13:40:02] there's room for more today [13:40:04] it's a proper list [13:40:23] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:29] hahahahahah [13:40:50] !log rebooting lvs1005 to reinstall with precise [13:40:54] Logged the message, Master [13:41:32] at least dataset1 is no longer on anyobne's hate list [13:41:39] and someday solaris will be off it too [13:41:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 25, down: 1, shutdown: 1BRPeering with AS64600 not established - BR [13:45:56] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [13:47:51] manually partitioned :) [13:48:13] way quicker [13:48:56] PROBLEM - SSH on lvs1005 is CRITICAL: Connection refused [13:49:07] !log reimaging db1042 [13:49:12] Logged the message, notpeter [13:51:50] Logged the message, Master [13:52:53] cmjohnson1: woo! thank you! [13:53:14] who is in dns wmnet? [13:53:31] get out :P [13:53:36] me [13:53:41] got out [13:53:48] thanks [13:53:53] was looking to fix lvs1005.pmtpa etc. [13:54:04] ok ;) [13:54:07] why does this system have a hostname of "unassigned"? [13:54:20] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:54:29] does it resolve? [13:54:32] yes [13:54:35] reverse dns too? [13:54:37] yes [13:54:41] restarted dhcpd? [13:54:50] puppet does that automatically [13:54:54] after a while [13:55:01] ah no now it does indeed [13:55:02] i force ran puppet [13:55:05] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [13:55:41] PROBLEM - Host db1042 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:42] mark: ping me when you're done with wmnet so I can fix that [13:55:44] i'm adding v6 dns records [13:55:54] which is the most tedious thing ever [13:55:58] should automate this [13:56:44] paravoid: i'm done in wmnet, so you can edit but not commit [13:56:52] rev dns now [13:57:32] Ryan_Lane: I presume virt1000 is also eqiad? [13:57:38] and everything that's 1000-1999? [13:57:38] yes [13:57:47] good, will fix [13:57:52] cool. thanks [13:58:50] RECOVERY - SSH on lvs1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:59:23] wmnet is too large, maybe we should split it... [14:00:06] do we have a script somewhere to purge puppet for an old host? [14:00:21] it's called puppetca --clean on sockpuppet [14:00:26] or what do you mean [14:00:35] exported resources [14:00:43] yeah [14:00:43] capella.pmtpa.wmnet needs to die [14:00:47] in /usr/local/sbin on sockpuppet iirc [14:01:34] thanks [14:02:10] i'm so gonna automate this [14:02:49] automate what? [14:04:14] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [14:04:20] v6 dns [14:05:41] New patchset: Pyoungmeister; "redeploying db1042 as s1 slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10256 [14:05:44] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:05:56] paravoid: i'm gonna svn commit now [14:05:59] should I include your change? [14:06:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10256 [14:06:32] didn't make it yet [14:06:38] ok, committing now then [14:06:43] yes [14:07:13] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10256 [14:07:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10256 [14:07:56] make your change, i'll make further changes soon [14:11:08] miredo assumes two servers, it seems [14:11:32] what prefix do we want to use? [14:12:11] 2620:0:862:ed1a ? [14:12:16] for what? [14:12:20] miredo [14:13:23] well, 2620:0:860:ed1a::. since I'm in pmtpa, I guess [14:13:37] doing what? [14:13:52] "This directive specifies the Teredo prefix which the Teredo relay and/or server will advertise. teredo_prefix must be a valid IPv6 prefix." [14:14:00] definitely not that prefix then :) [14:14:05] heh [14:14:06] that's the lvs service IPs prefix [14:14:24] why not 2001::/32? [14:14:44] ah, we're just tunneling through for them right [14:14:54] no [14:15:02] well, sort of [14:15:14] cmjohnson1: yep, one sec [14:15:17] 2001::/32 is the prefix set aside (on the internet) for teredo [14:15:26] ah. I see [14:15:38] "The default value is 2001:0000::." [14:15:38] disclaimer: i've not used teredo myself [14:15:44] i've done 6to4 a lot [14:15:50] cmjohnson1: can't log in. is it booted up? [14:16:30] ah, kk [14:17:17] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:20] RECOVERY - Host db1042 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [14:19:05] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [14:19:15] paravoid: can i make further dns changes? [14:19:25] yes [14:19:28] k [14:19:52] lvs1005 is done (and rebooted) [14:20:01] cool [14:20:09] !log rebooting lvs1006 to reinstall with precise [14:20:14] Logged the message, Master [14:20:45] the eqiad installs are quite slower for some reason [14:21:00] brewster far away [14:21:11] 25ms, but still [14:21:20] PROBLEM - mysqld processes on db1042 is CRITICAL: Connection refused by host [14:21:20] PROBLEM - MySQL Idle Transactions on db1042 is CRITICAL: Connection refused by host [14:21:29] PROBLEM - MySQL Slave Running on db1042 is CRITICAL: Connection refused by host [14:21:47] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: Connection refused by host [14:21:47] PROBLEM - Full LVS Snapshot on db1042 is CRITICAL: Connection refused by host [14:21:50] Ryan_Lane: I would suggest starting with 6to4 btw [14:21:51] oh darn [14:21:56] PROBLEM - MySQL disk space on db1042 is CRITICAL: Connection refused by host [14:22:05] Ryan_Lane: ssl3001 is a bit fucked because of that old setup [14:22:23] PROBLEM - MySQL Recent Restart on db1042 is CRITICAL: Connection refused by host [14:22:32] PROBLEM - SSH on db1042 is CRITICAL: Connection refused [14:22:39] mark: in which way? [14:22:44] ip address setup [14:22:47] nvm, i'll fix later [14:22:48] oh [14:22:54] right, they were manually added [14:22:59] PROBLEM - SSH on lvs1006 is CRITICAL: Connection refused [14:23:34] paravoid: well, miredo seems fairly straightforward [14:23:41] very few config options [14:23:48] I'll switch to 6to4, though [14:24:02] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [14:25:19] Ryan_Lane: they're both very easy to setup, 6to4 is a bit more straightforward to test though [14:25:25] * Ryan_Lane nods [14:25:50] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.182 seconds [14:26:53] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.110 seconds [14:27:38] New patchset: Pyoungmeister; "adding new mac for bellin's new mobo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10260 [14:28:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10260 [14:28:23] RECOVERY - SSH on db1042 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:31:19] mark: btw, launch day starts at 00:00 UTC, sounds quite risky for us [14:31:42] we're starting at 10:00 UTC [14:32:05] sounds reasonable [14:32:51] should we mail them that we're going to participate after all? [14:32:57] no [14:33:11] RECOVERY - SSH on lvs1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:33:28] why am I doing 6to4 and miredo? [14:33:42] that souns like an existence question [14:33:47] lol [14:34:25] New patchset: Mark Bergsma; "Add IPv6 server IPs to eqiad and esams SSL proxies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10262 [14:34:29] what do you mean? :) [14:34:41] "to give a better experience to our users" [14:34:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10262 [14:34:54] you need both for it to work properly? [14:34:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10262 [14:35:00] it seems miredo does what 6to4 does [14:35:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10262 [14:35:02] no they're different things [14:35:07] some people use teredo, some 6to4 [14:35:12] ahhhh. ok [14:35:17] 6to4 needs a public IP on the client to work [14:35:23] * Ryan_Lane nods [14:35:28] we need a relay to have reliable service to clients of either [14:35:28] teredo uses UDP but is more complicated [14:35:37] miredo works through a NAT [14:35:42] 6to4 requires a public IP [14:35:46] so, (at least some versions of) Windows try 6to4 if they see a public IP [14:35:55] and fallback to Teredo if they're behind NAT [14:36:04] ok. makes sense now [14:36:44] also, 6to4 uses ipv6-in-ip, so it doesn't pass though restictive firewalls [14:36:52] who only allow tcp/udp/icmp [14:37:20] there's one reason why many websites couldn't enable ipv6 :) [14:37:33] client thinks it has 6to4 connectivity, but firewall blocks it [14:37:43] smarter clients nowadays check, but still [14:38:02] smarter browsers worked around the problem [14:38:05] happy eyeballs [14:38:08] yes [14:38:15] we were at the mozilla offices some years ago [14:38:21] had a little presentation [14:38:30] afterwards I asked an unrelated question to them... about ipv6 [14:38:41] before I could even finish many engineers shaked their head and walked away [14:38:46] saying that ipv6 would never happen [14:38:46] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10260 [14:38:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10260 [14:38:53] and it wasn't their responsibility [14:38:55] it was very odd [14:39:04] I'm assuming I want an stf device: http://manpages.ubuntu.com/manpages/hardy/man4/stf.4.html [14:39:05] cmjohnson1: I added the new mac. should I just reimage? [14:39:14] reimage? [14:39:16] for a new mac? :D [14:39:31] well, I hav eno idea how messed up the box is anyway... [14:39:40] and it's very pre-prod [14:39:43] Ryan_Lane: that's hardy [14:39:54] ubuntu has native support for 6to4 in /etc/network/interfaces [14:39:56] PROBLEM - NTP on db1042 is CRITICAL: NTP CRITICAL: No response from NTP server [14:39:56] http://manpages.ubuntu.com/manpages/precise/en/man4/stf.4freebsd.html [14:39:57] check the man page for that [14:39:58] ah [14:40:00] ok [14:40:07] for relays? I'm not so sure [14:40:17] i've configured it through /etc/network/interfaces before at least [14:40:20] and that was years ago [14:40:26] just using a tunnel interface [14:40:32] but I don't have an example ready anymore :( [14:40:58] yeah, ok, I shall do so [14:41:21] mark: lvs1005 & lvs1006 both ready to go [14:41:37] awesome! [14:41:40] amslvs3 and 4 left [14:41:51] ha [14:41:58] i was right and you were wrong [14:42:11] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 26, down: 0, shutdown: 1 [14:42:22] define interface_add_ip6_mapped($interface=undef, $ipv4_address=undef) { [14:42:22] if ! $interface { [14:42:22] $all_interfaces = split($::interfaces, ",") [14:42:22] $interface = $all_interfaces[0] [14:42:22] } [14:42:25] this doesn't work in puppet [14:43:08] scoping? [14:43:11] no [14:43:14] can't reassign variable [14:43:23] ... [14:43:41] Ryan_Lane: not sure about Ubuntu's /e/n/i but this is a nice guide http://lists.afrinic.net/pipermail/afripv6-discuss/2007/000067.html [14:44:09] thanks [14:44:40] New patchset: Mark Bergsma; "Fix reassignment of the $interface variable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10263 [14:44:50] these sprints are kinda nice [14:45:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10263 [14:45:14] and we don't really need a hackathon for it either [14:45:16] hmmm [14:45:23] do I need a different bastion host for esams? [14:45:26] yes [14:45:28] aha [14:45:29] but you can login directly [14:45:35] to mgmt? :) [14:45:38] no [14:45:41] use hooft.esams.wikimedia.org [14:45:45] okay [14:45:46] thanks [14:46:38] hmm [14:46:41] gerrit is terribly confused here [14:47:33] hooft sounds very dutch [14:47:40] it is [14:47:49] can't reach gerrit. here [14:48:02] hmm now I can [14:48:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10263 [14:48:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10263 [14:49:02] sorry for all the questions, still trying to find my way around [14:49:18] don't worry dude [14:49:25] most 1 year hires still don't dare touching lvs servers ;-) [14:51:05] aaahh crap [14:51:20] RECOVERY - MySQL Idle Transactions on db1042 is OK: OK longest blocking idle transaction sleeps for seconds [14:51:27] (nothing too serious, don't worry) [14:51:29] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay seconds [14:51:29] RECOVERY - MySQL Slave Running on db1042 is OK: OK replication [14:51:47] RECOVERY - Full LVS Snapshot on db1042 is OK: OK no full LVM snapshot volumes [14:51:56] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay seconds [14:52:14] RECOVERY - MySQL Recent Restart on db1042 is OK: OK seconds since restart [14:52:23] RECOVERY - MySQL disk space on db1042 is OK: DISK OK [14:53:02] New patchset: Faidon; "autoinstall: pick precise for amslvs3 & amslvs4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10264 [14:53:17] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 3, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [14:53:25] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10264 [14:53:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10264 [14:53:35] PROBLEM - Host amslvs3 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10264 [14:56:18] !log rebooting amslvs3 & amslvs4 to reinstall with precise [14:56:23] Logged the message, Master [14:58:48] PXE doesn't seem to have worked [14:58:50] RECOVERY - Host amslvs3 is UP: PING OK - Packet loss = 0%, RTA = 109.16 ms [14:58:59] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [14:59:05] is there a different PXE server or something? [14:59:37] yes [14:59:40] New patchset: Mark Bergsma; "Add IPv6 server IPs to upload.eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10266 [14:59:41] hooft [14:59:46] wait [14:59:49] this is our first precise install in esams [15:00:02] iirc we setup puppet to setup the tftpboot [15:00:04] but perhaps it's not working [15:00:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10266 [15:00:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10266 [15:00:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10266 [15:01:14] RECOVERY - NTP on db1042 is OK: NTP OK: Offset -0.04734921455 secs [15:03:21] my browser seems to prefer ipv4 when available [15:03:26] so I don't expect we'll be seeing much ipv6 traffic ;-) [15:03:38] hm [15:03:59] ip tunnel shows my tun6to4 tunnel, but not ip -6 [15:04:00] IIRC chrome makes a request on ipv4 and ipv6 and whichever is fastests it uses. [15:04:30] yes [15:04:37] "happy eyeballs", google that [15:05:39] mark: so, hooft:/srv/tftp doesn't have precise, do we provision that manually? [15:05:45] sit0 also shows as down... [15:05:47] well [15:05:47] no [15:05:50] there's stuff in puppet [15:05:55] that should work, but wasn't tested on hooft [15:05:59] it's in misc/install-server.pp iirc [15:06:03] check that [15:06:04] and tun6to4 shows as UNKNOWN [15:07:24] heh [15:07:31] puppet provisions /srv/tftpboot indeed [15:07:37] but hooft's tftp runs off /srv/tftp :) [15:07:59] change hooft to use what puppet does nowadays [15:08:09] yp [15:08:10] yep [15:08:19] as of recently we have all that in the repo and the volatile file module [15:08:23] before it was rsynced manually [15:08:38] ok... [15:08:42] i'm gonna prepare lvs1005 and 1006 now [15:09:54] hm. why'd I need to manually bring sit0 up? [15:11:26] oh fuck [15:11:33] paravoid: I said lvs1006 didn't I [15:11:37] I should've said lvs1004 ;-) [15:11:43] hahahaha [15:11:52] lvs1006 was inactive too though [15:11:55] yes [15:11:57] I checked before I rebooted [15:11:57] it's internal [15:11:58] it's fine [15:12:03] but not what we need tomorrow ;-) [15:12:04] aha, okay [15:12:09] will do lvs1004 too then [15:12:25] no worries [15:12:56] New patchset: Faidon; "autoinstall: pick precise for lvs1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10269 [15:13:14] arghh [15:13:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10269 [15:13:21] amslvs preseeding doesn't work :( [15:13:32] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 2, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [15:13:52] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10269 [15:13:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10269 [15:14:19] has anyone talked to the fundraising folks yet about IPv6? [15:14:25] I haven't [15:15:12] is anything you're doing going to touch aluminum/grosley? [15:15:16] no [15:15:21] k [15:15:22] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs to eqiad balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10271 [15:15:26] and I was thinking you wouldn't want AAAA rcords on payments either [15:15:29] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [15:15:29] thus, I allocated IPs [15:15:33] but didn't do anything with them [15:15:38] can easily add them whenever [15:15:38] k [15:15:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10271 [15:15:47] PROBLEM - SSH on amslvs3 is CRITICAL: Connection refused [15:15:54] makes sense [15:15:55] geoiplookup will stay ipv4 only also [15:15:57] so that should stay working [15:16:04] and we can make it v6 compatible too [15:16:06] (but not now) [15:16:19] i disabled ipv6 in payments because it was causing lag as described in happy eyeballs [15:16:23] PROBLEM - SSH on amslvs4 is CRITICAL: Connection refused [15:16:58] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10271 [15:17:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10271 [15:17:15] i'll check in with them today about all this--I'd imagine their stats may break when the proxy logs start containing ipv6 addresses [15:17:29] analytics is aware [15:17:39] and they have enough people working on it ;) [15:17:42] !log rebooting es2 for kernel + mysql upgrade [15:17:46] Logged the message, Master [15:18:17] New patchset: Faidon; "autoinstall: fix partman for amslvs*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10272 [15:18:26] mark: yep, but I don't think they're taking over the payments analytics (yet) [15:18:39] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10272 [15:18:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10272 [15:18:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10272 [15:19:00] I'm sure it's not problem for weds either way [15:19:11] and certainly not MY problem ;-p [15:19:14] paravoid: well, though my output doesn't look like it does in the example, I can ping ipv6 addresses via the tun6to4 interface [15:19:18] mark: lol yes [15:19:44] Ryan_Lane: i'll help out once I have LVS in shape [15:19:46] I also found a way to add it to the interface [15:19:49] are we going to be logging anonymous contribs with ipv6 addresses? I guess we will [15:19:50] perhaps paravoid gets to you before that ;) [15:20:03] wonder how that will affect dumps output :-D [15:20:12] yet another not-my-problem [15:20:15] ;-) [15:20:35] PROBLEM - Host es2 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:44] nope not yours and mostly not mine, that's all stuff in core (exports.php) [15:20:52] but it will (perhaps) be fun to watch [15:20:54] http://pastebin.com/UN4wUzy7 [15:21:11] cmjohnson1: so, something is going on where I can't even pxe boot bellin. can you take a look at it? [15:21:46] (although i think I'd like to avoid reimaging if possible due to laziness) [15:22:03] remove /etc/udev.d/*persistent-net* [15:22:05] RECOVERY - Host es2 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:22:46] mark: who? [15:22:50] not you [15:22:58] lots of conversations in this channel ;) [15:23:00] peter, if he doesn't want to reimage but get back eth0 [15:25:17] !rebooting lvs1004 and reinstalling with precise [15:25:39] mark: oh! ok [15:25:41] PROBLEM - mysqld processes on es2 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:25:43] New patchset: Mark Bergsma; "Add new LVS services IPs to lvs1005" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10273 [15:26:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10273 [15:26:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10273 [15:26:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10273 [15:28:04] hmm [15:28:08] pybal restart still doesn't work [15:28:11] but pybal stop DOES work now [15:28:14] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.164 seconds [15:28:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 24, down: 2, shutdown: 1BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [15:28:15] so perhaps it's the new init script [15:28:16] hahaha [15:28:31] it was just broken on at least 3 levels :P [15:28:37] yep :) [15:28:46] but that's ok, i'll do more pybal releases soon [15:29:51] New patchset: Mark Bergsma; "Configure lvs1005 for IPv6 duty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10275 [15:30:06] none of our interface function in puppet will work for this :( [15:30:12] *functions [15:30:14] no definitely not [15:30:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10275 [15:30:35] you could do it with a new augeas thing [15:30:40] yeah [15:30:47] but we're also thinking about making the entire file a template [15:30:51] ...but not now :P [15:30:55] :( [15:30:56] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [15:31:05] PROBLEM - SSH on lvs1004 is CRITICAL: Connection refused [15:31:12] at least, for augeas, you have lots of examples to take from ;) [15:31:18] -_- [15:31:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10275 [15:31:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10275 [15:31:41] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:31:41] funny thing is the tagged interface is the closest one [15:31:41] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:31:41] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [15:31:58] mark: that worked, thanks! [15:32:03] yw [15:32:44] RECOVERY - mysqld processes on es2 is OK: PROCS OK: 1 process with command name mysqld [15:32:53] RECOVERY - SSH on amslvs3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:33:07] cmjohnson1: I have run puppet on bellin [15:33:47] RECOVERY - SSH on amslvs4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:34:11] so far the new pybal seems to be working well [15:34:18] I just fed it a hosts list with no AAAA entries [15:34:24] it's handling it gracefully [15:34:32] cmjohnson1: nah, just started it back up and the network is working properly now [15:35:33] oh, I'm in it.... sorry [15:35:33] mark: why can't I access amslvs3/4 from sockpuppet? [15:35:36] wnat me to get out? [15:35:39] seems like a firewall [15:35:45] ssh that is [15:36:33] New patchset: Pyoungmeister; "giving bellin the db-related classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10276 [15:36:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10276 [15:36:58] paravoid: it's a different ASN [15:37:04] and sockpuppet doesn't have internet access [15:37:07] it's an internal server [15:37:08] aah [15:37:15] yeah, you need to do something with that key ;) [15:37:20] hooft doesn't seem to have .ssh/new_install [15:37:21] we don't do esams installs terribly often, hehe [15:37:22] heh [15:37:31] should I just copy it to hooft's .ssh ? [15:37:35] I guess [15:38:00] time for more DNS changes [15:39:54] New review: Jeremyb; "Lydia says:" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [15:40:04] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10276 [15:40:06] me senses a clusterfuck [15:40:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10276 [15:40:20] ? [15:40:20] how would --ca_server sockpuppet will work with amslvs... [15:40:28] oh puppet and your ca model [15:40:31] haha [15:40:46] good point [15:41:02] should I just sign it from stafford? [15:41:17] if that works [15:41:20] I honestly have no idea [15:41:34] the weirdest things fail with puppet's CA [15:41:35] okay, will figure it out [15:41:44] RECOVERY - Puppet freshness on bellin is OK: puppet ran at Tue Jun 5 15:41:30 UTC 2012 [15:41:44] sorry about that [15:41:50] we're actually working on connectivity between the two networks [15:41:53] tunnels [15:41:57] they're disabled atm [15:42:00] but that'll help with these things [15:42:26] on the junipers? [15:42:29] RECOVERY - SSH on lvs1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:42:31] yep [15:42:38] just one esams router is still a foundry [15:42:46] so it's sort of waiting on the replacement mx80 [15:42:51] (which is there, but not connected yet) [15:45:23] root@brewster:/srv/autoinstall# ps aux |grep ngi [15:45:23] root 10225 0.0 0.0 7624 920 pts/2 S+ 15:45 0:00 grep ngi [15:45:27] root@brewster:/srv/autoinstall# ps aux |grep apa [15:45:29] root 10227 0.0 0.0 7624 920 pts/2 S+ 15:45 0:00 grep apa [15:45:32] root@brewster:/srv/autoinstall# ps aux |grep light [15:45:35] root 10231 0.0 0.0 7624 920 pts/2 S+ 15:45 0:00 grep light [15:45:38] heh... [15:45:40] third try [15:46:10] I like that PyBal's 1.00 release was completely broken ;-) [15:46:12] wasn't working for anything [15:46:36] 'stable at being broken' [15:47:13] mark: 05 15:43:32 < rguillebert_> do you have an update on https://bugzilla.wikimedia.org/show_bug.cgi?id=37089 ? [15:47:30] (just now elsewhere. he was redirected to #-tech) [15:47:39] hi [15:47:41] hah [15:47:54] do you have an update on https://bugzilla.wikimedia.org/show_bug.cgi?id=37089 ? [15:49:07] Only 8? Hell I've hit the api harder than that before. [15:49:24] rguillebert_: ah, was that you [15:49:29] i've now removed the null route [15:49:32] don't do that again :) [15:49:55] mark: i think there was also a second block? either squid or mediawiki config or something [15:49:58] apergos: ^ [15:50:12] squid [15:52:20] huh [15:52:48] probably I logged it [15:52:59] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:53] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 27.49 ms [15:54:12] pssshuh [15:54:13] meh don't see it [15:54:29] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 26, down: 0, shutdown: 1 [15:55:44] common-acls.conf , marked with 20120522 atg and reason [15:55:53] I need to go [15:56:01] I should be back on line later tonight [15:58:27] mark: lvs1004 is ready btw [15:58:41] awesome [16:05:35] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [16:08:09] New patchset: Mark Bergsma; "Configuring lvs1004 for IPv6 duty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10279 [16:08:17] mark: so are amslvs3/4 (rebooted too) [16:08:26] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [16:08:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10279 [16:08:43] awesome [16:08:47] help ryan then ;) [16:08:51] so, I used the partman cp setup as was in puppet [16:08:59] but has a difference in contrast to how they were setup [16:09:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10279 [16:09:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10279 [16:09:18] I'm puppetizing the 6to4 interface right now [16:09:31] are you sure it's working then? [16:09:36] both have / on md; the old had swap on md (sda2/sdb2) too, while the new one has two swap devices, sda2/sdb2 [16:09:39] no [16:09:44] it would be nice if he could verify that :) [16:09:52] I *think* its working [16:10:05] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:10:07] You'll soon find out :D [16:10:13] no, I have a way to test it [16:10:14] sec. [16:11:26] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.123 seconds response time. www.wikipedia.org returns 208.80.154.225 [16:16:46] Ryan_Lane: we don't need a 2002:: address on capella iirc [16:17:14] why not? [16:17:33] New patchset: Mark Bergsma; "Configure IPv6 LVS service IPs on amslvs3 and 4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10280 [16:17:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10280 [16:18:08] what else would clients route through? [16:18:17] because we have a native IPv6 on that box? [16:18:38] New patchset: Mark Bergsma; "Configure IPv6 LVS service IPs on amslvs3 and 4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10280 [16:18:56] didn't the WMF ask for Teredo/6to4 to avoid relying on third parties? [16:19:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10280 [16:19:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10280 [16:19:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10280 [16:19:07] ask who? [16:19:21] that's what it says on Wikitech's page, however outdated that page is [16:19:34] paravoid: so the tun6to4 would use the ipaddr6_eth0? [16:21:27] if so, then that makes my puppet class a hell of a lot easier [16:21:32] wait [16:21:36] I'm trying to debug this [16:21:37] sec [16:21:37] as I won't need to generate the damn address via ruby [16:22:38] ryan regrets his offer of this morning ;) [16:22:44] I do [16:22:51] I have a method to do it in ruby [16:22:57] I'm now trying for a one liner, though [16:22:57] ipaddr? [16:23:14] can I abuse another box at pmtpa to do tests? [16:23:22] abuse how? [16:23:29] break the string into an array, convert each octet into hex, then append [16:23:30] add a 2002::/16 route to capella [16:23:41] i.e. break connectivity with 6to4 for a while [16:23:58] yeah [16:24:02] most boxes don't do anything with v6 [16:24:05] find one that has no services running [16:24:07] on v6 [16:24:23] needs to be in the same subnet eh [16:24:24] take a squid [16:24:30] already on a squid ;) [16:24:47] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [16:24:53] whaddidyado ;p [16:26:41] New patchset: Mark Bergsma; "Configure amslvs3 and amslvs4 for IPv6 duty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10281 [16:27:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10281 [16:27:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10281 [16:27:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10281 [16:27:38] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.206 seconds [16:28:23] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.166 seconds [16:31:36] paravoid: you should resolve RT #3012 :) [16:36:38] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:36:59] New patchset: Mark Bergsma; "Allow recursive DNS queries from our IPv6 subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10283 [16:37:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10283 [16:37:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10283 [16:37:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10283 [16:44:39] !log (UTC) 23:42:14 !log re-enabled es4 monitoring. its currently our only es server without any tables marked as crashed / needing recovery, myisam recovery has been absent for all systems since the ms servers were migrated off of in nov 2011. (Sum of human knowledge * Rényi entropy = ES) [16:47:17] hey, at least there's one! [16:47:45] morebots quit right after that and I dind't see it in the SAL onwiki [16:48:10] of course it didn't respond to me just now either... [16:48:58] heh [16:50:04] mark: I was about to [16:50:11] mark: but you were quicker [16:50:37] :) [16:51:08] I'm looking at fcking 6to4 [16:52:02] well, I have an inline template that will compute the 2002: address, if needed [16:52:11] argh, I'm an idiot [16:52:43] !log Pooled ssl1001 [16:52:48] Logged the message, Master [16:52:56] I did ifdown eth0 on capella by mistake [16:52:59] heh :> [16:53:42] http://pastebin.com/1wB728J9 [16:53:53] probably a cleaner way :) [16:55:42] mark: do we do urpf by the way? [16:55:49] or some other kind of antispoofing? [16:55:53] mark: can I add lily to the decom list? any reason not to? [16:56:01] yes you can [16:56:08] it's scrap metal now [16:56:17] cool [16:56:17] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [16:57:11] New patchset: Pyoungmeister; "adding lily (long gone) to decom.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10284 [16:57:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10284 [16:57:45] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10284 [16:57:47] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.188 seconds [16:57:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10284 [16:58:56] paravoid: yes we do [16:59:01] (urpf) [16:59:26] Ryan_Lane: .split(',') not .split('.') ? [17:00:09] mark: hm, this won't work then [17:00:10] ah. yeah. my actual code has that correct [17:00:16] paravoid: can exempt it [17:00:23] the relay should be able to send ip src 192.88.99.1 [17:00:27] yeah [17:00:33] we have an ACL for exceptions [17:00:35] such as LVS [17:00:39] okay [17:00:51] do I have access to routers? [17:01:11] if you have the enable pass [17:01:28] mail me your ssh key and i'll add it soon [17:01:42] i'll just update the ACL now [17:01:46] okay [17:01:50] thanks! [17:03:24] term 6to4 { [17:03:24] from { [17:03:24] source-address { [17:03:24] 192.88.99.1/32; [17:03:24] } [17:03:25] } [17:03:25] then accept; [17:03:26] }k [17:03:36] I guess I could restrict on ip proto as well [17:04:06] ipip, right? [17:04:25] !log (UTC) 23:42:14 !log re-enabled es4 monitoring. its currently our only es server without any tables marked as crashed / needing recovery, myisam recovery has been absent for all systems since the ms servers were migrated off of in nov 2011. (Sum of human knowledge * Rényi entropy = ES) [17:04:45] no [17:04:51] it's ipv6-in-ip, not sure how juniper calls that [17:05:00] hmmmmm, there's a char it doesn't like? [17:05:08] !log (UTC) 23:42:14 !log re-enabled es4 monitoring. its currently our only es server without any tables marked as crashed / needing recovery, myisam recovery has been absent for all systems since the ms servers were migrated off of in nov 2011. (Sum of human knowledge * Renyi entropy = ES) [17:05:13] Logged the message, Master [17:05:18] good [17:05:51] from { [17:05:51] source-address { [17:05:51] 192.88.99.1/32; [17:05:51] } [17:05:51] protocol ipv6; [17:05:52] } [17:05:52] then accept; [17:05:55] simply ipv6 [17:05:58] okay [17:06:10] is now live on the active vrrp router [17:06:10] applied I see :) [17:06:13] will do the others now [17:09:38] i am [17:09:48] !log Added uRPF exception for 6to4 traffic on all routers [17:09:53] Logged the message, Master [17:15:33] Ryan_Lane: look at /e/n/i now [17:15:36] that should be it [17:15:40] ok [17:15:42] plus, we need some entries in /proc [17:16:03] we need /proc/sys/net/ipv*/conf/all/forwarding set to 1 [17:16:11] and we need /proc/sys/net/ipv4/conf/eth0/accept_ra set to 2 [17:16:26] isn't that syctl? [17:16:29] yes [17:16:37] tsk tsk writing directly into proc ;) [17:18:03] okay, look again [17:18:23] are we going to puppetize this? [17:18:32] I'm doing so now [17:18:35] okay [17:18:41] we also need to puppetize quagga then :-) [17:18:51] let's do a static route now [17:19:01] there's tomorrow also [17:19:03] and the day after ;) [17:20:53] be our guest :) [17:21:05] unless you want to configure quagga without puppetizing it [17:21:09] should be straightforward [17:21:32] paravoid: let's add the sysctl stuff in the sysctl files [17:21:36] rather than in pre-up [17:21:49] for labs a while ago, we looked at it, and there's a nice puppet module for quagga somewhere [17:22:33] Ryan_Lane: whichever you prefer [17:22:45] btw, if you're wondering what accept_ra is [17:22:58] (2 didn't exist e.g. in lucid) [17:23:21] in ipv6 you have router advertisements, the router advertises itself on the network [17:23:28] and that's how we currently do default route [17:23:33] hm, how do I tell puppet that $IFACE isn't a variable? [17:23:38] but if you enable forwarding, it stops accepting them [17:23:40] \$IFACE [17:23:41] \$IFACE [17:23:50] ok. thought so. making sure [17:23:52] because then it's not a "node" but it's a "router" [17:23:54] stupid spec [17:24:12] so now there's "echo 2 >" which makes it be a router but also accept advertisements [17:24:40] this is needed on all DSL routers for example and linux didn't support this until very recently (since it's forbidden by the specs) [17:24:58] so router vendors did all kind of nasty tricks, like parsing RAs in userspace [17:25:20] iow, the people writing the IPv6 specs never thought of the "IPv6 in a DSL" scenario [17:25:55] Ryan_Lane: you're right about doing it in /etc/sysctl.d, since forwarding is not 6to4 per se [17:25:58] it's teredo too [17:26:17] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [17:27:09] should I just add this to 50-advanced-routing.conf.sysctl? [17:27:32] preferably not [17:27:36] ok [17:28:16] Ryan_Lane: if you're doing it in sysctl.conf, don't forget to also add "default" in addition to "all" [17:28:34] since by the time it runs, eth0 might not exist yet [17:28:50] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [17:29:08] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.297 seconds [17:30:10] accept_ra = 2? [17:30:16] it's set to 1 in interfaces [17:30:20] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27406 bytes in 0.106 seconds [17:31:14] yes [17:31:19] but we enable forwarding [17:31:29] so 1 + forwarding = (implicit) 0 [17:31:31] see explanation above [17:33:29] o.O [17:33:34] so, in sysctl, do I set 1 or 2? [17:34:20] 2 [17:34:26] * Ryan_Lane nods [17:36:09] hmm, something broke [17:36:21] New patchset: Ryan Lane; "Add 6to4relay role to capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10292 [17:36:27] ^^ may not be totally correct [17:36:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10292 [17:37:20] (nevermind) [17:38:25] ugh, augeas [17:39:33] heh [17:39:35] yeah [17:39:36] it sucks [17:39:37] make that >= 12.04, or else sysctl won't work and hence tun6to4 won't work [17:39:44] accept_ra=2 won't work [17:39:49] needs a new kernel [17:40:34] New review: Faidon; "role::ipv6relay is missing" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/10292 [17:41:21] would this result in a delay of IPv6 readiness? [17:41:28] Jasper_Deng: would what> [17:41:28] New patchset: Ryan Lane; "Add 6to4relay role to capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10292 [17:41:50] Ryan_Lane the need for a new kernel [17:41:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10292 [17:41:53] hmm! [17:41:56] as specified above [17:42:01] esams ipv6 is completely broken apparently [17:42:13] Jasper_Deng: no, we're using a new version. I just needed to add a proper check in puppet [17:42:20] New review: Faidon; "files/misc/50-advanced-routing-ipv6.conf.sysctl is missing :-)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/10292 [17:42:23] hahaha [17:42:34] bah [17:42:36] Ryan_Lane: also, why "advanced routing"? [17:42:39] it's just "routing" [17:42:43] ask mark [17:42:46] it's plai' ol' forwarding [17:42:47] he named it [17:43:00] mark: what do you mean by broken...? [17:43:13] New patchset: Ryan Lane; "Add 6to4relay role to capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10292 [17:43:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10292 [17:43:42] I always forget I need to git add new files manually [17:44:35] nevermind [17:44:37] i am an idiot [17:44:49] Ryan_Lane: don't like commit -a? [17:45:10] even when using -a, you till need to add the files manually [17:45:15] if its a new file [17:45:22] hrmmm [17:45:54] i tend to `git status` before commit [17:46:23] and also you get `git status` commented for you in the commit msg. so you can abort if something's missing [17:46:24] New patchset: Mark Bergsma; "Add IPv6 addresses on all LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10293 [17:46:29] but you do have to think about it [17:46:32] I don't like commit -a, I always git add [17:46:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10293 [17:46:48] and then stare at diffs before committing [17:46:53] me too [17:47:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10293 [17:47:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10293 [17:49:58] New review: Jeremyb; "403 fixed on the WMDE side." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [17:52:39] Ryan_Lane: capella needs a proper IPv6 address, see what we did with the LVS servers [17:52:53] Ryan_Lane: we need that since we'll need to add static routes (or BGP peerings) [17:53:00] * Ryan_Lane nods [17:53:03] so, it's a very bad idea to do it with the autoconfigured IP [17:53:18] !log Redistributed statics in OSPF3 on csw1-esams [17:53:23] Logged the message, Master [17:53:35] interface_add_ip6_mapped ? [17:53:37] yes [17:53:50] what's this GRO stuff? [17:54:11] generic receive offloading, however I have no idea why it would be disabled on LVS servers [17:54:17] completely unrelated to IPv6, that's for sure [17:54:26] also, why don't we simply do this for every host? [17:54:56] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [17:54:56] because I want to sleep tonight [17:54:58] New patchset: Ryan Lane; "Add 6to4relay role to capella" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10292 [17:55:02] heh [17:55:03] haha [17:55:05] good reason [17:55:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10292 [17:55:29] * jeremyb just made an RT for wikidata.org DNS changes [17:55:32] wait there is totally a reason for it to be disabled [17:55:38] now if I can remember the details... [17:55:41] jeremyb: one already exists [17:56:02] (and yeah it's unrelated to ipv6) [17:56:24] Ryan_Lane: that's odd, i asked lydia if i should make one and she said yes [17:56:38] lydia doesn't really know [17:56:38] * Ryan_Lane shrugs [17:56:46] gro is broken with 802.1Q [17:56:51] * aude knew there was already one [17:56:52] ah [17:56:53] (as used by LVS) [17:57:04] which caused our MTU issues and slowness a while back [17:57:09] I sent a detailed mail about it back then [17:57:10] thanks [17:57:21] someday I will have a working memory (not really likely) [17:57:27] mark: so, what else can I do? [17:57:32] well, whatever. surely RT must have a dupe function [17:57:36] and/or what's left? [17:57:45] what's the status of 6to4 and miredo now? [17:57:49] sorry, wasn't paying close attention [17:58:03] they're working, Ryan is fixing the puppet manifests [17:58:12] we need the interface_add_ip6_mapped thing for it to get a proper IP [17:58:15] I pushed in my latest change [17:58:50] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [17:59:08] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.205 seconds [17:59:44] i'm gonna have dinner in a bit [17:59:50] and then afterwards, we should test mediawiki and ipv6 [18:00:11] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10292 [18:00:14] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10292 [18:00:20] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27406 bytes in 0.109 seconds [18:00:23] lemme know which statics I should set for 6to4/miredo [18:00:51] I think we can do bgp [18:00:57] there's also the matter of relays in other DCs... [18:02:01] I may be gone and without internet access soon ;( [18:02:15] mutante: I need your help, if you happen to be online [18:02:36] don't worry, I'll finish it [18:04:26] !log rebooting capella to make sure things work after a reboot [18:04:31] Logged the message, Master [18:07:44] would really be good to ensure the puppet stuff actually works too :) [18:08:57] I did [18:08:58] before the reboot [18:09:33] !log Redistributing static routes in OSPF on cr1-eqiad and cr2-eqiad [18:09:38] Logged the message, Master [18:11:14] root@capella:/proc/sys/net/ipv6/conf# cat eth0/accept_ra [18:11:15] 1 [18:11:23] gah, I forgot, "all" does not work for accept_ra [18:13:02] mark: static or bgp, whenever you're ready let me know [18:14:03] will do [18:15:22] New patchset: Faidon; "6to4: fix sysctl for accept_ra" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10295 [18:15:44] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10295 [18:15:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10295 [18:16:36] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10295 [18:17:47] Ryan_Lane: here? [18:17:50] or left already? [18:17:58] will you puppetize miredo or should I? [18:18:14] I probably can [18:18:20] does the configuration as is look sane? [18:18:23] yes [18:18:26] ok [18:18:48] afaik, haven't tested it [18:20:03] !log Replaced static LVS IPv6 routes with correct next-hops on cr1-eqiad and cr2-eqiad [18:20:04] there [18:20:07] lvs in eqiad now working [18:20:08] Logged the message, Master [18:20:41] i've been at this for 11 hours now, I need a little break [18:20:44] getting a headache ;) [18:22:13] me too [18:22:27] hmmm, interesting, firefox 13 enabled spdy by default [18:24:41] oh. sweet [18:24:42] New patchset: Ryan Lane; "Adding miredo to the ipv6relay role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10297 [18:25:00] hm. is spdy usable with stud? [18:25:05] nginx has plans to add it [18:25:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10297 [18:25:23] paravoid: ^^ [18:26:39] let's add yet another proxy for that [18:26:47] so we can proxy proxy proxy proxy [18:26:49] that's what we do [18:27:04] mark: well, it would just be a config in our current cluster ;) [18:27:28] that's why I was wondering if it would work if we used stud... [18:29:14] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10297 [18:29:16] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10297 [18:29:30] !log restart indexing on searchidex1001 [18:29:34] Logged the message, notpeter [18:30:22] Ryan_Lane: Spedye is meant to handle all TLS/SSL traffic for a website -- it is based upon the ideas in Bump's Stud, but extended to include converting SPDY connections into normal HTTP requests. [18:30:42] not sure if I should be happy or not [18:31:04] supports X-F-* though, unlike spdy [18:31:15] it has all kinds of protocol improvements [18:31:41] spdy, that is. [18:31:56] though it's likely that it'll get rolled into the newer version of http anyway [18:34:21] paravoid: ok, so what statics should I add? [18:34:33] i'm happy to do bgp soon, just not now [18:35:25] guess it's time to read up on spdy too [18:35:34] 2002::/16 -> 2620:0:860:1:208:80:152:115 [18:35:38] 2001::/32 -> 2620:0:860:1:208:80:152:115 [18:35:43] lemme also fix DNS [18:35:50] ok [18:36:13] i'll do preference 100, lower than bgp, higher than ospf [18:36:19] Ryan_Lane: http://trac.nginx.org/nginx/roadmap nginx 1.3 will support SPDY, to be released "end of May/start of June" [18:36:26] great [18:38:39] done [18:39:17] !log Added static routes 2002::/16 and 2001::/32 for 6to4 and teredo on the Tampa routers; these are redistributed in OSPF to eqiad [18:39:21] Logged the message, Master [18:40:26] ok, little break now [18:40:27] bbl [18:41:14] mark: hey [18:41:16] before you go [18:41:34] you were doing dns changes before but I see nothing on sockpuppet's pdns-templates [18:41:44] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [18:42:11] ah, nevermind [18:42:22] svn log needed an svn up to get the log lines [18:44:37] !log starting indexing on new searchidx2 [18:44:42] Logged the message, notpeter [18:45:26] !log starting innobackupex dump from blondel to bellin [18:45:31] Logged the message, notpeter [19:00:46] Ryan_Lane: have you actually tested that ruby ipv6 incantation? [19:01:53] yes [19:02:29] hrmmm. (/me is tweaking it to be shorter and maybe clearer) [19:05:14] we won't need it [19:05:56] indeed [19:06:38] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [19:07:26] ok, nvm then [19:09:11] RECOVERY - mysqld processes on bellin is OK: PROCS OK: 1 process with command name mysqld [19:33:38] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:34:49] paravoid: ok. I'm heading out. likely won't have internet anymore [19:34:56] okay [19:34:59] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [19:35:01] I found a bug on the miredo that I'm fixing [19:35:05] ah. cool [19:35:05] and I'll be around [19:35:09] * Ryan_Lane nods [19:35:14] btw, there's a hangout for IPv6 right now [19:35:16] with Vint Cerf [19:35:18] hopefully I'll have internet when I get home [19:35:20] and some other people [19:38:25] New patchset: Faidon; "Fix misc::miredo, didn't work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10301 [19:38:47] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10301 [19:38:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10301 [19:39:33] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10301 [19:41:45] New patchset: Faidon; "misc::miredo: fix a minor typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10302 [19:42:07] it ended but there's a recording available for anyone interested: https://www.youtube.com/watch?feature=player_embedded&v=lMcf6LxMgYI [19:42:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10302 [19:42:29] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10302 [19:42:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10302 [19:45:16] Logged the message, Master [19:45:20] Logged the message, Master [19:45:25] Logged the message, Master [19:45:29] Logged the message, Master [19:45:40] New patchset: Faidon; "misc::miredo init script does not have "status"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10303 [19:46:02] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10303 [19:46:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10303 [19:46:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10303 [19:47:44] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [19:50:25] mark: set up 6to4 and teredo at home and tested our relay; both work, woohooo! [19:50:43] mark: not with eqiad though, but neither is normal traffic (admin prohibited), so I guess something's pending from your side [20:03:29] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [20:19:23] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [20:32:08] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.113 seconds [20:53:49] back [20:53:53] paravoid: what is admin prohibited? [20:54:26] hi [20:55:00] mark: try pinging one of the cp1xxx hosts over IPv6 [20:55:04] From 2001:668:0:3::8000:13b2 icmp_seq=2 Destination unreachable: Administratively prohibited [20:55:17] that's normal [20:55:20] they are internal [20:55:40] like 10.x hosts for v4 [20:56:55] Totally should have added b00b to all mw ipv6 addresses :D [20:58:15] ah [20:58:23] pmtpa ones are not though [20:58:30] those are not internal [20:58:35] ah hm, I was using sq51 which is exposed to the internet [20:58:35] future installs will be all internal [20:58:38] (squids/varnish) [20:58:47] ha hm [20:58:51] how come? [20:59:00] to save v4 ips [20:59:05] and there's no reason for them to be external [21:00:38] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [21:01:31] so, what's left? :) [21:01:40] not much [21:01:52] yes, I noticed :) [21:04:59] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [21:11:25] New patchset: Mark Bergsma; "Add mediawiki.org IPv6 LVS service IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10351 [21:11:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10351 [21:11:55] \o/ [21:11:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10351 [21:12:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10351 [21:14:11] New patchset: Demon; "(bug 36852) Adding image/* mimetypes as "safe" so gerrit will show diffs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10352 [21:14:18] :o [21:14:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10352 [21:21:13] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6005 [21:21:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6005 [21:23:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10352 [21:23:47] ^demon: ^^ [21:23:53] failed to merge [21:24:03] <^demon> Ah, was afraid of that. [21:24:05] <^demon> Silly gerrit [21:24:07] <^demon> Will rebase. [21:26:45] mark: paravoid: heard of puppetdb? http://docs.puppetlabs.com/puppetdb/0.9/ (trying to fix storeconfigs perf / scaling issues) [21:26:59] yes [21:27:10] New patchset: Demon; "(bug 36852) Adding image/* mimetypes as "safe" so gerrit will show diffs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10352 [21:27:15] <^demon> Ryan_Lane: Rebased :) [21:27:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10352 [21:28:17] (for keys/etc. you mentioned earlier) [21:28:45] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10352 [21:28:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10352 [21:29:02] New patchset: Mark Bergsma; "Add IPv6 squid relay service IPs to both LVS classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10356 [21:29:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10356 [21:29:57] ^demon: you realize if I break gerrit, mark is going to kill me, right? [21:30:04] and then I'll kill you ;) [21:30:10] also, for the bare metal question there's a bunch of options but idk enough about them. quoting bgupta (dc11 orga): [21:30:14] > Puppet has collaborated with EMC to release a new provisioning tool called Razor. It's a pretty exciting time in provisioning tool development, with Dell's crowbar, PL/EMC's Razor, Ubuntu's MAAS (Metal as a Service), and or course Foreman and Cobbler. [21:30:24] <^demon> Ryan_Lane: Nothing should break :) [21:30:32] I've heard that before [21:30:43] <^demon> But I mean it For Real this time ;-) [21:30:45] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10356 [21:30:47] jeremyb: and I say meh to all of them [21:30:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10356 [21:30:52] I'm getting many "loss of session data messages" [21:30:58] Platonides: where? [21:31:00] in gerrit [21:31:05] Platonides: sounds like memcached? [21:31:06] in wikipedias [21:31:13] ah [21:31:19] error: The requested URL returned error: 503 while accessing https://gerrit.wikimedia.org/r/p/operations/puppet/info/refs [21:31:19] fatal: HTTP request failed [21:31:25] mark: sorry [21:31:25] the fact that I'm trying to edit 250 wikis at once doesn't help, but it shouldn't happen [21:31:29] <^demon> mark: Restarting gerrit, one moment. [21:31:30] it's restarting right now [21:31:36] awesome timing [21:31:41] back up [21:31:54] i'll be there during the next mediawiki release :P [21:31:59] heh [21:32:06] I did warn him ;) [21:32:12] how does that help me? [21:32:19] it doesn't [21:32:21] <^demon> At least gerrit came back up this time ;-) [21:32:21] you're fine now [21:32:33] you fetched in the like 15 seconds it was down [21:33:34] It's a shame it doesn't just work with the tools but feels the need to replace them all (seriously who writes a ssh server when there's already command options that you could wrap around). [21:34:50] you mean like paramiko, which we also use in the python gerrit hooks? [21:35:02] or like php-ssh2, like I'd use on labsconsole, if it didn't suck? [21:35:28] I *hate* using the system ssh [21:36:06] who's maintaining hte hate list? [21:36:11] paramiko is pretty awesome, I hate automation over ssh but I was thinking more gerrit being all powerful and mighty java [21:36:28] <^demon> Ryan_Lane: I never tried php-ssh2, does it suck? [21:36:32] yes [21:36:43] <^demon> That...sucks :\ [21:36:47] jeremyb: you don't hate like a million things? [21:37:02] i do! but today's your day for a list [21:37:04] apparently [21:37:05] my hate list is almost surely higher than my love list [21:37:22] <^demon> We should really keep a list. [21:37:32] <^demon> [[mw:Things that are fucking awful]] [21:37:56] 05 13:39:45 < mark> puppet, drac5, installing systems [21:37:56] 05 13:40:02 < mark> there's room for more today [21:38:10] 05 13:41:32 < apergos> at least dataset1 is no longer on anyobne's hate list [21:38:35] solaris is on it but not for long [21:38:42] drac should be on everyone's hate list [21:38:58] ah foundry is on it pretty much [21:39:30] how's the r710's? aren't those the ones marketed for linux types with less dell extra crap on them? [21:39:42] <^demon> apergos: We can still hate it even once it's gone :) [21:40:19] Thehelpfulone left before i could ping him... [21:40:46] jeremyb: Just don't mention r720's [21:40:53] *shudder driver issues* [21:40:59] we can but I tend to use my hate points for things actively deserving it [21:41:02] why? [21:41:03] drac6 is fine [21:41:06] drac5 is what I hate [21:41:14] +1 on that [21:41:19] Guest64573: ping [21:41:29] jeremyb: Try and get freebsd to work on an r720 and you will see why :D [21:42:19] Damianz: i'm asking specifically about a model the foundation was testing and then i think got more of. which was marketed that way. idk how r720 was marketed [21:42:51] oooh, a different /quit msg [21:52:23] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [21:58:41] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:05:32] New patchset: Demon; "Setting "Project Creators" as the default group for new repositories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10375 [22:05:35] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [22:05:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10375 [22:10:41] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [22:15:25] mark: mediawiki-lb.esams.wikimedia.org has the same IP (forward & reverse!) as text.esams.wikimedia.org [22:15:28] is that normal? [22:16:11] text.esams isn't used any more is it? [22:17:00] I think not [22:17:08] I *think* it was replaced by the service IPs [22:17:18] yes that is normal [22:17:24] text is no longer used [22:17:27] but i'm not removing any DNS until I hear back [22:17:32] should I remove the entries? [22:17:36] no [22:17:40] why would you [22:18:07] we have two PTRs for the same IP [22:18:10] it's not invalid [22:18:19] but it's not ideal either [22:18:22] well yeah [22:18:25] remove the reverse then [22:18:26] not the forward [22:18:32] is the forward still being used? [22:18:38] some people may still use it [22:20:00] okay [22:20:04] it's late [22:20:06] we should stop working [22:20:18] i.e. I should stop asking questions [22:20:28] thanks very much for all the patience today [22:20:55] thanks a lot for your help :) [22:20:58] see you tomorrow morning [22:21:15] yep :) [22:21:38] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [22:27:55] Ryan_Lane: why we setup 6to4/teredo: http://www.getipv6.info/index.php/Customer_problems_that_could_occur#Use_of_transitional_IPv6_connectivity [22:27:58] read and weep [22:28:26] I totally get that. I was asking why we were setting up both ;) [22:28:40] I know [22:28:47] just read the list of all the broken OS [22:28:50] or vendors [22:31:34] New patchset: Faidon; "role::ipv6relay: add system_role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10379 [22:31:38] yeah. that's slightly lame [22:31:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10379 [22:31:57] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10379 [22:32:00] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10379 [22:36:20] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27406 bytes in 0.135 seconds [23:10:59] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [23:23:17] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:32:44] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [23:36:38] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.185 seconds [23:37:05] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [23:50:26] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor