[00:00:09] let's see, i'm gonna make some edit [00:00:59] Well, enwiki edits are pretty high frequency, right? :P [00:01:03] Can't join #en.wikipedia [00:01:33] greg-g, tanks for the heads up. [00:01:52] 208.80.152.178 [00:02:02] If it moved, surely its ip changed? [00:02:04] greg-g: Can I get an emergency SWAT addition for a JS error that I accidentally caused during SWAT? [00:02:13] Maybe? [00:02:18] DNS seems to be accurate [00:02:21] 178.152.80.208.in-addr.arpa name = ekrem.wikimedia.org. [00:02:31] ekrem.wikimedia.org has address 208.80.152.178 [00:02:53] Does the bot need restarting? [00:02:56] According to monitor 'IRC RecentChanges', the connect service on [00:02:58] 'irc.wikimedia.org' has been working again as specified since 2014-04-22 [00:02:59] 00:01:34. [00:03:02] RoanKattouw: nope, the channel hasn't been created although i made action on wiki [00:03:19] (03PS4) 10BBlack: Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 [00:03:21] /usr/local/bin/start-ircbot [00:03:24] Down since 2:57pm PDT, that's when the Nimsoft alert first came up [00:03:32] Reedy: Yeah except that's already running. I'll restasrt [00:03:36] Wikitech suggests it might not be started automagically [00:03:39] Ah [00:03:44] If IRC was down in the first place [00:03:47] Chances are it needs rebooting [00:04:05] ouch [00:04:26] Reedy: No I think I got it, hold on [00:05:03] How about now? [00:05:11] I guess I should fire up a second IRC client or something [00:05:21] Yup, fixed [00:05:28] Awesome! [00:05:36] might wanna !log it [00:05:52] !log Restarted ircecho on ekrem, IRC working again now [00:05:55] Or get an IRC client that does multi network ;) [00:05:58] Logged the message, Mr. Obvious [00:06:25] just posted a wmfsf email asking if disabling ipv6 for the office for an extended length of time would be a burden for anyone -- I figured it impacts Ops/Dev more than most.. Wanted to ask in here too... [00:06:29] Concerns? [00:08:35] nice, works now [00:08:39] thanks, RoanKattouw [00:09:25] what is actually the reason for not to have the channels persistent? [00:09:32] I don't know [00:09:41] I have no idea how any of this IRC stuff is set up [00:09:52] after the restart bots can't reconnect to the channel because it doesn't exist [00:10:01] Danny_B: that shouldn't be the reason [00:10:05] It's just that https://wikitech.wikimedia.org/wiki/IRC happens to be perfectly accurate re how to start the service [00:10:09] a normal IRCd will create a channel if it doesn't exist [00:10:11] So it was very easy to bring it back up [00:10:24] although, setting mode +P would kinda fulfill what you want [00:10:27] (opers only) [00:10:28] Jasper_Deng: wmf irc is custom patched ircd [00:10:41] I know it's some charybdis-derived thing [00:10:43] I'm not sure it matters that much [00:10:46] It's rarely restarted [00:10:47] ^ [00:10:49] the channels are created only by irc bot [00:10:57] And rc-pmtpa is 99.999% there [00:11:02] no other user can create it [00:11:36] RoanKattouw: thanks for the SWAT! [00:12:19] greg-g: ping... ? [00:15:35] !log torrus (on manutius) is down [00:15:41] Logged the message, Master [00:16:10] just a heads up the ircd-ratbox not being in puppet has an rt ticket that has made slow progress [00:16:10] https://rt.wikimedia.org/Ticket/Display.html?id=4784 [00:16:30] so it's a weird manual thing but at least ticketed, thanks for doing it btw [00:17:01] We really need a better way of doing RC feed publishing and kill it completely [00:18:54] !log catrope synchronized php-1.24wmf1/extensions/MobileFrontend/javascripts/modules/editor/VisualEditorOverlay.js 'Fix JS error on save' [00:18:55] I think that's the idea all around, but at least packaging up properly what we have for the time being [00:19:01] Logged the message, Master [00:19:04] heh [00:19:04] greg-g: I went ahead and did it [00:19:09] chasemp: Did you sort your gerrit issues? [00:20:11] Reedy: if you mean for old ircd-ratbox repo -- I believe I did thanks, I haven't gotten a moment to try to sort it out properly yet tho [00:20:23] readding the 'right thing' I mean [00:25:44] Reedy: last time i heard there is xmpp rc feeding completely ready, but the only thing is to set up the xmpp server [00:26:59] I thought Daniel Kinzler gave up on that like 6 years ago? [00:27:09] https://github.com/wikimedia/mediawiki-core/tree/master/includes/rcfeed [00:27:52] good topic for hackathon ;-) [00:29:49] Bikeshed on what we could possibly use [00:29:50] ? [00:30:12] RoanKattouw: hey, thanks, sorry, was eating dinner [02:13:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3392 MB (3% inode=99%): [02:20:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3676 MB (3% inode=99%): [02:29:36] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-22 02:29:34+00:00 [02:30:20] Logged the message, Master [02:43:03] !log LocalisationUpdate completed (1.24wmf1) at 2014-04-22 02:43:01+00:00 [02:43:10] Logged the message, Master [03:00:19] RECOVERY - Disk space on virt0 is OK: DISK OK [03:32:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 22 03:32:29 UTC 2014 (duration 32m 28s) [03:32:41] Logged the message, Master [03:50:01] oops [03:50:04] Dang [03:50:11] Lost the scrollback [03:50:19] this channel is logged [03:50:22] see the topic [03:50:26] hax [03:50:33] I did see it, Jasper_Deng [03:50:37] ... [03:59:45] werdna: I emailed you a few days about (maybe) changing your production login… any thoughts? [04:35:21] !log reactivate esams<->HE & eqiad<->HE peerings; issues are confirmed to be resolved [04:35:27] Logged the message, Master [04:39:48] paravoid: i imagine this involves grasping some orb that is filled with swirling plasma in both hands as bolts of lightning crack against the sky [04:40:03] lol what? [04:40:28] i don't know anything about peering except that it's important, so my imagination runs off [04:40:50] what would you like to know? [04:40:54] happy to explain :) [04:41:05] well, first of all, how you get the lightnings to appear on command like that [04:41:25] just kidding. i don't even know what to ask. i just thought what you said sounded cool :P [04:41:33] lol [04:42:29] so how does it actually work? [04:42:33] * ori googles it [04:43:01] there is a protocol called BGP [04:43:15] routers talk to each other using that protocol [04:43:59] can i pick it up on my computer? [04:44:10] you probably can't [04:44:12] networks are split into "autonomous systems", e.g. Wikimedia has two, one for Europe and another one for the US (although these days the lines have blurred a lot and you could consider it one) [04:44:15] BGP's at the core routers level ^ [04:44:31] we have two instead of one because trans-Atlantic transit is too expensive [04:44:34] each autonomous system has a number assigned to it (Wikimedia's US network is 14907, Europe is 43821) [04:44:55] (no it's not, but anyway) [04:45:54] Wikimedia also has some routes, e.g. 208.80.152.0/22 & 91.198.174.0/24 [04:46:04] #ko.wikinews @ irc.wikimedia.org says "no such channel" [04:46:15] and using BGP, we announce these routes as originated from our AS [04:46:26] Revi: known problem, scroll up [04:46:32] Ok [04:46:50] so we tell our peers (other AS), via BGP, that 208.80.152.0/22 is originated from AS 14907 [04:47:00] to give a perspective of how important BGP is, the Egyption government shutoff Internet access before the Arab spring for the entire country by turning off BGP announcements from Egypt's ISPs [04:47:31] some of them are just using this information for their internal networks (these we usually call "private peers", "IXP peers" or just "peers") [04:47:44] (this is really great by the way; i am rapt) [04:47:52] others provide us with connectivity to reach third-party networks [04:47:59] these are called "transits" [04:48:06] we usually pay for transit [04:48:28] to reach to be reached [04:48:29] yes, i remember this bit because you explained it to me when you were last in sf [04:48:42] *to reach to third-party networks and to be reached by third-party networks [04:49:35] www.google.com has address 74.125.228.83 [04:49:42] A Destination P Prf Metric 1 Metric 2 Next hop AS path [04:49:45] * 74.125.228.0/24 B 170 250 0 >208.80.154.194 15169 ? B 170 100 100 >209.48.42.49 2828 15169 ? B 170 100 0 >129.250.204.189 2914 15169 ? B 170 100 >130.244.6.242 1257 15169 ? [04:49:50] argh, whitespace breakage [04:49:53] A Destination P Prf Metric 1 Metric 2 Next hop AS path [04:49:56] * 74.125.228.0/24 B 170 250 0 >208.80.154.194 15169 ? [04:49:59] B 170 100 100 >209.48.42.49 2828 15169 ? [04:50:02] B 170 100 0 >129.250.204.189 2914 15169 ? [04:50:05] B 170 100 >130.244.6.242 1257 15169 ? [04:50:21] 15169 is Google's ASN (autonomous system number) [04:50:35] 2828, 2914 & 1257 are transits of ours [04:50:57] just like IPv4 addresses have been exhausted, 2-byte ASN's have been exhausted, so all new ASN's are 4-byte (32-bit) [04:51:12] we peer directly with google, and those three transits also peer directly with google [04:51:29] A network that does not have to rely on /any/ transit is a /tier 1 network/ [04:51:55] (well, that is, no transit to reach other tier 1 nets) [04:54:47] what command is that the output of? [04:54:56] show route 74.125.228.83 terse [04:55:00] from the juniper [04:55:05] one of them anyway [04:57:17] 'some of them are just using this information for their internal networks' [04:57:26] why do we peer with them? is it reciprocal somehow? [04:57:50] well, kind of [04:58:15] first of all, yes, it is usually reciprocal [04:58:43] i.e. when we peer, we usually reach them directly and they usually reach us directly as well [04:59:00] but the answer to your "why do we peer with them" question is [04:59:23] if we don't peer directly with those networks, we'd use transit to reach them (and they'd use transit to reach us) [04:59:40] this is (usually) more expensive for both parties [05:00:01] in bits flying over optics terms, what's the difference? [05:00:14] plus, peering directly means that you don't get random networks in between [05:00:15] ie: there's only so many cables going out of the DC [05:00:57] which can reduce latency and help with troubleshooting if there is an issue [05:01:19] greg-g: there are A LOT of cables going out of the DCs we're in [05:01:28] in the order of thousands probably [05:01:32] huh [05:01:34] well then [05:01:44] plus, you can multiplex traffic in many layers on top of a fiber [05:02:08] some (like optical multiplexing) gurantee you bandwidth, others are best effort/can get congested [05:02:13] given the advantages of peering, why wouldn't internet2.edu refuse to peer with us? they share equinix w/ us [05:02:25] why would* [05:02:56] peering is a complicated game [05:03:29] stupid scarce resource politics [05:03:43] internet2 is AS 11537; http://as11537.peeringdb.com/ indicates that they have a "selective" peering policy and that they are not adding more peers [05:04:19] why not? what's the practical cost? [05:04:54] one aspect is the maintenance overhead; dealing with hundreds of peers needs more human resources than dealing with e.g. three [05:04:56] having to do what you did earlier -- i.e., respond to failures in their network [05:04:57] ah [05:05:00] in terms of emails exchanged, monitoring sessions etc. [05:05:05] right [05:05:12] a second cost is that you lower the leverage you have with your transits [05:05:34] since you lower the bandwidth you push to them and hence you're essentially a smaller customer [05:05:44] weird [05:05:51] the third, more important aspect for the overall game is [05:06:05] often networks ask you to *pay* them to peer with you [05:06:31] remember, if two networks don't peer directly, they both pay their respective transits to reach each other [05:07:25] if one is substantially larger than the other, then it pays *less* to their transit per mbps (gets a better discount) and it matters less for their overall budget of their network [05:08:01] (When you're done with this current thread, I'm still confused on the physical layer: are there other ASNs that we are physically unable to peer with? To peer with someone must you share an exchange 1 hop away or something like that?) [05:08:02] then that network can basically ask the other network to pay them to reach to them, e.g. half of what they pay their transit [05:08:25] sometimes you can call this extortion [05:08:47] * ori is still working out the math [05:09:45] right, i can see how it is somehow unsavory [05:09:51] imagine you have a small isp called OriSP, that buys transit out of Level3 (3356) [05:09:53] greg-g: it's easiest when you share the exchange [05:10:02] (as w/ internet2.edu, hence why I asked about them) [05:10:07] since you're small, you probably don't get a good deal and you pay, say $10 per mbps [05:10:10] otherwise you need to make it so or get transit [05:10:44] wikimedia buys transit out of NTT (2914), but is larger so it pays, say, $5 per mbps (no, that's not a real figure) [05:11:24] 2914 & 3356 have a deal and peer with each other (undisclosed terms, but they're both tier-1s, so presumably for free) [05:11:44] so the network path is wikimedia <-> Level3 <-> NTT <-> OriSP [05:11:46] er [05:11:54] wikimedia <-> NTT <-> Level3 <-> OriSP [05:11:58] I meant [05:12:41] (03PS1) 10Springle: Switch s7-analytics-slave to db1007 (eqiad lvm/dump slave). RT #7330. Though apparently already on 12th floor, db68 has fallen offline during pmtpa 10th/12th floor changes for some reason. [operations/dns] - 10https://gerrit.wikimedia.org/r/127869 [05:12:44] it's surprising to me that there is any aspect of this that is 'naively' reciprocal -- i.e. where two parties agree to provide the same service for one another without considering their respective sizes [05:12:59] say OriSP's customers are very fond of wikipedia and hence the wikipedia to/from OriSP traffic is 100mbps [05:13:23] (03CR) 10Springle: [C: 032] Switch s7-analytics-slave to db1007 (eqiad lvm/dump slave). RT #7330. Though apparently already on 12th floor, db68 has fallen offline durin [operations/dns] - 10https://gerrit.wikimedia.org/r/127869 (owner: 10Springle) [05:13:24] i would have figured that by now you'd always be paid or be paying, depending on the exact proportion in which you are larger or smaller [05:13:27] ori: AT&T and some networks do what's called "settlement-free" peering [05:13:33] this means that your isp pays $1000 per month to Level3 to reach wikipedia, which may be a substantial figure in your budget [05:13:34] but let paravoid continue [05:14:07] the wmf pays $500 per month to reach orisp, which is a) lower, b) petty cash considering our overall figures [05:14:25] !log db68 down. s1-analytics-slave cname to db1007 [05:14:31] Logged the message, Master [05:14:34] the evil wmf can say to orisp "gimme $500 per month and I'll peer with you directly" [05:15:08] that's a net win of $500 for orisp, and +$1000 to wmf's bank account per month [05:15:32] now imagine that at scale [05:16:10] (insanity) [05:16:25] and add to that the fact that some of these networks you might want to ask to peer with are also transit providers (vendors) that have an incentive for small players to not peer at all :) [05:17:45] i.e. in the above example, imagine if the aforementioned small OriSP went to NTT instead of the WMF and said "hello, we have 100mbps per month with each other, maybe we should peer!" [05:17:49] (good luck with that) [05:18:57] is it still possible to graduate from a dinky isp to a being a major player? i guess if you are attracting a lot of traffic [05:19:31] it's even more complicated than that :) [05:20:04] attracting a lot of traffic by itself isn't always good for your peering partnerships [05:20:24] in the sense that if your traffic isn't balanced (in/out), you're still paying [05:20:28] think Netflix [05:20:52] the whole Netflix/Comcast/AT&T/Level3 deal is a peering issue [05:21:03] http://blog.level3.com/global-connectivity/chicken-game-played-child-isps-internet/ [05:21:06] http://www.attpublicpolicy.com/consumers-2/who-should-pay-for-netflix/ [05:21:12] http://blog.netflix.com/2014/03/internet-tolls-and-case-for-strong-net.html [05:21:30] greg-g: oh yeah, in terms of physical layers [05:21:53] we either do a private peering, i.e. we ask the DC provider to physically connect a fiber between our equipment and another network's equipment [05:22:21] which obviously has some costs, both to the DC provider and for the cost of the port on our respective equipment [05:22:47] so we do this only for very large peers of ours/important peerings [05:22:51] (and transits) [05:23:12] for all the rest, we connect with one or two ports to the local internet exchange point (IXP) [05:23:33] which is essentially a large, shared medium (a switch) where lots of companies connect to [05:23:36] then peer with each other [05:23:42] gotcha [05:24:06] http://as14907.peeringdb.com/ [05:24:08] top-right corner [05:24:43] ams-ix is one of the largest in the world (if not the largest), see https://www.ams-ix.net/connected_parties [05:24:51] or well, https://www.ams-ix.net/ [05:25:27] wow [05:25:55] alright, I have to craah, I just didn't want to miss this [05:26:08] crash [05:26:10] :) [05:26:13] * ori waves [05:26:19] the European IXPs are usually non-profits [05:26:19] thanks and g'night! [05:26:20] well, i was not wrong about this being full of high drama [05:26:20] associations [05:26:34] in the US, it's 1-2 companies dominating the market [05:26:40] (like Equinix) [05:26:54] the market being a metropolitan area? [05:27:11] the market being ALL metropolitan areas :) [05:27:20] so equinix dominates the whole uS [05:27:45] has the largest IXP in Ashburn and New York, as well as the San Jose and Palo Alto ones [05:28:00] and all the largest in between (Dallas, Chicago etc.) [05:28:53] so, last year, the three largest non-profit member-run european IXPs (and also the three largest IXPs in the world) decided to invade the US :) [05:30:03] AMS-IX (Amsterdam), DE-CIX (Frankfurt) and LINX (London) decided to fork to AMS-IX New York, DE-CIX New York and LINX NoVa (North Virgina/Washington DC) [05:32:10] basically so they don't have to pay equinix to reach american internet traffic? [05:32:29] they are just operating an "island" [05:32:31] just a switch [05:33:13] and asking networks (whether it's content providers such as wikimedia and google, or ISPs) to connect via them and exchange traffic with each other [05:33:15] right, but attractive to peering partners [05:33:24] instead of/in addition to connecting to Equinix [05:34:12] do the transits actually punish customers for "disloyalty" like that by reducing the quality of service or increasing prices? [05:34:33] well [05:34:37] it's not about disloyalty [05:34:49] it's about how much you pay [05:34:58] it's different to approach a vendor for 100gbps and different to approach them for 10gbps [05:35:16] typically, the price per mbps decreases the larger your order becomes [05:35:19] economies of scale [05:35:43] if you're aggressively peering, you lower your transit bandwidth needs and your price per mbps for transit increases, yes [05:38:22] but i take back what i said earlier, i think i can see now how mutualism / reciprocity can be sophisticated [05:38:42] it's quite complicated [05:39:20] fwiw, wmf's policy is to never charge for peering (a so-called "open" peering policy) [05:39:31] and we generally don't pay for private peerings either [05:39:55] (but we pay our transits) [06:44:29] !log db77 - revoke puppet cert,salt key,rm from monitoring [06:44:36] Logged the message, Master [06:56:40] (03PS1) 10Chad: Actually configure $wmgBetaFeaturesWhitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127873 [06:57:49] (03CR) 10Chad: [C: 032] Actually configure $wmgBetaFeaturesWhitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127873 (owner: 10Chad) [06:58:26] !log demon synchronized wmf-config/InitialiseSettings.php 'Unbreak $wmgBetaFeaturesWhitelist' [06:58:32] Logged the message, Master [06:58:51] (03Merged) 10jenkins-bot: Actually configure $wmgBetaFeaturesWhitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127873 (owner: 10Chad) [07:04:46] (03CR) 10Chad: [C: 031] "Looks good. Feel free to merge whenever you're ready to babysit this live." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127793 (owner: 10Manybubbles) [07:15:06] (03PS1) 10Giuseppe Lavagetto: Decommission pdu checks in sdtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127875 [07:17:00] (03CR) 10Giuseppe Lavagetto: [V: 032] Decommission pdu checks in sdtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127875 (owner: 10Giuseppe Lavagetto) [07:17:32] (03CR) 10Giuseppe Lavagetto: [C: 032] Decommission pdu checks in sdtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127875 (owner: 10Giuseppe Lavagetto) [07:18:02] (03CR) 10Dzahn: "thx, those should remove the "UNKNOWNs" at" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127875 (owner: 10Giuseppe Lavagetto) [07:22:50] (03CR) 10Dzahn: "revoke puppet cert,salt key, rm from Icinga" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127262 (owner: 10coren) [07:23:48] mornining mutante [07:23:55] now back finally [07:24:32] ACKNOWLEDGEMENT - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn already removed from puppet, but #7110 says to contact Coren [07:33:54] matanya: welcome [08:27:18] (03PS4) 10Faidon Liambotis: dataset: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119212 (owner: 10Matanya) [08:28:11] (03CR) 10Faidon Liambotis: [C: 032] dataset: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119212 (owner: 10Matanya) [08:28:21] thanks paravoid [08:28:55] (03PS2) 10Faidon Liambotis: url-downloader: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119245 (owner: 10Matanya) [08:29:06] (03CR) 10Faidon Liambotis: [C: 032 V: 032] url-downloader: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119245 (owner: 10Matanya) [08:47:58] !log upgrading Bugzilla to 4.4.4 [08:48:05] Logged the message, Master [08:50:36] (03Abandoned) 10Aklapper: Upgrade custom code to Bugzilla 4.4.2 [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/113685 (owner: 10Aklapper) [08:53:10] (03Restored) 10Aklapper: Upgrade custom code to Bugzilla 4.4.2 [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/113685 (owner: 10Aklapper) [08:53:39] nice @ 'restored' feature [09:04:11] !log hooper - revoked puppet cert [09:04:17] Logged the message, Master [09:04:27] apergos: and that was that remnant from your report [09:05:21] yep saw, thanks! [09:06:38] matanya: see the very bottom of that Tampa etherpad.. [09:06:45] new stuff [09:16:02] mutante: mind running nmap on the entire subnet? this will make sure we know all machines that are still up [09:17:35] matanya: apergos just did a ping scan too [09:17:51] ping might not say the whole truth [09:17:56] not a ping scan. [09:18:07] I jut did pings of specific boxes [09:18:32] <_joe_> matanya: how come ping might not say the truth? [09:18:34] well, what i did was nmap with -sn [09:18:41] in the past, but not today yet [09:18:58] * _joe_ wonders if we have machines that do not respond to ping [09:19:02] _joe_: if ping/icmp is blocked [09:19:15] <_joe_> matanya: by whom? :) [09:19:33] there may be a couple [09:19:35] certainly not many [09:20:07] there are machines there dated to the fedora era no? :P how knows what was before puppet on hardy ... [09:21:05] no there aren't [09:21:22] apergos: all "es" _except_ 7 and 10? [09:22:16] apergos: above it says es4 should stay, so changing "es3-6" to "3,5,6" [09:27:21] (03PS1) 10Dzahn: decom: remove es1,2,3,5,6,8,9 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/127886 [09:28:07] matanya: i'll scan [09:31:32] (03CR) 10Matanya: [C: 031] decom: remove es1,2,3,5,6,8,9 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/127886 (owner: 10Dzahn) [09:39:46] apergos: what is the fate of stat1? spare? [09:48:29] (03PS1) 10ArielGlenn: toss pmtpa mw hosts from api dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/127890 [09:49:14] 2 Warning: Maximum number of allowable file uploads has been exceeded in Unknown on line 0 [09:53:05] mutante: thanks, yes 4 should stay [09:53:29] matanya: I think all the pmtpa hosts that are not special or in warranty are going to recycling [09:53:51] (03CR) 10ArielGlenn: [C: 032] toss pmtpa mw hosts from api dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/127890 (owner: 10ArielGlenn) [10:03:58] (03CR) 10Dzahn: [C: 032] decom: remove es1,2,3,5,6,8,9 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/127886 (owner: 10Dzahn) [10:12:31] apergos: regarding stat1 found 6144 it says move to eqiad [10:12:59] yes, but what is meant by that is 'move the service to eqiad' [10:13:08] stat100x is tht host (I forget which number) [10:13:59] matanya: otto shut it down "stat1 is powered down and ready for unracking and wiping." [10:14:03] stat1003 [10:14:08] thanks [10:14:39] mutante: what do we do with mgmt dns entries for decomed hosts? [10:15:33] matanya: in pmtpa, remove them [10:15:48] but dont do the same twice [10:15:55] means ? [10:16:08] apergos: [10:16:55] means probably 3 of us were preparing similar changes locally [10:17:04] oh, sure [10:17:08] /me picks another thing from the list [10:17:55] what is count:2 [10:18:19] !log harmon - delete salt key [10:18:25] Logged the message, Master [10:18:27] matanya: number of hits in apergo's grep [10:20:21] apergos: "lvs" being in dsh, imho it's a false positive [10:20:29] lvs1/2 just match amslvs1/2 [10:20:35] any clue what is tesla host ? not found in RT [10:20:41] matanya: pre-labs labs [10:20:59] should go? stay? [10:21:03] (03PS1) 10ArielGlenn: remove mgmt ips for more tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127894 [10:21:05] (03PS1) 10Alexandros Kosiaris: Add IPv6 address to carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/127895 [10:21:31] that's not all of them [10:21:44] matanya: 3801, 2510, 674, 132 ...et al [10:22:01] if someone wants to eyebal that and +1 if there are no issues, I'll push that and leave the rest for the other two commit-preparing folks :-) [10:22:24] matanya: maybe 3801 needs reopen [10:22:34] * matanya thinks the same [10:23:26] matanya: the count"n stuff is when there are multiple files or entries, it will make the report format giant if I list them all [10:23:44] count:2 for dns usually means there's a prod ip and a mgmt ip (for example) [10:23:50] ok, i'll go over this [10:24:40] (03CR) 10Matanya: [C: 031] remove mgmt ips for more tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127894 (owner: 10ArielGlenn) [10:24:56] okey okey [10:24:57] *dokey [10:25:18] (03CR) 10Dzahn: [C: 031] "yes, all of this has already had it's primary IPs removed and, in pmtpa, mgmt should also be removed now" [operations/dns] - 10https://gerrit.wikimedia.org/r/127894 (owner: 10ArielGlenn) [10:25:49] !log upgraded php-luasandbox to 1.9-1 on test.wikimedia.org [10:25:55] Logged the message, Master [10:26:57] (03CR) 10ArielGlenn: [C: 032] remove mgmt ips for more tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127894 (owner: 10ArielGlenn) [10:27:05] (03PS3) 10Hashar: puppet-lint role/nova.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 [10:28:13] (03CR) 10Hashar: "PS3 address most issues. I have left the selector inside a selector since I am never how bad it is going to interact. Also kept some arr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 (owner: 10Hashar) [10:34:09] i want a PS4 [10:34:45] what's the selling point(s) that gotcha? [10:35:09] is that a reply to hashar saying "PS3 address most issues" [10:35:35] !log upgraded php-luasandbox to 1.9.1 on beta (deployment-apache0{1,2}) [10:35:41] Logged the message, Master [10:38:30] PS4>XBONE! [10:40:54] (03PS1) 10Dzahn: remove even more Tampa mgmt IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/127896 [10:44:10] cleanup \o/ [10:46:12] it's getting cold in Tampa :) Sensor A [10:46:13] is under threshold: 33C [10:52:00] (03PS2) 10Dzahn: remove even more Tampa mgmt IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/127896 [10:58:00] (03CR) 10Steinsplitter: [C: 031] "now okay, pleas merge." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127479 (owner: 10Gerrit Patch Uploader) [11:00:35] you call that cold? [11:02:25] heh, no:) just some sensor in pmtpa thinks so [11:03:52] it was a bit confusing that power consumption appeared to go up a little bit after we shut things down [11:04:11] right, Reedy [11:06:30] (03CR) 10ArielGlenn: [C: 031] remove even more Tampa mgmt IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/127896 (owner: 10Dzahn) [11:07:49] (03CR) 10Dzahn: [C: 032] remove even more Tampa mgmt IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/127896 (owner: 10Dzahn) [11:09:24] (03PS1) 10Aklapper: Port upstream Bugzilla 4.4.4 changes to our modifications [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/127898 [11:14:57] (03Abandoned) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 (owner: 10Hoo man) [11:23:17] andre__: did you see valhalla replacing wikibugs with pywikibugs? [11:23:30] mutante, uh, no [11:23:35] https://github.com/valhallasw/pywikibugs [11:23:46] 2. Bugzilla sends an e-mail to wikibugs-l@lists.wm.o [11:23:46] 3. Tools mail server receives the e-mail. .forward pipes it to toredis.py [11:23:49] 4. toredis.py sends the e-mail to Redis ('PUBLISH') [11:24:04] very nice, compared to old wikibugs, and not running on mail server [11:24:32] nice [11:25:05] 03:58 < mutante> valhallasw: thanks for working on that, maybe you can kill some of these after testing https://bugzilla.wikimedia.org/buglist.cgi?component=wikibugs%20IRC%20bot&list_id=308502&product=Wikimedia&resolution=--- [11:25:16] yay [11:26:57] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 960.0 [11:26:57] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 860.133362 [11:28:57] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:28:57] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:02:41] (03PS1) 10Faidon Liambotis: squid.php: add cp3013/cp3014 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127902 [12:03:11] (03CR) 10Faidon Liambotis: [C: 032] squid.php: add cp3013/cp3014 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127902 (owner: 10Faidon Liambotis) [12:03:18] (03Merged) 10jenkins-bot: squid.php: add cp3013/cp3014 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127902 (owner: 10Faidon Liambotis) [12:04:26] !log faidon updated /a/common to {{Gerrit|If8f39abee}}: squid.php: add cp3013/cp3014 IPv6 addresses [12:04:33] Logged the message, Master [12:04:54] !log faidon synchronized wmf-config/squid.php 'add cp3013/cp3014 IPv6 addresses' [12:05:01] Logged the message, Master [12:31:24] (03CR) 10Faidon Liambotis: [C: 031] "Oh! I hadn't realized this is tunable. This is great :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 (owner: 10BBlack) [12:33:30] (03CR) 10Mark Bergsma: [C: 031] "Awesome." [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 (owner: 10BBlack) [12:42:11] paravoid: can you explain how that squid.php update lets cp301[34] reach the eqiad varnish backends on :3128? :) [12:42:21] it still doesn't make sense to me, although I can see it had the right effect [12:42:25] it doesn't [12:42:35] oh, what is it then? [12:42:42] did you see my response to the RT? [12:42:58] ah, reading now :) [12:42:59] I have an excerpt on how I worked around the issue [12:46:15] so what does squid.php do? [12:46:46] that's just purge stuff, as it seems to say? [12:47:02] no [12:47:16] it's what mediawiki uses to know which XFF values to trust [12:47:37] so that e.g. anonymous edits appear from the user's IP, not a random varnish's [12:47:51] oh, just to get the client source IP correct when forwarding from cp301[34] -> eqiad [12:48:20] yep [12:48:22] and the ipv6 address reordering fixed the router ACL issue separately, ok [12:48:28] yes [12:48:42] the squid.php change was unrelated, I just noticed that it wasn't added and fixed it for you :) [12:49:07] thanks :) [12:53:14] cool stuff with the RPS work [12:53:50] I vaguelly recall some optimizations with NUMA as well [12:59:11] (03CR) 10BBlack: [C: 04-2] "This is being actively worked on, and there's an RT ticket here: https://gerrit.wikimedia.org/r/#/c/127456 . I expect to add them as prop" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127456 (owner: 10QChris) [13:00:41] paravoid: well, we can try to do more algorithms over sysfs output to tune queues to where their hardware IRQs land via smp_affinity for RPS. Or we could jump right past that and do the actual hardware RSS stuff instead. Both are a bit complicated, though. [13:01:04] okay [13:01:23] happy with whatever you decide :) [13:01:39] I'm kind of inclined to just go forward with what we have now (my small updates to your script + bnx2x num_queues thingy) and then look at that later. [13:01:55] if you want to experiment, amslvs were quite saturated last time around (although we offloaded them a bit with leslie by loadbalancing to both lvs in a pair) [13:02:18] and eqiad's lvs were also frequently saturating their cpu [13:02:38] I have applied some fixes there manually [13:02:42] well with the hardware IRQs and/or RSS thing, bnx2 vs bnx2x are probably different experimental results, too [13:03:17] what were the manual fixes on eqiad? other stuff I should puppet into the ulsfo/esams setups? [13:03:41] it's been months, but iirc it was the RPS script that is in gerrit [13:03:46] ok [13:04:50] we've offloaded eqiad quite a bit since [13:04:59] back on cp301[34]: so the ip6 interface work made them use the "real" global ip6 addr to get through the router ACL, but the squid.php change seems to use the macaddr-based global addr for cp30[1234]? Is that correct? [13:05:11] no, the opposite [13:05:19] the ip6 interface work made them use the autoconfig address [13:05:21] since it came last [13:05:30] and the filter blocks the non-autoconfig address [13:05:32] (confusingly enough) [13:05:56] oh it's a block! [13:06:02] (03PS1) 10Matanya: purge_slow_digest: adding cron to terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/127909 [13:06:02] I deleted the autoconf address, and then it got autoconf'ed again [13:06:13] it sucks, we should fix this properly [13:06:25] and yet, I can't reach the v6 :3128 from random external ipv6 stuff - so I assumed we had a whitelist somewhere [13:06:58] maybe we do, separately [13:07:02] that's a different ACL [13:07:04] ok [13:07:16] got it [13:07:20] for hosts in internal zones [13:07:27] (i.e. 10/8 for ipv4) [13:07:29] so that block list is just to make sure only the addrs that are defined in squid.php make it through, in case of misconfig [13:07:31] we block the equivalent ipv6 [13:07:45] just to make sure that noone assumes they're globally reachable outside our network [13:07:51] yes, that's my guess for that block list [13:09:39] paravoid: do we need to change the memcaches for ms-fe in pmtpa? [13:09:49] ?? [13:10:42] in swift.pp : class proxy inherits role::swift::pmtpa-prod memcached_servers => [ "ms-fe1.pmtpa.wmnet:11211", "ms-fe2.pmtpa.wmnet:11211", "ms-fe3.pmtpa.wmnet:11211", "ms-fe4.pmtpa.wmnet:11211" ], [13:10:52] what about it? [13:10:59] ignored that because it's a class just for pmtpa [13:11:01] can this class go away? or the entire pmtpa should go? [13:11:33] the entire pmtpa-prod role/subroles should go away, yes [13:11:56] i'll push this, thanks. what about wipe/recycle etc? [13:12:08] this is being handled by chris and rob [13:12:44] so no need for tickets. ok, cool, thanks. same question for ganglia paravoid [13:13:03] i guess 'Swift pmtpa' => 'ms-fe1.pmtpa.wmnet ms-fe2.pmtpa.wmnet', can go away [13:13:42] paravoid: so, sysctl -w net.ipv6.conf.eth1.autoconf=0 would kill the automatic global v6 addr (but leave the RA-based fe80 link-local one). We could add that to base.pp, but we'd also have to look around for everywhere the auto addrs are used in configs and update those [13:13:45] matanya: yes [13:13:55] doing, thanks [13:14:13] bblack: I'd say in the mapped definition, not base.pp [13:14:17] s/base.pp/wherever we do our base-level sysctls/ [13:14:54] I mean inside interface::add_ip6_mapped instead [13:15:05] so that hosts that don't have that defined, still get a random ipv6 [13:15:15] oh, good point [13:15:30] but the ones that we manually set addresses to, remove their autoconf one [13:17:16] that would be cool [13:17:57] any objections to shutting down tampa netapps now for move? [13:18:18] I have no dependencies on them. [13:18:39] pybal could get interesting [13:18:55] noone moved that yet I think? [13:19:14] hoo: Regarding Gerrit change 127479, I see no link to any bug or discussion where Commons has decided to link all languages to [[foundation:Privacy policy]] rather than the language-specific versions they are apparently using now? [13:19:22] mark: no, don't think it's moved [13:19:35] anomie: All pages have been created with the same content atm [13:19:41] nice [13:19:49] not sure i'm comfortable with moving the netapps now then [13:20:14] hoo: I don't see that at https://commons.wikimedia.org/wiki/Special:PrefixIndex/MediaWiki:Privacypage [13:20:25] mark: chris moved fenari yesterday though [13:20:35] and nothing bad happened afaik [13:20:43] yeah [13:20:46] but http vs nfs :) [13:20:47] hoo: For example, https://commons.wikimedia.org/wiki/MediaWiki:Privacypage/cs isn't the same [13:21:03] anomie: I can only repeat what I've been told [13:21:21] oh, did it really move? apergos [13:21:23] mark: /etc/init.d/apache2 stop first then? :) [13:21:46] if the netapps don't come up, we'll have to setup a new location [13:21:51] let me copy the pybal config files first [13:21:57] yes fenari moved [13:22:05] hoo: And most languages aren't even overridden from the defaults, so they get what's in the i18n files. [13:22:11] copying to my home dir in iron [13:22:15] there's a mirror in eqiad, but still [13:22:18] worst case, we can mount from eqiad [13:22:19] yeah [13:22:20] (03PS3) 10Matanya: swift: remove swift role from tampa [operations/puppet] - 10https://gerrit.wikimedia.org/r/127247 [13:22:50] well this could be... interesting :-) [13:23:49] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swift role from tampa [operations/puppet] - 10https://gerrit.wikimedia.org/r/127247 (owner: 10Matanya) [13:24:00] matanya: thx [13:24:06] :) [13:25:34] !log second pass of swiftrepl eqiad->esams [13:25:38] Logged the message, Master [13:25:41] (03PS1) 10BBlack: Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 [13:25:55] (03CR) 10BBlack: [C: 04-1] Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 (owner: 10BBlack) [13:26:01] !log Disabled puppet and apache on fenari [13:26:07] Logged the message, Master [13:26:33] yeah pybal is complaining but it's fine now [13:26:46] (03CR) 10jenkins-bot: [V: 04-1] Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 (owner: 10BBlack) [13:27:43] ok, gonna shutdown the netapp now [13:27:49] fun [13:27:49] (03PS2) 10BBlack: Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 [13:28:03] (03CR) 10BBlack: [C: 04-1] Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 (owner: 10BBlack) [13:28:07] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [13:28:31] (03PS2) 10Hashar: contint: extract android SDK dependencies to a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/126000 [13:30:13] (03PS3) 10BBlack: Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 [13:30:29] (03CR) 10BBlack: [C: 04-1] Disable IPv6 global autoconf if explicit addr config for interface [operations/puppet] - 10https://gerrit.wikimedia.org/r/128089 (owner: 10BBlack) [13:30:48] hashar: merge? https://gerrit.wikimedia.org/r/#/c/126000/ :) [13:30:52] ok, both controllers shutting down [13:33:34] mutante: regarding https://gerrit.wikimedia.org/r/#/c/117674/4 most of the subnets are not in network::constants, can you point me to what you were referring to? [13:34:46] matanya: i was basically asking if they are in there already and it can be used, if they are not, maybe they should be added [13:35:35] sadly i'm not familiarized with wmf subnets (internal ones for sure) [13:36:17] paravoid: re the CirrusSearch cronspam, I will look into it (been meaning to do that anyway), but as for nfs1 and nfs2 [13:36:23] they are tampa boxes, yes? [13:36:25] are they going away? [13:36:38] I'm not entirely sure why mw-udp2log is including on those nodes [13:36:42] I think they are, I'm not entirely sure [13:37:10] (03CR) 10Hashar: "The core file are generated under the job workspace, for the mediawiki unit test job that is under /tests/phpunit/ I guess that is enough" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119225 (owner: 10Faidon Liambotis) [13:37:12] (03PS1) 10BBlack: Add cp3013 to esams mobile cache backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/128228 [13:37:19] matanya: i think you don't have to, because there is $INTERNAL [13:37:54] (03CR) 10Hashar: [C: 04-1] remove syslog service IP (Tampa) [operations/dns] - 10https://gerrit.wikimedia.org/r/125952 (owner: 10Dzahn) [13:37:55] matanya: see how the ferm rule for nrpe does it [13:38:05] bblack: you rock :) [13:38:05] rule => 'proto tcp dport 5666 { saddr $INTERNAL ACCEPT; }' [13:38:54] btw, I hadn't realized you'd reinstall, if I did I would have suggested also moving those servers to private1-esams :) [13:39:06] matanya: that comes from modules/base/templates/firewall/defs.erb [13:39:15] mutante: that is much wider than those allowed by nsca [13:39:26] $INTERNAL_V4 = (10.0.0.0/8) [13:39:33] while nsca allowed is /14 [13:39:43] paravoid: I could hold the above commit and pull them from pybal and do so. It would probably be another day of delay while figuring out misc broken puppet things related to the network change. [13:39:48] yes/no? [13:40:25] cmjohnson1: so this morning I'm going to sync something for you that disables "collection". then I'm going to revert it when you tell me to? [13:41:22] mutante: nfs1 and nfs2 are going away, yes? [13:41:46] i think there are some mw-udp2log instances there, and I think they should go away too, since that instance lives on fluorine already [13:42:05] bblack: your call [13:42:50] bblack: fwiw, I moved the ms-fe/ms-be 30xx boxes to private the other day, fixed a bunch of other broken puppet things :), but I think we should be done with that by now [13:43:19] I'd need to switch VLANs on their ports as well, right? or have someone with keys do it [13:43:28] yeah [13:43:31] I can do that [13:43:46] or I can give you access to [13:43:50] ok let's just do it while we can, at least it will validate things for moving other live varnishes later [13:43:51] how's your juniper foo? :) [13:43:58] old and weak, but I can figure it out [13:44:17] I should at least have access to look around so I don't have to bug you guys so often :P [13:44:42] I don't mind the bugging, I just don't want to stand in your way all the time :) [13:44:46] ottomata: i don't know, actually i'd love some updates on the nfs1/2 ticket [13:45:25] ottomata: #7295 , would be good if you can add something about the logging roles [13:45:35] (03CR) 10BBlack: [C: 04-1] "On hold pending move to private1-esams" [operations/puppet] - 10https://gerrit.wikimedia.org/r/128228 (owner: 10BBlack) [13:45:40] so basically mutante i'm not sure what would be the next step, unless someone can point me at it [13:46:08] shall I move the hosts to private1-esams now? [13:46:09] paravoid: ok add me and I'll pull them from pybal and start on that today [13:46:19] I need to get them out of pybal frontends first! [13:46:23] oh right [13:46:35] and you're blocked on the netapp migration [13:46:38] hehe [13:46:39] which may or may not be complicated by the fenari situation? I haven't been following that closely [13:46:43] yeah [13:46:50] no pybal for now [13:46:52] ok [13:47:09] matanya: either changing $INTERNAL, or adding another variable like that, or ignore my comment, heh [13:47:43] matanya: have a look at this http://pastebin.com/D7NunsQW [13:48:09] it might help you figure out the networks. It is the result of the ferm evaluations for network.pp [13:49:57] does it matter if we allow 10.0.0.0/8 or just 10.0.0.0/14 ? what else is the difference between existing $INTERNAL and what is listed separatedly in icinga [13:50:37] ottomata: also, https://gerrit.wikimedia.org/r/#/c/125952/ i did not know [13:51:18] (03CR) 10Hashar: [C: 04-1] "Tiny error, ZUUL_COMMIT is defined by Zuul but the job is run by Jenkins cron system." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 (owner: 10BryanDavis) [13:54:47] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [13:58:12] cmjohnson1: when do you want my to sync the change? [14:00:32] (03CR) 10Ottomata: [C: 032 V: 032] Fixing documentation on varnishkafka_ganglia.py [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/127674 (owner: 10Ottomata) [14:05:00] * manybubbles has the conch [14:05:11] (03CR) 10Manybubbles: [C: 032] Opt remaining wikis into Cirrus beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127363 (owner: 10Chad) [14:05:23] (03Merged) 10jenkins-bot: Opt remaining wikis into Cirrus beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127363 (owner: 10Chad) [14:08:09] manybubbles: anytime [14:08:21] cmjohnson1: k. now it is [14:08:46] (03CR) 10Manybubbles: [C: 032] Disable Collection for server move [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127658 (owner: 10Chad) [14:08:52] mutante: tarin will be back up shortly. [14:09:13] !log mchenry and sanger going down for server relocation [14:09:23] Logged the message, Master [14:09:23] cmjohnson1: gotcha, thx [14:10:21] (03Merged) 10jenkins-bot: Disable Collection for server move [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127658 (owner: 10Chad) [14:11:52] thanks chad [14:12:01] and nik [14:12:03] :) [14:12:12] (03CR) 10Chad: "Was this supposed to go live? I thought Greg/Erik had a window." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127658 (owner: 10Chad) [14:12:28] <^d> I guess so :) [14:12:57] PROBLEM - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:57] PROBLEM - Host mchenry is DOWN: PING CRITICAL - Packet loss = 100% [14:13:17] PROBLEM - Host sanger is DOWN: PING CRITICAL - Packet loss = 100% [14:13:27] ^d: greg-g told me the window was this morning at , well, now [14:13:31] and that I should sync it [14:13:37] <^d> Ok :) [14:13:37] !log manybubbles synchronized docroot/noc/createTxtFileSymlinks.sh 'noncirrus is removed' [14:13:42] Logged the message, Master [14:13:43] it's now yes [14:14:14] syncing [14:14:24] ^d: do you know if I can sync-file a deleted file? [14:14:31] noncirrus is gone now [14:14:34] <^d> No, you can't. [14:14:38] I was just going to leave it to get cleaned up for the next scap [14:15:01] <^d> Yeah, either have to do it with dsh by hand or just wait for scap to do it for you [14:15:09] this copying to apaches step used to be faster I think [14:15:13] sync-dir/sync-file [14:15:22] manybubbles: /home in tampa is down right now [14:15:27] perhaps that's the reason? [14:15:39] (03PS1) 10ArielGlenn: remove prod ips for emery, stat1, somehow overlooked during decom [operations/dns] - 10https://gerrit.wikimedia.org/r/128442 [14:15:41] !log manybubbles synchronized wmf-config/ 'cirrus for more wikis and disable collection for more' [14:15:43] mark: dunno. I don't know all the ways that sync-file worms the files out under my name [14:15:47] Logged the message, Master [14:15:49] well, synced [14:15:58] cmjohnson: synced [14:16:05] <^d> mark: We haven't sync'd to Tampa for a couple of weeks :) [14:16:22] i wonder if we still rely on /home at all [14:16:26] apache configs perhaps [14:16:49] thanks akosiaris i'll work with this later today [14:17:10] and noc [14:17:59] !log building new elasticsearch indexes for the last wikis that didn't have them. the cluster may go red as the indexes are assigned. silly nagios check. [14:18:05] Logged the message, Master [14:18:15] fenari: Connection to 2620:0:860:2:208:80:152:165 timed out while waiting to read [14:18:22] that is why the sync took so long [14:18:25] mark: ^^ [14:19:02] that's fenari [14:19:22] indeed [14:19:38] and it didn't want my file [14:19:57] !log added bblack account on all junipers [14:20:03] Logged the message, Master [14:20:43] bblack: that is you ^ give it a try [14:23:28] RECOVERY - Host 208.80.152.132 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [14:23:28] RECOVERY - Host mchenry is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms [14:23:46] mchenry back on the 12th floor [14:23:47] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms [14:23:57] that's where it started off ;) [14:23:58] sanger too [14:24:19] cmjohnson: let me know when you want me to sync the revert [14:25:28] PROBLEM - HTTP on mchenry is CRITICAL: Connection refused [14:25:29] yes some scripts do, in particular the apache deploy scripts do [14:25:50] the speciic things for the apache scripts are noted in a ticker [14:25:52] ticket [14:25:55] (03PS1) 10Yuvipanda: dynamicproxy: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/128459 [14:25:57] PROBLEM - SSH on sanger is CRITICAL: Connection refused [14:26:02] Coren: ^ trivial fix [14:26:07] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [14:26:07] PROBLEM - SSH on mchenry is CRITICAL: Connection refused [14:26:07] PROBLEM - RAID on sanger is CRITICAL: Connection refused by host [14:26:07] PROBLEM - Disk space on sanger is CRITICAL: Connection refused by host [14:26:07] PROBLEM - DPKG on sanger is CRITICAL: Connection refused by host [14:26:17] PROBLEM - check configured eth on sanger is CRITICAL: Connection refused by host [14:26:17] PROBLEM - Recursive DNS on 208.80.152.132 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:26:28] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [14:26:28] PROBLEM - puppet disabled on sanger is CRITICAL: Connection refused by host [14:26:28] PROBLEM - check if dhclient is running on sanger is CRITICAL: Connection refused by host [14:26:50] (03CR) 10Dzahn: [C: 031] remove prod ips for emery, stat1, somehow overlooked during decom [operations/dns] - 10https://gerrit.wikimedia.org/r/128442 (owner: 10ArielGlenn) [14:27:22] check if dhclient is running? :) [14:27:30] (03CR) 10coren: [C: 032] "Is consmetic." [operations/puppet] - 10https://gerrit.wikimedia.org/r/128459 (owner: 10Yuvipanda) [14:28:19] mark: after the dhcp fiasco it made sense ... [14:28:29] hehe I guess [14:28:55] perhaps it might make sense to create a basic sanity script though [14:29:01] that does all kinds of checks like that on a host [14:29:05] and reports a single status to icinga [14:30:39] (03CR) 10ArielGlenn: [C: 032] remove prod ips for emery, stat1, somehow overlooked during decom [operations/dns] - 10https://gerrit.wikimedia.org/r/128442 (owner: 10ArielGlenn) [14:33:17] RECOVERY - Host tarin is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [14:33:51] !log populating cirrus indexes for all remaining wikis [14:33:56] Logged the message, Master [14:33:57] RECOVERY - SSH on sanger is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7.1 (protocol 2.0) [14:34:07] RECOVERY - Disk space on sanger is OK: DISK OK [14:34:08] RECOVERY - DPKG on sanger is OK: All packages OK [14:34:08] RECOVERY - RAID on sanger is OK: OK: optimal, 1 logical, 2 physical [14:34:10] ottomata: any chance you can look at elastic1013 at some point? Its given up on ganglia [14:34:17] RECOVERY - check configured eth on sanger is OK: NRPE: Unable to read output [14:34:28] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.036 second response time on port 636 [14:34:28] RECOVERY - puppet disabled on sanger is OK: OK [14:34:28] RECOVERY - check if dhclient is running on sanger is OK: PROCS OK: 0 processes with command name dhclient [14:34:51] (03PS1) 10ArielGlenn: remove 'static' (cname for hume), dead for a long time [operations/dns] - 10https://gerrit.wikimedia.org/r/128461 [14:34:58] not good... unable to read output but OK ? damn... [14:35:07] RECOVERY - LDAP on sanger is OK: TCP OK - 0.038 second response time on port 389 [14:35:23] ah I know why [14:35:25] let's fix [14:37:47] (03CR) 10ArielGlenn: [C: 032] remove 'static' (cname for hume), dead for a long time [operations/dns] - 10https://gerrit.wikimedia.org/r/128461 (owner: 10ArielGlenn) [14:38:13] (03CR) 10Dzahn: "this already said "currently offline" back in 2009" [operations/dns] - 10https://gerrit.wikimedia.org/r/128461 (owner: 10ArielGlenn) [14:38:27] PROBLEM - NTP on mchenry is CRITICAL: NTP CRITICAL: No response from NTP server [14:38:51] akosiaris: what's the basics on logging into the network gear? do I need to use a bastion or a certain port or? [14:39:21] yeah use a bastion [14:39:27] the routers have public ips but use an ACL [14:39:34] so if your ip isn't in that list... [14:40:54] bblack: iron as a bastion and then just ssh bblack@cr1-eqiad.wikimedia.org etc etc etc [14:41:53] !log marc synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [14:42:01] Logged the message, Master [14:42:55] akosiaris: I get prompted for a password when doing so with my key! [14:44:04] <^d> Coren: That script looks like it worked? [14:44:25] (03PS1) 10ArielGlenn: remove mgmt ips for ms5, 9, 10 [operations/dns] - 10https://gerrit.wikimedia.org/r/128462 [14:45:15] bblack: looking into it [14:45:17] ^d: I'm not sure, actually. It doesn't look it it -- it tried to copy the new cdb to the wrong place so it probably failed to actually update it, and it broke on trying to refer to fenari at some point. Bitrot. [14:45:37] cmjohnson1: ping [14:45:49] mark [14:45:49] cp $file /home/wikipedia/common/wmf-config/interwiki.cdb [14:45:55] i'm wondering why mchenry isn't coming back, its mgmt doesn't work either [14:45:56] Yeah. [14:46:07] (sanger is up but management doesn't work) [14:46:12] not sure why it's not coming back [14:46:15] I got into fenari, but /home/wikpedia mount seems broken [14:46:19] lemme check mgmt [14:46:39] bblack: netapp is being moved [14:46:45] (03CR) 10ArielGlenn: [C: 032] remove mgmt ips for ms5, 9, 10 [operations/dns] - 10https://gerrit.wikimedia.org/r/128462 (owner: 10ArielGlenn) [14:46:53] bblack: argh... a single quote in the key fixing [14:46:58] /usr/local/bin/updateinterwikicache [14:47:14] lol, that's out of date [14:47:44] <^d> We should fix it then :) [14:47:45] /home/reedy/updateiwcache is up to date [14:48:33] * Coren runs that one. [14:48:45] We should probably, like, stuff that outside your home though. :-) [14:49:04] ok mchenry is stuck at boot [14:49:15] (03PS1) 10Chad: Correct path to copy cdb file to [operations/puppet] - 10https://gerrit.wikimedia.org/r/128465 [14:49:28] <^d> Coren, Reedy: There's a fix for the official script ^ [14:49:43] rebooting it to see a full boot [14:50:02] mchenry was installed in 2007 [14:50:03] not bad [14:50:03] !log marc synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [14:51:07] PROBLEM - Host sanger is DOWN: PING CRITICAL - Packet loss = 100% [14:51:27] (03CR) 10Reedy: [C: 031] Correct path to copy cdb file to [operations/puppet] - 10https://gerrit.wikimedia.org/r/128465 (owner: 10Chad) [14:51:37] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [14:52:23] manybubbles: So we have a SWAT this morning, but I'm inclined to say "no" to the change since I don't see that https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes was followed at all there. [14:53:17] Which one, anomie? [14:53:46] twkozlowski: There's only one SWAT this morning: https://gerrit.wikimedia.org/r/127479 [14:53:59] (03PS1) 10ArielGlenn: remove entries for srv235-257 from dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/128468 [14:54:37] mchenry is in fsck [14:54:44] cmjohnson1: what's the status of pdf & netapp? [14:55:26] mark: mchenry: stuck @ Grub [14:55:31] no it's not [14:55:34] it's in fsck right now [14:55:36] netapp is in process now...robh is doing that [14:55:40] i'll monitor mchenry [14:55:46] ok [14:56:09] (03CR) 10ArielGlenn: [C: 032] remove entries for srv235-257 from dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/128468 (owner: 10ArielGlenn) [14:56:22] Steinsplitter is here ^^ [14:56:52] ^^? [14:58:14] anomie: is the the problem that swat isn't for config changes? we've totally used it for that and I don't mind doing so in theory. [14:58:32] ah, I see, it is in the request process [14:59:09] so I assume I shouldn't reenenable the pdf stuff just yet? [14:59:12] ah [14:59:17] !log shutting down and relocating virt0 and pdf2 [14:59:20] Steinsplitter: Regarding your change at https://gerrit.wikimedia.org/r/127479, I'm a little concerned that this is going to change the privacy link for most languages on Commons and there's no link to any sort of discussion where Commons has indicated that they actually want the change or to a bug indicating that the current situation is broken [14:59:23] Logged the message, Master [14:59:35] manybubbles: I don't know about the PDF stuff, that's a different window [15:01:02] anomie: the pdf stuff is the window before. it was to turn it off, do some server moves, then turn it back on [15:01:09] (I also note than no one has +1'd this patch set, so how it made it into a SWAT window, I don't know.) [15:01:11] but it sounds like the server moves aren't done [15:01:27] Perhaps we can agree that only patches with at least one +1 can be scheduled for deployment. [15:01:57] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:00] * ^d goes around and puts a +1 on allllll the patches [15:02:14] * yuvipanda writes a bot that does that [15:02:18] anomie: pleas ask jalexander [15:02:19] <^d> Now everyting is swat-elligible ;-) [15:02:22] i have permisson from jalexander [15:02:24] anomie: [15:02:45] anomie: i have chaned onwiki months ago, no oppose [15:02:53] manybubbles: As for config changes, it all depends. [15:02:54] and now removing the page from teh autoplj list [15:02:57] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:03:25] Steinsplitter: You didn't change everything on-wiki. I see a handful of languages using the foundation: link, with most still using the link from the i18n files. [15:03:42] anomie: pls, tak a look at the mediawiki pag [15:04:10] anomie: https://commons.wikimedia.org/w/index.php?title=MediaWiki:Privacypage&action=history [15:04:37] pleas... ther ar a lot of ON WIKI change... + i have permission from wikimedia legal to do so, if you don't belive, write a mail to wikiemdia legal [15:04:46] manybubbles: If you want to do this one, feel free. Otherwise I'm going to deny it on the grounds that consensus and such should be demonstrated before the SWAT window and not argued on IRC at the last minute. [15:05:32] anomie: atm all /SUBPAGES ar redirecting to fondationswiki [15:05:40] i have deleted all local pages :) engough local cahnges? [15:06:53] anomie: I'm not going to do it above your discomfort. I'm happy to get whatever consensus is required and do it tomorrow [15:07:01] (03CR) 10Chad: [C: 031] "Makes sense to me, basically harmless." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127479 (owner: 10Gerrit Patch Uploader) [15:07:01] i need to go now, see you later.... [15:07:07] <^d> Now it's got a +1 [15:07:08] Steinsplitter: For example, I see https://commons.wikimedia.org/?uselang=sv links to https://commons.wikimedia.org/wiki/Commons:Integritetspolicy not https://wikimediafoundation.org/wiki/Privacy_policy. I'm not inclined to change that without seeing some evidence of consensus. [15:07:15] if you don't trust me, ask Jalexaner......... [15:07:17] ........... [15:07:21] * ^d is comfortable with Steinsplitter's explanation [15:07:36] ^d: Is Gerrit sending out email or the the mailserver stuck? [15:07:44] I'm not really close enough to the problem to have an opinion [15:07:51] email server is being moved [15:08:05] * anomie invites ^d to use the SWAT window, if he wants [15:08:07] <^d> siebrand: What mark said. [15:08:11] mark: Ah, thanks. I was wondering why I didn't get mail :) [15:08:19] should be up soon [15:09:18] manybubbles: IMO the change is a good one, I'm just (over?-)cautious about consensus. [15:09:39] (03CR) 10Chad: [C: 032] MediaWiki:Privacypage now redirects to foundationwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127479 (owner: 10Gerrit Patch Uploader) [15:09:46] anomie: ah, well, there, ^d is doing it [15:09:48] (03Merged) 10jenkins-bot: MediaWiki:Privacypage now redirects to foundationwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127479 (owner: 10Gerrit Patch Uploader) [15:09:51] bblack: ok fixed [15:10:07] <^d> Coren: You didn't commit your interwiki.cdb file. [15:10:11] <^d> This is stupid that we commit that. [15:10:13] <^d> But I'll fix. [15:10:44] ^d: I didn't know it was supposed to be in the first place. Noted for future. [15:10:48] (03PS1) 10Chad: Commit updated interwiki.cdb file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128581 [15:10:50] <^d> I wasn't either. [15:11:05] <^d> But tin was complaining about uncommitted changes when I tried to pull. [15:11:26] (03CR) 10Chad: [C: 032] Commit updated interwiki.cdb file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128581 (owner: 10Chad) [15:11:34] (03Merged) 10jenkins-bot: Commit updated interwiki.cdb file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128581 (owner: 10Chad) [15:11:37] Arguably, the updateiwlinks scripts should probably commit. [15:11:52] akosiaris: thanks, works! [15:11:52] argh... JSDuck 5.x seems to require this ...https://rubygems.org/gems/rkelly ... meh... [15:12:01] bblack: :-) [15:13:00] <^d> Coren: Arguably, we wouldn't even check in a freaking binary cdb file to git at all. [15:13:15] <^d> That's insane. [15:14:07] RECOVERY - SSH on mchenry is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [15:14:07] RECOVERY - Recursive DNS on 208.80.152.132 is OK: DNS OK: 0.326 seconds response time. www.wikipedia.org returns 208.80.154.224 [15:14:10] !log demon synchronized wmf-config/InitialiseSettings.php 'Icb6b4bad: Updated $wgForceUIMsgAsContentMsg for commonswiki' [15:14:16] Logged the message, Master [15:14:28] RECOVERY - HTTP on mchenry is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 174 bytes in 0.082 second response time [15:14:38] ^d: I sorta kinda see the point (backup if the wiki page gets hosed, I suppose) but it does seem like the wrong thing gets commited. [15:15:03] And also. Wiki page. That has its own source control. :-) [15:15:12] <^d> The whole thing is a mess. [15:15:14] * ^d cries a little [15:15:23] <^d> I should eat something. [15:15:28] PROBLEM - SSH on fenari is CRITICAL: Server answer: [15:19:12] oh damn [15:19:13] sodium has a bad disk [15:19:17] RECOVERY - NTP on mchenry is OK: NTP OK: Offset -0.09554314613 secs [15:20:57] I blame Yahoo! :O [15:22:09] I'm sure someone's already mentioned this [15:22:11] that https://noc.wikimedia.org/conf/s1.dblist is down [15:22:22] well, more like https://noc.wikimedia.org/ is down [15:22:32] it's 'cause you're moving fenari right? [15:22:38] yes [15:22:44] k :) [15:23:57] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [15:27:40] back in a little while, hope thiings stay calm-ish [15:27:52] (03PS7) 10BryanDavis: Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [15:29:46] back [15:31:26] anomie: pls, exmailn me you problem now :) [15:31:36] *explain [15:32:18] RECOVERY - Host tarin is UP: PING OK - Packet loss = 0%, RTA = 35.96 ms [15:32:37] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:32:59] mutante: Bist du da? Welches problem hat anomie genau? Arbeitet sie/er überhaupt für die WMF? [15:33:39] Steinsplitter: For example, I see https://commons.wikimedia.org/?uselang=sv links to https://commons.wikimedia.org/wiki/Commons:Integritetspolicy not https://wikimediafoundation.org/wiki/Privacy_policy. I'm not inclined to change that without seeing some evidence of consensus (perhaps I'm over-cautious). But ^d went ahead and deployed it. [15:34:17] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 35.84 ms [15:34:26] <^d> anomie: No it doesn't. [15:34:31] <^d> It links to wikimediafoundation.org [15:34:36] ^d: It did before you deployed it. [15:34:49] ^d: thanks [15:35:17] <^d> you're welcome [15:35:17] RECOVERY - Host nfs1 is UP: PING OK - Packet loss = 16%, RTA = 35.60 ms [15:35:18] <^d> :) [15:35:34] anomie: pleas stop making drama. :) [15:36:39] (03PS1) 10Ottomata: Quieting logster::job cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/128930 [15:37:06] (03PS2) 10Ottomata: Quieting logster::job cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/128930 [15:38:19] (03Abandoned) 10Aklapper: Upgrade custom code to Bugzilla 4.4.2 [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/113685 (owner: 10Aklapper) [15:39:57] (03PS3) 10Ottomata: Quieting logster::job cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/128930 [15:40:06] (03CR) 10Ottomata: [C: 032 V: 032] Quieting logster::job cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/128930 (owner: 10Ottomata) [15:41:30] manybubbles: the Collection extension can be reenabled again [15:41:42] pdf servers have finished moving [15:42:37] PROBLEM - Disk space on dataset2 is CRITICAL: DISK CRITICAL - /data is not accessible: No such file or directory [15:44:16] <^d> mark: On it, I think he went idle. [15:44:26] thanks :) [15:44:34] (03CR) 10Chad: [C: 032] Revert "Disable Collection for server move" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127659 (owner: 10Chad) [15:47:09] (03Merged) 10jenkins-bot: Revert "Disable Collection for server move" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127659 (owner: 10Chad) [15:47:50] !log demon synchronized wmf-config/InitialiseSettings.php 'Collection back on, server move over' [15:47:55] Logged the message, Master [15:49:57] <^d> mark: Just tested on a couple of pages, everything seems to be working fine. [15:50:06] awesome [15:50:07] PROBLEM - NTP on nfs1 is CRITICAL: NTP CRITICAL: Offset unknown [15:51:07] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms [16:00:35] (03PS1) 10Andrew Bogott: Rearrange andrew vs werdna: [operations/puppet] - 10https://gerrit.wikimedia.org/r/128940 [16:00:50] andrewbogott: :p [16:01:18] Garrett ;) [16:01:23] werdna: I feel like that patch leaks our UIDs… but at least it prevents me from stealing your keys :) [16:01:54] (03PS2) 10Andrew Bogott: Rearrange andrew vs werdna: [operations/puppet] - 10https://gerrit.wikimedia.org/r/128940 [16:01:56] ^ spelling fixed! [16:02:00] :D [16:03:53] (03CR) 10Andrew Bogott: "I'm worried that this patch leaks our old UIDs. Do I need to create two new user classes with the old UIDs and ensure=>absent?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/128940 (owner: 10Andrew Bogott) [16:06:07] PROBLEM - Puppet freshness on fenari is CRITICAL: Last successful Puppet run was Tue 22 Apr 2014 01:05:00 PM UTC [16:08:24] (03PS5) 10BBlack: Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 [16:10:03] (03CR) 10BryanDavis: "@Antione: The script that you gave the -1 for is actually triggered by zuul. See comments inline." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 (owner: 10BryanDavis) [16:10:55] (03CR) 10BBlack: [C: 032 V: 032] Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 (owner: 10BBlack) [16:10:57] PROBLEM - Host nfs2 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:47] PROBLEM - Host ps1-b4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [16:15:14] !log reedy updated /a/common to {{Gerrit|I55954c612}}: Commit updated interwiki.cdb file [16:15:21] Logged the message, Master [16:15:22] (03PS1) 10Reedy: Non wikipedias to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128941 [16:15:45] * Nemo_bis testing some ePub production [16:15:55] <^d> Reedy: That log is always wrong. It's always HEAD~1, never HEAD. [16:16:10] Yup [16:16:13] It's done purposely IIRC [16:16:29] <^d> ...purposefully inaccurate? [16:17:09] Something to do with potential security fixes [16:17:09] Reedy: I guess static-current will work once this deploy goes through? [16:17:14] yup [16:17:17] cool [16:18:44] it would be nice to merge https://gerrit.wikimedia.org/r/#/c/127473/ before 1.24wmf1 deploy [16:21:28] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [16:32:32] (03CR) 10Hashar: Move beta scap source directory off of NFS (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 (owner: 10BryanDavis) [16:35:12] ah there we go [16:40:34] !log aaron synchronized php-1.24wmf1/includes/filerepo/file/LocalFile.php 'e9807d0b3339acd791fceb5fd889f7044e9e3471' [16:40:40] Logged the message, Master [16:44:07] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.072 second response time [16:45:13] !log Reenabled Apache and puppet on fenari [16:45:20] Logged the message, Master [16:45:28] PROBLEM - Disk space on nfs1 is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=74%): [16:57:12] too bad mwlib translations have not been updated in ages [17:05:28] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Tue Apr 22 17:05:23 UTC 2014 [17:13:45] can someone take a look at https://rt.wikimedia.org/Ticket/Display.html?id=7343 [17:13:48] should be fairly trivial [17:13:54] nobody on RT duty? [17:14:36] The international person of mystery is on RT duty, obvs. [17:15:50] heh [17:15:56] marktraceur: does that mean you are on ops duty? [17:16:47] I'm only the inter-state man of mystery [17:17:15] Let's see, mark was the last ops person to talk. [17:17:37] marktraceur: yeah, and I saw his email in ops-list a while ago asking for volunteers for this week [17:17:41] So by the principle of no takebacks, he must be it. [17:18:53] Woohoo [17:19:01] wooo! [17:19:15] Coren: look at that RT ticket? https://rt.wikimedia.org/Ticket/Display.html?id=7343 :) [17:19:26] Coren: should be fairly trivial. Let me know if it needs confirmation from tfinc [17:20:28] wmf groups for staffers doesn't need manager okay or the 3 day bit, afaik. Ima do this now. [17:20:29] I had rt last week and no one signed up for it this week because we didn't have the standard meeting [17:20:37] but... I'm not willing to have it again back to back [17:20:52] so I just left ?? and hope that some kind soul will step forwards [17:21:01] Coren just took it [17:21:05] thanks ariel [17:21:09] oh yay [17:21:13] thank you Coren :-) [17:21:29] apergos: No worries; I hadn't taken it in ages because of all the labs stuff. [17:21:34] sure [17:27:53] (03Abandoned) 10Ori.livneh: Krinkle: access to LogStash cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/119790 (owner: 10Ori.livneh) [17:28:53] Coren: ty! [17:28:54] Coren: https://rt.wikimedia.org/Ticket/Display.html?id=7344&results=e2a6163d1bbfa7d88f348a3d76ef4811 [17:28:57] Coren: as well? :) [17:28:57] ori, +2 pls when you get a chance - it now uses the settings system and prompts for username on first run - https://gerrit.wikimedia.org/r/#/c/123830/ [17:29:11] yurikR: oh yeah, sorry about that. i'll review irght now. [17:30:53] * Coren wishes all RT tickets were that trivial. :-) [17:31:09] Coren: hehe :D ty! [17:37:30] yurikR: do you know that you can avoid the whole 'gerrit' remote altogether by changing ~/.config/git-review/git-review.conf to have [gerrit] defaultremote=origin ? [17:51:44] bd808: I sent an email your way. [17:52:09] !log disable cp301[34] (mobile varnish frontends) in pybal on fenari [17:52:14] Logged the message, Master [17:52:44] * bd808 waits for gmail to realize that yuvipanda sent email [17:54:19] silly gmail [17:54:37] bd808: nothing urgent, just someone reporting code being old on betalabs, and thought it might be related to scap on betalabs. [17:54:50] Found it. And yes I'm sure ti was [17:55:08] My "fix" yesterday wasn't quite fixed [17:55:20] :) [17:55:32] I would say that it's better now except Jenkins has decided to hate all my update jobs [17:55:42] jenkins usually hates everything [17:55:51] it's also confirmation bias, since we don't remember the times jenkins didn't hate stuff [17:55:57] * bd808 wishes hashar or Krinkle were around to help [17:56:19] (03CR) 10Ori.livneh: [C: 032] Correct path to copy cdb file to [operations/puppet] - 10https://gerrit.wikimedia.org/r/128465 (owner: 10Chad) [17:58:57] bd808: [17:59:07] (03PS1) 10Ottomata: Using offset => stored for kafkatee [operations/puppet] - 10https://gerrit.wikimedia.org/r/128954 [17:59:24] (03PS2) 10Ottomata: Using offset => stored for kafkatee [operations/puppet] - 10https://gerrit.wikimedia.org/r/128954 [17:59:31] (03CR) 10Ottomata: [C: 032 V: 032] Using offset => stored for kafkatee [operations/puppet] - 10https://gerrit.wikimedia.org/r/128954 (owner: 10Ottomata) [18:00:02] Krinkle: Stuck Jenkins jobs due to Jenkins brain damage of some sort -- https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ & https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/ [18:00:17] !log tridge is coming dow for relocation, shouldnt disrupt anything but backups in progress [18:00:26] Logged the message, RobH [18:00:47] Krinkle: Jenkins says there are no available runners on deployment-bastion and that there are 4 free runners on deployment-bastion [18:00:55] bd808: Because https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/607/ is already running [18:01:01] and has taken over 39 minutes at this point [18:01:45] But isn't it also stuck for the same reason? https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/607/console [18:01:49] Yeah [18:03:19] I forced a puppet run on deployment-bastion this morning and local jobs seem to have been busted since. I'll look in my scrollback to see if I can tell which puppet changes were applied. [18:03:58] bd808: I suspect maybe jenkins-slave or something related to it got kicked in the balls and unable to recover. [18:04:11] which is a local process and user on that node [18:04:15] qchris: re: cp301[34] analytics: we're taking a step back on those hosts and re-doing them differently. so it will take longer, but they're been removed from pybal frontends for now as well, which should stop the analytics loss. I won't put them back until we're ready to turn on backends as well [18:04:36] bd808: not sure what I can do about it from this end. Hashar would know more. [18:04:46] Krinkle: I stopped and started the slave connection from the Jenkins management console to see if that was the problem [18:05:05] It didn't fix anything as far as I could tell [18:05:32] bd808: Looks like its connection is working fine [18:05:34] println "uname -a".execute().text [18:05:36] https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/script [18:05:44] Linux deployment-bastion ... [18:05:56] so it can ssh and execute fine [18:05:59] strange [18:06:01] Bah, the RT duty docs lie. [18:09:30] Krinkle: I don't see anything obviously scary in the puppet changes that should have been applied -- http://paste.debian.net/hidden/cc32c276/ [18:09:51] bblack: Ok. Thanks. [18:10:09] Jenkins is quite sensitive. I honestly don't have a clue other than having cured it in the past with god-knows-what fix that didn't make sense but pulled it off. [18:10:32] Feel a bit like a rookie Warehouse 13 or Fringe agent. [18:11:09] All we are are experts in not knowing wtf its doing. We still don't know what it's doing, but at least we've had experience and know and awful lot about how we don't know what its doing. [18:11:29] -rookie [18:11:43] * bd808 updates puppet and applies again just for kicks [18:11:59] I bumped # executes to force it to look at it again. [18:12:03] And marked offine / online [18:12:26] chasemp or Coren (or whoever), are you interruptible enough to talk to me about the task of changing production UIDs? [18:12:43] I think it's a simple problem, but I suspect that it only seems simple to me because I'm overlooking some serious security concerns. https://etherpad.wikimedia.org/p/changeuid [18:12:58] * Coren takes a look. [18:13:07] I'm interruptible by definition this week. :-) [18:13:13] you, you're on RT? [18:13:21] Krinkle: So the same random sorts of changes I was trying. I had done online/offline and disconnect/connect before pinging you [18:13:34] I am just now trying to clean up some stuff to submit for diffs, maybe could wait a bit? [18:13:42] but off the top of my head, logged in users will be an issue [18:13:51] if that is actually going on [18:14:26] bd808: will check back in a bit, dinner bbl [18:15:12] chasemp: I'm imagining that I'll do this one at a time and notify each user before I do it. [18:15:29] But, no need for this to be synchronous, just add your thoughts to the etherpad when you have a chance. [18:15:35] sounds good, I would still boot them post-notify [18:15:41] in case [18:19:28] PROBLEM - Host ps1-c2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:20:45] Coren: try "ldaplist -l passwd | grep uidNumber" on virt1000 [18:20:49] no five-digit uids in there [18:20:57] PROBLEM - Host ps1-c1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:22:01] Reedy: hey, I'm in and out today, ya good for today? [18:23:15] Yus [18:24:13] andrewbogott: D'oh! I meant >5000 [18:24:41] (03PS1) 10Ottomata: Adding kafkatee::monitoring ganglia module class [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/128961 [18:24:52] when did that change? I can't find a single example of >5000 [18:25:07] seems like 1000-59999 is fair game [18:25:20] (03CR) 10Ottomata: [C: 032 V: 032] Adding kafkatee::monitoring ganglia module class [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/128961 (owner: 10Ottomata) [18:25:24] is there a way to tell ldap what ranges to allocate from? [18:25:38] Reedy: cool, enjoy :) [18:26:02] andrewbogott: No, you're correct of course. It's my brain mixing up the >50000 service groups w/ the >500 users. We're up to the mid 4000s now it seems. [18:26:07] (03PS1) 10Ottomata: Adding kafkatee monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/128962 [18:26:16] (03PS2) 10Ottomata: Adding kafkatee monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/128962 [18:26:29] Coren: ok! Sorry for pedant-attack, was just worried that something was malfunctioning. [18:27:14] (03PS3) 10Ottomata: Adding kafkatee monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/128962 [18:27:14] I've been using 600 range for systemwide users though, that one was free. [18:27:46] (03PS1) 10Jforrester: Follow-up Ic04c7c8ad: Check if whitelist is set before assigning [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128963 [18:27:52] Oddly there seems to be one user in ldap (Aaron) with uid of 544. Must've been created by hand to match admins.pp [18:27:54] bd808: ^^^ Should fix that. [18:28:48] James_F: Cool. Now if I can just get Jenkins fixed... ;) [18:29:03] bd808: :-) [18:29:16] (03CR) 10Ottomata: [C: 032 V: 032] Adding kafkatee monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/128962 (owner: 10Ottomata) [18:29:38] (03CR) 10Jforrester: "Followed-up in I72b1067." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [18:29:45] greg-g: https://secure.phabricator.com/T4843#10 and below [18:29:49] Pure. Awesomeness. [18:32:10] so… most hosts don't have personal login accounts at all. So salting a 'find' on every single server seems highly excessive... [18:32:21] Are there few enough hosts with logins that I can just enumerate them? [18:33:19] !log resurrecting tridge in pmtpa [18:33:24] Logged the message, RobH [18:33:30] twkozlowski: that is pretty awesome, I have heard the Evan ( epriestley ) is really great to work with [18:34:01] greg-g: at my last place we would ask him about a feature and then go to implement [18:34:13] and more than once within a day or so we'd get an email, "ok I did it" [18:34:19] like...wtf dude, damn awesome [18:34:23] chasemp: haha, awesome [18:34:36] *that's* the kind of upstream I want [18:34:47] greg-g: Yeah! [18:35:03] "Oh, that'll be easy to fix", and next time I look, there's a patch already :-) [18:35:16] truly, that is not unusual [18:35:47] mutante, any idea if there's a clever way to tell salt to only act on systems with direct user logins? (I feel like that's not exactly the same as the set of hosts with public dns…) [18:36:12] greg-g: though, they appear to be getting quite a lot of bug reports/feature requests from us, as far as I see [18:36:26] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128941 (owner: 10Reedy) [18:36:33] (03Merged) 10jenkins-bot: Non wikipedias to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/128941 (owner: 10Reedy) [18:36:54] twkozlowski: yeah, we're a thorough bunch :) [18:37:25] twkozlowski: it may be beneficial to have one wmf phab interface person? otherwise we may seem schizophrenic for priorities with each internal team communicating indepedently? [18:37:28] just a thought [18:37:57] not sure, we have a lot of diff use cases, and I don't know if quim can do them all justice [18:38:06] maybe evan just needs to hear them all and make sense of them himself? [18:38:41] quim seems to be coping quite okay with the load so far :-) [18:38:54] probably sane, i just don't want to abuse the good will, not that I'm saying we are [18:39:07] yeah, but being an advocate for mulitple points of view is tough :) [18:39:18] chasemp: yeah, valid concern [18:39:30] On an a bit related note, the notifications in Phabricator are fun. [18:39:46] My first thought on seeing them was "Oh, my Phabricator is on fire!!! " [18:40:45] andrewbogott: is any of that etherpad stuff helpful? can I help in any other way? [18:41:06] yes, it's helpful. [18:41:18] In part I was worried that there was some other place that UIDs live that I was forgetting about... [18:41:22] sounds like not [18:41:55] oh… chasemp, does it matter if there is one or two ldap uids that's outside of the 1000+ range? [18:42:19] I found at least one hand-edited uid in ldap that's 544. In theory that's good news because it's the same as in admins.pp [18:42:28] But it's an outlier [18:42:32] to me personally no, to debian it seems like 1000-59999 is assignable [18:42:46] 'assignable'? [18:42:50] !log Jenkins deposing / repooling deployment-bastion.eqiad.wmflabs slave locked up somehow, the executors are no more taken in account by Jenkins master [18:42:57] Logged the message, Master [18:43:06] andrewbogott: not reserved for debian project, etc. we can issue them as UID's to users. [18:43:13] !log tridge back and accessible [18:43:15] ah, ok. [18:43:19] Logged the message, RobH [18:43:36] We'll I'll regard AaronSchulz's weird uid as not a problem until/unless it proves a problem later [18:43:41] andrewbogott: 100-999 is also assignable, previously I always thought 500-1000 was user convention but [18:44:00] that may be beause I haven't had more than 500 users [18:44:05] hashar: I'm rebooting deployment-bastion too. [18:44:18] bd808: something went wild on it apparently [18:44:38] *Well [18:44:39] bd808: apparently a database update job ended up being blocked [18:44:48] bd808: it might have taken all memory or something like that :/ [18:45:12] hashar: Yeah. Krink-le and I both tried a lot of random Jenkins poking to get the slave there working again [18:46:12] nothing suspicious on ganglia http://ganglia.wmflabs.org/latest/?c=deployment-prep&h=deployment-bastion&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:46:18] beside the recent huge spike :D [18:46:36] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: non wikipedias to 1.24wmf1 [18:46:43] Logged the message, Master [18:46:48] hashar: It started being sad right after I ran puppet to fix the rsync server for scap [18:47:04] !log reedy synchronized docroot and w [18:47:10] Logged the message, Master [18:47:29] bd808: maybe the patch has something very weird so :/ [18:49:04] bd808: ssh ed to it \O/ [18:49:35] !log Jenkins deployment-bastion.eqiad.wmflab is back online: Slave successfully connected and online [18:49:42] Logged the message, Master [18:50:42] (pending—Waiting for next available executor on deployment-bastion.eqiad) [18:51:37] I didn't change the jobs today, but it acts like Jenkins has a dependency cycle or something all of a suddne. [18:52:25] yeah it is broken :-/ [18:52:47] apparently the master does not recognize the slave has free executors [18:55:45] !log restarted Zuul to clean up some stuck jobs from the queue [18:55:51] Logged the message, Master [18:56:06] (03PS5) 10Ottomata: Add EventLogging Kafka writer plug-in [operations/puppet] - 10https://gerrit.wikimedia.org/r/85337 (owner: 10Ori.livneh) [18:56:22] (03CR) 10Ottomata: [C: 032 V: 032] Add EventLogging Kafka writer plug-in [operations/puppet] - 10https://gerrit.wikimedia.org/r/85337 (owner: 10Ori.livneh) [18:57:13] (03CR) 10Yuvipanda: "I just realized Werdna is Andrew reversed." [operations/puppet] - 10https://gerrit.wikimedia.org/r/128940 (owner: 10Andrew Bogott) [18:57:39] YuviPanda: it really blows your mind doesn't it? [18:58:03] AaronSchulz: indeed. Seeing it on a gerrit page produced a moment of clarity. [18:58:52] "Wikimedia, where amazing happens." [19:00:28] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:01:19] Well that doesn't sound good. [19:01:37] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka [19:02:27] hi! [19:02:29] that's ori and I [19:03:07] (03PS1) 10Ottomata: Including kafka::config class, not ::client [operations/puppet] - 10https://gerrit.wikimedia.org/r/128967 [19:03:19] Well, that's less worrying then. :) [19:03:23] (03CR) 10Ottomata: [C: 032 V: 032] Including kafka::config class, not ::client [operations/puppet] - 10https://gerrit.wikimedia.org/r/128967 (owner: 10Ottomata) [19:03:32] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka ori.livneh ottomata + ori deplying kafka consumer [19:04:27] RECOVERY - DPKG on vanadium is OK: All packages OK [19:04:38] ori, did you have java installed on this box before? probably not, right? [19:05:03] ottomata: it was doing lucene something or other at one point and wasn't wiped before being repurposed for eventlogging [19:05:06] !log Jenkins killed Jenkins java process on deployment-bastion.eqiad.wmflab to free up the executor and threads entirely. [19:05:10] ottomata: so it probably did have java, but it didn't need it [19:05:11] Logged the message, Master [19:05:14] ok, i'm going to remove it [19:06:20] cool, that looks better [19:08:37] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [19:14:37] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka [19:18:32] (03Abandoned) 10Ottomata: statistics: converted iptables to ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/117670 (owner: 10Matanya) [19:20:37] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [19:24:37] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka [19:25:37] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [19:26:58] Can someone who knows Varnish take a look at https://gerrit.wikimedia.org/r/#/c/127818/ ? [19:27:27] It's changing a IE tag to an HTTP header to make our HTML head validate. We just want to be sure Varnish will pass it on as expected. [19:27:39] hashar: \o/ Jenkins seems to be deploying in beta again [19:27:53] We're expecting it will be sent to all browsers, so no IE sniffing needed. [19:27:56] bd808: had to kill the java jenkins slave process on the instance [19:28:12] bd808: I guess it got stuck somehow with ghost threads and the agent believing it had no executor left [19:28:20] Weird [19:28:52] From Jenkins job -- Finished scap: beta-scap-eqiad (build #1511) (duration: 20m 58s) [19:29:01] Too slow :( [19:30:56] bd808: you can add a wrappers: - timestamp to the job [19:31:20] that will prefix each line of the console with the elapsed/system time since beginning of the job [19:31:46] the mw update l10n took 12 minutes :- [19:31:46] ( [19:31:47] !log Reloading Zuul to deploy config change I9c2f94b138244ab8 [19:31:53] Logged the message, Master [19:32:19] hashar: When you're ready, ping me, I need you for a quick minute regarding jsduck. [19:32:31] hashar: Yeah. Looks like the "local" disk in labs isn't drastically faster than the NFS server [19:32:45] Krinkle: go ahead :-] [19:33:06] hashar: You mentioned we can "just use" gem. What do you mean? Run gem install on gallium? [19:33:19] bd808: the instance disk might well be on a NFS share as well, i.e. not on the compute node. I have no idea honestly [19:33:21] hashar: Or on a labs node? Because we need to publish them doc.wikimedia.org, so can't use a labs node right? [19:33:25] Krinkle: na on labs nodes [19:33:42] Krinkle: then there is the issue of moving the doc generated on the labs slaves back to the doc website which is gallium [19:34:04] We could make a way for it to rsync back, but I'd rather not. [19:34:07] Krinkle: and I have no real clue yet how to grant access on random labs instance to write back to a prod machine hosting the doc site [19:34:13] Exactly [19:34:29] especially with the prospect of allowing non-whitelisted users execute tests in the future. [19:34:50] though the publishing back to doc is usually on post merge isn't it ? [19:34:54] btw, what's the progress on that? It looks like we're getting close (aside from jobs that need to be on gallium, such as jsduck publish) [19:35:02] hashar: Hm.. good point. [19:35:17] But then, we still need to execute that job somewhere. [19:35:34] bd808: maxsem proposed at one point to generate the CDB files in memory (i.e. tmpfs) to save the I/O system calls. Then simply copy from tmpfs to disk. [19:36:08] bd808: I can't remember the details, Tim would now. But I think each write to a cdb file cause a chain of fopen(), fwrite() fclose() fsync() or something along that [19:36:12] hashar: The big cost is stating all of the i18n source files on disk [19:36:25] At least it was in my tests [19:36:30] hashar: So we need to move more jobs to labs nodes, make sure the labs nodes are 100% puppetized, and then enable that thing where it automatically creates new instances in a pool in advance, and destruct them after use, and register them as slaves upon spawning. [19:36:39] bd808: ah yeah that is not helpful either :-( [19:36:53] * bd808 wants ssd for this [19:37:11] * hashar wants a new l10n system [19:37:14] Which isn't just to allow tests for non-whitelisted users, it also allows us to do things like testing apache much easier without all these TEST_ID in the paths (e.g. we can simply use localhost:80) [19:37:43] Krinkle: yup. Gotta bring that back with ops [19:38:11] Krinkle: so either a pool of nodes on wmflabs or we figure out how to use LXC on some isolated prod machine [19:38:42] I don't think LXC is worth the effort. Way too much complication and side effects. I don't trust that one single bit. Doesn't scale, doesn't work for what we need. [19:38:49] I guess :] [19:39:15] so pool of disposable labs instance seems to fit our needs [19:39:30] bd808: seems to be CPU bound : http://ganglia.wmflabs.org/latest/?c=deployment-prep&h=deployment-bastion&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [19:40:13] bd808: there is a 50% plateau (2 threads out of 4 cpu) [19:40:19] hashar: It does peg all the cores you give it, but I *think* it's mostly iowait [19:40:32] LXC seems nice if you need to run separate linux boxes as one for different purposes (e.g. db server, apache server), but doesn't seem suited to run untrusted code of the same kind (e.g. dozens of instances all doing db/apache/phpunit/qunit with different versions of the code) [19:40:44] bd808: then we will see a lot of WAIT time? [19:41:01] bd808: ah yeah there is some iowait at the start :-] [19:42:25] Krinkle: on the other hands it is way faster to start a LXC since it is merely a user land separation [19:42:43] hashar: speed isn't an issue since they'd be started ahead of time [19:43:26] hashar: can we do things on LXC like install bins globally (apt-get, npm, gem), open ports, mess with puppet etc., make http requests, and then safely remove it without having any long term effects? [19:43:54] I think if this worked, Travis and other parties would be doing it. [19:44:07] how does Travis handle isolation? [19:44:18] separate vms [19:44:24] one-off, spawned by pool [19:44:34] ahead of time, host standby's assigned to each job [19:44:40] hot standby's* [19:45:14] just like openstack NodePool :-D [19:45:17] which make sense [19:45:43] so I gotta ping our favorites ops :D [19:45:44] they're spawned ahead of time, and provisioned on boot (basically all the provisioning does is register the node as available worker, the actual provisioning/installing bins is already done in the vm image that was created for that node) [19:47:20] hashar: we should try an optimise for common things. Especially npm-install is going to be slow. The reason it is super fast right now is because our labs node is re-used, so they're all 304s from local /home/jenkins-slave/.npm/ cache [19:47:20] HTTP 304s* [19:47:33] Maybe a reverse proxy around the nodes for certain url patterns. [19:47:49] that would be a straight proxy :-] [19:47:52] err [19:47:53] a cache [19:47:55] whatever [19:48:07] right, not reversed [19:48:11] outgoing, not incoming. [19:49:11] hashar: Watching scap/l10nupdate run with iotop on deployment-bastion does make it look like disk writes are the big cost right now. [19:50:27] i wonder if i can !log here? [19:50:58] !log I wonder if I can log here. [19:51:03] MatmaRex: if it is related to Wikimedia infrastructure yup. For labs #wikimedia-labs :-D [19:51:05] Logged the message, Master [19:51:09] heh [19:51:15] (I will edit the SAL, of course.) [19:51:28] !log wikibugs is down, let's not bring it back up [19:51:36] Logged the message, Master [19:51:58] Krinkle: a shared caching proxy for labs might be a good addition to wmflabs. Maybe yuvipanda will be interested, he implemented the reverse proxy already [19:52:09] (we have a unique opportunity to replace it with something better) [19:52:37] bd808: must be the cdb writes :-( [19:52:51] hashar: we probably want it to be opt-in for wmflabs projects though. Maybe opt-out for tools labs, but wmflabs in general probably opt-in. [19:52:56] hashar & bd808, the situation have changed since then, now there's far morre i18n filees [19:53:10] so I/O patterns are different [19:53:25] bd808: also I have no idea how the json files are generated. Presumably they read each key from the cdb files, craft an in memory array then dump that as json. The cdb files read might long as well. [19:53:47] bd808: maybe we could skip the cdb files entirely. Ie craft the json files from the i18n php files. [19:54:16] MaxSem: if you have any info I am sure Bryan will love some clues / info :D [19:54:51] MaxSem: he switched the beta cluster to use scap (i.e. with everything generated on a host then using rsync to copy to application servers). But the whole process takes quite a long time (22 minutes) [19:55:00] booo [19:55:07] Krinkle: or user configurable [19:55:09] Generating json directly may be possible. It may be worth trying a tmpfs dest dir too [19:55:11] NFS foreva! [19:55:16] Krinkle: i.e. have the instance available for people who care about it. [19:55:45] Yeah, so completely opt-in. [19:56:02] Krinkle: and if it proven useful, we could well set the proxy for npm/gem/pip by default via a puppet snippet applied on all instances. [19:56:18] hashar: I guess there is a way to configure ubuntu to use an http proxy for "everything" (e.g. don't want to have to pass parameters to curl, wget, node, npm etc.) [19:57:30] hashar: as long as there is a way for individual projects to disable / change it on their instance without puppet overriding it again back [19:58:07] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Last successful Puppet run was Tue 22 Apr 2014 04:57:34 PM UTC [19:58:11] Krinkle: with Zeljkof we found out gem can use a local disk cache. And I have setup python pip to use a local disk cache as well [19:58:25] Krinkle: maybe npm supports that as well. So we could just use /data/project [19:58:29] hashar: I did that already. [19:58:39] bd808: 13m 19s \O/ [19:58:43] !! [19:58:52] hashar: but that only works as long as we keep the same instance. [19:59:07] When we use a pool, that will not be an option, because we explicitly don't want any shared state. [19:59:16] yup true [19:59:20] At least not shared state that the execution context can write to [19:59:35] so that disqualify /data/project :-D [19:59:36] http proxy is safe I think, but disk store seems too fragile. [20:00:02] npm-install has a disk cache for all packages by version number in /home/jenkins-slave/.npm/ [20:00:05] hashar: 1m 00s with no l10n changes -- https://integration.wikimedia.org/ci/job/beta-scap-eqiad/1514/console [20:00:25] or whatever ~/.npm/ points to. [20:03:12] (03CR) 10Andrew Bogott: [C: 032] Rearrange andrew vs werdna: [operations/puppet] - 10https://gerrit.wikimedia.org/r/128940 (owner: 10Andrew Bogott) [20:03:24] bd808: looking at https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/4005/consoleFull it simply update InputBox extension which removes a message from en and qqq https://gerrit.wikimedia.org/r/#/c/128939/ [20:03:49] bd808: and that invalidated the whole cache (all files got rebuild https://integration.wikimedia.org/ci/job/beta-scap-eqiad/1513/console ). I guess because of the change in the qqq language [20:04:42] bd808: ah forget me, it ADDS two messages. so all languages are probably rewritten to add the english fallback for that key [20:04:57] hashar: That makes more sense [20:05:05] It's really really lame thnough [20:05:40] Krinkle: how is the .npm cache directory set ? [20:06:10] hashar: /etc/npmrc [20:06:17] The default is ~/.npm though [20:06:24] ($ npm config) [20:06:34] Krinkle: have you put it in puppet by any chance ? :] [20:08:57] krinkle ah it default to the $HOME/.npm so for our labs instance that is /mnt/home/jenkins-deploy/.npm and /mnt has plenty of disk [20:10:12] Krinkle: oh and OpenStack as an interesting project named TripleO : OpenStack on OpenStack. That let you create an openstack cloud on top of an existing one :] [20:10:30] so we could reuse the wmflabs infra to build our own openstack cloud on top of it [20:11:00] (03PS1) 10Andrew Bogott: Specify 'user' for removed ssh keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/129026 [20:11:38] stackception [20:12:07] !log rebuilding the search index for a few wikis - might cause the Elasticsearch health check to freak out because it sucks [20:12:09] !log wikibugs replaced by pywikibugs (https://github.com/valhallasw/pywikibugs) and moved to #wikimedia-dev (at last!) [20:12:14] Logged the message, Master [20:12:20] Logged the message, Master [20:13:05] (03CR) 10Andrew Bogott: [C: 032] Specify 'user' for removed ssh keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/129026 (owner: 10Andrew Bogott) [20:13:10] nfs2 [20:13:12] woops [20:15:56] (03PS1) 10Andrew Bogott: One more andrew/werdna renaming bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/129027 [20:18:57] (03CR) 10Andrew Bogott: [C: 032] One more andrew/werdna renaming bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/129027 (owner: 10Andrew Bogott) [20:22:17] PROBLEM - Host ps1-b1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:24:36] (03PS1) 10Ottomata: Now running varnishkafka on text varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/129030 [20:28:36] (03CR) 10Ottomata: [C: 032 V: 032] Now running varnishkafka on text varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/129030 (owner: 10Ottomata) [20:28:56] !log turning on varnishkafka on text varnishes [20:29:02] Logged the message, Master [20:31:02] (03PS1) 10Odder: Dead blogs are dead. [operations/puppet] - 10https://gerrit.wikimedia.org/r/129031 [20:31:14] (03CR) 10Chad: [C: 032] Only load/enable Lucene on production (not on labs) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126804 (owner: 10Reedy) [20:31:16] (03PS2) 10Odder: Dead blogs are dead [operations/puppet] - 10https://gerrit.wikimedia.org/r/129031 [20:32:00] (03Merged) 10jenkins-bot: Only load/enable Lucene on production (not on labs) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126804 (owner: 10Reedy) [20:35:20] !log demon synchronized wmf-config/CommonSettings.php 'No op in prod, disables lsearchd completely for beta' [20:35:28] Logged the message, Master [20:40:17] hashar: So regarding jsduck [20:40:29] hashar: When you said we can use gem, is that about labs node or not? [20:40:46] Krinkle: sorry almost sleeping follow up by email :) [20:40:55] I want this thing to be upgraded so I can get things done, things are blocked by the upgrade. Can I assume that comement is obsolete, or is there a solution? [20:40:56] for jsduck upgrade, on labs yes [20:41:03] We can't run it on labs. [20:41:12] so you need it update in production [20:41:13] And test and postmerge must use the same version [20:41:14] obvously [20:41:28] if things are broken on jsduck 4, upgrading the test to labs with jsduck 5 won't help :) [20:41:37] so either you get the package back ported from trusty (unlikely apparently) [20:41:51] or you use an intermediate git repository you can update from time to time [20:42:06] is it talking about dependencies with gem packages, or ruby itself? [20:42:08] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:28] no clue :-D [20:42:31] hashar: Because the gem dependencies Faidon packaged aren't going to be exist in newer ubuntu, there is nothing to backport as they don't exist upstream. [20:42:44] potentially the jsduck package in trusty depends on a bunch of other packages that are not available in Precise [20:42:44] we'd have to simply upgrade those ourselves too [20:42:47] paravoid: [20:42:51] so you will have to backport the whole dependency chain [20:43:03] and might end up having to backport a more recent version of ruby itself [20:43:39] so if jsduck5 works with the ruby version of Precise... Just gem install on your desktop and push that as a new git-deploy repo like integration/jsduck [20:44:06] hashar: I tried that in the past with jsduck 4. Didn't work. [20:44:16] or confirm with whoever did the package whether jsduck 5 can be back ported / updated [20:44:21] ah [20:44:22] I mena, I tried doing a gem install with a local directory (instead of global). [20:44:24] mean* [20:44:32] Zeljkof might help. He knows about ruby :] [20:44:50] Maybe because of something incompatible with Mac and Linux (e.g. gem install might do something osx specific for some compilation) [20:44:56] ah yeah [20:45:03] there are native ruby gems [20:45:08] i.e. C code which is compiled [20:45:16] so you would have to do the gem stuff on a Precise instance in labs [20:45:29] so any compile would eventually work on the Precise production boxes [20:46:19] gotta sleep timo sorry :-/ [20:46:51] drop a mail to whoever packaged jsduck4 for you, he might be able to package jsduck5 as well :] [20:46:57] Wikimedians are weird. [20:47:01] else the evil intermediary git repo wilwork [20:47:06] Apologizing for going to bed. [20:47:11] till we have a way to publish docs from labs instance [20:47:16] hashar: :-D [20:47:24] twkozlowski: =) [20:47:47] * hashar escapes [20:49:57] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53365 bytes in 0.489 second response time [20:50:39] (03PS1) 10Andrew Bogott: Don't try to have puppet remove werdna's old keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/129034 [20:57:10] (03CR) 10Andrew Bogott: [C: 032] Don't try to have puppet remove werdna's old keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/129034 (owner: 10Andrew Bogott) [20:58:00] lol [21:05:09] (03PS1) 10Manybubbles: Enable experimental highlighter on testing wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129039 [21:06:04] starting Flow deploy window [21:09:03] ori: can please look at https://gerrit.wikimedia.org/r/#/c/118966/ ? [21:09:32] (03CR) 10Spage: [C: 032] "deploy time" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127280 (owner: 10Spage) [21:09:43] (03Merged) 10jenkins-bot: Enable Flow on mw:Talk:Beta_Features/Nearby_Pages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127280 (owner: 10Spage) [21:12:38] !log spage updated /a/common to {{Gerrit|I851651247}}: Non wikipedias to 1.24wmf1 [21:12:45] Logged the message, Master [21:14:31] !log spage synchronized wmf-config/InitialiseSettings.php 'Enable Flow on Compact Personal Bar talk' [21:14:37] Logged the message, Master [21:31:49] !log spage synchronized php-1.24wmf1/extensions/Flow/modules/discussion/styles/mixins/collapse.less 'Fix Flow collapsed topics on mw.org' [21:31:54] Logged the message, Master [21:34:17] greg-g: apart from waiting for RL to notice the change, we're done [21:36:22] matanya_: hm? [22:02:12] (03PS1) 10Gergő Tisza: Enable survey option in MediaViewer on a few more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 [22:05:28] (03CR) 10Gilles: [C: 04-1] Enable survey option in MediaViewer on a few more wikis (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 (owner: 10Gergő Tisza) [22:06:41] (03PS2) 10Gergő Tisza: Enable survey option in MediaViewer on a few more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 [22:10:53] (03CR) 10Gilles: [C: 031] Enable survey option in MediaViewer on a few more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 (owner: 10Gergő Tisza) [22:22:40] !log reedy synchronized php-1.24wmf1/extensions/TimedMediaHandler 'I7483c8b7ec75f5149998da2b530ca04' [22:22:47] Logged the message, Master [22:28:48] !log aaron synchronized php-1.24wmf1/maintenance/populateImageSha1.php '32d9206d1c5b1ba39e7e47cb0a23b57d53772c1b' [22:28:55] Logged the message, Master [22:59:04] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Last successful Puppet run was Tue 22 Apr 2014 04:57:34 PM UTC [22:59:51] hello [22:59:53] i'll do swat today [23:01:02] tgr: is good to go? [23:01:42] ori: that one is good, about https://gerrit.wikimedia.org/r/#/c/129049/ though [23:01:58] the jenkins doc test is segfaulting all the time [23:02:36] i'm pretty sure it does not have to do anything with the code, it is a cherry-pick, the two other branches work [23:03:13] so you can just skip the tests if that's ok to do for a prod branch [23:03:54] if you omit that one change, that's fine too, it can wait until Thursday [23:04:24] it's okay to bypass jenkins if jenkins is not working [23:23:02] !log ori synchronized php-1.24wmf1/extensions/MultimediaViewer 'Update MultimediaViewer for I595446dc5: Add more survey languages (fr, de, pt/pr-br)' [23:23:09] Logged the message, Master [23:24:14] !log ori synchronized php-1.23wmf22/extensions/MultimediaViewer 'Update MultimediaViewer for I595446dc5: Add more survey languages (fr, de, pt/pr-br)' [23:24:20] Logged the message, Master [23:24:32] (03CR) 10Ori.livneh: [C: 032] Enable survey option in MediaViewer on a few more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 (owner: 10Gergő Tisza) [23:26:34] (03Merged) 10jenkins-bot: Enable survey option in MediaViewer on a few more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129047 (owner: 10Gergő Tisza) [23:26:46] !log ori updated /a/common to {{Gerrit|If2c57846f}}: Enable survey option in MediaViewer on a few more wikis [23:26:52] Logged the message, Master [23:27:31] !log ori synchronized wmf-config/InitialiseSettings.php 'If2c57846f: Enable survey option in MediaViewer on a few more wikis' [23:27:37] Logged the message, Master [23:28:45] !log ori synchronized php-1.24wmf1/extensions/EventLogging 'Update EventLogging for Iaa232298e: Set line-height for code icon on schema pages (bug 64251)' [23:28:51] Logged the message, Master [23:30:01] !log ori Started scap: I595446dc5, If2c57846f, Iaa232298e [23:30:07] Logged the message, Master [23:30:46] !log ori Finished scap: I595446dc5, If2c57846f, Iaa232298e (duration: 00m 45s) [23:30:52] Logged the message, Master [23:54:02] (03PS1) 10Ori.livneh: miscellaneous improvements for diamond module [operations/puppet] - 10https://gerrit.wikimedia.org/r/129075