[00:02:42] Krinkle can you pull that list actually? [00:02:56] git pull against gerrit is timing out for me [00:03:00] anyone else having issues? [00:03:39] ToAruShiroiNeko: https://github.com/wikimedia/operations-mediawiki-config/ [00:03:57] https://github.com/wikimedia/operations-mediawiki-config/blob/master/all.dblist [00:03:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 00:03:54 UTC 2013 [00:04:00] or which list did you mean? [00:04:28] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:18] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:15:10] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:20:10] RECOVERY - check_mysql on payments1001 is OK: Uptime: 5545520 Threads: 4 Questions: 4453825 Slow queries: 21421 Opens: 627 Flush tables: 1 Open tables: 61 Queries per second avg: 0.803 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:24:14] Krinkle brilliant [00:24:15] thanks [00:34:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 00:34:35 UTC 2013 [00:35:20] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:04:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 01:04:48 UTC 2013 [01:05:20] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:30:03] Krinkle here: https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener/WikiMap [01:30:06] just an idea [01:30:40] it lacks some wikis such as non-wmf operated chapter wikis, bugzilla etc [01:31:51] perhaps community input isnt needed as its rather sitraight forward [01:33:25] is that unacceptable for you? :/ [01:34:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 01:34:33 UTC 2013 [01:35:13] ToAruShiroiNeko: Where is the wiki-id column (the actual wiki database name or domain name like from all.dblist) [01:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:03] ToAruShiroiNeko: www | commonswiki seems odd [01:36:58] don't split 2015 from wikimania etc. [01:38:01] looks like a plausible implementation [01:38:03] its kind of for human use [01:38:14] I put www as its commons.wikimedia [01:38:24] www isn't even valid, even for humans. [01:38:36] I am open to suggestions :) [01:38:56] 2015 wikimania wiki doesnt exist yet I think [01:38:57] I'd use base36 | family (e.g. label of first character of base36) | wikiid or dbname or domainname [01:39:44] those 3 columns. Don't tear apart the supposed prefix, they aren't prefixes and would make it hard to analyse or map back to actual domain names (which is the main purpose of the map for the shortener) [01:40:54] project name is commonswiki and wikimania2015, not www or 2015. Only the family is virtual, the project could be real. That saves having to back fill missing values that we don't have (and can't have). [01:41:25] the list can expand with time [01:41:59] M0C would be wikimania2015 wiki [01:42:18] it doesnt exist yet [01:42:39] I think there's a couple ways to go about it, this is one of them. Though this is maybe not the easiest one to maintain with semi-automated mapping to how families are configured on the cluster. Either way, it'd be a trivial part of the URL shortener implementation. [01:42:58] yeah [01:43:20] fast majority would be simple to maintain like the main big projects [01:43:24] Not something I hope will be much discussion about (eg.. bikeshedding). This is a good start. [01:44:54] the real challenges at this point are 1) the domain name, 2) the open bugs for the ShortUrl extension (if we end up using it); 3) whether or not to drop this format alltogether and use full url shortening instead [01:45:12] the latter would fit equally (if not, better) in the QR encryption format. [01:45:59] the advantage of having (limited) readbility of the short url is something that hasn't been considered before so it's going to be hard to justify I think. [01:47:09] we'd sell it on being deterministic with static data (as opposed to being impossible to calculate, as the full url shortener would be, that would require an http or db request to register it and get the new id) [01:47:24] e.g. given the wiki id and shorturl id, we can create the url cheaply [01:51:20] * Krinkle runs \|\|(.+)\|\|(wikibooks|wikimedia|wiktionary|wikinews|wikiquote|wikisource|wikiversity|wikivoyage|wikipedia)$ on it [01:51:46] Krinkle I added a database column [01:52:09] Krinkle I think it answers to a flaw they mentioned [01:52:19] "Obfuscation and mis-use" [01:53:28] if I see an short/M url I know I am going to a Wikimania wiki [01:53:39] and not to a penis picture [01:54:28] the project column is now obsolete, only contains confusing values [01:54:51] and the family column contains various db name that aren't families [01:55:06] I suppose [01:55:12] editing [01:55:17] please do :) [01:55:35] it was helpful to seperate them for sorting to generate the base 36 [01:55:45] but thats been done already :p [01:56:04] what does 'J' stand for? [01:56:10] arbcom wikis [01:56:15] Judicial? :p [01:56:27] could be stuffed with X wikis I suppose [01:56:49] I dont think we will see loads of tiny URLs to those restricted wikis [01:57:06] board could be bundled with capters [01:59:39] WMFfoundationwiki should be foundationwiki [01:59:45] letting you edit to avoid EC [02:00:07] there is no WMFfoundationwiki [02:00:18] yes [02:00:26] I can fix it too [02:01:08] * Krinkle is done editing [02:01:28] board wikis are not chapter wikis, per wmf-config saying so. [02:01:33] foundationwiki is not a chapter wiki [02:02:14] no but it manages chapters, kinda [02:02:15] no? [02:03:04] practical purpose is relevant, but not as much as the configuration categories are relevant. [02:03:05] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wikimedia.dblist [02:03:08] those are chapter wikis only [02:03:21] very well it can be moved :) [02:03:26] should I do it? [02:03:41] go ahead :) [02:03:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 02:03:49 UTC 2013 [02:04:19] these are all the X wikis: https://github.com/wikimedia/operations-mediawiki-config/blob/master/special.dblist. Could make commons, wikimania* and wikidata and wikispecies separate, but I'm not sure that is clear. [02:04:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:04:56] perhaps chapters can be given a country ID [02:05:00] two letter ISO codes [02:05:40] would that make sense? [02:06:18] I'm going to stop working/discussing this further. I'm not disagreeing, just don't think it is useful to discuss further on this detail given the state of the RFC. This could be all wasted if we decide not to use this specific part of the RFC, and specifying it now doesn't make it more liely to be used either. [02:06:46] sure [02:08:35] I am probably more enthusiastic than most :p [02:09:49] !log testing MHA on pmtpa db37, db56, db58, db68. disabled their notifications' [02:10:05] Logged the message, Master [02:13:22] !log LocalisationUpdate completed (1.22wmf18) at Thu Sep 26 02:13:22 UTC 2013 [02:13:36] Logged the message, Master [02:14:23] Krinkle so for implementation domain name will be key [02:14:28] was there any progress on that? [02:15:38] not that I know of [02:15:42] springle: MHA? [02:16:24] jeremyb: http://code.google.com/p/mysql-master-ha/ [02:16:36] interesting... [02:17:02] Krinkle I hope there is some progress on that eventually [02:18:10] @replag 10.64.16.13 [02:18:11] Krinkle: [db1024: s7] 10.64.16.13: 0s [02:20:27] @replag db50 [02:20:28] Krinkle: [db50: s7] db50: s [02:20:32] @replag s7 [02:20:32] Krinkle: [s7] db1041: 0s, db1007: 0s, db1024: 0s, db1028: 0s [02:21:02] right, db50 isn't in rotation for the main api endpoint that the bot uses [02:21:40] ugh, long running queries, i hates (my own instances) [02:24:19] springle: i wonder if i could pick your brain a little [02:24:38] my mysql can get pretty rusty sometimes [02:25:15] !log LocalisationUpdate completed (1.22wmf17) at Thu Sep 26 02:25:14 UTC 2013 [02:25:26] Logged the message, Master [02:25:44] jeremyb: go ahead [02:25:52] i'll try to answer :) [02:25:52] seconds behind master is increasing, thread is waiting for next event, both threads (sql+io) are running. why might nothing be happening? [02:26:43] (i.e. waiting for master to send) [02:26:47] waiting for next event is io thread. what's sql thread doing? [02:27:43] grrr, so confusing. show processlist has 2 threads. i *think* innotop is combining them into one? [02:27:54] is this a box i can access? [02:28:38] nope, sorry :) [02:29:02] i think i need to implement some time limits. unattended hour long queries (from a webapp) are no good [02:29:12] in this case DELETE [02:29:29] thanks anyway [02:29:47] in any case, if seconds behind master is increasing, probably sql thread is lagging due to long-running transaction, or blocked by something else locally, or perhaps it's a big write txn blocked on io [02:30:13] big delete? [02:30:26] hour-long will suck :) [02:33:23] jeremyb: http://dev.mysql.com/doc/refman/5.5/en/slave-sql-thread-states.html might help [02:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 02:33:51 UTC 2013 [02:34:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:36:40] springle: thanks for jogging my memory :) [02:36:55] yw [02:45:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 26 02:45:41 UTC 2013 [02:45:51] Logged the message, Master [03:06:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:29:12] @info db50 [03:29:12] Krinkle: [db50: ?] 10.0.6.60 [03:33:12] @replag s2 [03:33:13] Krinkle: [s2: zhwiki] db1034: 0s, db1036: 0s, db1002: 0s, db1009: 0s, db1018: 0s [03:34:51] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 03:34:48 UTC 2013 [03:35:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:06:51] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 04:06:41 UTC 2013 [04:07:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:35:41] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 04:35:36 UTC 2013 [04:36:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:59:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:00:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.133 second response time [05:04:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:05:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.774 second response time [05:07:11] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 05:07:02 UTC 2013 [05:07:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:10:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:15:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.770 second response time [05:35:18] (03CR) 10MZMcBride: "Greg: would it be okay to amend this changeset to deploy the MassMessage extension to the three phase 0 wikis (test2wiki, testwiki, and me" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [05:36:41] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 05:36:37 UTC 2013 [05:37:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:37:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:38:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.161 second response time [05:38:16] (03CR) 10MZMcBride: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [05:56:39] (03CR) 10Legoktm: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [05:59:14] (03PS3) 10Legoktm: Enable MassMessage extension on test2.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 [06:06:21] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 06:06:19 UTC 2013 [06:06:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:23] (03CR) 10MZMcBride: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [06:14:44] oops [06:14:47] i a word there. [06:15:04] I almost that joke in the comment. [06:15:11] Heh. [06:15:27] Typing is the worst. [06:16:02] (03PS4) 10Legoktm: Enable MassMessage extension on test2.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 [06:37:11] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 06:37:01 UTC 2013 [06:37:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:00:11] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [07:02:32] (03PS1) 10Faidon Liambotis: varnish: remove X-CS/Analytics tag from vcl_fetch [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 [07:14:41] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 07:14:33 UTC 2013 [07:15:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:34:51] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 07:34:42 UTC 2013 [07:35:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:58:08] paravoid, yei!!! :) [07:58:48] paravoid, is it going to instantly start producing correct results? Or we need to wait until cache flushes? [08:02:33] sry, never mind, you already answered that in the patch notes. Great to hear that this issue is solved! [08:03:51] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 08:03:46 UTC 2013 [08:04:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:16:46] yurik: do you set X-Analytics anywhere in mediawiki? [08:16:56] paravoid, nope [08:17:09] its done only in the varnish [08:17:17] ok, good [08:18:37] paravoid, while you are looking at varnish, could you help with enabling ESI? I would really like to start live tests with it [08:18:51] i sent an email yesterday outlining test process [08:19:23] i will be traveling early next week, so won't be as easy to deal with pottential issues [08:20:12] has it been tested in labs? [08:21:32] yes, on one instance [08:21:58] at this point ESI won't be enabled unless there is a X-FORCE-ESI header [08:22:06] I haven't read your mail yet [08:22:37] so the only requirement from varnish is to enable tag search. The tag won't be there unless the header is set [08:22:56] any vanish magicians about? (it might not be a vanish issue) https://bugzilla.wikimedia.org/show_bug.cgi?id=54417#c4 [08:23:06] btw, how's the homepage changes coming along? [08:23:11] you've promised those a while ago :) [08:23:28] now that netmapper is finally going in prod, I'd like us to get rid of DfltLang and all those redirects altogether [08:23:45] we really need to make an effort simplifying the VCLs gradually [08:24:15] paravoid, most of the rewrite has already happened and is in production - all redirects are done by the special page now [08:24:20] as well as home page generation [08:25:05] once ESI is done, I can fairly easily enable dflt lang stuff (but you will have to remind me what exact behaviour is needed - my memory is shaky :( [08:25:15] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [08:25:24] what does ESI has to do with anything? [08:25:45] because that's what Dan wants to be done first :) [08:26:00] yes, but you've promised that the homepage fix for many months first :) [08:26:16] (netmapper too) [08:26:23] heh, i knew there was something i forgot - pls remind me excatly what needs to happen [08:26:27] netmapper is not on my plate :( [08:26:37] so, with netmapper, tag_carrier will get simplified a lot, right? [08:26:47] it took me a few hours to get all the needed ips info for Brandon [08:26:50] all those set X-Cs will go [08:26:59] set req.http.X-CS etc. [08:27:18] correct - it will be X-CS = netmapper (ip) in a loop - to deal with proxies (opera / ssl) [08:27:28] popping off the XFF values [08:27:37] but you still have X-DfltLang [08:27:41] for certain carriers [08:28:00] and then issuing a redirect [08:28:00] if (req.http.X-CS) { [08:28:00] // Carrier detected [08:28:01] error 666 "http://" + req.http.X-DfltLang + ".m.wikipedia.org" + req.http.X-DfltPage; [08:28:08] } else if (req.http.host == "zero.wikipedia.org") { [08:28:08] // All ZERO requests should go to the Special:ZeroRatedMobileAccess, even for unknown carrier [08:28:11] error 666 "http://" + req.http.X-DfltLang + ".zero.wikipedia.org" + req.http.X-DfltPage; [08:28:14] etc. [08:28:36] we've talked a bunch of times (there's even a mail from me about it if you search your inbox) about a proper solution for this [08:28:55] which would be for mobile to have its own homepage that m.wikipedia.org would go to, similar to www.wikipedia.org [08:29:11] a mobile version of www if you'd like [08:29:15] that could be varied on X-CS [08:29:29] and redirect or give different content to different carriers [08:29:38] searching, i think i remember now - should be easy to fix (assuming we never want to allow en.m.wikipedia.org/wiki/Special:ZeroSpecial if the default language is fr [08:30:09] that page simply shows a list of available language [08:30:17] the point is, we shouldn't have a default language in varnish at all [08:30:26] or care about language subdomains in general [08:30:54] paravoid, the easiest solution is to send everyone to english special page, which will redirect to the proper language depending on the carrier [08:30:54] or redirect [08:30:57] no [08:31:14] these redirects are for when users go to "m.wikipedia.org" or "zero.wikipedia.org". [08:31:29] this needs to be a separate, cross-language page IMHO [08:31:32] like www.wikipedia.org is [08:31:42] ignore zero for a moment [08:32:03] right now if you go to m.wikipedia.org from your non-zero browser, you'll get redirected to en.m [08:32:08] that's wrong by itself [08:32:47] doesn't that page (desktop ver) support mobile? [08:32:50] we just need a language-agnostic homepage/landing page; that could also handle zero/welcome $carrier customer [08:32:54] i remember there was some CSS discussion [08:33:00] which one? [08:33:06] main www [08:33:12] no idea [08:33:13] non language specific [08:33:51] hm, seems to [08:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 08:33:50 UTC 2013 [08:33:55] paravoid, just spoofed user-agent -- no redirect to m [08:34:13] www.wikipedia seems to be mobile-enabled [08:34:20] yep [08:34:21] so shouldn't m.wikipedia.org just use that then? :) [08:34:28] instead of redirecting to en.m [08:34:29] try hitting Ctrl+Shift+M in firefox [08:34:35] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:34:38] and set smaller res [08:34:50] seems to be properly scaling [08:35:06] yes, but this is about users going to "m.wikipedia.org" or "zero.wikipedia.org" [08:35:23] I propose(d) that we should just present them the same page as www [08:35:30] max has been talking about killing m & zero outright :) [08:35:38] and you should make that page zero-aware [08:35:45] add the banners there or whatever [08:36:01] and kill special:zerospecial and the dfltlang redirects [08:36:30] i am not sure i can just copy that page -- carriers whitelist specific languages, and moreover, they decide which languages they want to show for their locale [08:36:46] if you look at the meta pages, they list "prefered" languages in their order [08:36:58] fix that homepage to have that [08:37:09] to have the preferred languages on top, for example [08:37:17] and Vary: X-CS [08:37:40] it's crazy that we do preferred languages in varnish [08:37:42] so the special:zero shows that list, followed by a dropdown with all langs. The www page AFAIK is static, defined by users [08:37:57] i totally agree that it shouldn't be in varnish [08:38:30] we need a landing page and I don't see why zero's landing page belongs under Special: tbh [08:38:55] and the www page being static, this can change you know :) [08:39:31] that's why i said - the easiest solution is to simply redirect to the correct language's special, and then decide how to change the whole system of main page based on carrier [08:39:41] there is no correct language, is there [08:39:53] there is -- each carrier defines their prefered language [08:40:14] language*s*, for starters [08:40:23] right [08:40:30] but the main one is first in the list [08:40:45] plus, the fact that the carrier provides zero access to some languages, doesn't mean that users may not be interested in other languages and pay for them [08:40:59] yes - that's why we have a dropdown [08:41:06] or that the users speak the language you chose in the first place [08:41:06] with all langs [08:41:14] and no, i don't think its a good solution :) [08:41:29] if I go travel in russia and access m.wp.org I'll get a page in russian, that's not what we do in desktops [08:41:30] (the dropdown) [08:41:48] I'll get a page in russian if I'm accessing it via beeline, but the english wikipedia if I access it from wifi [08:41:51] crazyness :) [08:42:18] paravoid, you are asking for 2 things: 1) remove defalt langs from varnish - i can easily do that, 2) change the behavior of zero sites - you will have to get Dan to approve that [08:42:27] indeed [08:42:34] so let me do #1 [08:42:47] the one I care most about is cleaning up varnish [08:42:48] and we will deal with #2 after discussion with the biz ppl :) [08:43:06] but I do think it's crazy how we do landing pages and I care in general sense :) [08:43:08] it is their "productio" - something that they "sell" to carriers [08:43:10] in the* [08:43:21] and I agree with you :) [08:43:47] so anyway, will fix redirect, thx for reminding! [08:43:57] pls enable request's esi in the mean time [08:43:59] thank you :) [08:44:01] would be great if we can demo it :) [08:44:13] and start measuring impact [08:44:21] (will probaly need your help tracking graphs) [08:44:29] pls take a look at my email - its very short [08:44:34] let's wait for mark though :) [08:44:39] heh :( [08:44:43] i meant :) [08:47:01] (03PS1) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [08:47:27] mmmm! [08:47:37] uhoh [08:47:44] ori-l: you're still uP! [08:48:29] ori-l: I'm now putting this in ops/puppet.git, since it has 'labs' in its name - but it is sortof 'inserted' into the vagrant repo anyway. I'll be happy to move it to the vagrant repo if you're okay with it [08:48:33] run now ori-l, while you still have a chance [08:49:17] ori-l, :( https://dpaste.de/CdMwp/ [08:49:27] i mean :) [08:49:33] and i saw the same issue on mac :) [08:49:56] wait, what? [08:49:59] * yurik throws inflatable duck to ori to prevent him from drowning [08:50:26] that's your roundabout way of saying you still need a varnish role, yes? [08:50:40] i didn't actually accidentally commit some half-written varnish module, did i? [08:50:44] yurik: pull to latest? [08:50:58] yep [08:51:04] I had a similar problem [08:51:05] iirc [08:51:12] just did a git pull [08:51:37] there's no varnish role [08:51:38] i think it still runs though [08:51:58] yurik, it's 2 AM, don't fuck with my head! :P [08:52:11] * YuviPanda orders ori-l a tuna sandwich [08:52:12] hehe, its 5 am here :-P [08:52:24] I actually the subway peppers. [08:52:36] and paravoid just did that with mine, so i pass it on :-P [08:52:38] ironic, considering subway is nicer in the US than here. [08:52:53] ori-l: hmm, mind if I put labsvagrant.rb in mediawiki/vagrant.git [08:52:54] ? [08:52:55] subway the sandwitch place? bleh [08:53:30] the trailing whitespace wasn't confidence-inspiring [08:53:40] there's trailing whitespace? [08:53:50] line 63 [08:53:56] ori-l: gah, I no longer check for it since i've git set to strip it out [08:53:59] apparently it failed this time [08:54:11] yes, you can commit it [08:54:21] as long as it doesn't impact the desktop stuff [08:54:30] it doesn't [08:54:33] and sweet :D [08:54:41] I'll test it for a bit more and then commit [08:54:58] ori-l: it's just a standalone script, doesn't really do anything else. Most of it is sadly copy-pasted off mediawiki-vagrant.rb [08:55:22] i'm an awful rubyist [08:55:30] you should not imitate me [08:55:47] but it's your project, so gl;hf [08:56:03] this is the first time I'm writing ruby [08:56:04] so [08:56:11] (not counting attempts to do so about 4-5 years ago) [08:57:24] ah, hmm. [08:57:25] key_verify failed for server_host_key [08:57:31] (03PS2) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [08:57:37] this is gerrit, yes? [08:57:39] yeah [08:57:44] ok, paravoid [08:57:47] that's #3 [08:57:47] there's a bug filed, IIRC. [08:57:57] do you remember where? [08:58:02] i don't know wtf is up with that server [08:58:13] ok, i might have just sleepily read IRC convos [08:58:17] checking anyway [08:58:26] what is? [08:58:40] ori-l: https://bugzilla.wikimedia.org/show_bug.cgi?id=53895 [08:58:58] been around for a while looks like [08:59:07] ori-l: ? [08:59:25] people keep getting key_verify failed for server_host_key for gerrit [08:59:34] that's the worst bug title ever, it has nothing to do with translatewiki [08:59:44] heh [08:59:45] it usually clears itself up on a second try [09:00:16] e.g.: [09:00:19] 10:10 Krinkle: Hm.. Pushing to gerrit gives me: [09:00:19] 10:10 Krinkle: hash mismatch [09:00:19] 10:10 Krinkle: key_verify failed for server_host_key [09:00:21] 10:10 Krinkle: fatal: Could not read from remote repository. [09:00:23] 10:10 Krinkle: Trying again worked though [09:00:40] that's about 16 hours ago on #wikimedia-dev [09:01:11] and on 9/21: [09:01:14] Krenair: [16:57:17] has gerrit's key changed? I'm getting key_verify failed for server_host_key [09:01:14] Krenair: [16:58:08] .. and it's now working. weird. [09:01:37] (03PS3) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [09:05:37] paravoid: any idea? [09:06:37] Coren: commit 367294d573db933b934c8b9433dd2acb3b0c3c8d [09:06:56] Coren: on ops/dns, "Add dickson to DNS (new freenode server)", very misleading commit message [09:07:19] Coren: you removed manganese in the same commit, didn't mention anything in the message, I got very confused [09:07:52] ori-l: it's hard to debug retroactively [09:08:11] ori-l: we should go log digging the moment it happens [09:08:46] let me see if i can trigger it [09:08:57] i'll pull in a loop with a 5 second sleep [09:11:35] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:12:46] paravoid: unrelated, but follow on from the 'Mediawiki Labs' conversation between you, me and Ryan_Lane from the office - https://wikitech.wikimedia.org/wiki/Projects/mediawiki_Labs_project [09:18:13] (03PS4) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [09:24:01] paravoid: haven't been able to trigger it so far :/ it must be an IP conflict somewhere, no? [09:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 09:33:47 UTC 2013 [09:34:35] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:49:23] hashar: any probs with the new PHP version ? Can we upgrade production ? [09:50:50] akosiaris: ohh I forgot about that project [09:50:51] sorry [09:51:04] let me look on beta [09:51:27] beta has 5.3.10-1ubuntu3.8+wmf1 [09:52:04] last time I have checked, there we [09:52:12] re no suspicious trace on beta. Will dig again [09:53:09] akosiaris: nothing suspicious on beta. Sounds good to me [09:53:30] ok... i will schedule a cluster wide upgrade then ... [09:53:43] finally ... my chance to bring down the site :-) [09:55:55] hehe :P [10:00:12] \O/ [10:12:46] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:29:47] mark: paravoid: looks like the purges requests are not send to Amsterdam :/ [10:30:05] looking [10:30:05] https://upload.wikimedia.org/wikipedia/commons/c/cc/WiktionaryPt.svg serves me different content on tin and on my local laptop :( [10:30:12] that doesn't say much [10:32:48] the X-Timestamp: is older on the ams cache [10:34:06] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 10:33:59 UTC 2013 [10:34:46] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:45:02] (03PS1) 10TTO: Change SUL image for loginwiki to WMF logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86091 [10:51:04] multicast broken since the 22nd [10:51:15] htcp @ esams that is [10:51:20] let's find out why... [10:54:50] perhaps they don't end up at pmtpa [10:55:07] multicast routing gets fucked sometimes [10:56:35] is udpmcast unpuppetized? [10:56:49] took me a while to find it [10:57:18] it is [10:57:31] it won't be needed soon with mpls, so didn't bother [10:57:39] and your intuition is correct as usual [10:57:50] they don't end up at pmtpa [10:58:01] is leslie's ospf change not reverted yet? [10:58:09] no idea [10:58:34] is the htcp feed monitored somehow? [10:58:43] there are graphs, there are no alerts afaik [10:58:52] interface xe-5/2/1.0 { [10:58:52] interface-type p2p; [10:58:52] metric 30; [10:58:54] that's probably it [10:58:57] * mark changes [10:59:06] bblack was working on them I think [10:59:18] paravoid: there is a nagios plugin that let you collect ganglia metrics and alert when some threshold is reached. Might be helpful [10:59:24] what mcast routing protocol are you using? [10:59:45] PIM sparse [10:59:54] it fails sometimes because of RPF mismatch [11:00:11] OSPF as IGP? [11:00:15] yes [11:00:29] awesome [11:00:34] all Cisco? [11:00:39] all juniper [11:01:07] i think this will also get fixed once we do separate virtual border routers... [11:01:32] paravoid: does it work again? [11:01:58] it worked for a while then broke again [11:02:02] ha [11:02:14] hm [11:02:34] it works in bursts, I guess that's udpmcast [11:02:46] I'll check in pmtpa to be sure [11:02:47] do the packets arrive as multicast though? [11:02:54] udpmcast might be getting a bit overloaded [11:03:05] we have > 1000 purges a second sometimes now [11:03:24] (03PS5) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [11:03:30] yes, it works [11:03:50] I see them at both pmtpa & esams [11:03:50] we could use socat instead of udpmcast [11:03:52] a bit bursty [11:03:53] but then again [11:03:56] with mpls... [11:04:14] where is that? :) [11:04:22] waiting on vendor to fix their shit [11:05:11] 2nd vendor finally replied [11:05:39] hm [11:05:51] the purge for the wiktionary logo isn't working [11:05:55] need to ensure it works for ulsfo also [11:06:10] oh heh [11:06:17] I was counting multicast but not htcp [11:06:29] <- feels silly [11:06:35] (03PS1) 10TTO: Wnable $wgUseRCPatrol on fawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86093 [11:06:57] no, htcp still doesn't work in either pmtpa or esams [11:07:29] any more specific? :) [11:07:33] multicast doesn't work [11:07:51] I was counting them, but they were just ganglia [11:08:01] perhaps I should do another network migration today ;) [11:08:18] so it was just local DC traffic [11:09:06] heading out for lunch [11:09:27] so the problem is: cr2-eqiad receives a packet from some apache, wants to send it to pmtpa, but the primary OSPF link is on cr1 [11:09:27] (03Abandoned) 10TTO: Temporary celebration logo for tawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85960 (owner: 10TTO) [11:09:31] so sends it on to cr1 [11:09:47] (03PS6) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [11:09:49] cr1 does RPF check for multicast packet, finds a different destination route for the source apache (directly connected), and drops it [11:09:58] it's pretty annoying and hard to fix [11:10:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:10:37] meh... [11:11:24] (03CR) 10TTO: [C: 04-1] "The commit message contradicts itself: "for logged-in users only" or "all users"?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [11:12:06] Could someone fix the multicast relay to esams thingy? (Ping ottomata since you are in the topic) https://bugzilla.wikimedia.org/show_bug.cgi?id=54629 [11:12:13] we are working on it [11:12:18] see backlog [11:12:32] Ok, my apologies [11:13:37] Just wanted to make sure it didn't get lost. (I'm on my phone so it was somewhat hard to check the things I usually do) [11:17:43] how does it look now? [11:22:03] broken I see [11:23:30] hm [11:23:37] there are no PIM JOINs from pmtpa at all [11:24:26] now it's working [11:24:31] "clear pim join inet" [11:24:36] sigh :P [11:27:14] :/ [11:27:19] yeah I see it now [11:27:29] (03CR) 10QChris: [C: 031] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 (owner: 10Faidon Liambotis) [11:27:47] funnily enough, udpmcast is the most reliable part of this setup ;) [11:28:06] heheh [11:28:10] yeah I guess it is [11:28:47] perhaps I should hack in a passive nagios check into it [11:28:59] if udpmcast doesn't receive > N of packets, alert [11:29:04] every minute [11:29:27] quick & dirty, but it would be much better than what we have now [11:31:27] I think brandon wanted to do something with vhtcpd [11:31:42] i've been wanting to do something for a long time [11:31:52] I think i'm just gonna do it now [11:31:58] lol :) [11:34:09] but really, this should just be in vhtcpd [11:34:23] given that udpmcast will be redundant in a few weeks, and it would not catch all failurs [11:34:30] nod [11:34:38] or a more generic multicast connectivity check [11:34:43] different thing [11:34:47] that too [11:35:05] i'll ping brandon later today and ask what his status is [11:35:13] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 11:35:03 UTC 2013 [11:35:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:03] we used http://www.venaas.no/multicast/ssmping/ at grnet [11:36:43] and dbeacon (which I maintain in Debian and almost became upstream for :) but that would be an overkill here [11:36:50] mcfirst could just do it for us [11:37:19] run for 5s and wait for packets on the htcp group 4827 [11:40:45] yeah [11:40:55] in theory not though [11:40:56] e.g. if all dbs are locked [11:40:56] there are no purges [11:40:57] so ssmping would be better [11:41:09] mark: broken again... [11:41:24] indeed [11:43:24] fixed, but unlikely to be permanent [11:47:48] gone again [11:49:49] :( [11:50:02] may be better now [11:51:36] anyone spare a +2 for a labs related change for a module that isn't really used anywhere yet? https://gerrit.wikimedia.org/r/#/c/86083/ [11:54:41] (03CR) 10Mark Bergsma: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 (owner: 10Yuvipanda) [11:56:53] seems to work now indeed [11:56:54] (03PS7) 10Yuvipanda: Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 [11:56:58] mark: thanks for catching that. Updated now. [11:57:30] (03CR) 10Mark Bergsma: [C: 032] Add labsvagrant helper script to help manage labsvagrant roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/86083 (owner: 10Yuvipanda) [11:57:35] ty [11:59:15] yurik: so what's this X-FORCE-ESI header exactly? [11:59:36] a request header to be set by varnish? [12:03:45] mark: what did you do? [12:03:48] to perm fix? [12:03:57] remove the ospf metric on the other side [12:04:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 12:04:46 UTC 2013 [12:05:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:16:40] (03PS2) 10Physikerwelt: Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 [12:30:18] paravoid: Sorry if it got confusing; it was clear in context since there were the DHCP patches that mentionned this was a rename; I forgot to make sure that the changesets were understandable in isolation. [12:30:29] s/were/should be/ [12:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 12:34:27 UTC 2013 [12:35:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:58:17] (03CR) 10Diederik: "Thanks so much Faidon! and great catch!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 (owner: 10Faidon Liambotis) [13:00:55] paravoid: so the vcl_deliver function is only temporary? once the cache has been cleaned it can be removed again? [13:01:38] no [13:01:46] part of it is [13:08:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:11:14] (03CR) 10Mark Bergsma: "So all this beresp stuff seems to have been added to be able to log it, as request headers are only logged as they are received by the cli" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 (owner: 10Faidon Liambotis) [13:12:30] is this a -1 or future note? :) [13:12:50] whatever you feel like [13:12:52] i.e. do you think we should do std.log now rather than later? [13:13:03] can't say I care :) [13:13:18] X-Analytics sent to the client looks clumsy [13:13:33] yeah, I pointed that out too :) [13:13:38] but we do that already [13:13:47] we also sent X-CS... [13:18:07] (03PS2) 10Faidon Liambotis: varnish: remove X-CS/Analytics tag from vcl_fetch [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 [13:19:37] (03PS3) 10Faidon Liambotis: varnish: remove X-CS/Analytics tag from vcl_fetch [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 [13:21:11] (03CR) 10Faidon Liambotis: [C: 032] "Let's try VCL_Log later, I added a FIXME." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 (owner: 10Faidon Liambotis) [13:23:34] paravoid: hi, did you already switch RT to LDAP? it looks like it [13:23:38] no [13:23:41] did I? [13:23:45] I don't think so :) [13:24:02] ehm, i couldn't create a user the old way and then i see LDAP stuff in global config [13:24:19] ExternalAuthPriority [ 'LDAP' [13:25:07] remnants from my tests [13:25:35] ok, disabled again [13:25:36] sorry [13:25:48] can someone ban robertknight? [13:26:57] paravoid: thanks, now i could create a user [13:27:19] how do we get @ here? [13:28:04] paravoid: /query chanserv help access [13:29:01] 06:30 access #wikimedia-operations LIST [13:29:07] yeah I know [13:29:15] <- one of the people in there needs to add [13:29:38] heh, maplebed :p [13:32:58] paravoid: i match the /wikimedia wildcard, but that just has +V, need +0 *!*@*.wikimedia.org +V [13:33:03] +o [13:34:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 13:34:38 UTC 2013 [13:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:37:09] (03CR) 10Milimetric: "A belated +1 with opinions:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86079 (owner: 10Faidon Liambotis) [13:50:50] solved for -tech , now we just need it here [14:04:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 14:04:20 UTC 2013 [14:04:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:12:22] ahii akosiaris_away! [14:12:46] hey [14:12:53] wassup ? [14:13:17] looking at starting on the ulsfo stuff [14:13:24] me too [14:13:35] i am getting a bit the hang of the DC there.... [14:13:46] there are a couple of inconsistencies I found [14:13:58] oh? [14:14:15] for example ...manifests/network.pp... says public1-ulsfo is 198.35.26.0/27 [14:14:35] but cr1-ulsfo says 198.35.26.2/28 [14:14:38] oh 26 [14:14:38] hm [14:14:43] i think it is 26 [14:14:47] well 198.35.26.0/28 [14:14:55] you mean /26 ? [14:15:04] that is even worse... i found no reference to something like that ... [14:15:09] pbbt sorry [14:15:19] i just read one the the nets wrong, though it was 198.35.25 [14:15:24] nono [14:16:16] seems like we got assigned 198.35.26.0/23 but that should not matter because we will obviously NOT allocate everything to ulsfo [14:16:31] so don't ever assume /23 anywhere :P [14:16:54] other than that i am ready to close #5828 [14:18:18] cr1-ulsfo is authoritative [14:18:23] oh ha i just started doing that too [14:18:25] leslie just forgot to update network.pp [14:18:39] mark: thought so [14:18:49] thanx for verifying [14:19:00] also, all networks [14:19:23] hm, so all networks and the public1-ulsfo shoudl be /28 [14:19:24] ? [14:19:30] no [14:19:41] just the public1-uslfo part [14:20:26] mark: cr2-ulsfo is not up yet ? [14:20:31] no [14:20:36] cool [14:20:46] ahh got it [14:23:38] (03PS1) 10Akosiaris: tftpd server for ulsfo is bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86099 [14:24:19] there you go alex, one RT ticket nailed ;-p [14:24:36] lol [14:25:00] btw i think i should get an account at the junipers... i hate logging in with root :-) [14:25:10] yes [14:25:25] you can add yourself ;-) [14:25:37] ok I will :-) [14:26:09] mark, are all of the machines we are about to set up in ulsfo going to be on the public subnet? [14:26:20] (03CR) 10Akosiaris: [C: 032] tftpd server for ulsfo is bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86099 (owner: 10Akosiaris) [14:26:29] most can be private I think [14:26:47] ulsfo has an mpls link to eqiad [14:26:48] ok, so we'll need to add a new subnet too [14:26:56] there's no private subnet yet? [14:27:09] there is according to network.pp [14:27:09] it is in network.pp [14:27:15] i mean add to dhcpd [14:27:17] ok [14:27:17] (03PS1) 10Akosiaris: public1-ulsfo is a /28 not a /27 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86100 [14:27:18] aaa ok [14:27:21] on it! [14:28:18] (03CR) 10Akosiaris: [C: 032] public1-ulsfo is a /28 not a /27 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86100 (owner: 10Akosiaris) [14:28:27] with 16 IPs, it'd get a bit tight otherwise [14:28:51] oh aye ja [14:31:12] (03PS1) 10Ottomata: Adding private ULSFO subnet in dhcpd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/86102 [14:31:27] ottomata: did you use ferm for those kafka broker iptables rules? [14:32:47] no it was manual, i was asking around here yesterday about how to use properly, but no one was around, if/when we do this on the actual prod brokers, will def use ferm [14:33:18] do you think I should puppetize that with ferm on those now? [14:33:30] if it's test I don't care, but production, most definitely :) [14:33:34] ok cool, ja [14:34:01] the older iptables stuff in puppet is godawful and unmanageable [14:34:13] yeah ferm seems nice [14:34:17] yeah [14:34:25] speaking of that, paravoid, you around? [14:34:25] you can make macros which contain producer subnets or broker ips [14:34:35] i wanted to add a $srange param to ferm::service [14:34:36] so we can update all that in a sensible place [14:34:46] and if it was set use the R_SERVER function [14:34:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 14:34:45 UTC 2013 [14:34:51] rather than SERVICE [14:34:53] make sense? [14:34:58] R_SERVICE* [14:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:36:25] akosiaris: https://gerrit.wikimedia.org/r/#/c/86102/1 [14:37:10] some day we should probably generate dhcpd.conf from a template of subnets ;) [14:37:14] (but not today) [14:37:36] with a template from a data structure of subnets, I mean [14:37:42] (03CR) 10Akosiaris: [C: 032] Adding private ULSFO subnet in dhcpd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/86102 (owner: 10Ottomata) [14:38:11] ja totally [14:38:56] maybe it's not worth the effort [14:39:05] edits are rather infrequent [14:39:30] its nice to have things DRY and canonical, rather than looking all over the place [14:39:49] akosiaris: just had to fix something that was duplicated in two places [14:40:03] mehmeh [14:40:04] anyway [14:40:09] ? [14:40:24] the /28 /27 thing [14:40:35] a... yeah... [14:40:36] oh i'm saying because of the duplicated net config [14:40:37] yeah I agree [14:40:44] there's a downside to that too, though [14:40:46] if it was in once place that wouldn't have ahppened [14:40:49] once everything uses e.g. network.pp [14:40:53] and you make a single change there [14:40:59] that trickles down to LOTS of places in many different contexts [14:41:09] and unexpected things/breakage can happen [14:41:13] ja true [14:41:39] so at the very least it should be well defined what those data structures mean and can and cannot be used for, then [14:41:57] very simple example: [14:42:05] there's a difference between all ip prefixes we have allocated to us [14:42:11] and all prefixes that we trust because we have full control over it [14:42:20] things like "labs" and "toolserver" make that hard [14:42:27] because they're untrusted pieces of our network [14:42:38] so what does a generic name like $all_networks mean, in such a case :) [14:42:49] and should it include private ip space? or not? [14:42:55] it's all not trivial :) [14:42:57] aye [14:43:03] ipv4 only, ipv6 too, etc [14:43:08] aye [14:44:06] ottomata: cps or lvs first ? [14:44:14] akosiaris: i'm going to run puppet on brewster, i think we need that change there [14:44:24] and, i think we are just doing base install for now [14:44:26] so it doesn't matter [14:44:28] they'll all be the same [14:44:31] there are fewer lvs [14:44:33] lets do those [14:44:44] there are a couple of differences [14:44:57] for example ... cps use raid1-varnish.cfg in d-i [14:45:07] ah hm [14:45:11] lvs seems to be lvs* so no need to touch that... [14:45:12] lvs basically doesn't care much about disk [14:45:15] it just needs to be reliable [14:45:20] (raid1) [14:46:03] cmjohnson1: coudl you take a look at analytics1021? [14:46:06] sdb (i think) is busted [14:46:25] oh sorry [14:46:30] so if LVS currently has its own partitioning recipe [14:46:31] you already did, just went and looked at the RT [14:46:38] we can/should probably swap that out for some very generic raid1 recipe [14:47:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:47:03] ah, that's not what you said though [14:47:58] ? [14:48:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.860 second response time [14:51:27] mark, i'm looking at the lvs.cfg parman recipe [14:51:40] the current recipe says assume one disk or hw raid [14:52:29] do the new ulsfo lvs's have hw raid? or should we just take raid1-30G.cfg and use it [14:54:07] .mgmt's work, we could find out [14:54:25] i don't know [14:54:43] i think most boxes we buy now do support a simple form of hw raid 1 [14:55:07] hm [14:56:56] akosiaris: how do we check that? i'm in the mgmt interface, not sure where to look [14:57:13] i am afraid you won't like the answer... [14:57:17] but here it goes... [14:57:26] boot it and look at the hw controller while it boots [14:57:30] haha [14:58:01] (03Abandoned) 10Dzahn: the existing etherpad-lite_1.0-wm2 package in a operations/debs repo for completeness [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/76654 (owner: 10Dzahn) [14:58:37] i expect some hw controller... but probably unconfigured... [15:00:20] (03Abandoned) 10QChris: Turn on gerrit's database connection pooling [operations/puppet] - 10https://gerrit.wikimedia.org/r/62336 (owner: 10QChris) [15:06:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:06:45] akosiaris: [15:06:49] /admin1-> racadm raid get controllers [15:06:49] STOR0101 : No RAID controller is displayed. [15:06:54] does that count? [15:07:00] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 15:06:57 UTC 2013 [15:07:00] or should I look deeper? [15:07:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:08:03] well... I suppose it should count ? [15:08:11] i sure hope it does [15:08:36] ok, so no hw raid on lvs [15:08:41] so, raid1-30G.cfg? [15:08:47] neither on cps then... [15:09:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.379 second response time [15:09:05] well, for cps, can we just use raid1-varnish? [15:09:34] cps can't use hw raid1 [15:09:46] varnish uses the SSDs individually, directly [15:10:11] iirc amslvs* doesn't use hw raid1 either [15:10:18] so you could just use whatever is used for that [15:12:05] raid1-squid [15:14:18] hm, a little harder to match these uniquely in netboot.cfg [15:14:21] lvs400* [15:14:26] and then what, change lvs* to something [15:14:26] hm [15:14:35] i don't know what all the existant lvs* servers are out there [15:15:17] lvs[1-6] and lvs100[1-6]? [15:15:19] lvs1-6, lvs1001-1006, amslvs1-4 [15:15:27] ok [15:16:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:16:08] huh.... asw-ulsfo says cp500x, lvs500x in its show interfaces descriptions... [15:16:22] for a moment i was huh ? [15:16:31] rob renamed them afterwards [15:17:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.893 second response time [15:17:04] it was 500x to start with ? [15:17:09] what for ? [15:17:21] why skip 4 ??? anyway... i will fix that [15:17:26] noone knows [15:17:31] at least lets be consistent [15:20:34] i am searching wikitech for how to get MAC addies from racadm... [15:20:43] racadm getsysinfo [15:22:46] ottomata: i was about to collect them all from the switch [15:22:53] oo even better! :) [15:23:03] can you tell which is which from the switch? [15:23:24] welllllll descriptions are there... if they are correct then yes [15:23:29] if they are wrong ... [15:23:38] but we can always crosscheck :P [15:23:58] ok cool, you got descs that's good enoguh for me [15:24:07] actually, that will help for me too, there are multiple nics [15:24:17] can you look at lvs4001 right now? [15:24:19] what's it say for that one? [15:24:25] i will have to boot them once though [15:24:44] for the switch mac table to get populated [15:27:49] ok i'm in lvs4001 right now [15:27:51] shall I boot? [15:28:04] seems powered on ... [15:28:19] at least that is what ipmitool tells me [15:28:22] oh ok [15:28:27] !log Setup OSPFv3 between cr1-ulsfo and cr2-eqiad [15:28:28] gimme a sec [15:28:30] k [15:28:38] Logged the message, Master [15:29:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:29:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.812 second response time [15:32:40] !log Migrated OSPF between cr1-ulsfo and cr2-eqiad to p2p interface-type, removed high metric, TEMP annotation [15:32:55] Logged the message, Master [15:33:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:33:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 19.443 second response time [15:34:55] hashar, looks like you added generic::packages::tree and generic::packages::joe but they aren't used anyplace. Can I prune them, or are they included someplace that I'm missing? [15:36:00] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 15:35:52 UTC 2013 [15:36:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:45] ottomata: bad disk on an1021 [15:36:50] (03PS1) 10Andrew Bogott: Remove two seemingly-unused classes, 'joe' and 'tree'. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86108 [15:37:26] (03CR) 10Andrew Bogott: "Hashar, I think you added these two classes. Are they still used someplace that I'm missing?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86108 (owner: 10Andrew Bogott) [15:39:27] (03CR) 10Hashar: [C: 031] "That is some good old change I made a while back when building the beta bastion." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86108 (owner: 10Andrew Bogott) [15:39:38] andrewbogott: feel free to nuke joe/tree :] Thx for the cleanup! [15:39:45] andrewbogott: you should add Brion as reviewer,hehe [15:40:41] mutante, why Brion? [15:41:05] ottomata: seems like descriptions are not consistent [15:41:11] hashar, another review for you: https://gerrit.wikimedia.org/r/#/c/86035/ [15:41:28] andrewbogott: brion used joe back in the old time [15:41:33] andrewbogott: it is on fenari :-) [15:41:40] andrewbogott: because he likes joe, but just kidding because back then the joe or not was a thing :) [15:41:51] heh, ok. [15:41:51] oooo ok akosiaris [15:41:52] for example lvs4004 says 90:B1:1C:42:C3:7A, 90:B1:1C:42:C3:7C, 90:B1:1C:42:C3:7E, 90:B1:1C:42:C3:80 [15:42:04] but switch says 90:b1:1c:45:5d:91 [15:42:09] so... tough luck :P [15:42:25] but to racadm getsysinfo [15:42:33] let's not trust switch descriptions [15:43:50] (03PS1) 10Dzahn: since we just touched statistics.pp anyways, sneak in the retabbing as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/86110 [15:44:56] hashar, also, what's the story with manifests/misc/contint vs modules/contint? [15:45:04] (03CR) 10Hashar: [C: 04-1] "This is required by https://gerrit.wikimedia.org/r/#/c/75856/ which let us generate the .gitconfig of the jenkins user on Jenkins slaves." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86035 (owner: 10Andrew Bogott) [15:45:29] andrewbogott: manifests/misc/contint is historical but still applied. It has some augeas rules there I don't want to waste time migrating to contint module or site.pp or whatever :-] [15:46:01] andrewbogott: ohh there are no more rules there [15:46:14] Would it be incorrect to move its contents into the module? [15:46:45] andrewbogott: misc::contint::test::jenkins will be deleted eventually. It is used to publish a git repository. [15:47:03] andrewbogott: I have to migrate that snippet to use git-deploy. Would do that whenever I managed to understand git-deploy :-] [15:47:21] andrewbogott: but yeah that can most probably be deleted soon [15:47:54] and… back to gitconfig: If I modify https://gerrit.wikimedia.org/r/#/c/75856/ to use the new name, would that overcome your only objection? [15:47:54] misc::contint::android::sdk basically install packages required by the Android SDK, I can surely move that under contint packages.pp [15:48:06] ottomata: PERC S110 Controller [15:48:17] andrewbogott: yeah that works for me. gitconfig was originally named git :-] [15:48:29] lvs4004 says so... racadm hwinventory [15:48:41] Yeah, I kind of went in circles about that naming scheme [15:48:58] andrewbogott: I fall in the same trap constantly ! [15:49:05] huh... i even get the serials off the memory dimms... nice [15:49:23] (03PS1) 10Dzahn: statistics.pp - fix unquoted file modes and resource titles. puppet-lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/86112 [15:49:29] (03CR) 10Hashar: "reseting CR score, I dont mind using git as a module. That probably make more sense." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86035 (owner: 10Andrew Bogott) [15:50:36] (03CR) 10Hashar: [C: 04-1] "The gitconfig module is being renamed to git in https://gerrit.wikimedia.org/r/#/c/86035/ . Whenever that get merged, we will want to upda" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [15:52:10] (03CR) 10Andrew Bogott: [C: 032] Remove two seemingly-unused classes, 'joe' and 'tree'. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86108 (owner: 10Andrew Bogott) [15:55:38] (03PS1) 10Dzahn: statistics.pp, puppet-lint, fix WARNINGs: string containing only a variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/86113 [15:59:11] (03CR) 10Dzahn: "Yuvi says he would like to replace mediawiki-singlenode anyways?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [16:00:41] (akosiaris, sorry in meeting) [16:00:52] ottomata: no worries [16:01:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:01:03] hashar, I'm super confused… it looks like your patch https://gerrit.wikimedia.org/r/#/c/75856/4/manifests/role/ci.pp already calls it git::userconfig? [16:01:09] Did you just now update that patch? [16:01:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 10.479 second response time [16:01:57] (03CR) 10Dzahn: "repurposed to use SSLCertificateChainFile" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 (owner: 10Dzahn) [16:03:04] andrewbogott: good catch :-] [16:03:32] (03PS3) 10Physikerwelt: Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 [16:03:34] the fact that you unconsciously imagined the class to already have the same name that I arrived at seems like a good sign. [16:03:38] andrewbogott: 75856 is still based on the old version of my gitconfig patch which had a module named git. Loks like I rebased carelessly [16:04:29] andrewbogott: we could rebase https://gerrit.wikimedia.org/r/#/c/75856/ on top of your gitconfig -> git rename. And that would work. [16:04:36] Anyway… I will make your patch depend on mine but will not otherwise modify it and then we're done. [16:04:42] yeah, as you said :) [16:05:41] andrewbogott: excellent :-) [16:05:42] (03PS4) 10Physikerwelt: Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 [16:06:22] akosiaris: , so which nic? [16:06:31] NIC.Integrated.1-1-1 Ethernet [16:06:31] ? [16:06:57] have a look at the *inventory files at root's host on bast4001 [16:07:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:07:09] i got all the info i could from the machines... [16:07:26] i will do some awk and grep magic now and create the freaking dhcpd file [16:07:36] (03PS5) 10Andrew Bogott: contint: generate .gitconfig files for all jenkins users [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [16:07:37] (03PS3) 10Andrew Bogott: Move and rename the (currently unused) gitconfig module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86035 [16:08:04] i am puzzled by InstanceID: Disk.Direct.1-1:RAID.Embedded.1-1 [16:08:16] seems like the have hardware raid [16:08:37] hm [16:08:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 16:08:42 UTC 2013 [16:08:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.101 second response time [16:09:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:39] hm, but akosiaris, even from this info, how do you know which NIC is the correct MAC to use? [16:09:50] i could assume the first one, but dunno [16:09:56] i will assume the first one [16:10:15] mmmmmmk, gonna make some linux-host-entires [16:10:17] that is "bios" order and it should match the way the are in the box [16:10:26] !log Jenkins: deleted /tmp/mw-UIDGenerator-UID-* files on gallium. [16:10:38] Logged the message, Master [16:10:59] I will also assume someone as pedantic as me cabled the boxes and always connected the first one [16:12:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:12:18] hashar, can you give those two patches one last look svp? [16:13:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.146 second response time [16:15:30] ottomata: so... i start assigning IPs ... [16:15:42] any preferences ? [16:17:04] (03PS1) 10Mark Bergsma: Repartition ulsfo IP space [operations/dns] - 10https://gerrit.wikimedia.org/r/86121 [16:17:05] (03PS1) 10Mark Bergsma: Add br[12]-ulsfo loopback IPs and transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/86122 [16:17:16] hmmm [16:17:35] h [16:17:39] hm [16:17:54] (03CR) 10Mark Bergsma: [C: 032] Repartition ulsfo IP space [operations/dns] - 10https://gerrit.wikimedia.org/r/86121 (owner: 10Mark Bergsma) [16:18:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:18:04] andrewbogott: sure [16:18:05] so cps all internal.. mark LVS public right ? [16:18:12] also internal [16:18:19] they don't need to talk to the internet [16:18:24] aaa only the service IPs are public [16:18:26] ok [16:18:50] akosiaris: start cps at .101, lvs's at .11? iunno [16:19:19] (03PS2) 10Mark Bergsma: Add br[12]-ulsfo loopback IPs and transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/86122 [16:19:27] yep, and those get announced with bgp [16:19:47] note, in eqiad lvs is public [16:19:50] but they don't need to be [16:20:37] (03CR) 10Hashar: "Sounds good :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86035 (owner: 10Andrew Bogott) [16:20:48] (03CR) 10Mark Bergsma: [C: 032] Add br[12]-ulsfo loopback IPs and transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/86122 (owner: 10Mark Bergsma) [16:22:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.121 second response time [16:24:15] (03CR) 10Hashar: "Should be fine." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [16:24:56] (03PS1) 10Dzahn: bugzilla.pp - fix unquoted resource titles and file modes (puppet-lint) [operations/puppet] - 10https://gerrit.wikimedia.org/r/86124 [16:24:56] andrewbogott: I need to leave. Both patches are fine. When deploying https://gerrit.wikimedia.org/r/75856 , make sure /var/lib/jenkins/.gitconfig looks legit. I pasted the content of the current file as a comment on the change. [16:25:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:26:00] PROBLEM - DPKG on labstore4 is CRITICAL: Timeout while attempting connection [16:27:30] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:06] (03PS5) 10Jforrester: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 [16:29:11] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:29:15] (03CR) 10Jforrester: "Bah, copy-and-paste fail; sorry." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [16:31:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.223 second response time [16:32:54] (03CR) 10Greg Grossmeier: "test. and test2. should be fine to test that. Let's leave it off mediawiki.org for now." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [16:33:10] ah mark, what's this mean? [16:33:10] ; 198.35.26.224/27 [16:33:10] ; LVS service IPs [16:33:35] just curious [16:33:48] that's the block where the lvs service ips will go [16:34:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:34:09] (03PS1) 10Ottomata: Adding cp4001-cp4020 and lvs4001-lvs4004 ulsfo hosts in linux-host-entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/86125 [16:34:15] (03CR) 10Greg Grossmeier: [C: 031] Enable MassMessage extension on test2.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 (owner: 10Legoktm) [16:34:18] i.e. the ips users will connect to [16:34:53] ahhhhh, that's on a different interface on each lvs then? [16:35:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.464 second response time [16:35:01] no [16:35:13] the router will send those ips to them, even though they are on the private subnet [16:35:21] ohhhhh, hm [16:35:23] ok [16:35:58] akosiaris: https://gerrit.wikimedia.org/r/#/c/86125/ [16:36:13] (03PS1) 10Dzahn: retab misc/planet.pp from tabs to 4 spaces, do the cleanup before next attempt to turn into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86126 [16:37:31] I am off. See you later! [16:37:45] (03PS1) 10Faidon Liambotis: gdnsd ganglia: 2 -> 3 attempts [operations/puppet] - 10https://gerrit.wikimedia.org/r/86128 [16:37:53] ottomata: looks fine... add the lvs and its good to go [16:38:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:38:06] lvs is there too, no? [16:38:25] yes it is [16:38:25] (03CR) 10Faidon Liambotis: [C: 032] gdnsd ganglia: 2 -> 3 attempts [operations/puppet] - 10https://gerrit.wikimedia.org/r/86128 (owner: 10Faidon Liambotis) [16:38:28] ma bad [16:38:29] (03PS5) 10Physikerwelt: Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 [16:38:31] also, I just assumed this was the correct host-entries file [16:38:36] correct com/baud whatever [16:38:44] yes i think it is [16:38:49] i had chosen the same [16:38:54] ok, so now we need netboot [16:38:57] so, hw raid or no? :p [16:39:30] (03PS1) 10Faidon Liambotis: Disable ms-fe.eqiad.wmnet LVS check [operations/puppet] - 10https://gerrit.wikimedia.org/r/86129 [16:39:53] (03CR) 10Faidon Liambotis: [C: 032] Disable ms-fe.eqiad.wmnet LVS check [operations/puppet] - 10https://gerrit.wikimedia.org/r/86129 (owner: 10Faidon Liambotis) [16:40:22] ottomata: let's boot one and see how it goes. I 'd bet there is a hardware RAID there [16:40:38] ok, lvs4001 [16:41:04] can you hop on console at the same time as me? [16:41:11] i'm not entirely sure what we will be looking for [16:41:37] serial or through java applet ? [16:41:44] i don't think we can share serial [16:41:50] uhh serial, idunno, we can on ciscos [16:42:02] (03CR) 10Andrew Bogott: [C: 032] Move and rename the (currently unused) gitconfig module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86035 (owner: 10Andrew Bogott) [16:42:07] maybe the controller bios won't show up on serial [16:42:13] oh no [16:42:13] ? [16:42:19] ok you watch the java applet then, i don't have that up [16:42:20] maybe... [16:42:26] we will see [16:42:26] i'll watch serial :p [16:42:33] you ready? [16:42:40] gimme a sec [16:42:41] k [16:42:53] (03CR) 10Andrew Bogott: [C: 031] "Let me know when you're around, and I'll merge this while you're watching." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [16:45:57] ottomata: ready [16:46:21] reboot it [16:46:47] k [16:46:48] (03PS1) 10Dzahn: planet.pp - wrong quoting, aligned arrows, ensure first and other puppet lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/86130 [16:47:18] Reedy: piles of ": No such revision found for Q2116001" from job runners [16:48:38] AaronSchulz: Open a bug if there isn't one already [16:50:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.919 second response time [16:52:55] doo dee doo [16:53:05] you seeing anything akosiaris? [16:53:17] mark, X-FORCE-ESI header farces backend to insert instead of the banner [16:53:29] so that's a request header [16:53:35] correct [16:53:46] ok [16:55:39] ottomata: you all doin ulsfo installs? How goes it? [16:56:55] mark, is there currently an easy way to differentiate requests for the html content with all others (javascript/images)? [16:57:00] in varnish [16:57:38] content type perhaps [16:57:44] but really you should have mediawiki tell varnish what to do [16:58:40] (03PS1) 10Andrew Bogott: Move maven package def into contint module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86136 [16:59:18] perhaps a response header to tell it whether to evaluate ESI or not [16:59:27] mark, the doc shows set beresp.do_esi = true; in vcl_fetch -- https://www.varnish-cache.org/docs/3.0/tutorial/esi.html [16:59:35] Reedy: you already did :) [16:59:41] can we do it post request? [16:59:41] * AaronSchulz just added more whinging [17:00:04] yurik: what do you mean? [17:00:32] * AaronSchulz hmms at https://gdash.wikimedia.org/dashboards/jobq/ [17:01:00] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [17:01:25] mark, that tutorial shows how to enable ESI -- and the sample code shows it as part of vcl_fetch state. I could easily add a response header, but do you know if varnish can process it? [17:01:38] of course it can process it [17:01:59] if beresp.http.Enable-ESI == "true" { set beresp.do_esi = true; } [17:02:04] RobH, going ok [17:02:17] alex and I are stuck at wondering what partman to use for lvs [17:02:19] i mean - can varnish process the response header and set it to true AFTER it fetches the response? [17:02:22] trying to figure out if they ahve hardware raid [17:02:49] AaronSchulz: Probably wants an updated stack trace posting [17:02:59] yurik: that's what the documentation example shows, doesn't it? [17:03:21] getting conflicting info from racadm vs getsysinfo vs racadm hwinventory [17:03:27] RobH, do you know? [17:04:22] /admin1-> racadm raid get controllers [17:04:22] STOR0101 : No RAID controller is displayed. [17:04:40] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [17:04:41] mark, my bad, i thought vcl_fetch is BEFORE fetching. Cool, i will add a response header "X-ENABLE-ESI=1" [17:04:54] vs racadm hwinventory [17:04:58] can it be a bit less ugly? ;)( [17:05:02] no need to have it in all caps [17:05:07] X- is deprecated too [17:05:18] i thought all our headers are like that :) [17:05:21] "Process-ESI" or so [17:05:23] its called convention :)( [17:05:26] [InstanceID: Disk.Virtual.262145:RAID.Embedded.1-1] [17:05:28] its called nonsense [17:06:01] bleh, you no fun. "Xs" sounded much better [17:06:15] Enable-ESI header it is [17:06:26] just make sure you unset it [17:06:36] hmm, not right away though [17:07:00] after full out production [17:07:08] what package contains /usr/local/bin/position-of-the-moon? [17:07:14] none [17:07:16] it was part of puppet [17:07:19] I think it's not anymore [17:07:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 17:07:30 UTC 2013 [17:07:52] well, my puppet run is failing for lack of it :( [17:08:02] * mark is gonna get dinner [17:08:22] *shrug* [17:08:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:09:17] ottomata: sorry, reading backlog [17:09:39] ottomata: these have no raid [17:09:48] they use no backplane and plug directly into the controllers [17:09:55] requirement for caching use for us. [17:10:06] sorry for the delay, was dug into something else [17:10:43] as for what to use for lvs... lets see [17:11:48] surprisingly, NOT the one called lvs.cfg [17:11:54] as its for older lvs servers with hw raid [17:12:22] ottomata: I suggest using raid1-varnish.cfg [17:12:47] <^d> mark: So I think having gitblit behind varnish is causing some problems. [17:13:24] ottomata: sorry connection problems... i only get some glimpses of the screen [17:13:33] !log reedy synchronized php-1.22wmf19 'Initial sync' [17:13:44] never managed to make out something [17:13:47] Logged the message, Master [17:14:29] akosiaris: you guys are figuring out how to partition? [17:14:40] (03PS5) 10Legoktm: Enable MassMessage extension on testwiki/test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85944 [17:14:41] yes [17:14:57] chose raid1-varnish.cfg? [17:15:05] (thats the one to pick) [17:15:27] its easier to cheat and read netboot.cfg when you know what other servers are identical... [17:15:30] which is only helpful to the dude who orders them ;] [17:15:35] !log reedy synchronized php-1.22wmf19 'Initial sync v2' [17:15:42] btw... are the machines in 3.02 cabled ? [17:15:47] Logged the message, Master [17:16:04] I logged in tin asw-ulsfo and can find no reference to those machines... [17:16:05] The machines are cabled, however, the vlans may not yet be setup [17:16:20] and could not find the second switch... [17:16:28] there is no second switch [17:16:30] they are a stack [17:16:32] stacked ? [17:16:45] RobH, what about raid1-30G for lvs? [17:16:56] then why in show interfaces descriptions i am only seeing rack 3.03 ? [17:16:58] do we want to allocate the xfs partition like raid1-varnish does for those? [17:17:00] uh, i'd pick what mark already has in use on the new varnish machines [17:17:06] which is raid1-varnish [17:17:20] right for varnisih [17:17:24] oh, for lvs [17:17:24] i' saying for the lvs [17:17:33] duh, sorry, its early =P [17:17:36] :) [17:17:37] uhh, lemme seee [17:17:54] !log reedy synchronized docroot and w [17:18:07] Logged the message, Master [17:18:08] bleeeeeh [17:18:13] these are slightly different [17:19:06] so yea.... lvs boxen are kinda basic [17:19:59] ottomata: That would work, but mark also has pointed out he wants more boxes setup with LVM for space mgmt issues (and other things) [17:20:08] oh [17:20:10] but perhaps that doesnt apply to lvs boxen [17:20:11] cool! i like that too [17:20:18] but wouldnt hurt i dont think [17:20:26] raid1-lvm [17:20:27] ? [17:20:32] raid1-lvm looks good but 10gb is just silly small. [17:20:40] perhaps its time to update that recipie [17:20:42] pretty standard, I agree [17:20:48] i can make a raid1-lvm-30G? [17:20:52] (we could make new one, but seems like 10GB is now too tiny [17:20:59] i think we should just update the existing one. [17:20:59] orrr we can just edit this one [17:21:01] ok [17:21:04] i like [17:21:07] 30G? [17:21:08] 50G? [17:21:10] 100G? [17:21:10] :) [17:21:13] lets make 30 [17:21:14] k [17:21:17] well [17:21:19] thats also tiny [17:21:24] gotta think bigger! [17:21:30] we never get less than 250GB disk now [17:21:31] ever. [17:21:48] Can you still buy them? [17:21:49] * Reedy looks [17:21:49] is it just the 3rd number on the raid line? [17:21:50] 5000 8000 30000 raid \ [17:21:51] ? [17:22:03] yeah, but RobH, its nice to not take up all the space on the root partition [17:22:06] (ignoring SSDs) [17:22:11] true, lets do 50GB [17:22:12] if you have to write lots of data, its nice to ahve that on its own partition [17:22:13] just in case [17:22:15] k [17:22:20] 5000 8000 50000 raid \ [17:22:20] ? [17:22:20] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:29] (I don't know what those number mean! :p) [17:23:44] im trying to find my notes [17:23:47] to ensure you are right [17:23:53] cuz yea, they are mysterious ;] [17:24:10] ottomata: Numbers with many 0s means lots [17:24:13] haha [17:24:19] MHM what if 0s are on the left side? [17:24:22] i think you are correct [17:24:24] I NEED TO SEE DOCUMENTAION! [17:24:25] but i am not certain [17:24:25] hehe [17:24:28] ok [17:24:29] hah [17:24:34] lemme correlate w other files [17:24:45] raid1-30G has 5000 8000 30000 raid \ [17:25:03] yea the first is priority [17:25:06] second is .... [17:25:24] fuck i dont recall and now its gonna bug me [17:25:24] until i recall and understand [17:25:35] !log reedy Started syncing Wikimedia installation... : testwiki to 1.22wmf19 and build l10n cache [17:25:50] Logged the message, Master [17:26:01] http://ftp.dc.volia.com/pub/debian/preseed/partman-auto-recipe.txt [17:26:05] ottomata: ^ [17:26:28] oh minimal size [17:26:29] interesting [17:26:44] so you can set the minimal for the odd non 250gb disk [17:26:48] (03CR) 10Akosiaris: [C: 032] Adding cp4001-cp4020 and lvs4001-lvs4004 ulsfo hosts in linux-host-entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/86125 (owner: 10Ottomata) [17:26:51] aye [17:26:53] ok [17:26:54] but with a max of 50 we are fine i think anyhow [17:26:57] doing 50G [17:26:57] k [17:27:07] but you may wanna bump the minimum up too [17:27:14] 8 is too small [17:27:17] make it like 15 [17:27:48] if we are modernizing the partition data for actual disks today, may as well do it all =] [17:27:59] (but isnt vital) [17:28:26] akosiaris: Are you on the switch stack as root or yourself? [17:28:36] cuz if you arent setup on them as yourslef, one of our netadmins can do that [17:29:00] but if you are on as root, its ok, i just am always nervous when i do it ;] [17:29:01] RobH: root. I will create my one account at some point. I already talked with mark about it [17:30:05] (03PS1) 10Ottomata: Upping raid1-lvm partman recipe to 50G maxsize [operations/puppet] - 10https://gerrit.wikimedia.org/r/86141 [17:30:06] (03PS1) 10Ottomata: Adding new ulsfo lvs4* and cp4* hosts to netboot.cfg. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86142 [17:30:58] anyway got to go. RobH, please talk with LeslieCarr about the rack 3.02 not showing up in the switch interfaces list, cause I have a hunch those interfaces are not configured. [17:31:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [17:31:17] akosiaris: will do [17:31:23] c ya ppl [17:31:26] ttyl [17:33:28] greg-g, any 10 min window of time for me today? [17:35:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.944 second response time [17:35:00] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [17:35:39] yurik: the LD [17:35:49] yurik: http://open.blogs.nytimes.com/tag/varnish/ has X-RUN-ESI [17:39:10] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 17:39:00 UTC 2013 [17:39:24] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.22wmf19 and build l10n cache [17:39:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:36] Logged the message, Master [17:46:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [17:47:06] (03PS1) 10Reedy: Add 1.22wmf19 docroot stuffs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86144 [17:47:07] (03PS1) 10Reedy: All wikipedias to 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86145 [17:47:08] (03PS1) 10Reedy: testwiki, test2wiki, loginwiki, testwikidatawiki and mediawikiwiki to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86146 [17:47:29] (03CR) 10Reedy: [C: 032] Add 1.22wmf19 docroot stuffs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86144 (owner: 10Reedy) [17:47:43] (03CR) 10Ryan Lane: [C: 032] Applying misc::graphite::gdash makes host a gdash deployment target [operations/puppet] - 10https://gerrit.wikimedia.org/r/86050 (owner: 10Ori.livneh) [17:48:00] I have set up Varnish 3 and it does not seem to be purging. Anon users are seeing stale content. Is there anything I need to do in addition to http://www.mediawiki.org/wiki/Manual:Varnish_caching#Configuring_Varnish_3.x [17:48:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 22.716 second response time [17:48:26] Someone at #mediawiki suggested ops guys here may be able to guide [17:49:07] (03Merged) 10jenkins-bot: Add 1.22wmf19 docroot stuffs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86144 (owner: 10Reedy) [17:49:39] Who did? [17:49:51] It's sort of off topic in here as this is Wikimedia ops stuff... [17:50:03] #wikimedia-tech would better, but still slightly off topic [17:50:45] Reedy: was RoanKattouw ; something about varnish purging. (for mediawiki i guess) [17:50:54] bad RoanKattouw [17:50:58] Reedy: I sent him here because the ops people actually know how Varnish purging is set up [17:51:19] With varnishhtcpd and whatever else [17:52:36] It feels THIS close to getting it done. But without the purge it's going to be hard. I'd appreciate any help. [17:52:45] jazzybee: you should also check out the varnish channel. #varnish @ irc.linpro.no [17:53:49] Lol, pointing at a completely different irc network... [17:54:19] Reedy: that's where they live! blame them! [17:54:33] jeremyb: I suspect it's more of a backend issue...but it's getting both to work [17:55:08] Are we even using varnish for text yet? [17:55:30] yes for some domains? have to double check [17:55:44] https://wikitech.wikimedia.org/wiki/Projects#Switch_.22text.22_to_Varnish [17:56:24] I notice we have Text squids EQIAD and Text caches EQIAD [17:56:46] "Text is one of the few services that haven't migrated to Varnish." [17:58:29] I note for upload they all seem to be migrated to upload caches [18:01:17] https://gerrit.wikimedia.org/r/79048 https://gerrit.wikimedia.org/r/77710 [18:03:56] jazzybee: have you checked out our varnish configs/puppet repo ? [18:04:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [18:04:08] i mean it's horribly confusing, but that's sadly the best i can do help-wise :( [18:04:21] and double checked that multicast is properly working [18:04:24] cuz grrrrr multicast [18:04:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 10.617 second response time [18:06:07] Don't suppose wikitech has something more copyable? (is that behind a proxy?) [18:06:10] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 18:06:06 UTC 2013 [18:06:23] wikitech is now mixed up with labs [18:06:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:07:00] if you meant how wikitech is configured [18:07:03] or did you mean docs on wikitech ? [18:07:24] jazzybee: also #wikimedia-mobile on this network [18:07:25] (03PS1) 10Yurik: Removed Zero namespaces (480 & 481) from META [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86147 [18:07:27] i mean, we have https://wikitech.wikimedia.org/wiki/Varnish but… not exactly great :) [18:07:40] all the varnish experts are in europe and therefore not online righ tnow [18:07:44] I meant wikitech wiki config [18:07:45] Server: Apache/2.2.22 (Ubuntu) [18:07:45] X-Powered-By: PHP/5.3.10-1ubuntu3.6+wmf1 [18:08:02] Expires: Thu, 01 Jan 1970 00:00:00 GMT [18:08:02] Cache-Control: private, must-revalidate, max-age=0 [18:08:07] No caching at all then... [18:08:38] jeremyb and LeslieCarr: Not sure what multicast is but will check it out [18:08:50] !log db1004 removing/replacing disk 2 [18:08:51] usually purges are sent to a multicast address [18:08:53] wikitech should have naturally much more logged in users than a typical wiki? [18:09:01] https://en.wikipedia.org/wiki/Multicast [18:09:03] (03PS2) 10Ottomata: Upping raid1-lvm partman recipe to 50G maxsize [operations/puppet] - 10https://gerrit.wikimedia.org/r/86141 [18:09:03] Logged the message, Master [18:09:08] which then the varnish machines should be listening in on [18:09:08] (03CR) 10Ottomata: [C: 032 V: 032] Upping raid1-lvm partman recipe to 50G maxsize [operations/puppet] - 10https://gerrit.wikimedia.org/r/86141 (owner: 10Ottomata) [18:09:14] (03PS2) 10Ottomata: Adding new ulsfo lvs4* and cp4* hosts to netboot.cfg. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86142 [18:10:25] so, rack 3.02's interfaces weren't up ? [18:10:27] (03CR) 10Ottomata: [C: 032 V: 032] Adding new ulsfo lvs4* and cp4* hosts to netboot.cfg. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86142 (owner: 10Ottomata) [18:10:45] Alex left [18:11:20] LeslieCarr: don't know what alex was looking at [18:11:24] ok [18:11:29] i'm assuming some machines weren't installing ? [18:12:00] we haven't tried yet :) [18:12:09] i am just about to [18:12:20] ok, lemme check the interfaces first [18:12:20] :) [18:12:23] (03PS1) 10BBlack: Replace spoof_clientip and ACLs with netmapper [operations/puppet] - 10https://gerrit.wikimedia.org/r/86148 [18:12:57] (03CR) 10BBlack: [C: 04-2] "Immediate -2 so nobody accidentally merges this at this time." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86148 (owner: 10BBlack) [18:13:13] k [18:13:45] yeah, it's not configured yet on the second rack of stuff [18:14:25] lemme get those ports up, but if you want https://racktables.wikimedia.org/index.php?page=rack&rack_id=2004 is all ready [18:14:33] (03CR) 10Reedy: [C: 032] All wikipedias to 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86145 (owner: 10Reedy) [18:14:34] !log db1006 replacing disk 1 [18:14:42] (03Merged) 10jenkins-bot: All wikipedias to 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86145 (owner: 10Reedy) [18:14:46] Logged the message, Master [18:15:48] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.22wmf18 [18:15:59] Logged the message, Master [18:26:00] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:32] (03CR) 10Reedy: [C: 032] testwiki, test2wiki, loginwiki, testwikidatawiki and mediawikiwiki to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86146 (owner: 10Reedy) [18:26:41] (03Merged) 10jenkins-bot: testwiki, test2wiki, loginwiki, testwikidatawiki and mediawikiwiki to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86146 (owner: 10Reedy) [18:34:46] hmm, LeslieCarr, I'm having trouble getting lvs4001 to boot [18:35:20] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 18:35:15 UTC 2013 [18:35:23] oh ? [18:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:35:32] what's the issue ? [18:35:46] maybe it wants high heels instead of boots [18:39:52] lesliecarr: wanna look into 5807 [18:40:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf19 [18:40:48] Logged the message, Master [18:41:48] yeah, sorry (i'm half in a meeting) [18:41:53] lemme try rebooting one more time [18:42:01] but it seems the box just hangs with a blank console [18:42:07] LeslieCarr: Where do they have Tandori fish? [18:43:27] cmjohnson1: still down [18:43:41] kaldari: mehfil on 2nd has it today (they maybe have it once a week) and chaat always has it [18:43:44] on 3rd and folsom [18:45:21] LeslieCarr: Which mehfil on 2nd? The one up the hill or down the hill or both? [18:45:25] FYI: mediawiki.org is returning errors rather than page content. [18:45:34] um [18:45:38] the one closer to market [18:45:47] uh oh [18:45:49] ahhh hm [18:45:52] ok getting more info on brewseter [18:45:57] DHCPDISCOVER from 90:b1:1c:42:f0:9e via 10.128.0.3: network 10.128.0/24: no free leases [18:46:08] ohhh maybe alex didn't put the ips in [18:46:11] oh ithought he had [18:46:13] checking [18:46:34] lesliecarr: this is an easy fix...there isn't a link to 4/3/0 the link i have is xe-4/3/3 fiber #3456 [18:46:52] do you want to keep it or move it? [18:46:56] oh [18:46:57] haha [18:47:15] um, move to 4/3/0 please to match the other links ? [18:47:29] and now i run away to lunch before the crowds hit... [18:47:43] ottomata: make sure forward and reverse match [18:47:47] that's an often problem with that [18:48:11] k [18:49:13] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: revert 1.22wmf19 wikis [18:49:23] Logged the message, Master [18:50:43] LeslieCarr: won't forward/reverse not show up until later? (puppet signing time iirc) [18:51:28] anyway, bon appetit :) [18:52:13] !log reedy synchronized php-1.22wmf19/extensions/GlobalBlocking [18:52:26] Logged the message, Master [18:52:58] 2 Warning: Missing boundary in multipart/form-data POST data in Unknown on line 0 [18:53:12] LeslieCarr: in the wmnet file [18:53:12] !log reedy synchronized php-1.22wmf19/extensions/TorBlock [18:53:22] whare are the WMF3088 entries? [18:53:22] Logged the message, Master [18:53:26] WMF3471 1H IN A 10.65.3.85 [18:53:27] etc. [18:53:43] dc server names? [18:55:05] generic names for before they're allocated to a task? a la foo, bar, baz. (don't quote me :P) [18:55:44] Reedy: is that just | sort | uniq -c | sort -n ? or something more advanced? [18:57:09] watch "tail -n 1000 /home/wikipedia/syslog/apache.log | grep -v -i 'swift' | grep 'PHP\|Segmentation fault' | grep -v 'filemtime\|failed to mkdir\|GC cache entry\|cache slam averted\|SHA-1 metadata' | sed -r 's/\[notice\] child pid [0-9]+ exit signal //g' | sed 's/, referer.*$//g' | cut -d ' ' -f 7- | sort | uniq -c | sort -rn" [18:58:20] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:23] (03PS1) 10Ottomata: Adding entries for new ulsfo lvs4* and cp4* hosts. [operations/dns] - 10https://gerrit.wikimedia.org/r/86156 [18:59:31] LeslieCarr: could you check that for me? [18:59:40] i'm only 75% I know what I'm doing there [18:59:41] ottomata: lunch i think [18:59:44] ohp k [19:01:34] others having load issues? [19:01:41] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 seems to say yes .... [19:01:57] ori-l: waiting on anything else from me for gdash? [19:01:58] eek [19:02:06] I think I need to create a directory on tin maybe [19:03:00] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [19:03:04] done. I really need to replace the perl frontend with sartoris [19:03:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 19:03:20 UTC 2013 [19:03:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:05:10] RECOVERY - RAID on db1006 is OK: OK: State is Optimal, checked 2 logical device(s) [19:07:44] PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:32] !log reedy synchronized php-1.22wmf19/includes/ [19:09:45] Logged the message, Master [19:10:24] RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 3.78 ms [19:17:12] The wikis are incredibly slow and bouncy, do we have any sense on what's up? [19:17:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf19 [19:18:11] Logged the message, Master [19:18:13] 117 Exception from line 316 of /usr/local/apache/common-local/php-1.22wmf19/includes/MagicWord.php: Error: invalid magic word 'revisionsize' [19:19:22] ffs [19:19:24] Error: invalid magic word 'revisionsize' [19:19:30] 1 Warning: strcmp() expects parameter 1 to be string, array given in /usr/local/apache/common-local/php-1.22wmf19/includes/GlobalFunctions.php on line 134 [19:19:33] @ tes2 [19:19:53] im getting fatals on mw.org [19:19:58] but i presume that's related to the conversation above [19:20:18] looks like [19:20:26] mw.o was working yet 3 mins ago [19:20:34] https://git.wikimedia.org/commit/mediawiki%2Fcore.git/35065c9db563349bf5f84c9c1cc1853f7e37431b [19:20:45] Because I changed it back to wmf19 [19:20:46] Read up [19:21:16] * Reedy scaps [19:23:27] !log reedy Started syncing Wikimedia installation... : revisionsize [19:23:30] for what it's worse I'm having load issues on all sites, not just the ones that were switched to wmf19 (though those are the only ones with fatals so far) [19:23:37] Logged the message, Master [19:23:42] me wanders off to get lunch since meta won't load [19:23:58] meta WFM [19:24:15] sometimes it loads sometimes it fails massively, but I've had issues on meta, commons and en [19:24:19] the past 20 minutes has been hell [19:24:26] at least 3 staff members nearby have had similar issues [19:24:38] so either an internal wmf internet issue (though other sites are fine) or a site problem [19:24:49] Nothing has changed at all on commons or meta [19:25:14] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [19:25:15] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [19:25:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [19:25:23] given the timing reports on status.wikimedia I'm thinking it isn't jus tus [19:25:43] (03PS1) 10RobH: reclaiming servers: arsenic, niobium, palladium, strontium [operations/puppet] - 10https://gerrit.wikimedia.org/r/86158 [19:25:56] ottomata i will check the servers [19:26:12] mw.org is less breaky for me now [19:26:13] Reedy: where are we? did we go back now? [19:26:21] people still complaining in -tec [19:26:22] h [19:26:27] spoke too soon [19:26:33] wmf internal netowkr wouldn't cause errors, it would cause not being able to reach the page (aka timeouts) [19:26:56] (03CR) 10RobH: [C: 032] reclaiming servers: arsenic, niobium, palladium, strontium [operations/puppet] - 10https://gerrit.wikimedia.org/r/86158 (owner: 10RobH) [19:28:41] though i am seeing a bit of p-loss going on [19:28:58] i have heard rumor of some networks having issues due to the fiber cut … i'm going to try some moving traffic around a little [19:29:01] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: put loginwiki back to 1.22wmf18 [19:29:12] Logged the message, Master [19:29:18] though that shouldn't have affected my path [19:29:24] because we're going via tampa… eww [19:30:14] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:30:15] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:30:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:33:03] manybubbles: heading to a cafe, back on in a bit for search sync up thang [19:34:34] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 19:34:26 UTC 2013 [19:34:52] !log reedy Finished syncing Wikimedia installation... : revisionsize [19:35:05] Logged the message, Master [19:35:14] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:35:15] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:35:16] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:35:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:39:07] (03PS6) 10Physikerwelt: Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 [19:39:32] git review is being slowtoday. [19:39:35] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: loginwiki back to 1.22wmf19 [19:39:48] Logged the message, Master [19:40:14] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:15] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:34] PROBLEM - DPKG on labstore4 is CRITICAL: Timeout while attempting connection [19:42:04] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:54] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [19:43:06] fyi the payments* mysql stuff is me, and fixed [19:43:17] ....why is my git review just halting... wtf computer [19:43:21] getting rid of the performance_schema db [19:44:04] (03CR) 10Physikerwelt: [C: 032 V: 032] Taylor LaTeXML to Mediawiki needs [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 (owner: 10Physikerwelt) [19:44:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 19:44:41 UTC 2013 [19:45:14] RECOVERY - check_mysql on payments1003 is OK: Uptime: 2240 Threads: 2 Questions: 17042 Slow queries: 60 Opens: 661 Flush tables: 1 Open tables: 46 Queries per second avg: 7.608 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:45:15] RECOVERY - check_mysql on payments1002 is OK: Uptime: 1857 Threads: 2 Questions: 15031 Slow queries: 49 Opens: 662 Flush tables: 1 Open tables: 46 Queries per second avg: 8.094 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:45:16] RECOVERY - check_mysql on payments1004 is OK: Uptime: 1307 Threads: 2 Questions: 6617 Slow queries: 11 Opens: 654 Flush tables: 1 Open tables: 46 Queries per second avg: 5.062 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:45:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:47:58] (03PS1) 10Physikerwelt: Add install script :wq# Please enter the commit message for your changes. Lines starting [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/86164 [19:48:55] !log reedy synchronized php-1.22wmf19/extensions/GlobalBlocking/ [19:49:06] Logged the message, Master [19:53:07] LeslieCarr: you busy with things or can you check this over for me? https://gerrit.wikimedia.org/r/#/c/86156/ [19:53:28] greg-g: so we pushed an update to cirrus that should fix a bug that effects something like 3000 articles. is it ok if I reindex them sometime soon? [19:53:37] I only have to do the effected ones [20:04:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 20:04:00 UTC 2013 [20:04:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:06:21] ottomata, search meeting? [20:06:26] oop danke [20:06:29] time snuck up on me [20:07:50] (03CR) 10Andrew Bogott: "tested on a VM, seems safe." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86136 (owner: 10Andrew Bogott) [20:08:43] ottomata: yes, sorry, was meaning to check dns but kept getting stupid tube failures [20:09:34] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:44] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:10:52] hehe, s'ok [20:13:24] PROBLEM - Apache HTTP on mw1085 is CRITICAL: Connection refused [20:14:08] (03CR) 10Lcarr: [C: 032] Adding entries for new ulsfo lvs4* and cp4* hosts. [operations/dns] - 10https://gerrit.wikimedia.org/r/86156 (owner: 10Ottomata) [20:14:25] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [20:14:34] ottomata: want to do the dns push ? [20:19:45] RECOVERY - RAID on db1004 is OK: OK: State is Optimal, checked 2 logical device(s) [20:22:40] (03PS6) 10Reedy: Move logs to /var/log/mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 [20:23:02] damn it [20:26:25] (03PS7) 10Reedy: Move logs to /var/log/mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 [20:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Sep 26 20:33:48 UTC 2013 [20:34:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:38:47] yurik: you still planning on using the LD today? [20:38:48] (03PS1) 10RobH: reclaiming old hosts, removing cruft from decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/86166 [20:38:58] greg-g, yep [20:39:02] manybubbles: yeah, how long does that take? [20:39:26] yurik: cool, just checking before I add it to the calendar [20:39:56] (03CR) 10RobH: [C: 032] reclaiming old hosts, removing cruft from decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/86166 (owner: 10RobH) [20:42:56] greg-g: around 10 minutes [20:45:58] manybubbles: wanna do it during the LD today? :) [20:46:46] greg-g: 4pm your time? [20:47:14] ok, LeslieCarr, getting somewhere, maybe. [20:47:16] i got this far [20:47:27] Serving precise-installer/ubuntu-installer/amd64/pxelinux.0 to 10.128.0.12:2070 [20:47:27] Serving precise-installer/ubuntu-installer/amd64/pxelinux.0 to 10.128.0.12:2071 [20:47:39] not much movement after that [20:47:58] i'm doing lvs4002 right now, i think lvs4001 entry is cached as not found not sure [20:47:59] manybubbles: oh yeah, want ^d to do it? [20:48:00] oh! [20:48:03] Sep 26 20:47:57 bast4001 atftpd[29865]: atftpd terminating after 300 seconds [20:48:04] Sep 26 20:47:57 bast4001 atftpd[29865]: Main thread exiting [20:48:04] hm [20:48:43] <^d> greg-g: Sup? [20:48:58] greg-g: I wonder if I'll have wifi where I'm going? If so I can do it. [20:49:12] ^d: I wanted to update the ~3000 pages your commit for