[00:03:26] any takers on https://gerrit.wikimedia.org/r/#/c/85796/ ? kaldari also waiting on a puppet merge, see above. [00:18:27] Krinkle I hope I didint offend you with that remark, that wasnt my intention :( [00:19:35] ToAruShiroiNeko: No worries. [00:19:59] ToAruShiroiNeko: So which QR encoding do you recommend we use? version, % correction and max length it gives us [00:21:48] thats a matter of debate. I can suggest the shorter the better approach others may argue error correction would be needed. [00:22:03] I would go for verison1 or 2 [00:22:15] it all depends on how much error correction are we going for [00:22:33] We'd shoot for 25 characters? [00:23:04] I think both options should be put on the table [00:23:15] ver 1 25 chars - 7% error correction [00:23:28] ver 2 27 chars - 25% error correction [00:23:42] sorry [00:23:47] ver 2 29 chars - 25% error correction [00:24:02] Right [00:24:11] ver 1 and 2 dont differ too much in complexity [00:24:18] 21 vs 25 [00:24:42] it all depends on how short the url is [00:25:19] to be honest ver2 has one other advantage [00:25:42] it can handle max 47 chars which can be used to link individual diffs between versions [00:26:15] as a future improvement of the URL Shortening System [00:27:19] shorter the url the better though. It almost entirely depends on how short the non encoded part is [00:28:31] http://wmf.ly/ - 14 chars https://wmf.ly/ - 15 chars [00:28:46] ToAruShiroiNeko: In the initial proposal, how would we distinguish the project hash from the page hash? [00:29:03] if we go for ppp instead of pcc, would we pad it? or add a separator? [00:29:17] I think using 0's would do the trick [00:29:27] 001 010 100 etc [00:29:38] Are they not ambiguous in base36 though? [00:29:41] seperators would be a waste of chars on the long run [00:29:44] well yes [00:29:46] indeed [00:29:47] 00A [00:29:50] 00Z [00:29:57] then 010 [00:30:24] I like your idea better than mine in that regard [00:30:30] rather than PLL you had WWW in mind [00:30:38] less messy [00:30:55] hm.. base_convert tells me 0 and 000 both convert to 0 [00:30:59] (36 to 10) [00:31:01] didn't know that [00:31:11] well yes [00:31:20] 00001 decimal is still a 1 :p [00:31:39] same trick is usable for the page code [00:32:07] it will have one or two 0's for most projects as 4 base 36 chars should be sufficient for vast majority of the projects [00:32:32] we could choose not to worry about those though [00:32:48] we don't have to pad the pageid hash [00:32:54] yeah [00:33:05] it may be useful if we want to do more with it [00:33:16] if we do pageid hashing, we won't be able to use it for special pages, no? [00:33:33] we can get crafty for that [00:33:45] the code will never use %s [00:33:59] %code for special page would be a possibility [00:34:12] WWW%CCCCC for example [00:34:37] we can also use the space character [00:34:38] according to http://qrcode.meetheed.com/question3.php QR codes can have a max of 4,296 chars [00:34:48] so I don't know why we need to couple the two? [00:35:05] YuviPanda do you know how that looks? [00:35:14] how what looks? [00:35:17] 2D QR codes? [00:35:31] no a 4296 char QR code [00:35:39] https://upload.wikimedia.org/wikipedia/commons/e/eb/Qr-code-ver-40.png [00:35:43] its scary [00:35:52] ah [00:36:07] as you add more chars it becomes... intense [00:36:47] shrug, is it really that bad? Its not like you are reading them by hand [00:37:03] well, maybe someone does. Probably the same people that wear binary watch's :) [00:37:15] on a small screen it would be impractical, on a print out with a smaller scaling it wouldnt work at all [00:37:36] my phone cant read the image I linked for example on a 17 inch screen [00:37:48] this thing is probably larger than a 17 inch [00:38:35] 27 inch [00:40:18] Krinkle I am getting the sense that you arent trying to shorten stuff such as diffs? [00:41:27] YuviPanda: Nobody is talking about pageid hashing, as explained in the RFC, we'd use the shorturl table, which has its own pageid (that's the pageid we are talking about) [00:41:40] which maps to a plain namespace and page title [00:41:42] kaldari: sure, but that won't help for special pages and stuff [00:41:47] or fragments [00:41:48] (03CR) 10Ryan Lane: [C: 032] Add Ganglia view for static asset payload size [operations/puppet] - 10https://gerrit.wikimedia.org/r/85796 (owner: 10Ori.livneh) [00:41:50] which can contain special pages, too, though right now it doesn't [00:42:09] special pages can be given some sort of special coding [00:42:10] special pages with query params? [00:42:32] YuviPanda well dont expect short url to fix all of lifes problems [00:42:32] hmm, they can be considered to go together with 'page title' [00:42:50] Also, some things aren't problems. [00:43:13] I don't think we're concerned within this scope about fragments or query parameters. [00:43:38] ToAruShiroiNeko: https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Wiki_identifier_.282.29:_Map_wiki-id [00:44:11] I haven't changed it from PCC to PPP yet [00:44:19] Krinkle I see [00:44:22] Can you review that? [00:44:35] sure [00:44:43] I'm just hoping we don't end up with a different set of shorturl / encoding for pages, other things, etc [00:45:34] Krinkle you want me to edit? [00:45:39] sure [00:45:42] yay! [00:48:24] what is the SQL column name of the project id? [00:48:50] There exists none [00:48:52] YuviPanda: https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Full_url_mapping [00:49:01] oh so we would map it. [00:49:29] +1 to that, Krinkle [00:49:56] hash, stuff in DB, report base36. done [00:50:29] we could even just do something simpler, like hash, find shortest prefix of hash itself that is usable, and return that. [00:50:31] that sounds horrendously slow though [00:50:38] nevermind I said that. [00:51:05] we can do this trivially, since we can just perma redirect them, and also heavily cache them [00:51:20] I can even offer to write one up on labs, with the requisite caching... [00:53:18] (03PS1) 10Ori.livneh: Use static_assets Ganglia view on nickel [operations/puppet] - 10https://gerrit.wikimedia.org/r/85807 [00:54:10] in fact, I bet this could be trivially done with just a nginx module. [00:54:47] Sure [00:54:59] But it all comes down to pros and cons for implementation [00:55:28] there's lots of shorturl libs that are ready for use. But there's a couple catches. For one, it appears we want to avoid needing ondemand generation. [00:55:41] Krinkle: Extension:ShortUrl does on demand generation [00:55:48] eg. have a deterministic url for simple shortening that can be exposed on the page by a gadget. [00:56:06] and on demand generation in its case involves a couple of db calls [00:56:21] Yes, but that's somewhat different [00:56:32] how so? [00:56:32] also, we'd probably want to normalise the protocol [00:57:19] (03CR) 10Ryan Lane: [C: 032] Use static_assets Ganglia view on nickel [operations/puppet] - 10https://gerrit.wikimedia.org/r/85807 (owner: 10Ori.livneh) [00:57:22] Krinkle I am adding some images too [01:02:15] ec :o [01:02:50] (03PS1) 10Asher: removing self from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/85808 [01:03:42] (03CR) 10Asher: [C: 032 V: 032] removing self from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/85808 (owner: 10Asher) [01:03:48] ToAruShiroiNeko: an ec is usually resolved by resolving it, not by replacing it with your version [01:04:57] k, done [01:05:07] sure but it was too much work to do in one go for me :p [01:05:29] I know its sloppy [01:05:41] but in the end I end up retaning everything [01:08:02] Krinkle your example has a problem tho [01:08:14] http://wmf.co/wenav [01:08:19] should be encided [01:08:22] *encoded [01:08:33] already done [01:08:34] https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Wiki_identifier_.282.29:_Map_wiki-id [01:09:24] "av" pageid hash of shorturl [01:13:38] https://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FURL_shortener&diff=789369&oldid=789368 [01:14:16] ya [01:14:29] I was trying to merge the two tables [01:14:36] is this or that better? [01:14:46] Krinkle: why is pre-generating in Extension:ShortUrl better than just hashing the URL? [01:15:29] YuviPanda isnt it longer? [01:15:30] I can't answer to "better" as I think that would be impossible to debate about at this point [01:15:32] however [01:15:45] It would allow the forwarder to be dumb [01:16:06] e.g. not have a database other than the wiki domain map [01:16:38] and it would allow cheap url generation from both mediawiki javascript and php [01:16:39] hmm, so is it just a matter of 'the code is already written, why not just use as much of it as possible?' [01:16:39] ? [01:16:47] No, quite the contrary [01:17:29] there is the big downside of not supporting special pages. [01:17:39] That statement is incorrect. [01:17:53] as listed in the RFC, the downside is not supporting query parameters. [01:17:58] special pages work just fine. [01:18:03] right. [01:18:41] Krinkle I know this is nitpicking but perhaps we should avoid making en 1st id on this :) [01:18:41] that's what I had in my head (query params in special pages and elsewhere), and that translated out wrong onto the keyboard :P [01:18:50] the ShortUrl extension doesnt' support special pages right now, but that existing extension means nothing, this RFC is separate. If we end up using ShortUrl as a base, this would be the first bug to fix, and quite an easy one to fix at that. [01:19:02] righ [01:19:02] t [01:20:01] I suppose short url itself can take parameters [01:20:11] sure [01:20:20] but that makes the url longer [01:20:40] say I want to review indef blocked IPs on my iPad (not that I own one :P) [01:20:52] I can just snap the qr code and fill in the checkboxes by hand [01:21:21] having that handy while typing out my remark [01:21:32] https://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FURL_shortener&diff=789375&oldid=789373 [01:22:03] ya dif magic is fine [01:22:10] tho I'd put the mapping on the other side [01:22:14] *div [01:22:21] you put it in this order [01:22:25] yes I know [01:22:38] nerely a suggestion :) [01:22:44] I cant type today :o [01:23:09] Krinkle I think it may be useful to consider putting possible uses for the qr code as well as the shortening idea [01:23:58] and you may want to mention wmuk trying to acquire qrpedia thing [01:24:06] https://uk.wikimedia.org/wiki/Water_cooler#QRpedia_update [01:26:08] Krinkle maybe images may look better on the left, they look like an afterthought :/ [01:49:53] (03PS1) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:49:59] ori-l: ^ [01:50:04] initial stab. [01:50:07] doing more [01:50:18] * YuviPanda finds a host to est [01:50:20] *test [01:53:08] I read that as finds a host to eat [01:53:30] heh [01:53:33] (03PS2) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:53:40] bd808|BUFFER: ^ [01:56:52] YuviPanda: Cool. [01:57:05] (03PS3) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:59:15] (03PS4) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:04:28] (03PS5) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:06:37] (03PS6) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:15:01] !log LocalisationUpdate completed (1.22wmf18) at Tue Sep 24 02:15:01 UTC 2013 [02:15:18] Logged the message, Master [02:16:20] ori-l: so once I fix all the errors that prop up there, I'm going to have a very simple enable-role disable-role interface... somewhere. [02:16:32] ori-l: and then setup a cron that applies the vagrant puppet stuff every 30min [02:21:28] !log LocalisationUpdate completed (1.22wmf17) at Tue Sep 24 02:21:27 UTC 2013 [02:21:41] Logged the message, Master [02:40:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 24 02:40:38 UTC 2013 [02:40:51] Logged the message, Master [02:47:06] (03PS7) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:48:57] (03PS8) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:55:28] (03PS9) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:59:08] (03PS10) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [03:12:11] (03PS1) 10Springle: add db1036 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85816 [03:14:01] (03CR) 10Springle: [C: 032] add db1036 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85816 (owner: 10Springle) [03:16:23] (03PS1) 10Springle: warm up db1036 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85817 [03:17:16] (03CR) 10Springle: [C: 032] warm up db1036 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85817 (owner: 10Springle) [03:17:56] !log springle synchronized wmf-config/db-eqiad.php [03:18:13] Logged the message, Master [03:40:16] (03PS1) 10Springle: depool db1002 for upgrade, db1036 take up slack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85818 [03:41:05] (03CR) 10Springle: [C: 032] depool db1002 for upgrade, db1036 take up slack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85818 (owner: 10Springle) [03:42:04] !log springle synchronized wmf-config/db-eqiad.php 'depool db1002 for upgrade, db1036 take up slack' [03:42:27] Logged the message, Master [04:57:47] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [06:09:54] (03PS1) 10Springle: icinga auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/85825 [06:12:04] (03CR) 10Springle: [C: 032] icinga auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/85825 (owner: 10Springle) [06:13:15] <^demon|zzz> !log ES: enwikisource & cawiki done indexing. Really need to revisit batch indexing, way too slow. [06:13:35] Logged the message, Master [06:16:34] I guess I'm the only one who constantly reads "ES" as "ExternalStore." [06:16:56] <^demon|zzz> Yeah, I should probably stop doing that. [06:17:14] <^demon|zzz> Anyway, zzz. [06:22:34] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [06:36:23] !log upgrading db1002 to precise and mariadb [06:36:36] Logged the message, Master [07:05:50] (03PS1) 10Springle: set coredb db1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/85829 [07:06:48] (03CR) 10Springle: [C: 032] set coredb db1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/85829 (owner: 10Springle) [07:38:21] anyone around? [07:38:27] Labs NFS seems to be killed again [07:38:31] * YuviPanda pokes apergos, Coren [07:39:17] * YuviPanda wonders who else he can poke [07:42:59] * YuviPanda pokes springle [07:43:21] !log upgrading Jenkins on gallium from 1.509.2 to 1.509.3 [07:43:34] Logged the message, Master [07:43:51] YuviPanda: ? [07:43:54] oh [07:44:02] springle: labs NFS seems locked up. [07:44:10] looking for ops people who are awake... [07:45:11] http://ganglia.wmflabs.org/latest/?c=tools&m=load_one&r=hour&s=by%20name&hc=4&mc=2 seems not great, tho i've no idea how to properly read it [07:45:47] things worked like 30 min ago [07:57:35] labstore4 seems completely locked up [07:57:56] springle: would restarting it help? [07:58:24] about to see if i have access to do that... [08:01:22] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Labs%2520NFS%2520cluster%2520pmtpa&tab=m&vn= [08:01:31] I guess that is labstore3 having some weird nfs issue [08:02:07] whelp [08:02:42] Ariel did attempt to restart the NFS box on Sept10 around 4pm UTC [08:02:55] has it been two weeks since? [08:03:01] shit [08:03:03] 14 days almost yeah [08:03:03] exactly two weeks [08:03:06] our old overflow bug is back [08:03:07] na less [08:03:20] I guess it wasn't exactly two weeks [08:03:21] but close [08:03:23] but yeah roughly [08:03:57] hmm, 1209600 seconds fits inside 21 bits [08:06:44] !log labstore4 locked up, power cycled [08:06:59] Logged the message, Master [08:12:35] !log 8:12am INFO: Jenkins is fully up and running [08:12:45] Logged the message, Master [08:12:46] !log jenkins: upgrading plugins [08:13:01] Logged the message, Master [08:16:21] hmmmm, guys [08:16:35] is the IP address 94.23.242.48 somehow blocked by ops or something? [08:16:48] i get this when trying to edit via API: {"servedby"=>"mw1130", "error"=>{"code"=>"unknownerror", "info"=>"Unknown error: \"globalblocking-ipblocked\""}} [08:16:53] and it's definitely not globally blocked [08:17:03] (because i've gotten an error when trying to unblock it locally) [08:17:19] that IP is the Polish Toolserver [08:18:21] looking at when my bot stopped editing, this happen ~2 days ago [08:19:39] (last edit is 2013-09-22T15:28:42 UTC) [08:19:59] !log restarting Jenkins [08:20:11] Logged the message, Master [08:20:57] MatmaRex: have you tried again? [08:21:14] last time around 10 minutes ago [08:22:51] (i am trying to edit logged in, btw. and still happening right now, btw.) [08:23:46] I have no clue how global blocking works nowadays:/ [08:25:09] the thing is that it doesn't seem to be globally blocked [08:25:13] no logs, i can't unblock it [08:25:18] yet i get that error [08:30:12] !log 8:20am INFO: Jenkins is fully up and running [08:30:25] Logged the message, Master [08:30:41] !log labstore4 disk issues, /a failed mount in console, ssh key bouncing. awaiting input from ryan or coren [08:30:42] !log restarting Jenkins again for one last plugin upgrade [08:30:53] Logged the message, Master [08:31:03] Logged the message, Master [08:54:56] springle: any luck? [08:55:11] just a note, basically exactly the same thing happened a week or so ago [08:55:22] and Coren said to just phone him :P [08:55:48] hashar: i have issues with very slow response with https , anything known? [08:58:56] springle: addshore: i got root at labstore4 [08:59:00] investigating [08:59:06] labstore3? ;p [08:59:28] what do you mean 3 ? [08:59:35] the problem was with 4 right ? [09:00:00] special device /dev/mapper/vg1-lv1 does not exist [09:00:07] I was unaware there was a 4 that was running. but http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=Labs+NFS+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [09:00:17] akosiaris: thanks [09:06:20] addshore: multiple nfsd process all in D state ... I am unaware of the setup of these machines so I will assume the two problems are connected for now, and continue with labstore4 [09:09:19] springle: akosiaris: labstore4 is in decommissioning .. [09:09:27] manifests/decommissioning.pp:'labstore4', #added 9/17 to remove from monitoring [09:09:39] though if it has been added there solely to remove it from Ganglia… that is a bit lame :-D [09:10:06] Id1804eb7bf9f708c7fcac29e1fcde712425c0e4f adding labstore1002/labstore3/ssl3004 to decom.pp to remove from monitoring [09:10:06] :( [09:13:45] hashar: any idea is labstore4 is used anywhere? [09:14:33] no clue [09:14:52] akosiaris: springle: I think you want to reboot labstore3 [09:15:29] labstore4 is missing a PV from a VG, and seems like it has not seen some drives... [09:16:18] I have not yet investigated labstore3, but apart from high CPU usage, it is still accessible so I am not rebooting it [09:17:27] akosiaris: labstore3 machine may be accessible but nfsd is not :/ [09:18:08] We have no indication that a reboot will fix that. [09:18:16] still investigating ... [09:18:33] akosiaris: labstore3 has a high system cpu usage, which seems to be the consequence of the NFS kernel bug we are facing every 14 days or so [09:18:47] wasn't this resolved ? [09:18:50] akosiaris: indeed as last time it was rebooted and it did not fix the problem as nfsd did not come back up [09:18:53] but I am just making assumptions, I don't know the exact background. [09:18:58] I could not find the previous bug / rt [09:19:45] well, Coren moved the firmware forward to get away from this bug, unfortunatly the newer firmware had a similar problem that was much more frequent, hence it has been rolled back to this one which apparently still has this here 2 week bug [09:21:09] akosiaris: http://permalink.gmane.org/gmane.org.wikimedia.labs/1395 [09:21:22] akosiaris: mail from coren in July stating labstore3 was stalled [09:21:41] akosiaris: maybe we should wake him up so he can take whatever trace he needs :/ [09:22:02] that is why i stated that I thought this was fixed [09:25:53] hashar: akosiaris https://bugzilla.wikimedia.org/show_bug.cgi?id=52500 this was the last issue the nfs had [09:28:00] 4 weeks ago I guess (judging by when the ticket was marked as resolved) http://permalink.gmane.org/gmane.org.wikimedia.labs/1395 was reverted. 2 weeks after that the 10th labstore3 did exactly as it is now http://article.gmane.org/gmane.org.wikimedia.labs/1628/match=tools+labs+down and another 2 weeks it is doing it again [09:29:55] and per what we said 2 weeks ago he doesnt mind being woken up for this and would perfer being worked up to having more downtime :) [09:32:34] we should `reboot` every 13 days :D [09:33:21] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:08] rebooted labstore4, it is complaining about an unknown disk configuration ... [09:34:10] nice... [09:34:18] crappy controllers ftw [09:34:31] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [09:36:05] addshore: is NFS ok now ? [09:36:13] akosiaris: you rebooted labstore4? I am almost starting to think these two machines are the same.. [09:36:35] !log rebooted labstore3 (thought the 14days NFS issue was resolved???) [09:36:47] !log rebooted labstore4, controller configuration lost ? [09:36:49] akosiaris: if it is it will take a minute or two to be noticeable and then everything will slowly catch up [09:37:23] akosiaris: just make sure the service is running, after the reboot 2 weeks ago the service didnt come up automatically :) hence further downtime [09:42:08] as nothing seems to be happening yet I can only imagine the service hasn't restarted again and may need a manual poke [09:44:52] which I am trying to figure out how to do .... this seems so not standard... [09:45:34] akosiaris: .bash_history? ;p [09:51:02] !log Pooled Varnish Text caches in eqiad with weight 1 (affecting wikidata, wikivoyage, *.wikimedia.org) [10:01:22] upped to weight 5 [10:03:29] akosiaris: did you do something? ;p [10:03:39] maybe [10:03:45] seems like the nfs server is running [10:03:48] akosiaris: whatever you did write it down :P [10:03:49] and exports are there [10:03:53] yup [10:04:03] everything appears to be catching back up :) [10:05:24] akosiaris: congratulations! [10:05:32] springle: we both did not have access to labstore4 ... we should both have now... [10:05:44] +1 to documenting what you did akosiaris [10:06:03] akosiaris: well done [10:06:14] I run an unknown to me script that prompted me to not run it [10:06:14] you are our new faidon now :-D [10:06:15] akosiaris: +1 for docs, and what is labstore4? :P I have never seen any reference to it anywhere! [10:07:01] Whatever that machine was doing, ... it seems to have problems with its disks... [10:07:40] now time to log those !logs since more is back [10:07:41] !log [19:36AEST] !log rebooted labstore3 (thought the 14days NFS issue was resolved???) [10:07:41] !log [19:36AEST] !log rebooted labstore4, controller configuration lost ? [10:07:51] Logged the message, Master [10:08:02] Logged the message, Master [10:08:08] !log [19:51AEST] !log Pooled Varnish Text caches in eqiad with weight 1 (affecting wikidata, wikivoyage, *.wikimedia.org) [10:08:08] !log [20:01AEST] upped to weight 5 [10:08:19] Logged the message, Master [10:08:30] Logged the message, Master [10:09:50] oh whoops, more died after the first two, oh well, dup entries [10:10:29] akosiaris: hope you enjoyed digging through the maze of things :) and thanks :) [10:10:34] beta is back up too :0 [10:10:37] =] [10:11:41] addshore: you are most welcome. I sure hope everything is ok. We need better docs and monitoring for labs [10:12:18] akosiaris: from what I can see everything is recovering as expected :) and yes more docs and more monitoring :P [10:12:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:17:12] (03PS1) 10Hashar: update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 [10:18:14] (03CR) 10jenkins-bot: [V: 04-1] update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [10:19:22] holy [10:19:57] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [10:25:41] (03PS1) 10Hashar: update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 [10:25:59] (03CR) 10jenkins-bot: [V: 04-1] update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:26:56] that is totally dumb :/ [10:33:46] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 10:33:42 UTC 2013 [10:34:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:01] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:48:53] (03Abandoned) 10Hashar: update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:49:15] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [11:04:29] !log powercycled sq45 [11:04:42] Logged the message, Master [11:05:33] RECOVERY - Host sq45 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [11:07:53] PROBLEM - SSH on sq45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:03] PROBLEM - Frontend Squid HTTP on sq45 is CRITICAL: Connection refused [11:09:25] ACKNOWLEDGEMENT - Backend Squid HTTP on sq45 is CRITICAL: Connection refused daniel_zahn broken hardware - RT #5803 [11:09:25] ACKNOWLEDGEMENT - Frontend Squid HTTP on sq45 is CRITICAL: Connection refused daniel_zahn broken hardware - RT #5803 [11:09:26] ACKNOWLEDGEMENT - Puppet freshness on sq45 is CRITICAL: No successful Puppet run in the last 10 hours daniel_zahn broken hardware - RT #5803 [11:09:26] ACKNOWLEDGEMENT - SSH on sq45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn broken hardware - RT #5803 [11:10:08] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:12:27] heh, I forgot for a moment you're in your timezone :) [11:15:30] hehe, yea [11:16:36] re the SSL certificate change, ehm.. originally the ticket was to use SSLCACertificatePath, then it turned into not using it [11:17:02] so now we want SSLCertificateChainFile but not SSLCACertificatePath, ack? [11:18:03] I haven't followed up on that ticket, I think I filled it just to make sure we don't forget about it :D [11:19:31] RT #4823 first it was about adding it , then about removing it :p [11:20:13] ok, i'll suggest another patch on change 84901 [11:27:33] (03CR) 10Akosiaris: [C: 032] update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [11:32:38] RECOVERY - SSH on sq45 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:33:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 11:33:50 UTC 2013 [11:34:08] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:21] (03PS2) 10Dzahn: replace SSLCACertificatePath with SSLCertificateChainFile in Apache templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 [11:47:36] gwicke_away: RoanKattouw_away : parsoid.wmflabs.org down [12:03:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 12:03:51 UTC 2013 [12:04:08] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:11] mutante: i have issues accessing the site [12:07:49] matanya: which site? [12:08:37] commons, atm [12:08:51] tools.wmflabs.org before [12:09:08] mutante: it is on and off most of the day [12:09:24] works for me right now (I'm in EU) [12:09:57] I got the error page like 4,5 times [12:10:46] mutante: i'll let you know if it happens again [12:12:01] matanya: so about tools.wmflabs, there was an issue with the NFS server earlier and it was restarted [12:12:14] matanya: and the commons issue seems rather unrelated, ok, thx [12:12:26] matanya: what kind of error page? [12:12:32] thanks for the info mutante [12:12:51] wikimeida has error blah blah go to irc [12:13:02] @ paravoid ^ [12:13:16] on image upload? or just viewing pages on commons [12:13:36] hm [12:13:48] go to irc? [12:13:49] archiveing my talkpage, renaming a file [12:14:15] mutante: what's up with sq45? [12:14:31] paravoid: broken hardware [12:14:46] it's disabled in pybal anyways [12:15:10] not enough [12:15:18] paravoid: this -> https://www.iiconservation.org/archive/commons.wikimedia.org/wiki/File_M%C3%BCnchen_Karlsplatz_postcard_late_19th_century.html [12:15:32] [ 43.522538] Uhhuh. NMI received for unknown reason 21 on CPU 0. [12:15:40] dazed and confused .. bla [12:15:40] I can't SSH [12:16:07] it was already down per Chris, RT #5803, just powercycled it before i saw his ticket [12:16:19] still pooled in /h/w/conf [12:17:14] upload:{ 'host': 'sq45.wikimedia.org', 'weight': 20, 'enabled': False } [12:17:43] sorry, I meant /h/w/conf/squid [12:18:01] i see [12:18:14] it's not replying on 80/3128 though, can't be it [12:18:14] shutdown? [12:18:40] matanya: next time it happens, paste us the last line please [12:18:53] sure paravoid [12:18:55] the one below "If you report this error to the Wikimedia System Administrators, please include the details below." :P [12:19:10] yeah, got that :P [12:19:57] are you on sq45.mgmt or is it the usual bug that it's in use :) [12:20:08] I am not [12:20:10] usual bug I guess [12:20:32] yea, racreset [12:21:39] mutante: you can set the racadm to allow two connections [12:21:49] wow, it sits at login indeed [12:21:59] before it didn't finish booting ..wth [12:24:29] matanya: good hint [12:25:34] sq45 is weird, it sits at login but doesn't allow any login, not even via mgmt locally [12:25:41] guess i should just shutdown [12:26:11] yeah [12:26:47] !log shutting down sq45 [12:27:00] Logged the message, Master [12:27:54] mutante: you can check the boot messages [12:28:14] PROBLEM - Host sq45 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:41] mark: remove from /h/w/conf/squid/upload-settings.php ? [12:29:00] yeah I guess [12:30:04] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [12:33:14] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 12:34:00 UTC 2013 [12:34:29] mark: per wikitech docs? should i do a "make" and ./deploy all as well? i dont think i did that before [12:34:36] just committed in git so far [12:34:48] per wikitech docs yes [12:34:53] do the config diff as well [12:35:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:35:10] ok [12:36:22] oh, there are more changes in the diff than the one i made [12:41:10] mark: fenari cat /tmp/squiddiff .. it's so much more [12:41:42] Only in deployed/squid.conf: knsq* etc [12:42:38] strange that it's suddenly using sq36 hostname instead of ip [12:42:43] does it no longer resolve? [12:42:51] ah that's the reason [12:42:54] can you remove sq36 as well? [12:43:46] ok,yea [12:44:43] it's in specialParents pmtpa api_php [12:45:44] even right below # Do NOT remove servers here until they are PERMANENTLY decommissioned! , right [12:47:19] and the whole array with the coss drives .. removed and committed [12:47:49] we've missed you in this timezone, mutante [12:47:58] ran make [12:48:07] paravoid: :) [12:48:35] (03CR) 10Ottomata: [C: 032 V: 032] wikivoyage.org TXT Google Webmaster Tools key addition [operations/dns] - 10https://gerrit.wikimedia.org/r/85794 (owner: 10Kaldari) [12:52:42] paravoid, i just merged this change for kaldari: https://gerrit.wikimedia.org/r/#/c/85794/ [12:52:45] thinking it was part of the puppet repo [12:52:52] but it is not! it is operations/dns [12:52:53] :0 [12:53:26] you need to run "authdns-update" from any of the nsN servers [12:54:17] that's it? no git pull anywhere? [12:54:29] should I update this? [12:54:29] https://wikitech.wikimedia.org/wiki/Dns#Changing_records_in_a_zonefile [12:55:02] I have it on my TODO for a while.. [12:55:07] freakin all staff :) [12:55:15] there's multiple changes to that needed [12:55:26] I'd be more than fine if you want to update docs [12:56:05] ah authdns-update pulls! :) [12:56:11] nice [12:56:16] yes [12:56:16] dns-merge :) :p [12:56:43] heh [12:58:11] it's a bit more complicated than that [12:58:13] but yes, similar [13:02:38] k paravoid, i just updated the changing records and adding a new zone section, they were mostly right [13:02:52] just needed to talk about the repo and templates/zonename rather than ~/pdns-templates [13:05:23] paravoid: ottomata " If any auth DNS server failed to response, restart it with /etc/init.d/pdns restart " s/pdns/gndnsd .. right [13:05:31] gdnsd [13:05:52] no, that doesn't apply anymore [13:05:55] gdnsd won't crash [13:06:00] but I'll fix all that, don't worry [13:06:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 13:05:57 UTC 2013 [13:06:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:06:04] :) [13:09:38] heeyyyy RobH, can you help with a drive error? analytics1021 looks like sdb is busted [13:09:49] RT, eqiad queue [13:09:54] chris is taking care of these [13:10:00] ah ok [13:10:35] yeah that's how I found it, I'm looking at unowned RT tickets [13:10:46] I will comment on this one and assign to chris [13:11:01] by being in eqiad they are practically owned already [13:11:23] haha [13:11:24] yeah [13:11:26] tickets shouldn't be assigned unless there's a good reason to [13:11:39] queues are there to send tickets to restricted groups of people [13:11:47] that's hard for core-ops and all, but for dc queues it works well [13:12:18] oh ok [13:12:37] except then they are removed from the unassigned tickets page, and easier to look through? [13:14:15] when in the datacenter, you just look at the queue [13:14:23] not to unassigned tickets for all queues [13:14:27] since you can customize the dashboard with any saved search in a widget and stuff.. [13:14:28] which is 90% irrelevant at that point [13:14:44] but there he is:) [13:15:33] ottomata: there is that "quick search" thing that shows just the queues and number of open tickets [13:22:04] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [13:22:34] cp3004 is down [13:23:38] mgmt unresponsive too [13:25:08] ACKNOWLEDGEMENT - Host sq45 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #5803 [13:25:14] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:01] there's already an rt ticket [13:27:12] paravoid: ah, 3 existing tickets . 5443 (sda problems), 5606 (degraded raid), and heh. your latest one:) [13:27:41] wtg :P [13:28:07] hmm all management is unreachable there [13:28:13] ? [13:28:13] not all [13:28:21] I pinged cp3001 an it worked [13:28:57] perhaps it's just that rack then [13:29:57] i'll be there in a few days [13:30:07] i'm waiting for fiber patches to get installed [13:31:12] ACKNOWLEDGEMENT - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #5443, #5606, #5818 [13:32:24] heh, thanks mutante [13:34:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 13:34:49 UTC 2013 [13:35:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:36:02] <- somebody fixed those for a bunch of other hosts.. right [14:04:01] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [14:07:11] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:13:13] !log demon synchronized php-1.22wmf18/extensions/Elastica 'Sync message fixes' [14:13:24] Logged the message, Master [14:15:16] !log LocalisationUpdate completed (1.22wmf18) at Tue Sep 24 14:15:16 UTC 2013 [14:15:28] Logged the message, Master [14:15:57] !log LocalisationUpdate completed (1.22wmf17) at Tue Sep 24 14:15:56 UTC 2013 [14:16:07] Logged the message, Master [14:21:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 24 14:21:10 UTC 2013 [14:21:22] Logged the message, Master [14:28:56] (03CR) 10Dzahn: [C: 032] deployment: abstract out MW_RSYNC_HOST [operations/puppet] - 10https://gerrit.wikimedia.org/r/72491 (owner: 10Hashar) [14:38:32] !log replacing mw1046 hard drive --requires reinstall [14:38:44] Logged the message, Master [14:39:54] (03PS1) 10QChris: Turn off generating geowiki limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/85853 [14:40:15] (03CR) 10Dzahn: [C: 031] Remove "labs" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84724 (owner: 10Reedy) [14:40:33] (03CR) 10Dzahn: [C: 031] Remove old commented code [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84714 (owner: 10Reedy) [14:43:20] (03CR) 10Dzahn: [C: 032] Simplify sql slightly [operations/puppet] - 10https://gerrit.wikimedia.org/r/81616 (owner: 10Reedy) [14:44:57] yurik: you awake? [14:45:04] greg-g, [14:45:05] sadly [14:45:10] :) [14:45:15] (just flew in - red eye) [14:45:19] sorry I missed your message last night, I went to bed early [14:45:22] yuck [14:45:23] np [14:45:30] i can do a depl now if you want [14:45:35] sleeping? traitor [14:45:48] yeah, now until 2.5 hours from now is open [14:45:58] ok, but make sure you monitor this one :) [14:46:27] (03CR) 10Dzahn: "./sql.sh centralauth" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81616 (owner: 10Reedy) [14:46:32] we are changing how we do warnings in zero (i outlined it several months ago) [14:46:47] ... will ops be paged? [14:46:57] shouldn't :) [14:47:02] (famous last words) [14:47:13] cuz man, it's like 3am for them :) [14:48:37] !log db1031 removing/replacing disk 7 [14:48:48] Logged the message, Master [14:49:24] <^d> greg-g: 2.5h? [14:49:31] <^d> I've got a window in 1h10m. [14:49:50] bah [14:49:52] right, 1 hour [14:49:58] yurik: ^^ [14:49:58] no worries, i should be done in 20min [14:50:04] yep [14:50:29] I forgot to get that one on the google calendar [14:50:32] (added) [14:50:51] <^d> greg-g: Can we move "Ongoing" to under the "Week of..." section in deployments? I *always* look at the first chart and get confused. [14:51:05] ah [14:51:28] I need to rework the whole page, so yeah, lemme do that as a part of it, sounds reasonable [14:52:05] <^d> Also, I'd just move the "No friday" and "Be aware of scaptraps" part to the info box at the top. [14:52:42] * greg-g nods [14:55:44] <^d> I updated [[mw:Search/Timeline]] just now. Should all reflect reality again. [14:56:04] (03CR) 10Dzahn: [C: 04-1] "http://wikimediastories.com" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85698 (owner: 10Jeremyb) [14:58:04] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:34] (03CR) 10Dzahn: "it works if you put that section before the jobs/careers section" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85698 (owner: 10Jeremyb) [14:59:49] !log db1001 removing/replacing disk 7 [15:00:04] Logged the message, Master [15:00:22] <^d> Yo manybubbles, how goes? [15:02:13] (03CR) 10Dginev: [C: 04-1] "I really don't like that you modified the actual LaTeXML::Util::Config file - then you become unable to easily update LaTeXML and keep you" [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 (owner: 10Physikerwelt) [15:02:18] (03PS2) 10Dzahn: wikimediastories.com/net/org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85698 (owner: 10Jeremyb) [15:03:22] (03CR) 10Dzahn: [C: 031] "testing 5 urls on 1 servers, totalling 5 requests" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85698 (owner: 10Jeremyb) [15:03:45] (03CR) 10Dginev: "Also, I don't know if you've noticed but LaTeXML also has some pretty good PNG capabilities, that might be better than whatever you're doi" [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 (owner: 10Physikerwelt) [15:04:29] ^d: yo! [15:05:01] !log yurik synchronized php-1.22wmf17/extensions/ZeroRatedMobileAccess/ [15:05:13] Logged the message, Master [15:05:18] (03CR) 10Dginev: "What is a "LaTeX header" anyway? Is it something similar to a LaTeX preamble? LaTeXML has a --preamble option." [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/85646 (owner: 10Physikerwelt) [15:05:24] mw1072 is still dead? [15:06:32] yurk mw1072 has hard disk problems....going to need to replace it [15:07:15] yurik ^ [15:07:35] !log db1053 removing/replacing disk 3 [15:07:49] Logged the message, Master [15:08:03] cmjohnson, thx [15:08:11] greg-g, do you see any spikes? [15:08:15] (03CR) 10Dzahn: [C: 032 V: 032] "testing 5 urls on 1 servers, totalling 5 requests" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85698 (owner: 10Jeremyb) [15:08:46] (03CR) 10Dzahn: [C: 032] Remove old commented code [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84714 (owner: 10Reedy) [15:09:30] (03CR) 10Dzahn: [C: 032] Remove "labs" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84724 (owner: 10Reedy) [15:09:42] <^d> manybubbles: So, I was curious how long indexing is going to take us today. Running forceSearchIndex with --buildChunks 1 is useful for this :) [15:09:44] <^d> http://p.defau.lt/?w9pI8IsgEn_v1m_t8uBRWQ [15:10:03] * greg-g waits for graphs to load [15:10:19] ^d: nice! [15:10:24] unintendedly useful [15:10:33] <^d> We've only got 2 wikis with >10k pages, largest is only 27k. We should be done in no time. [15:10:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:10:43] yeah, so not too long [15:11:00] yurik: looks fine at a quick glance [15:11:20] kulio [15:11:29] <^d> greg-g: Wanna see my graph from yesterday? :p [15:11:36] suuuure [15:11:43] <^d> See if you can spot where we indexed cawiki and enwikisource. [15:11:45] <^d> http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Miscellaneous+eqiad&h=testsearch1001.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [15:12:04] <^d> :p [15:12:35] heh, nice [15:13:14] !log during sync-apache: mw1072 - read-only file system [15:13:23] <^d> manybubbles: We don't have anything urgent to deploy before rolling out in 45m, do we? [15:13:27] Logged the message, Master [15:14:00] ^d: not really. we should get master deployed at some point but it isn't urgent. [15:14:11] <^d> We're at HEAD~1 right now. [15:14:12] mutante: if you see anything different on mw1072 plz comment rt5809 [15:14:28] ^d: hmmm. let me check [15:14:55] <^d> HEAD should be HtmlFormatter stuff. I went ahead and merged that last night. [15:16:25] cmjohnson: ok [15:16:39] ^d: k. We're good then. I think we should do an in place rebuild to pick up fc30ef7923db at some point. [15:17:15] ^d: no need to run forceSearchUpdate. I can do it whenever it is allowed for me to. [15:17:34] <^d> Yeah, I was going to let you do the deploy. [15:17:52] <^d> And teach you a fun trick at the same time :) [15:18:19] ^d: I love tricks! [15:18:38] uh oh [15:20:13] ^d: so it looks like htmlformatter is in our deployment branch too. [15:21:12] graceful's [15:21:22] !log yurik synchronized php-1.22wmf18/extensions/ZeroRatedMobileAccess/ [15:21:34] Logged the message, Master [15:21:41] greg-g, fin, now will wait for tomatoes [15:21:45] !log sync-apache and graceful-all for wikimediastories and other cleanup [15:21:57] Logged the message, Master [15:22:03] <^d> manybubbles: You sure? I've got us at a7d5386. [15:22:29] <^d> master's got it, core's 1.22wmf18 branch has us at a7d5386. [15:23:37] ^d: ah! my mistake. I as looking in 1.22wmf18 and found HtmlFormatter.php and thought "oh, cool, we can use that now if we need" but I just realized it is in MobileFrontend so we can't. [15:23:43] we're good. [15:23:49] <^d> :) [15:24:03] !log freshly activated http://wikimediastories.org/ (and .com,.net) [15:24:15] Logged the message, Master [15:25:02] nice title: Grandmother & Scientist [15:25:31] !log demon synchronized php-1.22wmf18/extensions/CirrusSearch [15:25:43] Logged the message, Master [15:28:32] greg-g and ^d: so if I want to rebuild some search indecies, do I start that during our deployment window? specifically all of them other than those we built in the last few days could use it. [15:29:17] <^d> Yeah I'd say just do it during the window. Should be pretty low impact since you're doing in-place, right? [15:31:00] yeah, agree [15:33:07] ^d: just more load on the elasticsearch servers as they feed themselves. [15:33:11] <^d> Yeah we [15:33:19] <^d> 'll be fine :) [15:34:02] RECOVERY - RAID on db1001 is OK: OK: State is Optimal, checked 2 logical device(s) [15:34:22] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 15:34:12 UTC 2013 [15:34:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:34:46] yeah [15:34:51] ^d: and the window is in 25 minutes? [15:35:00] do we have a calendar or something :) [15:35:16] <^d> Yes, we do :) [15:35:34] <^d> https://wikitech.wikimedia.org/wiki/Deployments [15:36:45] heya mark, LeslieCarr, what is the process for getting a public IP? I'm searching on wikitech, have only found labs relevant stuff so far [15:37:02] thanks! [15:37:52] hey [15:37:54] oh just ask [15:38:07] oh ha [15:38:07] really just put in the procurement ticket for the server "needs public ip" [15:38:11] k [15:38:19] we trust you [15:38:24] …. unless we shouldn't.... [15:38:29] * LeslieCarr gives otto a suspicious glare [15:38:34] ^d & manybubbles, we'll deploy the core HtmlFormatter later today btw [15:38:35] (03PS1) 10Dzahn: remove mw1072 from all dsh groups, broken disk, to avoid scap/sync warnings [operations/puppet] - 10https://gerrit.wikimedia.org/r/85862 [15:38:50] <^d> MaxSem: Sweetness [15:38:58] As long as I don't have a beard to stroke, you can trust me! [15:39:01] MaxSem: wonderful! we've started using it in CirrusSearch's master [15:39:10] arg, added useless dependency again [15:39:36] <^d> LeslieCarr: Can I have like 20 IPs then since the process is so easy? ;-) [15:40:34] <^d> Don't need them or anything. Just to hang onto for good luck. [15:40:55] (03PS2) 10Dzahn: remove mw1072 from all dsh groups, broken disk, to avoid scap/sync warnings [operations/puppet] - 10https://gerrit.wikimedia.org/r/85862 [15:41:09] reports ^d to RIPE [15:41:10] ^d gets no ip's [15:41:18] gerrit gets its ip taken away [15:41:19] mwhahaha [15:41:42] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [15:42:09] <^d> LeslieCarr: Noooo! [15:42:13] <^d> :) [15:42:45] !change 85862 | cmjohnson [15:42:45] cmjohnson: https://gerrit.wikimedia.org/r/#q,85862,n,z [15:42:56] ^d: how did you get a list of all the closed wikis? I've been looking in mediawiki-config for a list of all wikis with cirrus but I see that all the closed wikis are identified by 'closed'. [15:43:01] these make scap people love you when there are broken servers [15:43:05] <^d> manybubbles: closed.dblist [15:43:11] LeslieCarr: https://rt.wikimedia.org/Ticket/Display.html?id=5821 [15:43:41] ^d: thanks [15:43:45] cool, so it's easiest to reinstall machines since it also changes their hostname ... [15:43:50] oof [15:43:54] hm [15:43:57] that's fine i suppose [15:44:12] mutante: have followups for stories [15:44:27] LeslieCarr, just curious [15:44:33] (03CR) 10Cmjohnson: [C: 031] remove mw1072 from all dsh groups, broken disk, to avoid scap/sync warnings [operations/puppet] - 10https://gerrit.wikimedia.org/r/85862 (owner: 10Dzahn) [15:44:35] actually, you can do most of the work yourself too :) [15:44:39] cool! [15:44:43] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:48] do these machines have multiple NICs? do we ever give machines an internal and external IP? [15:45:08] dunno about ciscos, all the new dells have 4 ethernet ports iirc [15:45:10] <^d> mutante: I reported that r/o on mw1072 yesterday. [15:45:17] that's what I used to do at my old job. We'd order all of our machines with public NICs, but only configure the public interface if we wanted the IP there [15:45:36] the lvs boxes get multiple ip's [15:45:37] and we'd keep both hostnames, a public dns record for public IPs, and an internal record for the internal IP [15:45:47] ah [15:45:49] mutante: oh, i see you commented on 4942 already. whois / NS records already look fine. we just need dns / apache [15:46:04] also, if we had thought of that, an excellent way to scam for more ip's [15:46:24] ^d: thanks, removing .. [15:46:39] jeremyb: for wikiPedia vs, wikMedia, right [15:46:47] ? from ICANN (or whatever)? [15:46:55] mutante: yes, https://rt.wikimedia.org/Ticket/Display.html?id=4942#txn-111215 :) [15:47:24] lesliecarr there are 4 NIC ports on cisco POS's [15:47:37] i mean, we def don't need that for these machines though, it would just be nice to not have to reinstall just to give something a public IP [15:47:38] cool [15:47:44] jeremyb: yes, #4940 [15:47:52] jeremyb: need domain names [15:47:53] yeah, the biggest thing for reinstalling is that the hostname changes as well [15:47:57] so you have the puppet changes, etc [15:48:07] but if we had everything having multiple names… yeah we could get around that [15:48:13] oh ja, if we did it that way, the 'hostname' o the machine would still be the internal one [15:48:19] since puppet would admin via the internal interface [15:48:20] urgh [15:48:24] public would be for public service stuff only [15:48:34] stop trying to game the system folks, its what makes the system less effective [15:48:36] =p [15:48:39] its just a DNS entry for the public one [15:49:06] (though chatting about how to game the system in a public logged channel I approve of) [15:49:11] ;] [15:50:26] hha [15:51:29] so, LeslieCarr, where is the IP assigned? [15:51:43] I can see how to change the hostname during reinstall [15:51:46] hehe [15:51:53] so in dns/pdns [15:52:00] but linx-host-entries is mapping hsotname to MAC [15:52:10] sockpuppet? or new operations/dns repo? [15:52:14] mutante: my point was we already have them. we don't still need them [15:52:15] new repo [15:52:18] the proper way! [15:52:33] then you change the host entries line from blah.eqiad.wmnet to blah.wikimedia.org [15:52:47] jeremyb: oh heh, i didn't notice the change in whois , gotcha:) [15:52:50] and then, you also put the hostname into decomissioning.pp [15:52:55] and make sure puppet runs on neon [15:52:56] mutante: patches on the way [15:53:12] then reinstall [15:53:16] hmm, decommission.pp by short hostname only? is that a problem? [15:53:27] yeah, it's short hostname only [15:53:31] since short hostname will be the same? [15:53:32] well you'll take it out afterwards [15:53:36] it's a little ghetto ;) [15:53:42] after……? [15:54:03] after the box is reinstalled [15:54:07] then you reinstall the box [15:54:21] then on stafford do "puppetstoredconfigclean.rb blah.eqiad.wmnet " [15:54:36] then take the machine out of decom.pp, and pretend it's a new machine [15:54:59] then… profit! [15:55:02] i love that process:) [15:55:17] shodul I add to decommission and then run puppetstoredconfigclean.rb before reinstalling and puppetizing as .wm.org? [15:55:25] (not sure what decom.pp does) [15:55:36] doesn't puppet have to run somewhere for decom.pp to do something? [15:55:44] on icinga and ganglia [15:56:22] ok, i think the reason i'm confused is because the short hostname will be the same before and after resinstall [15:56:29] but if you dont run puppetstoredconfigclean.rb it will be one of those dead hosts still reporting on IRC [15:57:17] yea, but we do that, like manganese has been repurposed with the same name, right [15:57:38] so: [15:57:38] - add to decom.pp [15:57:38] - run puppet on neon and nickel [15:57:38] - puppetstoredconfigclean.rb on sockpuppet [15:57:39] - reinstall /puppetize new host under .wm.org fqdn [15:57:39] ? [15:58:26] sounds right.. Leslie? [15:58:30] (03PS1) 10Jeremyb: adding in wikipediastories.com/net/org [operations/dns] - 10https://gerrit.wikimedia.org/r/85867 [15:58:31] (03PS1) 10Jeremyb: wikipediastories.com/net/org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85868 [15:58:39] mutante: ^ [15:58:54] RECOVERY - RAID on db1031 is OK: OK: State is Optimal, checked 2 logical device(s) [15:59:50] so much stuff for adding an IP! :p [16:00:06] someone would ahve to plug a cable in to get eth1 the public IP, ja? [16:00:46] LeslieCarr: ^? [16:01:29] (03CR) 10Chad: [C: 032] Cirrus on all closed wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85239 (owner: 10Chad) [16:02:06] (03Merged) 10jenkins-bot: Cirrus on all closed wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85239 (owner: 10Chad) [16:02:27] aw man and analytics1003 was my ganglia aggregator, hm [16:02:35] I'm going to move that to analytics1009 [16:06:33] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'Cirrus on closed wikis' [16:06:45] Logged the message, Master [16:07:51] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [16:07:58] (03PS1) 10Cmjohnson: adding mw1046 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/85869 [16:08:01] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:08:42] (03PS1) 10Ottomata: Decomissioning analytics1003 and analytics1004 in prep for giving them public IPs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/85870 [16:09:11] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 16:09:08 UTC 2013 [16:09:21] PROBLEM - DPKG on mw1046 is CRITICAL: Connection refused by host [16:09:21] PROBLEM - twemproxy process on mw1046 is CRITICAL: Connection refused by host [16:09:31] PROBLEM - Disk space on mw1046 is CRITICAL: Connection refused by host [16:09:51] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:01] (03CR) 10Dzahn: [C: 032] adding in wikipediastories.com/net/org [operations/dns] - 10https://gerrit.wikimedia.org/r/85867 (owner: 10Jeremyb) [16:10:01] PROBLEM - RAID on mw1046 is CRITICAL: Connection refused by host [16:10:24] (03CR) 10Ottomata: [C: 032 V: 032] Decomissioning analytics1003 and analytics1004 in prep for giving them public IPs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/85870 (owner: 10Ottomata) [16:10:54] !log building search indecies for all closed wikis [16:11:05] Logged the message, Master [16:11:08] (03CR) 10Dzahn: [C: 032] wikipediastories.com/net/org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85868 (owner: 10Jeremyb) [16:11:12] mutante: https://gerrit.wikimedia.org/r/#/c/85869/ [16:12:26] cmjohnson: is that back up in operation? [16:12:44] hold on, be back after syncing that apache change [16:12:50] i just reinstalled...just needs sync [16:14:02] (03PS1) 10Reedy: Disable CleanChanges on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85871 [16:15:14] (03CR) 10Reedy: [C: 032] Disable CleanChanges on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85871 (owner: 10Reedy) [16:15:22] (03Merged) 10jenkins-bot: Disable CleanChanges on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85871 (owner: 10Reedy) [16:16:01] (03CR) 10Dzahn: [C: 031] adding mw1046 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/85869 (owner: 10Cmjohnson) [16:16:01] RECOVERY - RAID on mw1046 is OK: OK: no RAID installed [16:16:05] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disable CleanChanges on meta' [16:16:21] RECOVERY - DPKG on mw1046 is OK: All packages OK [16:16:23] Logged the message, Master [16:16:31] RECOVERY - Disk space on mw1046 is OK: DISK OK [16:18:11] PROBLEM - Apache HTTP on mw1046 is CRITICAL: Connection refused [16:18:44] !log sync-apache, graceful for wikipediastories [16:18:58] Logged the message, Master [16:19:08] !log DNS update, add wikipediastories domains [16:19:17] Logged the message, Master [16:19:35] (03PS2) 10Cmjohnson: adding mw1046 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/85869 [16:19:43] (03CR) 10Cmjohnson: [C: 032] adding mw1046 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/85869 (owner: 10Cmjohnson) [16:20:34] jeremyb: looks good to me,, ack? [16:21:35] (03CR) 10Dzahn: "Apache was merged before this." [operations/dns] - 10https://gerrit.wikimedia.org/r/85867 (owner: 10Jeremyb) [16:21:41] PROBLEM - NTP on mw1046 is CRITICAL: NTP CRITICAL: Offset unknown [16:21:52] (03CR) 10Dzahn: "http://wikipediastories.com" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/85868 (owner: 10Jeremyb) [16:22:02] (03PS1) 10RobH: folks complained it wasnt bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85872 [16:22:12] hey, sorry i was getting ready for work [16:22:32] ottomata: that was correct [16:23:11] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [16:23:21] (03PS1) 10RobH: rename bastion4001 => bast4001 [operations/dns] - 10https://gerrit.wikimedia.org/r/85873 [16:24:20] waiting for the eternal neon puppet run [16:24:27] !log rebuilding search indecies in place for all other wikis running cirrus [16:24:44] Logged the message, Master [16:27:33] !log adding mw1046 back to pybal [16:27:48] Logged the message, Master [16:31:54] (03PS1) 10Ottomata: Moving ganglia aggregator for Analytics Ciscos to analytics1009 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85875 [16:32:21] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 16:32:15 UTC 2013 [16:32:41] RECOVERY - NTP on mw1046 is OK: NTP OK: Offset -0.009087443352 secs [16:32:49] (03CR) 10Ottomata: [C: 032 V: 032] Moving ganglia aggregator for Analytics Ciscos to analytics1009 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85875 (owner: 10Ottomata) [16:32:51] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [16:33:48] (03PS1) 10Ottomata: analytics1003 and analytics1004 now have public IPs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/85877 [16:33:50] (03PS1) 10Ottomata: analytics1003 and analytics1004 now have public IPs. [operations/dns] - 10https://gerrit.wikimedia.org/r/85878 [16:34:22] (03PS2) 10Ottomata: analytics1003 and analytics1004 now have public IPs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/85877 [16:34:31] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 16:34:26 UTC 2013 [16:34:34] (03PS2) 10Ottomata: analytics1003 and analytics1004 now have public IPs. [operations/dns] - 10https://gerrit.wikimedia.org/r/85878 [16:35:01] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:38:03] !log rebuilt search indecies in place for all other wikis runnig cirrus except cawiki and enwikisource. skipped them because they are already up to date. [16:38:18] Logged the message, Master [16:38:52] RECOVERY - RAID on db1053 is OK: OK: State is Optimal, checked 2 logical device(s) [16:39:11] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 16:39:05 UTC 2013 [16:39:51] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [16:42:54] Hi manybubbles. I advertised the new it.wikt search on it.wiki too, I hope to see more testers. So far, response from it.wikt has been enthusiastical. [16:43:10] Nemo_bis: wonderful! [16:43:33] LeslieCarr: [16:43:33] https://gerrit.wikimedia.org/r/#/c/85878 [16:43:33] https://gerrit.wikimedia.org/r/#/c/85877/ [16:43:34] we just did cawiki and enwikisource last night. we haven't heard anything yet. [16:43:52] Nemo_bis: thanks so much for being the first tester! [16:45:09] (03PS1) 10Petr Onderka: Fixed bugs that manifested for diff current dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/85879 [16:45:12] mutante: still here? [16:48:55] manybubbles: en.wikt will be exciting; if *they* don't complain, you can be quite sure the search is very good :P [16:49:11] (and I'm positive they won't have something to complain about) [16:49:56] Nemo_bis: thanks! I'm pretty excited. I think the most annoying bug we have now is this one: https://bugzilla.wikimedia.org/show_bug.cgi?id=54020 . I've actually just got an email complaining about it! [16:50:39] hahhha: https://www.usenix.org/system/files/1309_14-17_mickens.pdf [16:52:05] ottomata: dystopian bin-packing problem [16:52:31] hahah [16:52:34] "John still believed that somehow, some [16:52:34] way, he could continue to make his transistors smaller. Perhaps [16:52:35] the processor could run multiple copies of each program, comparing the results to detect errors? Perhaps a new video codec [16:52:35] could tolerate persistently hateful levels of hardware error? [16:52:35] All of these techniques could be implemented. However, John [16:52:35] slowly realized that these solutions were just things that he [16:52:35] could do, and inventing “a thing that you could do” is a low bar [16:52:36] for human achievement. If I were walking past your house and [16:52:36] I saw that it was on fire, I could try to put out the fire by finding a dingo and then teaching it how to speak Spanish." [16:52:52] manybubbles: ah, yes, that's a bit annoying; on the other hand one it.wikt user appreciated the * wildcard (though I'm not sure how it works, maybe the * is just ignored and it gives partial matches) [16:53:22] Nemo_bis: the * wildcard causes a prefix match. it is super real. [16:53:38] :) [16:53:53] I still have to read the diff of the features page [16:54:06] and we have a bug around that that I think is upstream: https://bugzilla.wikimedia.org/show_bug.cgi?id=54399 [16:55:33] Nemo_bis: this might be useful: https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures [16:56:50] <^d> How do I create a group in ganglia? It'd be nice to group the new search boxen. [16:57:02] <^d> (Right now they're all under misc. eqiad) [17:05:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 17:04:59 UTC 2013 [17:05:28] ^d: have a look at the puppet repositories ganglia.pp lines 286-342. [17:05:31] that is my guess [17:06:01] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:06:34] ^d: actually that probably isn't it, sorry [17:07:31] <^d> aw ok [17:14:03] (03CR) 10Petr Onderka: [C: 032 V: 032] Fixed bugs that manifested for diff current dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/85879 (owner: 10Petr Onderka) [17:14:36] !log Recreating GeoData index [17:14:41] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 17:14:35 UTC 2013 [17:14:50] Logged the message, Master [17:14:52] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [17:19:29] (03CR) 10Lcarr: [C: 04-1] "also including removing these from .eqiad.wmnet and 10." [operations/dns] - 10https://gerrit.wikimedia.org/r/85878 (owner: 10Ottomata) [17:19:56] (03CR) 10Lcarr: [C: 032] analytics1003 and analytics1004 now have public IPs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/85877 (owner: 10Ottomata) [17:20:13] cmjohnson: ready to work with juniper and on networky things ? [17:20:26] give me a sec...going to cage [17:21:08] ok [17:21:11] let me get some water [17:21:20] and possibly cyanide pill after talking to tech support on the phone ;) [17:21:33] which techsupport? ;) [17:21:49] I kid I kid [17:23:19] lesliecarr: whenever you're ready [17:27:28] greg-g: all of them! [17:29:02] (03PS2) 10RobH: folks complained it wasnt bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85872 [17:29:14] (03CR) 10RobH: [C: 032] rename bastion4001 => bast4001 [operations/dns] - 10https://gerrit.wikimedia.org/r/85873 (owner: 10RobH) [17:29:30] cmjohnson: woot, i am calling them now [17:29:35] LeslieCarr: what issue are you working on? :) [17:30:00] the new expansion cards not working completely on the switch stacks [17:30:05] preventing better link bundles [17:30:07] right [17:30:11] ok [17:30:56] (03PS3) 10RobH: folks complained it wasnt bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85872 [17:32:51] (03PS4) 10RobH: folks complained it wasnt bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85872 [17:33:01] ^d: which is the machine with all the apache logs on it? I thought I'd find it in my bash history [17:33:05] rebase, review, oh wait, someone merged something that touches, rebase. [17:33:09] <^d> manybubbles: fluorine [17:33:54] on a positive note [17:33:58] ^d: thanks! no wonder I couldn't find it. I was misspelling it. [17:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 17:33:53 UTC 2013 [17:34:04] patch testing is flying along today. [17:34:07] <^d> :) [17:34:56] ottomata: im merging a bunch of stuff that is analytics [17:34:59] is this you? [17:35:01] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:35:35] (03CR) 10RobH: [C: 032] folks complained it wasnt bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85872 (owner: 10RobH) [17:36:14] cmjohnson: so the tech support guy had problems with his computer [17:36:18] let's do other network things! [17:36:22] ok [17:36:31] hah [17:36:55] LeslieCarr: Did you suggest he tried turning it off and on again? [17:36:56] oh the irony [17:37:02] ok, i'm calling up he right now, and we will switch over the peering :) [17:37:06] toggle the ON/OFF switch [17:37:45] :) [17:38:00] LeslieCarr: did you ask the juniper support guy if he was using juniper optics? [17:38:03] in his computer [17:38:11] haha [17:39:45] so let's move psw1-eqiad ge-0/0/0 to cr1-eqiad xe-4/3/1 [17:40:16] and obviously we'll need a new optic [17:41:25] where is ge-0/0/0 going now? [17:41:33] it's not labeled.....grr [17:41:44] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:41:44] oh it shoudl be going to a patch panel, to hurricane electric [17:41:52] our peering got too hot to handle ;) [17:42:16] obviously too many people looking at commons [17:43:28] lesliecarr: okay to pull out of ge0/0/0? [17:43:32] one sec [17:43:33] optic is in 4/3/1 [17:43:34] RECOVERY - Puppet freshness on analytics1004 is OK: puppet ran at Tue Sep 24 17:43:30 UTC 2013 [17:43:35] yay [17:43:57] move please [17:45:04] lesliecarr: k [17:45:06] dien [17:45:11] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:45:19] LeslieCarr: analytics1001 was not removed when it was given a public IP [17:45:19] that's why I left the other ones in [17:45:20] should I remove them all? [17:46:01] yes please [17:49:08] !log Hurricane electric peering moved to 10GE port. the spice packets must flow... [17:49:19] Logged the message, Mistress of the network gear. [17:49:41] yay [17:51:04] so cmjohnson want to do https://rt.wikimedia.org/Ticket/Display.html?id=5781 ? [17:52:03] sure...need to go get a module [17:58:10] !log finished building search indexes on closed wikis [17:58:25] Logged the message, Master [17:58:47] cmjohnson: hey, juniper finally called back [17:59:16] ok [17:59:36] !log bast4001 reinstall [17:59:50] Logged the message, RobH [18:02:23] of course they're being sloooow [18:02:27] join.me isn't that hard! [18:02:31] you download an app, you put in a code [18:02:32] done [18:02:39] i mean he's using windows, not like he has to figure anything out [18:03:25] hah...good times. his Norton antivirus is preventing it from working [18:03:33] oh my god, now i am troubleshooting his laptop [18:03:38] tech supporting the tech support [18:03:44] shoot me [18:03:50] robh ^ [18:04:01] haha, he's not int he office today ;) [18:04:09] LeslieCarr: ask for a reimbursement from Juniper for it ;) [18:04:20] :) [18:05:04] ? [18:05:20] i think chris was offering your murdering services [18:05:41] oh [18:05:48] i dont know what you are talking about [18:05:53] i do offer [18:05:59] "IT consluting" [18:06:03] heh [18:06:15] which can involve stripping insultation off mains wires and placing them near lightswitches. [18:06:20] but thats not murder. [18:06:24] thats accidental. [18:06:32] * RobH channels his not so inner bofh [18:06:38] :) [18:07:18] I... am unsure of typo-status [18:07:20] ;) [18:07:57] lesliecarr: figured out tech support yet? [18:08:30] nope [18:08:31] oi! [18:08:49] ok, told him to call me back [18:09:00] let's get asw2-a5-eqiad some more bandwidth! [18:09:17] !log upgraded varnish pkg to -wm16 on cp1046 [18:09:28] k...the module is installed...need to run some fiber....connect to ge-0/0/37 (asw2-a5)? [18:09:30] Logged the message, Master [18:09:39] yes [18:12:23] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [18:12:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:33] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 18:13:26 UTC 2013 [18:13:53] PROBLEM - twemproxy process on mw1046 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [18:14:12] damn it, i been out of datacenter long enough when cmjohnson mentions a module my first thought is puppet ;_; [18:14:13] RECOVERY - Puppet freshness on analytics1004 is OK: puppet ran at Tue Sep 24 18:14:07 UTC 2013 [18:14:22] im fine with it not being DC related, but being puppet related is sad. [18:14:23] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [18:14:33] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:14:43] PROBLEM - Disk space on analytics1003 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 9236 MB (3% inode=99%): [18:14:43] PROBLEM - Apache HTTP on mw1046 is CRITICAL: Connection refused [18:15:38] RobH, whaaa? I thought Puppet is second best thing in universe after sliced bread? [18:18:28] end results are good [18:18:33] but the process is sometimes painful. [18:18:48] lesliecarr: just the one link..correct? it's done...the cable # is 2169 [18:19:54] Puppet is the best possible way to do configuration management, implemented about as badly as it could possibly be while still "working". [18:20:10] let's see [18:20:12] maybe both links [18:20:15] checking traffic... [18:20:55] "Describe a configuration in a strictly idempotent state-based manner" -> win. "Actual syntax and semantics of the 'language' used to describe it" -> epic fail. [18:21:50] In fact, at my previous employer, I wrote a reasonably sane configuration language that /output/ .pp files in a desperate bid to avoid that fail. [18:22:02] hrm, let's just hook up both, we have been having a decent steady growth [18:22:06] and there's no real downside [18:22:14] and the upside is longer until we have to pay attention [18:22:47] good thing robh ordered me a bunch of sfp+ optics [18:22:47] It was very nice, and worked well, but wouldn't scale very nicely (it output a flat .pp with node definitions all expanded) [18:22:56] cmjohnson: we havent yet [18:23:04] the net terms are in processing [18:23:05] uhoh [18:23:11] as its FAR more than i can put on a CC [18:23:14] hehe [18:24:17] leslie...#2171 going to 0/36 [18:25:33] thanks [18:27:04] so xe-0/0/36 us going to xe-6/1/0 and xe-0/0/37 is going to xe-6/1/2 (or vice versa?) [18:28:01] scap in a second [18:28:08] vice versa...want me to swap [18:28:16] can you swap ? [18:31:10] lesliecarr: #2169 xe-6/1/0 to xe-5/0/36 #2171 xe-6/1/2 to xe-5/0/37 [18:31:37] thanks [18:31:38] lesliecarr: while we are at it...can we make asw-a5 primary switch..i have a 4200 in there that we no longer need in the rack [18:31:44] asw2-a5 [18:32:19] but it's part of the network fabric...so not sure how much is needed first [18:32:24] oh that may be a little more difficult [18:32:27] actually [18:32:33] because we are pushing a lot of traffic [18:32:36] well hrm [18:32:46] actually it has to go over the backplane anywyas [18:32:56] so we could… but to put the switch in the stack it would need a reboot iirc [18:33:02] and that would cause downtime [18:33:10] so i think not for now ? [18:33:16] grr..yes it would and that is not wise [18:33:17] yep [18:33:27] though we can always take out asw1-a5 if we have long enough cables [18:33:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 18:33:46 UTC 2013 [18:34:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:39:50] robh: reinstalled mw1046 added back to all the dsh groups...i need to apache-sync ...and then do an apache-graceful that server only..correct? [18:42:56] !log bsitu Started syncing Wikimedia installation... : Update Echo and Thanks to master [18:43:09] Logged the message, Master [18:47:03] cmjohnson: apache sync just that server and graceful just that server yes [18:47:12] if you sync all apaches, you should also graceful them [18:47:25] if there is an issue in a config, better to find it immediately than when apaches restart later [18:47:28] =] [18:51:00] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:10] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:50] RECOVERY - DPKG on mw1125 is OK: All packages OK [18:52:00] RECOVERY - Disk space on mw1125 is OK: DISK OK [18:52:09] cmjohnson: grrr can you maybe try reseating the module in asw-a6 ? [18:52:26] sure..give me a sec [18:52:35] thanks [18:52:50] the juniper guy still hasn't called back, maybe we should just do the troubleshooting steps he was going to and send him the results [18:52:59] !log sync'ing mw1046 [18:53:04] !log varnish updated to -wm16 on all eqiad mobile caches now, yell at me if something starts looking "off" there... [18:53:11] Logged the message, Master [18:53:18] lesliecarr: hahaha probably [18:53:25] Logged the message, Master [18:54:08] !log apache-graceful on mw1046 [18:54:19] Logged the message, Master [18:54:30] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [18:55:19] !log bsitu Finished syncing Wikimedia installation... : Update Echo and Thanks to master [18:55:32] Logged the message, Master [18:56:39] (03PS2) 10Bsitu: Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85778 [18:57:40] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:33] lesliecarr: reseated [19:01:31] (03CR) 10Bsitu: [C: 032] Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85778 (owner: 10Bsitu) [19:01:43] (03Merged) 10jenkins-bot: Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85778 (owner: 10Bsitu) [19:02:11] yay reseating fixed it [19:02:12] hrm [19:02:21] can you just reseat the module in asw2 and asw 7 ? [19:02:44] row a? [19:03:08] yes [19:03:20] cmjohnson: I got jenkins upgraded this morning :-) thanks! [19:03:23] asw2 is not a mondule..i can reseat the optic [19:03:51] oh sorry [19:03:58] i meant in asw-a2-eqiad and asw-a7-eqiad [19:04:12] those two modules which are misbehaving [19:04:14] hashar...mutante got it..apparently i should've done it manually [19:04:25] since it worked so well in asw-a6 [19:05:36] reseating asw-a7 now [19:05:45] woot [19:06:43] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo and Thanks on various wikis' [19:06:57] ooo it sees the optics in the right ports... [19:07:01] Logged the message, Master [19:07:02] !log bsitu synchronized echowikis.dblist 'Enable Echo and Thanks on various wikis' [19:07:18] Logged the message, Master [19:07:20] except the interfaces are down/down [19:07:20] hrm [19:07:45] maybe try reseating the optics as well ? [19:08:01] in 7? [19:08:16] look now..i just put the fibers back [19:08:38] oo up up [19:08:41] lemme see if traffic is passing [19:10:04] yay it is :) [19:10:07] :) [19:10:16] can you do the same for asw-a2 ? [19:10:20] done [19:10:22] check it [19:10:49] shows the correct interfaces... [19:11:33] so ...the optics were in the wrong port....totally my mistake [19:11:48] well only the 1 optic but ... [19:12:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:12:04] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [19:12:15] yay working! [19:12:16] ok [19:12:23] let's go onto asw-b-eqiad [19:12:27] let me login first [19:13:04] ok asw-b-eqiad only asw-b7 is affected [19:13:04] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 19:12:57 UTC 2013 [19:14:01] check it now lesliecarr [19:14:04] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [19:14:23] yay [19:14:24] PROBLEM - Apache HTTP on mw1046 is CRITICAL: Connection refused [19:14:34] PROBLEM - twemproxy process on mw1046 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [19:14:52] dr0ptp4kt: i don't see you in #-fundraising ; see history and talk page for https://wikimediafoundation.org/wiki/Thank_You_All FYI [19:14:57] just reseated the optic [19:16:04] ok so reseating solves all ills [19:16:52] that and using the correct ports [19:17:30] jeremyb, thx. amit used to be on that page i think, but he's moved on. i'll raise it with peeps and see who wants to be a smiling face on the page :) [19:18:42] yay all uplinks are 40g now [19:18:43] :) [19:18:51] next choke point… backplane! [19:19:00] dr0ptp4kt: ok, i just mentioned it because you were recently in page history [19:19:16] how do we fix that? [19:19:49] !log bsitu synchronized echowikis.dblist 'Enable Echo and Thanks on various wikis' [19:20:02] Logged the message, Master [19:22:56] jeremyb, it's a good point! we had to update it to make it accurately reflect the org chart, but it may make sense to get another person from the team occupying that old slot. i emailed the biz team about it just a couple of minutes ago. appreciate it! [19:24:38] dr0ptp4kt: ok. i'm not really seeing the biz connection though. was just about to ask in #-fundraising where i should raise further questions (that I haven't raised at all yet) [19:26:57] #wikimedia-rfc [19:27:02] brrr [19:27:06] oh, right [19:28:07] MaxSem: except mediawiki not wikimedia [19:28:17] jeremyb, oh, HA, i see, like you said: look at the Talk page! the thing to which i was referring was the lack of anyone from wikipedia zero on that page, which i think is a valuable thing to get the word out about. regarding the Talk page reporting of problems with rendering on mobile, i think that #wikimedia-rfc channel MaxSem mentioned, plus #wikimedia-mobile may be good place to start. someone may ask for an em [19:28:18] to the mobile-l list or even perhaps a Bugzilla report at https://bugzilla.wikimedia.org. [19:29:20] #mediawiki-rfc indeed [19:29:40] dr0ptp4kt: i see absolutely no relevance to rfcs... ? [19:30:46] dr0ptp4kt: anyway, pcoombe will look into the mobile thing when he has a chance. (it's fixed for now because mobile is disabled on that page) i guess i'll just ask him about the other stuff [19:31:56] i now see your edits were very minor. i hadn't actually looked at the diffs. i just saw they were recent and no one else was recent :) [19:32:03] MaxSem, can you speak to the rendering aspects of http://m.wikimediafoundation.org/wiki/Talk:Thank_You_All …would that make sense to for #mediawiki-rfc or is the #wikimedia-mobile irc channel or mobile-l mailing list a better starting point? to that point, would an entry into Bugzilla for the MobileFrontend extesion make sense for jeremyb? [19:32:07] who else should I tell? [19:32:34] so, anyone: I have to head out early today, don't think I'm ignoring you, I'm just not online [19:32:49] jeremyb, your reporting had positive side effects. if only everything worked that way, huh? [19:33:07] dr0ptp4kt: you mean to get biz people onto the page? :-) [19:33:14] heh, it's happened before [19:33:18] jeremby, yeah :) [19:33:27] * jeremyb hands dr0ptp4kt a tab key [19:33:34] jeremyb, i needed that :) [19:33:50] jeremyb the tab key is my best and worst friend [19:33:51] i certainly need the tab key for *you* [19:34:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 19:34:03 UTC 2013 [19:34:19] jeremyb LOL [19:34:31] dafuq? that page is blacklisted, but it's not working [19:35:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:35:14] !log reedy synchronized wmf-config 'touch' [19:35:26] Logged the message, Master [19:36:09] MaxSem: no, the blacklist worked [19:36:17] blacklist was recent [19:36:43] bbiab [19:43:04] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Tue Sep 24 19:42:55 UTC 2013 [19:44:04] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [19:55:04] bsitu, are you still deploying? [19:55:43] MaxSem: I am done with it [20:02:56] (03CR) 10MaxSem: [C: 032] 5 years later, we're not interested in a query.php translator [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85652 (owner: 10MaxSem) [20:03:09] (03CR) 10MaxSem: [C: 032] No need to remove hiddenStructure by now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85651 (owner: 10MaxSem) [20:04:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 20:04:35 UTC 2013 [20:05:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:08:24] (03PS1) 10Lcarr: adding ulsfo in ganglia options [operations/puppet] - 10https://gerrit.wikimedia.org/r/85907 [20:09:35] (03CR) 10Lcarr: [C: 032] adding ulsfo in ganglia options [operations/puppet] - 10https://gerrit.wikimedia.org/r/85907 (owner: 10Lcarr) [20:09:49] (03CR) 10MaxSem: [C: 032] Slightly more concise language for wgRightsText [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85259 (owner: 10Kaldari) [20:10:00] (03Merged) 10jenkins-bot: Slightly more concise language for wgRightsText [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85259 (owner: 10Kaldari) [20:11:12] Krinkle|detached when you return perhaps the PPP codes can be generated with gaps [20:11:50] like 1000-3000 for wikipedias, next 2k for wiktionaries etc [20:27:28] (03PS1) 10Lcarr: adding ulsfo ip information [operations/puppet] - 10https://gerrit.wikimedia.org/r/85911 [20:27:59] (03CR) 10RobH: [C: 031] adding ulsfo ip information [operations/puppet] - 10https://gerrit.wikimedia.org/r/85911 (owner: 10Lcarr) [20:28:20] (03CR) 10Lcarr: [C: 032] adding ulsfo ip information [operations/puppet] - 10https://gerrit.wikimedia.org/r/85911 (owner: 10Lcarr) [20:28:23] LeslieCarr: +1s are really just nearly silent support, as they take no blame ;] [20:32:48] ugh.. AaronSchulz, we've got a load of Exception from line 789 of /usr/local/apache/common-local/php-1.22wmf17/includes/job/JobQueueRedis.php: Unable to connect to redis server. [20:34:34] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 20:34:26 UTC 2013 [20:35:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:04:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 21:04:00 UTC 2013 [21:05:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:20:34] PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:31:25] scapping [21:34:20] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:34:34] Logged the message, Master [21:34:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 21:34:35 UTC 2013 [21:34:54] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:45:49] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:46:03] Logged the message, Master [21:48:01] !log routing changes in ulsfo, commit confirmed 5 for safety [21:48:14] Logged the message, Mistress of the network gear. [21:57:35] !log maxsem synchronized php-1.22wmf17/extensions/MobileFrontend/ [21:57:47] Logged the message, Master [21:59:37] !log maxsem synchronized php-1.22wmf18/extensions/MobileFrontend/ [21:59:50] Logged the message, Master [22:07:39] !log maxsem synchronized php-1.22wmf17/extensions/MobileFrontend/ [22:07:50] Logged the message, Master [22:09:32] !log maxsem synchronized php-1.22wmf18/extensions/MobileFrontend/ [22:09:43] Logged the message, Master [22:28:28] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset -0.004374027252 secs [22:29:14] (03PS2) 10Jforrester: Enable VisualEditor on "phase 2" Wikipedias (logged-in users only) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84369 [22:29:18] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (logged-in users only) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84369 (owner: 10Jforrester) [22:30:04] (03PS2) 10Jforrester: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 [22:30:08] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [22:30:19] (03PS3) 10Jforrester: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 [22:30:23] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [22:30:48] (03PS4) 10Jforrester: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 [22:30:52] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [22:39:28] !log maxsem synchronized php-1.22wmf17/extensions/MobileFrontend/ [22:39:40] Logged the message, Master [22:39:47] <^d> MaxSem: That including HtmlFormatter? ^ [22:40:12] yup [22:40:26] <^d> Sweet. [22:40:30] I mean, it's in the core now [22:41:01] !log maxsem synchronized php-1.22wmf18/extensions/MobileFrontend/ [22:41:13] Logged the message, Master [22:47:18] uh-oh, exception log is full of deadlocks [23:06:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:21:03] (03PS1) 10Springle: depool db1018 for upgrade, pool db1002 and warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85938 [23:22:17] (03CR) 10Springle: [C: 032] depool db1018 for upgrade, pool db1002 and warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85938 (owner: 10Springle) [23:23:33] !log springle synchronized wmf-config/db-eqiad.php 'depool db1018 for upgrade, pool db1002 and warm up' [23:23:47] Logged the message, Master [23:30:32] hi springle...so i added the new disk today to the 3 dbs and found 2 more reporting raid problems [23:34:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 23:34:45 UTC 2013 [23:35:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:37:05] cmjohnson1: thanks. yeah, 2 more popped up, db1004 last week and db1006 sometime over the weekend [23:37:14] i'll get tickets in shortly [23:37:25] tickets have been created already [23:37:27] thx [23:37:33] oh great. thanks [23:43:02] anybody know what the misc_web servers are used for? [23:45:45] !log upgrading db1018 to precise + mariadb [23:46:00] Logged the message, Master