[00:03:26] any takers on https://gerrit.wikimedia.org/r/#/c/85796/ ? kaldari also waiting on a puppet merge, see above. [00:18:27] Krinkle I hope I didint offend you with that remark, that wasnt my intention :( [00:19:35] ToAruShiroiNeko: No worries. [00:19:59] ToAruShiroiNeko: So which QR encoding do you recommend we use? version, % correction and max length it gives us [00:21:48] thats a matter of debate. I can suggest the shorter the better approach others may argue error correction would be needed. [00:22:03] I would go for verison1 or 2 [00:22:15] it all depends on how much error correction are we going for [00:22:33] We'd shoot for 25 characters? [00:23:04] I think both options should be put on the table [00:23:15] ver 1 25 chars - 7% error correction [00:23:28] ver 2 27 chars - 25% error correction [00:23:42] sorry [00:23:47] ver 2 29 chars - 25% error correction [00:24:02] Right [00:24:11] ver 1 and 2 dont differ too much in complexity [00:24:18] 21 vs 25 [00:24:42] it all depends on how short the url is [00:25:19] to be honest ver2 has one other advantage [00:25:42] it can handle max 47 chars which can be used to link individual diffs between versions [00:26:15] as a future improvement of the URL Shortening System [00:27:19] shorter the url the better though. It almost entirely depends on how short the non encoded part is [00:28:31] http://wmf.ly/ - 14 chars https://wmf.ly/ - 15 chars [00:28:46] ToAruShiroiNeko: In the initial proposal, how would we distinguish the project hash from the page hash? [00:29:03] if we go for ppp instead of pcc, would we pad it? or add a separator? [00:29:17] I think using 0's would do the trick [00:29:27] 001 010 100 etc [00:29:38] Are they not ambiguous in base36 though? [00:29:41] seperators would be a waste of chars on the long run [00:29:44] well yes [00:29:46] indeed [00:29:47] 00A [00:29:50] 00Z [00:29:57] then 010 [00:30:24] I like your idea better than mine in that regard [00:30:30] rather than PLL you had WWW in mind [00:30:38] less messy [00:30:55] hm.. base_convert tells me 0 and 000 both convert to 0 [00:30:59] (36 to 10) [00:31:01] didn't know that [00:31:11] well yes [00:31:20] 00001 decimal is still a 1 :p [00:31:39] same trick is usable for the page code [00:32:07] it will have one or two 0's for most projects as 4 base 36 chars should be sufficient for vast majority of the projects [00:32:32] we could choose not to worry about those though [00:32:48] we don't have to pad the pageid hash [00:32:54] yeah [00:33:05] it may be useful if we want to do more with it [00:33:16] if we do pageid hashing, we won't be able to use it for special pages, no? [00:33:33] we can get crafty for that [00:33:45] the code will never use %s [00:33:59] %code for special page would be a possibility [00:34:12] WWW%CCCCC for example [00:34:37] we can also use the space character [00:34:38] according to http://qrcode.meetheed.com/question3.php QR codes can have a max of 4,296 chars [00:34:48] so I don't know why we need to couple the two? [00:35:05] YuviPanda do you know how that looks? [00:35:14] how what looks? [00:35:17] 2D QR codes? [00:35:31] no a 4296 char QR code [00:35:39] https://upload.wikimedia.org/wikipedia/commons/e/eb/Qr-code-ver-40.png [00:35:43] its scary [00:35:52] ah [00:36:07] as you add more chars it becomes... intense [00:36:47] shrug, is it really that bad? Its not like you are reading them by hand [00:37:03] well, maybe someone does. Probably the same people that wear binary watch's :) [00:37:15] on a small screen it would be impractical, on a print out with a smaller scaling it wouldnt work at all [00:37:36] my phone cant read the image I linked for example on a 17 inch screen [00:37:48] this thing is probably larger than a 17 inch [00:38:35] 27 inch [00:40:18] Krinkle I am getting the sense that you arent trying to shorten stuff such as diffs? [00:41:27] YuviPanda: Nobody is talking about pageid hashing, as explained in the RFC, we'd use the shorturl table, which has its own pageid (that's the pageid we are talking about) [00:41:40] which maps to a plain namespace and page title [00:41:42] kaldari: sure, but that won't help for special pages and stuff [00:41:47] or fragments [00:41:48] (03CR) 10Ryan Lane: [C: 032] Add Ganglia view for static asset payload size [operations/puppet] - 10https://gerrit.wikimedia.org/r/85796 (owner: 10Ori.livneh) [00:41:50] which can contain special pages, too, though right now it doesn't [00:42:09] special pages can be given some sort of special coding [00:42:10] special pages with query params? [00:42:32] YuviPanda well dont expect short url to fix all of lifes problems [00:42:32] hmm, they can be considered to go together with 'page title' [00:42:50] Also, some things aren't problems. [00:43:13] I don't think we're concerned within this scope about fragments or query parameters. [00:43:38] ToAruShiroiNeko: https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Wiki_identifier_.282.29:_Map_wiki-id [00:44:11] I haven't changed it from PCC to PPP yet [00:44:19] Krinkle I see [00:44:22] Can you review that? [00:44:35] sure [00:44:43] I'm just hoping we don't end up with a different set of shorturl / encoding for pages, other things, etc [00:45:34] Krinkle you want me to edit? [00:45:39] sure [00:45:42] yay! [00:48:24] what is the SQL column name of the project id? [00:48:50] There exists none [00:48:52] YuviPanda: https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Full_url_mapping [00:49:01] oh so we would map it. [00:49:29] +1 to that, Krinkle [00:49:56] hash, stuff in DB, report base36. done [00:50:29] we could even just do something simpler, like hash, find shortest prefix of hash itself that is usable, and return that. [00:50:31] that sounds horrendously slow though [00:50:38] nevermind I said that. [00:51:05] we can do this trivially, since we can just perma redirect them, and also heavily cache them [00:51:20] I can even offer to write one up on labs, with the requisite caching... [00:53:18] (03PS1) 10Ori.livneh: Use static_assets Ganglia view on nickel [operations/puppet] - 10https://gerrit.wikimedia.org/r/85807 [00:54:10] in fact, I bet this could be trivially done with just a nginx module. [00:54:47] Sure [00:54:59] But it all comes down to pros and cons for implementation [00:55:28] there's lots of shorturl libs that are ready for use. But there's a couple catches. For one, it appears we want to avoid needing ondemand generation. [00:55:41] Krinkle: Extension:ShortUrl does on demand generation [00:55:48] eg. have a deterministic url for simple shortening that can be exposed on the page by a gadget. [00:56:06] and on demand generation in its case involves a couple of db calls [00:56:21] Yes, but that's somewhat different [00:56:32] how so? [00:56:32] also, we'd probably want to normalise the protocol [00:57:19] (03CR) 10Ryan Lane: [C: 032] Use static_assets Ganglia view on nickel [operations/puppet] - 10https://gerrit.wikimedia.org/r/85807 (owner: 10Ori.livneh) [00:57:22] Krinkle I am adding some images too [01:02:15] ec :o [01:02:50] (03PS1) 10Asher: removing self from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/85808 [01:03:42] (03CR) 10Asher: [C: 032 V: 032] removing self from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/85808 (owner: 10Asher) [01:03:48] ToAruShiroiNeko: an ec is usually resolved by resolving it, not by replacing it with your version [01:04:57] k, done [01:05:07] sure but it was too much work to do in one go for me :p [01:05:29] I know its sloppy [01:05:41] but in the end I end up retaning everything [01:08:02] Krinkle your example has a problem tho [01:08:14] http://wmf.co/wenav [01:08:19] should be encided [01:08:22] *encoded [01:08:33] already done [01:08:34] https://www.mediawiki.org/wiki/Requests_for_comment/URL_shortener#Wiki_identifier_.282.29:_Map_wiki-id [01:09:24] "av" pageid hash of shorturl [01:13:38] https://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FURL_shortener&diff=789369&oldid=789368 [01:14:16] ya [01:14:29] I was trying to merge the two tables [01:14:36] is this or that better? [01:14:46] Krinkle: why is pre-generating in Extension:ShortUrl better than just hashing the URL? [01:15:29] YuviPanda isnt it longer? [01:15:30] I can't answer to "better" as I think that would be impossible to debate about at this point [01:15:32] however [01:15:45] It would allow the forwarder to be dumb [01:16:06] e.g. not have a database other than the wiki domain map [01:16:38] and it would allow cheap url generation from both mediawiki javascript and php [01:16:39] hmm, so is it just a matter of 'the code is already written, why not just use as much of it as possible?' [01:16:39] ? [01:16:47] No, quite the contrary [01:17:29] there is the big downside of not supporting special pages. [01:17:39] That statement is incorrect. [01:17:53] as listed in the RFC, the downside is not supporting query parameters. [01:17:58] special pages work just fine. [01:18:03] right. [01:18:41] Krinkle I know this is nitpicking but perhaps we should avoid making en 1st id on this :) [01:18:41] that's what I had in my head (query params in special pages and elsewhere), and that translated out wrong onto the keyboard :P [01:18:50] the ShortUrl extension doesnt' support special pages right now, but that existing extension means nothing, this RFC is separate. If we end up using ShortUrl as a base, this would be the first bug to fix, and quite an easy one to fix at that. [01:19:02] righ [01:19:02] t [01:20:01] I suppose short url itself can take parameters [01:20:11] sure [01:20:20] but that makes the url longer [01:20:40] say I want to review indef blocked IPs on my iPad (not that I own one :P) [01:20:52] I can just snap the qr code and fill in the checkboxes by hand [01:21:21] having that handy while typing out my remark [01:21:32] https://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FURL_shortener&diff=789375&oldid=789373 [01:22:03] ya dif magic is fine [01:22:10] tho I'd put the mapping on the other side [01:22:14] *div [01:22:21] you put it in this order [01:22:25] yes I know [01:22:38] nerely a suggestion :) [01:22:44] I cant type today :o [01:23:09] Krinkle I think it may be useful to consider putting possible uses for the qr code as well as the shortening idea [01:23:58] and you may want to mention wmuk trying to acquire qrpedia thing [01:24:06] https://uk.wikimedia.org/wiki/Water_cooler#QRpedia_update [01:26:08] Krinkle maybe images may look better on the left, they look like an afterthought :/ [01:49:53] (03PS1) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:49:59] ori-l: ^ [01:50:04] initial stab. [01:50:07] doing more [01:50:18] * YuviPanda finds a host to est [01:50:20] *test [01:53:08] I read that as finds a host to eat [01:53:30] heh [01:53:33] (03PS2) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:53:40] bd808|BUFFER: ^ [01:56:52] YuviPanda: Cool. [01:57:05] (03PS3) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [01:59:15] (03PS4) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:04:28] (03PS5) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:06:37] (03PS6) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:15:01] !log LocalisationUpdate completed (1.22wmf18) at Tue Sep 24 02:15:01 UTC 2013 [02:15:18] Logged the message, Master [02:16:20] ori-l: so once I fix all the errors that prop up there, I'm going to have a very simple enable-role disable-role interface... somewhere. [02:16:32] ori-l: and then setup a cron that applies the vagrant puppet stuff every 30min [02:21:28] !log LocalisationUpdate completed (1.22wmf17) at Tue Sep 24 02:21:27 UTC 2013 [02:21:41] Logged the message, Master [02:40:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 24 02:40:38 UTC 2013 [02:40:51] Logged the message, Master [02:47:06] (03PS7) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:48:57] (03PS8) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:55:28] (03PS9) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [02:59:08] (03PS10) 10Yuvipanda: Add labsvagrant module [operations/puppet] - 10https://gerrit.wikimedia.org/r/85814 [03:12:11] (03PS1) 10Springle: add db1036 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85816 [03:14:01] (03CR) 10Springle: [C: 032] add db1036 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/85816 (owner: 10Springle) [03:16:23] (03PS1) 10Springle: warm up db1036 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85817 [03:17:16] (03CR) 10Springle: [C: 032] warm up db1036 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85817 (owner: 10Springle) [03:17:56] !log springle synchronized wmf-config/db-eqiad.php [03:18:13] Logged the message, Master [03:40:16] (03PS1) 10Springle: depool db1002 for upgrade, db1036 take up slack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85818 [03:41:05] (03CR) 10Springle: [C: 032] depool db1002 for upgrade, db1036 take up slack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85818 (owner: 10Springle) [03:42:04] !log springle synchronized wmf-config/db-eqiad.php 'depool db1002 for upgrade, db1036 take up slack' [03:42:27] Logged the message, Master [04:57:47] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [06:09:54] (03PS1) 10Springle: icinga auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/85825 [06:12:04] (03CR) 10Springle: [C: 032] icinga auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/85825 (owner: 10Springle) [06:13:15] <^demon|zzz> !log ES: enwikisource & cawiki done indexing. Really need to revisit batch indexing, way too slow. [06:13:35] Logged the message, Master [06:16:34] I guess I'm the only one who constantly reads "ES" as "ExternalStore." [06:16:56] <^demon|zzz> Yeah, I should probably stop doing that. [06:17:14] <^demon|zzz> Anyway, zzz. [06:22:34] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [06:36:23] !log upgrading db1002 to precise and mariadb [06:36:36] Logged the message, Master [07:05:50] (03PS1) 10Springle: set coredb db1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/85829 [07:06:48] (03CR) 10Springle: [C: 032] set coredb db1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/85829 (owner: 10Springle) [07:38:21] anyone around? [07:38:27] Labs NFS seems to be killed again [07:38:31] * YuviPanda pokes apergos, Coren [07:39:17] * YuviPanda wonders who else he can poke [07:42:59] * YuviPanda pokes springle [07:43:21] !log upgrading Jenkins on gallium from 1.509.2 to 1.509.3 [07:43:34] Logged the message, Master [07:43:51] YuviPanda: ? [07:43:54] oh [07:44:02] springle: labs NFS seems locked up. [07:44:10] looking for ops people who are awake... [07:45:11] http://ganglia.wmflabs.org/latest/?c=tools&m=load_one&r=hour&s=by%20name&hc=4&mc=2 seems not great, tho i've no idea how to properly read it [07:45:47] things worked like 30 min ago [07:57:35] labstore4 seems completely locked up [07:57:56] springle: would restarting it help? [07:58:24] about to see if i have access to do that... [08:01:22] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Labs%2520NFS%2520cluster%2520pmtpa&tab=m&vn= [08:01:31] I guess that is labstore3 having some weird nfs issue [08:02:07] whelp [08:02:42] Ariel did attempt to restart the NFS box on Sept10 around 4pm UTC [08:02:55] has it been two weeks since? [08:03:01] shit [08:03:03] 14 days almost yeah [08:03:03] exactly two weeks [08:03:06] our old overflow bug is back [08:03:07] na less [08:03:20] I guess it wasn't exactly two weeks [08:03:21] but close [08:03:23] but yeah roughly [08:03:57] hmm, 1209600 seconds fits inside 21 bits [08:06:44] !log labstore4 locked up, power cycled [08:06:59] Logged the message, Master [08:12:35] !log 8:12am INFO: Jenkins is fully up and running [08:12:45] Logged the message, Master [08:12:46] !log jenkins: upgrading plugins [08:13:01] Logged the message, Master [08:16:21] hmmmm, guys [08:16:35] is the IP address 94.23.242.48 somehow blocked by ops or something? [08:16:48] i get this when trying to edit via API: {"servedby"=>"mw1130", "error"=>{"code"=>"unknownerror", "info"=>"Unknown error: \"globalblocking-ipblocked\""}} [08:16:53] and it's definitely not globally blocked [08:17:03] (because i've gotten an error when trying to unblock it locally) [08:17:19] that IP is the Polish Toolserver [08:18:21] looking at when my bot stopped editing, this happen ~2 days ago [08:19:39] (last edit is 2013-09-22T15:28:42 UTC) [08:19:59] !log restarting Jenkins [08:20:11] Logged the message, Master [08:20:57] MatmaRex: have you tried again? [08:21:14] last time around 10 minutes ago [08:22:51] (i am trying to edit logged in, btw. and still happening right now, btw.) [08:23:46] I have no clue how global blocking works nowadays:/ [08:25:09] the thing is that it doesn't seem to be globally blocked [08:25:13] no logs, i can't unblock it [08:25:18] yet i get that error [08:30:12] !log 8:20am INFO: Jenkins is fully up and running [08:30:25] Logged the message, Master [08:30:41] !log labstore4 disk issues, /a failed mount in console, ssh key bouncing. awaiting input from ryan or coren [08:30:42] !log restarting Jenkins again for one last plugin upgrade [08:30:53] Logged the message, Master [08:31:03] Logged the message, Master [08:54:56] springle: any luck? [08:55:11] just a note, basically exactly the same thing happened a week or so ago [08:55:22] and Coren said to just phone him :P [08:55:48] hashar: i have issues with very slow response with https , anything known? [08:58:56] springle: addshore: i got root at labstore4 [08:59:00] investigating [08:59:06] labstore3? ;p [08:59:28] what do you mean 3 ? [08:59:35] the problem was with 4 right ? [09:00:00] special device /dev/mapper/vg1-lv1 does not exist [09:00:07] I was unaware there was a 4 that was running. but http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=Labs+NFS+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [09:00:17] akosiaris: thanks [09:06:20] addshore: multiple nfsd process all in D state ... I am unaware of the setup of these machines so I will assume the two problems are connected for now, and continue with labstore4 [09:09:19] springle: akosiaris: labstore4 is in decommissioning .. [09:09:27] manifests/decommissioning.pp:'labstore4', #added 9/17 to remove from monitoring [09:09:39] though if it has been added there solely to remove it from Ganglia… that is a bit lame :-D [09:10:06] Id1804eb7bf9f708c7fcac29e1fcde712425c0e4f adding labstore1002/labstore3/ssl3004 to decom.pp to remove from monitoring [09:10:06] :( [09:13:45] hashar: any idea is labstore4 is used anywhere? [09:14:33] no clue [09:14:52] akosiaris: springle: I think you want to reboot labstore3 [09:15:29] labstore4 is missing a PV from a VG, and seems like it has not seen some drives... [09:16:18] I have not yet investigated labstore3, but apart from high CPU usage, it is still accessible so I am not rebooting it [09:17:27] akosiaris: labstore3 machine may be accessible but nfsd is not :/ [09:18:08] We have no indication that a reboot will fix that. [09:18:16] still investigating ... [09:18:33] akosiaris: labstore3 has a high system cpu usage, which seems to be the consequence of the NFS kernel bug we are facing every 14 days or so [09:18:47] wasn't this resolved ? [09:18:50] akosiaris: indeed as last time it was rebooted and it did not fix the problem as nfsd did not come back up [09:18:53] but I am just making assumptions, I don't know the exact background. [09:18:58] I could not find the previous bug / rt [09:19:45] well, Coren moved the firmware forward to get away from this bug, unfortunatly the newer firmware had a similar problem that was much more frequent, hence it has been rolled back to this one which apparently still has this here 2 week bug [09:21:09] akosiaris: http://permalink.gmane.org/gmane.org.wikimedia.labs/1395 [09:21:22] akosiaris: mail from coren in July stating labstore3 was stalled [09:21:41] akosiaris: maybe we should wake him up so he can take whatever trace he needs :/ [09:22:02] that is why i stated that I thought this was fixed [09:25:53] hashar: akosiaris https://bugzilla.wikimedia.org/show_bug.cgi?id=52500 this was the last issue the nfs had [09:28:00] 4 weeks ago I guess (judging by when the ticket was marked as resolved) http://permalink.gmane.org/gmane.org.wikimedia.labs/1395 was reverted. 2 weeks after that the 10th labstore3 did exactly as it is now http://article.gmane.org/gmane.org.wikimedia.labs/1628/match=tools+labs+down and another 2 weeks it is doing it again [09:29:55] and per what we said 2 weeks ago he doesnt mind being woken up for this and would perfer being worked up to having more downtime :) [09:32:34] we should `reboot` every 13 days :D [09:33:21] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:08] rebooted labstore4, it is complaining about an unknown disk configuration ... [09:34:10] nice... [09:34:18] crappy controllers ftw [09:34:31] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [09:36:05] addshore: is NFS ok now ? [09:36:13] akosiaris: you rebooted labstore4? I am almost starting to think these two machines are the same.. [09:36:35] !log rebooted labstore3 (thought the 14days NFS issue was resolved???) [09:36:47] !log rebooted labstore4, controller configuration lost ? [09:36:49] akosiaris: if it is it will take a minute or two to be noticeable and then everything will slowly catch up [09:37:23] akosiaris: just make sure the service is running, after the reboot 2 weeks ago the service didnt come up automatically :) hence further downtime [09:42:08] as nothing seems to be happening yet I can only imagine the service hasn't restarted again and may need a manual poke [09:44:52] which I am trying to figure out how to do .... this seems so not standard... [09:45:34] akosiaris: .bash_history? ;p [09:51:02] !log Pooled Varnish Text caches in eqiad with weight 1 (affecting wikidata, wikivoyage, *.wikimedia.org) [10:01:22] upped to weight 5 [10:03:29] akosiaris: did you do something? ;p [10:03:39] maybe [10:03:45] seems like the nfs server is running [10:03:48] akosiaris: whatever you did write it down :P [10:03:49] and exports are there [10:03:53] yup [10:04:03] everything appears to be catching back up :) [10:05:24] akosiaris: congratulations! [10:05:32] springle: we both did not have access to labstore4 ... we should both have now... [10:05:44] +1 to documenting what you did akosiaris [10:06:03] akosiaris: well done [10:06:14] I run an unknown to me script that prompted me to not run it [10:06:14] you are our new faidon now :-D [10:06:15] akosiaris: +1 for docs, and what is labstore4? :P I have never seen any reference to it anywhere! [10:07:01] Whatever that machine was doing, ... it seems to have problems with its disks... [10:07:40] now time to log those !logs since more is back [10:07:41] !log [19:36AEST] !log rebooted labstore3 (thought the 14days NFS issue was resolved???) [10:07:41] !log [19:36AEST] !log rebooted labstore4, controller configuration lost ? [10:07:51] Logged the message, Master [10:08:02] Logged the message, Master [10:08:08] !log [19:51AEST] !log Pooled Varnish Text caches in eqiad with weight 1 (affecting wikidata, wikivoyage, *.wikimedia.org) [10:08:08] !log [20:01AEST] upped to weight 5 [10:08:19] Logged the message, Master [10:08:30] Logged the message, Master [10:09:50] oh whoops, more died after the first two, oh well, dup entries [10:10:29] akosiaris: hope you enjoyed digging through the maze of things :) and thanks :) [10:10:34] beta is back up too :0 [10:10:37] =] [10:11:41] addshore: you are most welcome. I sure hope everything is ok. We need better docs and monitoring for labs [10:12:18] akosiaris: from what I can see everything is recovering as expected :) and yes more docs and more monitoring :P [10:12:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:17:12] (03PS1) 10Hashar: update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 [10:18:14] (03CR) 10jenkins-bot: [V: 04-1] update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [10:19:22] holy [10:19:57] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [10:25:41] (03PS1) 10Hashar: update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 [10:25:59] (03CR) 10jenkins-bot: [V: 04-1] update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:26:56] that is totally dumb :/ [10:33:46] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 10:33:42 UTC 2013 [10:34:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:01] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:48:53] (03Abandoned) 10Hashar: update submodules url so they end with '.git' [operations/puppet] - 10https://gerrit.wikimedia.org/r/85841 (owner: 10Hashar) [10:49:15] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [11:04:29] !log powercycled sq45 [11:04:42] Logged the message, Master [11:05:33] RECOVERY - Host sq45 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [11:07:53] PROBLEM - SSH on sq45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:03] PROBLEM - Frontend Squid HTTP on sq45 is CRITICAL: Connection refused [11:09:25] ACKNOWLEDGEMENT - Backend Squid HTTP on sq45 is CRITICAL: Connection refused daniel_zahn broken hardware - RT #5803 [11:09:25] ACKNOWLEDGEMENT - Frontend Squid HTTP on sq45 is CRITICAL: Connection refused daniel_zahn broken hardware - RT #5803 [11:09:26] ACKNOWLEDGEMENT - Puppet freshness on sq45 is CRITICAL: No successful Puppet run in the last 10 hours daniel_zahn broken hardware - RT #5803 [11:09:26] ACKNOWLEDGEMENT - SSH on sq45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn broken hardware - RT #5803 [11:10:08] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:12:27] heh, I forgot for a moment you're in your timezone :) [11:15:30] hehe, yea [11:16:36] re the SSL certificate change, ehm.. originally the ticket was to use SSLCACertificatePath, then it turned into not using it [11:17:02] so now we want SSLCertificateChainFile but not SSLCACertificatePath, ack? [11:18:03] I haven't followed up on that ticket, I think I filled it just to make sure we don't forget about it :D [11:19:31] RT #4823 first it was about adding it , then about removing it :p [11:20:13] ok, i'll suggest another patch on change 84901 [11:27:33] (03CR) 10Akosiaris: [C: 032] update git buildpackage conf comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/85840 (owner: 10Hashar) [11:32:38] RECOVERY - SSH on sq45 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:33:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Tue Sep 24 11:33:50 UTC 2013 [11:34:08] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:21] (03PS2) 10Dzahn: replace SSLCACertificatePath with SSLCertificateChainFile in Apache templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 [11:47:36] gwicke_away: RoanKattouw_away : parsoid.wmflabs.org down [12:03:58]