[00:00:41] * Krenair will amend the patchset [00:01:19] I'm not sure what the connection with meta.wikimedia.org/wiki/MediaWiki:Experiments.js is [00:02:26] if I was doing that task, I would make an administrative interface in PHP which ran on some unrelated domain, like e3control.wikimedia.org [00:02:32] then that admin interface would write to the database [00:02:57] then an RL module would provide access to that data as JS code [00:03:09] the RL module would be part of the existing extension [00:04:09] and I would make the DB query cached in memcached, and I would make the RL module cacheable on the client side [00:04:22] or perhaps even combined with some startup bundle [00:04:39] well, that was my first instinct, but halfway through implementing a mediawiki interface that allows interactive update, is versioned, and references the person making the update [00:04:50] i got the sneaking suspicion that i'm implementing mediawiki in mediawiki [00:04:52] ... Sounds easier to just deploy changes [00:05:18] Fix MediaWikis config handling [00:05:19] Krenair: no. i did that for the past six months. [00:05:26] Reedy: Hah! [00:05:31] I was trying to lure him in on that yesterday. [00:05:32] Brooke: you paid him to say that [00:05:38] I wish. [00:05:57] afaik "mediawikis config handling" == reedy [00:06:00] well, you could still run the admin interface on a separate domain, even if it is MW [00:06:03] Heh. [00:06:15] TimStarling: And use what for user auth? [00:06:16] we have lots of wikis in *.wikimedia.org [00:06:17] No ori-l, I won't come and configure your own MediaWiki installation [00:06:25] Or you mean installing a separate MW instasnce? [00:06:28] instance [00:06:51] yeah, a separate MW instance would be ideal [00:07:05] what would be a good one to use? [00:07:25] I'm not sure what advantage a separate MW instance is over using Meta-Wiki. [00:07:34] privilege separation [00:07:42] usually admin interfaces are full of XSS vulnerabilities [00:07:45] Doesn't the entire user rights system already account for that? [00:07:51] so the idea is to put it in a separate cookie domain [00:07:53] Well, let's not introduce those, then. [00:08:00] yeah, sure [00:08:08] TimStarling: well, why create an admin interface? the people who will need to update stuff number 4-5 [00:08:20] it can just be some JSON [00:08:48] Loop. [00:08:50] some JSON changed how? [00:09:06] just edit the article it lives in? [00:09:06] Isn't there a better code editor these days? [00:09:36] New patchset: Alex Monk; "Redirect secure.wikimedia.org URLs to proper HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [00:09:50] yes, you could do that [00:10:53] and if i did do that, there really wouldn't be a point in keeping it on a separate wiki, no? since i won't be building a custom crud interface for the database, just using the standard edit interface [00:11:09] you could even have the contents of a metawiki page delivered via an RL module on the local wiki [00:11:23] if you wanted to get clever [00:11:28] yeah, that's what i was trying to explain above [00:11:32] that was the idea [00:12:06] i didn't articulate it very clearly [00:12:20] bbl [00:12:49] ori-l, ok, I uploaded a new patchset to redirect wikidata properly [00:13:05] Krenair: would you like me to update the config on the labs machine? [00:13:10] to match the patch? [00:13:49] I doubt it can cause any issues, but yes please [00:16:22] Krenair: done [00:18:09] Okay so that was a good idea [00:18:20] Turns out what wasn't a good idea was coding at past midnight [00:18:29] wikidata is at wikidata.org, not wikidata.wm.o -.- [00:20:06] New patchset: Alex Monk; "Redirect secure.wikimedia.org URLs to proper HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [00:24:25] Krenair: updated labs [00:25:15] It's unclear whether it's at www.wikidata.org or wikidata.org. [00:25:25] I guess it's still being decided or something. [00:26:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:31] Brooke, looks like it's www. [00:30:22] Krenair: Kind of. [00:30:51] http://wikidata.org/wiki/Hello and http://www.wikidata.org/wiki/Hello both work. [00:34:52] Brooke, sitematrix points to www. [00:36:18] I don't know what you want me to say. It's in an inconsistent state. There's a bug about it somewhere. [00:39:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.821 seconds [00:40:53] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:48:30] Brooke, I'll assume www. here. I don't think it matters a great deal [00:52:54] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [00:57:06] New patchset: Alex Monk; "Redirect secure.wikimedia.org URLs to proper HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [01:08:26] nak [01:08:40] well, kind of [01:08:48] I'd prefer to just 404 wikidata [01:09:16] who would ever link to wikidata on secure [01:09:18] but anyway [01:09:25] but did testing reveal Krenair? [01:12:18] paravoid, what did it reveal? [01:13:50] er, s/what/but/ [01:13:55] er, s/but/what/ even [01:14:01] too late [01:14:07] That everything worked fine... Until I tried to add wikidata. Took me two failed tries to work out what the hell I was doing wrong [01:14:13] I should better get some sleep [01:14:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [01:42:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [01:42:35] Bot noise. [01:45:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [01:46:16] Brooke? [01:46:26] dbbot-wm [01:47:39] I wonder why it keeps joining then closing it's connection [02:00:02] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 256 seconds [02:00:20] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 274 seconds [02:02:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:56] PROBLEM - Puppet freshness on copper is CRITICAL: Puppet has not run in the last 10 hours [02:09:56] PROBLEM - Puppet freshness on db1001 is CRITICAL: Puppet has not run in the last 10 hours [02:09:56] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [02:09:56] PROBLEM - Puppet freshness on cp1005 is CRITICAL: Puppet has not run in the last 10 hours [02:09:56] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [02:09:57] PROBLEM - Puppet freshness on mc8 is CRITICAL: Puppet has not run in the last 10 hours [02:09:57] PROBLEM - Puppet freshness on srv251 is CRITICAL: Puppet has not run in the last 10 hours [02:09:58] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [02:09:58] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [02:09:59] PROBLEM - Puppet freshness on srv223 is CRITICAL: Puppet has not run in the last 10 hours [02:10:51] PROBLEM - Puppet freshness on es5 is CRITICAL: Puppet has not run in the last 10 hours [02:10:51] PROBLEM - Puppet freshness on mc3 is CRITICAL: Puppet has not run in the last 10 hours [02:10:51] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [02:10:51] PROBLEM - Puppet freshness on sq33 is CRITICAL: Puppet has not run in the last 10 hours [02:10:51] PROBLEM - Puppet freshness on srv272 is CRITICAL: Puppet has not run in the last 10 hours [02:11:53] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:11:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:11:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:13:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.589 seconds [02:32:41] !log LocalisationUpdate completed (1.21wmf3) at Mon Nov 12 02:32:41 UTC 2012 [02:32:50] Logged the message, Master [02:38:26] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Mon Nov 12 02:38:16 UTC 2012 [02:39:11] RECOVERY - Puppet freshness on es5 is OK: puppet ran at Mon Nov 12 02:38:43 UTC 2012 [02:39:11] RECOVERY - Puppet freshness on search33 is OK: puppet ran at Mon Nov 12 02:38:55 UTC 2012 [02:39:38] RECOVERY - Puppet freshness on db1031 is OK: puppet ran at Mon Nov 12 02:39:19 UTC 2012 [02:40:50] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Puppet has not run in the last 10 hours [02:40:50] PROBLEM - Puppet freshness on cp1006 is CRITICAL: Puppet has not run in the last 10 hours [02:40:50] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [02:40:50] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [02:40:50] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [02:41:35] RECOVERY - Puppet freshness on sq33 is OK: puppet ran at Mon Nov 12 02:41:14 UTC 2012 [02:41:35] RECOVERY - Puppet freshness on hooper is OK: puppet ran at Mon Nov 12 02:41:17 UTC 2012 [02:42:11] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Mon Nov 12 02:41:40 UTC 2012 [02:47:08] RECOVERY - Puppet freshness on db1001 is OK: puppet ran at Mon Nov 12 02:46:53 UTC 2012 [02:48:38] RECOVERY - Puppet freshness on mc8 is OK: puppet ran at Mon Nov 12 02:48:17 UTC 2012 [02:49:14] RECOVERY - Puppet freshness on cp1006 is OK: puppet ran at Mon Nov 12 02:48:53 UTC 2012 [02:49:41] RECOVERY - Puppet freshness on srv223 is OK: puppet ran at Mon Nov 12 02:49:25 UTC 2012 [02:51:38] RECOVERY - Puppet freshness on mc3 is OK: puppet ran at Mon Nov 12 02:51:31 UTC 2012 [02:51:38] RECOVERY - Puppet freshness on srv202 is OK: puppet ran at Mon Nov 12 02:51:36 UTC 2012 [02:52:05] RECOVERY - Puppet freshness on cp1005 is OK: puppet ran at Mon Nov 12 02:51:55 UTC 2012 [02:54:11] RECOVERY - Puppet freshness on srv272 is OK: puppet ran at Mon Nov 12 02:53:52 UTC 2012 [02:58:41] RECOVERY - Puppet freshness on cp1025 is OK: puppet ran at Mon Nov 12 02:58:25 UTC 2012 [02:59:08] RECOVERY - Puppet freshness on sodium is OK: puppet ran at Mon Nov 12 02:58:44 UTC 2012 [03:02:44] RECOVERY - Puppet freshness on search17 is OK: puppet ran at Mon Nov 12 03:02:12 UTC 2012 [03:03:38] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Mon Nov 12 03:03:12 UTC 2012 [03:06:38] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Mon Nov 12 03:06:16 UTC 2012 [03:28:50] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [03:31:42] RECOVERY - Puppet freshness on copper is OK: puppet ran at Mon Nov 12 03:31:20 UTC 2012 [03:43:32] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:49:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [05:44:11] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:45:32] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.021 second response time on port 8123 [07:02:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:05:35] New review: Siebrand; "Any updates? Another month went by." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12188 [08:19:11] what the hell... [08:32:13] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Puppet has not run in the last 10 hours [08:32:13] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Puppet has not run in the last 10 hours [08:32:13] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Puppet has not run in the last 10 hours [08:32:13] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Puppet has not run in the last 10 hours [08:32:13] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Puppet has not run in the last 10 hours [08:32:14] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Puppet has not run in the last 10 hours [08:32:14] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Puppet has not run in the last 10 hours [08:32:15] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Puppet has not run in the last 10 hours [08:32:15] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Puppet has not run in the last 10 hours [08:32:16] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Puppet has not run in the last 10 hours [08:32:16] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Puppet has not run in the last 10 hours [08:32:17] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Puppet has not run in the last 10 hours [08:32:17] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [08:32:18] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:02:18] apergos: Did you see my ping about the eswiki backups? [09:02:25] Should be all good now... [09:02:32] yes, I reran them already, they are past the first problem stage [09:02:40] yay [09:02:46] but you're not on the xml datadumps list or you would have seen today's mail :-P [09:03:04] I saw that the bug fix went in but not when it was deployed so thanks for the ping [10:13:49] New patchset: Dereckson; "(bug 41962) Namespace configuration for bar.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33043 [10:14:18] New review: Dereckson; "PS3: Renaming "Portal Diskussion" to "Portal Dischkrian"" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/33043 [10:15:42] New patchset: Dereckson; "(bug 41962) Namespace configuration for bar.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33043 [10:16:03] New review: Dereckson; "PS4: Fixing whitespace issue" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/33043 [10:41:55] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:53:55] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [11:14:22] New patchset: Ori.livneh; "Enable CombineUserTalk for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33058 [11:14:22] New patchset: Ori.livneh; "Enable event logging for mobile beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32864 [11:17:57] New patchset: Ori.livneh; "Enable CombineUserTalk for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33058 [11:18:11] New patchset: Faidon; "swift: include Ubuntu Cloud archive for folsom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33059 [11:18:33] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33058 [11:20:36] apergos: ping? [11:23:52] paravoid: ponngg [11:25:45] apergos: ms-be6 has synced account/container but zero objects [11:25:49] I'm debugging it now. [11:27:45] ugh [11:28:03] hope I didn't screw up the config somehow [11:28:13] actually, maybe it's better if I did since that would be easily fixed [11:28:38] how do you see those stats? [11:28:57] df? [11:28:59] ls? [11:29:00] :) [11:29:58] yeah, it's a ring screwup [11:30:05] grr [11:30:25] container & object have the wrong ports [11:30:35] oh, woops [11:30:46] also, container seems to not be balanced at all [11:31:04] forgot to run rebalance probably [11:31:05] hmm that'z weird, I did rebalance after adding all three [11:31:28] maybe because it told you to rebalance again after a few hours? [11:31:33] yep [11:31:37] anyway [11:31:43] want to push a fix? [11:31:44] for account and container it does that [11:31:54] where is it? [11:32:10] and I'm gonna run puppet on everythig so if you have something manually disabled... [11:32:15] now is the time to worry :-P [11:32:38] everywhere? :) [11:32:46] all ms-fe* all ms-be* [11:32:51] and no, puppet is not disabled anywhere [11:32:53] that's everywhere as far as I'm concerned [11:32:55] ok [11:33:01] no, everywhere was the answer to the "where is it?" question [11:33:19] I mean where did you redo the rings? [11:33:29] I didn't redo them, that's what I just asked you to do :) [11:33:55] ok, I thought you said you had a fix you wanted me to push out [11:33:57] ok then [11:34:06] no [11:34:08] sorry, my bad [11:34:25] port should be 6000 for object and 6001 for container (see the entries above it) [11:34:37] you have 6002 on all three [11:35:03] look at the port column on the other servers and you'll spot the issue immediately [11:36:29] ok great [11:36:32] * mark is reading his email backlog [11:36:34] thanks for catching that [11:36:34] i'm now at oct 12 [11:36:59] hahahaha [11:37:43] mark: so, I should reformat the thumpers; I looked up wikitech documentation, is it really going to be a PITA? [11:38:28] I saw some "grub doesn't recognize sdac so you have to install grub manually foo" but I'm not sure if it's current [11:38:41] may no longer be current [11:38:54] just try it [11:41:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33059 [11:42:03] anyone knows what happend to tmh1 yesterday? according to gangila it was off [11:43:05] !log apt: removing swift 1.5 from precise-wikimedia [11:43:12] Logged the message, Master [11:43:15] j^: 19:58 mutante: powercycling tmh1 [11:44:33] that's not very helpful, is it? :) [11:44:45] sorry, I don't have a better answer for you [11:45:01] ok lets hope that was a one off and see [11:45:31] was considering to increase the number of concurrent jobs to process more videos but if it already crashes with 50% cpu usage [11:46:48] New patchset: Faidon; "Remove spurious notify, doesn't work across stages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33062 [11:47:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33062 [11:47:20] j^: do we have a backlog? [11:48:32] paravoid: yes all the existing videos are transcoded right now [11:49:18] transcoded to what? [11:49:55] webm, smaller resolutions if required(i.e. 1080p uploads) [11:50:31] ok [11:51:10] no h.264 after all? [11:52:50] might be added at a later point [11:52:58] code is there, not enabled right now [11:54:01] nod [11:55:08] so with tmh out its down to a legal/political descision to switch it on [11:59:02] New patchset: Faidon; "Repurpose ms[1-3]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33063 [11:59:03] great [11:59:33] apergos: all new swift boxes should run 1.7.4 [11:59:43] ok [11:59:51] apergos: be careful not to create ring files on precise boxes until we switch all of them to precise [12:00:02] I assume the back ends are still on 1.5? [12:00:08] no, ms-be6 is 1.7.4 now [12:00:16] all precise backends will be 1.7 [12:00:17] but that's the only one? [12:00:42] that's the only one I upgraded and we don't have ensure => latest, so yes [12:00:49] ok great [12:00:50] but I think it's also the only one in precise [12:00:55] and there's no 1.7 for lucid in our repo [12:01:03] uh huh [12:01:10] (and not planning to) [12:01:31] fix the ring files so we can be sure it replicates properly across versions [12:02:27] 1.7.5 says "expected in 11 hours", heh [12:02:53] has a few fixes that we expect [12:05:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33063 [12:11:47] New patchset: Faidon; "Kill last references to Solaris" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33065 [12:12:39] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:12:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:12:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:12:42] New patchset: Faidon; "Cleanup the base class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33066 [12:12:54] mark: I think you'll enjoy this [12:27:00] ? [12:27:00] ah yes [12:27:43] the last two [12:33:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33065 [12:35:16] New patchset: Faidon; "Add ms1-3 to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33067 [12:35:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33067 [12:37:01] yay, I broke stuff [12:39:31] New patchset: Faidon; "Fix ntp template brekage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33068 [12:39:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33068 [12:55:04] mark: any insight on why ms1 won't PXE even when I get confirmation that network boot will be attempted? [12:55:09] I don't get a DHCP request at all [12:57:20] no [12:57:29] should work [12:57:33] it's on the internal vlan isn't it? [13:05:41] hmm [13:05:45] where shall we run the ceph monitors on [13:08:47] lldpctl is unhelpful [13:08:51] which switch should I look into? [13:08:59] they're on csw1-sdtpa I'm pretty sure [13:09:00] maybe bonding affects it [13:09:04] oh right [13:09:05] yes that is it [13:09:10] sorry, should've thought of that [13:09:17] there's an option to make that work [13:09:26] force-up [13:09:30] you'll need to disable the non-eth0 ports and disable the lag [13:09:33] no there isn't [13:10:12] old junos? [13:10:30] I'm sure force-up works, I've used it before :) [13:10:49] if only it were junos eh [13:10:55] aaaaaaw crap [13:10:58] hahaha [13:11:12] sure, on our juniper switches force-up works great [13:15:13] have the patience to guide me through the foundrys? [13:15:23] or slap me with a fine manual, since I can't find one on wikitech [13:16:17] oh there's a PDF on fenari [13:19:10] sure [13:19:14] lag ms1 [13:19:26] disable e for all of eth1+ [13:19:31] no deploy [13:19:33] then install [13:19:36] then deploy again [13:19:38] and reenable ports [13:20:20] ms1 dynamic Y 12 15/2 ethe 15/2 [13:20:22] some info here: http://wikitech.wikimedia.org/view/Link_aggregation#Foundry [13:20:25] that's not very useful, is it [13:20:33] what isn't? [13:21:05] isn't that a single port? [13:21:36] seems like it [13:21:40] that's not very useful no [13:21:58] might as well undeploy that lag [13:23:18] looks a lot like ios [13:23:22] (first time in a foundry) [13:24:44] yes [13:24:48] it's like a bad ios clone [13:25:29] hmz so [13:25:44] ceph docs recommend using SSDs for OSD journals and for monitors [13:25:49] yeah [13:25:55] but i don't think we should be using the SSDs currently used for swift [13:26:21] those are not guaranteed for stable data, i.e. fsync() does not really mean it survives a power failure [13:29:46] what do you mean? what's the problem with the SSDs currently used for swift? [13:30:20] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:30:27] effectively they use write caching [13:30:30] despite fsync() [13:30:41] so data consistency is not guaranteed [13:30:55] that's fine for squid/varnish, according to ben it was also fine for swift (not sure that's true) [13:30:59] but it sure doesn't seem fine for ceph [13:31:23] you mean caching within the SSD firmware? [13:31:33] yes, inside the controller [13:32:02] not the system's controller but the SSD's controller? [13:32:12] first time I'm hearing this [13:32:57] something that hdparm -W 0 doesn't fix? [13:33:04] no [13:33:24] (installer worked after 'no lag ms1', yay) [13:33:51] http://www.evanjones.ca/intel-ssd-durability.html [13:34:16] so we use the X25m and the intel 320, which is an evolution of that [13:34:46] as far as I understand the 320 has more/larger capacitors which help with a bit of power backup on power failures, but that doesn't sound like it really fully fixes the problem [13:34:54] so although those may be better, I still don't fully trust them for this use [13:35:02] we better use the Intel 720 SSDs for that purpose [13:35:39] wow [13:36:02] nasty [13:37:02] ok, installer is running [13:37:11] going to grab a quick lunch [13:53:26] RECOVERY - mysqld processes on es4 is OK: PROCS OK: 1 process with command name mysqld [13:56:35] back [13:59:06] Two file systems are assigned the same mount point (/): RAID1 device #0 and SCSI29 (0,0,0), partition #1 (sdac). [14:00:34] where? [14:00:54] I'm reformatting ms1 [14:04:59] I see [14:13:05] New patchset: Pyoungmeister; "re-adding es4 to db.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33073 [14:14:20] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33073 [14:14:53] !log py synchronized wmf-config/db.php 're-adding es4' [14:15:01] Logged the message, Master [14:22:06] New patchset: Faidon; "partman: remove mdadm/boot_degraded from recipes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33074 [14:22:06] New patchset: Faidon; "partman: new recipe for thumpers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33075 [14:23:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33074 [14:23:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33075 [14:59:12] !log reedy synchronized php-1.21wmf3/extensions/CentralNotice/ [14:59:18] Logged the message, Master [15:18:28] yep, grub-install fails [15:18:39] (among other d-i fails that I've solved via the magic of dd) [15:19:04] 'grub-install /dev/sda' failed [15:27:25] !log reedy synchronized php-1.21wmf3/includes/api/ApiEditPage.php [15:27:31] Logged the message, Master [15:28:04] !log reedy synchronized php-1.21wmf3/extensions/EducationProgram/ [15:28:10] Logged the message, Master [15:29:04] Heyaaaa, RobH, are you working today? [15:30:36] paravoid: ms3 and up will fare better [15:34:17] apergos: you're done with ms-be3001 right? [15:34:28] yes, sorry I didn't put it back to some pristine state [15:34:33] also the drac on that one has the normal password now [15:34:39] good [15:34:47] thanks for the loan [15:51:25] formatting ms2 now [15:51:40] starting on ms-be3003 now [15:53:11] esams servers don't have ssds btw [15:53:22] ms3's mgmt doesn't work [15:53:25] and they probably don't have the drive cages for the SSDs either [15:53:27] grr [15:53:41] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [15:55:52] $ telnet ms3.mgmt.pmtpa.wmnet 22 [15:55:52] Trying 10.1.7.3... [15:55:52] Connected to ms3.mgmt.pmtpa.wmnet. [15:55:52] Escape character is '^]'. [15:55:55]