[00:00:06] is it a meeting Monday? [00:00:13] it is. [00:01:10] But neither task should take all day :) [00:01:11] hrmm, that means higher activity when everybody is on and links to tickets [00:01:38] yea, well, what the hack, let's do it [00:01:51] tell people during the meeting we're going to do it $now [00:01:51] I'm on RT duty next week, so if we break RT it'll be win/win [00:01:57] fair:)) [00:02:33] wait, are we also moving to other server? [00:02:51] and if yea, do we know which one to use [00:02:57] That… was not part of my plan. [00:03:00] is RT still in tampa? [00:03:07] yea [00:03:10] Hm. [00:03:19] and if it's puppetized... [00:03:24] using a clean host .. [00:03:24] Is there any reason why we'd want to do both things at once? [00:03:33] Hm, true. [00:03:44] I don't know anything about how to set up the DB for that though. [00:04:00] shrug, it's not absolutely necessary to combine them, but usually we are .. like .. don't want to move unpuppetized stuff to eqiad, but once it's done want a clean host [00:04:08] to gurantee puppetizing worked as a bonus [00:04:43] That's reasonable. Where is the RT db hosted? How would we migrate that part? [00:04:43] db shouldn't be much worry, it used to be db9 [00:04:51] and we should be able to simply switch that over [00:04:57] because it already replicates to another [00:05:05] like Robh did with racktables [00:05:23] Ah, so it's replicated to eqiad already? That's handy. [00:05:44] db1001 [00:05:49] yea [00:05:59] So we need a server for it to run on. [00:06:24] It has its own private box atm, right? [00:06:45] we could use zirconium [00:06:53] unless we don't want to mix it with other public services [00:07:14] Currently it uses lighttpd instead of apache, which might not play well with other services. [00:07:25] urgh [00:07:27] no, can't call it private [00:07:27] Although /probably/ it's fine to have both on one box. [00:07:34] can we not put lighttpd in eqiad [00:07:43] why not? [00:07:51] andrewbogott: so in previous ops meetings we discussed killing lighttpd and moving to just apache and nginx [00:08:00] cuz nginx is lightweight [00:08:02] I don't know that there's any real reason why it uses lighttpd, that's just how it's always been. [00:08:07] (is my understanding) [00:08:20] mostly cuz some folks think apache is overkill. [00:08:20] andrewbogott: it's on streber [00:08:25] RobH, what role would nginx play in that case? [00:08:29] hhhhmmm, nginx doesn't have the word "light" in it, so I don't have any proof that it's lightweight [00:08:34] andrewbogott: i assume as the webserver [00:08:43] notpeter: ok, well, im quoting meeting heresay [00:09:05] OK… I don't object to doing that but I also wouldn't know how to do it. [00:09:05] if there is no reason that RT needs lighttpd, i would say migrate it to apache [00:09:08] and I'm spouting nonsense ;) [00:09:10] you want a _light_ webserver? check out "fnord" http://www.fefe.de/fnord/others.html [00:09:18] Changing the current puppet setup to use apache I could probably figure out... [00:09:20] i personally prefer we use apache. [00:09:26] since we use it on other stuff [00:09:30] Yeah. [00:09:33] i would prefer it all be standardized [00:09:36] meh, same with nginx [00:09:51] currently we only use nginx with the ssl stuff i thought? [00:10:04] still just handles http(s) requests [00:10:23] yea, so we have the majority of all minor services, and non ssl primary service as apache2 [00:10:29] https service as nginx [00:10:29] so, mutante, if we're going to migrate too then it probably makes sense to switch to apache. I'll mess with the puppet manifests tomorrow and see if that is hard or easy. [00:10:35] and like 3 misc tampa items as lighttpd [00:10:46] so if those 3 misc tampa items became apache, it would be nice. [00:10:51] mailman is also lighttpd [00:10:54] But it sounds like we'll be bringing up a new RT and pointing it at the old db, which is a slightly different problem than then one I was expecting (and maybe slightly easier) [00:11:17] but i dont disaggree with Robh.. if there is no real reason .. don't have to make it complicated by using multiple [00:11:32] what about jetty? it uses the apache license. so it's kinda like apache [00:11:39] notpeter: .... are you trolling? [00:11:53] i cannot tell when remote sometimes ;p [00:11:59] mutante, so, maybe not Monday in this case. But I will research tomorrow and see how things look. [00:12:03] Right now… dinnertime. [00:12:10] RobH: completely [00:12:14] andrewbogott: Tomorrow is the day after today [00:12:15] New patchset: Pgehres; "Actually enabling" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63094 [00:12:17] It also never dies [00:12:25] andrewbogott: sounds all good to me.. i mean, we don't _have_ to combine this, either way.. and thanks for the puppetization:) [00:12:30] Reedy: Correct on both counts. [00:12:51] New review: Pgehres; "Once more with feeling" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/63094 [00:12:51] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63094 [00:12:52] notpeter: cuz if you werent, i was gonna copy that to mark_ and then let you and him argue which is better, jetty or lighttpd [00:13:49] andrewbogott: but we should either do both, or JUST the upgrade first, just not the other way around and copy to eqiad before it's puppetized [00:13:52] to be honest, I think we should get out of http all together [00:14:00] get with some newer standards [00:14:08] gopher? [00:14:32] we should have everyone just use xmpp to talk to the wikipedia elizabot [00:14:45] !log pgehres synchronized wmf-config 'Actually enabling wgCentralAuthAutoMigrate' [00:14:53] Logged the message, Master [00:14:57] root@rt-testing13:~# finger andrew [00:14:58] Login: andrew Name: Andrew Bogott [00:14:59] hehe [00:15:48] notpeter: elizabot!! reminds me .. http://flooterbuck.sourceforge.net/ is fun [00:16:07] infobot rewrite with SQL backend.heh [00:16:26] New patchset: Yurik; "Removed X-Carrier and testing IP ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62867 [00:16:44] ori-l: now back to the bugzilla change [00:17:38] mutante: if you merge the patch to modifications repo, i'd like to do another dry run in labs, setting it up from scratch, optimally without any manual intervention this time [00:17:47] should be fairly quick [00:19:22] ori-l: go ahead, i merged it but didnt deploy yet [00:19:32] confirming again there is no diff left [00:19:38] cool, thanks [00:22:29] all equal minus one tab/space thing to be ignored [00:22:40] brb [00:26:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [00:40:41] ori-l: you know what would also be interesting.. some time .. http://bugzilla.wikimedia.org/bzapi doesn't exist in boogs.wmflabs.org yet, and it doesnt work in prod while it's there [00:41:13] "while it's there" -- while what's there? [00:41:27] i didn't understand the last part [00:41:27] andre__: did you see boogs.wmflabs.org yet?:) [00:41:42] he has, i emailed him about it; patience is not my forte [00:41:44] :) [00:42:22] ori-l: so there is that bzapi thing, somebody tried to set that up looooong time ago but it's broken [00:42:29] it's not the regular API [00:42:41] awjr: asked about it recently [00:42:49] there are some remnants on wikitech somewhere [00:43:02] so i checked if it's still on the server and it is, but it doesnt work, it times out [00:43:02] * awjr waves [00:43:27] there is an xmlrpc api that works though [00:43:37] bummer the bzapi restful api doesn't work [00:43:47] but i imagine that will take some wrestling with to make go [00:44:13] https://wikitech.wikimedia.org/wiki/Bugzilla_REST_API [00:44:26] https://wikitech.wikimedia.org/w/index.php?title=Bugzilla_REST_API&action=history [00:45:10] [00:45:15] FastCgiServer /srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl -processes 3 -idle-timeout 180 [00:45:23] Alias /bzapi /srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl/ [00:45:39] bugzilla_api.conf bugzilla_api.conf~ bugzilla_api.conf.sample INSTALL lib Makefile.PL root script t TODO [00:46:04] wc -l TODO = 74 :) [00:46:08] It'd be nice if things just worked out of the box ;) [00:46:14] that looks nice to have, i'll check it out [00:46:24] just about to commit a couple of small tweaks to the manifest, but it worked this time [00:47:07] Reedy: the fact that bugzilla needs a separate piece of software to act as a restful api that proxies to bugzilla is… [00:47:43] ohai mediawiki, you've got an api? [00:48:12] awjr: Though, it's not like bugzilla is known for being a highly useable piece of software [00:48:24] heh true story [00:48:32] oh, API is great to find outdated versions via spider :p [00:50:28] ori-l: awjr : this is the TODO file left in the bzapi dir [00:50:35] https://bugzilla.wikimedia.org/TODO [00:51:02] o_O [00:51:13] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:51:53] New review: Ori.livneh; "PS7 splits up the two invocations of checksetup.pl into two separate exec resources, as opposed to a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:53:07] ori-l: i'll just deploy 62837 [00:53:39] deploy it how? is the git repo checked out on top of the bugzilla source tree? [00:54:19] i'll just assume you have a big red deploy button and not ask what it does [00:54:23] no, it's the git repo checked out in /root/bzmod and then i sync it locally [00:54:35] ah, right [00:56:56] btw, https://gerrit.wikimedia.org/r/#/c/62837/ is inert by itself. some manifest has to actually include it for it to do anything, but i wasn't sure where to do that for labs [00:57:12] in this case, sync is simple "cp", replacing diff with cp for those 3 files and nothing else, done [00:57:35] on boogs.wmflabs i just added this file: http://dpaste.de/mbd44/ and imported it from manifests/site.pp [00:58:20] actually it seemed ok to me to manually deploy it, i kind of like the sanity-check and separate deploy from merge [00:58:36] people merging in repos are not always sure if that actually deploys or not [00:58:45] and they might not be able to fix it and need root [00:58:50] if something goes wrong [00:59:14] well, puppet will fail, but i see your point [00:59:22] not like we do that with mediawiki either [01:00:00] so: keep the git::clone to /srv/bugzilla/modifications, but don't apply it? [01:01:14] hmm, i guess my opinion depends based on how we handle the +2 permissions on the repo [01:01:45] people doing +2 should preferably have server access and check when actually merging it [01:02:08] then the rest could be auto [01:02:37] but if we want to have more people with +2 that would also be fine, but then we should have deployment as a separate step [01:02:45] you'd have to ask Reedy & platformers, but afaik that's already the case in practice if not in actual ACLs [01:02:59] people wait around for someone with root to merge for exactly that reason [01:03:34] as long as we don't have that situatuon where stuff is merged and you think it's deployed, but then it's not [01:03:39] so my hunch is that no one who would lose +2 by this change would mind it, since s/he isn't using it anyway [01:04:01] makes sense [01:04:42] let me just get you that tarball :) [01:04:50] stripping password and attachments [01:05:56] i don't think i need it anymore, do i? [01:06:17] oh.. depends [01:06:22] i was just baffled by how url_quote was working in production. we know the answer: it doesn't [01:06:26] if those 3 files were the only reason, then no [01:06:46] yeah, i closed the ticket [01:06:47] unless you wanted to systematically check for any other diffs [01:06:54] ok [01:07:12] oh, well -- sure, yeah -- i thought you have, but if you think there may be other ones lurking then that'd be worth doing [01:07:47] hold on, i'll do it on server ..not that many files [01:07:55] that are actually in the bzmod repo [01:08:11] looks like the people who +2 changes in that repo are you and ^demon, so i guess ask him re: the +2 rights [01:08:36] or just give him and Reedy sudo on that machine, it's a bit silly that they don't have it [01:13:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [01:13:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:14:30] there will be a discussion about which commands are really needed and to make sudo rules that are specific vs. using ALL [01:14:35] i found more diff :p [01:14:44] haha [01:15:05] it's harmless and not much though.. making a change [01:16:14] i think we should host a git mirror of the bazaar repo on gerrit and maintain a wmf branch with our patches [01:17:20] that way we just clone the gerrit repo into place and that's it, rather than do patch management across two SCMs and several production systems [01:19:05] unfortunately the tooling for git<->bazaar is not as good as git-svn, but in a pinch i think it'd be ok to just commit the contents of the release tarball and not have a history, but i dunno. [01:23:21] diff -qru /root/bzmod/modifications/bugzilla-4.2/ /srv/org/wikimedia/bugzilla/ | grep -v "Only in" [01:23:24] = https://gerrit.wikimedia.org/r/#/c/63103/ [01:24:23] ? [01:39:53] i have no idea [01:40:18] the mediawiki bugzilla extension turns [[wikilinks]] into Special:Search URLs, but I don't know how the search box enters into it [01:43:22] ori-l: It's a hack. [01:43:43] to make an endpoint that can take ?title=Special:Search and returns search results? [01:43:52] so that Bugzilla can dress up as MediaWiki? [01:43:56] nice license :) "# its stolen from somewhere but was mostly re-written by Dirk Mueller " [01:44:00] So that more links work. [01:44:08] I guess interwiki links. [01:44:10] And probably some others. [01:44:18] https://en.wikipedia.org/wiki/google:foo [01:44:52] https://en.wikipedia.org/w/index.php?title=Special:Search&search=google:foo [01:44:54] well, but for that to be the case you merely need bugzilla to tolerate the presence of the title param if it happens to be present; you don't need to actually need to embed it into search queries generated on bugzilla itself [01:45:12] I mean for [[foo]]. [01:45:21] [[google:foo]] didn't used to work. [01:45:28] I think that's what you're talking about, at least. [01:45:32] I'm a little lost this evening. [01:46:38] I understand how this is a part of a broader attempt to make MediaWiki and Bugzilla links interchangeable, but not why it has to be added by Bugzilla. If you use the search box on Bugzilla to search for something, and then strip the 'title=Special%3ASearch' fragment from the result page, you get the same page back [01:47:16] maybe it's so that there's a familiar, canonical appearance to bugzilla search links [01:47:45] but I don't think it -- by which I mean the hack to add it to the quick search box on bugzilla itself -- serves a technical purpose. [01:48:17] Oh, the search box. [01:48:23] Yeah, I don't know anything about that. [01:48:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62642 [01:48:57] there's some stuff on http://www.mediawiki.org/wiki/Talk:Bugzilla but it's hard to extract the content from nemo and timeshifter growling at each other [01:49:58] " Rather than delete my accurate info, only remove the incorrect parts, if any. And only after you are sure." Response: "I've no idea what you're talking about. Your information was thouroughly wrong, mine is correct [...] Of course I know about Special:Search, I already had to correct a mistake you introduced when you didn't know it." Response: "If you don't know what I am talking about, then that shows you are ignorant.", et [01:49:58] c. [01:50:10] please leave some comments on the patch set, ori and Susan :) [01:50:19] Which patchset? [01:50:27] https://gerrit.wikimedia.org/r/#/c/63103/1 [01:50:35] https://gerrit.wikimedia.org/r/#/c/63103/1/bugzilla-4.2/template/en/custom/global/footer.html.tmpl [01:50:39] Ah. [01:51:17] merged one more RT thingie and about to leave for today [01:51:22] Shouldn't that be &? [01:51:56] Susan: that can indeed be a question but be aware my change in the current form is simply showing the diff between actual prod. and the repo [01:52:21] Oh. [01:52:24] so what it suggests is already the case unless we remove it [01:52:25] All right. [01:52:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:40] Well, I think we should fix both versions to be &. [01:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:54:26] mutante: I'm ok with +2ing it [01:56:30] alright, since i got a +1 from andre__ and ori-l , doing that now, it should just reflect what is live, can always change later [01:56:53] +1 [01:57:10] Susan, and yeah, escaping would be more correct [01:59:27] mutante: can you merge the puppet patch, too? the class is not applied to any host so it won't affect any production machines, and that way i can follow up with another patch adding a labs role without a rebase party every time something changes [02:01:39] andre__: this is done. now there is really no diff anymore between files that are in bugzilla-4.2 in the repo and stuff on the server [02:01:48] oh lovely [02:01:49] diff -qru /root/bzmod/modifications/bugzilla-4.2/ /srv/org/wikimedia/bugzilla/ | grep -v "Only in" ... nothing [02:02:09] of course i am excluding any file that is not in the repo [02:02:19] with the "Only in" part [02:02:51] mutante: can you 'find /srv/org/wikimedia/bugzilla/' > pastebin? [02:05:20] oh, you were about to leave, i just noticed that -- don't worry about any of this then [02:05:33] ori-l: i'll get you the file list, but not merge the module now...deal? [02:05:57] deal [02:06:42] New review: Tim Starling; "@Ariel: if Wikipedia is down for any length of time, they can turn on their TV and hear about it on ..." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [02:09:15] ori-l: cat ~/bugzilla-files [02:09:18] that was easiest [02:09:25] on fenari that is [02:09:42] note /srv/org/wikimedia/bugzilla/Bugzilla/.svn [02:09:43] heh [02:10:04] thanks, i'll diff that against the contents of the same dir on boogs.wmflabs.org [02:10:12] cool, thanks and cu later then [02:10:16] have a good night [02:10:22] you too..out [02:15:49] !log LocalisationUpdate completed (1.22wmf3) at Fri May 10 02:15:49 UTC 2013 [02:15:58] Logged the message, Master [02:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [02:46:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:37:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 10 03:37:41 UTC 2013 [03:37:50] Logged the message, Master [04:04:31] apergos: swift? :) [04:18:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [04:24:18] when I looked last night ms-be1 was nowhere near done [04:24:27] I have not looked yet this morning, I just sat down [04:25:17] yep still a long ways to go [04:29:05] good morning :-) [04:29:12] why do you say so? [04:29:15] I think it's done [04:32:46] I look at the objct replication percentage since you pushed the rings [04:32:59] it's only to 34% [04:33:05] where do you see that? [04:33:13] I checked the syslogs from the time of the push on [04:34:32] what specifically? [04:34:52] we are now at this: [04:34:54] May 10 04:20:03 ms-be1 object-replicator 6809/19842 (34.32%) partitions replicated in 338402.97s (0.02/sec, 179h remaining) [04:34:58] 34% [04:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [04:38:09] this is bs :) [04:38:37] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-be1.pmtpa.wmnet&m=cpu_report&r=4hr&s=by%20name&hc=4&mc=2&st=1368160689&g=network_report&z=large&c=Swift%20pmtpa [04:38:49] yes, I saw that. and yet it's still working it way through [04:39:10] there is zero traffic on the other boxes [04:39:50] May 10 01:37:38 ms-be2 object-replicator 14129/15207 (92.91%) partitions replicated in 19800.16s (0.71/sec, 25m remaining) [04:39:53] May 10 01:42:38 ms-be2 object-replicator 14680/15207 (96.53%) partitions replicated in 20100.16s (0.73/sec, 12m remaining) [04:39:56] May 10 01:47:38 ms-be2 object-replicator 15108/15207 (99.35%) partitions replicated in 20400.16s (0.74/sec, 2m remaining) [04:39:59] May 10 01:48:16 ms-be2 object-replicator 15207/15207 (100.00%) partitions replicated in 20438.25s (0.74/sec, 0s remaining) [04:40:01] I've already looked at all the other boxes [04:40:02] May 10 01:53:46 ms-be2 object-replicator 35/15182 (0.23%) partitions replicated in 300.00s (0.12/sec, 36h remaining) [04:40:05] May 10 01:58:46 ms-be2 object-replicator 55/15182 (0.36%) partitions replicated in 600.00s (0.09/sec, 45h remaining) [04:40:08] May 10 02:03:46 ms-be2 object-replicator 84/15182 (0.55%) partitions replicated in 900.00s (0.09/sec, 44h remaining) [04:40:12] that's ms-be2 [04:40:14] also see [04:40:17] May 10 04:23:46 ms-be2 object-replicator 2002/15182 (13.19%) partitions replicated in 9300.06s (0.22/sec, 17h remaining) [04:40:18] I'm well aware they have (pastebin. just pastebin it.) made multiple passes hrough [04:40:20] May 10 04:28:46 ms-be2 object-replicator 2640/15182 (17.39%) partitions replicated in 9600.06s (0.27/sec, 12h remaining) [04:40:23] May 10 04:33:46 ms-be2 object-replicator 3162/15182 (20.83%) partitions replicated in 9900.06s (0.32/sec, 10h remaining) [04:40:26] May 10 04:38:46 ms-be2 object-replicator 3868/15182 (25.48%) partitions replicated in 10200.06s (0.38/sec, 8h remaining) [04:40:31] http://xkcd.com/612/ [04:41:30] but ms-be1 has not made a single complete pass yet. so there we are. [04:45:46] it is still replicating things, as shown in the log. [04:45:56] !log swift: pushing new rings, set weight 0 for ms-be5 sdb1, ms-be11 sdh1 (broken disks); balance 999.99, needs further rebalance [04:46:05] Logged the message, Master [04:46:05] May 10 04:45:00 ms-be1 object-replicator May 10 04:45:00 ms-be1 object-replicator Successful rsync of /srv/swift-storage/sdb4/objects/40071/97f at [10.0.6.213]::ob [04:46:05] ject/sde1/objects/40071 (55.200) [04:46:17] *sigh* [04:46:59] ? [04:47:16] that's from replication of an object a couple mins ago on ms-be1 [04:48:07] so? [04:48:18] so it was still moving data [04:48:53] and it still will [04:49:17] I'm not going to wait "179h" though [04:49:32] let's replace ms-be4 with an r720xd today [04:49:49] and ms-be1 on monday and ship them both back [04:50:28] at a certain point it speeds up (the last so many hours turn out to be only a few minutes) [04:51:21] you know we have three replicas of everything, right? :) [04:54:38] the fact that ms-be1 still rsyncs doesn't mean the data hasn't been copied from another replica [04:54:42] case in point, your rsync above [04:55:20] -rw------- 1 swift swift 89816 May 5 21:19 /srv/swift-storage/sde1/objects/40071/97f/9c87ca3b424d8d02fc9508570d55f97f/1367788760.42417.data [04:55:25] synced 5 days ago [04:55:49] I would hope that there are two other copies elsewhere [04:56:07] no, there are three copies of that file since may 5th [04:58:18] is this something we can guarantee? that everything on ms-be1 already has three copies on it elsewhere and it's just wasting time? because I don't know a way to verify that [04:58:32] of course now it's not worth discussing, since new rings went out [04:58:37] but theoretically [04:58:44] anyways, whatever [04:59:17] what do new rings have to do with that? [04:59:31] ms-be1 is unchanged in those new rings [04:59:36] it was 0 before and it's 0 now [04:59:41] I mean, the decision has already been made [04:59:46] so it's not worth discussing [04:59:55] it'll keep doing what it was doing [05:00:06] running rsync through its files [05:00:37] and sure, we can wait a few more days if that's what you wish, that's why I suggested monday [05:01:12] well you want ms-be4 pulled and replaced (today), I assume that means new rings [05:01:22] no [05:01:41] just leave it at 100% to sync [05:01:46] it will take a hit on our 404s [05:02:04] but I don't think it'll be a problem in practice [05:03:01] the only worrying part is having one replica in each of the two failed disks plus another one on ms-be4 [05:03:09] th disk layout is different [05:03:16] which is why it's ridiculous we have two failed disks for weeks without having them at weight 0 or replaced [05:03:32] oh, that's right [05:04:07] two disks change, right? [05:04:11] yep [05:04:20] ok [05:04:24] we can do that on monday then [05:04:28] replace both ms-be1 & 4 [05:05:00] damn, I could have removed those two now [05:05:06] had [05:06:51] did the new rings not actually go around yet? [05:07:21] I still see the old ones on ms-be1 for example [05:07:48] haven't forced run puppet yet [05:07:53] ah [05:09:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [05:13:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:35] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=descending&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:13:39] see? [05:13:41] now ms-be1 copies too [05:13:57] since there's a new destination that doesn't already have the objects [05:14:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:15:03] it was already copying some objects, but yes it is doing more work now [05:15:29] no it wasn't [05:16:25] um, I pasted rsync output in the channel [05:19:01] and i pasted you the mtime of that object in the target [05:19:27] which was may 5th :) [05:22:13] so, object A has a set of say (ms-be1, ms-be2, ms-be3) [05:22:18] we push new rings with ms-be1 at weight 0 [05:22:28] the set now becomes (ms-be2, ms-be3, ms-be12) [05:22:39] all of ms-be1/2/3 will try to rsync the file to ms-be12 [05:23:03] if ms-be2 is done first, then ms-be1 will just rsync over it and transfer nothing [05:25:25] and since the change was to set ms-be1's weight to 0, this means that we had sets with all the other boxes in the cluster [05:25:29] who raced ms-be1 [05:25:43] but they were many more, and two of them had each object [05:25:51] which is why ms-be1 is left behind on the rsync I guess [05:25:58] but copies nothing :) [05:26:13] now that I rebalanced the rings, I started the race for some partitions from the start [05:26:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:51] so the set now maybe (ms-be2, ms-be9, ms-be12) and ms-be1 has to race ms-be2/3 to copy the object to ms-be9 [05:27:00] which is why it picked up traffic again [05:27:04] but it'll still lose :) [05:27:15] (also, it will be a very short race, considering it's just two disks this time) [05:28:03] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.304 second response time [05:28:59] that's my understanding at least :) [05:29:44] if all holders of an object must rsync their copies to the new holder without checking if the new holder actually already has it already, that's pretty poor [05:30:00] it's just rsync [05:30:12] it won't actually copy [05:30:28] the "checking if it has it already" is happening by rsync essentially [05:31:05] the replicator doesn't do anything per object [05:31:11] it just fires up rsync per partition [05:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:50] rsync lines had a pile of RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.270 second response time [05:33:15] I would have expected it to actually only copy if the data wasn't already there [05:33:26]