[00:00:06] is it a meeting Monday? [00:00:13] it is. [00:01:10] But neither task should take all day :) [00:01:11] hrmm, that means higher activity when everybody is on and links to tickets [00:01:38] yea, well, what the hack, let's do it [00:01:51] tell people during the meeting we're going to do it $now [00:01:51] I'm on RT duty next week, so if we break RT it'll be win/win [00:01:57] fair:)) [00:02:33] wait, are we also moving to other server? [00:02:51] and if yea, do we know which one to use [00:02:57] That… was not part of my plan. [00:03:00] is RT still in tampa? [00:03:07] yea [00:03:10] Hm. [00:03:19] and if it's puppetized... [00:03:24] using a clean host .. [00:03:24] Is there any reason why we'd want to do both things at once? [00:03:33] Hm, true. [00:03:44] I don't know anything about how to set up the DB for that though. [00:04:00] shrug, it's not absolutely necessary to combine them, but usually we are .. like .. don't want to move unpuppetized stuff to eqiad, but once it's done want a clean host [00:04:08] to gurantee puppetizing worked as a bonus [00:04:43] That's reasonable. Where is the RT db hosted? How would we migrate that part? [00:04:43] db shouldn't be much worry, it used to be db9 [00:04:51] and we should be able to simply switch that over [00:04:57] because it already replicates to another [00:05:05] like Robh did with racktables [00:05:23] Ah, so it's replicated to eqiad already? That's handy. [00:05:44] db1001 [00:05:49] yea [00:05:59] So we need a server for it to run on. [00:06:24] It has its own private box atm, right? [00:06:45] we could use zirconium [00:06:53] unless we don't want to mix it with other public services [00:07:14] Currently it uses lighttpd instead of apache, which might not play well with other services. [00:07:25] urgh [00:07:27] no, can't call it private [00:07:27] Although /probably/ it's fine to have both on one box. [00:07:34] can we not put lighttpd in eqiad [00:07:43] why not? [00:07:51] andrewbogott: so in previous ops meetings we discussed killing lighttpd and moving to just apache and nginx [00:08:00] cuz nginx is lightweight [00:08:02] I don't know that there's any real reason why it uses lighttpd, that's just how it's always been. [00:08:07] (is my understanding) [00:08:20] mostly cuz some folks think apache is overkill. [00:08:20] andrewbogott: it's on streber [00:08:25] RobH, what role would nginx play in that case? [00:08:29] hhhhmmm, nginx doesn't have the word "light" in it, so I don't have any proof that it's lightweight [00:08:34] andrewbogott: i assume as the webserver [00:08:43] notpeter: ok, well, im quoting meeting heresay [00:09:05] OK… I don't object to doing that but I also wouldn't know how to do it. [00:09:05] if there is no reason that RT needs lighttpd, i would say migrate it to apache [00:09:08] and I'm spouting nonsense ;) [00:09:10] you want a _light_ webserver? check out "fnord" http://www.fefe.de/fnord/others.html [00:09:18] Changing the current puppet setup to use apache I could probably figure out... [00:09:20] i personally prefer we use apache. [00:09:26] since we use it on other stuff [00:09:30] Yeah. [00:09:33] i would prefer it all be standardized [00:09:36] meh, same with nginx [00:09:51] currently we only use nginx with the ssl stuff i thought? [00:10:04] still just handles http(s) requests [00:10:23] yea, so we have the majority of all minor services, and non ssl primary service as apache2 [00:10:29] https service as nginx [00:10:29] so, mutante, if we're going to migrate too then it probably makes sense to switch to apache. I'll mess with the puppet manifests tomorrow and see if that is hard or easy. [00:10:35] and like 3 misc tampa items as lighttpd [00:10:46] so if those 3 misc tampa items became apache, it would be nice. [00:10:51] mailman is also lighttpd [00:10:54] But it sounds like we'll be bringing up a new RT and pointing it at the old db, which is a slightly different problem than then one I was expecting (and maybe slightly easier) [00:11:17] but i dont disaggree with Robh.. if there is no real reason .. don't have to make it complicated by using multiple [00:11:32] what about jetty? it uses the apache license. so it's kinda like apache [00:11:39] notpeter: .... are you trolling? [00:11:53] i cannot tell when remote sometimes ;p [00:11:59] mutante, so, maybe not Monday in this case. But I will research tomorrow and see how things look. [00:12:03] Right now… dinnertime. [00:12:10] RobH: completely [00:12:14] andrewbogott: Tomorrow is the day after today [00:12:15] New patchset: Pgehres; "Actually enabling" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63094 [00:12:17] It also never dies [00:12:25] andrewbogott: sounds all good to me.. i mean, we don't _have_ to combine this, either way.. and thanks for the puppetization:) [00:12:30] Reedy: Correct on both counts. [00:12:51] New review: Pgehres; "Once more with feeling" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/63094 [00:12:51] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63094 [00:12:52] notpeter: cuz if you werent, i was gonna copy that to mark_ and then let you and him argue which is better, jetty or lighttpd [00:13:49] andrewbogott: but we should either do both, or JUST the upgrade first, just not the other way around and copy to eqiad before it's puppetized [00:13:52] to be honest, I think we should get out of http all together [00:14:00] get with some newer standards [00:14:08] gopher? [00:14:32] we should have everyone just use xmpp to talk to the wikipedia elizabot [00:14:45] !log pgehres synchronized wmf-config 'Actually enabling wgCentralAuthAutoMigrate' [00:14:53] Logged the message, Master [00:14:57] root@rt-testing13:~# finger andrew [00:14:58] Login: andrew Name: Andrew Bogott [00:14:59] hehe [00:15:48] notpeter: elizabot!! reminds me .. http://flooterbuck.sourceforge.net/ is fun [00:16:07] infobot rewrite with SQL backend.heh [00:16:26] New patchset: Yurik; "Removed X-Carrier and testing IP ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62867 [00:16:44] ori-l: now back to the bugzilla change [00:17:38] mutante: if you merge the patch to modifications repo, i'd like to do another dry run in labs, setting it up from scratch, optimally without any manual intervention this time [00:17:47] should be fairly quick [00:19:22] ori-l: go ahead, i merged it but didnt deploy yet [00:19:32] confirming again there is no diff left [00:19:38] cool, thanks [00:22:29] all equal minus one tab/space thing to be ignored [00:22:40] brb [00:26:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [00:40:41] ori-l: you know what would also be interesting.. some time .. http://bugzilla.wikimedia.org/bzapi doesn't exist in boogs.wmflabs.org yet, and it doesnt work in prod while it's there [00:41:13] "while it's there" -- while what's there? [00:41:27] i didn't understand the last part [00:41:27] andre__: did you see boogs.wmflabs.org yet?:) [00:41:42] he has, i emailed him about it; patience is not my forte [00:41:44] :) [00:42:22] ori-l: so there is that bzapi thing, somebody tried to set that up looooong time ago but it's broken [00:42:29] it's not the regular API [00:42:41] awjr: asked about it recently [00:42:49] there are some remnants on wikitech somewhere [00:43:02] so i checked if it's still on the server and it is, but it doesnt work, it times out [00:43:02] * awjr waves [00:43:27] there is an xmlrpc api that works though [00:43:37] bummer the bzapi restful api doesn't work [00:43:47] but i imagine that will take some wrestling with to make go [00:44:13] https://wikitech.wikimedia.org/wiki/Bugzilla_REST_API [00:44:26] https://wikitech.wikimedia.org/w/index.php?title=Bugzilla_REST_API&action=history [00:45:10] [00:45:15] FastCgiServer /srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl -processes 3 -idle-timeout 180 [00:45:23] Alias /bzapi /srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl/ [00:45:39] bugzilla_api.conf bugzilla_api.conf~ bugzilla_api.conf.sample INSTALL lib Makefile.PL root script t TODO [00:46:04] wc -l TODO = 74 :) [00:46:08] It'd be nice if things just worked out of the box ;) [00:46:14] that looks nice to have, i'll check it out [00:46:24] just about to commit a couple of small tweaks to the manifest, but it worked this time [00:47:07] Reedy: the fact that bugzilla needs a separate piece of software to act as a restful api that proxies to bugzilla is… [00:47:43] ohai mediawiki, you've got an api? [00:48:12] awjr: Though, it's not like bugzilla is known for being a highly useable piece of software [00:48:24] heh true story [00:48:32] oh, API is great to find outdated versions via spider :p [00:50:28] ori-l: awjr : this is the TODO file left in the bzapi dir [00:50:35] https://bugzilla.wikimedia.org/TODO [00:51:02] o_O [00:51:13] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:51:53] New review: Ori.livneh; "PS7 splits up the two invocations of checksetup.pl into two separate exec resources, as opposed to a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:53:07] ori-l: i'll just deploy 62837 [00:53:39] deploy it how? is the git repo checked out on top of the bugzilla source tree? [00:54:19] i'll just assume you have a big red deploy button and not ask what it does [00:54:23] no, it's the git repo checked out in /root/bzmod and then i sync it locally [00:54:35] ah, right [00:56:56] btw, https://gerrit.wikimedia.org/r/#/c/62837/ is inert by itself. some manifest has to actually include it for it to do anything, but i wasn't sure where to do that for labs [00:57:12] in this case, sync is simple "cp", replacing diff with cp for those 3 files and nothing else, done [00:57:35] on boogs.wmflabs i just added this file: http://dpaste.de/mbd44/ and imported it from manifests/site.pp [00:58:20] actually it seemed ok to me to manually deploy it, i kind of like the sanity-check and separate deploy from merge [00:58:36] people merging in repos are not always sure if that actually deploys or not [00:58:45] and they might not be able to fix it and need root [00:58:50] if something goes wrong [00:59:14] well, puppet will fail, but i see your point [00:59:22] not like we do that with mediawiki either [01:00:00] so: keep the git::clone to /srv/bugzilla/modifications, but don't apply it? [01:01:14] hmm, i guess my opinion depends based on how we handle the +2 permissions on the repo [01:01:45] people doing +2 should preferably have server access and check when actually merging it [01:02:08] then the rest could be auto [01:02:37] but if we want to have more people with +2 that would also be fine, but then we should have deployment as a separate step [01:02:45] you'd have to ask Reedy & platformers, but afaik that's already the case in practice if not in actual ACLs [01:02:59] people wait around for someone with root to merge for exactly that reason [01:03:34] as long as we don't have that situatuon where stuff is merged and you think it's deployed, but then it's not [01:03:39] so my hunch is that no one who would lose +2 by this change would mind it, since s/he isn't using it anyway [01:04:01] makes sense [01:04:42] let me just get you that tarball :) [01:04:50] stripping password and attachments [01:05:56] i don't think i need it anymore, do i? [01:06:17] oh.. depends [01:06:22] i was just baffled by how url_quote was working in production. we know the answer: it doesn't [01:06:26] if those 3 files were the only reason, then no [01:06:46] yeah, i closed the ticket [01:06:47] unless you wanted to systematically check for any other diffs [01:06:54] ok [01:07:12] oh, well -- sure, yeah -- i thought you have, but if you think there may be other ones lurking then that'd be worth doing [01:07:47] hold on, i'll do it on server ..not that many files [01:07:55] that are actually in the bzmod repo [01:08:11] looks like the people who +2 changes in that repo are you and ^demon, so i guess ask him re: the +2 rights [01:08:36] or just give him and Reedy sudo on that machine, it's a bit silly that they don't have it [01:13:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [01:13:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:14:30] there will be a discussion about which commands are really needed and to make sudo rules that are specific vs. using ALL [01:14:35] i found more diff :p [01:14:44] haha [01:15:05] it's harmless and not much though.. making a change [01:16:14] i think we should host a git mirror of the bazaar repo on gerrit and maintain a wmf branch with our patches [01:17:20] that way we just clone the gerrit repo into place and that's it, rather than do patch management across two SCMs and several production systems [01:19:05] unfortunately the tooling for git<->bazaar is not as good as git-svn, but in a pinch i think it'd be ok to just commit the contents of the release tarball and not have a history, but i dunno. [01:23:21] diff -qru /root/bzmod/modifications/bugzilla-4.2/ /srv/org/wikimedia/bugzilla/ | grep -v "Only in" [01:23:24] = https://gerrit.wikimedia.org/r/#/c/63103/ [01:24:23] ? [01:39:53] i have no idea [01:40:18] the mediawiki bugzilla extension turns [[wikilinks]] into Special:Search URLs, but I don't know how the search box enters into it [01:43:22] ori-l: It's a hack. [01:43:43] to make an endpoint that can take ?title=Special:Search and returns search results? [01:43:52] so that Bugzilla can dress up as MediaWiki? [01:43:56] nice license :) "# its stolen from somewhere but was mostly re-written by Dirk Mueller " [01:44:00] So that more links work. [01:44:08] I guess interwiki links. [01:44:10] And probably some others. [01:44:18] https://en.wikipedia.org/wiki/google:foo [01:44:52] https://en.wikipedia.org/w/index.php?title=Special:Search&search=google:foo [01:44:54] well, but for that to be the case you merely need bugzilla to tolerate the presence of the title param if it happens to be present; you don't need to actually need to embed it into search queries generated on bugzilla itself [01:45:12] I mean for [[foo]]. [01:45:21] [[google:foo]] didn't used to work. [01:45:28] I think that's what you're talking about, at least. [01:45:32] I'm a little lost this evening. [01:46:38] I understand how this is a part of a broader attempt to make MediaWiki and Bugzilla links interchangeable, but not why it has to be added by Bugzilla. If you use the search box on Bugzilla to search for something, and then strip the 'title=Special%3ASearch' fragment from the result page, you get the same page back [01:47:16] maybe it's so that there's a familiar, canonical appearance to bugzilla search links [01:47:45] but I don't think it -- by which I mean the hack to add it to the quick search box on bugzilla itself -- serves a technical purpose. [01:48:17] Oh, the search box. [01:48:23] Yeah, I don't know anything about that. [01:48:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62642 [01:48:57] there's some stuff on http://www.mediawiki.org/wiki/Talk:Bugzilla but it's hard to extract the content from nemo and timeshifter growling at each other [01:49:58] " Rather than delete my accurate info, only remove the incorrect parts, if any. And only after you are sure." Response: "I've no idea what you're talking about. Your information was thouroughly wrong, mine is correct [...] Of course I know about Special:Search, I already had to correct a mistake you introduced when you didn't know it." Response: "If you don't know what I am talking about, then that shows you are ignorant.", et [01:49:58] c. [01:50:10] please leave some comments on the patch set, ori and Susan :) [01:50:19] Which patchset? [01:50:27] https://gerrit.wikimedia.org/r/#/c/63103/1 [01:50:35] https://gerrit.wikimedia.org/r/#/c/63103/1/bugzilla-4.2/template/en/custom/global/footer.html.tmpl [01:50:39] Ah. [01:51:17] merged one more RT thingie and about to leave for today [01:51:22] Shouldn't that be &? [01:51:56] Susan: that can indeed be a question but be aware my change in the current form is simply showing the diff between actual prod. and the repo [01:52:21] Oh. [01:52:24] so what it suggests is already the case unless we remove it [01:52:25] All right. [01:52:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:40] Well, I think we should fix both versions to be &. [01:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:54:26] mutante: I'm ok with +2ing it [01:56:30] alright, since i got a +1 from andre__ and ori-l , doing that now, it should just reflect what is live, can always change later [01:56:53] +1 [01:57:10] Susan, and yeah, escaping would be more correct [01:59:27] mutante: can you merge the puppet patch, too? the class is not applied to any host so it won't affect any production machines, and that way i can follow up with another patch adding a labs role without a rebase party every time something changes [02:01:39] andre__: this is done. now there is really no diff anymore between files that are in bugzilla-4.2 in the repo and stuff on the server [02:01:48] oh lovely [02:01:49] diff -qru /root/bzmod/modifications/bugzilla-4.2/ /srv/org/wikimedia/bugzilla/ | grep -v "Only in" ... nothing [02:02:09] of course i am excluding any file that is not in the repo [02:02:19] with the "Only in" part [02:02:51] mutante: can you 'find /srv/org/wikimedia/bugzilla/' > pastebin? [02:05:20] oh, you were about to leave, i just noticed that -- don't worry about any of this then [02:05:33] ori-l: i'll get you the file list, but not merge the module now...deal? [02:05:57] deal [02:06:42] New review: Tim Starling; "@Ariel: if Wikipedia is down for any length of time, they can turn on their TV and hear about it on ..." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [02:09:15] ori-l: cat ~/bugzilla-files [02:09:18] that was easiest [02:09:25] on fenari that is [02:09:42] note /srv/org/wikimedia/bugzilla/Bugzilla/.svn [02:09:43] heh [02:10:04] thanks, i'll diff that against the contents of the same dir on boogs.wmflabs.org [02:10:12] cool, thanks and cu later then [02:10:16] have a good night [02:10:22] you too..out [02:15:49] !log LocalisationUpdate completed (1.22wmf3) at Fri May 10 02:15:49 UTC 2013 [02:15:58] Logged the message, Master [02:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [02:46:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:37:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 10 03:37:41 UTC 2013 [03:37:50] Logged the message, Master [04:04:31] apergos: swift? :) [04:18:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [04:24:18] when I looked last night ms-be1 was nowhere near done [04:24:27] I have not looked yet this morning, I just sat down [04:25:17] yep still a long ways to go [04:29:05] good morning :-) [04:29:12] why do you say so? [04:29:15] I think it's done [04:32:46] I look at the objct replication percentage since you pushed the rings [04:32:59] it's only to 34% [04:33:05] where do you see that? [04:33:13] I checked the syslogs from the time of the push on [04:34:32] what specifically? [04:34:52] we are now at this: [04:34:54] May 10 04:20:03 ms-be1 object-replicator 6809/19842 (34.32%) partitions replicated in 338402.97s (0.02/sec, 179h remaining) [04:34:58] 34% [04:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [04:38:09] this is bs :) [04:38:37] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-be1.pmtpa.wmnet&m=cpu_report&r=4hr&s=by%20name&hc=4&mc=2&st=1368160689&g=network_report&z=large&c=Swift%20pmtpa [04:38:49] yes, I saw that. and yet it's still working it way through [04:39:10] there is zero traffic on the other boxes [04:39:50] May 10 01:37:38 ms-be2 object-replicator 14129/15207 (92.91%) partitions replicated in 19800.16s (0.71/sec, 25m remaining) [04:39:53] May 10 01:42:38 ms-be2 object-replicator 14680/15207 (96.53%) partitions replicated in 20100.16s (0.73/sec, 12m remaining) [04:39:56] May 10 01:47:38 ms-be2 object-replicator 15108/15207 (99.35%) partitions replicated in 20400.16s (0.74/sec, 2m remaining) [04:39:59] May 10 01:48:16 ms-be2 object-replicator 15207/15207 (100.00%) partitions replicated in 20438.25s (0.74/sec, 0s remaining) [04:40:01] I've already looked at all the other boxes [04:40:02] May 10 01:53:46 ms-be2 object-replicator 35/15182 (0.23%) partitions replicated in 300.00s (0.12/sec, 36h remaining) [04:40:05] May 10 01:58:46 ms-be2 object-replicator 55/15182 (0.36%) partitions replicated in 600.00s (0.09/sec, 45h remaining) [04:40:08] May 10 02:03:46 ms-be2 object-replicator 84/15182 (0.55%) partitions replicated in 900.00s (0.09/sec, 44h remaining) [04:40:12] that's ms-be2 [04:40:14] also see [04:40:17] May 10 04:23:46 ms-be2 object-replicator 2002/15182 (13.19%) partitions replicated in 9300.06s (0.22/sec, 17h remaining) [04:40:18] I'm well aware they have (pastebin. just pastebin it.) made multiple passes hrough [04:40:20] May 10 04:28:46 ms-be2 object-replicator 2640/15182 (17.39%) partitions replicated in 9600.06s (0.27/sec, 12h remaining) [04:40:23] May 10 04:33:46 ms-be2 object-replicator 3162/15182 (20.83%) partitions replicated in 9900.06s (0.32/sec, 10h remaining) [04:40:26] May 10 04:38:46 ms-be2 object-replicator 3868/15182 (25.48%) partitions replicated in 10200.06s (0.38/sec, 8h remaining) [04:40:31] http://xkcd.com/612/ [04:41:30] but ms-be1 has not made a single complete pass yet. so there we are. [04:45:46] it is still replicating things, as shown in the log. [04:45:56] !log swift: pushing new rings, set weight 0 for ms-be5 sdb1, ms-be11 sdh1 (broken disks); balance 999.99, needs further rebalance [04:46:05] Logged the message, Master [04:46:05] May 10 04:45:00 ms-be1 object-replicator May 10 04:45:00 ms-be1 object-replicator Successful rsync of /srv/swift-storage/sdb4/objects/40071/97f at [10.0.6.213]::ob [04:46:05] ject/sde1/objects/40071 (55.200) [04:46:17] *sigh* [04:46:59] ? [04:47:16] that's from replication of an object a couple mins ago on ms-be1 [04:48:07] so? [04:48:18] so it was still moving data [04:48:53] and it still will [04:49:17] I'm not going to wait "179h" though [04:49:32] let's replace ms-be4 with an r720xd today [04:49:49] and ms-be1 on monday and ship them both back [04:50:28] at a certain point it speeds up (the last so many hours turn out to be only a few minutes) [04:51:21] you know we have three replicas of everything, right? :) [04:54:38] the fact that ms-be1 still rsyncs doesn't mean the data hasn't been copied from another replica [04:54:42] case in point, your rsync above [04:55:20] -rw------- 1 swift swift 89816 May 5 21:19 /srv/swift-storage/sde1/objects/40071/97f/9c87ca3b424d8d02fc9508570d55f97f/1367788760.42417.data [04:55:25] synced 5 days ago [04:55:49] I would hope that there are two other copies elsewhere [04:56:07] no, there are three copies of that file since may 5th [04:58:18] is this something we can guarantee? that everything on ms-be1 already has three copies on it elsewhere and it's just wasting time? because I don't know a way to verify that [04:58:32] of course now it's not worth discussing, since new rings went out [04:58:37] but theoretically [04:58:44] anyways, whatever [04:59:17] what do new rings have to do with that? [04:59:31] ms-be1 is unchanged in those new rings [04:59:36] it was 0 before and it's 0 now [04:59:41] I mean, the decision has already been made [04:59:46] so it's not worth discussing [04:59:55] it'll keep doing what it was doing [05:00:06] running rsync through its files [05:00:37] and sure, we can wait a few more days if that's what you wish, that's why I suggested monday [05:01:12] well you want ms-be4 pulled and replaced (today), I assume that means new rings [05:01:22] no [05:01:41] just leave it at 100% to sync [05:01:46] it will take a hit on our 404s [05:02:04] but I don't think it'll be a problem in practice [05:03:01] the only worrying part is having one replica in each of the two failed disks plus another one on ms-be4 [05:03:09] th disk layout is different [05:03:16] which is why it's ridiculous we have two failed disks for weeks without having them at weight 0 or replaced [05:03:32] oh, that's right [05:04:07] two disks change, right? [05:04:11] yep [05:04:20] ok [05:04:24] we can do that on monday then [05:04:28] replace both ms-be1 & 4 [05:05:00] damn, I could have removed those two now [05:05:06] had [05:06:51] did the new rings not actually go around yet? [05:07:21] I still see the old ones on ms-be1 for example [05:07:48] haven't forced run puppet yet [05:07:53] ah [05:09:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [05:13:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:35] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=descending&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:13:39] see? [05:13:41] now ms-be1 copies too [05:13:57] since there's a new destination that doesn't already have the objects [05:14:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:15:03] it was already copying some objects, but yes it is doing more work now [05:15:29] no it wasn't [05:16:25] um, I pasted rsync output in the channel [05:19:01] and i pasted you the mtime of that object in the target [05:19:27] which was may 5th :) [05:22:13] so, object A has a set of say (ms-be1, ms-be2, ms-be3) [05:22:18] we push new rings with ms-be1 at weight 0 [05:22:28] the set now becomes (ms-be2, ms-be3, ms-be12) [05:22:39] all of ms-be1/2/3 will try to rsync the file to ms-be12 [05:23:03] if ms-be2 is done first, then ms-be1 will just rsync over it and transfer nothing [05:25:25] and since the change was to set ms-be1's weight to 0, this means that we had sets with all the other boxes in the cluster [05:25:29] who raced ms-be1 [05:25:43] but they were many more, and two of them had each object [05:25:51] which is why ms-be1 is left behind on the rsync I guess [05:25:58] but copies nothing :) [05:26:13] now that I rebalanced the rings, I started the race for some partitions from the start [05:26:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:51] so the set now maybe (ms-be2, ms-be9, ms-be12) and ms-be1 has to race ms-be2/3 to copy the object to ms-be9 [05:27:00] which is why it picked up traffic again [05:27:04] but it'll still lose :) [05:27:15] (also, it will be a very short race, considering it's just two disks this time) [05:28:03] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.304 second response time [05:28:59] that's my understanding at least :) [05:29:44] if all holders of an object must rsync their copies to the new holder without checking if the new holder actually already has it already, that's pretty poor [05:30:00] it's just rsync [05:30:12] it won't actually copy [05:30:28] the "checking if it has it already" is happening by rsync essentially [05:31:05] the replicator doesn't do anything per object [05:31:11] it just fires up rsync per partition [05:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:50] rsync lines had a pile of RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.270 second response time [05:33:15] I would have expected it to actually only copy if the data wasn't already there [05:33:26] that's what rsync does [05:37:56] ori-l: indeed, that discussion must be ignored :) [05:38:06] that is what I would expect but the output indicates otherwise [05:39:11] ori-l: it was mainly trolling in revenge for some quarrel about some bug(s) [05:39:47] Nemo_bis: heh. I wouldn't have stared at it for longer than a millisecond except that it seemed like it could contain the answer to my question :P [05:45:59] New review: Nemo bis; "Well, *officially* status.wikimedia.org is designated place, except it gives no information; in theo..." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [05:49:23] ori-l: noo [05:52:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:52:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:52:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:07:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:27:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [07:28:34] New review: Hashar; "Nice catch, I guess that will fix a few extensions :-]" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/63080 [08:13:14] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 12.5168948438 (gt 8.0) [08:13:24] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: CRITICAL: packet_loss_average is 12.1789862595 (gt 8.0) [08:13:34] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 13.0341796154 (gt 8.0) [08:14:34] PROBLEM - Packetloss_Average on analytics1004 is CRITICAL: CRITICAL: packet_loss_average is 12.9878309375 (gt 8.0) [08:17:14] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.663575583333 [08:17:24] RECOVERY - Packetloss_Average on analytics1003 is OK: OK: packet_loss_average is 1.12469108333 [08:17:34] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 11.9558827273 (gt 8.0) [08:17:35] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is 0.465591147541 [08:17:41] hello [08:18:07] and here is jenkins slow again :D [08:18:34] RECOVERY - Packetloss_Average on analytics1004 is OK: OK: packet_loss_average is 0.480869576271 [08:18:44] PROBLEM - Packetloss_Average on analytics1008 is CRITICAL: CRITICAL: packet_loss_average is 12.3684900769 (gt 8.0) [08:21:34] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is -0.0649045555556 [08:22:44] RECOVERY - Packetloss_Average on analytics1008 is OK: OK: packet_loss_average is 1.09076163793 [08:23:14] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: CRITICAL: packet_loss_average is 10.7205832331 (gt 8.0) [08:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:14] RECOVERY - Packetloss_Average on analytics1005 is OK: OK: packet_loss_average is 0.629512118644 [08:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [08:33:33] !log Jenkins hit by the evil bug again {{bug|48025}}. Taking as much trace as possible before restarting it [08:33:42] Logged the message, Master [08:55:33] New patchset: Nikerabbit; "ULS config for deployment phase 1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63113 [08:58:50] New review: Siebrand; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63113 [09:01:28] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63113 [09:17:30] !log gallium: dumping java heap of Jenkins using jmap -dump:format=b,file=/root/jenkins-bug48025.jmap -F 24508 [09:17:39] Logged the message, Master [09:25:05] !log restarting jenkins [09:25:14] Logged the message, Master [09:26:33] !log killed -9 jenkins :( [09:26:41] Logged the message, Master [09:42:55] !log restarted jenkins twice to get it to finally serve something :( [09:43:03] Logged the message, Master [09:53:56] I'm running a batch delete on ms-fe1 [09:54:07] it's going to take a while [09:54:15] it takes a hit on responsiveness but things to be stable so far [10:00:11] New review: Nikerabbit; "Marking -2 to avoid accidental merges before scheduled time." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/63113 [10:00:50] hashar: it seems you are forming love-hate relationship with jenkins [10:02:24] Nikerabbit: indeed [10:02:30] Nikerabbit: it most probably has a memory leak :( [10:09:11] !log jenkins: upgrading plugins and restarting. [10:09:19] Logged the message, Master [10:27:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [10:32:09] hashar: it lost memory of the good moments between you? [10:32:46] na I got it upgraded [10:32:52] and something must be incompatible somewhere [10:43:43] New patchset: Hydriz; "(bug 41757) Enable special:import on Hindi Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31823 [10:47:32] ahh [10:47:37] I might have find out the actual root cause [10:47:39] yum [11:14:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [11:14:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:27:03] !log Moved ssl3001-3004 from csw1-esams to csw2-esams [11:27:12] Logged the message, Master [11:36:33] ah, is that why WP is down for me? [11:37:14] i doubt it [11:41:43] indeed, HTTP is equally down [11:43:13] WORKSFORME [11:44:01] !log Disabled Puppet and stopped PyBal on amslvs2 [11:44:08] !log restarting jenkins [11:44:09] Logged the message, Master [11:44:17] Logged the message, Master [11:45:54] PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:54] RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 96.54 ms [11:49:04] !log Reenabled PyBal on amslvs2 [11:49:12] Logged the message, Master [11:50:19] !log Stopped PyBal on amslvs1 [11:50:26] Logged the message, Master [11:51:53] !log Started PyBal on amslvs1 [11:52:01] !log Moved amslvs1 and 2 from csw1-esams to csw2-esams [11:52:01] Logged the message, Master [11:52:09] Logged the message, Master [11:53:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:59:29] now ok again [12:08:40] !log Depooled knsq23-knsq30 text squid frontends [12:08:49] Logged the message, Master [12:24:29] !log jenkins: cleaning up all occurrences of an old plugin in build histories. The script run in a tmux on gallium (see https://bugzilla.wikimedia.org/show_bug.cgi?id=48025#c19 ) [12:24:37] Logged the message, Master [12:27:33] !log Removed knsq23-knsq30 from the text squid config [12:27:42] Logged the message, Master [12:47:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [13:20:09] !log Migrated toolserver routing to cr1-esams and cr2-esams via csw2-esams access [13:20:18] Logged the message, Master [13:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:59:12] I'm having trouble writing any data to /home on stat1.wikimedia.org. I get "No space left on device." but there's plenty of space. Can anyone help? [13:59:13] apergos, are you around? we are having a weird error on stat1 [13:59:22] Hey drdee :) [13:59:26] morning [13:59:27] υο [13:59:31] yo even [13:59:32] disk is not full [13:59:42] /home is 85% [14:00:08] lt me get over there and have a look [14:00:12] thanks! [14:00:19] ^ [14:00:53] it seems that another backup process is going crazy as well [14:01:21] what are yu trying to write and how? [14:02:00] i tried ' touch foo' in my home folder [14:02:23] halfak_ tried copying a 4mbfile [14:02:50] My last test was: echo "asdjsdnfjsdnjfsd" > foo [14:03:13] ok, that gives me enough to work with [14:03:14] same error (No space) [14:15:03] New patchset: Hashar; "contint: install colordiff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63130 [14:16:04] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:54] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:24] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:31] out of inodes [14:17:34] blah [14:17:34] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:35] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:35] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [14:17:35] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:35] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:35] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:36] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:54] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:04] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:04] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:14] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 94.57 ms [14:18:24] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 95.28 ms [14:18:24] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 96.61 ms [14:18:24] RECOVERY - Host amssq54 is UP: PING OK - Packet loss = 0%, RTA = 96.50 ms [14:18:24] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 96.74 ms [14:18:34] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:09] drdee: so we can't change the max number of inodes dynamically [14:19:26] apergos: we are about to move /a/squid/archive to stat1002 so that would solve the problem [14:19:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [14:19:34] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:45] is /a on home? no [14:19:53] no [14:20:01] can you see who the culprit is ? [14:20:14] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:14] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:14] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:14] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:34] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:35] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:35] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:44] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:54] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: Connection timed out [14:20:54] PROBLEM - nginx HTTP on ms6 is CRITICAL: Connection timed out [14:20:54] PROBLEM - SSH on ms6 is CRITICAL: Connection timed out [14:20:54] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:54] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:55] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:55] PROBLEM - SSH on amssq58 is CRITICAL: Connection timed out [14:21:04] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:35] PROBLEM - Frontend Squid HTTP on amssq58 is CRITICAL: Connection timed out [14:21:48] I will guess /home/erosen/tmp/repos since it is taking forever to list, I made the mistake of not speficying nosort [14:22:04] lemme verify though [14:22:04] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 96.53 ms [14:22:07] let me check with him [14:22:14] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 96.31 ms [14:22:24] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 97.01 ms [14:22:24] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 96.05 ms [14:22:24] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 94.95 ms [14:22:24] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 95.07 ms [14:22:34] RECOVERY - Host knsq29 is UP: PING OK - Packet loss = 0%, RTA = 93.86 ms [14:22:58] trying a wc -l now to see how many entries are in there [14:23:13] only 81091 but it would be a start [14:23:34] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:04] ok [14:24:13] i've lost the entire history of the article by its undeleting, can anybody fix it by direct db edit, please? [14:24:17] stll there must be a real big hog to use so many, that's a small percentage [14:24:26] i just asked erosen to move something [14:25:01] halfak_: we know the cause, asked erosen to delete some of his smaller files [14:25:14] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 94.99 ms [14:25:14] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 94.03 ms [14:25:24] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 96.68 ms [14:25:24] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 96.85 ms [14:25:25] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 95.60 ms [14:25:25] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 96.33 ms [14:25:25] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 95.10 ms [14:25:25] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 95.06 ms [14:25:25] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 96.23 ms [14:25:25] RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 95.11 ms [14:25:26] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 95.14 ms [14:25:26] Did we run out of inodes or something? [14:25:26] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 96.59 ms [14:25:27] RECOVERY - Frontend Squid HTTP on amssq58 is OK: HTTP OK: HTTP/1.0 200 OK - 567 bytes in 0.194 second response time [14:25:29] yes [14:25:32] yup [14:25:34] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 93.74 ms [14:25:43] I mentioned it above but obviously it scrolled off the screen [14:25:44] RECOVERY - nginx HTTP on ms6 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 0.190 second response time [14:25:44] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK: HTTP/1.0 200 OK - 663 bytes in 0.193 second response time [14:25:44] RECOVERY - SSH on ms6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:25:54] RECOVERY - SSH on amssq58 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:25:54] RECOVERY - Host amssq49 is UP: PING OK - Packet loss = 0%, RTA = 95.15 ms [14:25:54] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 96.62 ms [14:25:54] RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 96.50 ms [14:27:09] I"m trying a recursive listing on that directory (on /home/erosen/tmp/repo) to see if it's a lot more or not [14:30:06] he must have a job that runs that updates that continuously too [14:31:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [14:32:37] those 81000 things are all git repos [14:32:59] we might wait forever for the recursive ls to finish. can I ask you to follow up from here drdee ? [14:33:18] yes, that's fine [14:33:22] i am poking erosen [14:33:32] thank you for your help! [14:33:58] yw [14:41:10] apergos: Thanks! [14:42:25] ah for the future, df -i will give inode info [14:42:34] hopefully there will not be a similar future [14:42:38] ;-) [14:42:50] thanks for the tip [14:43:16] sure thing [14:44:47] You think it wouldn't be that hard to actually give a useful error message [14:44:56] hah [14:45:05] Welcome to software engineering [14:45:36] well it returns ENOSPC [15:10:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [15:11:24] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:34] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:34] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:35] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [15:11:35] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:54] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:54] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:04] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:04] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:04] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:15] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:25] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:35] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:44] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:54] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:04] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:04] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:14] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:14] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:44] RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 95.19 ms [15:13:44] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 96.23 ms [15:13:44] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 94.88 ms [15:13:44] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 94.69 ms [15:13:54] RECOVERY - Host amssq49 is UP: PING OK - Packet loss = 0%, RTA = 94.71 ms [15:13:54] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 96.29 ms [15:13:54] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 96.40 ms [15:13:54] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 96.34 ms [15:13:54] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 0%, RTA = 94.85 ms [15:13:55] RECOVERY - Host knsq29 is UP: PING OK - Packet loss = 0%, RTA = 93.44 ms [15:13:55] RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 94.96 ms [15:13:56] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 96.28 ms [15:13:56] RECOVERY - Host knsq27 is UP: PING OK - Packet loss = 0%, RTA = 96.39 ms [15:13:57] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 96.55 ms [15:13:57] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 93.69 ms [15:13:58] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 94.73 ms [15:13:58] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 96.35 ms [15:13:59] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 96.43 ms [15:14:24] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 95.53 ms [15:53:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:53:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:53:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [16:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [16:41:24] New patchset: Twotwotwo; "toy scripts playing with long-range compression" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/63139 [16:47:34] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:34] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:44] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:14] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 94.72 ms [16:48:24] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 97.16 ms [16:48:24] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 94.97 ms [17:02:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:15:46] New patchset: CSteipp; "Enable Local XFF blocking" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63141 [17:18:46] !log jenkins: cleared all occurrences of an old plugin in build histories. {{bug|48025}}. Jenkins should be fine now. Will monitor tonight. [17:18:53] Logged the message, Master [17:22:36] New patchset: Diederik; "Non-performant hack: set threads to 1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63143 [17:39:55] * andre__ still doesn't get a clue about reports about timeouts of http requests to bits in Europe and sends an email to ops@ [17:41:33] !log authdnsupdate - removing racktables2 entry [17:41:41] Logged the message, RobH [17:55:33] New patchset: RobH; "removing old racktables stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63147 [17:59:33] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63147 [18:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:16:23] !log authdns-update to remove blog.wikimediafoundation as it wasnt setup and pointed to decom server [18:16:32] Logged the message, RobH [18:19:42] New review: Jdlrobson; "hoping there is no side effects but seems too trivial to bikeshed on" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/63081 [18:22:02] RobH: just a quick heads-up before I leave about https://bugzilla.wikimedia.org/show_bug.cgi?id=48257 | https://rt.wikimedia.org/Ticket/Display.html?id=5118 - in case you have any ideas, they are welcome. TIA... [18:22:29] !log powercycling srv291 [18:22:37] Logged the message, Master [18:23:17] andre__: hrmm, well, this would fall on the person on rt triage, but its after his hours [18:23:23] unfortunatley, he is also the person to look at this [18:23:30] or leslie [18:23:47] so its mark_ or leslie i think [18:23:47] nods [18:24:30] alright. So I'll cross fingers and hope that no angry mob with torches will mass-comment on that bug report over the weekend :) [18:25:34] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [18:27:08] anybody feel like merging the new IPs list for Opera Mini? Some users are apparently getting a red banner without it [18:27:10] https://gerrit.wikimedia.org/r/#/c/63077/ [18:27:54] PROBLEM - Apache HTTP on srv291 is CRITICAL: Connection refused [18:29:06] dear ops, do we have any issues with IPv6 at this point? [18:29:26] will varnish properly recognize an IPv6 traffic? [18:29:37] and be able to do ACL ~ operator on it? [18:39:28] New review: Jdlrobson; "Actually just tested this - it looks really bad and is unacceptable. The whitespace has changed and ..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/63081 [18:41:02] New patchset: RobH; "barium reclaim" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63156 [18:42:43] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63156 [18:44:43] New patchset: RobH; "removed colby and barium from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63160 [18:48:34] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 2 processes with command name varnishncsa [18:49:38] !log killing and restarting nagios-nrpe-server on a bunch of cp10xx boxes to fix monitoring of varnishncsa [18:49:47] Logged the message, Master [18:51:12] Anybody willing to do a quick merge & push to bits for me? Fix for our Firefox OS app, serious localization issue. https://gerrit.wikimedia.org/r/#/c/63067/ thanks! [18:51:25] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 2 processes with command name varnishncsa [18:51:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 2 processes with command name varnishncsa [18:51:34] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [18:51:44] paravoid: you aren't still here are you? [18:53:14] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 2 processes with command name varnishncsa [18:53:24] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 2 processes with command name varnishncsa [18:53:35] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:54:14] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 2 processes with command name varnishncsa [18:54:24] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 2 processes with command name varnishncsa [18:54:34] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 2 processes with command name varnishncsa [18:56:24] PROBLEM - Disk space on cp1030 is CRITICAL: Connection refused by host [18:56:40] PROBLEM - DPKG on cp1030 is CRITICAL: Connection refused by host [18:56:40] PROBLEM - RAID on cp1030 is CRITICAL: Connection refused by host [18:57:04] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [19:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:03:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63160 [19:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:13:04] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [19:13:24] RECOVERY - Disk space on cp1030 is OK: DISK OK [19:13:34] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 2 processes with command name varnishncsa [19:13:35] RECOVERY - DPKG on cp1030 is OK: All packages OK [19:13:35] RECOVERY - RAID on cp1030 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:28:31] csteipp: can you just use value 1 for https://gerrit.wikimedia.org/r/#/c/63150/1 ? [19:37:30] New review: Hashar; "I am not going to bikeshed about it. So feel free to merge :-D" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/62125 [19:38:41] yurik: I got IPv6 if you want a second test :-] [19:44:31] hashar, do you know if we can use IPv6 CIDRs in the varnish VCL files? [19:53:55] yurik: nooooo idea :( [19:55:19] yurik: you probably want to ask the varnish folks [19:55:42] !log aaron synchronized php-1.22wmf3/includes '6c01414775bb9c5269b391bee7ffd1611f9ff74f & d1940caa1f4a11e7c7d8d2ff5b178da7de8e2a92' [19:55:44] hashar, i sent an email to ops, hope i'll get a reply :) [19:55:51] Logged the message, Master [19:56:01] yurik: ping varnish devs too. They probably have an IRC channel there :-] [19:56:17] yurik: apparently we are running 3.0.3plus~rc1-wm10 [19:56:23] !log aaron synchronized php-1.22wmf3/maintenance [19:56:31] Logged the message, Master [20:01:03] hashar, thx, asking on #varnish [20:01:15] Anybody want to push a small update to bits? Fix for FirefoxOS application which is hosted there: https://gerrit.wikimedia.org/r/#/c/63067/ thanks! [20:01:26] brion: Don't you have working shell again? :p [20:01:39] we tried to remove it ;) [20:01:52] New patchset: Reedy; "Update FirefoxOS app with language fix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63067 [20:01:55] i'm not trained on deploys and don't want to blow it up while i have unused perms [20:02:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63067 [20:02:08] thx :DD [20:02:18] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63113 [20:02:50] New patchset: Tim Landscheidt; "Tool Labs: Add libstring-shellquote-perl to exec_environ." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63211 [20:05:22] !log reedy synchronized docroot/bits/WikipediaMobileFirefoxOS [20:05:29] Logged the message, Master [20:08:31] New patchset: Andrew Bogott; "Added manifest for rt4 running with Apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63213 [20:09:44] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay seconds [20:09:49] New patchset: Cmjohnson; "Adding macs for wtp1005-1010 and wtp1015-1018" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63214 [20:10:14] OK is a good number of seconds [20:10:24] RECOVERY - MySQL Replication Heartbeat on db35 is OK: OK replication delay seconds [20:17:34] mutante, notepeter: the remaining step for RT in eqiad is configuring the db connection. Is it customary to puppetize that, or just hand edit? [20:17:43] For reference, I'm talking about filling in the fields of this: http://dpaste.org/JwSgZ/ [20:19:13] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63214 [20:22:25] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63211 [20:28:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [20:31:08] !log aaron synchronized php-1.22wmf3/includes/filebackend [20:31:16] Logged the message, Master [20:35:24] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay seconds [20:36:54] robh: around? [20:37:14] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay seconds [20:40:05] New patchset: Diederik; "Added per dc / server role breakdown of udp2log packetloss monitoring." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63220 [20:42:21] New review: Hashar; "Honestly, I have no idea how that stuff work. I have simply uncommented the existing block :-]" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62606 [20:43:24] PROBLEM - RAID on db26 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:43:29] New patchset: Diederik; "Added per dc / server role breakdown of udp2log packetloss monitoring." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63220 [20:43:54] PROBLEM - DPKG on db26 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:24] PROBLEM - Disk space on db26 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:45] BTW, when I see this: Are broken DB servers automatically taken out of http://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php & Co., or does this require human intervention? [20:46:55] scfc_de: It requires human intervention to actually take them out, but MediaWiki is largely intelligent about not querying broken serverse [20:48:02] RoanKattouw: So MediaWiki tries the next server in the list when one is not available? [20:48:25] It queries the slave lag on all of them and uses the least-lagged one [20:48:47] save killing the master when situation is too bad to believe [20:48:57] With a short cache so it only queries the lag once every few seconds [20:49:05] Yes, that too [20:49:18] It avoids dead slaves, but also avoids sending queries to slaves that are lagged >30s [20:49:28] And if all slaves are lagged >30s, the whole system goes into read-only mode [20:50:01] RoanKattouw: Ah, okay, that makes sense. Thanks! [20:51:14] PROBLEM - SSH on db26 is CRITICAL: Server answer: [20:52:19] binasher, ping [20:57:35] !log aaron synchronized php-1.22wmf3/includes/filebackend/FileBackendStore.php '2717d876ba2507df26ac2c5ab746ad7895af1bbf' [20:57:42] Logged the message, Master [20:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:22] AaronSchulz: since paravoid isn't here, I'll ask you: we're going to start reading from Ceph next week, do you/him have a plan for which day that will happen? [20:59:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [21:00:21] greg-g: tentatively monday, unless some problems come up [21:01:25] AaronSchulz: sounds good. Can I put it down at a specific time, eg: 1pm pacific? [21:01:42] Isure [21:01:43] *sure [21:03:02] AaronSchulz: thanks. [21:03:46] cmjohnson1: am now, was at lunch [21:04:07] ok, so doing wtp partitioning? [21:04:22] robh: cool...so looking at partman recipe for parsoid....thinking raid1.cfg? [21:04:24] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 2 processes with command name varnishncsa [21:04:31] !log fixing some more varnishncsa monitoring [21:04:38] so raid1 would work, but is non ideal, and i'll explain why [21:04:39] Logged the message, Master [21:04:55] mark has stated that if we have no good reason not to, we should use lvm in normal use cases [21:05:06] so in the event of bad shit, disk filing, etc, we can expand and buy time. [21:05:14] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 2 processes with command name varnishncsa [21:05:34] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 2 processes with command name varnishncsa [21:05:55] hrmm [21:05:59] i dislike raid1-lvm [21:06:12] as it isnt at all like lvm.cfg [21:06:13] wtf. [21:06:30] oh well [21:06:47] cmjohnson1: So if you look at raid1-lvm, most of the newer misc servers us that when they dont do hw raid [21:07:12] so i would use raid1-lvm [21:07:15] not raid1 [21:07:33] but only so we have the lvm stuff for expanding rather than a set XFS partition (like raid1) [21:07:41] its also odd that raid1 uses xfs, oh well. [21:07:58] we tend to not use xfs unless we are dealing with a large filesystem, larger than what fits on two disks [21:08:01] lvm...is a good thing so why waste the space as xfs [21:08:18] or why use xfs at all when there isnt a needed usecase. [21:08:20] New patchset: Yurik; "* Optimized opera_mini ACLs and added more IPv4 ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63077 [21:08:35] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 2 processes with command name varnishncsa [21:08:43] Hey LeslieCarr, could you please check whether https://rt.wikimedia.org/Ticket/Display.html?id=4433 can be closed or not? [21:08:51] cool..so i will use raid1-lvm [21:09:34] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 2 processes with command name varnishncsa [21:09:34] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 2 processes with command name varnishncsa [21:10:06] drdee: leslie is out of office, conferences and travel and the like [21:10:12] so she may not be responsive on irc. [21:10:14] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [21:10:17] aight ty [21:10:17] (fyi) [21:10:24] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 2 processes with command name varnishncsa [21:10:25] quite welcome [21:10:48] New patchset: Yurik; "Added IPv6 CIDRs to opera mini IP block" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63226 [21:10:50] who else besides mark and LeslieCarr can confirm whether ticket https://rt.wikimedia.org/Ticket/Display.html?id=4433 can be closed? [21:11:10] it's about setting up ACL's around the analytics machines [21:11:14] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [21:14:44] RECOVERY - Varnish traffic logger on cp3004 is OK: PROCS OK: 2 processes with command name varnishncsa [21:14:45] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 2 processes with command name varnishncsa [21:15:04] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 2 processes with command name varnishncsa [21:15:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [21:15:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:15:37] Anybody wants to merge a few extra IP (v4 only) ranges to opera mini - https://gerrit.wikimedia.org/r/#/c/63077/ thx! [21:16:11] users are seeing big red warnings without that [21:24:04] New patchset: Cmjohnson; "updating netboot.cfg with recipe for wtp1005-29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63227 [21:26:04] New patchset: Dzahn; "decom bellin and blondel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63228 [21:26:23] !log Running listing based copy-sync scripts in terbium for all originals+transcodes on all wikis for swift->ceph to fix any outdated/missing files [21:26:31] Logged the message, Master [21:26:46] !log decom of bellen/blondel : setting network ports to disabled [21:26:54] Logged the message, RobH [21:27:04] New patchset: Cmjohnson; "updating netboot.cfg with recipe for wtp1005-29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63227 [21:31:04] New patchset: Dzahn; "decom bellin and blondel and remove them from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63228 [21:33:51] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63227 [21:34:44] PROBLEM - NTP on db26 is CRITICAL: NTP CRITICAL: No response from NTP server [21:36:25] come on jenkins, now you say you verified it and at the same time you are telling me it needs Verified [21:37:02] ah, the first one was still for PS1 [21:37:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63228 [21:37:13] that makes more sense. [21:37:42] yea, if you create a second patch set in less time than it needs to validate the first :p [21:39:03] !log switch port disabled for blondel & bellin [21:39:11] Logged the message, RobH [21:40:04] PROBLEM - Host blondel is DOWN: PING CRITICAL - Packet loss = 100% [21:41:04] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [21:42:58] New patchset: Cmjohnson; "Fixing error on wtp1016 entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63229 [21:45:01] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63229 [21:45:23] !log shutdown -h blondel .. bye [21:45:34] Logged the message, Master [21:54:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [22:01:58] RECOVERY - Puppet freshness on neon is OK: puppet ran at Fri May 10 22:01:47 UTC 2013 [22:03:19] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [22:05:39] PROBLEM - Disk space on virt1005 is CRITICAL: NRPE: Command check_disk_space not defined [22:05:59] PROBLEM - Disk space on ms2 is CRITICAL: NRPE: Command check_disk_space not defined [22:27:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [22:47:59] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [22:48:55] !log pgehres synchronized php-1.22wmf3/extensions/CentralAuth/ 'Small little update to maintenance script' [22:49:03] Logged the message, Master [22:55:28] New patchset: Andrew Bogott; "Added manifest for rt4 running with Apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63213 [23:19:51] !log DNS update - remove various old Tampa hosts [23:20:00] Logged the message, Master