[00:00:36] RECOVERY - Disk space on search1004 is OK: DISK OK [00:01:24] if you submit the same diff/comment on the same parent commit repeatedly is it deterministic? i.e. does the diff/comment/parent combo accurately predict if the hook will fail? [00:01:36] generally, yes [00:01:37] s/accurately/consistently/ [00:02:09] oh [00:02:14] and that's true for both commits that fail and those that succeed? [00:02:15] RECOVERY - Disk space on ms-be4 is OK: DISK OK [00:02:15] this has nothing to do with the comments and such [00:02:21] the comment hooks work fine [00:02:24] RECOVERY - DPKG on ms-be4 is OK: All packages OK [00:02:27] it's the patchset-created hook that fails [00:02:42] RECOVERY - RAID on ms-be4 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:03:28] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:03:36] RECOVERY - RAID on search1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:04:21] RECOVERY - DPKG on search1005 is OK: All packages OK [00:04:57] RECOVERY - Disk space on search1005 is OK: DISK OK [00:05:37] !log put ms-be2 into rotation as a new production swift backend storage node [00:05:40] Logged the message, Master [00:05:57] it may be that I'm never closing my connection [00:06:02] err [00:07:12] RECOVERY - NTP on search1003 is OK: NTP OK: Offset -0.008622169495 secs [00:07:39] RECOVERY - RAID on search1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:08:24] RECOVERY - Disk space on search1006 is OK: DISK OK [00:08:24] RECOVERY - DPKG on search1006 is OK: All packages OK [00:10:30] RECOVERY - DPKG on search1007 is OK: All packages OK [00:11:42] RECOVERY - RAID on search1007 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:11:42] RECOVERY - Disk space on search1007 is OK: DISK OK [00:11:42] RECOVERY - NTP on search1005 is OK: NTP OK: Offset -0.03319811821 secs [00:12:00] RECOVERY - NTP on search1004 is OK: NTP OK: Offset 0.006244897842 secs [00:12:54] RECOVERY - NTP on ms-be4 is OK: NTP OK: Offset -0.01974022388 secs [00:13:03] RECOVERY - RAID on search1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:13:30] RECOVERY - Disk space on search1009 is OK: DISK OK [00:13:48] RECOVERY - DPKG on search1009 is OK: All packages OK [00:17:51] RECOVERY - NTP on search1007 is OK: NTP OK: Offset 0.004569292068 secs [00:17:51] RECOVERY - NTP on search1006 is OK: NTP OK: Offset -0.0298012495 secs [00:18:27] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:19:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:10] !log put ms-be3 into rotation as a new production swift backend storage node [00:21:13] Logged the message, Master [00:23:24] RECOVERY - NTP on search1009 is OK: NTP OK: Offset 0.02111792564 secs [00:23:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.268 seconds [00:27:09] RECOVERY - RAID on search1010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:27:15] !log put ms-be4 into rotation as a new production swift backend storage node [00:27:18] RECOVERY - Disk space on search1010 is OK: DISK OK [00:27:18] Logged the message, Master [00:27:27] RECOVERY - DPKG on search1010 is OK: All packages OK [00:29:30] maplebed: :) [00:29:42] yup. [00:29:42] RECOVERY - Disk space on search1011 is OK: DISK OK [00:29:50] it'll be hours before the rings are balanced again though. [00:29:51] RECOVERY - DPKG on search1011 is OK: All packages OK [00:30:13] sure [00:30:54] RECOVERY - RAID on search1011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:31:59] maplebed: i'm reading a sci fi book right now were everyone lives on rings … and there's one where they have to rebalance the rings as they start to fall apart [00:32:28] can't they just run world-ring-builder --rebalance? [00:32:33] :) [00:33:09] RECOVERY - RAID on search1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:33:18] RECOVERY - DPKG on search1012 is OK: All packages OK [00:33:45] RECOVERY - Disk space on search1012 is OK: DISK OK [00:33:45] RECOVERY - NTP on search1010 is OK: NTP OK: Offset -0.01545226574 secs [00:35:03] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:35:41] * Ryan_Lane groans [00:35:45] damn you paramiko [00:36:54] RECOVERY - RAID on search1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:37:12] RECOVERY - NTP on search1011 is OK: NTP OK: Offset 0.02468931675 secs [00:37:30] RECOVERY - DPKG on search1013 is OK: All packages OK [00:37:57] RECOVERY - Disk space on search1013 is OK: DISK OK [00:40:12] RECOVERY - NTP on search1012 is OK: NTP OK: Offset 0.02842032909 secs [00:41:33] RECOVERY - DPKG on search1015 is OK: All packages OK [00:41:33] RECOVERY - RAID on search1015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:42:00] RECOVERY - Disk space on search1015 is OK: DISK OK [00:42:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2951 [00:43:00] \o/ [00:43:57] RECOVERY - NTP on search1013 is OK: NTP OK: Offset 0.03063797951 secs [00:44:51] RECOVERY - Disk space on search1016 is OK: DISK OK [00:44:51] RECOVERY - RAID on search1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:45:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2952 [00:45:27] RECOVERY - DPKG on search1016 is OK: All packages OK [00:46:31] huh.. lint check but no preceding new patchset message [00:47:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2952 [00:47:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2952 [00:49:09] yeah. weird [00:49:27] that's obviously something I've screwed up. heh [00:50:03] New patchset: Ryan Lane; "Fixing lint check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:50:13] nah. it's there [00:50:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2953 [00:50:28] damn it. [00:50:31] it depends on another change [00:50:42] RECOVERY - NTP on search1015 is OK: NTP OK: Offset -0.01510822773 secs [00:51:22] New patchset: Ryan Lane; "Fixing lint check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:51:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2953 [00:51:37] better [00:51:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2953 [00:51:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:52:04] Change abandoned: Ryan Lane; "Was just a test change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:53:02] now to make sure it works :) [00:54:09] RECOVERY - NTP on search1016 is OK: NTP OK: Offset 0.03278541565 secs [00:59:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.268 seconds [01:05:36] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2954 [01:05:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2954 [01:08:42] RECOVERY - SSH on ms-be5 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:13:17] cool. hooks and lint checks are working properly again [01:20:51] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Mar 7 01:20:37 UTC 2012 [01:23:24] RECOVERY - DPKG on ms-be5 is OK: All packages OK [01:24:27] RECOVERY - Disk space on ms-be5 is OK: DISK OK [01:24:54] RECOVERY - RAID on ms-be5 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:31:37] heads up .. about to import a 1.3G file on commons via fenari [01:31:48] RECOVERY - NTP on ms-be5 is OK: NTP OK: Offset 0.01478517056 secs [01:31:57] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [01:39:04] Z [01:39:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:03] Eloquence: it may be better to use /var/tmp on fenari instead of home for larger files, made me think of the mail "[Private-l] Avoiding NFS slowness" [01:41:47] does not seem like there is a problem right now, though [01:45:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.089 seconds [01:49:34] mutante, thanks, we moved it over anyway due to permission issue .. but looks like we're running into an issue (file size limit) with the new file backend code which we'll try to resolve tomorrow [01:54:20] Yes, please use /tmp or /var/tmp [01:54:25] Don't download large files onto /home [02:19:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.340 seconds [03:06:00] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [03:15:22] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [03:15:22] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [04:47:01] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [04:53:58] !log added ms-be5 drives to swift cluster [04:54:01] Logged the message, Master [04:56:01] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:56:26] mutante: where's the best place to watch these new spindles? ganglia? [04:56:44] yeah, ganglia it should be [04:57:21] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&c=Swift+pmtpa&h=Swift+pmtpa+prod&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [04:58:11] hmm, thats the one maplebed mentioned last , because of this change "The graphs previously had been measuring the number of objects created (or destroyed) every 30 seconds. I changed them to normalize to per-second to match the rest of the swift graphs. " [04:59:54] i'm not a big fan of having this fake host in ganglia to do cluster-wide stats. or at least that's what i assume it is [05:04:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:04:26] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:05:26] hrmmmm... i see no monitoring at all for some of the swift services [05:06:07] i.e. to make sure the container server and account server &c are actually running where they should be [05:06:30] also, capacity/usage is graphed but maybe should be monitored too? [05:09:19] jeremyb: you're probably right.. let me add a ticket for that and i'll copy/paste your suggestions [05:09:52] hah, https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templates/nagios/checkcommands.cfg.erb#l293 [05:10:38] that means [[file:Little_kitten_.jpg]] can never be deleted at commons [05:10:48] ;) [05:10:52] i see cronjobs like this [05:10:55] "Cron /usr/local/bin/swift-ganglia-report-global-stats " [05:11:30] heh, yea, kitten needs to live forever,, but who would delete a kitten ;) [05:11:52] mutante: all these new ms-be's have account+container running? [05:12:50] account, object and container [05:13:01] speaking about ms-be5 [05:13:08] but i suppose all of them ,yeah [05:13:19] yeah, well would be kinda pointless without object ;) [05:13:22] i just did this the first time ever myself:) [05:13:31] following maplebeds instructions [05:13:37] just to get this last box up as well [05:14:08] http://wikitech.wikimedia.org/view/Swift/How_To#Add_a_device_.28drive.29_to_a_ring [05:15:06] i added all drives (sda4,sdb4 and sdc1 to sdl1) to all three rings [05:15:58] and since after that the builder files are rsynced to all other nodes.. they should be all the same..yep [05:18:33] > for node in ms-fe2 ms1 ms2 ms3 ms-be1 ms-be5 ; [05:18:41] http://wikitech.wikimedia.org/view/User:Dzahn/swift-add [05:18:45] yea [05:18:46] where are ms-be{2..4}? [05:19:33] they have just been built earlier today [05:19:55] mutante: i mean in the rsync list... [05:20:10] yes [05:20:28] they werent done when these docs were writen i suppose [05:20:41] mutante: they're older than be5 and be5 is in the list... [05:21:01] * jeremyb doesn't follow [05:22:45] i dont know, they should be in the list by now , right [05:23:44] oh, yeah, thats cause ms-be5 was up before, but then went down again [05:23:52] oh [05:23:54] and then was reinstalled.. [05:24:23] what do ya'll think about having a single designated host (fenari?) where all ring operations are done and then you have a sync script on each host that updates it's local rings from the rw host. then you can dsh from a swift dblist to run the sync script everywhere (after an operation) [05:25:11] sounds reasonable [05:26:22] idk what standard practice is for swift but maybe even put the rings in version control [05:26:52] i wonder what ring size you used [05:27:44] mark suggested we put the ring files in puppet too. [05:28:18] I've been doing the ring operations on ms-fe1 because the host has to have the swift packages. fenari doesn't. [05:29:03] maplebed: yeah, i was just reading the docs and assuming these ops were done right on the new machine that was being added [05:29:06] I set ring size to 16 [05:29:07] (http://wikitech.wikimedia.org/view/Swift/Setup_New_Swift_Cluster) [05:29:20] ah, hi maplebed [05:29:24] hi! [05:29:29] thanks for setting up ms-be5! [05:29:32] i was about to say, please lets send all this to him or the list as well:) [05:29:44] sadly I think I want it in a different zone. [05:30:04] maplebed: so we should rsync to ms-be-2,3,4 as well, right? [05:30:16] IIRC ms-be1 is already in zone 5 and I want one zone per rack. [05:30:35] yeah... the rsync command should be more clearly indicated that it's only an example and that the files need to go to all swift hosts. [05:30:42] maplebed: yeah, reading that page now [05:30:53] maplebed: be-1 is zone 4 [05:30:54] the docs shouldn't need to maintain a full list of the hosts in the cluster; it's redundant and will be out of date. [05:31:05] oh. is ms-be2 in zone 5 then? [05:31:09] yes [05:31:12] eh, no, wait [05:31:36] ms-be2 is zone 5. [05:31:43] right, then you probably want zone 8 for ms-be-5 [05:31:52] (from the output of swift-ring-builder /etc/swift/account.builder | less) [05:31:54] i wasnt even sure if they are each in their own rack [05:32:14] I checked racktables a while ago; I'm pretty sure they're all different. [05:33:06] robh didn't want to give me a full rack separation but he wound up finding space. [05:33:22] maplebed: i'm not sure where puppet fits in but certainly some of this should be made into bash scripts instead of just sitting on the wiki [05:33:59] I wound up just running thinsg like this on the command line: [05:33:59] for dev in sd{a..b}4; do swift-ring-builder /etc/swift/account.builder add z${zone}-${ip}:6002/${dev} $weight; swift-ring-builder /etc/swift/container.builder add z${zone}-${ip}:6001/${dev} $weight; swift-ring-builder /etc/swift/object.builder add z${zone}-${ip}:6000/${dev} $weight; done [05:34:05] obviously suboptimal. [05:34:21] mutante just wrote a bash script to do all the devices for a new ms-be host. [05:34:22] well, i just rsynced to be2,3,4 now as well [05:35:23] that script could probably also find out the device names, and the weight to be used, based on their size..all by itself [05:35:40] not sure about the zone.. [05:35:52] maplebed: can you some more about how the ring would fit in puppet? [05:36:08] mutante: are you sure you added ms-be5 to the rings? [05:36:58] maplebed: i can see the devices from ms-be1, searching the IP of ms-be5 [05:37:10] jeremyb: after running commands to adjust the rings wherever you need to run them (probably on some host in the cluster) you copy them into puppet instead of rsyncing them to the other hosts. Puppet then deploys the files to all the hosts in the cluster. [05:37:14] @ms-be1:~# swift-ring-builder /etc/swift/container.builder search 10.0.6.204 [05:37:47] I think you may have missed updating ms-fe1. [05:38:30] oh, most likely yes, it wasnt in the rsync example [05:38:30] ms-fe1 is also missing from the list i copied in here from the wiki [05:38:42] gotcha,, doing it now [05:38:46] because that's where I was running the commands. [05:38:47] :( [05:39:03] (so that was the source) [05:39:04] mutante: did you do yours on ms-be5? or where? [05:39:09] yes [05:39:09] that part of the doc clearly needs improving. [05:39:10] ;) [05:39:44] mutante: you must have copied the ring files onto ms-be5 from somewhere first, though, right? [05:40:04] yes i copied them from be1, before i could add devices [05:40:38] added the devices, then rsynced from be-5 back to the others [05:40:48] cool. [05:41:34] there, ms-fe1 done [05:41:51] thankfully it's ok for the ring files to be different on different hosts (for a while) because of the algorithms swift uses to figure out which bits are where. At least that's what the docs claim. [05:42:14] ok, good [05:42:27] mutante: feel free to update the docs as you see fit; I'll look at them a bit tomorrow as well. [05:42:46] alright [05:42:53] seems like the bits that need improving are: [05:43:05] * make sure different racks get different zones [05:43:21] * copy the files to all hosts in the cluster (or maybe we should just start putting them in puppet now) [05:43:38] * the ring-building stuff can be done on any node, not necessarily the one you just built [05:43:52] looks like the whole part involving parted/mkpart/mkfs is not necessary anymore.. puppet did it [05:44:01] I don't think so. [05:44:12] I think the partition map didn't get fully nuked when the host was rebuilt. [05:44:17] mark built ms-be5 once already [05:44:18] i saw puppet creating the xfs filesystems on first run [05:44:34] but on reboot it went back to pxe and broke everything. [05:44:39] the partitions map is another thing.. alright.. yea [05:45:04] I say this because ms-be2 and ms-be3 needed the partitions created and the filesystem made but ms-be4 (and ms-be5) didn't. [05:45:18] so clearly there's something wonky going on. [05:45:33] we're supposed to get more hardware delivered this week; we can see what happens on those hosts. [05:45:37] makes sense..yeah.. ms-be5 was at a "no root filesystem defined" prompt [05:45:51] this time it did not break on reboot though.. i tested..at least once [05:45:58] good. [05:46:16] all the swift stuff should start on boot, so I like to have the hosts rebooted as the last thing before putting them into service. [05:46:20] just to make sure. [05:46:21] after installing the new kernel i had a reason anyways [05:46:33] ok, I gotta bail. [05:46:34] note: be1 has 1GB swap, be-5 has 4GB [05:46:34] thanks again! [05:46:41] also taking a break.. yw [05:46:43] ttyl [05:46:45] how about be2-4? [05:46:53] did not check yet [05:46:55] ms-be1 was made by hand by robh. [05:47:00] it's probably the most different. [05:47:06] i think 4, the docs also say we want 4 [05:47:09] yep [05:47:28] (honestly, if we hit swap the host is fucked, so I can't imagine it really matters too much...) [05:47:35] * mutante nods [05:47:47] ok. g'night! [05:47:48] ok, cya later [05:47:49] night [06:11:50] New patchset: Dzahn; "add missing locale pl_PL.UTF-8 to fix broken pl.planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2955 [06:12:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2955 [06:12:37] New review: Dzahn; "pl.planet needs this. RT 2416" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2955 [06:12:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2955 [06:27:54] mutante: should i edit your swift-add page directly or just tell you what i think should be changed? [06:27:57] or [06:27:59] ? [06:30:07] jeremyb: go ahead and edit, there is always history [06:31:03] how certain you are (that history will always be there) ;) [06:31:20] jeremyb: about something else, gerrit change 2012, you said back then you just repushed it for the logo change, but that has already been done in another change. can it be abandoned then? [06:31:49] i have to open it to look [06:32:00] !change 2012 | jeremyb [06:32:00] jeremyb: https://gerrit.wikimedia.org/r/2012 [06:33:50] jeremyb: re: the swift-add script. even better if we jus did that via gerrit as well. i'll just have to check out first, if that new repo for "tools/scripts" exists already [06:34:17] eh, "project" i should say [06:34:20] mutante: idk what's up with any of those. well... now i see gerrit says 2 are in the production branch but i didn't know until just now. but http://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ still looks wrong [06:34:35] mutante: well i'm sleeping soon. should i hold off? [06:35:16] no rush, we'd just like to clean up a bit in the gerrit queue [06:35:40] i mean should i not edit the wiki copy? [06:36:10] mutante: about /r/2012 in particular: i'm not about to edit war over it. i still think it's the right change to make but if no one agrees then no one agrees [06:36:56] if you want to edit it, go ahead.. or use the Tlak: page for your version, or wait :) [06:37:00] Talk: [06:37:16] ok, i'm editing [06:37:35] you can always import an old version and i can submit a change to match the new version [06:38:34] jeremyb: i'm not sure i even have an opinion about changing link to enter_bug.cgi. I#m just pointing out the logo change has already happened now. so this woul not merge anyways [06:38:47] alright [06:39:10] mutante: and i'm pointing out it doesn't look changed... [06:39:24] "but http://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ still looks wrong" [06:39:51] it's also not ssl'd [06:40:00] but https://gerrit.wikimedia.org/r/#patch,sidebyside,2013,3,files/svn/viewvc.conf [06:40:37] mutante: but see my link? ;) [06:41:12] i suggest you create a new chane then [06:42:14] to do what? [06:42:27] maybe puppet hasn't run on the box? [06:42:35] fix the logo? i dont know, you said it still looks wrong [06:43:05] that link doesnt show either, not the old and not the new logo.. i have no idea right now [06:43:53] mutante: you sure it doesn't show the old logo? looks the same as months ago to me [06:43:53] dont worry about it now, or comment on the gerrit change [06:44:34] mutante: i have to assume there's been no puppet run there. or alternatively there's something wrong with the puppet config for that service or with the way i made the change [06:45:24] checking if puppet ran [06:45:57] puppet ran at Wed Mar 7 06:26:00 UTC 2012 [06:46:09] what host is this? [06:46:14] formey [06:48:28] ok, i see the old logo in viewvc.conf on the host [06:48:42] src="https://donate.mozilla.org/page/-/bugzilla.png" [06:49:04] yay! i was getting ready to ask for an apache restart ;P [06:49:21] but puppet ran, so what is most likely missing is that puppet has not been told to actively put this file in the /etc/viewvc/ dir [06:52:45] it is defined in class viewvc in svn.pp , and it says it requires svn::server [06:53:01] but in site.pp formey just includes svn::server and not viewvc .. [06:53:58] seems like it should be the other way around, svn::Server reuquiring viewvc [06:55:08] would be nice if there was some notification when a change was merged to an additional branch. so that i knew when it went to production so i could try to verify it took then [06:55:23] or is it svn::server::viewvc already , but then why would it require svn::server again in that place [07:02:51] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [07:03:46] jeremyb: gerrit sends mail, you can "wach projects", Settings -> Watched projects, email notifications .e.g . "Gerrit-MessageType: merged" [07:03:49] bbiaw [07:04:23] mutante: i meant specifically for cahnges you were a part of (or watching). or at least for ones you submitted [07:04:29] changes* [07:09:28] i guess w=while? [07:09:46] anyway, no rush, just a mystery ;) [07:27:36] RECOVERY - Disk space on ms1004 is OK: DISK OK [07:31:33] New patchset: Hashar; "allow hashar on formey host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [07:31:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2821 [07:39:46] New review: Dzahn; "your change seems fine, just makes me wonder about the sudo users that don't have an account then. (..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2821 [07:42:45] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [08:24:03] New review: Hashar; "I have no idea. I guess the accounts were added manually over time." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2821 [08:40:53] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:50] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:47:11] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [09:15:09] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:06] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:33:12] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [13:00:05] !log updated mwlib to 0.13.6 [13:00:07] !log updated mwlib to 0.13.6 [13:00:07] !log updated mwlib to 0.13.6 [13:00:09] Logged the message, Master [13:00:12] Logged the message, Master [13:00:18] Logged the message, Master [13:01:59] New review: Mark Bergsma; "I think it would be better to make a list of Nagios/Icinga monitoring IPs in a class (icinga::config..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2934 [13:03:37] New patchset: Mark Bergsma; "There is one in network.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2956 [13:03:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2956 [13:07:33] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [13:16:38] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [13:16:38] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [13:28:30] !log Removed torrus from streber [13:28:34] Logged the message, Master [13:34:53] New patchset: Mark Bergsma; "Remove csw5-pmtpa from monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2957 [13:35:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2957 [13:35:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2957 [13:36:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2957 [13:43:55] New review: Mark Bergsma; "Please fix indentation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2936 [13:47:05] New review: Mark Bergsma; "Do you see that FIXME above the code you just copy pasted? :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2936 [13:51:52] New review: Mark Bergsma; "So why are these files owned by rainman if they're managed by Puppet? rainman can't really edit them..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2947 [14:10:08] New patchset: Hashar; "allow hashar on formey host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [14:10:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2821 [14:11:49] New review: Demon; "Per IRC discussion--doesn't need an account created, just needed the default shell adjusted in LDAP ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2821 [14:34:26] Change abandoned: Hashar; "This is now unneeded. Just had to be allowed in the LDAP directory :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [14:48:03] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:03] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:06:03] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:06:03] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:17:55] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Wed Mar 7 15:17:47 UTC 2012 [15:20:09] RECOVERY - Disk space on search1019 is OK: DISK OK [15:21:03] RECOVERY - RAID on search1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:21:30] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Wed Mar 7 15:21:23 UTC 2012 [15:21:48] RECOVERY - DPKG on search1019 is OK: All packages OK [15:22:15] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.030 second response time on port 8123 [15:23:36] RECOVERY - Disk space on search1020 is OK: DISK OK [15:24:12] RECOVERY - RAID on search1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:25:24] RECOVERY - DPKG on search1020 is OK: All packages OK [15:26:36] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:06] RECOVERY - Lucene on search1006 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:33] RECOVERY - Lucene on search1004 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:42] RECOVERY - Lucene on search1005 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:51] RECOVERY - Lucene on search1007 is OK: TCP OK - 0.027 second response time on port 8123 [15:29:54] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.026 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.026 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [15:33:43] RECOVERY - NTP on search1019 is OK: NTP OK: Offset 0.03436875343 secs [15:35:40] RECOVERY - NTP on search1020 is OK: NTP OK: Offset 0.02554428577 secs [15:37:52] New patchset: Pyoungmeister; "that's a bit better" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2958 [15:38:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2958 [15:40:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2958 [15:40:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2958 [15:40:55] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=62%): /var/lib/ureadahead/debugfs 284 MB (3% inode=62%): [15:40:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:40:55] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=61%): /var/lib/ureadahead/debugfs 179 MB (2% inode=61%): [15:40:55] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [15:53:28] robh: can we swap db22 disk today? [15:53:45] !rt 2497 [15:53:45] https://rt.wikimedia.org/Ticket/Display.html?id=2497 [15:54:21] prolly, lemme check it out [15:56:02] cmjohnson1: Ok, it is a slave on s4 now [15:56:23] so we can attempt a hot swap. I am going to go ahead and tell it to identify drives 9 and 11, since drive 10 is the bad one, it cannot accept the identify command [15:56:27] it seems, or its just not working [15:56:49] RECOVERY - Disk space on srv224 is OK: DISK OK [15:56:58] RECOVERY - Disk space on srv219 is OK: DISK OK [15:56:58] RECOVERY - Disk space on srv220 is OK: DISK OK [15:56:58] RECOVERY - Disk space on srv223 is OK: DISK OK [15:57:08] cmjohnson1: actually, going to try to ID drive 10 now [15:57:21] i had bad syntax [15:58:02] bleh [15:58:03] The device specified does not exist. [15:58:28] cmjohnson1: ok, none of the damned id commands work. [15:58:33] Can you attempt to id drive 10? [15:58:40] i have the disk diagram up [15:58:43] (meaning from the layout on the top of the case [15:58:44] cool [15:58:59] just under dvd rom 2nd from left [15:59:01] so this starts at 0 [15:59:09] so if the diagram starts at one, this is drive 11 thats bad. [15:59:29] I want you to go ahead and find it, and then pull it, and do not replace it [15:59:36] i want to then run a scan and confirm we pulled the right driv e [15:59:38] http://docs.oracle.com/cd/E19121-01/sf.x4240/820-3835-14/820-3835-14.pdf [15:59:38] ok? [16:00:14] it runs backwards....o is the most bottom on left side working it's way up to 3 [16:00:57] !log pulling disk 10 from db22 [16:01:00] Logged the message, Master [16:01:45] robh: ok [16:02:00] that appears to be the correct drive [16:02:03] go ahead and replace =] [16:02:23] !log replacing hdd for disk 10 on db22 [16:02:26] Logged the message, Master [16:03:08] robh: fyi that was the last of replacement disks i have for the x4240's [16:03:20] good, that means we can stop using the damned sdervers ;] [16:03:46] i also need to bring down db59 to add the cards [16:04:25] lemme know if disk 10 is rebuilding [16:04:39] cmjohnson1: its not powered on, you can bring down db59 whenever [16:04:47] will check shortly, on call with dell [16:05:02] okay [16:06:54] cmjohnson1: confirmed, rebuilding [16:07:04] Device #10 [16:07:04] Device is a Hard drive [16:07:05] State : Rebuilding [16:07:05] Supported : Yes [16:09:16] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [16:09:16] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.026 second response time on port 8123 [16:10:36] cool...updating ticket [16:15:35] Do I dare ask why /var/log/mw/fatal.log isn't logrotated? It's currently 6.7G on fenari... [16:16:14] lol, there's an RT ticket for it [16:52:54] RECOVERY - RAID on db22 is OK: OK: 1 logical device(s) checked [17:03:47] meh, sync-apache + apache-graceful-all routinely fail keep servers in sync [17:06:36] apergos: are you about? Could you put the 1.19.0beta1 files onto download.wm.o for me please? [17:07:00] hello [17:07:20] where are they? [17:07:22] http://noc.wikimedia.org/~reedy/upload-1.19.0beta1.tar [17:08:35] or /home/reedy/public_html/upload-1.19.0beta1.tar [17:08:57] not gz? I'll fix that. but you still want to provide a sig file too [17:09:01] of the gz [17:09:07] so youmight as well do it yourself [17:09:41] following the example of: mediawiki-1.18.0beta1.tar.gz mediawiki-1.18.0beta1.tar.gz.sig [17:09:49] yeah, extract that file [17:09:55] ah [17:09:55] 1.19 dir, and the tar.gz/sig in there [17:11:55] check that [17:12:04] (I had already maded the dir and stuff :-D) [17:12:10] heh [17:12:13] thanks [17:12:22] you can grab them ok? [17:12:42] yeah, just downloading now [17:13:03] http://www.mediawiki.org/wiki/Download [17:13:09] better update [17:13:44] ok, back to playing with gdb [17:14:01] yup, email to send etc now [17:14:25] cool [17:44:40] robh: the I/O fusion cards...when does 30 days begin? today ...day of shipping or day of arrival? [17:45:08] it started the day they shipped [17:45:18] so whatever it says on packing slip, which should be scanned into the procurement ticket [17:45:30] k [17:45:41] then since its at your center, i am leaving it to you to head up that they go back on time, so please check in with asher regularly [17:46:15] i will...set a reminder to send them back! [18:10:45] ok, cmjohnson1 i am reviewing your email about the two sq servers [18:10:59] ok [18:10:59] I know you cannot do what I am about to yet without root, but I am going to atleast tell ya in here what i am doin [18:11:02] fyi and all [18:11:36] Ok, so I am NOT going to decomm them until you do some hardware tests to confirm its bad controller [18:11:49] but I will go ahead and do a clean shutdown on both. Squid is nice in that you do not need to manually depool anything [18:11:50] ok [18:11:55] pybal is smart, and it handles that stuff [18:12:03] pybal being the load balancing agent that mark wrote [18:12:36] !log shutting down sq38 and sq46 per rt 2581 for testing [18:12:39] Logged the message, RobH [18:12:50] heh, i may not be able to shut them down, they are hung [18:13:17] cmjohnson1: yea, they are hung, go ahead and manually reboot and test (i expect if you let them post the errors will scroll by) [18:13:43] ok...i will ping you in a few [18:13:50] basically i tried to ssh in, ssh is down, and sq38 isnt taking drac [18:13:56] checking sq46 drac now [18:14:23] sq38 is already listed for decom [18:14:25] meh, drac needs reset on 46, just feel free to down both manually and test [18:14:34] cool, makes this easier [18:14:42] i have removed all cabling [18:14:56] ok, if sq46 is also dead, comment in ticket and assign to me [18:15:04] ok [18:15:11] then i need to put them in the decommission.pp in puppet, which when runs yanks those servers off monitoring and the like [18:21:14] New patchset: Bhartshorne; "Add 500 tracking to the Swift proxy logtailer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2960 [18:21:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2960 [18:22:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2960 [18:22:05] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2960 [18:30:33] New patchset: Bhartshorne; "Add 503 tracking to the Swift proxy logtailer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2961 [18:30:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2961 [18:31:07] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2961 [18:31:10] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2961 [18:32:59] cmjohnson1: ok, if sq46 is bad, you can unrack and pull drives for wipe [18:33:12] it doesnt need to be there for the rest of the decom process, i will take it from here, thanks for testing =] [18:33:19] ok [18:35:22] huh, seems someone, i imagine mark_ already added it to the decommission list in puppet [18:36:53] probably [18:36:56] !log pulled sq39 from text pybal config, pulled sq46 from upload pybal config [18:36:59] Logged the message, RobH [18:44:04] RobH: you were talking about 38 before but now !log'd 39? [18:44:25] i did the wrong thing, yea [18:44:33] !log correction sq39 [18:44:35] Logged the message, RobH [18:44:37] just on the log though [18:44:59] sq38 isnt on ticket, was transposing [18:45:03] thx for checkin though =] [18:45:30] jeremyb: though sq38 is also gone, so i couldnt have broken anthing =] [18:45:38] hah [18:49:10] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [19:00:00] New patchset: Bhartshorne; "counting all swift hits by status code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2962 [19:00:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2962 [19:00:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2962 [19:00:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2962 [19:23:09] does anyone know what this puppetbarf is about? "err: /Stage[main]/Mediawiki::Sync/Exec[mw-sync]: Failed to call refresh: Command exceeded timeout at /var/lib/git/operations/puppet/manifests/mediawiki.pp:24" [19:24:04] oh nm. i see, it's sync-common failing [19:27:24] New patchset: Bhartshorne; "corrected hits by status code to per-second instead of per-measurement interval. added percentage of hits by status code." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2963 [19:27:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2963 [19:27:46] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2963 [19:27:49] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2963 [19:27:58] !log manual apt-upgrade, puppetd --refresh, and repeat on srv265 because it was running on outdated apache config [19:28:01] Logged the message, Master [19:41:21] New patchset: Asher; "it's wikipedia zero time!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2964 [19:41:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2964 [19:44:07] New patchset: Bhartshorne; "changed the name of the ganglia logtailer from being proxy specific to being http specific since it works for the backend stoarge nodes too. moved the puppet config from proxy-specific to swift::base to include it on all nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2965 [19:44:19] New patchset: Bhartshorne; "changed the name of the ganglia logtailer from being proxy specific to being http specific since it works for the backend stoarge nodes too. moved the puppet config from proxy-specific to swift::base to include it on all nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2966 [19:44:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2965 [19:44:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2966 [19:44:38] New patchset: Asher; "it's wikipedia zero time!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2964 [19:44:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2964 [19:45:09] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2965 [19:45:12] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2965 [19:45:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2966 [19:45:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2966 [19:45:57] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2964 [19:46:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2964 [19:46:15] wheee gerrit spam!!! wheee!!!!!! [19:47:02] binasher: looks like I merged your change. [19:54:16] mutante: did we ever figure out svn? [19:54:30] he's probably not around right now. [19:54:35] but since you are... [19:54:49] you mentioned last night that you weren't too fond of the pseudo-host for cluster stats idea. [19:54:53] i can't keep everyone's TZs straight [19:55:03] maplebed: not that i know a better way! [19:55:18] ah, damn. [19:55:22] I thought you were going to come back with a suggestion. [19:55:35] maplebed: never set up ganglia myself [20:00:31] it's not.. [20:00:48] oh. I guess it is evening so people who aren't behind in their work in this time zone would be somewhere else [20:02:17] * jeremyb wonders who apergos is responding to [20:02:50] that mu tante isn't around right now [20:03:17] "it's not"? [20:03:42] we need a bot that reassembles nicknames for apergos [20:03:57] "mu tante" has to become "mutante" [20:04:02] !log deployed support for zero.wikipedia.org and carrier tagging to mobile varnish servers [20:04:05] I was going to say iit's not that late [20:04:06] Logged the message, Master [20:04:09] but it is, I guess [20:04:10] cuz it wasnt intentional so not to ping him? [20:04:16] (thats what i do) [20:04:25] I was deliberately not pinging, that's true [20:04:31] wasn't he pinged five minutes before? [20:04:34] apergos: ahh [20:04:37] and generally people like to see themselves in the context [20:04:53] I don't ping unless I meanit [20:05:01] preilly: right?! [20:05:01] however I am happy to make exceptions domas [20:05:11] what is 'ping'? [20:05:13] domas: i *was* trying to ping though [20:05:42] !log reverted no-pagecache rsync on search nodes - without corresponding index warmup in lsearchd, it just pushes back the pain a bit and does more harm than good [20:05:43] cause your irc client to whine at you that your name has been bandied about and that you should really look into it [20:05:45] Logged the message, Master [20:05:47] = "ping" [20:05:57] why would an irc client whine [20:06:10] *shrug*, it puts it in different color for me, I look at it eventually [20:07:24] domas: i think whine implies make noise [20:07:32] domas: what? [20:07:33] domas: do you not experience noise? [20:07:41] preilly: I wanted to know if you think the same! [20:07:55] it is only visual here, why would I ever want anyone to disturb my music? [20:09:29] preilly: don't you feel good that I picked you out?! [20:09:42] domas: not so much [20:09:45] :( [20:10:03] * preilly is waiting for it…. whatever it is... [20:10:17] for what? [20:10:24] picking [20:10:28] * preilly is confused  [20:10:31] we were discussing about highlighting on IRC [20:10:33] obviously [20:10:34] :) [20:17:26] !log yet another redirects.conf change, per RT#2498 redirect wikimedia.com-->wikimedia.org [20:17:29] Logged the message, Master [20:25:47] New patchset: Pyoungmeister; "required for catch-all term" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2967 [20:25:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2967 [20:26:27] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2967 [20:26:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2967 [20:30:37] hi domas [20:30:45] hey aude [20:31:03] domas: you've moved to DC, right? [20:31:06] not yet [20:31:09] I was there recently tho! [20:31:10] ah, okay :( [20:31:19] we have a meetup on saturday [20:31:33] RobH is welcome also [20:32:04] if i dont do sailing sounds cool =] [20:32:14] oooh sailing! [20:32:33] http://en.wikipedia.org/wiki/Wikipedia:Meetup/DC_28 [20:58:40] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [21:15:28] RECOVERY - RAID on srv194 is OK: OK: no RAID installed [21:17:56] !log running apt upgrades and puppetd --test on srv194, srv197, srv203, srv212, srv213, srv230, srv244, srv245, srv252, srv282 and manually restarting nrpe because they're reporting funky in nagios [21:17:59] Logged the message, Master [21:22:40] RECOVERY - RAID on srv203 is OK: OK: no RAID installed [21:34:31] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [21:40:07] New patchset: Bhartshorne; "corrections to capture log lines from swift storage nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2968 [21:40:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2968 [21:40:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2968 [21:40:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2968 [21:43:13] lol nagios is from hell [21:43:22] * maplebed muted nagios yesterady. [21:43:58] it's alerting on half the world, strangly a lot of nrpe procs are owned by dzahn [21:44:59] 25404 root 20 0 1337m 1.2g 3060 R 100 31.6 291:33.65 puppet [21:45:00] awesome [21:45:52] what's the trick when the nagios UI gives you the "Whoops!" page? [21:52:08] ah, corrupt or incomplete nagios config. [21:59:11] PROBLEM - RAID on mw40 is CRITICAL: Connection refused by host [22:01:08] RECOVERY - RAID on mw40 is OK: OK: no RAID installed [22:04:35] RECOVERY - RAID on srv254 is OK: OK: no RAID installed [22:39:37] !log set swift weight for ms1 to 0 initiating the process to move data off the host in preparation for decomissioning it. [22:39:41] Logged the message, Master [22:44:20] robh: so ms4, memory all checks out. !rt 885 [22:44:31] !rt 885 [22:44:31] https://rt.wikimedia.org/Ticket/Display.html?id=885 [22:44:41] thx jeremyb...i was just going leave it alone [22:45:58] cmjohnson1: so do those slots that indicate fault not detect that memory? [22:46:09] that is what I believe [22:46:14] ie: memory is fine, bad mainboard? [22:46:29] yes...that is the conclusion [22:46:56] that sucks cos it is hard to get the motherboard [22:46:57] ok, so we have a ton of useless memory in a system with a bad mainboard [22:47:06] which is no longer under warranty [22:47:34] can we use the memory in any of the other ms's? [22:47:42] could sell the good parts on ebay ;-P [22:47:47] should be able to yes [22:47:48] and the disks [22:47:51] but still sucks =P [22:48:53] this would be more concerning if maplebed didnt already replace this hardware with swift ;] [22:49:07] robh: can we run just using the known working memory slots? [22:49:20] it will be slow but I think it is rather slow already [22:49:20] how many slots are bad? [22:49:26] I hate running bad hardware [22:49:33] if those slots are bad, who is to say what else is now bad [22:49:39] it could lead to unpredictable results [22:49:43] true ...but run it till it dies [22:49:50] i call this dead ;] [22:50:06] but, lets get mark_ to agree. taking ticket from you and assigning to him. [22:50:42] k [22:51:15] Please update the ticket with what slots are bad (the #'s and how many) cuz it matters, you have to populate the dimms in a specific order [22:51:25] so if the first to populate slots are dead, then it reall ysucks [22:52:21] yep...np [23:01:23] PROBLEM - Disk space on search10 is CRITICAL: DISK CRITICAL - free space: /a 5248 MB (3% inode=99%): [23:09:29] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [23:18:29] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [23:18:29] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours