[00:00:36] RECOVERY - Disk space on search1004 is OK: DISK OK [00:01:24] if you submit the same diff/comment on the same parent commit repeatedly is it deterministic? i.e. does the diff/comment/parent combo accurately predict if the hook will fail? [00:01:36] generally, yes [00:01:37] s/accurately/consistently/ [00:02:09] oh [00:02:14] and that's true for both commits that fail and those that succeed? [00:02:15] RECOVERY - Disk space on ms-be4 is OK: DISK OK [00:02:15] this has nothing to do with the comments and such [00:02:21] the comment hooks work fine [00:02:24] RECOVERY - DPKG on ms-be4 is OK: All packages OK [00:02:27] it's the patchset-created hook that fails [00:02:42] RECOVERY - RAID on ms-be4 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:03:28] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:03:36] RECOVERY - RAID on search1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:04:21] RECOVERY - DPKG on search1005 is OK: All packages OK [00:04:57] RECOVERY - Disk space on search1005 is OK: DISK OK [00:05:37] !log put ms-be2 into rotation as a new production swift backend storage node [00:05:40] Logged the message, Master [00:05:57] it may be that I'm never closing my connection [00:06:02] err [00:07:12] RECOVERY - NTP on search1003 is OK: NTP OK: Offset -0.008622169495 secs [00:07:39] RECOVERY - RAID on search1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:08:24] RECOVERY - Disk space on search1006 is OK: DISK OK [00:08:24] RECOVERY - DPKG on search1006 is OK: All packages OK [00:10:30] RECOVERY - DPKG on search1007 is OK: All packages OK [00:11:42] RECOVERY - RAID on search1007 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:11:42] RECOVERY - Disk space on search1007 is OK: DISK OK [00:11:42] RECOVERY - NTP on search1005 is OK: NTP OK: Offset -0.03319811821 secs [00:12:00] RECOVERY - NTP on search1004 is OK: NTP OK: Offset 0.006244897842 secs [00:12:54] RECOVERY - NTP on ms-be4 is OK: NTP OK: Offset -0.01974022388 secs [00:13:03] RECOVERY - RAID on search1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:13:30] RECOVERY - Disk space on search1009 is OK: DISK OK [00:13:48] RECOVERY - DPKG on search1009 is OK: All packages OK [00:17:51] RECOVERY - NTP on search1007 is OK: NTP OK: Offset 0.004569292068 secs [00:17:51] RECOVERY - NTP on search1006 is OK: NTP OK: Offset -0.0298012495 secs [00:18:27] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:19:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:10] !log put ms-be3 into rotation as a new production swift backend storage node [00:21:13] Logged the message, Master [00:23:24] RECOVERY - NTP on search1009 is OK: NTP OK: Offset 0.02111792564 secs [00:23:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.268 seconds [00:27:09] RECOVERY - RAID on search1010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:27:15] !log put ms-be4 into rotation as a new production swift backend storage node [00:27:18] RECOVERY - Disk space on search1010 is OK: DISK OK [00:27:18] Logged the message, Master [00:27:27] RECOVERY - DPKG on search1010 is OK: All packages OK [00:29:30] maplebed: :) [00:29:42] yup. [00:29:42] RECOVERY - Disk space on search1011 is OK: DISK OK [00:29:50] it'll be hours before the rings are balanced again though. [00:29:51] RECOVERY - DPKG on search1011 is OK: All packages OK [00:30:13] sure [00:30:54] RECOVERY - RAID on search1011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:31:59] maplebed: i'm reading a sci fi book right now were everyone lives on rings … and there's one where they have to rebalance the rings as they start to fall apart [00:32:28] can't they just run world-ring-builder --rebalance? [00:32:33] :) [00:33:09] RECOVERY - RAID on search1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:33:18] RECOVERY - DPKG on search1012 is OK: All packages OK [00:33:45] RECOVERY - Disk space on search1012 is OK: DISK OK [00:33:45] RECOVERY - NTP on search1010 is OK: NTP OK: Offset -0.01545226574 secs [00:35:03] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:35:41] * Ryan_Lane groans [00:35:45] damn you paramiko [00:36:54] RECOVERY - RAID on search1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:37:12] RECOVERY - NTP on search1011 is OK: NTP OK: Offset 0.02468931675 secs [00:37:30] RECOVERY - DPKG on search1013 is OK: All packages OK [00:37:57] RECOVERY - Disk space on search1013 is OK: DISK OK [00:40:12] RECOVERY - NTP on search1012 is OK: NTP OK: Offset 0.02842032909 secs [00:41:33] RECOVERY - DPKG on search1015 is OK: All packages OK [00:41:33] RECOVERY - RAID on search1015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:42:00] RECOVERY - Disk space on search1015 is OK: DISK OK [00:42:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2951 [00:43:00] \o/ [00:43:57] RECOVERY - NTP on search1013 is OK: NTP OK: Offset 0.03063797951 secs [00:44:51] RECOVERY - Disk space on search1016 is OK: DISK OK [00:44:51] RECOVERY - RAID on search1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:45:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2952 [00:45:27] RECOVERY - DPKG on search1016 is OK: All packages OK [00:46:31] huh.. lint check but no preceding new patchset message [00:47:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2952 [00:47:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2952 [00:49:09] yeah. weird [00:49:27] that's obviously something I've screwed up. heh [00:50:03] New patchset: Ryan Lane; "Fixing lint check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:50:13] nah. it's there [00:50:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2953 [00:50:28] damn it. [00:50:31] it depends on another change [00:50:42] RECOVERY - NTP on search1015 is OK: NTP OK: Offset -0.01510822773 secs [00:51:22] New patchset: Ryan Lane; "Fixing lint check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:51:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2953 [00:51:37] better [00:51:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2953 [00:51:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2953 [00:52:04] Change abandoned: Ryan Lane; "Was just a test change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [00:53:02] now to make sure it works :) [00:54:09] RECOVERY - NTP on search1016 is OK: NTP OK: Offset 0.03278541565 secs [00:59:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.268 seconds [01:05:36] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2954 [01:05:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2954 [01:08:42] RECOVERY - SSH on ms-be5 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:13:17] cool. hooks and lint checks are working properly again [01:20:51] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Mar 7 01:20:37 UTC 2012 [01:23:24] RECOVERY - DPKG on ms-be5 is OK: All packages OK [01:24:27] RECOVERY - Disk space on ms-be5 is OK: DISK OK [01:24:54] RECOVERY - RAID on ms-be5 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:31:37] heads up .. about to import a 1.3G file on commons via fenari [01:31:48] RECOVERY - NTP on ms-be5 is OK: NTP OK: Offset 0.01478517056 secs [01:31:57] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [01:39:04] Z [01:39:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:03] Eloquence: it may be better to use /var/tmp on fenari instead of home for larger files, made me think of the mail "[Private-l] Avoiding NFS slowness" [01:41:47] does not seem like there is a problem right now, though [01:45:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.089 seconds [01:49:34] mutante, thanks, we moved it over anyway due to permission issue .. but looks like we're running into an issue (file size limit) with the new file backend code which we'll try to resolve tomorrow [01:54:20] Yes, please use /tmp or /var/tmp [01:54:25] Don't download large files onto /home [02:19:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.340 seconds [03:06:00] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [03:15:22] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [03:15:22] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [04:47:01] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [04:53:58] !log added ms-be5 drives to swift cluster [04:54:01] Logged the message, Master [04:56:01] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:56:26] mutante: where's the best place to watch these new spindles? ganglia? [04:56:44] yeah, ganglia it should be [04:57:21] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&c=Swift+pmtpa&h=Swift+pmtpa+prod&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [04:58:11] hmm, thats the one maplebed mentioned last , because of this change "The graphs previously had been measuring the number of objects created (or destroyed) every 30 seconds. I changed them to normalize to per-second to match the rest of the swift graphs. " [04:59:54] i'm not a big fan of having this fake host in ganglia to do cluster-wide stats. or at least that's what i assume it is [05:04:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:04:26] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:05:26] hrmmmm... i see no monitoring at all for some of the swift services [05:06:07] i.e. to make sure the container server and account server &c are actually running where they should be [05:06:30] also, capacity/usage is graphed but maybe should be monitored too? [05:09:19] jeremyb: you're probably right.. let me add a ticket for that and i'll copy/paste your suggestions [05:09:52] hah, https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templates/nagios/checkcommands.cfg.erb#l293 [05:10:38] that means [[file:Little_kitten_.jpg]] can never be deleted at commons [05:10:48] ;) [05:10:52] i see cronjobs like this [05:10:55] "Cron /usr/local/bin/swift-ganglia-report-global-stats " [05:11:30] heh, yea, kitten needs to live forever,, but who would delete a kitten ;) [05:11:52] mutante: all these new ms-be's have account+container running? [05:12:50] account, object and container [05:13:01] speaking about ms-be5 [05:13:08] but i suppose all of them ,yeah [05:13:19] yeah, well would be kinda pointless without object ;) [05:13:22] i just did this the first time ever myself:) [05:13:31] following maplebeds instructions [05:13:37] just to get this last box up as well [05:14:08] http://wikitech.wikimedia.org/view/Swift/How_To#Add_a_device_.28drive.29_to_a_ring [05:15:06] i added all drives (sda4,sdb4 and sdc1 to sdl1) to all three rings [05:15:58] and since after that the builder files are rsynced to all other nodes.. they should be all the same..yep [05:18:33] > for node in ms-fe2 ms1 ms2 ms3 ms-be1 ms-be5 ; [05:18:41] http://wikitech.wikimedia.org/view/User:Dzahn/swift-add [05:18:45] yea [05:18:46] where are ms-be{2..4}? [05:19:33] they have just been built earlier today [05:19:55] mutante: i mean in the rsync list... [05:20:10] yes [05:20:28] they werent done when these docs were writen i suppose [05:20:41] mutante: they're older than be5 and be5 is in the list... [05:21:01] * jeremyb doesn't follow [05:22:45] i dont know, they should be in the list by now , right [05:23:44] oh, yeah, thats cause ms-be5 was up before, but then went down again [05:23:52] oh [05:23:54] and then was reinstalled.. [05:24:23] what do ya'll think about having a single designated host (fenari?) where all ring operations are done and then you have a sync script on each host that updates it's local rings from the rw host. then you can dsh from a swift dblist to run the sync script everywhere (after an operation) [05:25:11] sounds reasonable [05:26:22] idk what standard practice is for swift but maybe even put the rings in version control [05:26:52] i wonder what ring size you used [05:27:44] mark suggested we put the ring files in puppet too. [05:28:18] I've been doing the ring operations on ms-fe1 because the host has to have the swift packages. fenari doesn't. [05:29:03] maplebed: yeah, i was just reading the docs and assuming these ops were done right on the new machine that was being added [05:29:06] I set ring size to 16 [05:29:07] (http://wikitech.wikimedia.org/view/Swift/Setup_New_Swift_Cluster) [05:29:20] ah, hi maplebed [05:29:24] hi! [05:29:29] thanks for setting up ms-be5! [05:29:32] i was about to say, please lets send all this to him or the list as well:) [05:29:44] sadly I think I want it in a different zone. [05:30:04] maplebed: so we should rsync to ms-be-2,3,4 as well, right? [05:30:16] IIRC ms-be1 is already in zone 5 and I want one zone per rack. [05:30:35] yeah... the rsync command should be more clearly indicated that it's only an example and that the files need to go to all swift hosts. [05:30:42] maplebed: yeah, reading that page now [05:30:53] maplebed: be-1 is zone 4 [05:30:54] the docs shouldn't need to maintain a full list of the hosts in the cluster; it's redundant and will be out of date. [05:31:05] oh. is ms-be2 in zone 5 then? [05:31:09] yes [05:31:12] eh, no, wait [05:31:36] ms-be2 is zone 5. [05:31:43] right, then you probably want zone 8 for ms-be-5 [05:31:52] (from the output of swift-ring-builder /etc/swift/account.builder | less) [05:31:54] i wasnt even sure if they are each in their own rack [05:32:14] I checked racktables a while ago; I'm pretty sure they're all different. [05:33:06] robh didn't want to give me a full rack separation but he wound up finding space. [05:33:22] maplebed: i'm not sure where puppet fits in but certainly some of this should be made into bash scripts instead of just sitting on the wiki [05:33:59] I wound up just running thinsg like this on the command line: [05:33:59] for dev in sd{a..b}4; do swift-ring-builder /etc/swift/account.builder add z${zone}-${ip}:6002/${dev} $weight; swift-ring-builder /etc/swift/container.builder add z${zone}-${ip}:6001/${dev} $weight; swift-ring-builder /etc/swift/object.builder add z${zone}-${ip}:6000/${dev} $weight; done [05:34:05] obviously suboptimal. [05:34:21] mutante just wrote a bash script to do all the devices for a new ms-be host. [05:34:22] well, i just rsynced to be2,3,4 now as well [05:35:23] that script could probably also find out the device names, and the weight to be used, based on their size..all by itself [05:35:40] not sure about the zone.. [05:35:52] maplebed: can you some more about how the ring would fit in puppet? [05:36:08] mutante: are you sure you added ms-be5 to the rings? [05:36:58] maplebed: i can see the devices from ms-be1, searching the IP of ms-be5 [05:37:10] jeremyb: after running commands to adjust the rings wherever you need to run them (probably on some host in the cluster) you copy them into puppet instead of rsyncing them to the other hosts. Puppet then deploys the files to all the hosts in the cluster. [05:37:14] @ms-be1:~# swift-ring-builder /etc/swift/container.builder search 10.0.6.204 [05:37:47] I think you may have missed updating ms-fe1. [05:38:30] oh, most likely yes, it wasnt in the rsync example [05:38:30] ms-fe1 is also missing from the list i copied in here from the wiki [05:38:42] gotcha,, doing it now [05:38:46] because that's where I was running the commands. [05:38:47] :( [05:39:03] (so that was the source) [05:39:04] mutante: did you do yours on ms-be5? or where? [05:39:09] yes [05:39:09] that part of the doc clearly needs improving. [05:39:10] ;) [05:39:44] mutante: you must have copied the ring files onto ms-be5 from somewhere first, though, right? [05:40:04] yes i copied them from be1, before i could add devices [05:40:38] added the devices, then rsynced from be-5 back to the others [05:40:48] cool. [05:41:34] there, ms-fe1 done [05:41:51] thankfully it's ok for the ring files to be different on different hosts (for a while) because of the algorithms swift uses to figure out which bits are where. At least that's what the docs claim. [05:42:14] ok, good [05:42:27] mutante: feel free to update the docs as you see fit; I'll look at them a bit tomorrow as well. [05:42:46] alright [05:42:53] seems like the bits that need improving are: [05:43:05] * make sure different racks get different zones [05:43:21] * copy the files to all hosts in the cluster (or maybe we should just start putting them in puppet now) [05:43:38] * the ring-building stuff can be done on any node, not necessarily the one you just built [05:43:52] looks like the whole part involving parted/mkpart/mkfs is not necessary anymore.. puppet did it [05:44:01] I don't think so. [05:44:12] I think the partition map didn't get fully nuked when the host was rebuilt. [05:44:17] mark built ms-be5 once already [05:44:18] i saw puppet creating the xfs filesystems on first run [05:44:34] but on reboot it went back to pxe and broke everything. [05:44:39] the partitions map is another thing.. alright.. yea [05:45:04] I say this because ms-be2 and ms-be3 needed the partitions created and the filesystem made but ms-be4 (and ms-be5) didn't. [05:45:18] so clearly there's something wonky going on. [05:45:33] we're supposed to get more hardware delivered this week; we can see what happens on those hosts. [05:45:37] makes sense..yeah.. ms-be5 was at a "no root filesystem defined" prompt [05:45:51] this time it did not break on reboot though.. i tested..at least once [05:45:58] good. [05:46:16] all the swift stuff should start on boot, so I like to have the hosts rebooted as the last thing before putting them into service. [05:46:20] just to make sure. [05:46:21] after installing the new kernel i had a reason anyways [05:46:33] ok, I gotta bail. [05:46:34] note: be1 has 1GB swap, be-5 has 4GB [05:46:34] thanks again! [05:46:41] also taking a break.. yw [05:46:43] ttyl [05:46:45] how about be2-4? [05:46:53] did not check yet [05:46:55] ms-be1 was made by hand by robh. [05:47:00] it's probably the most different. [05:47:06] i think 4, the docs also say we want 4 [05:47:09] yep [05:47:28] (honestly, if we hit swap the host is fucked, so I can't imagine it really matters too much...) [05:47:35] * mutante nods [05:47:47] ok. g'night! [05:47:48] ok, cya later [05:47:49] night [06:11:50] New patchset: Dzahn; "add missing locale pl_PL.UTF-8 to fix broken pl.planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2955 [06:12:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2955 [06:12:37] New review: Dzahn; "pl.planet needs this. RT 2416" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2955 [06:12:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2955 [06:27:54] mutante: should i edit your swift-add page directly or just tell you what i think should be changed? [06:27:57] or [06:27:59] ? [06:30:07] jeremyb: go ahead and edit, there is always history [06:31:03] how certain you are (that history will always be there) ;) [06:31:20] jeremyb: about something else, gerrit change 2012, you said back then you just repushed it for the logo change, but that has already been done in another change. can it be abandoned then? [06:31:49] i have to open it to look [06:32:00] !change 2012 | jeremyb [06:32:00] jeremyb: https://gerrit.wikimedia.org/r/2012 [06:33:50] jeremyb: re: the swift-add script. even better if we jus did that via gerrit as well. i'll just have to check out first, if that new repo for "tools/scripts" exists already [06:34:17] eh, "project" i should say [06:34:20] mutante: idk what's up with any of those. well... now i see gerrit says 2 are in the production branch but i didn't know until just now. but http://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ still looks wrong [06:34:35] mutante: well i'm sleeping soon. should i hold off? [06:35:16] no rush, we'd just like to clean up a bit in the gerrit queue [06:35:40] i mean should i not edit the wiki copy? [06:36:10] mutante: about /r/2012 in particular: i'm not about to edit war over it. i still think it's the right change to make but if no one agrees then no one agrees [06:36:56] if you want to edit it, go ahead.. or use the Tlak: page for your version, or wait :) [06:37:00] Talk: [06:37:16] ok, i'm editing [06:37:35] you can always import an old version and i can submit a change to match the new version [06:38:34] jeremyb: i'm not sure i even have an opinion about changing link to enter_bug.cgi. I#m just pointing out the logo change has already happened now. so this woul not merge anyways [06:38:47] alright [06:39:10] mutante: and i'm pointing out it doesn't look changed... [06:39:24] "but http://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ still looks wrong" [06:39:51] it's also not ssl'd [06:40:00] but https://gerrit.wikimedia.org/r/#patch,sidebyside,2013,3,files/svn/viewvc.conf [06:40:37] mutante: but see my link? ;) [06:41:12] i suggest you create a new chane then [06:42:14] to do what? [06:42:27] maybe puppet hasn't run on the box? [06:42:35] fix the logo? i dont know, you said it still looks wrong [06:43:05] that link doesnt show either, not the old and not the new logo.. i have no idea right now [06:43:53] mutante: you sure it doesn't show the old logo? looks the same as months ago to me [06:43:53] dont worry about it now, or comment on the gerrit change [06:44:34] mutante: i have to assume there's been no puppet run there. or alternatively there's something wrong with the puppet config for that service or with the way i made the change [06:45:24] checking if puppet ran [06:45:57] puppet ran at Wed Mar 7 06:26:00 UTC 2012 [06:46:09] what host is this? [06:46:14] formey [06:48:28] ok, i see the old logo in viewvc.conf on the host [06:48:42] src="https://donate.mozilla.org/page/-/bugzilla.png" [06:49:04] yay! i was getting ready to ask for an apache restart ;P [06:49:21] but puppet ran, so what is most likely missing is that puppet has not been told to actively put this file in the /etc/viewvc/ dir [06:52:45] it is defined in class viewvc in svn.pp , and it says it requires svn::server [06:53:01] but in site.pp formey just includes svn::server and not viewvc .. [06:53:58] seems like it should be the other way around, svn::Server reuquiring viewvc [06:55:08] would be nice if there was some notification when a change was merged to an additional branch. so that i knew when it went to production so i could try to verify it took then [06:55:23] or is it svn::server::viewvc already , but then why would it require svn::server again in that place [07:02:51] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [07:03:46] jeremyb: gerrit sends mail, you can "wach projects", Settings -> Watched projects, email notifications .e.g . "Gerrit-MessageType: merged" [07:03:49] bbiaw [07:04:23] mutante: i meant specifically for cahnges you were a part of (or watching). or at least for ones you submitted [07:04:29] changes* [07:09:28] i guess w=while? [07:09:46] anyway, no rush, just a mystery ;) [07:27:36] RECOVERY - Disk space on ms1004 is OK: DISK OK [07:31:33] New patchset: Hashar; "allow hashar on formey host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [07:31:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2821 [07:39:46] New review: Dzahn; "your change seems fine, just makes me wonder about the sudo users that don't have an account then. (..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2821 [07:42:45] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [08:24:03] New review: Hashar; "I have no idea. I guess the accounts were added manually over time." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2821 [08:40:53] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:50] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:47:11] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [09:15:09] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:06] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:33:12] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [13:00:05] !log updated mwlib to 0.13.6 [13:00:07] !log updated mwlib to 0.13.6 [13:00:07] !log updated mwlib to 0.13.6 [13:00:09] Logged the message, Master [13:00:12] Logged the message, Master [13:00:18] Logged the message, Master [13:01:59] New review: Mark Bergsma; "I think it would be better to make a list of Nagios/Icinga monitoring IPs in a class (icinga::config..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2934 [13:03:37] New patchset: Mark Bergsma; "There is one in network.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2956 [13:03:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2956 [13:07:33] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [13:16:38] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [13:16:38] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [13:28:30] !log Removed torrus from streber [13:28:34] Logged the message, Master [13:34:53] New patchset: Mark Bergsma; "Remove csw5-pmtpa from monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2957 [13:35:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2957 [13:35:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2957 [13:36:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2957 [13:43:55] New review: Mark Bergsma; "Please fix indentation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2936 [13:47:05] New review: Mark Bergsma; "Do you see that FIXME above the code you just copy pasted? :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2936 [13:51:52] New review: Mark Bergsma; "So why are these files owned by rainman if they're managed by Puppet? rainman can't really edit them..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2947 [14:10:08] New patchset: Hashar; "allow hashar on formey host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [14:10:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2821 [14:11:49] New review: Demon; "Per IRC discussion--doesn't need an account created, just needed the default shell adjusted in LDAP ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2821 [14:34:26] Change abandoned: Hashar; "This is now unneeded. Just had to be allowed in the LDAP directory :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2821 [14:48:03] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:03] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:06:03] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:06:03] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:17:55] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Wed Mar 7 15:17:47 UTC 2012 [15:20:09] RECOVERY - Disk space on search1019 is OK: DISK OK [15:21:03] RECOVERY - RAID on search1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:21:30] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Wed Mar 7 15:21:23 UTC 2012 [15:21:48] RECOVERY - DPKG on search1019 is OK: All packages OK [15:22:15] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.030 second response time on port 8123 [15:23:36] RECOVERY - Disk space on search1020 is OK: DISK OK [15:24:12] RECOVERY - RAID on search1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:25:24] RECOVERY - DPKG on search1020 is OK: All packages OK [15:26:36] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:06] RECOVERY - Lucene on search1006 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:33] RECOVERY - Lucene on search1004 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:42] RECOVERY - Lucene on search1005 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:51] RECOVERY - Lucene on search1007 is OK: TCP OK - 0.027 second response time on port 8123 [15:29:54] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.026 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.026 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [15:31:19] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [15:33:43] RECOVERY - NTP on search1019 is OK: NTP OK: Offset 0.03436875343 secs [15:35:40] RECOVERY - NTP on search1020 is OK: NTP OK: Offset 0.02554428577 secs [15:37:52] New patchset: Pyoungmeister; "that's a bit better" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2958 [15:38:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2958 [15:40:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2958 [15:40:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2958 [15:40:55] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=62%): /var/lib/ureadahead/debugfs 284 MB (3% inode=62%): [15:40:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:40:55] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=61%): /var/lib/ureadahead/debugfs 179 MB (2% inode=61%): [15:40:55] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [15:53:28] robh: can we swap db22 disk today? [15:53:45] !rt 2497 [15:53:45] https://rt.wikimedia.org/Ticket/Display.html?id=2497 [15:54:21] prolly, lemme check it out [15:56:02] cmjohnson1: Ok, it is a slave on s4 now [15:56:23]