[00:23:13] !log applying new loopback filter to cr1-eqiad - higher risk of issues [00:23:14] Logged the message, Mistress of the network gear. [01:01:11] thanks LeslieCarr, I have colloquy working on the mifi [01:01:18] awesome [02:16:23] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1411s [02:20:03] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1631s [02:26:03] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:29:43] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:41:54] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [04:15:16] RECOVERY - Disk space on es1004 is OK: DISK OK [04:20:46] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:40:19] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [09:53:53] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 447089 MB (3% inode=99%): [09:54:53] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 442063 MB (3% inode=99%): [10:16:53] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:15:33] PROBLEM - Puppet freshness on srv272 is CRITICAL: Puppet has not run in the last 10 hours [12:50:54] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [13:21:32] New patchset: Dzahn; "add the nagios apache site config as it is now and enable it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1829 [13:21:49] New patchset: Dzahn; "now fix it,install star ssl cert,admin email,tabs,.." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1830 [13:22:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1829 [13:26:37] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1829 [13:26:37] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1829 [13:27:05] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1830 [13:27:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1830 [13:47:27] New patchset: Dzahn; "special.cfg was removed a while ago, do not require it any longer, broken dep." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1831 [13:47:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1831 [13:48:12] New review: Dzahn; "Dependency File[/etc/nagios/special.cfg] has failures: true" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1831 [13:48:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1831 [14:09:47] !log fixed Apache VirtualHost warnings on spence, NameVirtualHost *:443 in ports.conf, in sites-available,.. [14:09:48] Logged the message, Master [14:11:44] !log nagios https now serves real SSL cert [14:11:45] Logged the message, Master [14:27:17] New review: Dzahn; "just noticed this one when using SSL:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1827 [14:28:37] New review: Catrope; "Yeah, SSL to the new Ganglia is broken, but that's not a regression cause SSL to the old Ganglia was..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1827 [14:28:50] New review: Dzahn; "lol" [test/mediawiki/core] (master); V: -1 C: 0; - https://gerrit.wikimedia.org/r/1826 [14:29:17] New review: Demon; "Looks AWESOME -> approved" [test/mediawiki/core] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1826 [14:29:29] Change merged: Demon; [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1826 [14:29:48] hahaa [14:30:00] <^demon> I love test repos. [14:30:04] <^demon> We can commit anything to them :) [14:30:07] :) [14:33:46] oh btw, if you have huge files, like dumps (?), and you still want to use git, this might be it http://git-annex.branchable.com/ [14:57:04] did you just review with a comment of "lol"? yes you did [14:58:33] New patchset: Mark Bergsma; "Use private key file which also contains the public SSL cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1832 [14:58:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1832 [14:58:58] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1832 [14:58:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1832 [15:03:51] RECOVERY - HTTP on sodium is OK: HTTP OK HTTP/1.0 200 OK - 434 bytes in 0.053 seconds [15:22:33] New patchset: Mark Bergsma; "Add hold_domains class parameter for easier debugging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1833 [15:23:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1833 [15:23:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1833 [15:26:17] New patchset: Mark Bergsma; "Hold all mails on sodium for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1834 [15:26:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1834 [15:26:38] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1834 [15:26:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1834 [15:33:59] New patchset: Mark Bergsma; "Puppetize htdigest file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1835 [15:34:04] RobH: the wiki is down and linked IRC channel is #wikipedia [15:34:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1835 [15:34:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1835 [15:34:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1835 [15:35:00] Thehelpfulone: ? [15:35:39] everything seems online to me, what wiki? [15:35:52] it was down for a minute, Reedy fixed it [15:35:59] ok. [15:36:07] RobH: perhaps it wasn't you, we requested that #wikipedia be changed to #wikimedia-downtime in the message, but someone said it should go to #wikimedia-tech [15:36:18] is there a bugzilla request for this? [15:36:28] just asking in irc is pretty much not going to get it done. [15:36:47] they thought it was on #wikimedia-tech, but someone said that it should be at #wikimedia-tech [15:37:06] I thought it linked to #wikimedia-tech [15:37:16] yeah, it links to #wikipedia [15:37:18] but if not, any changes on it, someone is going to have to file a bug request [15:37:30] or else there is no record of why it was changed [15:37:45] oh, found it [15:37:53] oh there is a bug request? [15:37:57] * Thehelpfulone didn't realise [15:38:06] old bug [15:38:07] https://bugzilla.wikimedia.org/show_bug.cgi?id=16043 [15:38:21] seems there was disagreement and discussion on what exactly to list in the page [15:38:24] New patchset: Mark Bergsma; "Fix private file path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1836 [15:38:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1836 [15:38:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1836 [15:38:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1836 [15:41:12] so appears it was reopened to get pushed through [15:41:34] you may want to poke hexmode on it once the bug settles out, and he knows who to push on in wmf [15:41:36] =] [15:42:07] basically getting him involved saves you having to run folks down for help, since he knows who to go to =] [15:42:19] great thanks [15:42:47] New patchset: Mark Bergsma; "Fix port 443 in use issue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1837 [15:43:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1837 [15:43:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1837 [15:43:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1837 [15:43:53] RobH: do you know of a wikimedia testwiki where crats can't remove sysop/crat when they've assigned it? [I need one to update some documentation images] [15:44:25] nope, but i dont deal much with testwiki stuff anymore these days [15:45:07] okay thanks anyways :) [15:49:08] !log Started rsync of lily:/var/lib/mailman/data to sodium (in a screen on sodium) [15:49:09] Logged the message, Master [15:50:45] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [16:12:35] New patchset: Mark Bergsma; "Use a service IP for lists.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1838 [16:13:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1838 [16:13:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1838 [16:23:36] !log Started rsync of lily:/var/lib/mailman/archives to sodium (in a screen on sodium) [16:23:37] Logged the message, Master [16:37:10] New patchset: Mark Bergsma; "Enable CGI module for Mailman" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1839 [16:37:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1839 [16:37:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1839 [16:37:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1839 [17:11:58] New patchset: Mark Bergsma; "Enable mod_redirect" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1840 [17:12:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1840 [17:12:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1840 [17:23:33] New patchset: RobLa; "Revert "Telling the world mediawiki rocks cuz it duz"" [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1841 [17:24:35] Change abandoned: Mark Bergsma; "Vandal permblocked." [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1841 [17:25:13] * mark ducks [17:25:13] New review: RobLa; "Planning to rock in a sort of stoic James Dean kind of way instead" [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1826 [17:26:15] Change restored: RobLa; ":-P" [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1841 [17:28:00] mark: it looks like we're going to need per repository bots or something on Gerrit maybe [17:28:18] gerrit already reports in a different channel for some projects [17:28:20] (labs) [17:28:33] but yeah [17:28:39] oh...ok, so we probably just need to switch the test repo over to #mediawiki [17:28:47] you'll have a ton of repos, and we'll have a bunch too [17:29:12] we can do it on prefix [17:29:21] operations/*, labs/*, mediawiki/* [17:29:23] or something like that [17:29:56] I wonder if you're gonna miss the same sort of functionality that I miss most [17:30:11] we let ops engineers merge everything themselves most of the time, but I review everything afterwards [17:30:16] gerrit doesn't support that well at all [17:30:24] I believe mediawiki code review does that better, with its FIXMEs [17:30:34] but you probably won't merge until it's reviewed, right? [17:30:49] yeah, merge will happen post revew [17:31:01] unfortunately that doesn't work so well for us of course ;) [17:31:12] Well you guys really have a post-commit review model [17:31:19] Because you literally put it in production first, then review it later [17:31:36] in theory everything will go via labs/test repo first [17:31:41] but in practive, we're a long way from there ;) [17:31:42] Right [17:31:45] anyway, food now [17:31:46] bbl [17:31:48] see ya [17:32:43] robla: Just switching over the channel isn't trivial. I started reorganizing some stuff to also support PHP/JS lint checks in https://gerrit.wikimedia.org/r/#change,1794 [17:33:21] I suppose I'd better get something in BZ about that [17:33:28] But essentially it's all code no config [17:33:43] That rev needs someone with Python chops to finish it [17:57:39] New patchset: Dzahn; "git rid of IP in Apache virtual host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1842 [17:57:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1842 [17:58:01] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1842 [17:58:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1842 [18:26:53] !log gracefulling apache on spence to deactivate nmis.w.o (abandoned install of nedi) [18:26:55] Logged the message, and now dispaching a T1000 to your position to terminate you. [18:28:14] !log lvs1003 repaired, now needs install and setup. rt1549 and rt 2241 [18:28:16] Logged the message, RobH [18:38:11] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 9, down: 0, shutdown: 0 [18:40:36] !log running authdns-update on dobson to pick up new dns temps [18:40:39] Logged the message, and now dispaching a T1000 to your position to terminate you. [18:48:01] Ryan_Lane: you were doing the SSL certs, IIRC, right? [18:48:15] in which way do you mean? [18:48:22] https://bugzilla.wikimedia.org/show_bug.cgi?id=33657 [18:48:39] Ryan_Lane: installing them and/or getting them from digicert [18:49:04] dude, IE8 …. grrrrr...... [18:49:40] exactly... we just killed IE6 last year, didn't we? [18:49:44] user needs to install RapidSSL CA cert in browser manually? [18:49:45] no [18:49:50] We still support IE6 [18:50:03] Erik Z is working on new browser stats from the squid reports [18:50:11] Hopefully see if we've dropped from octobers 2.22% [18:50:42] http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm [18:50:48] hexmode: ... [18:50:49] Reedy: I know *we* didn't. I mostly meant the larger web community [18:51:04] hexmode: you really should ask for more info when things like this come in [18:51:10] he gave basically no information [18:51:16] Ryan_Lane: just saw it [18:51:39] well, you have browser info... and OS [18:51:50] I can try on this spare laptop here [18:51:55] it still has IE8 [18:52:27] yeah, but he could be hitting some x.x.wikipedia.org address [18:52:35] or x.wikipedia.com [18:52:38] or x.wikipedia.net [18:53:24] and I have *really* hard time believing IE8 doesn't trust the cert [18:54:06] all IE uses NSS (the microsoft one, not the mozilla one), so we'd also get this behavior on other IEs [18:55:00] WFM on en.wikipedia.org [18:55:03] there are supposedly issues with some mobile devices..but dont see IE8 issues so far [18:55:12] "From 9th December 2010 RapidSSL and Geotrust certificates have been issued from a new 2048Bit root. The GeoTrust Global CA root is not installed on various mobile devices." [18:55:26] but thats been said almost a year ago now [18:58:28] New patchset: Lcarr; "Adding lcarr to sms group, removing dcooper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1843 [18:58:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1843 [18:59:10] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1843 [18:59:11] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1843 [19:03:37] New patchset: Asher; "user class for dartar + adding to admins::restricted" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1844 [19:03:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1844 [19:05:10] New patchset: Asher; "user class for dartar + adding to admins::restricted" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1844 [19:05:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1844 [19:07:02] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1844 [19:07:03] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1844 [19:09:22] robh: srv191 new hdd arrived [19:11:01] cmjohnson1: admin logged? [19:14:31] disregard [19:15:01] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:15:07] so srv191 is a single disk apache with a bad disk [19:15:12] and the repalcement is on site, server is offline [19:15:26] ? [19:15:34] seems that way [19:15:46] yea, ok [19:15:59] cmjohnson1: so yea, just confirm the drive is same, you can pull, system is powered down [19:16:16] git experts ? how do i reset just 1 file to origin/production's version but not allt he other ones ? [19:16:55] git reset --hard -- path/to/file [19:17:48] hrm, doesn't work on just one file [19:17:56] and without hard it doesn't fix it... [19:17:57] git checkout origin/production -- path/to/file then? [19:18:03] ah, good idea [19:20:04] robh: srv191 is multiple disks [19:22:40] urgh [19:22:48] why does an apache have multiple disks [19:23:12] checking soemthing [19:23:35] ok, it shouldnt have that many disks, taking on serial console to verrify some things [19:24:07] okay [19:25:25] cmjohnson1: what makes you think it has multiple disks? srv190 and srv191 are single disk [19:25:30] are you sure there are disks in all those slots? [19:25:35] cuz usually they are empty. [19:26:13] and the filler slot is different looking than the normal one with a disk [19:26:48] no...i am not sure..i didn't pull them out...i will check [19:27:02] there is no reason they should have more than one disk. [19:27:08] i am booting it now [19:27:22] the filler slots are quite distinctly different [19:27:26] no hdd leds and the like. [19:27:39] and confirmed in its sas, its only the one disk [19:27:51] so take note of the differences between active disks and what the filler slots look like [19:27:53] you are correct...only one disk [19:28:12] so that system is now powered down [19:28:31] you can swap the disk, boot with crash cart and confirm it sees the disk (ctrl+c in the SAS bios prompt) [19:28:36] if it sees it, power it back down update ticket [19:28:49] ok [19:28:55] and make a new ticket in core-ops, stating that srv191 has been repaired and needs a new installation =] [19:29:14] since we replaced its one and only disk, it has no data. [19:30:40] !log working on mw1102, disregard flapping [19:30:42] Logged the message, RobH [19:37:05] !log mw1102 offline due to bad mainboard until replacement arrives tomorrow or next [19:37:07] Logged the message, RobH [19:40:52] !log mw1103 hardware issues, disregard nagios flapping [19:40:53] Logged the message, RobH [19:49:21] PROBLEM - Host mw1102 is DOWN: PING CRITICAL - Packet loss = 100% [19:51:01] RECOVERY - Host srv191 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [19:52:39] !log replaced HDD srv191 [19:52:40] Logged the message, Master [19:53:04] !log shutting down srv191 for new install [19:53:05] Logged the message, Master [20:01:51] diederik: for the glam filter - do you care if the full log line is written, or would it be ok to just write the nara filename per request? [20:03:03] pref. full line, else filename & timestamp & referrer as minimum [20:08:17] ok, full line it will be [20:10:42] New patchset: Dzahn; "in the generic check_disk command used in base, ignore tmpfs filesystems" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1845 [20:11:12] diederik: actually, i'm going to write just the three you mentioned - let me know if you actually want more [20:11:28] perfect! [20:12:54] log lines will look like: [20:12:56] 2012-01-11T00:03:37.021 http://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Photograph_of_White_House_Meeting_with_Civil_Rights_Leaders._June_22%2C_1963_-_NARA_-_194190.tif/lossy-page1-220px-Photograph_of_White_House_Meeting_with_Civil_Rights_Leaders._June_22%2C_1963_-_NARA_-_194190.tif.jpg http://en.wikipedia.org/wiki/Martin_Luther_King,_Jr. [20:14:46] !log adjusted firewall rules on payments* to restore ganglia reporting since we switched to nickel [20:14:48] Logged the message, Master [20:17:37] is there a rasonable way to create a .deb from java source? [20:18:37] New review: Dzahn; "check_disk_6_3 is what is used as generic check_disk in base. (if you really want to check tmpfs fil..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1845 [20:22:48] PROBLEM - Host srv191 is DOWN: PING CRITICAL - Packet loss = 100% [20:24:07] New patchset: Asher; "glam filter for national archive - rt2212" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1846 [20:24:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1846 [20:25:30] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1846 [20:25:31] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1846 [20:26:51] robh: i told you the wrong rack that is nearing 100% of power usage. B5 not B4 [20:32:41] huh? [20:32:50] cmjohnson1: told me where? [20:32:55] if email, please send correction email [20:33:06] cuz im in the middle of other things and not even paying attention to tampa [20:35:16] i am updating the email...now [20:46:33] LeslieCarr: yer names kill me [20:46:40] im not sure that meets our requirements for naming ;p [20:46:46] ganglia1001 ;p [20:46:54] we dont name things by the software on them! [20:47:04] i guess this is clustered but still [20:47:20] LeslieCarr: this is all they will ever run on them? [20:50:27] kicked ticket back to you with questions. [20:50:57] short break, brb [20:58:00] New patchset: Hashar; "gallium: allow postgreSQL administration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1847 [20:58:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1847 [20:58:59] New review: Petrb; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1847 [21:00:25] thanks :) [21:01:18] ^demon: can you have a look at gerrit change 1847. It is to let us administrate postgreSQL :-) [21:16:44] !log i just took nickel offline by mistake [21:16:46] Logged the message, RobH [21:16:54] the thing was mislabeled on the back >_< [21:17:01] woops [21:17:08] !log ganglia offline for a moment, sorry folks [21:17:09] Logged the message, RobH [21:18:46] ok looks like I have a good image... saving and going to do something else [21:18:53] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2875* [21:19:01] good times, now i get to redo the labels for that entire rack [21:20:19] oh well, better i discover it now and have fixable downtime [21:20:30] than later with accidental downtime compounded by this. [21:21:03] and ganglia doesn't seem to be starting up smoothly, also good to know [21:22:39] yea, was just noticing that, you are taking care of? [21:22:52] !log updated dns for neon/cobalt to ganglia1001/1002 [21:22:54] Logged the message, RobH [21:23:02] thats just mgmt dns mind you LeslieCarr [21:23:08] i didnt assign you any ip info for the nic [21:23:10] yep :) [21:23:24] also, confirm you have ganglia startup issue ? [21:23:31] so i dont need to go pokin at it [21:23:37] can you set up ip info, puppetize aggregation, and push it out plz? ;) [21:23:43] confirmed gmetad didn't start up [21:23:54] i am trying ot figure out why [21:24:09] !log leslie is handling the ganglia not starting back up issue even though i caused it to die, yay me [21:24:10] Logged the message, RobH [21:25:25] i don't know why gmetad is not starting up tho [21:25:58] anyone else want to check out nickel ? it's in rc.d in all the proper run levels, in init.d, service start works [21:29:43] it fired up after system reboot in the lab instance? [21:30:15] if i had completed the lab instance, perhaps it would have ? [21:30:25] bad me :( [21:31:20] if its actually working enough to be online i am going to ignore the issue since im in eqiad [21:31:29] sorry ;] [21:33:03] LeslieCarr: it seems psw1-eqiad has no mgmt connection up [21:33:07] i will run it shortly [21:33:10] thanks [21:36:53] !log psw1-eqiad mgmt connected [21:36:55] Logged the message, RobH [21:36:58] LeslieCarr: ^ [21:37:04] huzzah [21:38:16] RobH did you see the update on psw1 ? it's got a sfp not sfp+ optic in [21:38:23] and it doesn't see the 2nd link :( [21:38:37] then we need to order more optics [21:38:47] cuz we have no more of the sfp+, all the ones i installed were the same. [21:38:57] they are the ones in psw2 presently [21:39:01] just a bunch of them. [21:39:10] okay, let's order a new bunch :) should i put a ticket in ? [21:39:21] or just ordera bunch and expense them? ;) [21:39:29] yea since you can pull the model info off the ones you want [21:39:33] put in a procurement ticket [21:39:40] i will then order with my card [21:39:44] i should need 4, so i would say us get 6, since we should have a few extra [21:39:56] lets double it to 8 really [21:40:06] i think ideally we want quite a few spare of this [21:40:08] since its something we use a lot of. [21:40:17] okay, we'll be needing some in the future anyways, next transit connection hopefully coming soon, etc [21:40:25] and row C i am ordering this week [21:40:40] so shortly i will be getting quote for the networking gear for that, and those will need more [21:40:52] 4 more in fact, so lets add that. [21:40:54] oh definitely, that's 8 right there [21:40:58] 4 for each side [21:41:06] yea, my bad [21:41:08] so we need like 16 [21:41:13] no worries, i often forget that stuff :) [21:41:15] 8 for row c and then 4 for this [21:41:20] and then 4 for spare [21:41:31] lets us add 2 new connections without issue or ordering [21:41:42] since i know we already have to have 2 reserved for something upcoming [21:41:49] i'm gonna update it to 18 [21:42:01] i wanna round it to 20 [21:42:03] just because. [21:42:06] okay [21:42:10] thats a valid reason i think =] [21:42:21] i also am thinking 2 multimode, just in case we have to use on in an emergency [21:42:23] plus with more in use, we should keep more spare for defective items [21:42:40] its odd for one to suddenly die i assume [21:42:43] but still [21:42:50] actually [21:42:59] lemme find the site we order from so you can see [21:43:09] i mean it's unexpected but not rare as far as issues that happen commonly [21:43:15] also, doyou have a fiber cleaner ? [21:44:02] the paperthing that comes with some switches? [21:44:16] the cheap one like that yea... [21:44:18] usually in a little green box thing ? [21:44:21] somewhere around here [21:44:25] but no real cleaner no [21:44:29] okay [21:44:42] if we should order, and you see it online, link in ticket [21:44:57] i am trying to get a good list of standard items we want in each deployment [21:45:10] cuz finding we are short on things always of course =] [22:15:39] New patchset: Lcarr; "Adding in traceroute to bastion hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1848 [22:15:50] RECOVERY - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is OK: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y OK - 2400 [22:15:59] lily seems to have died (lists.wm.o is down), SSH isn't responding [22:16:03] Can someone have a look please? [22:16:10] !log lists.wikimedia.org is down [22:16:11] Logged the message, Master [22:17:01] anyone have a prob with adding traceroute into the bastion boxes ? [22:18:45] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1848 [22:18:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1848 [22:19:07] LeslieCarr: and mtr please. [22:19:20] okay, adding that as well :) [22:19:30] oh mtr is already on [22:19:30] yay [22:21:04] is somethign up with puppet ? [22:21:15] oh i'm not root, nm [22:24:50] PROBLEM - Puppet freshness on srv272 is CRITICAL: Puppet has not run in the last 10 hours [22:27:34] New patchset: Lcarr; "changing traceroute-nanog to traceroute due to perm issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1849 [22:27:55] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1849 [22:27:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1849 [22:30:20] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Jan 11 22:30:01 UTC 2012 [22:31:40] Reedy: someone fixed i assume? [22:31:43] cuz it seems ok to me. [22:32:16] !log no its not ;] [22:32:17] Logged the message, RobH [22:32:20] fixed itself [22:32:24] It was down [22:33:04] when skynet rises up, we know it started on lily. [22:34:25] RobH: do you want RT-2250? [22:34:48] or should I assign it to cmjohnson or mark or someone else? [22:35:33] uhhh, isnt that half done in another ticket already? [22:35:48] yea, it was... [22:35:48] I only saw 'purchase' in the other ticket. [22:36:17] maplebed: yea, they are already racked. [22:36:18] oh wait, now I remember. [22:36:21] in different racks. [22:36:24] you created a ticket to rack them that I made you redo. [22:36:25] we had to move one [22:36:34] I'm sorry, I forgot about that one (and didn't link it to the swift ticket.) [22:36:37] ms-fe1 and ms-fe2 [22:36:43] grr.... [22:36:51] * maplebed looks [22:36:51] ? [22:37:00] (I'm annoyed I forgot.) [22:37:43] 2211. [22:38:17] thanks. [22:39:38] !log mw1081 ready for install rt2251 [22:39:40] Logged the message, RobH [22:41:57] LeslieCarr: I assigned RT2211 back to cmj since the last message there was you asking him for MAC address info. [22:42:09] !log mw1099 repaired, ready for os install per rt2252 [22:42:11] Logged the message, RobH [22:42:39] thanks maplebed [22:42:39] but I have another question about that ticket - it lists 'public vlan', but I would like ms-fe{1,2} to have private addresses. Should I change that ticket to 'private vlan'? [22:42:52] (I initially thought they would need public addresses, but that's not the case.) [22:43:35] oh. nevermind. "You can only reassign tickets that you own or that are unowned." I didn't reassign it to cmj. but I think you should. ;) [22:43:54] you can steal and then assign, its kinda annoying [22:45:40] !log mw1108 online and ready for install per rt2253 [22:45:41] Logged the message, RobH [22:45:47] i reassigned [22:56:05] !log poking searchidx1001 for memory error [22:56:07] Logged the message, RobH [22:59:59] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [23:18:54] !log searchidx1001 offline and powered down until replacement memory arrives (2012-01-13) rt 2208 [23:18:55] Logged the message, RobH