[00:00:41] Looks like I need chown -h [00:01:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:01:17] That seems to work [00:01:59] !log Restarting the find-chown, this time with -h so symlinks are handled correctly (for some reason there are a bunch of broken symlinks with weird characters out there...) [00:02:08] Logged the message, Mr. Obvious [00:02:11] RoanKattouw: Wouldn't it be better to just wipe those broken links? [00:02:17] They point to nowhere anyway [00:02:18] I bet those files are 2004ish [00:02:22] !log find /export/upload/wik*/*/{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,archive,math,temp,timeline} ! -user apache -exec /root/fixownership2 \{\} \; where fixownership2 = chown -h apache $1 [00:02:30] Logged the message, Mr. Obvious [00:02:30] AaronSchulz: 2008 actually [00:02:50] weird [00:02:54] hoo: Hmm, maybe, but I don't feel comfortable deleting stuff from this system without talking to Aaron (and maybe Ariel) first [00:04:13] Sure, just saying... broken symlinks only cause trouble [00:04:56] if you're having spare time, you could join it against the DB to look whether one of those files still "exists" (or at least the wiki things so) [00:05:03] PROBLEM - NTP on analytics1011 is CRITICAL: NTP CRITICAL: No response from NTP server [00:10:57] New patchset: Ryan Lane; "Fix ldap client settings for nfs1/2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19968 [00:11:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.892 seconds [00:11:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19968 [00:11:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19968 [00:19:23] !log running mwscript purgeParserCache.php --wiki=enwiki --age=1209600 [00:19:32] Logged the message, Master [00:21:49] New patchset: Dzahn; "move zirconium from private to public" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19969 [00:22:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19969 [00:24:12] !log running "mwscript purgeParserCache.php --wiki=$db --age=1814400" instead [00:24:22] Logged the message, Master [00:27:45] New patchset: Ryan Lane; "Fixing script user info" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19972 [00:28:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19972 [00:29:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19972 [00:40:58] !log installing package upgrades on singer [00:41:07] Logged the message, Master [00:43:36] Getting a database error [00:43:44] saying there are too many concurrent transactions [00:43:47] on enwiki [00:53:21] Jasper_Deng: thanks [00:53:33] binasher: what happened? [01:00:31] New patchset: Dzahn; "let partman take all the space that is left instead of fixed value that was too high for zirconium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19977 [01:01:21] New review: Dzahn; "thanks Ben" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/19977 [01:01:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19977 [01:20:03] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19765 [01:20:22] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19751 [01:32:37] New patchset: Ryan Lane; "labs puppetmasters should be cas..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19979 [01:33:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19979 [01:33:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19979 [02:02:57] New patchset: Jeremyb; "followup Iec13c027653f21d0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19981 [02:03:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19981 [02:05:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19981 [02:10:51] !log fixed arecord issues with labsconsole by adding an exception handling live hack for the jobs [02:11:05] Logged the message, Master [02:12:17] !log pushed in large puppet change for ldap, openstack, gerrit and ldap pdns to make it more modular and to add support for eqiad region [02:12:27] Logged the message, Master [03:17:46] New patchset: Dzahn; "add nagios monitoring group "misc_pmtpa" because Nagios is broken without it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19983 [03:18:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19983 [03:18:59] New review: Dzahn; "just to fix Nagios right now.." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/19983 [03:19:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19983 [03:47:07] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Puppet has not run in the last 10 hours [03:47:07] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [03:47:07] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [03:47:07] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [03:47:07] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:57:28] !log nagios back up after adding missing monitor groups misc_pmtpa in appserver role (srv194) [03:57:34] out [03:57:37] Logged the message, Master [04:34:28] RoanKattouw_away: best to look at one of the symlinks directly and see if it's referenced anywhere in the db [04:35:06] I would expect titles like that to have been cleaned up but I've run across a few bad ones before [08:37:07] heya [08:40:02] who's the heya to? [08:44:09] everyone :) [08:53:39] apergos: can you please resolve two bugs for me? [08:54:17] it's going to depend on what they are [08:54:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=39402 [08:54:59] https://bugzilla.wikimedia.org/show_bug.cgi?id=39399 [08:57:13] um [08:57:20] where are these two new statuses if they have been added? [08:57:57] what do you mean where? [08:58:21] shouldn't I see them in the drop down list under status at the bottom of the bug? [08:58:39] after you add them, you will [08:59:20] oh, you are asking me to make these changes to bugzilla [08:59:22] I see [08:59:55] I think you should ask someone who actually is involved in bugzilla maintenance or admin to some degree [09:00:05] such as? [09:00:18] I don't follow any of the conversations about it, and I have no idea what anyone wants over there [09:00:56] who does? [09:01:05] I'm looking into that now [09:02:38] well it looks like thehelpfulone is the most active recently [09:02:40] https://bugzilla.wikimedia.org/buglist.cgi?list_id=139066&resolution=FIXED&query_format=advanced&component=Bugzilla&product=Wikimedia [09:03:50] but reedy seems to also be doing things [09:04:05] so I would check with one of them and see if that's appropriate [09:10:39] thanks [09:10:50] Reedy: ping [09:10:52] I'll be lurking to see what they say [09:11:04] great, thank you [09:11:47] sure [10:27:57] Logged the message, Master [10:28:06] Logged the message, Master [10:28:16] Logged the message, Master [10:28:25] Logged the message, Master [10:28:34] Logged the message, Master [10:28:43] Logged the message, Master [10:28:52] Logged the message, Master [10:29:01] Logged the message, Master [10:29:10] Logged the message, Master [10:29:20] Logged the message, Master [10:29:29] Logged the message, Master [10:29:39] Logged the message, Master [10:35:40] fucker [10:50:05] !log Setup semi-sync snapmirror from nas1-a:home_pmtpa to nas1001-a:home_pmtpa [10:50:16] Logged the message, Master [11:26:05] Logged the message, Master [11:26:13] Logged the message, Master [11:26:22] Logged the message, Master [11:27:04] grrr [11:27:24] does anyone has op around here? [11:27:28] Logged the message, Master [11:27:37] Logged the message, Master [11:27:40] we should at least ban pp-pdf2 and pp-pdf3 [11:27:45] Logged the message, Master [11:27:50] although I'd say pp-pdf1 one too [11:30:47] apergos: do you have op? [11:30:59] or mail contacts for those people [11:31:06] I don't think so [11:31:09] (for op) [11:31:16] and I definitely don't (have email contacts) [11:31:39] tfinc? [11:32:58] New patchset: Matthias Mullie; "Bug 36772 - Article Feedback - Supporting feedback on help pages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17503 [11:37:15] paravoid: why ban? [11:37:33] they had been specifically asked to log it when they do updates [11:37:38] by Tim IIRC [11:37:43] can't look at channel access list so can't tell but I think that counts as a "no" [11:39:28] just stop caring [11:40:30] https://wiki.openstreetmap.org/w/index.php?title=Talk%3AWiki&action=historysubmit&diff=796940&oldid=796938 [11:40:38] ehm https://wikitech.wikimedia.org/index.php?title=Server_admin_log&curid=3768&diff=50285&oldid=50284 [11:43:41] heh [11:43:51] they could leave em in for the one server [11:43:57] *shrug* [12:05:41] !log Removed OSPF metric on xe-5/2/1.0 on cr2-eqiad, to move eqiad->pmtpa traffic to the lower latency link [12:05:50] Logged the message, Master [12:12:18] paravoid: what's the labs glustermanager cron spam that can't contact ldap? [13:26:41] !log Enabled and started SIS deduplication on home_pmtpa on nas1-a [13:26:50] Logged the message, Master [13:37:48] No working slave server: Unknown error (10.0.6.43)) [13:38:11] seems a glitch [13:39:54] but it's quite slow [13:47:56] mark: sorry just came on again [13:48:10] mark: labstore2 can't connect to virt0:389 apparently, trying to understand why [13:55:25] New review: Faidon; "I don't disagree with the change but rather with its stated effect (and hence commit message). $is_l..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/19786 [13:59:49] *sigh* [14:01:58] oh? [14:09:26] Isn't labs on prod now anyway? So that change should pretty much have 0 effect. [14:09:47] huh? [14:09:55] how do you figure labs is on prod? [14:10:14] we decided to drop the test branch a while ago [14:10:23] ^ [14:10:37] oh, *that* production [14:10:44] The branch not the server. [14:10:53] not e.g. realm or some other things [14:10:53] s/server/puppetmaster/ [14:15:37] !log stopping puppet on brewster to continue partman troubleshooting for analytics dells [14:15:46] Logged the message, Master [14:32:20] New patchset: Jgreen; "remove fetch_udplogs from aluminium/grosley, it's handled by netapp replication now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20023 [14:33:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20023 [14:34:40] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20023 [14:54:02] New patchset: Mark Bergsma; "Add some classes for NFS mounts from the NetApps" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20025 [14:54:43] New patchset: Mark Bergsma; "Mount the home_pmtpa volume on bast1001:/srv/home_pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20026 [14:55:23] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20025 [14:55:23] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20026 [14:55:50] New patchset: Mark Bergsma; "Add some classes for NFS mounts from the NetApps" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20025 [14:56:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20025 [14:58:05] New patchset: Mark Bergsma; "Mount NFS home_pmtpa on bast1001:/srv/home_pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20027 [14:59:04] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20026 [14:59:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20027 [14:59:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20025 [14:59:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20027 [15:04:28] New patchset: Mark Bergsma; "Correct eqiad hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20028 [15:05:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20028 [15:07:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20028 [15:10:27] New patchset: Jgreen; "deprecate manual replication of gzipped fundraising udplogs, instead netapp replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20030 [15:11:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20030 [15:11:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20030 [15:20:41] New patchset: Mark Bergsma; "Need to mount the othersite NFS volumes readonly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20032 [15:21:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20032 [15:21:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20032 [15:24:48] New patchset: Mark Bergsma; "Fix volume paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20033 [15:25:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20033 [15:25:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20033 [15:51:12] Hey opsies + RobH [15:51:19] agh, ottomata1! [15:51:20] brb [15:51:40] ok phew [15:51:42] that's better [15:51:43] yeah heya [15:51:49] Can someone get me a wikitech account? [15:52:55] New patchset: Ottomata; "analytics-dell.cfg - Not using swap (for now) Confirming to skip past no swap warning, also confirming to overwrite partition table." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20036 [15:53:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20036 [15:56:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20036 [15:58:57] !log starting puppet on brewster. (Woo, partman looking better!) [15:59:06] Logged the message, Master [16:01:00] Hmmm. Getting a lot of parser/language related OOMs atm [16:07:23] Logged the message, Master [16:09:06] cmjohnson1: i don't think account creation's been open for a while? [16:10:11] i doubt it [16:10:24] also l10n cache is broken there so stuff like http://wikitech.wikimedia.org/view/Special:Log/newusers makes no sense [16:11:06] and regular users also can't make new accounts: You do not have permission to create this user account, for the following reason: The action you have requested is limited to users in the group: Administrators. [16:11:35] so you need one of these ppl: http://wikitech.wikimedia.org/view/Special:ListUsers/sysop [16:12:07] haha, morebots is an administrator? [16:13:35] morebots is a helpful guy! [16:13:51] or LeslieCarr ;) [16:13:55] ? [16:14:01] nagios-wm is the most helpful [16:14:04] LeslieCarr: ottomata needs a wikitechwiki account [16:14:11] oh [16:14:15] let's see if i have admin access [16:14:20] :) [16:14:20] you do! [16:14:29] oo thank you! [16:14:33] oh yay [16:14:57] LeslieCarr: you're in order right before ma rk coincidentally [16:15:02] what's your user ? [16:15:08] haha, did mark do that on purpose ? ;) [16:15:18] i think it's alphabetical [16:18:27] hah, paravoid's a crat but not a sysop? [16:21:05] i'm in! thanks LeslieCarr [16:25:06] * mark always hides behind leslie when there's work to be done ;-) [16:25:18] haha [16:25:19] ) [16:25:21] :) [16:29:26] !log Stopping puppet on brewster again. Sigh. PARTMAAAAN! [16:29:36] Logged the message, Master [16:33:41] maplebed: here? [16:33:50] yeah. [16:33:53] on the phone [16:35:07] okay [16:35:23] ok, back. [16:35:25] so, the window is in 1½ hour [16:35:36] what were you planning to observe? [16:35:40] any particuarly interesting graphs? [16:35:59] there are a few graphs and logs [16:36:21] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies [16:36:44] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Swift%2520pmtpa&tab=m&vn=swift+frontend+proxies [16:38:13] then tailing /home/w/logs/syslog/swift on fenari [16:38:25] (I'd probably set up a tail specific to originals) [16:38:43] but I don't actually have much expectation that we'll see interesting things in any of those sources [16:38:48] cmjohnson1: i can do what when back? [16:38:54] make a wikitech account? [16:39:10] so tomorrow all ms servers can die with no issues, right? ;) [16:39:23] another thing I've done in the past is run tcpdump on one of the proxies filtering out just GET requests [16:39:23] no [16:39:25] mark: not yet. [16:39:33] mark: squids still point to ms [16:39:34] but close. [16:39:35] bummer [16:39:38] and this isn't scheduled for today [16:39:57] still [16:40:02] if they'd die, we wouldn't be in a lot of trouble [16:40:03] and I don't think that we should do two things at the time too [16:40:25] +1 paravoid [16:40:29] agreed [16:40:38] yeah, it'd probably be less work to fix squid rather than fixing ms in case of an incident, as I see it [16:40:44] correct me if I'm wrong Ben [16:41:23] paravoid: I think the thing that will actually make me comfortable it's working right is to do tests around original upload and fetcthing and see that it's coming from MW i [16:42:56] RobH: it was done already (wikitech acct) [16:43:21] mark, can I ask you a question about some partman stuff? [16:43:28] yes [16:43:33] notpeter has been helping me but I think we are a little stumped, maybe you'd have an idea [16:43:42] i'm trying to get the new analytics dells installed [16:43:44] i'm so so so close [16:43:48] i'd like [16:43:52] 30GB physical / [16:44:02] and 12GB swap, or no swap at all, i don't really care right now [16:44:16] if I do physical swap, no matter what I say, it fills up the rest of the disk [16:44:25] if I remove the swap partition in the recipe, then / fills up the whole disk [16:45:16] paravoid: I'm going to go talk to aaron and see what other things we can look at. [16:45:22] biab. [16:46:06] this is the current recipe [16:46:07] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=files/autoinstall/partman/analytics-dell.cfg;h=3d61c91350c21f00e0fdab33326a62d72937b417;hb=HEAD [16:47:57] ok, so i see it as just having 30GB physical right now... [16:48:03] if i remember my partman) [16:48:12] yup [16:48:15] except [16:48:17] after install [16:48:20] it fills all of sda [16:48:29] if I make a second partition (swap) [16:48:35] / will be 30GB [16:48:49] and the second partition will fill the rest of the space on sda [16:49:33] did you try d-ipartman-auto-lvm/guided_sizestring80% [16:49:42] no, but I am not using LVM [16:49:53] right? [16:49:56] why are you not [16:49:59] for / [16:50:00] ? [16:50:35] just use partman/lvm.cfg and be happy [16:51:16] ah it has /boot [16:51:19] physical [16:51:21] hmmm [16:51:23] you know what [16:51:24] OK [16:51:25] don't even make your own [16:51:27] that is fine with me [16:51:27] just use that [16:51:37] I have no idea why people keep making custom new recipes when it doesn't seem to matter at all [16:51:44] I think it's a big waste of time [16:51:52] it will matter eventually [16:51:56] once we know how we want these things partitioned [16:52:00] i will need to make one [16:52:04] for the others we wanted mirrored raid on / [16:52:07] this has an SSD on / [16:52:10] so no mirrored raid [16:52:13] sorry [16:52:17] SSD on sda* [16:52:25] but ja, i will try this [16:52:50] Logged the message, Master [16:57:54] ok, I have those two ganglia graphs open [17:01:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20040 [17:02:06] ms-be3 in the pool but not up. hm [17:15:53] cool, lvm.cfg works fine [17:15:54] thanks mark! [17:31:11] paravoid: aaron suggests that the best test will be to verify that thumbnail creation and purging works as expected. [17:32:23] okay [17:32:47] apergos: seriously?! ::sigh:: ms-be1 went down yesterday, and ms-be3 today. grumblegrumble. [17:32:57] anything on console? or just time to powercycle... [17:33:34] I didn't check thaat, I just saw it in the log with 'error syncing with node' [17:33:55] do we know what makes em fall over? [17:34:20] ms-be1 was looking like the 219d bug [17:34:25] but I checked uptime on the rest. [17:34:51] look at the dates on the boot logs on ms-be1 [17:35:56] yesterday ms-be3 had an uptime of 146 days. [17:36:21] do you want to poke at ms-be3 or should I? [17:37:00] oh, I can power cycle it [17:37:25] or not [17:37:28] uh [17:38:03] please go ahead. I usually also watch the console on boot to see if it does something funny like a fs check [17:38:47] how do I get on mgmt though? [17:38:49] cause [17:38:55] ariel@fenari:~$ ssh -l root ms-be3.mgmt.pmtpa.wmnet [17:38:55] ssh: connect to host ms-be3.mgmt.pmtpa.wmnet port 22: Connection refused [17:39:00] what stupid thing am I missing? [17:39:08] that's right! these are dell c2100s. they don't do ssh. only ipmi. [17:39:12] \o/ [17:39:30] http://wikitech.wikimedia.org/view/Dell_PowerEdge_C2100 [17:39:43] ohjoy [17:39:54] :) [17:39:59] hey guys [17:40:24] dschoon wants to use oracle's java for the analytics machines [17:40:27] (hadoop, etc.) [17:40:30] why? [17:40:50] drdee, dschoon? [17:41:10] using oracle java has a lot of issues [17:41:29] Hadoop requires Java 1.6+. It is built and tested on Oracle (nee Sun) Java, which is the only "supported" JVM. [17:41:31] we do use it for certain things [17:41:33] among philosophical concerns, there's also legal problems; we can't e.g. put it into apt.wikimedia.org (as far as I understand) [17:41:34] ok that failed: [17:41:41] right, but there is this: [17:41:43] since that would be redistibution which is forbidden [17:41:44] root@sockpuppet:~# ipmitool -U root -H ms-be3.mgmt sol activate [17:41:47] http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html [17:41:49] Error: This command is only available over the lanplus interface [17:41:56] which is an 'installer package' [17:42:06] that does not keep the java package itself in apt [17:42:14] but an installer that dls and installs from oracle [17:42:16] apergos: use "the easy way" [17:42:29] right [17:42:29] but trusts a remote website to run code on our machines. [17:42:43] help.ubuntu suggests using this [17:42:58] the installer we can put in our own apt [17:43:00] sun java? grrrr [17:43:18] ottomata: we shouldn't blindly trust remote code in production machines imho [17:43:27] why can't you use the openjdk though? [17:43:37] there is basically no alternative a openjdk is not compatible with hadoop [17:43:38] reason one: [17:43:39] http://wiki.apache.org/hadoop/HadoopJavaVersions [17:43:45] it is possible to do [17:43:46] i hate when ipmi sucks [17:43:47] but not supported [17:44:27] OpenJDK6 has some open bugs w.r.t handling of generics (https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/611284, https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/716959), so OpenJDK cannot be used to compile hadoop mapreduce code in branch-0.23 and beyond, please use other JDKs. [17:44:33] (might be ok in 7, who knows) [17:44:42] i did the puppet work for this in lucid [17:44:47] to use sun jdk 6 [17:44:49] using alternatives [17:44:56] could do the same for this [17:44:59] keeping openjdk as default [17:45:18] ottomata: that's for *compiling* [17:45:25] and it says "Hadoop does build and run on OpenJDK" [17:45:26] so from the console (i.e. ipmi_mgmt console) I should be able to get a linux login prompt (typically)? [17:45:29] yes, compiling maprreduce code [17:45:50] apergos: if the machine was up and responsive, yes; hitting 'return' would trigger a prompt refresh. [17:45:50] oh okay [17:45:58] we need to be able to compile mapreduce code, people will be using it [17:46:04] ok, just making sure it works the way I think it ought to [17:46:05] the launchpad bugs seems to suggest that it works with openjdk 7 though [17:46:09] yup. [17:46:19] can we test openjdk 7 first and if that fails try to find a way around oracle? [17:48:23] ah I see, it wants -I lanplus [17:48:33] (to not use the "easy" way) [17:48:35] asking dschoon, he's got more info on this than i do [17:48:42] I would love to find a way around snoracle. [17:49:04] though I'm not involved in the hadoop stuff [17:49:29] New patchset: Aaron Schulz; "Make reads come from swift for all wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20043 [17:50:04] !log powercycled ms-be3 ... in theory [17:50:04] qatop [17:50:11] moops. [17:50:14] Logged the message, Master [17:50:31] apergos: connect to the console! you should see it booting in a few. [17:50:39] I'm on it in another window [17:50:46] ah finally. sorry but it was doing *nada* [17:50:53] for quite a while. [17:50:55] it takes a while to post. [17:51:15] I just wanted to see a bios message come up. any bios message. [17:52:06] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20043 [17:52:21] paravoid: no. hotspot is significantly faster than openjdk. [17:52:29] oh boy [17:52:35] it is a huge waste of time to test or work with openjdk. [17:52:38] also from gassandra [17:52:41] Check which version of the Java runtime environment your system is using. If your system is using the OpenJDK Runtime Environment, you will need to change it to use the Oracle Sun JRE. [17:52:45] cassandra* [17:53:03] every major java project says "don't use OpenJDK" [17:53:16] it's a world of hurt for absolutely no gain. we are not using it anywhere, ever. [17:53:18] period. [17:53:43] (i speak from experience -- i really don't want to go down that rathole again.) [17:54:38] I don't think it helps to say "anywhere, ever, period" in a discussion where we're trying to find the best way for everyone [17:54:46] well, yes. [17:54:49] my head hurts :( [17:55:31] although, dschoon, I think most of these pages that say don't use openjdk are all referring to 6 [17:55:33] not 7 [17:55:47] *nod* 7 is a recent release. [17:56:10] shipping the oracle jdk from our servers is a license violation [17:56:11] the main issue is that the java certification kit was never released [17:56:15] the installer package is pure crap [17:56:48] and installing java on systems by hand is difficult, time-consuming and error prone [17:57:07] New patchset: Pyoungmeister; "appserver module: making nagios group defined by $::group, (was $::cluster )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20047 [17:57:09] why is pure crap? [17:57:18] so the only ways I see (I'm open to ideas) is creating either a package or a puppet manifest of ours to install java [17:57:25] we have one. [17:57:27] what's the difference between doing that and doing it by hand? [17:57:37] ottomata: what do you mean? [17:57:40] well, for 6. [17:57:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20047 [17:57:55] it probably just is java's equivalent of "wget oracle-jdk-y.tgz" … make install [17:58:13] why is the installer• pure crap? [17:58:20] the one that ubuntu recommends? [17:58:48] what do you mean by "ubuntu recommends"? [17:59:00] it's a third-party ppa [17:59:23] https://help.ubuntu.com/community/Java [17:59:31] actually Ubuntu says "It is not advisable to install Oracle (Sun) Java 6 unless you have some specific need.." [17:59:53] haha [17:59:58] aye, we are talking about 7 [18:00:03] we were talking about 7 [18:00:20] oh, ok [18:00:35] ah you know, i think this installer only does JRE [18:00:52] i could do something in puppet, paravoid [18:00:58] wouldn't be hard [18:00:58] COOKIE_URL=http://launchpadlibrarian.net/98645053/cookie.txt #required by the latest JDK7, wget doesn't work without it [18:01:12] exec wget && tar -xvf && update-alternatives [18:01:17] that's from the installer's postinst... [18:01:18] i think there's a puppet class already [18:01:31] !log authdns-update for new es1005-1010 [18:01:39] yeah, i've used that a bunch [18:01:40] Logged the message, RobH [18:01:45] messed with hat puppet class [18:01:51] it doesn't do oracle java 7 though [18:01:52] but it could! [18:02:12] I'm not terribly excited to have something download things off a http url with no other authentication and then run it as root, but we really really can't do anything else... [18:02:24] I guess we could find ways around that too... [18:02:29] *sigh* [18:02:36] do an md5 check [18:02:40] yeah [18:02:43] can do [18:02:51] and obviously, don't run it as root [18:02:53] that's what I was thinking [18:02:56] it's an installer... [18:03:03] the downloading i mean [18:03:05] it install java into the system [18:03:07] once the md5 check has run... [18:03:12] can do as well [18:03:14] right [18:03:21] or SHA224 :P [18:03:31] 256? [18:03:32] whatever it was [18:03:37] 512 [18:03:44] 1024? [18:03:48] (this game is fun!) [18:04:01] * mark bumps dschoon's head to make it hurt more [18:04:06] owww :( [18:04:08] paravoid: curl | bash [18:04:13] paravoid: don't be scared, just do it [18:04:22] New patchset: Ottomata; "Using lvm.cfg for analytics dell partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20051 [18:04:23] Ryan_Lane: curl | sudo ruby [18:04:24] that's how openstack does it [18:04:27] :D [18:05:01] thankfully openstack is now packaged by the distro ;) [18:05:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20051 [18:06:05] yeah, I'm very grateful they didn't go the RedHat side [18:06:07] or just toss the jdk in a non public repo on our servers of course [18:06:11] puppet private repo, whatever [18:06:15] redhat is also packaging it [18:06:26] they just announced a fully supported openstack distro [18:06:28] i don't think that would violate any licenses would it? [18:06:31] mark: yeah thought of that too [18:06:31] is that better mark? the .tgz somewhere? [18:06:37] but md5 is better I think [18:06:42] it's easier, not better [18:06:48] i'd rather dl from them with md5 [18:06:50] having the puppet fileserver ship a 20mb file is not very good [18:06:54] just beacuse if we want to upgrade, it's easier [18:07:06] we could put it somewhere internal with a vhost [18:07:08] and creating another facility to serve files is too complicated [18:07:11] and wget it internally [18:07:12] aye ok [18:07:45] well, if you're willing to do it I'm not going to stop you :P [18:07:52] yup, can do [18:08:27] so, yeah, before doing all that, how hard would it be to check if openjdk 7 is any better than 6? [18:08:41] maybe they fixed whatever it was that made it slower than oracle java? [18:09:15] er, hotspot is the product of a decade of proprietary research. so i doubt it. but we can do some googling. [18:09:38] but i haven't heard any yelling, which means nothing has probably changed. [18:10:25] so was Solaris :P [18:10:35] and solaris was pretty good! [18:10:36] (kidding) [18:10:42] :) [18:11:04] it's really too bad about sun. they didn't always suck horribly. [18:11:45] i've been googling, not much info, looks like people are talking about hadoop maybe being ok [18:11:49] no info on cassandra [18:12:08] well, we could benchmark during our data warehouse testing [18:12:13] dschoon: yeah we were saying this the other day too [18:12:25] (about Sun) [18:13:09] hmmm, actually I see some threads about cassandra not liking java 1.7 at all (which == 7, no?) [18:13:09] but i think we still need to install the oracle jdk [18:13:26] yeah, 1.7 is jdk7 [18:17:34] * apergos looks for a q-tip [18:17:39] "solaris was pretty good"? [18:17:46] must have not heard right... [18:17:53] it was interesting! [18:18:02] that's a good word for it! [18:27:23] New patchset: Dzahn; "zirconium, use 1T partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20053 [18:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20053 [18:28:27] New review: Dzahn; "SEAGATE ST91000640SS AS02 931 GB" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20053 [18:28:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20053 [18:35:43] RobH: is it like a known thing that if you ever entered the Unified Server Configurator stuff on a Dell it keeps trying to enter that on every boot ? [18:36:37] like i would not expect to keep doing that after powercycling, but it does :p [18:36:59] mutante: yea, its kinda insanely annoying [18:37:05] we should disable it from the drac interface options [18:37:17] (im sure theres a flag, cuz its a drac option if you physically console it) [18:37:30] alright, i'll look for it [18:37:38] if we ever need it to troubleshoot something physically then the onsite tech can reenable [18:37:50] cuz the behavior of permana looping until you enter and exit the ucs is nuts [18:38:08] since you cannot exit it via serial. [18:38:45] yep, it gives me the "to cancel enter iDRAC6 configuration utility" [18:39:22] cmjohnson1: right, so we should disable it [18:39:27] cuz if you accidentally launch it on serial [18:39:35] it sticks the computer in a semi-permanent ucs loop [18:39:43] every reboot until a console is attached and its exited properly [18:39:45] its horrible. [18:39:48] i should have never touched that F key:) [18:40:15] yep [18:48:33] New patchset: Aaron Schulz; "Make foreign reads of commons actually work with swift reads." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20056 [18:51:05] New patchset: Aaron Schulz; "Make foreign reads of commons actually work with swift reads." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20056 [18:54:03] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20056 [18:54:24] RobH: cmjohnson1: Ctrl+E during startup -> firmware setup -> system services -> cancel system services -> ignore warning -> save and exit .. boots normal again [18:55:18] mutante: yes, but there is a way to tell it that via command line drac [18:55:23] so it doesnt require a reboot of a live server [18:55:30] for existing services [19:01:19] New patchset: RobH; "added es1005-1010 into files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20058 [19:02:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20058 [19:02:15] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20058 [19:04:55] ottomata: shit, i didnt see your admin log that you stopped puppet on brewster [19:05:08] yup, its ok though [19:05:09] you can run it [19:05:12] and i just ran it and i saw a bunch of partman stuff copy down =[ [19:05:13] sorry dude =[ [19:05:19] no probs, [19:05:19] ok, its running again [19:05:25] that's fine, log it? [19:05:31] !log puppet restarted on brewster [19:05:40] Logged the message, RobH [19:05:44] i actually just hadn't turned it back on yet, because I wasn't yet sure if I would have to make more changes [19:06:00] no worries, i pushed a bunch of chagnes for new es servers so had to update to install them [19:11:43] New patchset: Ottomata; "Using lvm.cfg for analytics dell partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20051 [19:12:01] ACK [19:12:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20051 [19:14:22] Change abandoned: Ottomata; "Not sure what happened there. Abandoning and recommitting." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20051 [19:15:29] New patchset: Ottomata; "netboot.cfg - using lvm.cfg for analytics dells" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20063 [19:16:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20063 [19:16:46] New patchset: Pyoungmeister; "re-add --daemon option to udp2log init script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20064 [19:17:01] !log running purgeParserCache.php again [19:17:10] Logged the message, Master [19:17:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20063 [19:17:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20064 [19:19:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20047 [19:21:32] i'm about to merge puppet on sockpuppet [19:21:37] someone else's change is coming in too: [19:21:44] role::applicationserver [19:21:44] - $nagios_group = "${::cluster}_${::site}" [19:21:44] + $nagios_group = "${::group}_${::site}" [19:21:48] s'ok? [19:21:57] -@monitor_group { "misc_pmtpa": description => "misc pmtpa application servers" } [19:27:29] ottomata: yeah [19:27:33] sorry, thoght I merged that [19:27:42] s'ok, someone else merged before I had a chance to! [19:27:47] New patchset: Cmjohnson; "fixing dhcpd entry for es9 and 10" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20065 [19:27:52] ah, then maybe it was me :) [19:28:01] yep [19:28:02] heh [19:28:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20065 [19:29:32] !log stopping puppet on oxygen and restarting both udp2log instances to switch user. will keep puppet off until I merge new init script into puppet [19:29:37] cmjohnson1: sure [19:29:41] Logged the message, notpeter [19:30:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20065 [19:30:27] cmjohnson1: ok, all merged up [19:36:48] New patchset: Ryan Lane; "Change LDAP iptables rules to allow all of our network" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20066 [19:37:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20066 [19:38:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20066 [19:44:02] !log stopping puppet on emery and restarting both udp2log instances to switch user. will keep puppet off until I merge new init script into puppet [19:44:11] Logged the message, notpeter [19:46:30] PXE! [19:46:43] oops wrong chat [19:49:30] hi ops [19:49:44] do any of you have hair left after dealing with c2100s? [19:49:51] i've pulled most of mine out :p [19:50:01] so [19:50:13] i've got 1 out of 11 to install and boot properly [19:50:17] using the exact same configs I had before [19:50:22] i am working on the second one [19:50:23] this time [19:50:28] during PXE boot and install [19:50:33] it asks me to confirm my partition layout [19:50:34] I say yes [19:50:38] it finishes the install [19:50:56] but then boots into PXE again (and if I let it will ask me to confirm the partition layout again) [19:50:58] so [19:51:01] I boot into BIOS [19:51:06] check boot device priority [19:51:19] I manually set the SATA drive to 1st boot priority [19:51:24] then I save changes and exit [19:51:30] but when it boots up, it starts PXE again! [19:51:43] has anyone encountered this before? [19:51:50] could be it's just not writing a bootloader [19:52:03] hm [19:52:17] (which i think happens last) [19:52:24] i do see it say that it does it [19:52:27] installs GRUB [19:52:49] but maybe it is failing, and since it can't boot from disk, it falls back to pxe? [19:53:24] you could turn up logging for d-i and check the log [19:53:53] or whatever it uses, i guess maybe it's not called d-i ;-P [19:54:07] (=debian-installer) [19:54:30] logs on the new machine? [19:54:38] yes [19:54:45] how do I get them? whenever I boot pxe it tries to reformat everything [19:54:56] i'd have to change the boot recipe, no? [19:55:19] http://d-i.alioth.debian.org/doc/internals/ch02.html#id319503 [19:57:18] !!log ms-be1004 down for ssd install [19:57:21] bleh [19:57:26] !log ms-be1004 down for ssd install [19:57:35] Logged the message, RobH [19:58:15] ottomata: so you have a box that's currently expected to boot ubuntu but going to pxe instead? [19:58:33] yup [19:58:36] ottomata: boot into the installer, drop to a shell, mount the filesytem, what's there? [19:58:42] does it mount at all? [19:58:54] so, the recipe is trying to auto confirm everythign when it boots pxe [19:58:58] chroot into the installed system and run grub? [19:59:18] ideas on how to get to the install menu? [20:00:02] well one well tested obvious solution is plug in the ethernet to the wrong port ;0) [20:00:06] ;-)* [20:03:43] * jeremyb could maybe look more later... have to get some real work done [20:04:01] ah got menu, because it asks for confirmation to write part table [20:04:04] ottomata: did you solve the pxe problem? [20:04:07] nope [20:04:10] k. [20:04:13] I have your answer! [20:04:16] oh!? [20:04:25] http://wikitech.wikimedia.org/view/Dell_PowerEdge_C2100#Initial_Setup [20:04:41] I think the Force PXE First step got skipped. [20:04:45] hmmmmmm [20:04:47] k [20:04:57] will try that! [20:05:10] check everything else in the list while you're there... [20:05:14] k [20:05:25] (those instructions are focused on having 2ssds + 12 disks, btw. [20:05:26] ) [20:05:44] k [20:08:08] New patchset: Ryan Lane; "Up the user-management ldap tools version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20067 [20:08:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20067 [20:12:25] cmjohnson1: hang on a minute [20:12:27] !log ms-be1004 is ready to be installed into service [20:12:28] we just had a swift deploy [20:12:36] Logged the message, RobH [20:12:38] and I want to make sure everything's cool before mucking with the cluster. [20:25:20] New patchset: Ryan Lane; "Make virt1000 an openstack nova controller" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20136 [20:26:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20136 [20:27:13] cmjohnson1: I think you're clear to work on ms-be6. Please let me know when you turn it on; I need to watch the first boot. [20:29:09] I'll be onsite again tomorrow, I need to get back into town for another engagement before 8pm =P (so leaving eqiad now, back online from home shortly, i hope) [20:38:38] !log stopping puppet on locke and restarting both udp2log instances to switch user. will keep puppet off until I merge new init script into puppet [20:38:47] Logged the message, notpeter [20:39:18] notp eter is quite reptitive today ;) [20:39:39] ah the joys of doing things by hand :) [20:40:14] New patchset: Asher; "ssd specific mysql tuning, addressed write stalls observed while running purgeParserCache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20217 [20:40:59] New patchset: Dzahn; "add zirconium to site.pp, standard, admins, role::planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20218 [20:41:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20217 [20:41:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20218 [20:43:01] binasher: merged your stuff on sock. typo though? paresercache.my.cnf.erb [20:45:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20136 [20:48:02] Logged the message, Master [20:54:24] chrismcmahon: I think that means it's the wrong password. [20:54:27] oops. [20:54:30] cmjohnson1: ^^ [20:54:37] (sorry chrismcmahon ) [20:56:09] hahahaha [20:56:17] chris is so popular ;-) [20:58:17] !log attempting to upgrade virt1000 to precise [20:58:27] Logged the message, Master [20:58:32] such confidence ;) [20:58:36] heh [20:58:44] cmjohnson1: sure; one sec. [20:58:49] well, if it fails I'm just going to reinstall ;) [21:03:28] oh, I see the problem cmjohnson1 [21:03:38] you have to give the fqdn for the mgmt interface to ipmi_mgmt. [21:04:12] want to give it a try just to confirm? [21:04:37] good. [21:04:38] yw! [21:12:53] !log ms-be6 down for bad disk repair [21:13:01] Logged the message, Master [21:15:07] ok, cmjohnson1 go ahead. [21:18:29] cmjohnson1: bad news. [21:18:42] which drives did you replace? [21:20:08] !log stopping puppet on nfs1 and restarting both udp2log instances to switch user. will keep puppet off until I merge new init script into puppet [21:20:19] Logged the message, notpeter [21:21:44] no. it hasn't fully booted yet [21:22:48] sdc and sde are throwing errors [21:24:16] 43.502478] end_request: I/O error, dev sde, sector 1955470264 [21:24:23] 43.502543] Filesystem "sde1": I/O Error Detected. Shutting down filesystem: sde1 [21:24:38] 43.502547] I/O error in filesystem ("sde1") meta-data dev sde1 block 0xaea91370 ("xlog_recover_iodone") error 5 buf [21:24:53] but they might just be fs errors, which is to be expected if they're blank drives. [21:24:58] wtf is going with those systems? [21:25:07] how many disks have died so far? [21:25:14] the first one doesn't seem like a fs error though, [21:25:20] 43.502369] end_request: I/O error, dev sde, sector 976793160 [21:25:54] sigh, maplebed, notpeter [21:26:03] soup? [21:26:10] i flipped around the settings in bios so they would correspond with http://wikitech.wikimedia.org/view/Dell_PowerEdge_C2100#Initial_Setup [21:26:12] what hapened? [21:26:28] pxe install works without a hitch (aside from asking me to confirm writing the partition table) [21:26:31] cmjohnson1: umm... unscientifically. [21:26:40] after pxe install finishes, it starts to boot pxe again [21:26:49] (I counted and guessed based on somethingerother) [21:26:50] even though the boot order stays correct in bios [21:26:57] !log stopping puppet on nfs2 and restarting both udp2log instances to switch user. will keep puppet off until I merge new init script into puppet [21:27:06] Logged the message, notpeter [21:27:22] trying one more time to boot after exiting bios [21:27:28] ... [21:27:40] ottomata: you can also give ipmi_mgmt a command 'bootdisk' [21:27:44] (then powercycle) [21:27:55] but that's just a good test to make sure that osmething else isn't broken, [21:27:58] not a good end state for the box. [21:28:37] cmjohnson1: by pressing 's' to skip somethingerother, [21:28:42] it's continuing to boot. [21:29:00] it's claiming errors on sdc, sde, and maybe sdl [21:29:06] if it makes it the rest of the way I'll look again. [21:29:19] ok, it's at login. [21:29:22] notpeter: did you chown everything in /home/w/logs as part of the udplog stuff? [21:29:23] oh noi think i got it! [21:29:25] !!!! [21:29:37] YES [21:29:42] i don't know what I did different this time [21:29:49] but after entering bios (not changing antyhing) and then exiting [21:29:55] it booted and I could ssh in! [21:30:02] cmjohnson1: it failed to mount sdc, e, l, and m. [21:30:20] I vaguely remember some command to tell me what slot it was in... [21:30:31] (booted user command, not bios thing) [21:30:49] binasher: yes. udp2log user needs to be able to write on nfs1 [21:30:58] do I need to chown to something else? [21:31:42] maybe chgrp to 500 and make group readable [21:31:58] so they can be read by regular users on fenari [21:32:20] 500 = wikidev [21:32:23] ja [21:33:03] cmjohnson1: do you know which drive you pulled from which slot? [21:33:10] in theory they could go back in, [21:33:15] binasher: better? [21:33:28] I mean which disk (that's now out) came from which slot. [21:33:39] (I can read it now, but verification would be nice :) ) [21:33:40] notpeter: yup [21:33:48] cool. sorry about that [21:34:11] cmjohnson1: if your'e not 100% sure 4 fresh disks would be better; [21:34:23] getting them backwards woudl be ... well, I don't actually know what it would xdo. [21:34:33] but if you are, rock on. [21:34:35] let's try again. [21:36:05] actually... the filesystems are labeled. [21:36:12] so it'll be fine. [21:36:18] go ahead with the two you pluled out. [21:36:46] down please. [21:37:18] Logged the message, Master [21:39:13] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20064 [21:39:54] !log restarting puppet on oxygen [21:40:04] Logged the message, notpeter [21:41:07] yup. [21:42:56] cmjohnson1: I see it booting [21:43:11] ERR: MEM TEST [21:43:13] whee! [21:43:17] maybe that's the root of all this. [21:43:56] so did you just put the two you'd taken out back in or did you swap 4? [21:45:49] ok, it's up, [21:46:33] so the ones that didn't mount are c, e, h, [21:47:36] mount: special device LABEL=swift-sdh1 does not exist [21:48:28] New patchset: Pyoungmeister; "udp2log: also ensure that logging dirs are owned by udp2log user, not root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20229 [21:48:31] maplebed: how tedious would it be to upgrade copper? [21:48:56] * AaronSchulz wonders wtf running unit tests in labs just gives "killed" [21:48:57] AaronSchulz: I can't right now, but http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_upgrade_notes_2012-08 details what's necessary. [21:49:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20229 [21:49:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20229 [21:50:42] cmjohnson1: sde and sdh are swapped. [21:51:04] cmjohnson1: if there's any way you can stick around I'd rather not leave it like this overnight. [21:51:26] from parted, I asked it to print the partition table for /dev/sdh and it prints the label swift-sde and vice versa. [21:51:37] !log restarting puppet on emery [21:51:46] Logged the message, notpeter [21:52:51] k. [21:52:59] I really wish there were a better method than this. [21:53:37] [1217224.205659] Out of memory: kill process 24899 (php) score 59271 or a child [21:53:39] * AaronSchulz sighs [21:54:52] !log restarting puppet on locke [21:55:00] Logged the message, notpeter [21:55:25] I can see it on console. [21:58:02] ok, now the unmounted disks are c, e, h, [21:58:06] let's see what parted says [21:58:25] /dev/sdh has swift-sdh. [21:59:16] ok, they're all back in the right spots. [22:00:22] trying to mount sdh manually [22:00:47] starting xfs filesystem recovery. [22:00:56] sdh might just be in pain from getting powercycled. [22:01:10] nope. [22:01:20] XFS: log mount/recovery failed: error 5 [22:01:43] i 0:0:16:0: SSP: enclosure_logical_id(0x500065b36789abff), slot(5) [22:01:52] is that the same numbering system? [22:02:07] sd 0:0:16:0: Attached scsi generic sg7 type 0 [22:02:25] was disk7 one of the ones we've been swapping? [22:02:38] huh. [22:02:48] ::sigh:: [22:02:56] ok so. [22:03:08] at the moment, can we swap out c and e and h for fresh disks? [22:03:15] now that we've been playing map-the-drives? [22:03:42] ok. [22:03:47] !log restarting puppet on nfs1/2 [22:03:56] Logged the message, notpeter [22:04:01] running. [22:04:54] and there's the kernel spam [22:05:14] running e [22:05:39] does that match up with the ones you pulled earlier? [22:06:28] maplebed, notpeter!!!!! [22:06:35] i have login prompts on all of the new analytics dells!!!!! [22:06:40] woo! [22:06:47] yayayayayyayayayayayayay [22:06:51] 1 week later yayayyayayay [22:06:54] cmjohnson1: but when we swapped e and h... did either of those hit disk 9? [22:06:54] maplebed: ls /sys/class/sas_host/host0/device ? [22:07:18] hm.. [22:07:41] New patchset: Pyoungmeister; "correcting scoping on nagios group def in appserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20234 [22:08:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20234 [22:08:23] cmjohnson1: ok, let's try. last one. [22:08:28] we're basically out of time. [22:08:48] yes. [22:08:56] with brand new empty disks. [22:09:10] +1 carry on. [22:09:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20234 [22:11:54] udevadm info --query=all --path=/block/sda has DEVNAME and ID_PATH [22:12:00] bye guys! [22:15:42] k. [22:15:53] if you want to take off, I can email you whether it succeeds [22:16:00] and if it doesn't I'll just atke the server down for the night. [22:16:17] which would be too bad, but we'll survive. [22:18:02] umm... yayish? [22:18:20] issues with sdf. [22:19:11] unmounted disks are: c e f h [22:20:00] (of course, blank disks should not mount, so unmounted != dead) [22:20:20] so parted shows [22:20:26] sdc: unrecognised disk label [22:20:30] (sounds like a new disk!) [22:20:35] so does sde [22:20:38] and sdh [22:21:18] trying to mount sdf1 by hand [22:21:57] ok. [22:21:59] thanks a bunch! [22:46:08] New patchset: Dzahn; "fix SSL cert name .crt -> .pem and remove trailing /www from doc roots" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20239 [22:46:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20239 [22:47:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20239 [22:48:27] running scap... [23:09:08] New patchset: Asher; "puppet classes for new es servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20242 [23:09:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20242 [23:10:01] nagios-wm: are you still alive? ... [23:12:09] !log killing nagios-wm. it stopped talking even though stuff gets written to --infile [23:12:18] Logged the message, Master [23:13:25] poor nagios-wm [23:16:22] New patchset: Asher; "puppet classes for new es servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20242 [23:17:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20242 [23:17:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20242 [23:26:49] !log fix date/NTP on sq36 [23:26:57] Logged the message, Master [23:29:38] New patchset: Ryan Lane; "Fix glance db host for eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20246 [23:30:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20246 [23:30:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20246 [23:31:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:26] RECOVERY - mysqld processes on es1002 is OK: PROCS OK: 1 process with command name mysqld [23:39:01] !log reslaved es1002 after conversion to innodb [23:39:10] Logged the message, Master [23:44:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.105 seconds [23:45:11] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:49:23] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [23:49:23] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [23:49:23] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [23:49:23] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [23:56:34] New patchset: Ryan Lane; "Initial config for eqiad labs region compute, api and network nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20248 [23:57:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20248 [23:57:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:57:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20248 [23:59:09] Ryan_Lane: how do people get accounts on wikitech? [23:59:16] we make them for them