[00:04:34] PROBLEM - SSH on amslvs2 is CRITICAL: Server answer: [00:05:19] RECOVERY - SSH on nescio is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:06:13] RECOVERY - SSH on ssl3001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:07:34] RECOVERY - SSH on amslvs2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:11:55] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [00:21:03] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [00:21:22] PROBLEM - SSH on nescio is CRITICAL: Server answer: [00:27:04] RECOVERY - SSH on nescio is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:29:46] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:25] PROBLEM - SSH on nescio is CRITICAL: Server answer: [00:32:18] PROBLEM - SSH on ssl3001 is CRITICAL: Server answer: [00:33:40] RECOVERY - SSH on ssl3001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:36:57] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [00:38:00] PROBLEM - SSH on ssl3001 is CRITICAL: Server answer: [00:38:27] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:45:12] RECOVERY - SSH on ssl3001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:32:28] RECOVERY - SSH on nescio is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:41:55] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 246 seconds [01:42:49] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 266 seconds [01:49:16] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 657s [01:55:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:57:57] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:58:34] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 16 seconds [03:23:55] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [03:33:57] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [03:38:00] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [03:38:00] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [03:38:00] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:14:55] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [05:50:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:09:57] morning [06:12:58] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:25:40] moin [06:26:41] not yet fixed i think: 11 18:03:32 < jeremyb> spence is up to 86 hrs since last puppet run... [06:52:30] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [07:04:24] wtf, why does ganglia have """Wikimedia Grid > Miscellaneous pmtpa > en.wikipedia.org""" ? [07:04:42] and there's a 10.0.0.34 in there too [07:05:58] hrmmm, and virt0 seems to be missing entirely from ganglia [07:07:30] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [07:10:43] morning [07:13:01] hey [07:19:17] oh, i forgot to mention: amslvs2, ssl3001, and nescio (all ams) flapped more than a few times each last night. maybe worth a look [07:22:30] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:34:48] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:37:30] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:59:42] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [09:06:27] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [09:06:27] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [09:36:25] New patchset: ArielGlenn; "ms10 to public hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19206 [09:37:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19206 [09:37:53] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19206 [10:01:37] New patchset: ArielGlenn; "add ms10 to list of media rsyncers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19208 [10:02:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19208 [10:02:18] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19208 [10:12:27] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [10:49:16] !log starting initial rsync from ms8 to ms10 (should take approximately forever) [10:49:27] Logged the message, Master [10:57:24] hm? [10:58:44] apergos: what's the plan there? [10:59:07] right now we rsync to eqiad [10:59:34] anyways the idea is to have two hosts, one in eqiad and one in tampa, with full copies of the original media [10:59:48] aren't we going to use swift instead of ms*? [10:59:52] one as primary and one in case the primary dies... used to feed media mirrors outside [10:59:55] (which we do already) [10:59:57] like this week? [11:00:08] yep [11:00:34] these hosts will have a flat copy exposed from rsyncing to the outside [11:01:03] they will get it by polling upload.wm.o (which will mostly be the squids I guess, but sometimes it will fall through to swift) for new/changed media [11:01:27] *for rsyncing to the outside [11:01:57] I have a script already in place for it, which I'll be testing once the switchover is done and seems stable-ish [11:02:00] polling how? [11:02:12] http gets [11:02:15] standard stuff [11:02:32] we'll have lists of changed/new media from, I guess I get that out of the db [11:02:38] idea is to run that once a day [11:03:05] much better than trying to ask swift for a list of everything in all its containers [11:03:15] also back end independent [11:03:53] anyways the silly thing right now is that we copy from tampa to equiad and then our mirrors get from there [11:05:06] it would have been a nice way to stress-test the backup fiber cable :p [11:05:13] no it woudln't :-D [11:05:22] hihi [11:05:41] in fact we do our daily copy over that wire, which I quietly let hang that day [11:05:53] given we were limping along on one link [11:08:04] !log apt: remove php-wikidiff2/1.1.0-2 from lucid-wikimedia/universe (lucid-wikimedia/main has 1.1.2-1) [11:08:13] Logged the message, Master [11:08:33] !log apt: remove wikidiff2/0.0.1wm1 (obsolete), copysrc php-wikidiff2 {lucid,precise}-wikimedia [11:08:42] Logged the message, Master [11:08:47] hmmm [11:58:57] !log apt: include php5 5.3.10-1ubuntu3.2+wmf1 @ precise-wikimedia [11:59:05] Logged the message, Master [12:04:06] !log apt: include php-memcached 2.0.1-6~wmf+lucid2 @ lucid-wikimedia [12:04:14] Logged the message, Master [12:11:37] apergos can't fill that one fiber anyway [12:11:48] no matter how much he tries to stress test with those server copies ;) [12:12:32] * apergos files that away as a challenge for later :-P [12:13:36] later? that's lame [12:13:42] in 10 years you'll fill it with ease [12:13:44] i'm talking NOW [12:22:15] later before the end of *this* year [12:24:09] !log Restored SnapMirror relationship between nas1-a and nas1001-a, 70 days lagged, transfer initiated [12:24:18] Logged the message, Master [12:27:02] just 70 days? I'm impressed [12:27:09] hehe [12:27:15] do you know anything about the "labs" aggregate? [12:27:21] I think ryan did that for testing once, not entirely sure [12:27:24] first time I'm hearing of this [12:27:25] it can't be in production use ;) [12:27:43] well, considering the netapps didn't work until what? yesterday? [12:27:47] no, it can't be in production :) [12:27:51] they've been up and down [12:28:06] until friday you mean ;) [12:28:15] yeah [12:28:53] btw, the ontap web interface is not completely useless [12:29:06] any reason to use it over the cli? [12:29:15] somtimes it's easier [12:29:18] sometimes it's not [12:29:33] I used to use both, depending on the task [12:30:54] snapmirror back in sync [12:31:11] how do I login to the nas? via root pwd? [12:31:13] not that we need it for anything [12:31:14] yes [12:32:38] ah of course, "too many users logged in" [12:32:44] hehe [12:32:50] didn't get fixed in 8 I see [12:34:53] I thought we were going to use snapmirror? [12:35:05] yeah, but not the existing test volume [12:35:06] or you mean it's useless now that it has no useful data? [12:35:11] aha [12:35:12] i just copied images onto it for testing [12:35:19] and I wanted to see if it could recover after all the failures ;) [12:35:21] it did [12:35:23] it's back in sync [12:35:41] :-) [12:36:05] nas{1,1001}-{a,b} [12:36:06] nice. [12:37:08] mark: did you see this? 13 07:19:17 < jeremyb> oh, i forgot to mention: amslvs2, ssl3001, and nescio (all ams) flapped more than a few times each last night. maybe worth a look [12:37:45] hmm, perhaps connectivity problems [12:38:25] oh, just ssh [12:38:31] some ssh login scanner i'm sure [12:38:36] hah [12:39:20] paravoid: what do you think about ontap cluster mode? [12:40:51] mark: also, i noticed (before now) that the ARIN /22 had new contacts but now I see that the ams /24 has old contacts. maybe want to update there too [12:41:44] mark: never used it, it's an 8+ feature [12:41:47] can we even use it? [12:41:58] if we restrict to non-block-level access, I think so [12:42:06] they used to have completely separate product lines, the OnTAP and OnTAP GX [12:42:09] still so [12:42:15] well, we can use either [12:42:19] but they're completely different OSs [12:42:22] and with OnTAP 8 I heard they merged the two [12:42:43] I think the GX needed separate hardware [12:42:44] at least, you have to install a different image and there's no conversion [12:43:16] * paravoid googles it [12:43:37] dammit mark, why can't you ask all the easy questions that I have the answers for. [12:45:40] ONTAP isn’t a Unified OS, it’s a brand. They could call StorageGrid or Engenio ONTAP and it’ll be about as Unified as Cluster-Mode and 7-Mode. There is no unification here beyond the block diagram they’ll draw on your white board and that diagram is not connected to reality but to brand marketing. [12:45:45] *sigh* [12:45:56] i guess we'll just use 7-mode [12:47:37] !log Setup NTP on all NetApps [12:47:46] Logged the message, Master [12:59:54] !log Doing NDU of IOM3 shelf firmware on nas1-a [13:00:03] Logged the message, Master [13:05:18] !log Doing NDU of IOM3 shelf firmware on nas1001-a [13:05:26] Logged the message, Master [13:08:52] hi w [13:08:57] arrived ok? [13:10:57] paravoid: so the new ontap version we run now does support ipv6 quite well it seems [13:11:07] yay [13:11:11] 8.0 didn't afaik [13:11:20] indeed [13:15:23] i hate the current network setup of the netapps, but they don't support anything fancy [13:15:33] i'm considering running a RIP instance on our routers, just for the netapps :-( [13:16:58] nah, looks like that also can't do what I want [13:17:29] what do you want? [13:17:47] not both controllers down whenever we upgrade row B access switches [13:18:18] i'd want to connect each controller to a different row stack, but then they can't take over eachother's ips anymore [13:18:23] since they're different subnets [13:19:05] I lack information about our network topology... [13:19:10] do we have that documented anywhere? [13:19:17] not really [13:19:21] basically, the routers just route [13:19:32] every eqiad row has its own subnets [13:19:42] and every row consists of an ex4200 stack [13:19:51] so no l2 across rows atm [13:20:05] every stack connects to the router directly? [13:20:06] mostly to avoid sTP [13:20:08] yes [13:20:28] if the netapps spoke OSPF or so, i'd consider hooking them up to the routers directly [13:20:29] local cross-connect? :) [13:20:31] ew? :) [13:20:37] how's that ew [13:20:44] that's very elegant [13:20:59] you probably didn't understand what I proposed [13:21:11] no [13:21:12] local switching on the mx [13:21:20] don't want local switching on the mx [13:21:26] because "ew" [13:21:34] yeah mostly [13:21:43] yeah, that's what I figured [13:21:47] if we would do it, then for a special netapp only subnet [13:21:53] I said it myself to avoid making it sound like a real proposal :P [13:21:57] hehe [13:22:31] we COULD use vlans on the netapp [13:22:40] and connect it to both row A and row B vlans or something [13:22:56] how would we do failover then? [13:23:03] how would we access it from the hosts that is? [13:23:03] nah still sucks [13:23:13] via the row A VLAN or the row B VLAN? [13:23:20] or you mean do RIP over the two VLANs? [13:23:26] yeah [13:23:31] but not sure it really does that [13:23:34] it's not very configurable [13:23:40] i believe it has rip just to get routes from the network [13:23:49] even if it does, I'd expect it to be so uncommon to be completely buggy [13:23:53] yeah [13:24:27] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [13:24:31] yay for ethernet topologies [13:25:07] good morning you two! (nycer here) you sound busy, but I could use some help troubleshooting boot pxe for the analytics dells if you've got a sec sometime [13:25:18] we could hook them up to the routers directly and only bridge those ports between [13:25:22] but it would eat up 4 MX ports [13:25:33] are these 10g? [13:25:39] the MX ports yes [13:25:44] so we'd have to buy 10G cards for the netapps too [13:25:45] the netapps? [13:25:48] I believe it doesn't have them yet [13:25:51] only for internal use [13:25:54] (afaik) [13:25:57] internal use? [13:26:02] it says this: [13:26:02] slot 0: Dual 10G Ethernet Controller T320E-SFP/KR [13:26:02] c0a MAC Address: 00:a0:98:14:6f:ac (auto-unknown-cfg_down) [13:26:03] c0b MAC Address: 00:a0:98:14:6f:ad (auto-10g_kr-fd-up) [13:26:08] but I believe that's internal to the chassis [13:26:13] I don't think so [13:26:13] for between the two controllers or so? dunno [13:26:27] the internal link between the two controllers is infiniband iirc [13:26:36] well it's worth investigating [13:26:53] ottomata: so what's the status? [13:27:50] paravoid: it has private IPs bound to those controllers [13:27:54] I think it's using them for something [13:28:02] they're UP [13:28:24] one of the two is UP, both have private IPs bound [13:29:03] mark, the status is [13:29:07] as far as I know [13:29:32] i think last i heard he booted to the installer fine (via pxe) and then the installer itself was unable to get a lease. and tcpdump on brewster showed no request showing up [13:29:43] they are able to reach brewster/dhcp during initial boot, but once they start to go through installer configs (at the auto configure networking step) they don't seem to be able to reach them anymore [13:29:54] i'm not really sure how this step works, and notpeter and I are a little stumped [13:29:55] ok [13:29:56] also installer's shell said a different MAC than the BIOS? [13:29:59] which machine should I try? [13:30:04] oh it did? [13:30:08] analytics1011 [13:30:14] ottomata: i think so [13:30:34] what kind of machines are these? [13:30:34] slot 0: Dual 10G Ethernet Controller T320E-SFP/KR [13:30:34] c0a MAC Address: 00:a0:98:14:6f:ac (auto-unknown-cfg_down) [13:30:37] whoops [13:30:44] mark@fenari:~$ ssh root@analytics1011.mgmt [13:30:44] ssh: connect to host analytics1011.mgmt port 22: Connection refused [13:32:27] hello? [13:32:32] nakr [13:32:33] mark [13:32:34] http://wikitech.wikimedia.org/view/Dell_PowerEdge_C2100 [13:32:39] sorry, was finding link [13:32:41] C2100s? ew [13:34:04] 04:7d:7b:a5:e1:b2 [13:34:15] is what I got from BIOS [13:34:30] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [13:34:37] 2: eth0: mtu 1500 qdisc mq state DOWN qlen 1000 [13:34:38] link/ether 90:e2:ba:11:7f:40 brd ff:ff:ff:ff:ff:ff [13:34:46] clearly what linux thinks is eth0 is not connected [13:35:08] e1:b2 is not in the lot [13:36:29] hm [13:36:42] well that would be why it wouldn't work then... [13:36:45] is that nic1 or nic2? [13:37:09] all other macs are higher and incement by one [13:37:13] seems like it should be the first nic [13:37:44] weird, i'm going to bios boot 1012 and see what the MAC is there, to make sure i have it right [13:37:47] 1012 behaved the same way [13:38:17] open a ticket for rob in the eqiad queue to check on that [13:38:32] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [13:38:32] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [13:38:32] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [13:38:55] to check to see why MACs are reporting wrong (if they are?) [13:39:06] to check what the hell is up with the connections on those boxes [13:39:09] he can figure it out [13:39:23] ok [13:46:49] well well [13:46:54] there you go ottomata [13:47:33] paravoid: meh [13:47:52] perhaps i'll make one vlan which we bridge between the two MXes [13:48:08] hm [13:48:19] that goes through the EXes? [13:48:24] i'd prefer not [13:48:31] this has the potential of looping at some point in the future [13:48:35] i'd prefer to hookup the netapp controllers directly [13:48:43] would cost 2 MX ports [13:49:01] and it might save us from a loop [13:49:06] yeah [13:49:11] so, I agree with you [13:49:13] direct is better [13:49:16] just probably need to buy 10G NICs for the netapps [13:49:46] oh, I have an idea what those 10Gs might be [13:49:51] how do we do inter-site replication? [13:50:03] just over the normal network interfaces [13:50:23] it's LACP over two GigEs per controller [13:50:55] https://communities.netapp.com/thread/12200 [13:51:08] "So those 2 ports (c0a, c0b) would be for the cluster interconnect only. Makes sense for this to no longer be infiniband as I imagine its easier for manufacturing, and the bandwidth is suitable." [13:51:15] so, the IB that I knew got replaced by those [13:51:39] yeah [13:53:32] you can add up to 4x dual 10GbE adapters apparently [13:54:00] the filers are usually normal x86 boxes and they get PCIs [13:54:05] yes [13:54:12] at least the ones I've seen, clearly some things have changed :) [13:54:18] still so on ours [13:55:52] I wonder how much overpriced will those cards be :P [13:56:10] probably very [13:57:14] i wonder if you could just put some generic ones in ;) [13:57:29] moorning RobH! [13:57:32] probably not [13:57:34] it needs a driver of course [13:57:37] heya [13:57:52] could you help me figure out some issues with the auto install of the new dells? [13:57:55] we still haven't figured it out [13:58:15] it looks like maybe bios is reporting different MAC addresses (not sure about that, could be my fault, trying to figure that out now) [14:01:25] ottomata: what's the problem? [14:01:38] i just looked at it [14:01:48] all ethernet NICs on the system report disconnected [14:02:05] ah, this was to RobH [14:02:16] yes [14:03:11] oh they report disconnected? [14:03:30] sorry, i missed that when you said it before [14:03:33] crazy, ok [14:03:40] no.. [14:03:44] eth4 is connected [14:03:50] 6: eth4: mtu 1500 qdisc mq state UP qlen 1000 [14:03:51] link/ether 04:7d:7b:a5:e6:94 brd ff:ff:ff:ff:ff:ff [14:03:51] inet6 2620:0:861:106:67d:7bff:fea5:e694/64 scope global dynamic [14:03:51] valid_lft 2591999sec preferred_lft 604799sec [14:03:51] inet6 fe80::67d:7bff:fea5:e694/64 scope link [14:03:51] valid_lft forever preferred_lft forever [14:03:54] so that's the problem [14:04:16] hm [14:04:20] perhaps rob can see what available NICs there are [14:04:23] and probably use the other one [14:04:27] who bought these servers? [14:04:32] I don't recall signing off on these [14:04:38] so I just checked one of the macs I had on 1012 [14:04:40] and it was off [14:04:58] I may have pasted in the wrong place, I have 1012's mac on 1013 in linux-host-entries [14:05:13] so this might be my fault [14:05:26] i'm going to change 1012 and try to boot and see if it works [14:06:51] New patchset: Ottomata; "Fixing MAC address on analytics1012." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19214 [14:07:11] mark or paravoid, could one of you approve that? I can merge and apply [14:07:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19214 [14:07:32] I will [14:07:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19214 [14:07:43] too late :P [14:08:04] nope [14:08:09] I'm in sockpuppet :P [14:08:35] paravoid, are you merging on sockpuppet? [14:08:42] apergos: I merged your ms10 hosts.allow too [14:10:04] crap did I no do that? mybad [14:10:34] thanks [14:15:27] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [14:25:10] ok, mark, RobH, I just double checked on analytics1011 [14:25:15] bios reports e1:b2 as NIC1 [14:25:22] which is what I have in puppet [14:25:27] but linux doesn't [14:25:35] so as long as you can boot from what linux thinks is eth0 [14:25:38] that's probably easier to use [14:25:41] PXE boot i mean [14:25:59] or just disable the secondary nic in bios [14:26:08] so weird, why would they see different MACs? [14:26:13] its odd that ubuntu is detecting nic2 for nic1 [14:26:18] isn't the secondary NIC the mgmt interface [14:26:22] it used to happen on older super micro systems [14:26:22] naw, as far as I can tell, neither are correct [14:26:24] no [14:26:26] NIC2 is e1:b3 [14:26:34] the mgmt interface is not either of the primary nics [14:26:36] mark saw: [14:26:36]     link/ether 04:7d:7b:a5:e6:94 brd [14:26:40] ah ok [14:27:00] not nic2 [14:27:04] nic5 [14:27:08] eth4 [14:27:09] nic5! [14:27:11] aye [14:27:11] ok [14:27:17] uhh, and those aren't in bios…ok [14:27:20] at least I don't see them [14:27:26] there are only two nics in the c2100 [14:27:32] ok, how do I find the proper MACs on each of these dells? [14:27:37] in the bios [14:27:45] for analytics1011 [14:27:50] the bios shows [14:27:53] * NIC1 Mac Address [04-7D-7B-A5-E1-B2] [14:27:54] * NIC2 Mac Address [04-7D-7B-A5-E1-B3] [14:27:59] ok [14:28:23] mark says he saw eth4 link/ether 04:7d:7b:a5:e6:94 brd in linux [14:28:29] right mark? [14:28:44] i can confirm the proper nic1 is attached to the network [14:28:50] ys [14:28:54] that was analytics1011 [14:29:21] right [14:29:49] mark, how did you get that MAC from linux? executed shell and then…? [14:29:53] ip addr show [14:29:55] k [14:34:59] yeah ok, mark, i see e1:b2 as eth4 [14:35:02] on 1011 [14:35:11] 6: eth4: mtu 1500 qdisc mq state DOWN qlen 1000 [14:35:11] link/ether 04:7d:7b:a5:e1:b2 brd ff:ff:ff:ff:ff:ff [14:35:36] which is what BIOS shows for NIC1 [14:36:50] so, in this case at least, MAC is (and has been) correct in puppet [14:37:03] RobH, any idea what might be going on here? [14:37:21] MAC is correct, but auto install cannot configure network on boot [14:37:54] notpeter saw packets reaching brewster dhcp just fine, but it won't auto configure [14:45:46] hm [14:46:02] so, is it possibly trying to use the wrong interface? [14:46:21] when I try to 'configure network', the prompt says 'detecting link on eth0' [14:46:29] then it attempts to configure the network with DHCP [14:46:31] but that fails [14:46:42] how many interfaces does it have (besides mgmt)? [14:46:45] apergos, which logs are you lookingat on brewster? [14:46:58] /var/log/messages, it's where dhcpd writes its info [14:47:02] linux reports eth0-eth5 [14:47:05] eth0 says [14:47:06] 2: eth0: mtu 1500 qdisc mq state DOWN qlen 1000 [14:47:37] but is not that MAC that BIOS reports as NIC1 [14:47:39] that is on eth4 [14:47:46] oh ho [14:47:47] well [14:47:53] New patchset: RobH; "added fluorine to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19221 [14:48:05] who cares what the bios says, q is whether linux consistently sees something else as eth0 [14:48:21] i would disable all but one nic in bios. [14:48:23] and see what it does. [14:48:32] (leave nic1 enabled, disable nic2) [14:48:34] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19221 [14:48:36] ok, will try that [14:48:42] damn, what did i fubar in my set [14:50:36] bleh, i hit some bad key combo and threw random characters throughout site.pp... awesome [14:50:36] is it like 2 onboard and 3 other NICs? or how does it manage to be 5 interfaces? [14:50:53] heh [14:52:05] New patchset: RobH; "added fluorine to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19221 [14:52:26] for people who spin up virtual instances to play with puppet configs locally: you use this puppet,master::self setup... and then when you want to sync from the master puppet repo again, do you have to toss that instance and start over or is there some way to force the update? [14:52:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19221 [14:53:07] I have no idea how the install for ottomata is showing 5 nics =P [14:53:16] heh [14:53:18] ottomata: im pretty distracted on site but other folks seem to be helping [14:53:22] there's only two plus mgmt? [14:53:25] but if you guys make no headway i can glance at it later [14:53:47] yeah, afaik [14:53:48] New review: RobH; "just adding a server to site.pp" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/19221 [14:53:49] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19221 [14:54:22] RobH: fwiw, i think 3-4 ppl have looked at it and it's been more than a few days. but does also sound like more stuff can be tried (e.g. disabling all but one and seeing what happens) [14:54:40] yeah in bios now, trying that [14:55:07] um, i think both of my NICs are currently disabled... [14:55:20] that would be odd [14:55:25] ok never mind [14:55:28] they aren't [14:55:32] the bios just reads funny [14:55:38] you can disable them all of course ;] [14:55:41] ha [14:55:43] disabling nic2 [14:56:22] ok, here is a stupid question [14:56:30] how do I tell bios to save with F10 [14:56:38] from a mac? my F- buttons are all mapped to do stuff [14:57:34] ah nm, i found a save without F [14:58:00] ok, two 'neverminds' in a row here, I will try harder before asking so fast :p [15:05:17] rats, RobH [15:05:19] same deal [15:05:23] with NIC2 disabled [15:06:17] and apergos [15:06:33] (since rob is busy, sorry for poking) [15:06:42] yea, onsite today =P, sorry [15:06:50] s'ok! [15:06:56] onsite means in data center? [15:07:28] yep [15:07:43] from the mgmt console if you list the macs what does it show you (how long a list)? [15:07:52] ip addr show [15:07:57] eth0-eth5 [15:08:02] so 6 ifs [15:08:10] not including lo [15:08:22] https://gist.github.com/3341607 [15:09:04] and yeha, I do see dhcp logs on brewster [15:09:16] i don't know what they are supposed to look like, but they look normal i think [15:09:18] * jeremyb is pretty sure that means you have some kind of extra card (PCI) with 4 NICs and also 2 onboards [15:09:41] and the last 2 are the onboard [15:09:46] the BIOS is the onboard [15:09:47] aye, and, on boot auto network configure, it says it is trying to use eth0 [15:09:50] eth0 is PCI [15:10:08] but nothing's plugged in there, right? [15:10:14] physically [15:10:21] ah, apergos, i take that back [15:10:24] after disabling NIC2 [15:10:28] the last 2 must be nic1 and 2 I think [15:10:34] I only see eth0-eth4 [15:10:35] which makes sense [15:10:36] yeah totoally [15:10:42] eth4 is nic1, eth5 is nic2 [15:10:46] the MACs match up [15:10:47] so yeah [15:11:11] 04:7d:7b:a5:e1:b2 this is the only mac of yours in the logs on brewster [15:11:17] yes, and that is the proper one [15:11:21] that is eth4/NIC1 [15:11:36] ok [15:11:50] and it claims to get he correct ip addr for you [15:11:52] so, when I say'configure the network' [15:11:57] I see this message first [15:11:57] ┌───────────────┤ Detecting link on eth0; please wait... ├────────────────┐ [15:12:17] then it does 'Configuring the network with DHCP' [15:12:20] but fails that step [15:12:43] can you get to anoother console from there? (there's how not well I know the ubuntu installer :-/) [15:12:45] right, it's not plugged in [15:13:02] apergos, you want a console? [15:13:04] DHCP will always fail if there's no layer 1 [15:13:09] I can get out and you can use it [15:13:15] no [15:13:28] ? [15:13:29] I mean another console in ubuntu with maybe a root prompt [15:13:37] oh yeah, i can get a shell [15:13:43] that's how I did ip addr show [15:14:35] ok and can you paste that output? [15:14:49] ip addr show? [15:14:56] uh huh [15:15:03] https://gist.github.com/3341681 [15:15:31] you can see that eth4 is e1:b2 [15:15:50] also, notable, the logs on brewster only show up when the thing is booting pxe [15:15:57] NOT when I try to autoconfigure the network from the installer [15:17:03] well I assume that if it's sending from some other random ethn, with no cable, no packets are going to get anywhere. [15:17:08] for the install. [15:17:12] right [15:17:19] makes sense [15:17:28] so next up is how to get linux to reorder the nics [15:17:30] soooo, who knows what those other eths are [15:17:32] yeah [15:17:38] or disable 0-3 [15:17:52] I don't suppose you can re-order them in the bios somehow? (though that might not propogate to the os) [15:18:04] !log reprovisioning srv281 with a new fs layout [15:18:12] i dunno, the bios only shows NIC1 and NIC2 (eth4 and eth5) [15:18:15] Logged the message, Master [15:18:23] at least, where I looked (in PCI configs [15:18:28] i'll reboot in bios and see what I can find [15:19:03] ok [15:19:12] I'm looking on teh interwebs also [15:19:18] k danke [15:19:48] PROBLEM - Host srv281 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:57] hm, nope, I don't see anything in bios other than NIC1 and NIC2 [15:24:25] Remove the entries in udev rules and let the os pick the adapters up again? It should use those for persistant naming based off hardware [15:25:19] its a fresh os installer, it should be doing that every reboot. [15:25:20] RECOVERY - Host srv281 is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [15:27:29] i've reenabled NIC2 (since this didn't help) and am rebooting pxe again [15:27:35] will look at udev rules...? [15:28:13] those rules are remade on every boot into the isntaller [15:28:20] since its the installer, it doesnt save that stuff locally [15:28:32] * RobH has interesting dell behavior [15:28:40] ottomata: is 1012 the same? [15:28:40] two 2tb disks wont detect in same r310 =P [15:28:54] yeah, so far, same same everywhere [15:29:05] if so, can't you just set these to use eth4 for the installer? [15:29:13] can I? [15:29:15] PROBLEM - SSH on srv281 is CRITICAL: Connection refused [15:29:17] sure [15:29:26] the question is what are the other nics [15:29:26] anyway, back in ~30 mins [15:29:34] cuz we dont have this issue on the other c2100s [15:29:36] so somethign is odd about this [15:29:51] (the swift hosts dont have this issue, im not convinced its not networking, but i have not looked at it yet) [15:29:56] aye, hm [15:29:59] these were ordered with extra NICs and the others weren't? [15:30:02] ok, lemme try a random other one [15:30:09] jeremyb, do you know how to tell the installer to use eth4? [15:30:25] The real fun is when you add a drive and suddenly it won't load the bios, even after removing the drive and you find out it's a bug in dell's EFI/BIOS dual stack thing. :( [15:32:53] looking [15:33:36] RECOVERY - SSH on srv281 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:34:49] ottomata: see files/{tftpboot,dhcpd}/* [15:35:19] hmmk [15:35:23] ottomata: there's seperate pxelinux.cfg files for the different types of serial consoles. you could add another for your funky network [15:35:31] why the other c2100s work, that's very interesting [15:35:39] trying 1016, seeing if it is the same... [15:36:00] ottomata: on the append line in pxelinux.cfg try switching netcfg/choose_interface=auto to be e.g. netcfg/choose_interface=eth4 [15:36:16] although idk what auto does. shouldn't that just pick the one that's got a live link? [15:36:19] if there's only one [15:36:28] maybe look that up [15:36:30] i have to run [15:36:58] ok, thanks jeremyb! [15:37:07] if other hosts are well behaved we are better off trying to find out why this one is not [15:37:11] yeah true [15:37:32] i guess the first thing to do is to figure out why we have some many ifs, right? [15:37:39] can rob check that out since he is in datacenter? [15:37:47] well you're trying to look at 1016 [15:37:47] do we need someone to look at the back of the box and see? [15:37:51] so let's see what that does [15:37:54] yeah [15:37:55] just finished [15:37:57] same deal [15:37:57] ok [15:38:00] lemme see if it has 6 ifs [15:38:01] hm [15:38:24] yup [15:38:24] same [15:38:31] eth4 is NIC1 [15:38:43] what's a host that is on one of these and is live? [15:39:28] mark: what's the status with the mediawiki module stuff? [15:39:51] we currently have a conflict, mediawiki::sync is defined in both modules/mediawiki/manifests/sync.pp and manifests/mediawiki.pp [15:40:01] I'd presume the situation is the same with some of the other stuff too [15:40:27] ugh, right [15:40:43] i think we should temporarily rename the module or so [15:44:20] won't help much [15:44:28] since all the classes are defined with their full name [15:44:33] we'd need to rename all the classes too [15:44:35] "yay"! [15:44:59] so, what's the status? can we get rid of manifests/mediawiki.pp instead? [15:45:19] no not really [15:45:22] at least, not without thorough review [15:45:35] we have not looked at backwards compatibility at all, since we didn't realize they would conflict [15:45:36] apergos [15:45:42] none of them, afaik, maybe one of the swifts? [15:45:42] yes? [15:45:43] i'm not sure [15:46:02] these dells (analytics1011-22) are all new to me [15:46:26] there might be other c2100s, someone said something about the swifts [15:47:22] check racktables? [15:47:47] can you check that the bios settings for the network interfaces look like wikitech says ( http://wikitech.wikimedia.org/view/Dell/PowerEdge_C2100 ) ? [15:48:00] if not then don't change anything, but I'm interested to see if something's different [15:48:26] yeah well I'm digging around in racktables [15:48:34] and it's not set up to find stuf in that field [15:49:58] ok checking... [15:50:15] ah there we go, (turns out one can simply click on it to get the list :-/) [15:50:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:50:52] checking, but most of the stuff on that page doesn't have much to do with networking [15:50:54] but will check [15:51:05] yeah just the networking piece please [15:51:07] so [15:51:09] ignore the rest [15:51:18] what's the status? [15:51:34] our stuff mark? [15:51:38] yes [15:51:44] linux should see only the two nics with 04:7d (I am on another c2100) [15:51:48] the swift C2100s don't have 6 nics [15:51:50] only 2 [15:51:56] i don't know why these were ordered with more [15:52:06] do they show 6 in bios? [15:52:06] were they? or is the configration wrong? [15:52:08] or just in linux? [15:52:12] probably not [15:52:17] since the 4 additional ones are an addon card [15:52:22] and the bios usually doesn't know about those [15:52:32] so, just make that card not enable itself somehow [15:52:40] usually you can do that in the bios, or in the NIC bios for those 4 cards [15:53:06] PROBLEM - NTP on srv281 is CRITICAL: NTP CRITICAL: No response from NTP server [15:53:46] what's the problem? [15:54:02] apergos, networking stuff in bios looks normal [15:54:04] some nic reordering issue [15:54:06] ok [15:54:18] they bought C2100 with additional nics, linux detects those first [15:54:18] hm, [15:54:20] ah, usual stuff [15:54:22] should just disable them [15:54:23] there's gotta be a way to disable the cards [15:54:27] i don't see anything in the bios for those cards, so not there [15:54:29] or even pull em out [15:54:29] that's why Dell authored biosdevname [15:54:30] NIC bios? [15:54:40] pull em out works too [15:54:46] pull em out is fine with me [15:54:48] ok, which one can I play with? [15:55:09] you're on 1011 right? [15:55:25] not realyl [15:55:29] but i'll take that [15:55:50] I mean ottomata [15:56:12] was kicking that one I belive [15:56:13] i just took 1015 [15:56:36] 1015 is good [15:56:42] if you boot to bios at all [15:56:46] double check that I got the MAC right [15:56:56] should be 04:7d:7b:a5:e4:76 [15:58:10] it is [15:59:50] ok good [16:01:34] yeah let's just have rob pull those cards out [16:01:38] no point in having them anyway [16:01:42] and probably easier than disabling them [16:02:44] RobH: when you are back, could you please take out the additional NICs in the analytics C2100 so they can continue? I assume it's an addon card anyway... thanks [16:03:13] did no one actually want them to begin with? [16:03:39] they showed up by magic [16:10:11] i unnoooo [16:10:39] afaik we only need one if on most of these (although it might be nice to have a public and private IP on one of them...) [16:10:59] nono [16:11:23] what's the point of that [16:13:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:14:51] not asking for it, just saying it'd be nice... [16:15:00] i dunno, so I could set up a VPN and not have to do this annoying proxying tunnel stuff [16:15:01] why would that be nice? [16:15:19] if it has a public ip it's a public server [16:16:49] aye, cool, i'm sure there are tons of networking details I don't about that you guys have going, which is why I am not asking for it :) just saying it would be nice not to have to proxy to get to web guis on internal nodes [16:16:50] how is proxycommand annoying? pretty seamless to me. only complaint i have is the killed by signal msg when i disconnect [16:17:16] oh, web proxy [16:17:35] will you have one per node? or one for the whole cluster? [16:17:36] yeah, and I just got my ssh configs worked out last week [16:17:44] so the bastion stuff is working now better for me than before [16:17:58] you could just have a permanent proxy like gdash/ishmael/graphite do [16:18:02] those are all internal boxes [16:18:14] they proxy through fenari's apache [16:18:27] who's going to use the web GUIs? [16:20:26] New patchset: Alex Monk; "(bug 39306) Add a flood group to itwiktionary." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19230 [16:20:36] i have one set up with http auth on analytics1001 to use [16:20:57] its working ok, and i need to setup fancy foxyproxy or something to make it easier for me [16:21:02] its just more and more things to set up and maintain [16:21:08] a vpn means I wouldn't have to deal with that stuff [16:21:13] we had that at couchsurfing, was so nice [16:21:21] we only had a very few machines with public IPs [16:21:23] everything else was internal [16:21:26] but we could sign onto the vpn [16:21:31] and then we were on the internal net [16:21:43] but but but [16:21:49] I am not asking for this! [16:21:50] just sayyyyin [16:22:08] anyway, i gotta get some food, I guess we are waiting for RobH to have time to remove the excess NICs, ja? [16:22:35] ottomata: so, there's a permanent proxy on fenari for those services. why not just add another proxy on fenari? [16:23:02] ottomata: i still don't know who your intended audience is. who will use these services [16:23:24] back in a while (before the meeting)... [16:24:01] mainly just me and othjer analytics folks right now, we might need somethign later [16:24:03] agg, be back in abit [16:36:59] New patchset: Pyoungmeister; "temp hack to differentiate applicationserve module and the old applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19232 [16:37:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19232 [16:38:02] i hear the mediawiki class also conflicts with the new module [16:38:24] yes. there's a double def of the package [16:38:28] going to track that down today [16:38:45] just manifests/mediawiki.pp, no? [16:39:19] the new role class shouldn't get that though, I don't think? [16:39:39] I believe that what's happening is that it's getting the new mediawiki module and the old role class [16:39:46] which pulls inthe old package class [16:39:49] leading to the double def [16:40:03] there's a module mediawiki and there's a class mediawiki outside the module [16:40:05] they simply conflict [16:40:20] kk [16:44:51] New patchset: Pyoungmeister; "temp hack to differentiate applicationserve module and the old applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19232 [16:45:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19232 [16:46:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19232 [16:49:09] food, bbl [16:50:56] * jeremyb waves LeslieCarr moin [16:51:42] *grumble* [16:52:30] another fine monday [16:52:39] hehe [16:52:45] where by fine I mean as in "the fine manual" [16:53:33] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [16:55:27] New patchset: Pyoungmeister; "changing scoping on include in mediawiki module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19233 [16:56:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19233 [16:57:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19233 [17:05:33] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:07:01] hey RobH, any news on stat1001? [17:08:33] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [17:09:17] cmjohnson1: yay! [17:09:23] search32 is the best albatross ever :) [17:14:45] heheheh [17:18:22] New patchset: Aaron Schulz; "Switched backend reads to swift for testwikis and mw.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19235 [17:19:01] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19235 [17:21:34] drdee_: uhh, we cleared that ticket awhile ago [17:21:40] drdee_: its been fixed [17:22:17] RobH, a while ago? i thought we were waiting for a new motherboard as the CPU2 socket was broken [17:22:27] well, awhile ago being last week [17:22:30] RobH: did you see the latest on the new C2100s? [17:22:36] jeremyb: ? [17:22:54] RobH: they want you to remove the addon (PCI?) ethernet from those boxes [17:23:20] RobH: okay cool! and is precise installed? [17:23:33] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [17:24:16] oh bugger, those do have a ton of ethernet [17:24:33] i forgot we added them since analytics orignally planned to bond 2+ interfaces [17:24:42] jeremyb: cant we just move the nic to the first port on the card? [17:24:47] or disable it in bios? [17:25:02] drdee_: i didnt do anything but fix the mainboard power issue [17:25:12] it will have to be reinstalled [17:25:25] RobH: idk... i think mark looked and gave up [17:25:32] 13 16:02:44 < mark> RobH: when you are back, could you please take out the additional NICs in the analytics C2100 so they can continue? I assume it's an addon card anyway... thanks [17:25:33] RobH: okay, i will open an RT ticket [17:25:44] drdee_: cool, i may snag it from you later [17:25:56] jeremyb: urgh, thats more annoying, because I dont have the slot covers [17:26:01] so i will have a bunch of slightly open cases [17:26:08] best to disable in bios, lemme take a look at it [17:26:18] jeremyb: i take it i can do whatever to analytics1011? [17:27:39] yeah totally [17:27:50] RobH, fire at will [17:28:00] wait, we originally wanted to use 2+ interfaces? [17:28:08] maybe not with real fire [17:28:12] try blanks first [17:28:21] ottomata: originally it was explained that these may need an insane amount of network throughput [17:28:30] that's true... [17:28:36] since then, its less, but they were already ordered at that point [17:28:45] dschoon would know more about the original order [17:28:53] but the idea was if you guys needed to, we could bond multiple 1tb connections together [17:29:06] its annoying that the OS sees the addon card first [17:29:09] but bios the other way around [17:29:14] yes, there is. [17:29:29] aye yeah, i guess if we wanted, that originally, we might want to try to keep it. if we could reorder them it would be best then [17:29:29] OR [17:29:32] ottomata: if i can make it work with the add on cards primary port, we may go that route, and leave the cards installed. [17:29:55] as jeremb suggested earlier, it should be possible to modify pxelinux.cfg for these, right? [17:30:01] and force them to use eth4? [17:30:09] and there's a serious argument for setting up the network topology between the machines to dedicate interfaces to specific applications (like allocating an interface to each mapper, etc) [17:30:16] but we don't need to start there [17:30:21] netcfg/choose_interface=auto [17:30:25] ^^ ottomata RobH [17:30:27] ottomata: why modify the installer? [17:30:36] why not just plug the network into port1 on the addon card and it all works? [17:30:42] that's cool too! [17:30:43] even better [17:31:08] ottomata: i think thats the way to go, but i will test it in a few (gotta finish fluorine first) [17:31:48] ok cool [17:31:48] danke [17:33:29] ottomata: maybe you need some firmware for the addon card? [17:34:05] but yes, plugging in the right place does seem like a good thing to try ;) [17:34:10] the dell just uses addon ethernet before mainboard ethernet in the os is all [17:34:25] i didnt realize these had the addon, it explains it all quite well [17:38:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [17:38:40] cmjohnson1: you didnt get in the copper attached SFPs from stayonline for them yet? [17:39:25] cmjohnson1: y u break it ? [17:39:34] also what rack are these again ? [17:40:05] isn't that waiting on the SFP+ cables? [17:41:28] cool [17:58:47] New patchset: Pyoungmeister; "another namespace hack. mediawiki module is now mediawiki_new" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19237 [17:59:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19237 [17:59:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19237 [18:01:07] yeah, those aren't showing up … [18:01:08] sigh [18:01:14] i mean they show as something plugged in [18:01:19] they just don't recognize it as valid [18:02:56] New patchset: Pyoungmeister; "mediawiki module namespace stuff: must rename dir as well." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19239 [18:03:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19239 [18:04:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19239 [18:13:26] New patchset: Demon; "Explicitly specify FollowSymlinks for the Mobile nightly builds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19240 [18:14:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19240 [18:16:18] ottomata: ops meeting [18:16:20] please call in [18:20:50] ottomata: 13 18:16:17 < mark> ottomata: ops meeting [18:26:57] cool, yeah, trying to remember how... [18:27:22] x2002 [18:27:36] but if it doesn't work, and you're remote, dial the office number and ask em to transfer you [18:27:56] right phone.... [18:28:36] 1-415-839-6885 [18:28:36] ? [18:28:53] yes [18:29:33] great got it [18:29:34] thanks! [18:29:38] sweet [19:03:51] Jeff_Green: ok fun stuffs [19:03:56] ok [19:04:08] so, it should be forwarding to boron, 10.64.40.66, right ? [19:04:20] ya yes [19:05:46] so i don't have sudo on boron [19:05:48] can you fix that ? [19:06:03] sure [19:07:29] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [19:07:29] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [19:07:35] LeslieCarr: try now [19:08:32] nope [19:08:55] did you log out and back in? [19:09:17] ah [19:09:34] yay [19:10:06] omg why the hell is the default visudo editor pico ?! the horror [19:11:00] well, want to try another dhcp ? [19:11:15] tcpdumping the incoming interface as well as boron [19:11:30] sure [19:12:17] let me know which machine to scrutinize/kick :) [19:12:42] you can do any of the payments100* [19:12:47] i just logged into payments1001 [19:13:11] cool [19:13:15] want to kick payments1001 ? [19:13:21] yup [19:13:24] doing so [19:13:59] i'll let you know when it gets to dhcp [19:16:53] ok it's requesting [19:17:20] interesting... [19:17:26] not seing any traffic... [19:17:29] ya [19:17:31] on the interface at all [19:17:42] on boron? [19:17:46] oh shit [19:17:52] blocked? [19:18:11] no, i was looking at the wrong interface [19:18:14] ha [19:18:18] :( [19:18:21] fail [19:18:56] Jeff_Green: nano is the system's default editor; have a look at update-alternatives --config editor [19:19:29] paravoid: ahh, thanks [19:19:33] visudo just uses /usr/bin/editor which in turns uses /etc/alternatives/editor [19:19:43] symlinked to it [19:19:50] and that's managed by update-alternatives [19:19:52] yeah, makes sense--i was looking for pico, didn't notice it was nano [19:19:52] ottomata: im looking at analytics1011 [19:20:12] don't say pico, it shows your age :P [19:20:31] i was looking for Smith Corona [19:20:53] there's no pico anymore [19:20:55] nor pine [19:21:24] Jeff_Green: can you restart the dhcp funness ? [19:21:31] yu[ [19:21:51] paravoid: you know I use alpine . . . [19:22:01] ah RobH, danke! [19:22:03] I know :) [19:22:51] thankfully the kids writing these fancy new operating systems provide a handy symlink [19:22:59] Jeff_Green: i just put "export VISUAL=.." in .bashrc [19:23:01] LeslieCarr: rebooting [19:23:11] ottomata: so the add on NICs do NOT have PXE options enabled (sounds like the 10g issue) [19:23:16] im just going to yank the damned things out [19:23:32] hmm, ok... [19:23:36] of course, i dont have nice slot covers on hand to install, so they will be slightly ugly (not that you will see that ;) [19:23:39] mutante: ya. i was just confused, nano (or pico or anything related) seemed like just about the last choice anyone would set as a default, and I was baffled as to why that's what I was getting without any override [19:23:45] i mean, it sounds like they were ordered for a reason [19:24:03] was 10g never solved? [19:24:07] pxe i mean [19:24:17] it was, needed special bios flash [19:24:29] this is another card, different firmware, same issue, but we dont need them anymore. [19:24:33] * Jeff_Green remembers being mocked as a junior admin in the last century for editing things with pico [19:24:42] im checking to see if i can set bios to what to detect first [19:24:46] k [19:24:54] Jeff_Green: all you need is vi man. [19:25:00] leslie dhcping [19:25:01] Jeff_Green: editor wars will go on forever:) "joe" :) [19:25:05] ok, i see the request... [19:25:10] mutante: you and brion are joe users. [19:25:34] mutante: I'm going to start working with WP51 [19:25:39] nah,i'm vim, i just suggest joe as an alternative over nano/pico [19:26:18] LeslieCarr: you saw the request hit eth0 on boron? no response made it back [19:26:49] ha, sounds like Jeff_Green and Leslie are troubleshooting the same thing I was on Friday [19:27:00] and peter was as well? [19:27:03] yeahg [19:27:04] except behind a few additional firewalls ;] [19:27:09] i saw it on the incoming interface on payments1001 [19:27:12] not on boron [19:27:12] sigh [19:27:48] ottomata: maybe? we're troubleshooting network/firewall config for the frack SRX's [19:27:54] Jeff_Green: WordPerfect? haha [19:28:03] yes! it's text-only friendly! [19:28:32] * jeremyb groans @ WP [19:28:36] Jeff_Green: can i donate to WP using lynx and w3m?:) [19:28:48] probably different problem, but we were doing the same thing: watching packets on dhcp servers to see if they got through [19:29:08] speaking of which, after about 2 hours of pounding my head on the desk last friday night I finally got pxelinux+DRAC to play nicely together, so I've got proper pxeboot menus working [19:30:56] Jeff_Green: it has Y2K issues:) [19:31:14] WP? no problem. I'll just run it in a vm set always to 1999. [19:31:47] yeah. hehe [19:32:21] !log powering off analytics1011-1014 to remove the add on nics (disrupting installer) [19:32:30] Logged the message, RobH [19:33:50] hrm, maybe putting each interface specifically instead of interfaces "all" [19:36:12] Jeff_Green: reboot again ? [19:36:19] k [19:38:44] LeslieCarr:I think it just got an IP [19:38:52] oh wait . . . nm [19:42:32] sigh [19:42:35] food guy just showed up [19:42:36] sorry about that [19:43:56] heya so, while waiting for RobH to figure out what to do with c2100 interfaces [19:44:03] i'm going to see if I can't install stat1001 [19:44:13] anyone know which type of hw it is? or how to find out? [19:44:21] ottomata: it's Cisco [19:44:44] aye ok…do you know if there is a wikitech page on it (checking...) [19:44:46] ottomata: http://wikitech.wikimedia.org/view/Cisco_UCS_C250_M1 [19:44:57] u so fast [19:45:03] grrr [19:45:19] oh boy cisco, so do you think this will be more or less of a headache than the c2100s? [19:45:40] ottomata: eh.. same in a different way?:) [19:45:42] ottomata: so i removed the add on nic in 1011-1014 [19:45:47] and now doing the next set of 4 [19:45:53] you can have abck 1011 and try to install now [19:45:58] should be fine with the mac stuff you have setup. [19:46:01] (now) [19:46:04] ottomata: but you already got the MACs and DHCP and DNS and stuff... [19:47:05] for stat1001, yeah, puppet seems all set [19:47:09] just a matter of booting i think [19:47:29] RobH, ok, cool! although am a bit sad about the loss of those IFs if we had originally ordered them on purpose [19:47:37] ottomata: it should be as long as partman works [19:47:41] k [19:48:11] ottomata: there was some reason though you did not want precise when i installed it [19:48:18] but i'm not sure which [19:49:16] Jeff_Green: i'm gonna eat then do some investigating [19:49:23] LeslieCarr: k [19:49:30] um, not anymore, i think we are ok with precise for everything now [19:49:36] on the other ciscos [19:49:39] hmmmm [19:49:45] actually i'm not sure, stat1001 is fine for precise though [19:49:53] so for the analytics1001-1010 ciscos [19:49:58] cmjohnson1: uhhh, ipmi_mgmt script is the easy way to do that stuff [19:50:06] we are not sure which hadoop/cloudera distribution we are going to use [19:50:08] cmjohnson1: if you run imp_mgmt in iron, or on sockpuppet, it has a built in help [19:50:10] if we use CDH3 [19:50:17] we need lucid, precise is not supported [19:50:17] ottomata: Well, they arent going elsewhere [19:50:27] ottomata: so if you need the cards in teh future, we can install them bak and fiture them out [19:50:32] ok cool, sounds good [19:50:34] wow, typos. [19:50:40] haha [19:55:20] anyway, but yeah, stat1001 is set to use precise in puppet [19:55:21] so should be cool [19:55:22] hrmm [19:55:28] cmjohnson1: i bet the password or user is wrong [19:55:33] the script hardcodes root [19:55:43] pxe booting 1011 now, lets see! [19:56:14] cmjohnson1: mw-be5.mgmt [19:56:17] oh poo but I did not change the MAC addies yet [19:56:18] doh [19:56:19] if you dont include mgmt, no dice [19:56:24] ottomata: no mac changes [19:56:29] the one you pulled from bios is nic1 [19:56:29] no changes? [19:56:38] oh right, cause you just unplugged the others [19:56:39] cool [19:56:43] so nic1 is eth0 now [19:56:45] right right [19:56:47] yep [19:56:57] looking bettter! [19:58:36] woot network setup good! [19:58:50] erg [19:58:51] Error while setting up RAID │ [19:58:51] │ An unexpected error occurred while setting up a preseeded RAID │ [19:58:51] │ configuration. [20:00:06] it dislikes your partman [20:00:13] yeah grr [20:00:13] more than likely cuz its not identical to the other systems used [20:00:24] ottomata: ugh, yeah, sounds familiar :/ [20:00:27] (these have a single ssd installed inside plus 8 disks) [20:00:31] the dells? [20:00:37] the c2100s, yep [20:00:43] analytics1011-1022 [20:00:49] hmmmm, ok, someone else wrote a partman for analytics dell [20:01:02] wasnt me, so i cannot speak to it [20:01:26] mutante twas you! [20:01:30] heheh [20:01:34] need to separate analytics partman - cisco vs. dell [20:01:36] ottomata: you mean analytics-cisco, not dell [20:01:56] ottomata: but now you are set on networking, im finishing up the last 4 of the batch now for it [20:01:59] stat1001 is cisco, actually haven't started that yet [20:01:59] =] [20:02:02] we are talking about analytics1011 [20:02:06] which is a dell c2100 [20:02:08] ah, gotcha [20:02:10] stat1001 isnt a cisco. [20:02:13] its a dell [20:02:16] r510. [20:02:17] wth [20:02:18] ooook [20:02:20] haha [20:02:21] good to know [20:02:26] thanks RobH! [20:02:36] analytics1001-1010 are the ciscos [20:02:37] sorry, i thought analytics1001 :p [20:02:42] yep [20:02:48] right yup [20:03:58] hm, ok i don't know partman [20:04:14] but, mutante, the diff between the -dell and -cisco recipes is which devices [20:04:28] did you mean to leave sdc and sdd for swap on dells? [20:04:28] 1 2 0 swap - \ [20:04:28] /dev/sdc2#/dev/sdd2 \ [20:04:48] you changed it to sda and sdb for root / [20:05:33] actually what i remember is changing it for analytics-cisco to NOT use sda and sdb.. ehmm.. [20:05:42] right hm [20:05:49] looking [20:05:50] you started with a -dell? they look mostly copy pasted [20:06:29] i copied an existing config that did not have either -dell or -cisco in it and renamed it to -cisco [20:06:39] then for consistency renamed the original to -dell [20:06:43] afair [20:08:17] yeah, git log analytics-dell.cfg [20:08:25] yeah [20:08:27] k [20:08:40] so maybe dells should use sda2 and sdb2 for swap? [20:08:43] https://gerrit.wikimedia.org/r/#/c/9387/ [20:08:51] files/autoinstall/partman/analytics-cisco.cfg [20:08:52] renamed from files/autoinstall/partman/analytics.cfg [20:08:56] ah cool [20:09:01] it was just "analytics.cfg" before that [20:09:05] aye [20:09:08] ok internet, tell me why you don't want the machines to install... [20:09:08] but i did not write the original file either :p [20:09:19] qwigohaerhinafs [20:09:28] analytics1021 poppedout of the rails onto me [20:09:33] my hand is now all bleeding. [20:09:47] ouch :/ [20:09:49] ow [20:09:51] go fix it! [20:09:54] don't type about it [20:10:02] geez [20:10:03] haha [20:10:15] its all cosmetic [20:10:15] and now his compy is dead from blood juice frying motherboard [20:10:30] nah, just now i have a bunch of shallow open cuts on my left hand fingers [20:10:39] it would be good to note if analytics1021 is possessed [20:10:40] gotta get a hammer and fix the rail now [20:10:43] for future reference when we are using it [20:10:45] ottomata: it will now be the fastest one [20:10:48] put it on the wiki [20:10:49] blood makes the servers run. [20:11:08] ah sysadmin blood libel? [20:12:01] ottomata: "# 2 drives with 1 raid1 partition + 2 swap partitions" is in thumper.cfg , maybe that helps? [20:13:12] reading, uhhhhhh [20:13:30] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [20:13:55] derpy derpy, oh this is an 'expert recipe' [20:14:05] hmmmm, am I an expert? [20:14:05] /dev/sdy and /dev/sdac are unusual :p [20:14:07] survey says, no [20:14:53] ottomata: http://wikitech.wikimedia.org/view/Partman#Dissecting_a_Semi-working_configuration [20:15:00] i wish i had more for you... [20:15:40] so yeah, wouldn't hte fact that sdc and sdd are not mentioned on [20:15:43] d-i partman-auto/disk string /dev/sda /dev/sdb [20:15:47] in -dell [20:15:53] mean that [20:15:53] 1 2 0 swap - \ [20:15:53] /dev/sdc2#/dev/sdd2 \ [20:15:56] would not work? [20:16:07] Jeff_Green: hrm, this may be it For Linux based DHCP clients it is seen that the "seconds elapsed” field is set to 0 in the DHCP Discover message and the SRX drops those packets instead of relaying them to the DHCP server, if the "minimum-wait-time" vale is configured with anything except 0. [20:16:48] wee, now my left hand is all bandaged up [20:16:51] and i have a hammer. [20:16:55] servers tremble. [20:16:58] uh oh, watch out 1022! [20:17:29] the rails on the sides of the c2100 have a sliding stud, and a bunch of screws [20:17:32] rather than two snap studs [20:17:35] i hate teh c2100. [20:17:48] LeslieCarr: we're not even getting to linux, this is just the interface trying to pxeboot [20:17:53] ok, mutante, iunno what i'm doing, but i'm going to change that, swap shouldn't be on sdc/sdd [20:18:58] yeah [20:19:04] sigh [20:19:18] ottomata: yeah, i wouldn't know better right now [20:19:21] is junos logging anything at all? [20:19:51] LeslieCarr: hey while you're in there can you burn holes for [all] --> spence/neon for ntp? [20:19:53] New patchset: Ottomata; "Fixing MAC for analytics1012" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19380 [20:20:19] it's showing that it sees 0 requests [20:20:39] New patchset: Ottomata; "Using sda/sdb for swap partition, not sdc/sdb on analytics dells." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19381 [20:20:57] mutante, could you +2 merge those bitteplease? [20:21:00] but i can basically tcpdump the incoming port and see what looks like a request [20:21:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19380 [20:21:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19381 [20:21:41] LeslieCarr: the dhcp helper is seeing none? could filters block the request before it gets to the dhcp relay process? [20:22:13] they could but right now i have everything set to open so you could get this going and started up [20:22:21] there could be some incoming to the router issues... [20:22:45] but there's no way to block or open system-service dhcp on each interfacae in 11.4 ... [20:22:52] ottomata: +2'ed but not merged yet ....ehm [20:23:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19381 [20:23:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19380 [20:23:11] there they go [20:23:13] i can do sockpuppet [20:23:24] alright [20:23:47] bbl then, food [20:24:02] trying payments1003 [20:24:39] k [20:24:54] ah, can't remmber what to do on sockpuppet after 'git merge origin/production' [20:25:07] or, is that it? [20:25:10] there's a hook? hmmm [20:25:13] where is that wikitech page [20:25:15] ... [20:28:31] LeslieCarr, do you know where the wiki page is on how to merge puppet configs on sockpuppet? [20:28:35] i feel like I am missing a step [20:28:59] wikitech search is failling me [20:29:17] iirc it's not terribly well documented, sec [20:29:29] there was a page that had what I needed [20:29:30] git fetch [20:29:34] it's there yeah, sec [20:29:35] git diff HEAD origin/production [20:29:44] git merge origin/production (i think) [20:29:46] feel like i'm missing something [20:29:48] aye [20:30:10] it may be with the labs docs actually [20:30:36] oo [20:30:55] https://labsconsole.wikimedia.org/wiki/Git#Git.2FGerrit_and_the_puppet_repositories [20:31:11] yes! [20:31:12] https://labsconsole.wikimedia.org/wiki/Help:Git#Making_the_changes_live_in_puppet [20:31:14] danke [20:31:16] http://wikitech.wikimedia.org/view/Puppet#Making_changes --> "See the documentation on Labs for this. " [20:31:35] it's like a rat maze! you get more cheese the quicker you track that down next time. [20:31:36] hm, think I got it, ok... [20:31:52] i found it by searching for 'origin/production' [20:31:53] ha [20:32:33] ha glorious [20:32:51] ottomata: ok, all the addon nics are removed [20:32:56] i usually end up doing 'history' on sockpuppet because I forget and it's easier than looking it up in the docs [20:32:57] you should be all set for networking now on the batch [20:33:23] thank youuuuu [20:33:26] ahhh right [20:33:37] yeah hopefully i can get partman right now [20:33:56] my recent experience with partman: it's a good way to burn cycles with trial and error! [20:36:29] What the ... [20:36:35] srv194: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [20:36:38] srv281: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [20:37:00] RoanKattouw: yeah, reimaging those [20:37:08] they will get updated once puppet is run [20:37:12] but... troubleshooting. yay! [20:37:15] should be better soon [20:37:24] OK [20:37:40] Also, ops people, I have another request [20:38:15] Aaron needs a bunch of files with wrong ownership on ms7 to be cleaned up. I was offering to do this because I originally caused the bad ownership like two years ago [20:38:58] However, ms7 doesn't trust my key for root, so I can't do this on the machine directly. I could do it via NFS but I kind of don't want to because I think that'll just slow down the NFS share terribly [20:39:18] So could anyone either give me access to ms7, or do the cleanup for me? [20:47:35] RoanKattouw: thats cuz ms7 is solaris [20:47:40] doesnt puppetize. [20:47:51] Right [20:54:21] New patchset: Pyoungmeister; "sudo::group : allowing for multiple priv. definitions for one group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19384 [20:54:57] ottomata: easiest to just look at bash history for the right way to merge on sockpuppet. [20:55:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19384 [20:55:23] !log updating root auth keys file on ms7 by hand. [20:55:32] Logged the message, notpeter [20:56:50] aye, thanks maplebed [21:06:01] New patchset: Catrope; "Another path fix for Parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19387 [21:06:15] Could someone approve that one please? ----^^ [21:06:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19387 [21:10:52] RoanKattouw: sure [21:11:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19387 [21:11:40] welp, mutante, no movement with raid stuff [21:11:40] growly [21:11:46] i'm about to head out for the day [21:11:50] RoanKattouw: want me to merge on sockpuppet as well? [21:11:53] so! will bug people about partman stuff tomorrow [21:12:03] ottomata: I have some experience with it [21:12:07] I can halp if you want [21:12:14] cool! [21:12:20] yeah man, if you are still working [21:12:27] try pxebooting analytics1011 [21:12:31] probably will for a bit longer [21:12:31] ok [21:12:44] actually, I don't kow what partman you're looking for [21:12:49] so why don't we just tlak tomorrow [21:12:59] !log reinstalling virt1004 to see if the raid issue re-appears [21:13:03] notpeter: Please do, yeah [21:13:08] Logged the message, RobH [21:13:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19384 [21:13:09] notpeter: Thanks man [21:13:19] RoanKattouw: no problem [21:16:54] New patchset: Dzahn; "install star SSL cert in role class, variable domain name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19390 [21:17:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19390 [21:17:48] !log Searching for files with wrong ownership on ms7 using [21:17:53] notpeter [21:17:55] sorry [21:17:56] um [21:17:57] Logged the message, Mr. Obvious [21:17:58] analytics-dell.cfg [21:18:06] they are super close to working [21:18:10] just a problem setting up raid [21:18:17] just using the first 2 disks [21:18:21] mirrored raid for root and swap [21:18:26] should be pretty simple [21:18:36] !log ms7# find /export/upload/wik*/*/{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,archive,math,temp,timeline} ! -user apache [21:18:45] Logged the message, Mr. Obvious [21:19:08] ottomata: ok, I'll try to take a look today [21:19:12] !log ...piping result into ms7:/root/badownershipfiles [21:19:20] Logged the message, Mr. Obvious [21:19:40] thanks notpeter [21:19:43] send me an email with how it goes [21:19:46] otto@wikmedia.org [21:19:57] PROBLEM - SSH on virt1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:21] kk [21:21:45] PROBLEM - Host virt1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:12] i know nagios, sheesshhhhh [21:25:37] New patchset: Dzahn; "install star SSL cert in role class, variable domain name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19390 [21:26:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19390 [21:27:18] RECOVERY - Host virt1004 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [21:31:24] Hey ops people [21:31:37] If I want to force a specific version of a package in puppet, how do I do that? [21:32:56] Hmm, looks like ensure => "0.6.12~dfsg1-1ubuntu1" should work [21:33:14] Yeah, basically put a version instead of 'latest' in ensure [21:33:41] IIRC it only sucks for >= stuff [21:33:43] https://github.com/evolvingweb/puppet-apt [21:33:58] havent used, but its to use APT pinning with puppet [21:34:08] New patchset: Catrope; "Specifically force version 0.6.12 of nodejs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19394 [21:34:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19394 [21:45:53] PROBLEM - Host virt1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:56] ACKNOWLEDGEMENT - Host virt1004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT-#3261disk fail [21:52:47] New patchset: Catrope; "Install build-essential because g++ and make are needed for npm install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19397 [21:53:27] Change abandoned: Catrope; "Not actually needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19394 [21:53:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19397 [21:56:35] hrmm, i cannot tell if this cisco is in a reboot loop [21:56:43] it takes so long by the time it posts i am looking at something else. =P [21:57:08] RECOVERY - Host virt1004 is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms [21:57:30] argh its in a reinstall loop.....whyyyyyyyy [21:58:49] Logged the message, Master [22:00:49] cmjohnson1: then when you are done, you may wanna dig against them all [22:01:01] powerdns has a known bug where it will hang on reload on secondary nameservers on occasion [22:01:16] so just dig @ns0/1/2.wikimedia.org hostname_to_lookup [22:01:46] New patchset: Dzahn; "install star SSL cert in role class, dependency using new arrow syntax" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19390 [22:02:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19390 [22:03:44] New patchset: Bhartshorne; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [22:04:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [22:04:47] PROBLEM - Host virt1004 is DOWN: PING CRITICAL - Packet loss = 100% [22:05:46] New patchset: Bhartshorne; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [22:06:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [22:06:27] if it answers with an ip you are set =] [22:06:57] (but that paste didn't give an ip.) [22:06:57] cmjohnson1: labsdb1.mgmt.pmtpa.wmnet is what you dig [22:07:06] right, but he didnt dig the right fqdn [22:07:20] New patchset: Dzahn; "install star SSL cert in role class, dependency using new arrow syntax" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19390 [22:07:24] cmjohnson1: dig @ns0.wikimedia.org labsdb1.mgmt.pmtpa.wmnet [22:07:56] RECOVERY - SSH on virt1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:07:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19390 [22:08:05] RECOVERY - Host virt1004 is UP: PING OK - Packet loss = 0%, RTA = 35.36 ms [22:08:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19390 [22:23:34] New patchset: Hoo man; "idwiki logo changed to locally upload version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19399 [22:23:39] jeremyb: ^ [22:29:24] PROBLEM - NTP on virt1004 is CRITICAL: NTP CRITICAL: No response from NTP server [22:31:03] New patchset: Pyoungmeister; "fixing broken puppet on hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19401 [22:31:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19401 [22:33:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19401 [22:40:25] New patchset: Bhartshorne; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [22:41:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [22:51:55] New patchset: Pyoungmeister; "more unbreaking of puppet due to switch to mediawiki module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19404 [22:52:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19404 [23:03:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19404 [23:08:53] AaronSchulz: hey, I have a question about how deploy works... [23:08:56] got a sec [23:08:56] ? [23:09:10] * AaronSchulz meows [23:09:37] hey, so, I newly puppetized the l10nupdate user and the mwdeploy user [23:09:49] looking at puppet, it made a couple of changes on fenari [23:10:01] 1. /User[l10nupdate]/shell: shell changed '/bin/bash' to '/bin/false' [23:10:13] 2. [mwdeploy]/User[mwdeploy]/home: home changed '/var/www' to '/var/lib/mwdeploy' [23:10:27] is this going to fuck things up? [23:10:54] bin/false? [23:11:15] i.e.: its default shell is to just close [23:11:19] like, if you su to that user [23:13:07] I can also just change them back [23:13:11] why did it change? [23:13:34] I changed them from users to systemusers. slightly different puppet definitions [23:13:59] as long as /var/lib/mwdeploy is not writable by mwdeploy [23:14:11] you shouldn't give unprivileged users writable home directories [23:14:12] it's part of migration to the new puppet modules that I've been writing [23:14:37] TimStarling: ah, kk. thank you [23:17:58] there we go, now it's right, nagios-wm [23:25:30] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [23:26:57] New review: Dzahn; "the channel names were/are still hardcoded in /usr/local/bin/start-nagios-bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [23:33:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.370 seconds [23:35:32] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [23:39:27] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [23:39:27] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [23:39:27] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours