[00:00:01] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 198 seconds [00:25:07] PROBLEM - MySQL Recent Restart on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:25] PROBLEM - MySQL Slave Running on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:25] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:52] PROBLEM - Full LVS Snapshot on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:52] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:10] PROBLEM - mysqld processes on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:28] PROBLEM - MySQL Idle Transactions on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:28] PROBLEM - MySQL disk space on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:31] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 1 seconds [00:27:58] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [00:31:06] db1034 is having scheduling issues, upgrading kernel now] [00:35:49] nice [00:35:57] that reminds me, I have one very nice check_tainted nagios check [00:36:15] that checks if the kernel is tainted (like OOM kicking in, backtraces and whatnot) [00:36:21] do you think it would be of value here? [00:37:26] New patchset: Asher; "re-enabling slow query collection cron, just on pmtpa dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6897 [00:37:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6897 [00:37:52] RECOVERY - Disk space on db1004 is OK: DISK OK [00:37:58] paravoid: that would be great! [00:38:46] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [00:44:45] !log rebooted db1034 [00:44:47] Logged the message, Master [00:47:01] PROBLEM - Host db1034 is DOWN: PING CRITICAL - Packet loss = 100% [00:48:22] RECOVERY - MySQL Slave Running on db1034 is OK: OK replication [00:48:31] RECOVERY - Host db1034 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [00:48:49] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay seconds [00:48:50] RECOVERY - Full LVS Snapshot on db1034 is OK: OK no full LVM snapshot volumes [00:49:07] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 254 seconds [00:49:07] RECOVERY - MySQL disk space on db1034 is OK: DISK OK [00:49:16] RECOVERY - MySQL Idle Transactions on db1034 is OK: OK longest blocking idle transaction sleeps for seconds [00:49:16] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 262 seconds [00:49:34] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay seconds [00:49:34] RECOVERY - MySQL Recent Restart on db1034 is OK: OK seconds since restart [00:55:43] PROBLEM - Host db1034 is DOWN: PING CRITICAL - Packet loss = 100% [00:56:55] RECOVERY - Host db1034 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:02:01] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [01:02:01] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [02:18:30] New patchset: Catrope; "Make l10nupdate script work with submodules correctly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6905 [02:18:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6905 [02:35:07] yo ops folks! did we experiment on april 15th with sending our mobile traffic to https? [03:44:23] New patchset: Lcarr; "removed random "GG", made "standard footers" section removed old argument parsing code" [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [03:46:20] New patchset: Lcarr; "minor cleanups" [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [03:48:44] New review: pugmajere; "(no comment)" [operations/software] (master) C: 1; - https://gerrit.wikimedia.org/r/6906 [03:48:55] New review: Lcarr; "(no comment)" [operations/software] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6906 [03:49:07] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6906 [03:49:09] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [04:10:59] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:12:47] thems fighting words [04:13:50] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:14:03] New review: pugmajere; "Missed a documentation reference here." [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:15:52] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:16:25] drdee: there were zero related changes on the 16th and the session leak issues on the 14th (so all traffic was sent from esams to pmtpa). maybe someone else knows something but that's about all i see at a glance. you should look yourself: http://wikitech.wikimedia.org/view/Server_admin_log#April_16 [04:19:43] New review: pugmajere; "(no comment)" [operations/software] (master) C: 0; - https://gerrit.wikimedia.org/r/6907 [04:20:55] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6907 [04:20:57] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:24:26] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [04:28:55] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6463 [04:28:57] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6463 [04:41:08] TimStarling: `echo "sync failed"` should be >&2 ? [04:41:39] all the other errors go to stdout don't they? [04:41:59] haven't a clue [04:56:32] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [04:59:23] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [05:08:59] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [05:10:02] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [05:14:05] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:16:34] New patchset: Lcarr; "Fix SLAX syntax output" [operations/software] (master) - https://gerrit.wikimedia.org/r/6909 [05:17:09] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6909 [05:17:13] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6909 [05:35:14] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [05:43:00] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:13:09] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [06:38:57] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/teahouse.log, have not been written to in 24 hours [06:41:57] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [06:48:41] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [07:30:43] PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:10] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [08:54:30] PROBLEM - Frontend Squid HTTP on amssq52 is CRITICAL: Connection refused [08:55:24] PROBLEM - Backend Squid HTTP on amssq52 is CRITICAL: Connection refused [08:56:00] RECOVERY - Frontend Squid HTTP on amssq52 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.219 seconds [08:56:45] RECOVERY - Backend Squid HTTP on amssq52 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [09:01:33] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [09:12:23] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 5349 bytes in 0.003 seconds [09:15:32] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [09:16:44] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: Connection refused [09:17:11] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: Connection refused [09:29:20] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [09:29:47] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [09:54:23] PROBLEM - Backend Squid HTTP on amssq54 is CRITICAL: Connection refused [09:54:32] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: Connection refused [09:57:14] RECOVERY - Backend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [09:57:14] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [10:31:14] PROBLEM - Frontend Squid HTTP on amssq55 is CRITICAL: Connection refused [10:32:17] PROBLEM - Backend Squid HTTP on amssq55 is CRITICAL: Connection refused [10:41:10] New patchset: Dzahn; "adding DHCP entries / MAC addresses for analytics1002 to 1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:41:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6919 [10:52:14] RECOVERY - Frontend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [10:52:14] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [11:03:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6905 [11:30:24] is Reedy around? [11:31:12] PROBLEM - Frontend Squid HTTP on amssq56 is CRITICAL: Connection refused [11:31:21] PROBLEM - Backend Squid HTTP on amssq56 is CRITICAL: Connection refused [11:38:15] RECOVERY - Frontend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [11:38:24] RECOVERY - Backend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [11:59:56] !log merging CSS fix for broken mobile site table layout [12:00:01] Logged the message, Master [12:01:54] wait, we can merge those in non ops branches? I thought we didn't have rights [12:03:11] it says it merged [12:03:16] huh [12:03:25] apergos: would you know about the next step to push it out? never done mobile [12:03:29] no [12:03:35] I don't have a clue [12:03:56] I haven't touched mobile since the move from the old gateway [12:04:01] maybe before that [12:04:21] it's just an extension i think [12:04:22] well the fix was pretty obvious, do not "display: none" in CSS on table [12:05:04] so just merged it for demon who couldnt right now and it breaks all tables [12:05:08] sure [12:05:15] * Katie_WMDE not sure how deployment works though [12:08:45] talking in -mobile channel ... [12:30:25] Katie_WMDE: so, BZ says a deployment is planned for today anyways. and reopened the ticket that was called resolved just for having the fix in gerrit, unmerged. afraid that's all i have for now then. [12:30:50] don't know when today [12:30:59] ok [12:35:19] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:49] !log updated mwlib to 0.13.7 [12:36:52] Logged the message, Master [12:37:34] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 108.83 ms [12:39:10] !log updated mwlib to 0.13.7 [12:39:13] Logged the message, Master [12:40:08] !log updated mwlib to 0.13.7 [12:40:11] Logged the message, Master [12:40:43] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: Connection refused [12:41:55] PROBLEM - Backend Squid HTTP on amssq57 is CRITICAL: Connection refused [13:03:21] RECOVERY - Backend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [13:03:39] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [13:19:32] New patchset: Ottomata; "templates/udp2log/filters.oxygen.erb - adding more Wikipedia Zero filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6923 [13:19:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6923 [13:32:04] New review: Diederik; "Wrong filename for saudi filter." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/6923 [13:35:29] Change abandoned: Diederik; "Outdated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4127 [13:48:42] New patchset: Ottomata; "templates/udp2log/filters.oxygen.erb - adding more Wikipedia Zero filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6923 [13:49:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6923 [13:49:58] New review: Diederik; "Is ready to be merged." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6923 [13:55:39] yo opsies [13:55:46] can someone check and merge that for me? [13:56:25] Probably not the most polite way to ask. [13:59:45] hiya Damianz [13:59:47] <^demon> ottomata: If it's not critical, it's best to add someone to review it and someone will review it when they get a chance. [13:59:49] sorry if so! [13:59:56] didn't mean it that way [14:00:07] i think if you could hear me say it you wouldn't think so [14:00:41] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: Connection refused [14:00:51] opsies is a term of endearment! [14:01:08] PROBLEM - Frontend Squid HTTP on amssq58 is CRITICAL: Connection refused [14:01:34] ^demon, i can and will do that, but that is opposite of what everyone else has told me to do so far :p [14:02:35] <^demon> Don't know who's told you what. [14:02:52] <^demon> I'm just saying what makes most sense... [14:16:26] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [14:16:35] RECOVERY - Frontend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [14:16:37] New patchset: Pyoungmeister; "giving ayush shell on stat1 per rt 2888" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6930 [14:16:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6930 [14:18:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6930 [14:18:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6930 [14:29:56] cmjohnson1: are you sure about the raid card in srv217 being the same? [14:30:05] that seems rather unlikely... srv217 as an app server is a very different box [14:36:53] mark: yes..same model number [14:37:05] you looked at the actual card? [14:37:13] yes [14:37:20] well then I support that idea [14:37:24] srv217 sucks [14:37:26] both sas1 and sas2 [14:37:41] might as well demolish it for parts [14:37:50] i'm not even sure it's still in warranty [14:38:16] How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth? [14:38:23] it is not...ended in Feb 2012 [14:38:33] then srv217 is never gonna be fixed anyway [14:39:49] than I will get w/ robh about a decom ticket for it [14:39:53] ok [15:16:38] !log shutting down storage3 to replace raid card [15:16:41] Logged the message, Master [15:20:52] No one is scripting reboots right now are they? [15:25:43] not fully automatic, no [15:26:11] but rebooting squids every once in a while [15:43:52] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [15:56:51] New patchset: Jgreen; "adding sudoers rules for datacenter techs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6933 [15:57:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6933 [15:57:46] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6933 [15:57:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6933 [16:01:17] jeff_green: when you get a chance look at storage3. I see the raid cfg. [16:01:50] I can look now--did you restore the config? [16:01:53] PROBLEM - Backend Squid HTTP on amssq59 is CRITICAL: Connection refused [16:02:11] PROBLEM - Frontend Squid HTTP on amssq59 is CRITICAL: Connection refused [16:02:40] i installed the new card and did not make any changes...I went into raid bios and see the cfg....before I do anything else I want to see if you can access the drives [16:03:27] so I should go into the RAID bios? [16:04:09] yes, I don't know how you had it configured...if I don't do the same way it could screw everything up [16:04:25] unfortunately the configuration long predates me [16:04:44] RECOVERY - Backend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.436 seconds [16:05:02] RECOVERY - Frontend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [16:05:29] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [16:18:23] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:43] cmjohnson1: when I hop onto com2, I'm at a dhcpfail pxeboot(?) page, any reason I should not power cycle? [16:22:03] no...go ahead [16:23:22] jeff_green: the boot process went like normal here...curious to see if it works. [16:23:56] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:23:57] ok [16:24:21] * pgehres sends happy thoughts storage3's way [16:24:25] hahaha [16:25:33] if happy thoughts don't work, Plan B is "perucssive maintenance" [16:25:58] e.g. kick the living shit out of it [16:26:19] prolly could ask rob about the raid config [16:26:29] it's likely to be in his brain cells someplae [16:27:24] hrmf, it popped into grub before i noticed and stopped it [16:31:17] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:21] cmjohnson1: ok i'm in the RAID bios finally [16:32:14] does it appear to be "normal" [16:32:29] too early to say [16:32:35] I'm not familiar with this controller [16:32:42] looking at raid properties [16:33:12] me neither. but it is not in a foreign state, it sees all drives...at first glance looks ok [16:33:32] it has only two drives? [16:34:48] cmjohnson1: I think we should have robh take a look, I can make guesses but he's more likely to know how it would have been built [16:35:40] ok..sounds good to me...are able to access file system? [16:36:10] I don't know anything [16:36:26] I see a controller that appears to have only two physical (sata) drives, and that's surprising to start [16:40:25] storage3 has two disks for os, then the disk shelf for data [16:40:50] Jeff_Green: so you see the chassis disks, but the disk md1200 shelf has 12 additional disks [16:40:58] RobH: i just rebooted [16:41:17] RobH: you want to look, you know the hardware [16:41:41] there are only a few things to check, tht the raid adapter sees the old config during post and boot [16:41:48] and that the OS sees the hardware when its booted [16:42:32] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [16:42:38] LeslieCarr: I just finished wiring serial to all the new row C switches. I need to run the fibers for the row ends to the opposing cr1/2 [16:42:49] what was the port for asw-c8-eqiad on cr1? [16:43:17] Jeff_Green: i can take a glance [16:43:37] so its rebooting now, as ping is up but ssh isnt [16:43:56] it came up and it's trying to pxeinstall [16:44:11] i'll log off the console, you'lll want to power cycle [16:44:21] I'm out [16:45:20] awesome RobH :) [16:46:25] LeslieCarr: so what ports should i use, i know you told me this once [16:46:31] i cannot find my notebook [16:46:39] let me look that up .. [16:46:46] so once i have those two fibers run, row C is ready for full network setup [16:46:57] 4 fibers then eh [16:47:00] the power strips need more network cable for mgmt connection, but i took care of them [16:47:03] two fibers left [16:47:10] i ran the c1 to a1 and the c8 to a8 [16:47:19] ok [16:47:30] also need to migrate the last two fibers not in raceway [16:47:35] one is the wave, the other transit [16:47:53] both i would like to move today if possible [16:47:55] c1 goes to xe-5/0/2 on cr1 and c8 goes to xe-5/1/2 on cr1 [16:48:02] and the same with cr2 (only the second ports) [16:48:33] are those ports shutdown? [16:48:52] yep [16:55:26] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:43] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:55] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:02:49] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [17:05:40] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:14:04] !log asw-c1-eqiad connected to both cr1 and cr2 [17:14:08] Logged the message, RobH [17:18:25] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [17:23:13] PROBLEM - NTP on storage3 is CRITICAL: NTP CRITICAL: No response from NTP server [17:25:08] LeslieCarr: if you've got a minute, can you do me a huge favor and flush the mobile varnish cache? [17:25:29] didn't preilly tell you that you're supposed to say how much more awesome i am than asher first? ;) [17:25:41] LeslieCarr: i figured it was implied :D [17:25:45] !log flushed the mobile cache [17:25:48] haha [17:25:48] but, you are SO much more awesome than asher. [17:25:48] Logged the message, Mistress of the network gear. [17:26:01] \o/ [17:26:02] ty [17:31:38] LeslieCarr: Ok, I have the fibers run, want to work on the bad linecard? [17:32:33] RobH: mind doing it in about 5 minutes ? [17:32:47] yea i will take a short break [17:32:50] back in 5 [17:35:09] heads up, I'm about to try a 1.2GB upload from hume following Roan's instructions. [17:35:36] * ^demon goes down into the nagios-proof bunker [17:36:05] heh [17:36:30] cmjohnson1: Ok, we are going to discuss this in here [17:36:41] storage3 is in the OS installer, so it didnt boot properly from the OS disks [17:36:44] What exactly did you do? [17:37:14] oh, you pulled the wrong card i bet. [17:37:23] cmjohnson1: Did you replace the SAS controller, or the raid controller for the MD enclosure? [17:37:28] the sas controller was fine. [17:38:03] the RAID controller is what needs swapping, and some random srv isnt going to have a spare raid controller [17:38:06] so not sure what you guys are doing. [17:38:23] cmjohnson1: srv217 wouldnt have a disk array, or the raid controller. [17:39:22] i did swap the sas controller and not the controller card on the riser [17:42:16] RobH: are you ready ….. to rumble? :) [17:42:44] !log switching all masterships over to cr2-eqiad in preparation to reseat cr1 linecard [17:42:47] Logged the message, Mistress of the network gear. [17:43:03] robh: so we are goin to have to order that part [17:44:37] RoanKattouw, you were right .. from hume it's completely painless :) [17:45:03] cmjohnson1: order a raid controller? [17:45:13] cmjohnson1: Did dell confirm that it means the controller is dead? [17:45:37] yes, we walked through steps yesterday..including moving the controller to the other riser [17:45:46] I am not sure why mark suggested you change out the SAS controller, I expect he didn't understand what error you were having. [17:45:50] but still receive the same error [17:46:03] what about with udpating the firmware on the raid controller, same error? [17:46:04] i confused him with my confusion [17:46:46] we didn't update firmware [17:48:19] I would suggest attempting that before we call the system bad [17:48:37] the complexity comes from keeping the data intact [17:49:45] can we just have dell send a new RAID controller? [17:49:55] how much can they cost, a couple $hundred? [17:52:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: Requested table is empty or does not exist, [17:52:22] also, wtb crash kit for every flavor of hardware for which data loss is an issue [17:53:58] PROBLEM - LVS HTTPS on upload-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:08] LeslieCarr: .... [17:54:15] LeslieCarr: I think we're having network issues [17:54:23] looking [17:54:24] inside of the network I can get to gerrit [17:54:29] outside of the network we can't [17:54:36] there were some peering outage emails [17:54:38] <^demon> Same with jenkins I think. [17:54:40] maybe related? [17:54:57] Do we not have Nagios alerts for gerrit? [17:55:10] RECOVERY - LVS HTTPS on upload-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 597 bytes in 0.111 seconds [17:55:11] Gerrit works here [17:55:14] <^demon> gerrit is in watchmouse now :) [17:55:18] weird [17:55:18] Aha OK [17:55:25] only port 443 isn't working [17:55:28] <^demon> So is jenkins. [17:55:32] <^demon> 443 wfm. [17:55:48] it's working for me now too [17:56:00] i think it was bgp reconvergance [17:56:13] * Ryan_Lane nods [17:56:13] <^demon> RoanKattouw: We should get it in nagios though, yes. [17:56:27] Ryan_Lane: do you know the status of the netapp deployment? [17:56:34] halfway done [17:56:48] eta? [17:56:54] whenever someone does it? [17:57:10] ok, now for the riskier part RobH … are you ready ? [17:57:19] "the riskier part"? [17:57:21] <^demon> Riskier part? [17:57:27] You already took down a service and now there's a riskier part? [17:57:29] <^demon> Haha, guess now ;-) [17:57:31] :p [17:57:43] Hmm I guess the "riskier part" was breaking Rob's internet [17:57:45] we have to reseat the linecard and possibly replace it [17:57:56] lets not break my shit [17:58:04] i just lost about 4 items in progress [17:58:13] i get to reenter about 2 dozen seials. [17:58:15] serial. [17:58:19] * Damianz thinks we could just put RobH and LeslieCarr in their own vlan and watch the fight [17:58:23] haha [17:58:26] mark: quick question or are you done for the day? [17:58:32] which means im not gonna do it [17:58:43] LeslieCarr: now all the serials for racktables and new switches are your problem. [17:59:01] paravoid: he's at dinner [17:59:49] dude, when doing network maintenances like taking down 1 of the 2 main routers, sometimes these things happen [18:00:07] ;] [18:00:23] LeslieCarr: So am I standing by for something for you i imagine? [18:00:30] LeslieCarr: You need some of that 100% updating garunteed magic :D [18:00:32] im on my mifi, advise when i can swap back [18:00:35] now, comes the fun part! [18:00:46] s/updating/uptime/ [18:00:55] <^demon> LeslieCarr: You replaced riskier with fun. [18:01:00] haha [18:01:31] RobH: I am going to power off fpc5 on cr1-eqiad and when you see the lights stop blinking, can you pull it out and reseat it ? [18:02:23] RobH: are you ready ? [18:02:44] ok, lemme get screwdriver [18:03:34] RobH: if this is unsuccessful, then we need to swap the fpc [18:03:35] ok [18:03:38] LeslieCarr: ready [18:03:56] !log powering off fpc5 on cr1-eqiad in order for RobH to physically reseat the card [18:03:59] Logged the message, Mistress of the network gear. [18:04:31] LeslieCarr: reseated [18:05:14] RobH: awesome, onlining the fpc …. [18:05:50] !log powering on fpc 5 on cr1-eqiad [18:05:53] Logged the message, Mistress of the network gear. [18:05:58] it's still powering on :) [18:06:21] i also have the box with the replacement [18:06:24] well, i assume its the replacement [18:06:34] 'Mistress of the network gear.' makes me think of a BDSM room covered in Juniper gear. [18:06:38] is this for the packet loss on peering? [18:07:06] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [18:07:07] i assume its not, since the same issue was on cr1 and cr2 [18:07:09] yes, however, it showed any large ping sourced from an interface to have some packet loss [18:07:20] however, it seems like this is probably a software issue .... [18:07:21] or did cr2 not have this issue? [18:07:30] but we need to cover the bases before we can escalate :( [18:07:56] cuz if cr2 and cr1 have the same issue, swapping a card on cr1 aint gonna do shit [18:08:06] (so did cr2 have same problem?) [18:08:25] yes, and so do our mx 80's [18:08:30] i think it's a trio chipset problem [18:08:36] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:08:39] (fpc5 still booting up...) [18:08:40] oh, and the new card doesnt have that chipset? [18:08:46] oh the new card does [18:08:59] just that when you go through support they have a list of things you can do ... [18:09:03] so all the shit we are doing now is just checkboxing [18:09:06] ok. [18:09:08] yes [18:09:25] card is up, ping tests running now [18:09:26] so want me to install this new card in the same slot that 5 was in or in a slot below it? [18:09:35] when this fails to fix it that is ;] [18:10:27] yep in the same slot would be best, yep, this has the exact same pattern [18:10:39] every 48 to 53 packets [18:11:15] so swap it in existing slot and reinsert all the fibers in the same port. [18:11:16] let me offline this card again and then can we replace ? [18:11:17] yep [18:11:21] also before you pull it [18:11:22] hehe you beat me to it [18:11:25] copy down the label port #s [18:11:30] incase i misorder them [18:11:32] ok [18:11:33] one sec [18:12:16] http://pastebin.com/Fn7AEJ3z [18:13:44] AaronSchulz: http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Shard_Smaller_Wikis [18:14:08] ok, fine to pull and swap now right? [18:14:27] !log turned off fpc5 on cr1-eqiad to swap [18:14:27] yep [18:14:28]