[00:00:01] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 198 seconds [00:25:07] PROBLEM - MySQL Recent Restart on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:25] PROBLEM - MySQL Slave Running on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:25] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:52] PROBLEM - Full LVS Snapshot on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:52] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:10] PROBLEM - mysqld processes on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:28] PROBLEM - MySQL Idle Transactions on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:28] PROBLEM - MySQL disk space on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:31] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 1 seconds [00:27:58] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [00:31:06] db1034 is having scheduling issues, upgrading kernel now] [00:35:49] nice [00:35:57] that reminds me, I have one very nice check_tainted nagios check [00:36:15] that checks if the kernel is tainted (like OOM kicking in, backtraces and whatnot) [00:36:21] do you think it would be of value here? [00:37:26] New patchset: Asher; "re-enabling slow query collection cron, just on pmtpa dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6897 [00:37:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6897 [00:37:52] RECOVERY - Disk space on db1004 is OK: DISK OK [00:37:58] paravoid: that would be great! [00:38:46] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [00:44:45] !log rebooted db1034 [00:44:47] Logged the message, Master [00:47:01] PROBLEM - Host db1034 is DOWN: PING CRITICAL - Packet loss = 100% [00:48:22] RECOVERY - MySQL Slave Running on db1034 is OK: OK replication [00:48:31] RECOVERY - Host db1034 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [00:48:49] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay seconds [00:48:50] RECOVERY - Full LVS Snapshot on db1034 is OK: OK no full LVM snapshot volumes [00:49:07] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 254 seconds [00:49:07] RECOVERY - MySQL disk space on db1034 is OK: DISK OK [00:49:16] RECOVERY - MySQL Idle Transactions on db1034 is OK: OK longest blocking idle transaction sleeps for seconds [00:49:16] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 262 seconds [00:49:34] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay seconds [00:49:34] RECOVERY - MySQL Recent Restart on db1034 is OK: OK seconds since restart [00:55:43] PROBLEM - Host db1034 is DOWN: PING CRITICAL - Packet loss = 100% [00:56:55] RECOVERY - Host db1034 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:02:01] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [01:02:01] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [02:18:30] New patchset: Catrope; "Make l10nupdate script work with submodules correctly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6905 [02:18:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6905 [02:35:07] yo ops folks! did we experiment on april 15th with sending our mobile traffic to https? [03:44:23] New patchset: Lcarr; "removed random "GG", made "standard footers" section removed old argument parsing code" [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [03:46:20] New patchset: Lcarr; "minor cleanups" [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [03:48:44] New review: pugmajere; "(no comment)" [operations/software] (master) C: 1; - https://gerrit.wikimedia.org/r/6906 [03:48:55] New review: Lcarr; "(no comment)" [operations/software] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6906 [03:49:07] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6906 [03:49:09] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6906 [04:10:59] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:12:47] thems fighting words [04:13:50] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:14:03] New review: pugmajere; "Missed a documentation reference here." [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:15:52] New patchset: Lcarr; "Switched from tabs to double spaces" [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:16:25] drdee: there were zero related changes on the 16th and the session leak issues on the 14th (so all traffic was sent from esams to pmtpa). maybe someone else knows something but that's about all i see at a glance. you should look yourself: http://wikitech.wikimedia.org/view/Server_admin_log#April_16 [04:19:43] New review: pugmajere; "(no comment)" [operations/software] (master) C: 0; - https://gerrit.wikimedia.org/r/6907 [04:20:55] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6907 [04:20:57] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6907 [04:24:26] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [04:28:55] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6463 [04:28:57] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6463 [04:41:08] TimStarling: `echo "sync failed"` should be >&2 ? [04:41:39] all the other errors go to stdout don't they? [04:41:59] haven't a clue [04:56:32] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [04:59:23] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [05:08:59] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [05:10:02] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [05:14:05] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:16:34] New patchset: Lcarr; "Fix SLAX syntax output" [operations/software] (master) - https://gerrit.wikimedia.org/r/6909 [05:17:09] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6909 [05:17:13] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6909 [05:35:14] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [05:43:00] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:13:09] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [06:38:57] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/teahouse.log, have not been written to in 24 hours [06:41:57] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [06:48:41] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [07:30:43] PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:10] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [08:54:30] PROBLEM - Frontend Squid HTTP on amssq52 is CRITICAL: Connection refused [08:55:24] PROBLEM - Backend Squid HTTP on amssq52 is CRITICAL: Connection refused [08:56:00] RECOVERY - Frontend Squid HTTP on amssq52 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.219 seconds [08:56:45] RECOVERY - Backend Squid HTTP on amssq52 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [09:01:33] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [09:12:23] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 5349 bytes in 0.003 seconds [09:15:32] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [09:16:44] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: Connection refused [09:17:11] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: Connection refused [09:29:20] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [09:29:47] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [09:54:23] PROBLEM - Backend Squid HTTP on amssq54 is CRITICAL: Connection refused [09:54:32] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: Connection refused [09:57:14] RECOVERY - Backend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [09:57:14] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [10:31:14] PROBLEM - Frontend Squid HTTP on amssq55 is CRITICAL: Connection refused [10:32:17] PROBLEM - Backend Squid HTTP on amssq55 is CRITICAL: Connection refused [10:41:10] New patchset: Dzahn; "adding DHCP entries / MAC addresses for analytics1002 to 1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:41:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6919 [10:52:14] RECOVERY - Frontend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [10:52:14] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [11:03:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6905 [11:30:24] is Reedy around? [11:31:12] PROBLEM - Frontend Squid HTTP on amssq56 is CRITICAL: Connection refused [11:31:21] PROBLEM - Backend Squid HTTP on amssq56 is CRITICAL: Connection refused [11:38:15] RECOVERY - Frontend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [11:38:24] RECOVERY - Backend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [11:59:56] !log merging CSS fix for broken mobile site table layout [12:00:01] Logged the message, Master [12:01:54] wait, we can merge those in non ops branches? I thought we didn't have rights [12:03:11] it says it merged [12:03:16] huh [12:03:25] apergos: would you know about the next step to push it out? never done mobile [12:03:29] no [12:03:35] I don't have a clue [12:03:56] I haven't touched mobile since the move from the old gateway [12:04:01] maybe before that [12:04:21] it's just an extension i think [12:04:22] well the fix was pretty obvious, do not "display: none" in CSS on table [12:05:04] so just merged it for demon who couldnt right now and it breaks all tables [12:05:08] sure [12:05:15] * Katie_WMDE not sure how deployment works though [12:08:45] talking in -mobile channel ... [12:30:25] Katie_WMDE: so, BZ says a deployment is planned for today anyways. and reopened the ticket that was called resolved just for having the fix in gerrit, unmerged. afraid that's all i have for now then. [12:30:50] don't know when today [12:30:59] ok [12:35:19] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:49] !log updated mwlib to 0.13.7 [12:36:52] Logged the message, Master [12:37:34] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 108.83 ms [12:39:10] !log updated mwlib to 0.13.7 [12:39:13] Logged the message, Master [12:40:08] !log updated mwlib to 0.13.7 [12:40:11] Logged the message, Master [12:40:43] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: Connection refused [12:41:55] PROBLEM - Backend Squid HTTP on amssq57 is CRITICAL: Connection refused [13:03:21] RECOVERY - Backend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [13:03:39] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [13:19:32] New patchset: Ottomata; "templates/udp2log/filters.oxygen.erb - adding more Wikipedia Zero filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6923 [13:19:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6923 [13:32:04] New review: Diederik; "Wrong filename for saudi filter." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/6923 [13:35:29] Change abandoned: Diederik; "Outdated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4127 [13:48:42] New patchset: Ottomata; "templates/udp2log/filters.oxygen.erb - adding more Wikipedia Zero filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6923 [13:49:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6923 [13:49:58] New review: Diederik; "Is ready to be merged." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6923 [13:55:39] yo opsies [13:55:46] can someone check and merge that for me? [13:56:25] Probably not the most polite way to ask. [13:59:45] hiya Damianz [13:59:47] <^demon> ottomata: If it's not critical, it's best to add someone to review it and someone will review it when they get a chance. [13:59:49] sorry if so! [13:59:56] didn't mean it that way [14:00:07] i think if you could hear me say it you wouldn't think so [14:00:41] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: Connection refused [14:00:51] opsies is a term of endearment! [14:01:08] PROBLEM - Frontend Squid HTTP on amssq58 is CRITICAL: Connection refused [14:01:34] ^demon, i can and will do that, but that is opposite of what everyone else has told me to do so far :p [14:02:35] <^demon> Don't know who's told you what. [14:02:52] <^demon> I'm just saying what makes most sense... [14:16:26] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [14:16:35] RECOVERY - Frontend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [14:16:37] New patchset: Pyoungmeister; "giving ayush shell on stat1 per rt 2888" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6930 [14:16:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6930 [14:18:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6930 [14:18:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6930 [14:29:56] cmjohnson1: are you sure about the raid card in srv217 being the same? [14:30:05] that seems rather unlikely... srv217 as an app server is a very different box [14:36:53] mark: yes..same model number [14:37:05] you looked at the actual card? [14:37:13] yes [14:37:20] well then I support that idea [14:37:24] srv217 sucks [14:37:26] both sas1 and sas2 [14:37:41] might as well demolish it for parts [14:37:50] i'm not even sure it's still in warranty [14:38:16] How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth? [14:38:23] it is not...ended in Feb 2012 [14:38:33] then srv217 is never gonna be fixed anyway [14:39:49] than I will get w/ robh about a decom ticket for it [14:39:53] ok [15:16:38] !log shutting down storage3 to replace raid card [15:16:41] Logged the message, Master [15:20:52] No one is scripting reboots right now are they? [15:25:43] not fully automatic, no [15:26:11] but rebooting squids every once in a while [15:43:52] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [15:56:51] New patchset: Jgreen; "adding sudoers rules for datacenter techs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6933 [15:57:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6933 [15:57:46] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6933 [15:57:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6933 [16:01:17] jeff_green: when you get a chance look at storage3. I see the raid cfg. [16:01:50] I can look now--did you restore the config? [16:01:53] PROBLEM - Backend Squid HTTP on amssq59 is CRITICAL: Connection refused [16:02:11] PROBLEM - Frontend Squid HTTP on amssq59 is CRITICAL: Connection refused [16:02:40] i installed the new card and did not make any changes...I went into raid bios and see the cfg....before I do anything else I want to see if you can access the drives [16:03:27] so I should go into the RAID bios? [16:04:09] yes, I don't know how you had it configured...if I don't do the same way it could screw everything up [16:04:25] unfortunately the configuration long predates me [16:04:44] RECOVERY - Backend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.436 seconds [16:05:02] RECOVERY - Frontend Squid HTTP on amssq59 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [16:05:29] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [16:18:23] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:43] cmjohnson1: when I hop onto com2, I'm at a dhcpfail pxeboot(?) page, any reason I should not power cycle? [16:22:03] no...go ahead [16:23:22] jeff_green: the boot process went like normal here...curious to see if it works. [16:23:56] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:23:57] ok [16:24:21] * pgehres sends happy thoughts storage3's way [16:24:25] hahaha [16:25:33] if happy thoughts don't work, Plan B is "perucssive maintenance" [16:25:58] e.g. kick the living shit out of it [16:26:19] prolly could ask rob about the raid config [16:26:29] it's likely to be in his brain cells someplae [16:27:24] hrmf, it popped into grub before i noticed and stopped it [16:31:17] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:21] cmjohnson1: ok i'm in the RAID bios finally [16:32:14] does it appear to be "normal" [16:32:29] too early to say [16:32:35] I'm not familiar with this controller [16:32:42] looking at raid properties [16:33:12] me neither. but it is not in a foreign state, it sees all drives...at first glance looks ok [16:33:32] it has only two drives? [16:34:48] cmjohnson1: I think we should have robh take a look, I can make guesses but he's more likely to know how it would have been built [16:35:40] ok..sounds good to me...are able to access file system? [16:36:10] I don't know anything [16:36:26] I see a controller that appears to have only two physical (sata) drives, and that's surprising to start [16:40:25] storage3 has two disks for os, then the disk shelf for data [16:40:50] Jeff_Green: so you see the chassis disks, but the disk md1200 shelf has 12 additional disks [16:40:58] RobH: i just rebooted [16:41:17] RobH: you want to look, you know the hardware [16:41:41] there are only a few things to check, tht the raid adapter sees the old config during post and boot [16:41:48] and that the OS sees the hardware when its booted [16:42:32] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [16:42:38] LeslieCarr: I just finished wiring serial to all the new row C switches. I need to run the fibers for the row ends to the opposing cr1/2 [16:42:49] what was the port for asw-c8-eqiad on cr1? [16:43:17] Jeff_Green: i can take a glance [16:43:37] so its rebooting now, as ping is up but ssh isnt [16:43:56] it came up and it's trying to pxeinstall [16:44:11] i'll log off the console, you'lll want to power cycle [16:44:21] I'm out [16:45:20] awesome RobH :) [16:46:25] LeslieCarr: so what ports should i use, i know you told me this once [16:46:31] i cannot find my notebook [16:46:39] let me look that up .. [16:46:46] so once i have those two fibers run, row C is ready for full network setup [16:46:57] 4 fibers then eh [16:47:00] the power strips need more network cable for mgmt connection, but i took care of them [16:47:03] two fibers left [16:47:10] i ran the c1 to a1 and the c8 to a8 [16:47:19] ok [16:47:30] also need to migrate the last two fibers not in raceway [16:47:35] one is the wave, the other transit [16:47:53] both i would like to move today if possible [16:47:55] c1 goes to xe-5/0/2 on cr1 and c8 goes to xe-5/1/2 on cr1 [16:48:02] and the same with cr2 (only the second ports) [16:48:33] are those ports shutdown? [16:48:52] yep [16:55:26] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:43] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:55] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:02:49] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [17:05:40] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:14:04] !log asw-c1-eqiad connected to both cr1 and cr2 [17:14:08] Logged the message, RobH [17:18:25] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [17:23:13] PROBLEM - NTP on storage3 is CRITICAL: NTP CRITICAL: No response from NTP server [17:25:08] LeslieCarr: if you've got a minute, can you do me a huge favor and flush the mobile varnish cache? [17:25:29] didn't preilly tell you that you're supposed to say how much more awesome i am than asher first? ;) [17:25:41] LeslieCarr: i figured it was implied :D [17:25:45] !log flushed the mobile cache [17:25:48] haha [17:25:48] but, you are SO much more awesome than asher. [17:25:48] Logged the message, Mistress of the network gear. [17:26:01] \o/ [17:26:02] ty [17:31:38] LeslieCarr: Ok, I have the fibers run, want to work on the bad linecard? [17:32:33] RobH: mind doing it in about 5 minutes ? [17:32:47] yea i will take a short break [17:32:50] back in 5 [17:35:09] heads up, I'm about to try a 1.2GB upload from hume following Roan's instructions. [17:35:36] * ^demon goes down into the nagios-proof bunker [17:36:05] heh [17:36:30] cmjohnson1: Ok, we are going to discuss this in here [17:36:41] storage3 is in the OS installer, so it didnt boot properly from the OS disks [17:36:44] What exactly did you do? [17:37:14] oh, you pulled the wrong card i bet. [17:37:23] cmjohnson1: Did you replace the SAS controller, or the raid controller for the MD enclosure? [17:37:28] the sas controller was fine. [17:38:03] the RAID controller is what needs swapping, and some random srv isnt going to have a spare raid controller [17:38:06] so not sure what you guys are doing. [17:38:23] cmjohnson1: srv217 wouldnt have a disk array, or the raid controller. [17:39:22] i did swap the sas controller and not the controller card on the riser [17:42:16] RobH: are you ready ….. to rumble? :) [17:42:44] !log switching all masterships over to cr2-eqiad in preparation to reseat cr1 linecard [17:42:47] Logged the message, Mistress of the network gear. [17:43:03] robh: so we are goin to have to order that part [17:44:37] RoanKattouw, you were right .. from hume it's completely painless :) [17:45:03] cmjohnson1: order a raid controller? [17:45:13] cmjohnson1: Did dell confirm that it means the controller is dead? [17:45:37] yes, we walked through steps yesterday..including moving the controller to the other riser [17:45:46] I am not sure why mark suggested you change out the SAS controller, I expect he didn't understand what error you were having. [17:45:50] but still receive the same error [17:46:03] what about with udpating the firmware on the raid controller, same error? [17:46:04] i confused him with my confusion [17:46:46] we didn't update firmware [17:48:19] I would suggest attempting that before we call the system bad [17:48:37] the complexity comes from keeping the data intact [17:49:45] can we just have dell send a new RAID controller? [17:49:55] how much can they cost, a couple $hundred? [17:52:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: Requested table is empty or does not exist, [17:52:22] also, wtb crash kit for every flavor of hardware for which data loss is an issue [17:53:58] PROBLEM - LVS HTTPS on upload-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:08] LeslieCarr: .... [17:54:15] LeslieCarr: I think we're having network issues [17:54:23] looking [17:54:24] inside of the network I can get to gerrit [17:54:29] outside of the network we can't [17:54:36] there were some peering outage emails [17:54:38] <^demon> Same with jenkins I think. [17:54:40] maybe related? [17:54:57] Do we not have Nagios alerts for gerrit? [17:55:10] RECOVERY - LVS HTTPS on upload-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 597 bytes in 0.111 seconds [17:55:11] Gerrit works here [17:55:14] <^demon> gerrit is in watchmouse now :) [17:55:18] weird [17:55:18] Aha OK [17:55:25] only port 443 isn't working [17:55:28] <^demon> So is jenkins. [17:55:32] <^demon> 443 wfm. [17:55:48] it's working for me now too [17:56:00] i think it was bgp reconvergance [17:56:13] * Ryan_Lane nods [17:56:13] <^demon> RoanKattouw: We should get it in nagios though, yes. [17:56:27] Ryan_Lane: do you know the status of the netapp deployment? [17:56:34] halfway done [17:56:48] eta? [17:56:54] whenever someone does it? [17:57:10] ok, now for the riskier part RobH … are you ready ? [17:57:19] "the riskier part"? [17:57:21] <^demon> Riskier part? [17:57:27] You already took down a service and now there's a riskier part? [17:57:29] <^demon> Haha, guess now ;-) [17:57:31] :p [17:57:43] Hmm I guess the "riskier part" was breaking Rob's internet [17:57:45] we have to reseat the linecard and possibly replace it [17:57:56] lets not break my shit [17:58:04] i just lost about 4 items in progress [17:58:13] i get to reenter about 2 dozen seials. [17:58:15] serial. [17:58:19] * Damianz thinks we could just put RobH and LeslieCarr in their own vlan and watch the fight [17:58:23] haha [17:58:26] mark: quick question or are you done for the day? [17:58:32] which means im not gonna do it [17:58:43] LeslieCarr: now all the serials for racktables and new switches are your problem. [17:59:01] paravoid: he's at dinner [17:59:49] dude, when doing network maintenances like taking down 1 of the 2 main routers, sometimes these things happen [18:00:07] ;] [18:00:23] LeslieCarr: So am I standing by for something for you i imagine? [18:00:30] LeslieCarr: You need some of that 100% updating garunteed magic :D [18:00:32] im on my mifi, advise when i can swap back [18:00:35] now, comes the fun part! [18:00:46] s/updating/uptime/ [18:00:55] <^demon> LeslieCarr: You replaced riskier with fun. [18:01:00] haha [18:01:31] RobH: I am going to power off fpc5 on cr1-eqiad and when you see the lights stop blinking, can you pull it out and reseat it ? [18:02:23] RobH: are you ready ? [18:02:44] ok, lemme get screwdriver [18:03:34] RobH: if this is unsuccessful, then we need to swap the fpc [18:03:35] ok [18:03:38] LeslieCarr: ready [18:03:56] !log powering off fpc5 on cr1-eqiad in order for RobH to physically reseat the card [18:03:59] Logged the message, Mistress of the network gear. [18:04:31] LeslieCarr: reseated [18:05:14] RobH: awesome, onlining the fpc …. [18:05:50] !log powering on fpc 5 on cr1-eqiad [18:05:53] Logged the message, Mistress of the network gear. [18:05:58] it's still powering on :) [18:06:21] i also have the box with the replacement [18:06:24] well, i assume its the replacement [18:06:34] 'Mistress of the network gear.' makes me think of a BDSM room covered in Juniper gear. [18:06:38] is this for the packet loss on peering? [18:07:06] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [18:07:07] i assume its not, since the same issue was on cr1 and cr2 [18:07:09] yes, however, it showed any large ping sourced from an interface to have some packet loss [18:07:20] however, it seems like this is probably a software issue .... [18:07:21] or did cr2 not have this issue? [18:07:30] but we need to cover the bases before we can escalate :( [18:07:56] cuz if cr2 and cr1 have the same issue, swapping a card on cr1 aint gonna do shit [18:08:06] (so did cr2 have same problem?) [18:08:25] yes, and so do our mx 80's [18:08:30] i think it's a trio chipset problem [18:08:36] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:08:39] (fpc5 still booting up...) [18:08:40] oh, and the new card doesnt have that chipset? [18:08:46] oh the new card does [18:08:59] just that when you go through support they have a list of things you can do ... [18:09:03] so all the shit we are doing now is just checkboxing [18:09:06] ok. [18:09:08] yes [18:09:25] card is up, ping tests running now [18:09:26] so want me to install this new card in the same slot that 5 was in or in a slot below it? [18:09:35] when this fails to fix it that is ;] [18:10:27] yep in the same slot would be best, yep, this has the exact same pattern [18:10:39] every 48 to 53 packets [18:11:15] so swap it in existing slot and reinsert all the fibers in the same port. [18:11:16] let me offline this card again and then can we replace ? [18:11:17] yep [18:11:21] also before you pull it [18:11:22] hehe you beat me to it [18:11:25] copy down the label port #s [18:11:30] incase i misorder them [18:11:32] ok [18:11:33] one sec [18:12:16] http://pastebin.com/Fn7AEJ3z [18:13:44] AaronSchulz: http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Shard_Smaller_Wikis [18:14:08] ok, fine to pull and swap now right? [18:14:27] !log turned off fpc5 on cr1-eqiad to swap [18:14:27] yep [18:14:28] AaronSchulz: I don't think that's any different from what you described earlier. [18:14:30] Logged the message, Mistress of the network gear. [18:15:30] PROBLEM - Host cr1-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.196) [18:15:39] yes nagios-wm it is down [18:15:43] and you are all alone nagios-wm [18:15:54] AaronSchulz: when you return. please poke me [18:16:23] I enjoy demoralizing nonsentient bots way too much [18:16:23] !log updating md1000 controller card firmware on storage3 [18:16:26] Logged the message, Master [18:17:09] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [18:17:21] maplebed: looks similar, as long as it's the same wikis from my last email [18:17:29] $wmfSwiftBigWikis [18:17:37] yeah, I took that list from your mail. [18:17:39] I didn't verify it. [18:17:55] LeslieCarr, does this mean that we won't be able to stick a "No bots have been hurt for this product" label on our websites? [18:18:20] Nemo_bis: nope, we'll be dissed on by PETB [18:20:09] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:38] LeslieCarr, 13 activists have been put in jail recently in Italy because they liberated a couple dozens of dogs used for tests. They had a list of criminal charges as long as the gerrit bugs list [18:20:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/6897 [18:20:58] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6897 [18:21:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6897 [18:21:06] :( what happened to the dogs ? [18:21:13] They get merged? [18:21:15] :D [18:21:16] LeslieCarr, dunno, run away [18:21:31] mebbe they merged each other in a happy reproduction party [18:21:49] hehehe [18:22:10] AaronSchulz: when do you want to do the sharding? [18:22:26] (we should put it on the deployment calendar!) [18:22:27] ;) [18:23:14] I can make the switch right before you do, so I guess we need to agree on a time [18:23:17] * apergos lurks [18:23:33] apergos: http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Shard_Smaller_Wikis [18:23:38] * AaronSchulz pushes apergos back in the caves [18:23:39] LeslieCarr: ok, new one is all plugged in [18:24:18] * apergos growls and moves out of the way of the sharp sticks [18:24:41] thanks RobH, powering on now [18:25:10] if you do a show log messages or show log chassisd you can see 5 million pieces of information about the powering on [18:25:12] robla: do you have any advice regarding when AaronSchulz and I should push live a change to move more wikis to sharded containers? [18:25:24] it takes about 5 to 7 minutes [18:25:35] LeslieCarr: cool, lemme know when ya dont need me anymore and i will then take lunch [18:25:48] also lemme know if i can switch off mifi and back to normal stuff [18:26:28] ok, will do on both accounts, i'll just ping test the links after it comes up …. [18:27:36] maplebed: whenever there's a free spot here: http://wikitech.wikimedia.org/view/Software_deployments seems fine to me [18:27:51] robla: perfect. that was what I was looking for. thanks. [18:28:33] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:28:49] LeslieCarr: So will we be swapping the cards back or should I pack up the old card for return? [18:29:06] pack up the old one for return [18:29:15] do you need me to arrange a ups pickup or do you have that ? [18:29:30] the vendor isnt paying for it? [18:29:50] they're paying, just have to arrange ups to come pick it up [18:29:57] unless they stop by eqiad constantly [18:30:56] So they did not include any kind of return tag. I assume that if you call they just dispatch UPS with the label? [18:31:53] they didn't ? [18:32:00] usually it's in the outside little plastic thingy [18:32:06] a shipping invoice and a folded label [18:32:09] RECOVERY - Host cr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 31.36 ms [18:32:15] no label in this [18:32:17] just the packing slip [18:32:21] so they need to send that stuff over [18:32:32] usually they include it, and I can schedule a pickup with the label # [18:32:41] old card is sealed up ready to go. [18:32:59] grrr juniper [18:33:53] sigh, exact same pattern [18:34:05] i mean not sigh in that in confirms my hypothessi but sigh in that it was correct [18:34:06] so you wanna contact them about the lack of return tag? [18:34:32] Is the network back to where the mgmt network is going to work again? (can i swap abck off my mifi) [18:35:29] and if you are done with me for a bit, going to go get lunch [18:35:33] lemme know. [18:35:34] let me double check [18:35:47] * cmjohnson1 stepping out for a few to grab some food [18:37:06] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [18:37:16] !log reenabled services on fpc5 of cr1-eqiad [18:37:20] Logged the message, Mistress of the network gear. [18:38:09] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [18:38:42] heh, seems wasnt ready to swap abck yet [18:38:46] nothing on wmfwiki =[ [18:38:49] wmf_wifi even [18:41:01] LeslieCarr: yay [18:41:25] yay good :) [18:41:33] preilly: RoanKattouw: looks like there's an overlap Thursday 10am: http://wikitech.wikimedia.org/view/Software_deployments [18:41:42] LeslieCarr: So you going to be around in about two hours? [18:41:53] i would like to move the cr2 wave and cr2 transit into the raceway today if we can [18:42:04] they are the last items needing to move for me to do a proper cable cleanup [18:42:07] robla: we talked about it and we should be okay [18:42:16] RobH: ok we cna do that [18:42:24] robla: the zero stuff is just a configuration change to varnish no code [18:43:01] okee doke [18:44:41] LeslieCarr: found the label, it fell out in the hallway [18:44:47] went down to shipping, they had it there [18:44:52] sorry about that ;_; [18:45:31] yay [18:56:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [19:01:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [19:05:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [19:11:39] LeslieCarr since you're like a bagillion times awesomer than asher, would you mind flushing the varnish cache again? [19:11:48] hahaha [19:11:51] poor asher :) [19:12:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [19:12:07] it's ok he can't help your aewsomeness [19:12:19] !log flushed mobile varnish cache [19:12:20] done [19:12:21] New patchset: Asher; "run ptqd in 10m increments on prod dbs, add safeguards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6952 [19:12:22] Logged the message, Mistress of the network gear. [19:12:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6952 [19:13:27] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6952 [19:13:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6952 [19:15:53] LeslieCarr: you should ask the mobile dudes to fix their deployment process / cdn usage every time they ask for a cache flush. site reliability awesome points are way awesomer than dev points. [19:16:14] hehehe [19:16:18] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [19:18:03] RobHalsell: what's the largest capacity disk we can get from Dell for an R310? [19:19:32] and have it housed in more than one location? ;) [19:20:21] LeslieCarr: make it so [19:21:56] notpeter: has ayush access to stat1? [19:22:54] drdee: FYI, nagios has been complaining about no puppet runs on stat1 [19:23:19] drdee: also did you see what i said the other day? [19:23:39] 08 04:16:24 < jeremyb> drdee: there were zero related changes on the 16th and the session leak issues on the 14th (so all traffic was sent from esams to pmtpa). maybe someone else knows something but that's about all i see at a glance. you should look yourself: http://wikitech.wikimedia.org/view/Server_admin_log#April_16 [19:25:45] RECOVERY - Puppet freshness on db1004 is OK: puppet ran at Tue May 8 19:25:39 UTC 2012 [19:27:33] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [19:31:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [19:35:55] LeslieCarr: very sorry to ask again but… mobile varnish cache flush? i had not fully deployed changes :( [19:36:16] jeremyb: thanks, i hadn't seen your reply [19:37:10] about puppet, who can have a look at that? [19:37:27] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2213 [19:37:38] binasher: does have a very good point that you guys should make a very concerted effort to not need to flusht he cache whenever changes happen [19:37:40] awjr: it'd be good to test deployments before requesting a cache flush [19:37:41] and, it is flusehd [19:39:25] binasher: how would you test whether or not you forgot to deploy something if there's a chance you're seeing the old version because of the cache? (or maybe ya'll have a way to bypass the cache layer? [19:39:30] ) [19:39:39] anyway, +1 to making flushes unnecessary [19:41:13] jeremyb: would it be feasible to script tests behind the proxy layer, i.e. from fenari? [19:42:10] Jeff_Green: i can only guess. but you might end up needing to write new tests per deploy. (you're not just testing does it work, but also, "are all of the changes live or not") [19:42:41] maybe you could write a test that loads deploy-specific config [19:43:03] roll to the deploy host which is out of lvs, run against it, then roll sitewide [19:43:31] oh, you guys are taking backends out of rotation and then returning them? [19:44:04] i think we have a host permanently out of lvs [19:44:14] in sas is 600, in nearline sas is 3tb, in sata is 2tb [19:44:16] sec, i can tell you more [19:44:26] ok [19:44:34] RobHalsell: sweet! i put in an RT ticket for buying some :-) [19:45:17] jeremyb: last time I did this kind of testing it was srv193 [19:45:32] that's testwiki. i'm 98% sure [19:45:42] (as in test.wp.o) [19:45:44] is that good or bad? [19:46:00] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2725* [19:46:05] Jeff_Green: buying some hdd's or some servers? [19:46:16] i just don't know what it has to do with mobile. (as in maybe it does, i have no idea) [19:46:24] RobHalsell: 2 addl hdds each for aluminium/grosley [19:46:30] jeremyb: me either [19:46:37] oh, we wouldnt order from dell [19:46:46] we would just order some on newegg [19:47:06] RobHalsell: ah, I didn't know if we needed trays or whatever. I'm happy with newegg or whatever you prefer [19:47:23] i just want some very quick storage for all the log data that's on storage3, assuming we get it back [19:48:46] LeslieCarr: so the ups tag isnt a return label, so we would have to pay for the ups pickup, and blah blah [19:48:52] so i am just gonna take it with me and drop off to ups store [19:48:58] ok [19:49:03] seriously not a return label ? [19:49:07] sigh [19:49:15] they suck [19:49:29] they didnt pay ups for the pickup [19:52:00] hi guys [19:52:05] having puppet trouble on stat1 [19:52:07] Could not find class misc::statistics::mediwiki [19:52:11] but the class exists afaik [19:52:17] its in my local updated prod [19:52:22] and it used to work [19:53:58] 08 15:43:52 <+nagios-wm> PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [19:57:02] New patchset: Bhartshorne; "introducing sharding to all wikis with more than 25k objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6954 [19:57:15] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [19:57:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6954 [19:59:12] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [19:59:59] ottomata: I don't see where that is being pulled into the config for stat1 [20:00:25] i see only misc::statistics::site in site.pp [20:01:32] line 39? [20:01:41] statistics.pp [20:01:41] ? [20:01:51] oh [20:01:52] roles [20:01:53] sorry [20:01:57] roles/statistics.pp [20:02:04] class role::statistics [20:02:22] role/statistics.pp * [20:03:56] i see [20:05:31] ottomata: ok if i run puppet etc on stat1? [20:05:39] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [20:06:55] yes [20:12:13] ottomata: that is indeed odd [20:13:01] do you see it on the puppetmaster copy of puppet repo? [20:13:11] checking [20:14:54] fyi opsen - AaronSchulz and I are starting to deploy sharded containers for 'dewiki','fiwiki', 'frwiki', 'hewiki', 'huwiki', 'idwiki', 'itwiki', 'jawiki', 'rowiki', 'ruwiki', 'thwiki', 'trwiki', 'ukwiki', 'zhwiki' [20:15:01] ottomata: i do see it there yeah [20:17:39] hmm [20:17:48] are you allowed to restart puppet master? [20:17:59] i don't think that's the problem [20:18:20] i think it's a subtle config error [20:20:02] hm [20:20:06] on stat1 or elsewhere? [20:20:35] in the manifests [20:21:38] this is incredibly unlikely, but I wonder if puppet could be stupid about manifest filenames [20:21:59] this is the only case I see where we have the same manifest filename in different directories [20:28:48] Jeff_Green: swift.pp and role/swift.pp both exist. [20:28:57] ah [20:29:03] (and work ok) [20:29:07] yeah that seems implausible anyway [20:29:24] jeremyb: it isn't difficult to ensure that you aren't hitting the frontend cache, from adding a random query string, to explicitly purging a page via mediawiki [20:30:05] i barely know how mobiles works [20:30:09] mobile* [20:30:14] e.g. does it use bits? [20:31:11] oh ha [20:31:16] misc::statistics::mediwiki, [20:31:22] anyway, i think the solution is not better testing to be sure deploy is complete before flushing [20:31:24] typo. missing a [20:31:35] it's just not doing flushing at all ;) [20:32:00] definitely [20:32:44] mobile does use bits [20:34:00] New patchset: Jgreen; "fixed typo mediwiki-->mediawiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6988 [20:34:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6988 [20:34:41] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6988 [20:34:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6988 [20:35:27] * AaronSchulz wonders what's with dbbot-wm [20:36:37] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue May 8 20:36:31 UTC 2012 [20:36:41] ottomata: fixed [20:39:43] ottomata: stat1 needs a dist-upgrade and reboot or it's likely to fall off the surprise 209-day surprise reboot kernel bug [20:40:22] well, we're talking to mark about upgrading it to precise anyway [20:40:25] which willb e a full install [20:40:27] but thank you! [20:40:38] will keep that in mind [20:41:03] k [20:41:33] you've 37 days :-P tick tock tick tock! [20:44:56] AaronSchulz: and then there were none [20:50:27] aww no more @replag chorus [20:52:07] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6954 [20:55:39] New patchset: Jgreen; "removing temporary civicrm database from hume, removed unused fundraising entries from deprecated host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6991 [20:55:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6991 [20:58:15] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6954 [20:58:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6954 [20:59:07] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/6991 [20:59:25] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6991 [20:59:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6991 [21:10:57] !log shutting down mysql across all eqiad core db slaves [21:11:00] Logged the message, Master [21:15:20] PROBLEM - mysqld processes on db1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:16:31] PROBLEM - mysqld processes on db1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:17:16] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 205 seconds [21:17:43] PROBLEM - mysqld processes on db1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:17:43] PROBLEM - mysqld processes on db1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:18:01] PROBLEM - mysqld processes on db1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:18:15] its going to get noisy in here.. i really need to get the eqiad dbs into their own nagios service group [21:18:38] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 285 seconds [21:18:46] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 231 seconds [21:18:55] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 239 seconds [21:19:04] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 248 seconds [21:19:13] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 258 seconds [21:19:13] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 75, down: 4, dormant: 0, excluded: 0, unused: 0BRae3: down - BRxe-0/0/3: down - BRae2: down - BRae1: down - BR [21:19:22] PROBLEM - mysqld processes on db1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:19:43] you can ignore the router interfaces down message [21:20:07] PROBLEM - mysqld processes on db1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:20:16] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay seconds [21:22:10] PROBLEM - mysqld processes on db1019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:22:10] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay seconds [21:22:19] PROBLEM - mysqld processes on db1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:23:04] PROBLEM - mysqld processes on db1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:23:16] * binasher is load testing nagios-wm [21:23:49] PROBLEM - mysqld processes on db1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:24:07] PROBLEM - mysqld processes on db1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:24:07] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay seconds [21:24:52] PROBLEM - mysqld processes on db1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:25:46] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:26:22] PROBLEM - mysqld processes on db1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:26:22] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay seconds [21:27:07] PROBLEM - mysqld processes on db1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:27:52] PROBLEM - mysqld processes on db1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:28:37] PROBLEM - mysqld processes on db1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:29:49] PROBLEM - mysqld processes on db1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:32:46] !log rebooting eqiad core db slaves for kernel upgrade [21:32:48] Logged the message, Master [21:33:16] PROBLEM - Host db1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host db1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host db1006 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host db1007 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host db1018 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:16] PROBLEM - Host db1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:17] PROBLEM - Host db1017 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1024 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1034 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1033 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1040 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1039 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:25] PROBLEM - Host db1041 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:25] we have a db bot ? [21:34:46] PROBLEM - Host db1019 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:55] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:55] PROBLEM - Host db1021 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:55] PROBLEM - Host db1038 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:04] PROBLEM - Host db1043 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:09] binasher: if that's noisy, what about those problems with esams some time ago? ;) [21:35:22] RECOVERY - Host db1004 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [21:35:31] PROBLEM - MySQL Idle Transactions on db1022 is CRITICAL: Connection refused by host [21:35:31] PROBLEM - MySQL Slave Running on db1022 is CRITICAL: Connection refused by host [21:35:31] RECOVERY - Host db1006 is UP: PING OK - Packet loss = 0%, RTA = 27.02 ms [21:35:40] RECOVERY - Host db1002 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [21:35:49] PROBLEM - MySQL Recent Restart on db1022 is CRITICAL: Connection refused by host [21:35:49] RECOVERY - Host db1034 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [21:35:49] RECOVERY - Host db1019 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [21:35:58] RECOVERY - Host db1043 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [21:35:58] RECOVERY - Host db1021 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [21:35:58] RECOVERY - Host db1040 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [21:35:58] RECOVERY - Host db1017 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [21:35:58] RECOVERY - Host db1038 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [21:35:58] RECOVERY - Host db1024 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [21:35:59] RECOVERY - Host db1039 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [21:37:01] RECOVERY - Host db1041 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms [21:37:01] RECOVERY - Host db1018 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [21:37:01] RECOVERY - Host db1035 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [21:37:10] RECOVERY - MySQL Idle Transactions on db1022 is OK: OK longest blocking idle transaction sleeps for seconds [21:37:10] RECOVERY - MySQL Slave Running on db1022 is OK: OK replication [21:37:10] RECOVERY - mysqld processes on db1007 is OK: PROCS OK: 1 process with command name mysqld [21:37:10] RECOVERY - MySQL Recent Restart on db1022 is OK: OK seconds since restart [21:37:10] RECOVERY - Host db1007 is UP: PING OK - Packet loss = 0%, RTA = 26.98 ms [21:37:19] RECOVERY - Host db1005 is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [21:37:25] LeslieCarr: @info servername, @info clustername, @info dbname (db45, s5, dewiki) [21:37:46] T3rminat0r: this is intentional though [21:38:21] @replag all [21:38:23] jeremyb: [s1] db38: 0s, db36: 0s, db32: 0s, db59: 0s, db60: 0s, db12: 0s; [s2] db52: 0s, db53: 0s, db54: 0s, db57: 0s; [s3] db39: 0s, db34: 0s, db25: 0s, db11: 0s [21:38:24] jeremyb: [s4] db31: 0s, db22: 0s, db33: 0s, db51: 0s; [s5] db35: 0s, db45: 0s, db44: 1s, db55: 0s; [s6] db43: 0s, db47: 0s, db50: 0s; [s7] db37: 0s, db56: 0s, db58: 0s, db26: 0s [21:38:59] this is all based on mediawiki api calls, so shows the app's view of the dbs [21:39:21] hrm, one of the eqiad dbs isn't coming back up [21:40:01] binasher: mostly, not entirely. it also fetches db.php from noc [21:40:28] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 1574 seconds [21:41:00] the important stuff [21:41:22] RECOVERY - mysqld processes on db1005 is OK: PROCS OK: 1 process with command name mysqld [21:41:40] RECOVERY - mysqld processes on db1004 is OK: PROCS OK: 1 process with command name mysqld [21:41:46] so replag should show mw's cached picture of replag / where it's sending slave queries, vs. necessarily real time checks [21:41:49] RECOVERY - mysqld processes on db1017 is OK: PROCS OK: 1 process with command name mysqld [21:42:07] RECOVERY - mysqld processes on db1019 is OK: PROCS OK: 1 process with command name mysqld [21:42:16] RECOVERY - mysqld processes on db1021 is OK: PROCS OK: 1 process with command name mysqld [21:42:16] RECOVERY - mysqld processes on db1002 is OK: PROCS OK: 1 process with command name mysqld [21:42:25] binasher: How does MW have a cached picture of replag? [21:42:25] RECOVERY - mysqld processes on db1006 is OK: PROCS OK: 1 process with command name mysqld [21:42:43] RECOVERY - mysqld processes on db1018 is OK: PROCS OK: 1 process with command name mysqld [21:42:52] RECOVERY - Host db1033 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [21:43:01] RECOVERY - mysqld processes on db1033 is OK: PROCS OK: 1 process with command name mysqld [21:43:01] RECOVERY - mysqld processes on db1024 is OK: PROCS OK: 1 process with command name mysqld [21:43:19] RECOVERY - mysqld processes on db1034 is OK: PROCS OK: 1 process with command name mysqld [21:43:37] RECOVERY - mysqld processes on db1038 is OK: PROCS OK: 1 process with command name mysqld [21:43:37] RECOVERY - mysqld processes on db1022 is OK: PROCS OK: 1 process with command name mysqld [21:43:40] New review: Demon; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6997 [21:43:46] RECOVERY - mysqld processes on db1040 is OK: PROCS OK: 1 process with command name mysqld [21:43:55] RECOVERY - mysqld processes on db1043 is OK: PROCS OK: 1 process with command name mysqld [21:44:13] RECOVERY - mysqld processes on db1039 is OK: PROCS OK: 1 process with command name mysqld [21:44:13] RECOVERY - mysqld processes on db1041 is OK: PROCS OK: 1 process with command name mysqld [21:44:13] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 1983 seconds [21:44:19] Change abandoned: Demon; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/6997 [21:44:22] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 1958 seconds [21:44:22] RECOVERY - mysqld processes on db1035 is OK: PROCS OK: 1 process with command name mysqld [21:44:28] RoanKattouw: lag data is stuffed in memcached [21:44:36] with a low ttl [21:44:40] PROBLEM - MySQL Slave Delay on db1006 is CRITICAL: CRIT replication delay 772 seconds [21:44:41] Really? How low is the ttl [21:45:06] Cause I've never seen the API's replag numbers be stuck for any noticeable length of time [21:45:07] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 1926 seconds [21:45:24] i'd have to check the code again but it might just be 30 sec or so [21:45:29] !log adding adminbot to the repo [21:45:32] Logged the message, Master [21:45:34] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 1516 seconds [21:45:34] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1520 seconds [21:45:34] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 1517 seconds [21:45:34] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 1514 seconds [21:45:39] It would have to be much much lower [21:45:43] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: CRIT replication delay 305 seconds [21:45:43] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 1503 seconds [21:45:43] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 1502 seconds [21:45:52] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 1535 seconds [21:45:52] PROBLEM - MySQL Slave Delay on db1002 is CRITICAL: CRIT replication delay 1784 seconds [21:46:10] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 1659 seconds [21:46:28] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 1070 seconds [21:46:28] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 1597 seconds [21:46:28] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 1694 seconds [21:46:28] PROBLEM - MySQL Replication Heartbeat on db1039 is CRITICAL: CRIT replication delay 1057 seconds [21:46:28] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 1066 seconds [21:46:28] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 1694 seconds [21:46:29] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 1375 seconds [21:46:46] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 1604 seconds [21:46:46] PROBLEM - MySQL Slave Delay on db1041 is CRITICAL: CRIT replication delay 824 seconds [21:46:55] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CRIT replication delay 803 seconds [21:46:55] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 958 seconds [21:46:55] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 1701 seconds [21:47:04] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CRIT replication delay 762 seconds [21:47:04] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay 0 seconds [21:47:04] PROBLEM - MySQL Slave Delay on db1039 is CRITICAL: CRIT replication delay 1002 seconds [21:47:13] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 867 seconds [21:47:22] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: CRIT replication delay 692 seconds [21:47:31] RECOVERY - MySQL Slave Delay on db1006 is OK: OK replication delay 0 seconds [21:47:48] RoanKattouw: it's actually just 5sec [21:47:49] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 1666 seconds [21:47:49] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 1494 seconds [21:48:03] so the api should be very up to date [21:48:07] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 907 seconds [21:48:25] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CRIT replication delay 1322 seconds [21:48:25] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 1475 seconds [21:48:34] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [21:48:50] Aaah I see it now [21:48:52] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 1688 seconds [21:48:57] LoadMonitor.php [21:49:30] the random recache logic in there is kinda neat.. so should always be < 5sec [21:49:38] Yeah [21:49:41] Just read that coe [21:49:46] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay -1 seconds [21:49:46] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [21:49:55] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [21:50:04] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds [21:50:40] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 1 seconds [21:50:58] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [21:50:58] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [21:50:58] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [21:50:58] RECOVERY - MySQL Slave Delay on db1041 is OK: OK replication delay 0 seconds [21:51:43] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay seconds [21:51:43] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay 0 seconds [21:52:01] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 1377 seconds [21:52:01] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay seconds [21:52:55] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 1 seconds [21:52:55] RECOVERY - MySQL Slave Delay on db1002 is OK: OK replication delay 0 seconds [21:53:01] !log rebooting db1018 one more time [21:53:04] Logged the message, Master [21:53:49] RECOVERY - MySQL Replication Heartbeat on db1039 is OK: OK replication delay 0 seconds [21:54:07] RECOVERY - MySQL Slave Delay on db1039 is OK: OK replication delay 0 seconds [21:54:52] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 1120 seconds [21:55:01] PROBLEM - Host db1018 is DOWN: PING CRITICAL - Packet loss = 100% [21:55:19] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [21:55:28] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 1 seconds [21:56:49] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 0 seconds [21:56:49] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 1 seconds [21:57:07] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay -1 seconds [21:57:25] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [21:57:43] RECOVERY - Host db1018 is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [21:59:31] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 5 seconds [21:59:40] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay 0 seconds [21:59:49] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 1 seconds [21:59:58] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [22:00:16] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [22:00:16] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [22:00:27] haha, know what fixed the problem i was up last night trying to figure out ? disabling some of the debug logging on the box ;) [22:01:46] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 1229 seconds [22:01:55] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 1213 seconds [22:02:49] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 0 seconds [22:03:25] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [22:03:25] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [22:03:43] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [22:03:43] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 0 seconds [22:04:01] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [22:12:23] New patchset: Asher; "test server support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7005 [22:12:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7005 [22:13:01] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [22:13:09] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7005 [22:13:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7005 [22:13:19] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [22:23:24] New patchset: Asher; "don't cache for test.m" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7007 [22:23:43] New patchset: Reedy; "Updated config files per current usage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7008 [22:23:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7007 [22:23:58] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7007 [22:24:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7007 [22:26:16] New patchset: Reedy; "Updated config files per current usage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7008 [22:27:02] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7008 [22:27:03] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7008 [22:30:48] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [22:31:42] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [22:38:05] New patchset: Reedy; "Start a .gitignore" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7010 [22:39:04] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7010 [22:39:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7010 [22:46:06] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 184 seconds [22:47:09] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 181 seconds [22:51:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [23:27:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.130 seconds [23:33:16] binasher or LeslieCarr: can i get one more mobile varnish cache flush? this time we tested first, i swear. this should be it for the day [23:33:33] awjr: i'll get it [23:33:57] instead of beer i'll make you rebuild my fw maker package next release i do :) [23:34:21] LeslieCarr: hahaha [23:34:35] !log purged varnish mobile cache [23:34:37] what could possibly go wrong [23:34:38] Logged the message, Mistress of the network gear. [23:34:42] oh you think i'm kidding? ;) [23:34:48] i hate rebuilding the package [23:35:10] you'd think with all this tech i could just give it a freaking python script, and have it ask me where it should go and if i want anything special and do it like magic [23:35:11] but nope [23:35:44] fortunately, i have black magic at my disposal. [23:36:06] are you saying all it would have taken me was sacrificing a goat ? [23:36:11] cuz i could do that for an easy package build [23:36:14] and maybe a virgin [23:42:34] Reedy prepare your anus [23:43:07] * RoanKattouw ChanServ op #wikimedia-operations RoanKattouw [23:43:11] ::sigh:: [23:43:25] RoanKattouw: you noob [23:43:44] * blackman sodomizes maplebed  [23:44:38] Ryan_Lane: ^ kick please? [23:44:51] ... [23:45:06] !ops blackman [23:45:11] let him finish the alphabet though :-] [23:45:18] guess we don't have asm here ;) [23:45:34] blackman: omg ur so clver! [23:45:35] Stuck record much? [23:45:39] cmon Ryan_Lane, keep blackman down [23:45:40] * Ryan_Lane yawns [23:45:44] finally [23:45:54] welcome back [23:45:56] learn to ban asshole [23:46:00] hahahaha [23:46:01] ahahaha [23:46:05] hashar: let's have a picnic [23:46:06] * AaronSchulz saw that coming [23:46:06] hahahahahaha [23:46:10] me too [23:46:14] blackman: there is like a whole openspace making fun of you right now :-))) [23:46:14] I was too lazy to ban [23:46:17] blackman: thanks for the laugh [23:46:17] thanks for the laugh! [23:46:27] i hope blackman doesn't figure out that anyone can edit wikipedia [23:46:30] we'd be so fucked [23:46:34] Ryan_Lane: you should be ashamed of yourself as an op [23:46:34] hahahaha [23:46:39] * hashar laugh [23:46:45] blackman: or I can just continue not giving a fuck ;) [23:46:47] 9_9 [23:46:56] I wish we could edit our users [23:46:57] there's Thehelpfulone, collector of flags [23:47:09] blackman: exactly 0 fucks are given this day [23:47:27] finally a channel with some cajones [23:47:38] blackman: now derp, I'm in most channels. [23:47:52] blackman: T middle last in W [23:47:55] blackman: VISA [23:47:57] 445 [23:48:07] visa is accepted everywhere [23:48:10] 0479 [23:48:31] blackman: it isn't [23:48:42] Reedy: where is it not? [23:48:50] (everywhere i want to be, anyway) [23:49:04] blackman: how's shellmix these days? [23:49:08] blackman: getraenke hoffmann [23:49:12] blackman: apparently there are bigger trolls than you around here :-) [23:49:13] on an unrelated note: http://en.wikipedia.org/wiki/Help:Starting_editing [23:49:18] preilly: lots of disposable IPs [23:49:29] Stop wasting the IPv4 addresses [23:49:44] we need them for labs! :P [23:49:48] Reedy: but i'm trying to use them all so we can push more people to ipv6 [23:50:21] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [23:50:30] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [23:50:34] * binasher gots more ipv4 than a horse gots hair [23:51:50] i feel like nagios-wm is by far the biggest troll in this room. [23:51:56] just spamming uncontrollably [23:51:59] all hours of the day [23:52:03] christ. [23:52:21] sodomizes you all [23:52:30] hawt [23:52:42] bot fetish? [23:53:08] * Ryan_Lane yawns [23:53:12] and kick [23:53:15] good laugh for a bit [23:53:15] you have to do both [23:53:22] though +b is probably muting him at the same time [23:53:26] hashar: yeah. I've been getting a laugh from him :) [23:53:27] it's a conspiracy. nagios-wm and blackman [23:53:43] yeah we should kick nagios-wm too [23:53:46] heh [23:53:48] it is getting annoying [23:53:54] i've been arguing that for ages [23:54:03] Problem - blackman is CRITICAL [23:54:13] why do *i* care what's broken? *i'm* not ops. [23:54:14] actually, we have kicked it out of #wikimedia-tech [23:54:21] so people could use -tech to chat ;-D [23:54:22] Thank God. [23:54:22] you can always just not join this channel :) [23:54:33] sadly, we need nagios-wm [23:54:49] I would love to see it shunted to its own channel though... [23:54:50] true. most of my harassment of you involves labs, anyway. [23:54:52] you can't quit me [23:54:53] so that we can talk while shit's blowing up. [23:55:07] yeah, that's a good idea maplebed [23:55:17] #wikimedia-nagios [23:55:23] I think we should make it PM us all the notifications [23:55:23] not a big fan tbh :) [23:55:25] aka, #wikimedia-yelling-into-the-void [23:55:46] but I could probably cope by telling irssi to print it into the same window [23:55:47] I want everyone's growl notifications to stop them from working [23:55:48] hi kids [23:55:58] ohai [23:56:04] rapey-nagios-wm: how's it goin? [23:56:13] CRITCAL CRITICAL NO LUBEclear [23:56:16] rapey-nagios-wm: you feeling a certain may today> [23:56:19] have you seen my van? [23:56:22] *way [23:56:25] Does it say FBI on it? [23:56:26] it's full of candy [23:56:37] does it have free candy spray painted on it? [23:56:40] cause that makes it totally legit [23:57:01] what was i trying to work on today.. oh well