[00:00:29] perfect. (if it was too long, I would suggest checkpointing less often) [00:01:45] New review: Bhartshorne; "see inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/1797 [00:02:45] Change restored: Catrope; "Bah, apparently I can't amend abandoned changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:02:57] New patchset: Catrope; "WIP for breaking out puppet-specific hooks to puppet.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:03:23] New review: Catrope; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/1794 [00:13:12] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:21:48] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:27:45] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:28:26] maplebed: patchset 4 ? [00:30:34] New patchset: Asher; "test new varnish pkgs on mobile cp servers out of production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:31:41] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1798 [00:31:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:42:20] PROBLEM - DPKG on cp1042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:43:10] thats me [00:43:39] Ryan_Lane: use 10.64.129.0/24 for the range for the virt1 ip [00:43:43] 1,2,3 are reserved [00:43:53] I need a public IP [00:43:54] (updated in rdns) [00:44:03] this isn't a compute node [00:44:07] those use private IPs [00:44:24] ah yes [00:46:31] all i want for christmas is another /22 [00:46:40] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:31] New patchset: Asher; "fix for varnish pkg testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:49:06] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1799 [00:49:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:49:52] How easy is it to ip squat? :D [00:52:00] RECOVERY - DPKG on cp1042 is OK: All packages OK [00:52:56] Reedy: impossible - everyone is using ip's plus all decent transit providers filter based on which ip blocks you own [00:53:49] Shame [00:56:30] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [00:57:25] As I've said before... I know my old uni won't be using all 65k of the ips they own. They should be made to give up ranges they know they won't use [01:00:53] New patchset: Asher; "putting backend instances of new mobile varnish servers into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:02:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1800 [01:02:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:26:28] !log backend varnish instance on cp1042 running 3.0.2 is in production for 1/3 of mobile requests [01:26:31] Logged the message, Master [01:28:44] !log puppet is being deliberately left broken on cp1043 and 1044 until tomorrow [01:28:45] Logged the message, Master [01:32:40] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:36:01] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1797 [01:36:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1797 [01:37:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:46:35] New patchset: Lcarr; "fixing type in ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [01:46:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1801 [01:46:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [01:52:30] now for the moment of truth.... [01:53:02] restarting nickel [02:20:05] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [02:56:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 677s [03:01:15] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 689s [03:42:59] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 03:42:38 UTC 2012 [04:19:34] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:19:34] RECOVERY - Disk space on es1004 is OK: DISK OK [04:46:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:00:50] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 205 MB (2% inode=60%): /var/lib/ureadahead/debugfs 205 MB (2% inode=60%): [08:10:40] RECOVERY - Disk space on srv222 is OK: DISK OK [10:00:49] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 429355 MB (3% inode=99%): [10:04:29] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 413604 MB (3% inode=99%): [10:08:19] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:34:09] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [10:35:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [10:52:43] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [10:53:43] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:54:43] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [12:29:31] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [12:46:00] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [13:21:30] ACKNOWLEDGEMENT - Host db43 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2170 [13:24:00] ACKNOWLEDGEMENT - Host db41 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn was used for OWA, re-claimed by CT for miscellaneous tomfoolery [13:25:00] nice [13:30:54] ACKNOWLEDGEMENT - Host db19 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn http://rt.wikimedia.org/Ticket/Display.html?id=2034 [13:39:24] ACKNOWLEDGEMENT - Host dataset1 is DOWN: CRITICAL - Host Unreachable (208.80.152.166) daniel_zahn RT #1345 [13:52:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:06:02] New review: Dzahn; "reverting to use NFS path like before (for now)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1643 [14:06:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1643 [14:07:59] apergos: heh, just quoting ticket :) [14:08:22] we need a hostname "tomfoolery" now [14:08:25] it's required [14:08:39] *gg* [14:09:39] hrmm..noticing puppet freshness problems on ..it seems..random hosts [14:09:58] joy [14:10:13] Does running puppet by hand cause errors? [14:10:58] trying on sq67 [14:11:05] yes, it does :/ [14:11:15] err: Failed to apply catalog: Could not find dependency Package[varnish3] for Mount[/var/lib/varnish] at /var/lib/git/operations/puppet/manifests/varnish.pp:40 [14:11:35] checking gerrit for changes.. [14:12:07] !change 1799 [14:12:07] https://gerrit.wikimedia.org/r/1799 [14:13:23] hmmm.. cant find package varnish3, seems unrelated though [14:18:42] New patchset: Dzahn; "require Class varnish::packages, instead of Package[varnish3] in Mount, because this broke puppet fe. on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:19:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1802 [14:28:01] New review: Dzahn; "fixed dependency problem ?!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1802 [14:28:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:30:12] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jan 6 14:29:51 UTC 2012 [14:30:23] New review: Dzahn; "yes, it did. puppet runs again on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:32:42] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Fri Jan 6 14:32:27 UTC 2012 [14:33:12] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Fri Jan 6 14:33:10 UTC 2012 [14:38:02] !log saw the log about cp1043/44 being deliberately left broken, but requirement in varnish.pp also broke others, fixed on sq67,68,69 (gerrit change 1802) [14:38:04] Logged the message, Master [14:39:12] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Fri Jan 6 14:38:57 UTC 2012 [14:41:12] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Fri Jan 6 14:40:46 UTC 2012 [14:44:42] RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Fri Jan 6 14:44:33 UTC 2012 [14:46:42] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Fri Jan 6 14:46:24 UTC 2012 [14:51:48] yep, those recovered for the same reason [14:56:42] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Fri Jan 6 14:56:19 UTC 2012 [16:27:02] New patchset: Hashar; "WikipediaMobile: add direct link to latest nightly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [16:27:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1803 [16:48:37] New patchset: Demon; "Fix my public key, was off-by-one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [16:48:51] New patchset: Demon; "For the last time, fixing my public key. I swear this is it. (With a new comment too)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [16:49:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1805 [16:50:10] New review: Demon; "Cherry picked from test: https://gerrit.wikimedia.org/r/#change,1219" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1805 [16:50:31] New review: Demon; "Stupid change, but was needed for the dependency to cleanly merge." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1804 [17:03:13] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 17:02:44 UTC 2012 [18:00:21] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1804 [18:00:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [18:01:55] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1805 [18:01:56] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [18:05:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1803 [18:05:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [18:12:57] hey kids...some people are getting DB errors when connecting to commons...whatever host 10.0.6.32 is, production machines are still pointing to it even tho it's down [18:13:22] woosters: ^^ [18:15:19] Yup [18:15:22] We know [18:15:25] db22 is down [18:15:28] db22 is S4 master [18:15:31] i am on mgmt right now.. [18:15:37] its Sun :p [18:15:40] aah ok [18:15:54] And I think S4 is centralauth also [18:16:08] https://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=db22 [18:16:14] but db22 already had an existing ticket [18:16:17] for failed RAID [18:16:19] ouc [18:16:20] h [18:16:22] :| [18:17:11] !rt 1136 [18:17:12] http://rt.wikimedia.org/Ticket/Display.html?id=1136 [18:17:45] Want me to ping asher? [18:17:55] yes [18:18:08] or RobH? thx [18:18:19] i was thinking asher as it was db [18:18:21] i can do both [18:19:05] PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:43] all...db22 shouldn't be down...i am working on a ticket to replace a HDD but it is hot swapple [18:19:56] Ah [18:20:00] ah [18:20:01] cmjohnson1, seemingly not :( [18:20:09] obviously... [18:20:19] Asher should be inbound [18:20:33] cmjohnson1: glad to hear you are working on it though! [18:20:55] cmjohnson1: i thought it just failed completely, the second disk as well [18:21:19] cmjohnson1: i am on Sun LOM of db22 right now, i wont do anything though [18:21:25] Hey binasher, raid/disk issue, cmjohnson1 is/was changing a supposedly how swappable disk, seemingl ynot [18:21:46] no...my error [18:21:48] .hi. [18:22:15] can someone give a brief status? [18:22:23] db22 is currently down and is s4 master. [18:22:28] is it rebooting or still down? [18:22:41] db22 had an existing ticket for harddisk / RAID failure [18:23:16] which is the ticket i was working 1136 [18:23:17] sounds like promoting a slave to be a new master is a good idea. binasher - agree? [18:23:19] cmjohnson was working on that ticket to replace disk [18:23:28] of course [18:23:39] binasher: do you want to run it or shall I? [18:23:52] I'm happy to if you're still in transit. [18:25:05] Reedy: heya [18:25:07] s4 has a sad amount of capacity [18:25:35] Ohai [18:25:42] Reedy: asher taking care of it? [18:25:52] binasher: do you watn point or should I take it? [18:26:08] oh good, i dont have to do shit, huzzah! [18:27:02] :) hi all [18:27:21] maplebed: whatever you want to do is fine [18:27:29] ok, I'll take point [18:27:30] deploying db.php that puts s4 in read only [18:27:46] mutante: you mind running coms? [18:27:49] and puts db31 in the master position [18:28:06] i'm not even sure what you mean by point :) [18:28:23] binasher: to confirm, you are currently deploying the new db.php; I'll repoint db33 to slave off of db31. [18:28:23] You get shot first :p [18:28:37] maplebed: reporting to wikitech? sure. i am still on mgmt of db22. but not doing anything. should i logout? [18:28:58] mutante: you're ok on db22. [18:28:59] maplebed: you must also switch: s4-secondary.eqiad.wmnet is an alias for db1038.eqiad.wmnet. [18:29:10] binasher: ok, will do. [18:30:16] cmjohnson1: do you think db22 will be back up today? [18:30:30] I am done...the drive has been swapped. [18:30:36] mutante: expected recovery: 10min. [18:30:45] I apologize...the instructions did state that it was hot swapple [18:30:58] mutante: current effect: commons is read-only. [18:31:13] cmjohnson1: it happens. its booting now though? [18:31:51] for the record: master log file on db31 is db31-bin.000213, master_log_pos is 205612709 [18:31:55] PROBLEM - RAID on db23 is CRITICAL: CRITICAL: logical devices: 1 defunct [18:32:04] so the raid had too many failed disks to hot swap? [18:32:11] (i am sure I put in that it was hot swap ;) [18:32:39] cmjohnson1, for future, might be worth !log'ing things like that :) [18:33:00] reedy: definitely [18:33:05] binasher: db33 is changed to slave off of db31. I willset read_only to false on db31 in 2min unless you object. [18:33:09] maplebed: !log the hostname + binlog file + log pos of the new s4 master as it starts. toolserver will need that (and it will be good to have it recorded if db22 comes back up with data) [18:33:09] yea, log every disk swap or shutdown [18:33:25] currently changing slave status for db1038 [18:33:37] im interested to know why it died when the hot swap disk was removed. [18:33:54] too many failed drives or improper raid configuration? [18:34:01] !log new master for s4: db31, log file db31-bin.000213 log pos is 205612709 [18:34:03] Logged the message, Master [18:34:48] !log old master for s4 log file db22-bin.000106 log pos 631618956 [18:34:49] Logged the message, Master [18:36:04] maplebed: i'll change db1038 unless you're in the middle of it. just go for setting db31 rw and taking s4 out of ro in db.php [18:36:14] binasher: jjust finished changing db1038 [18:36:24] curretnly setting db31 rw [18:36:44] hmm [18:36:49] on db1038: Seconds_Behind_Master: 8729495 [18:36:59] feck [18:37:22] fascinating. [18:37:27] it might be thoroughly busted. [18:37:30] db33 looks fine though. [18:37:37] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:38:06] and looks like they're pointing at the same positions [18:38:15] !log db31 made read-write as the new master for s4 [18:38:16] Logged the message, Master [18:38:16] hmm.. go ahead [18:39:15] ok, now db1038 shows 0 sec behind, as soon as transactions were made [18:39:49] maplebed: can you also update dns? (s4-master cname) [18:39:56] sure, in a sec. [18:41:10] !log pushed out new db.php setting s4 to read-write [18:41:11] Logged the message, Master [18:41:32] Reedy: would you test commons for me? [18:42:27] RECOVERY - RAID on db23 is OK: OK: 1 logical device(s) checked [18:43:47] maplebed, test edit is fine, recent changes looks good (activity again) [18:43:48] commons looks good in dberror.log [18:43:55] mutante: I believe everything is back online. [18:44:04] thanks reedy. [18:44:11] 36 minutes downtime it seems [18:44:13] maplebed: sent a short summary to wikitech-l [18:45:04] binasher: what I meant by 'take point' is that I've found it very beneficial to have a single person as the lead in an outage situation to coordinate activity and prevent duplicate action. assignment of that person is what I was talking about. [18:45:46] got it, that went nice and smoothly [18:45:47] there are all sorts of things that become easier when there's that single figurehead for the event, such as coordinatingc communications, making efficient use of people for investigations, etc. [18:46:47] mutante: are you still on the db22 console? if so, is it actually booting? [18:46:52] !log s4 database rotation complete. outage duration 36 minutes. [18:46:53] Logged the message, Master [18:47:17] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [18:47:24] the only s4 slave is also the lvm snapshot host, so it's going to be slow [18:48:11] binasher: i am on the Sun LOM command line but dont see console output yet.. let me check [18:49:21] Serial console started. [18:49:52] but i dont see more [18:50:01] start /SP/console , right [18:50:11] updated dns to set s4-master to db31. [18:52:42] binasher: i dont see any output [18:53:01] cmjohnson1: still here? [18:53:05] so is db22 fully out of the loop and can be recovered normally? [18:53:13] cuz if so i wanna take a look at a few things before we bring it up [18:53:15] RobH: yes it is. [18:53:28] ok, can we hold for five minutes so i can poke at some stuff in mgmt? [18:53:33] binasher yes [18:53:38] imo, take your time. [18:53:51] mutante: can ya hop off serial if yer on it? [18:54:22] Connection to db22.mgmt closed. [18:54:29] thank you =] [18:54:32] binasher: db1038 is now caught up in replication. do you watn to run any sort of checks on it to see if it's broken? [18:54:33] yw:) [18:55:11] thanks for doing comms, mutante. [18:55:32] all thx for fixing my blunder! [18:56:43] np, thanks to malafaya, he reported it first, 10 minutes before Nagios, heh [18:56:53] cmjohnson1: db22 still need to be brought up [18:58:10] binasher: okay, i think robh is looking at a few things first [18:58:28] maplebed: poking at db1038 but i think it's fine. replag reported normally as soon as db31 wrote to its binlog [18:58:42] yea gimme a few more minutes and we can bring it back [18:59:36] ok, I am done poking at the stuff in mgmt, mutante did you want to take back over or want me to just boot it? [18:59:59] i dont mind, i wanna check a couple things after boot [19:01:38] no answer assumes compliance? (maplebed ?) [19:01:43] yer heading the outage, yer call [19:01:50] he's away from the desk right now [19:02:06] RobH: eh, yeah, please go ahead, i was distracted [19:02:24] ok, so all the other slaves are going to new master and this is out of the loop right? [19:02:55] right [19:02:56] RobH: just boot please, real life calls me..thx [19:03:16] heh [19:03:18] well timed [19:03:20] RobH: sorry, stepped away for a minute. thought I declared the outage over. [19:03:29] oh, ok [19:03:48] db22 needs to be reattached to the mysql tree sometime, but it's an hours-day timeframe, not minutes-hour. [19:04:02] but thanks for checking. [19:04:03] :) [19:04:05] ok, attempting to bring it up [19:04:49] robh: let me know if the disk swap fixed the original problem [19:06:12] RobH: yeah, i already sent mail to wikitech saying its over [19:06:32] RobH: so i dont think we need to rush on db22 now.. [19:06:33] right_ [19:06:56] k [19:07:21] i lost the prompt before by reading backlog of errors, had to restart the boot, so glad its not time restricted ;] [19:07:48] yeah, seems like we are fine for now. at least no more user reports [19:07:50] nearly all commons db queries are going to a single host that is rather slow thanks to running two lvm snapshots [19:08:07] its not "we're down" urgent, but it's still pretty urgent [19:09:20] it appears that only 8 of the 12 drives are online in the disk array [19:09:23] timeframe of hours is still ok though [19:09:37] That sounds somewhat suspect [19:09:41] i think i know what caused it [19:09:45] i am just confirming things [19:11:18] i think i'll build db51 as a new s4 slave today [19:17:07] cool. outage response worked pretty good though, didnt it? thanks everyone, heading out for dinner then, cu [19:17:18] cyas [19:17:37] seeya, thanks mutante [19:17:40] cya: thx [19:18:45] binasher: so the raid array is degraded to the point of non-use. Too many disks fell out within the same mirror sets. [19:18:54] I think that all disks are working [19:19:06] but would need to redo the array config and start over [19:19:32] means you have a full recovery to do though [19:19:45] the alternative is booting via rescue and trying to force it online [19:19:51] ok, in that case - we might as well do that for the sake of future use of the machine if you think its worth it (its always good to have spare db hardware) [19:19:55] which is kind of a pain in the ass [19:20:12] wipe and rebuild would probably be better [19:20:38] ok, I will go ahead and rebuild a fresh raid10 array and get the OS load started, its fully automated for the dbs. [19:21:37] i'm going to build db51 as an s4 slave today, db22 can be added as an s4 next week if the os reload goes ok. s4 needs one more slave than it had anyways, and was going to get one from that last hw order [19:24:31] New patchset: Asher; "db51 -> s4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1806 [19:24:56] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [19:28:12] !log started innodb hot backup of db1038 to db51 [19:28:13] Logged the message, Master [19:30:14] cmjohnson1: ok, i am working db22, which drive did you swap? [19:30:23] sorry if you said before, but i have it in front of me now [19:30:35] (which drive did you id as the serial from the ticekt that is) [19:30:50] the slot #s are written on the top of the case, shows the layout [19:30:57] the ticket was 3PD1TR3R [19:31:03] so you can usually look at top of one in a rack [19:31:12] since we have db racks [19:31:29] right, but I need to know what slot it was in [19:31:36] there are 16 drive slots [19:32:23] if you arent sure just lemme know that [19:34:05] i am not sure of the drive slot...it was 3 row below fan [19:34:26] ok, so you know the physical drive slot, just not how they are numbered? [19:34:34] we can figure that out if so [19:35:02] yes [19:35:49] ok, so if you snag the stepstool or whatever, you can go over to the row A database rack [19:35:56] and look on the top cover of the top db [19:36:01] it has the drive layout on the top cover [19:36:11] or any rack where the top server is a 4240 [19:36:36] just need to know the # of the slot, and the diagram will have that. if you cannot read it off anything, we can shutdown db22 and pull it for that diagram [19:36:41] but usually easier to read off top of anohter [19:41:42] RobH: I made an eqiad ticket for you yesterday; just for scheduling (not urgent), do you have an idea when you'll be onsite to do it? [19:41:46] !rt2220 [19:41:50] im onsite now =] [19:42:15] yea i planned to do this today, didnt even know the ticket was there yet [19:42:23] i need to fix it and the two cp servers i offlined to do it [19:42:27] will happen today [19:43:04] cool. thanks! [19:43:14] welcome [19:46:25] robh: it was drive slot 4 (last on 2nd Col) [19:47:54] all disks back in place? [19:48:02] yes [19:48:04] ok, rescanning [20:04:49] LeslieCarr, is ganglia on nickle fully setup? as all servers seem ot be showing in all groups [20:05:24] yeah, trying to figure that problem out :( [20:05:32] if you have any ideas.. [20:07:31] !log db22 reinstalling [20:07:32] Logged the message, RobH [20:08:45] Should there be data there for each machine too? [20:13:37] ok, once db22 gets to the partitioning and passes in installer, i am going afk for a short break (5 minutes) [20:13:45] then will be back to work on eqiad open issues [20:23:08] !log db22 reinstalled and booting into OS. No puppet runs yet, now its Asher's problem ;] [20:23:09] Logged the message, RobH [20:25:03] !log rt2226 - redeploy db22 for asher [20:25:04] Logged the message, RobH [20:25:26] afk, back shortly. [20:27:24] RECOVERY - Host db22 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [20:44:44] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [20:46:58] PROBLEM - MySQL disk space on db22 is CRITICAL: Connection refused by host [20:52:08] PROBLEM - DPKG on db22 is CRITICAL: Connection refused by host [20:53:47] I have made a sun java package that will successfully install via dpkg [20:53:48] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [20:53:58] PROBLEM - Disk space on db22 is CRITICAL: Connection refused by host [20:56:06] notpeter, I think it's weekend time now :p [20:57:20] that sucked soooo hard... [20:57:28] no no, now I get to put it in the repo! yay! [20:57:31] haha [20:57:39] I was suggesting that was enough pain for the week [20:57:46] but if you wanna carry on ;) [20:57:50] hahaha [20:58:21] man, so there's an app for debian/ubuntu that takes the sun java binary installers [20:58:23] and pacakges them [20:58:30] called, creatively, java-package [20:58:34] nice [20:58:39] unfortunately, it's extremely poorly supported [20:58:50] so I hacked out large portions of it, and then it ran! [20:58:51] yay! [20:58:58] now to see if I hacked out anything important... [20:59:17] like, I broke all of the java browser support... but I don't think that that's going to matter [21:00:45] !log started slaving db51 off of db31 [21:00:46] Logged the message, Master [21:04:19] binasher: saw i ticketed db22 to you? [21:04:31] yup, thanks! [21:05:54] i should have db22 back in prod by the end of the day [21:06:02] but still going to but db51 in s4 too [21:06:58] RECOVERY - MySQL disk space on db22 is OK: DISK OK [21:08:56] !log putting db51 into production as an s4 slave [21:08:57] Logged the message, Master [21:11:58] RECOVERY - DPKG on db22 is OK: All packages OK [21:13:48] RECOVERY - Disk space on db22 is OK: DISK OK [21:13:58] New patchset: Asher; "upgrading db22 to new pkgs / fully puppetized config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1807 [21:13:58] RECOVERY - RAID on db22 is OK: OK: 1 logical device(s) checked [21:15:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1807 [21:15:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1807 [21:18:04] binasher: I will get that ssd installed for your testing in a bit [21:18:48] RobH: right on [21:21:44] !log es1002 back ready for service use per #2220: replace original RAID card in es1002 [21:21:45] Logged the message, RobH [21:21:47] maplebed: ^ [21:22:16] woo!!! [21:22:22] now more boring work for me! [21:22:26] :P [21:22:48] !log restoring db22 from a live hotbackup of db1038 [21:22:49] Logged the message, Master [21:27:57] ryan_lane: howdy [21:28:42] drdee: howdy [21:28:43] sup? [21:29:24] ryan_lane: not much :) [21:29:46] !log cp1014 and cp1019 hdd controller cables replaced (removed for testing controllers), both can be used normally [21:29:47] Logged the message, RobH [21:30:08] ryan_lane: would you have a spare IP address for me? the pageviews virtual instance is ready and Erik M. would like to have it running in a slightly more open manner for experimentation [21:30:21] i meant spare public IP address [21:32:13] I think so [21:32:27] let's talk in -labs [21:32:28] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:32] binasher: so i can shutdown db1029? [21:37:53] looks like its doing nothing but sitting there [21:38:20] yea, not running mysql, shutting down [21:38:30] !log db1029 coming down for ssd testing [21:38:31] Logged the message, RobH [21:44:25] !log db1029 powering back up with ssd testing hardware installed [21:44:26] Logged the message, RobH [21:44:29] binasher: ^ all yers [21:46:28] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [21:46:28] PROBLEM - Host db1029 is DOWN: PING CRITICAL - Packet loss = 100% [21:47:28] RECOVERY - Host db1029 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [22:16:40] New patchset: Lcarr; "removed statically spec'ed cluster from nickel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1808 [22:16:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1808 [22:17:02] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1808 [22:17:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1808 [22:19:12] New patchset: Ryan Lane; "Enabling ipv6 proxy on ssl3001 to re-enable upload ipv6 proxy." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1809 [22:19:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1809 [22:19:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1809 [22:19:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1809 [22:32:14] LeslieCarr: you about? [22:32:32] RobH: yeah [22:32:37] i need the info on what you need connected [22:32:39] sorry work keeps distracting me [22:32:44] lemme get you the ticket [22:32:44] i see https://rt.wikimedia.org/Ticket/Display.html?id=2219 in network for you [22:32:47] but nothing else [22:33:06] also have this http://rt.wikimedia.org/Ticket/Display.html?id=1882 for lab switch, is this now invalid since we shipping it back? [22:33:13] or should it stall until it comes back? [22:33:27] http://rt.wikimedia.org/Ticket/Display.html?id=1207 [22:33:53] the 1882 ticket we should hold on until the new labs switch is back here [22:34:40] so am i relocating existing fiber runs or adding new ones? [22:35:05] adding new ones [22:36:30] hey RobH es1002 was caught in a reinstall loop. I checked the bios and it had network listed before disk in boot order. I switched it, but I'm curious how that happened and whether it has any relation to our failures to get it to load into a bunch of raid-0 disks. [22:37:01] (or if I'm totally confused and it's supposed to be network first, for some reason I can't fathom) [22:37:08] LeslieCarr: can you happen to tell me a cable that is cr1 to cr2 [22:37:14] just so i can find it and see its length [22:37:19] a fiber that is [22:37:38] maplebed: it happened cuz the controller was removed and added back [22:37:39] ok, 1 sec [22:37:42] and i forgot to confirm that in bios is all [22:37:51] from trying to boot iwth the r610 controller [22:37:54] xe-5/2/0 [22:38:06] RobH: ok. so it's new from the perc card, and wasn't the case for the previous card we were trying? [22:38:32] in which case it's probably unrelated, since we tried the raid-0 thing before swapping cards. [22:38:42] it remembers the raid0 stuff [22:38:50] but the bios was set to boot from the r610 controler [22:38:59] so when it didnt detect the raid controller, bios removed from boot order [22:39:04] now that its back, it defaults to last [22:39:07] or not at all [22:39:09] gotcha. [22:39:11] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [22:39:13] if you moved it and saved it, should be good [22:39:18] i just forgot to do that for ya, slipped my mind [22:39:19] ok, so setting it back to the controller first was the right thing. [22:39:21] yep [22:39:24] and in fact it's now booted into an OS. [22:39:26] \o/ [22:39:50] LeslieCarr: that works fine, thanks! getting the fibers now for the runs. one is short in rack, cr1 to psw1, but cr2 to psw2 is a1 to a8 [22:40:21] RECOVERY - SSH on es1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:43:06] LeslieCarr: fyi: the cr1-eqiad - 5/2/3 to psw1-eqiad xe-0/1/0 will be a sfp attached cable [22:43:14] rather than a fiber, easier for in rack stuff, shouldnt matter to you [22:43:17] but wanted to let you know [22:43:28] (i think they are marked differently in port labels in software?) [22:43:38] okay [22:43:39] cool [22:44:05] you can see that sfp-t's are different than sfp-sx or lx [22:44:14] sfp-t's being the technical term for a copper sfp [22:46:46] cool [22:48:58] LeslieCarr: so i am not sure where you want this on psw1 [22:49:08] 0/1/0 sounds like port 1 on the main 24 ports [22:49:30] is that right? (or does this plug into the sfp port addon?) [22:49:41] i was thinking into the sfp port addon [22:49:49] ok, cool [22:49:56] but if it's copper on one side you could also plug it into the last port [22:49:56] 1m fits for that, not the other [22:50:02] okay [22:50:05] the last copper port [22:50:11] the addon is fine [22:50:13] you dont need to move it [22:50:16] cool [22:50:18] thats the side i was routing to [22:50:23] so the first of the two 10g? [22:50:32] or is this in th 1g? [22:51:16] cuz i think its 0/1/0 through 0/1/3 [22:51:31] oh wait a sec [22:51:31] and 0/1/0 and 0/1/2 are 10g, [22:51:40] we need to use the 10g [22:51:42] so copper won't work [22:51:53] aww =P [22:51:53] sorry, brain fart - it's 10g cards on the other side [22:51:55] yeah [22:51:58] :( [22:52:17] hrmm, lemme track down a fiber, we may need to use a longer one today and replace with proper one later [22:52:21] brb, checking [22:52:27] okay [22:52:38] frick [22:52:40] fruck adfawefdsfvawd [22:52:41] if you need more fiber please put in an order for some more (extra fiber is always useful imho) [22:52:44] i just pulled the wrong cable [22:52:53] pulled 5/3/3 [22:53:02] then realized it and clicked it back in, was out a mm [22:53:06] but i saw the lights hop. [22:53:09] LeslieCarr: ^ [22:53:16] oh [22:53:19] yeah [22:53:27] you broke the site!! :-D [22:53:33] i didnt just have an hour long recap of the outage with chris or anything... [22:53:36] about being more careful. [22:53:37] you only pulled the 2nd worst cable [22:53:41] RECOVERY - Puppet freshness on es1002 is OK: puppet ran at Fri Jan 6 22:53:38 UTC 2012 [22:53:41] it could have been the tinet one [22:53:43] just kidding. have a good day plugging cables :-) see you next week. [22:54:00] which one was this one? [22:54:04] he peering [22:54:06] ohh, hurricane [22:54:08] yea [22:54:09] yeah [22:54:11] oh well =P [22:54:13] hehehe [22:54:15] shit happens [22:54:26] maybe you can label the cables on both sides with the peering partner name? [22:54:30] someone tell chris when he shows back up, it will cheer him up [22:54:47] this wasnt partner cable [22:54:50] was core to peering router [22:54:58] so if we had more than one peer, it would have taken them all out [22:55:02] but we have just the lonely single peer. [22:55:11] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [22:55:22] :) [22:59:04] !log db22 is back in s4 [22:59:05] Logged the message, Master [22:59:30] LeslieCarr: so these need to be 10, so [22:59:36] cr1-eqiad - 5/2/3 to psw1-eqiad xe-0/1/0 [22:59:43] cr2-eqiad - 5/2/3 to psw2-eqiad xe-0/1/1 to cr2-eqiad - 5/2/3 to psw2-eqiad xe-0/1/2 ? [22:59:45] right? [22:59:57] cuz 0/1/1 is not 10g [23:00:55] LeslieCarr: can you confirm if the sfp i just plugged into cr1 5/2/3 is right? [23:01:07] cuz i just have a bunch of them in psw2 [23:01:15] so took it from there, its where we put the spares [23:01:32] ok [23:02:11] haha it's showing up as "unknown" ;) [23:02:25] lemme remove the disable [23:02:25] well it doesnt match the sfps in the core [23:02:31] but dunno [23:02:41] try another pull and replug, just in case [23:02:48] third party sfp's can be finicky [23:03:01] swapped [23:03:11] ahha,l it does say now unknown 10GBASEoptic [23:03:15] so looks like it is 10g :) [23:03:38] ok, that was my concern, so hooking it up to psw1 now [23:03:53] sfp installed in psw1 [23:03:58] if ya wanna check it works? [23:04:47] i do but have afriend helping me debug the gangliaweb issues [23:05:00] ok, should i just connect it all and ten you can check? [23:05:18] i assume i need to change cr2-eqiad - 5/2/3 to psw2-eqiad xe-0/1/1 [23:05:27] cool [23:05:30] actually [23:05:32] urgh [23:05:35] psw2 is ex4500 [23:05:43] RECOVERY - NTP on es1002 is OK: NTP OK: Offset -0.02113974094 secs [23:05:44] ? [23:05:51] there isnt a 0/1/1 [23:05:56] hehe [23:06:01] its only 0/0/X [23:06:02] what is the 10g port called ? [23:06:03] ok [23:06:11] its all the same port as hurricane [23:06:15] i assume they can go 10g [23:06:36] just dont pick 0/0/1 [23:06:48] the sfp in 0/0/0 needs to have its lever flipped to the shut position [23:06:52] as its blocking the port below it [23:06:59] but one must remove the fiber for that, so not now. [23:07:10] higher ports are better, right side of switch [23:07:29] I'll run the fiber in the trays and leave unattached for now [23:07:35] okay [23:07:41] just attach to whichever is easiest [23:07:44] and tell me what you used [23:08:10] ok, will update ticket and assign back to you [23:08:51] cool [23:08:54] thanks [23:22:53] LeslieCarr: http://rt.wikimedia.org/Ticket/Display.html?id=1207 is updated with ports attached [23:23:00] thank you RobH [23:25:27] !log working rt1549 lvs1003 may flap, it is presently not in service due to possible hdd failure [23:25:28] Logged the message, RobH [23:31:37] ok, have not come up for air for 3 hours, taking short break, back in 5 to 10. [23:31:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - psw1-eqiad:xe-0/1/0 (cable #TBD)BR [23:33:17] New patchset: Lcarr; "adding in new ganglia site for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1810 [23:39:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1810 [23:39:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1810 [23:41:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 88, down: 0, dormant: 0, excluded: 0, unused: 0 [23:47:03] RECOVERY - RAID on es1002 is OK: OK: State is Optimal, checked 2 logical device(s) [23:47:24] lvs1003 is all kinds of messed up [23:49:03] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [23:49:20] nickel is about to be all kinds of messed up if i can't get ganglia to work! [23:55:13] RECOVERY - DPKG on es1002 is OK: All packages OK [23:56:33] RECOVERY - Disk space on es1002 is OK: DISK OK