[00:00:29] perfect. (if it was too long, I would suggest checkpointing less often) [00:01:45] New review: Bhartshorne; "see inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/1797 [00:02:45] Change restored: Catrope; "Bah, apparently I can't amend abandoned changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:02:57] New patchset: Catrope; "WIP for breaking out puppet-specific hooks to puppet.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:03:23] New review: Catrope; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/1794 [00:13:12] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:21:48] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:27:45] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:28:26] maplebed: patchset 4 ? [00:30:34] New patchset: Asher; "test new varnish pkgs on mobile cp servers out of production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:31:41] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1798 [00:31:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:42:20] PROBLEM - DPKG on cp1042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:43:10] thats me [00:43:39] Ryan_Lane: use 10.64.129.0/24 for the range for the virt1 ip [00:43:43] 1,2,3 are reserved [00:43:53] I need a public IP [00:43:54] (updated in rdns) [00:44:03] this isn't a compute node [00:44:07] those use private IPs [00:44:24] ah yes [00:46:31] all i want for christmas is another /22 [00:46:40] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:31] New patchset: Asher; "fix for varnish pkg testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:49:06] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1799 [00:49:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:49:52] How easy is it to ip squat? :D [00:52:00] RECOVERY - DPKG on cp1042 is OK: All packages OK [00:52:56] Reedy: impossible - everyone is using ip's plus all decent transit providers filter based on which ip blocks you own [00:53:49] Shame [00:56:30] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [00:57:25] As I've said before... I know my old uni won't be using all 65k of the ips they own. They should be made to give up ranges they know they won't use [01:00:53] New patchset: Asher; "putting backend instances of new mobile varnish servers into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:02:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1800 [01:02:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:26:28] !log backend varnish instance on cp1042 running 3.0.2 is in production for 1/3 of mobile requests [01:26:31] Logged the message, Master [01:28:44] !log puppet is being deliberately left broken on cp1043 and 1044 until tomorrow [01:28:45] Logged the message, Master [01:32:40] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:36:01] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1797 [01:36:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1797 [01:37:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:46:35] New patchset: Lcarr; "fixing type in ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [01:46:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1801 [01:46:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [01:52:30] now for the moment of truth.... [01:53:02] restarting nickel [02:20:05] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [02:56:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 677s [03:01:15] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 689s [03:42:59] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 03:42:38 UTC 2012 [04:19:34] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:19:34] RECOVERY - Disk space on es1004 is OK: DISK OK [04:46:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:00:50] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 205 MB (2% inode=60%): /var/lib/ureadahead/debugfs 205 MB (2% inode=60%): [08:10:40] RECOVERY - Disk space on srv222 is OK: DISK OK [10:00:49] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 429355 MB (3% inode=99%): [10:04:29] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 413604 MB (3% inode=99%): [10:08:19] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:34:09] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [10:35:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [10:52:43] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [10:53:43] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:54:43] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [12:29:31] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [12:46:00] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [13:21:30] ACKNOWLEDGEMENT - Host db43 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2170 [13:24:00] ACKNOWLEDGEMENT - Host db41 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn was used for OWA, re-claimed by CT for miscellaneous tomfoolery [13:25:00] nice [13:30:54] ACKNOWLEDGEMENT - Host db19 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn http://rt.wikimedia.org/Ticket/Display.html?id=2034 [13:39:24] ACKNOWLEDGEMENT - Host dataset1 is DOWN: CRITICAL - Host Unreachable (208.80.152.166) daniel_zahn RT #1345 [13:52:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:06:02] New review: Dzahn; "reverting to use NFS path like before (for now)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1643 [14:06:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1643 [14:07:59] apergos: heh, just quoting ticket :) [14:08:22] we need a hostname "tomfoolery" now [14:08:25] it's required [14:08:39] *gg* [14:09:39] hrmm..noticing puppet freshness problems on ..it seems..random hosts [14:09:58] joy [14:10:13] Does running puppet by hand cause errors? [14:10:58] trying on sq67 [14:11:05] yes, it does :/ [14:11:15] err: Failed to apply catalog: Could not find dependency Package[varnish3] for Mount[/var/lib/varnish] at /var/lib/git/operations/puppet/manifests/varnish.pp:40 [14:11:35] checking gerrit for changes.. [14:12:07] !change 1799 [14:12:07] https://gerrit.wikimedia.org/r/1799 [14:13:23] hmmm.. cant find package varnish3, seems unrelated though [14:18:42] New patchset: Dzahn; "require Class varnish::packages, instead of Package[varnish3] in Mount, because this broke puppet fe. on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:19:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1802 [14:28:01] New review: Dzahn; "fixed dependency problem ?!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1802 [14:28:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:30:12] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jan 6 14:29:51 UTC 2012 [14:30:23] New review: Dzahn; "yes, it did. puppet runs again on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:32:42] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Fri Jan 6 14:32:27 UTC 2012 [14:33:12] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Fri Jan 6 14:33:10 UTC 2012 [14:38:02] !log saw the log about cp1043/44 being deliberately left broken, but requirement in varnish.pp also broke others, fixed on sq67,68,69 (gerrit change 1802) [14:38:04] Logged the message, Master [14:39:12] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Fri Jan 6 14:38:57 UTC 2012 [14:41:12] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Fri Jan 6 14:40:46 UTC 2012 [14:44:42] RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Fri Jan 6 14:44:33 UTC 2012 [14:46:42] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Fri Jan 6 14:46:24 UTC 2012 [14:51:48] yep, those recovered for the same reason [14:56:42] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Fri Jan 6 14:56:19 UTC 2012 [16:27:02] New patchset: Hashar; "WikipediaMobile: add direct link to latest nightly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [16:27:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1803 [16:48:37] New patchset: Demon; "Fix my public key, was off-by-one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [16:48:51] New patchset: Demon; "For the last time, fixing my public key. I swear this is it. (With a new comment too)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [16:49:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1805 [16:50:10] New review: Demon; "Cherry picked from test: https://gerrit.wikimedia.org/r/#change,1219" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1805 [16:50:31] New review: Demon; "Stupid change, but was needed for the dependency to cleanly merge." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1804 [17:03:13] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 17:02:44 UTC 2012 [18:00:21] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1804 [18:00:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [18:01:55] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1805 [18:01:56] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [18:05:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1803 [18:05:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [18:12:57] hey kids...some people are getting DB errors when connecting to commons...whatever host 10.0.6.32 is, production machines are still pointing to it even tho it's down [18:13:22] woosters: ^^ [18:15:19] Yup [18:15:22] We know [18:15:25] db22 is down [18:15:28] db22 is S4 master [18:15:31] i am on mgmt right now.. [18:15:37] its Sun :p [18:15:40] aah ok [18:15:54] And I think S4 is centralauth also [18:16:08] https://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=db22 [18:16:14] but db22 already had an existing ticket [18:16:17] for failed RAID [18:16:19] ouc [18:16:20] h [18:16:22] :| [18:17:11] !rt 1136 [18:17:12] http://rt.wikimedia.org/Ticket/Display.html?id=1136 [18:17:45] Want me to ping asher? [18:17:55] yes [18:18:08] or RobH? thx [18:18:19] i was thinking asher as it was db [18:18:21] i can do both [18:19:05] PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:43] all...db22 shouldn't be down...i am working on a ticket to replace a HDD but it is hot swapple [18:19:56] Ah [18:20:00] ah [18:20:01] cmjohnson1, seemingly not :( [18:20:09] obviously... [18:20:19] Asher should be inbound [18:20:33] cmjohnson1: glad to hear you are working on it though! [18:20:55] cmjohnson1: i thought it just failed completely, the second disk as well [18:21:19] cmjohnson1: i am on Sun LOM of db22 right now, i wont do anything though [18:21:25] Hey binasher, raid/disk issue, cmjohnson1 is/was changing a supposedly how swappable disk, seemingl ynot [18:21:46] no...my error [18:21:48] .hi. [18:22:15] can someone give a brief status? [18:22:23] db22 is currently down and is s4 master. [18:22:28] is it rebooting or still down? [18:22:41] db22 had an existing ticket for harddisk / RAID failure [18:23:16] which is the ticket i was working 1136 [18:23:17] sounds like promoting a slave to be a new master is a good idea. binasher - agree? [18:23:19] cmjohnson was working on that ticket to replace disk [18:23:28] of course [18:23:39] binasher: do you want to run it or shall I? [18:23:52] I'm happy to if you're still in transit. [18:25:05] Reedy: heya [18:25:07] s4 has a sad amount of capacity [18:25:35] Ohai [18:25:42] Reedy: asher taking care of it? [18:25:52] binasher: do you watn point or should I take it? [18:26:08] oh good, i dont have to do shit, huzzah! [18:27:02] :) hi all [18:27:21] maplebed: whatever you want to do is fine [18:27:29] ok, I'll take point [18:27:30] deploying db.php that puts s4 in read only [18:27:46] mutante: you mind running coms? [18:27:49] and puts db31 in the master position [18:28:06] i'm not even sure what you mean by point :) [18:28:23] binasher: to confirm, you are currently deploying the new db.php; I'll repoint db33 to slave off of db31. [18:28:23] You get shot first :p [18:28:37] maplebed: reporting to wikitech? sure. i am still on mgmt of db22. but not doing anything. should i logout? [18:28:58] mutante: you're ok on db22. [18:28:59] maplebed: you must also switch: s4-secondary.eqiad.wmnet is an alias for db1038.eqiad.wmnet. [18:29:10] binasher: ok, will do. [18:30:16] cmjohnson1: do you think db22 will be back up today? [18:30:30] I am done...the drive has been swapped. [18:30:36] mutante: expected recovery: 10min. [18:30:45] I apologize...the instructions did state that it was hot swapple [18:30:58] mutante: current effect: commons is read-only. [18:31:13] cmjohnson1: it happens. its booting now though? [18:31:51] for the record: master log file on db31 is db31-bin.000213, master_log_pos is 205612709 [18:31:55] PROBLEM - RAID on db23 is CRITICAL: CRITICAL: logical devices: 1 defunct [18:32:04] so the raid had too many failed disks to hot swap? [18:32:11] (i am sure I put in that it was hot swap ;) [18:32:39] cmjohnson1, for future, might be worth !log'ing things like that :) [18:33:00] reedy: definitely [18:33:05] binasher: db33 is changed to slave off of db31. I willset read_only to false on db31 in 2min unless you object. [18:33:09] maplebed: !log the hostname + binlog file + log pos of the new s4 master as it starts. toolserver will need that (and it will be good to have it recorded if db22 comes back up with data) [18:33:09] yea, log every disk swap or shutdown [18:33:25] currently changing slave status for db1038 [18:33:37] im interested to know why it died when the hot swap disk was removed. [18:33:54] too many failed drives or improper raid configuration? [18:34:01] !log new master for s4: db31, log file db31-bin.000213 log pos is 205612709 [18:34:03] Logged the message, Master [18:34:48] !log old master for s4 log file db22-bin.000106 log pos 631618956 [18:34:49] Logged the message, Master [18:36:04] maplebed: i'll change db1038 unless you're in the middle of it. just go for setting db31 rw and taking s4 out of ro in db.php [18:36:14] binasher: jjust finished changing db1038 [18:36:24] curretnly setting db31 rw [18:36:44] hmm [18:36:49] on db1038: Seconds_Behind_Master: 8729495 [18:36:59] feck [18:37:22] fascinating. [18:37:27] it might be thoroughly busted. [18:37:30] db33 looks fine though. [18:37:37] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:38:06] and looks like they're pointing at the same positions [18:38:15] !log db31 made read-write as the new master for s4 [18:38:16] Logged the message, Master [18:38:16] hmm.. go ahead [18:39:15] ok, now db1038 shows 0 sec behind, as soon as transactions were made [18:39:49] maplebed: can you also update dns? (s4-master cname) [18:39:56] sure, in a sec. [18:41:10] !log pushed out new db.php setting s4 to read-write [18:41:11] Logged the message, Master [18:41:32] Reedy: would you test commons for me? [18:42:27] RECOVERY - RAID on db23 is OK: OK: 1 logical device(s) checked [18:43:47] maplebed, test edit is fine, recent changes looks good (activity again) [18:43:48] commons looks good in dberror.log [18:43:55] mutante: I believe everything is back online. [18:44:04] thanks reedy. [18:44:11] 36 minutes downtime it seems [18:44:13] maplebed: sent a short summary to wikitech-l [18:45:04] binasher: what I meant by 'take point' is that I've found it very beneficial to have a single person as the lead in an outage situation to coordinate activity and prevent duplicate action. assignment of that person is what I was talking about. [18:45:46] got it, that went nice and smoothly [18:45:47] there are all sorts of things that become easier when there's that single figurehead for the event, such as coordinatingc communications, making efficient use of people for investigations, etc. [18:46:47] mutante: are you still on the db22 console? if so, is it actually booting? [18:46:52] !log s4 database rotation complete. outage duration 36 minutes. [18:46:53] Logged the message, Master [18:47:17] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [18:47:24] the only s4 slave is also the lvm snapshot host, so it's going to be slow [18:48:11] binasher: i am on the Sun LOM command line but dont see console output yet.. let me check [18:49:21] Serial console started. [18:49:52] but i dont see more [18:50:01] start /SP/console , right [18:50:11] updated dns to set s4-master to db31. [18:52:42] binasher: i dont see any output [18:53:01] cmjohnson1: still here? [18:53:05] so is db22 fully out of the loop and can be recovered normally? [18:53:13] cuz if so i wanna take a look at a few things before we bring it up [18:53:15] RobH: yes it is. [18:53:28] ok, can we hold for five minutes so i can poke at some stuff in mgmt? [18:53:33] binasher yes [18:53:38] imo, take your time. [18:53:51] mutante: can ya hop off serial if yer on it? [18:54:22] Connection to db22.mgmt closed. [18:54:29] thank you =] [18:54:32] binasher: db1038 is now caught up in replication. do you watn to run any sort of checks on it to see if it's broken? [18:54:33] yw:) [18:55:11] thanks for doing comms, mutante. [18:55:32] all thx for fixing my blunder! [18:56:43] np, thanks to malafaya, he reported it first, 10 minutes before Nagios, heh [18:56:53] cmjohnson1: db22 still need to be brought up [18:58:10] binasher: okay, i think robh is looking at a few things first [18:58:28] maplebed: poking at db1038 but i think it's fine. replag reported normally as soon as db31 wrote to its binlog [18:58:42] yea gimme a few more minutes and we can bring it back [18:59:36] ok, I am done poking at the stuff in mgmt, mutante did you want to take back over or want me to just boot it? [18:59:59] i dont mind, i wanna check a couple things after boot [19:01:38] no answer assumes compliance? (maplebed ?) [19:01:43] yer heading the outage, yer call [19:01:50] he's away from the desk right now [19:02:06] RobH: eh, yeah, please go ahead, i was distracted [19:02:24] ok, so all the other slaves are going to new master and this is out of the loop right? [19:02:55] right [19:02:56] RobH: just boot please, real life calls me..thx [19:03:16] heh [19:03:18] well timed [19:03:20] RobH: sorry, stepped away for a minute. thought I declared the outage over. [19:03:29] oh, ok [19:03:48] db22 needs to be reattached to the mysql tree sometime, but it's an hours-day timeframe, not minutes-hour. [19:04:02] but thanks for checking. [19:04:03] :) [19:04:05] ok, attempting to bring it up [19:04:49] robh: let me know if the disk swap fixed the original problem [19:06:12] RobH: yeah, i already sent mail to wikitech saying its over [19:06:32] RobH: so i dont think we need to rush on db22 now.. [19:06:33] right_ [19:06:56] k [19:07:21] i lost the prompt before by reading backlog of errors, had to restart the boot, so glad its not time restricted ;] [19:07:48] yeah, seems like we are fine for now. at least no more user reports [19:07:50] nearly all commons db queries are going to a single host that is rather slow thanks to running two lvm snapshots [19:08:07] its not "we're down" urgent, but it's still pretty urgent [19:09:20] it appears that only 8 of the 12 drives are online in the disk array [19:09:23] timeframe of hours is still ok though [19:09:37] That sounds somewhat suspect [19:09:41] i think i know what caused it [19:09:45] i am just confirming things [19:11:18] i think i'll build db51 as a new s4 slave today [19:17:07] cool. outage response worked pretty good though, didnt it? thanks everyone, heading out for dinner then, cu [19:17:18] cyas [19:17:37] seeya, thanks mutante [19:17:40] cya: thx [19:18:45] binasher: so the raid array is degraded to the point of non-use. Too many disks fell out within the same mirror sets. [19:18:54] I think that all disks are working [19:19:06] but would need to redo the array config and start over [19:19:32] means you have a full recovery to do though [19:19:45] the alternative is booting via rescue and trying to force it online [19:19:51] ok, in that case - we might as well do that for the sake of future use of the machine if you think its worth it (its always good to have spare db hardware) [19:19:55] which is kind of a pain in the ass [19:20:12] wipe and rebuild would probably be better [19:20:38] ok, I will go ahead and rebuild a fresh raid10 array and get the OS load started, its fully automated for the dbs. [19:21:37] i'm going to build db51 as an s4 slave today, db22 can be added as an s4 next week if the os reload goes ok. s4 needs one more slave than it had anyways, and was going to get one from that last hw order [19:24:31] New patchset: Asher; "db51 -> s4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1806 [19:24:56] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [19:28:12] !log started innodb hot backup of db1038 to db51 [19:28:13] Logged the message, Master [19:30:14] cmjohnson1: ok, i am working db22, which drive did you swap? [19:30:23] sorry if you said before, but i have it in front of me now [19:30:35] (which drive did you id as the serial from the ticekt that is) [19:30:50] the slot #s are written on the top of the case, shows the layout [19:30:57] the ticket was 3PD1TR3R [19:31:03] so you can usually look at top of one in a rack [19:31:12] since we have db racks [19:31:29] right, but I need to know what slot it was in [19:31:36] there are 16 drive slots [19:32:23] if you arent sure just lemme know that [19:34:05] i am not sure of the drive slot...it was 3 row below fan [19:34:26] ok, so you know the physical drive slot, just not how they are numbered? [19:34:34] we can figure that out if so [19:35:02] yes [19:35:49] ok, so if you snag the stepstool or whatever, you can go over to the row A database rack [19:35:56] and look on the top cover of the top db [19:36:01] it has the drive layout on the top cover [19:36:11] or any rack where the top server is a 4240 [19:36:36] just need to know the # of the slot, and the diagram will have that. if you cannot read it off anything, we can shutdown db22 and pull it for that diagram [19:36:41] but usually easier to read off top of anohter [19:41:42] RobH: I made an eqiad ticket for you yesterday; just for scheduling (not urgent), do you have an idea when you'll be onsite to do it? [19:41:46] !rt2220 [19:41:50] im onsite now =] [19:42:15] yea i planned to do this today, didnt even know the ticket was there yet [19:42:23] i need to fix it and the two cp servers i offlined to do it [19:42:27] will happen today [19:43:04] cool. thanks! [19:43:14] welcome [19:46:25] robh: it was drive slot 4 (last on 2nd Col) [19:47:54] all disks back in place? [19:48:02] yes [19:48:04] ok, rescanning [20:04:49] LeslieCarr, is ganglia on nickle fully setup? as all servers seem ot be showing in all groups [20:05:24] yeah, trying to figure that problem out :( [20:05:32] if you have any ideas.. [20:07:31] !log db22 reinstalling [20:07:32] Logged the message, RobH [20:08:45] Should there be data there for each machine too? [20:13:37] ok, once db22 gets to the partitioning and passes in installer, i am going afk for a short break (5 minutes) [20:13:45] then will be back to work on eqiad open issues [20:23:08] !log db22 reinstalled and booting into OS. No puppet runs yet, now its Asher's problem ;] [20:23:09] Logged the message, RobH [20:25:03] !log rt2226 - redeploy db22 for asher [20:25:04] Logged the message, RobH [20:25:26] afk, back shortly. [20:27:24] RECOVERY - Host db22 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [20:44:44] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [20:46:58] PROBLEM - MySQL disk space on db22 is CRITICAL: Connection refused by host [20:52:08] PROBLEM - DPKG on db22 is CRITICAL: Connection refused by host [20:53:47] I have made a sun java package that will successfully install via dpkg [20:53:48] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [20:53:58] PROBLEM - Disk space on db22 is CRITICAL: Connection refused by host [20:56:06] notpeter, I think it's weekend time now :p [20:57:20] that sucked soooo hard... [20:57:28] no no, now I get to put it in the repo! yay! [20:57:31] haha [20:57:39] I was suggesting that was enough pain for the week [20:57:46] but if you wanna carry on ;) [20:57:50] hahaha [20:58:21] man, so there's an app for debian/ubuntu that takes the sun java binary installers [20:58:23] and pacakges them [20:58:30] called, creatively, java-package [20:58:34] nice [20:58:39] unfortunately, it's extremely poorly supported [20:58:50] so I hacked out large portions of it, and then it ran! [20:58:51] yay! [20:58:58] now to see if I hacked out anything important... [20:59:17] like, I broke all of the java browser support... but I don't think that that's going to matter [21:00:45] !log started slaving db51 off of db31 [21:00:46] Logged the message, Master [21:04:19] binasher: saw i ticketed db22 to you? [21:04:31] yup, thanks! [21:05:54] i should have db22 back in prod by the end of the day [21:06:02] but still going to but db51 in s4 too [21:06:58] RECOVERY - MySQL disk space on db22 is OK: DISK OK [21:08:56] !log putting db51 into production as an s4 slave [21:08:57]