[00:16:23] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 5.709 seconds [00:20:44] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:05] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 6.459 seconds [00:22:32] @replag [00:22:34] Krinkle: [s1] db36: 1s, db32: 1s, db59: 1s, db60: 1s, db12: 1s; [s5] db44: 1s [00:26:35] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:59] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.717 seconds [00:39:00] Do we know more about bugzilla's issues by now? I haven't heard any news on engineering-l nor here. [00:40:55] there's a thread on ops@ about it [00:41:02] ah [00:41:32] I'm not on there, was surprised nobody responded to siebrand's mail [00:42:08] maplebed reported "something is bloating memory on bugzilla (see ganglia graph for kaulen[1]), causing it to tip into swap death and require a reboot." [00:42:18] but I've not seen it pinpointed to an exact cause [00:42:55] kaulen has been having issues all day (nagios-wm has reporting `PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds` all day) [00:43:24] but recovers as but then times out again a few minutes later [00:43:54] ping Ryan_Lane notpeter - if either of you is around, would be good if you could take a look [00:43:54] I didn't know bugzilla ran on kaulen. now that makes sense [00:44:11] http://wikitech.wikimedia.org/view/Kaulen is actually up to date. [00:44:21] :p [00:45:20] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:34] I'll send something out to wikitech-l. [00:48:02] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.008 seconds [00:52:32] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:08] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [01:08:08] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [01:08:08] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:08:08] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:42:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 279 seconds [01:44:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:50:14] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [02:17:14] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [02:28:38] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.663 seconds [02:33:48] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:09] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 8.644 seconds [02:48:21] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:33] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Sun May 27 02:55:18 UTC 2012 [03:10:24] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 8.777 seconds [03:14:45] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:18] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.006 seconds [03:27:47] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:38] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.017 seconds [03:38:08] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:29] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.560 seconds [03:43:59] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:50] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 4.155 seconds [03:55:50] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:01:41] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 3.855 seconds [04:15:15] is anyone on the console for kaulen? [04:20:07] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:27:07] !log kaulen - temporarily disabled swap and set oom_adj score to 15 for apache [04:27:11] Logged the message, Master [04:39:22] New patchset: Asher; "block bot / spider useragents from accessing bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9010 [04:39:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9010 [04:39:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9010 [04:39:45] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9010 [04:42:55] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [04:49:58] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [04:52:58] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [04:52:58] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [04:58:58] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [04:58:58] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [04:58:58] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [05:10:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:51:57] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:53:54] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [07:32:10] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [11:09:29] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [11:09:29] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:09:29] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [11:09:29] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [11:29:04] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 185 seconds [11:29:04] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 185 seconds [11:30:24] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [11:30:33] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [11:50:48] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [12:12:51] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [12:18:06] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [12:33:41] New patchset: Reedy; "Bump pcre.recursion_limit upto 5000 as part of bug 36839" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9012 [12:34:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9012 [12:56:03] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [13:03:24] PROBLEM - Host db51 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:45] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 227 seconds [13:05:12] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 253 seconds [13:05:12] PROBLEM - MySQL Replication Heartbeat on db22 is CRITICAL: CRIT replication delay 253 seconds [13:05:21] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 262 seconds [13:05:30] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 271 seconds [13:05:30] PROBLEM - MySQL Slave Delay on db1038 is CRITICAL: CRIT replication delay 272 seconds [13:05:39] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 281 seconds [13:05:44] this appears to be a bad thing [13:05:48] PROBLEM - MySQL Slave Delay on db22 is CRITICAL: CRIT replication delay 289 seconds [13:07:09] RECOVERY - MySQL Slave Delay on db22 is OK: OK replication delay NULL seconds [13:08:03] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay NULL seconds [13:08:21] RECOVERY - MySQL Slave Delay on db1038 is OK: OK replication delay NULL seconds [13:12:33] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - DB [13:13:08] (Can't contact the database server: Unknown error (10.0.6.61)) [13:13:09] - commons.wikimedia.org [13:13:17] @info 10.0.6.61 [13:13:17] Krinkle: [10.0.6.61: s4] db51 [13:13:29] @replag [13:13:31] Krinkle: [s5] db44: 1s; [s7] db26: 1s [13:13:36] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - DB [13:14:06] Commons just went down. or so it appears ? "Sorry! This site is experiencing technical difficulties." tried several refreshes [13:14:24] See #wikimedia-tech [13:14:31] viewing works, but edit pages are consistently failing [13:14:34] Yes [13:14:40] Yeah, master [13:14:43] I thought that'd be it [13:14:45] edit uses master [13:15:11] Ops are notified [13:15:33] oh god, its Sunday too... [13:15:52] It's a lovely sunny sunday too so they're probably drunk [13:16:11] yes, and I was around odd itmes yesterday. *and* I just sat down to reconnect an unplugged cable and thinik about eating lunch (it's 4 pm here) [13:16:21] so I can't ssh in [13:16:54] looking at IRC RC seems contradicting the info here [13:17:14] can bots edit? [13:17:16] good morning. [13:17:27] I just took it out of db.php [13:17:48] Reedy paged me for s4 being down. I assume that's what you're working on? how can I help? [13:17:58] a few bots though [13:18:07] just see #en.wikipedia on irc.wikimedia.org :) [13:18:09] do you know how to do a master switch? [13:18:13] I do. [13:18:27] db51 is rachable from mgmt and shows low load, very responsiv [13:18:27] e [13:18:31] *reachable [13:18:47] ganglia shows it just went completely idle [13:18:51] you can take over then [13:19:08] I had just gone to bed and would be happy to get back in it [13:19:22] You mean you got out of bed? :D [13:19:23] ping to fenari (by ip) fails [13:19:26] http://wikitech.wikimedia.org/view/Master_switch [13:19:27] get back in it [13:19:34] Thanks Tim [13:20:12] trying a random host (i.e. 4.2.2.2) also fails [13:21:21] can someone give me a summary of what's happened and what symptoms have been seen? [13:21:54] last thing I see in the logs about the interface is eth0 up and ready. [13:22:12] from the irc logs, it seems it just started to give "can't connect to db" errors [13:22:16] I can summarize what I know about db51 having looked at it for 5 mins; someone else has to tell you about irc [13:22:19] and what people saw. [13:22:39] Time just took db51 out of the config and set it to read-only and was going to do a master switch as the wiki was failing to connect to the master. [13:22:43] s/Time/Tim/ [13:22:46] Yeah, Can't contact the database server: Unknown error (10.0.6.61) across the cluster (due to globalusage etc) [13:22:56] Dunno if it's actually dead of the wiki network to db is just goofed [13:22:57] Angela told me the s4 master is down, so I got out of bed, logged on to fenari, identified the s4 master and tried to connect to it with a mysql client [13:23:12] $ mysql -h db51 [13:23:12] ERROR 2003 (HY000): Can't connect to MySQL server on 'db51' (110) [13:23:25] silly question, is mysqld up there? [13:23:32] so I took that as confirmed, commented it out of db.php and reduced the load on the next server down to 0 [13:23:53] and simultaneously switched to read-only mode so that page views would work more reliably [13:24:34] TimStarling: did you change the replication chain at all on the s4 slaves or only touch db.php? [13:24:41] only db.php [13:25:04] argh [13:25:10] ok [13:25:11] mysql> select @@read_only; [13:25:14] | 0 | [13:25:32] looking at this channel logs: [13:25:32] 15:03:03 PROBLEM - Host db51 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:46] !log on db31 set global read_only=1 [13:25:49] Logged the message, Master [13:25:58] followed by a number of "MySQL Replication Heartbeat: replication delay" errors [13:26:02] ok now I have also set read_only mode on db31 [13:26:07] which would be the slaves [13:26:20] read_only should never be zero on a slave, this is bad maintenance [13:26:35] ok, this swap will be easy. all slaves are at the same position on the dead master. [13:26:36] pretty much, this points db31 to the one to be promoted to master [13:27:22] you'll have to check for any binlog entries on db31 which were committed after I switched the configuration [13:27:30] they could cause replication to fail [13:27:42] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [13:27:47] wouldn't that be fixed if db31 were the new master? [13:27:47] you'll have to start replication from before the point in the binlog that represents the configuration change [13:28:16] damn. I haven't done that in ages. looking now. [13:28:45] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:28:46] at one time, master switches due to downtime were relatively common :) [13:29:08] maybe it is set up with multimaster replication or something? [13:29:49] yeah, log_slave_updates=1 [13:29:50] I don't think so. [13:30:16] I mean, it is set up that way, but inserts to multiple masters isn't allowed. [13:31:09] there's nothing replicating from db31 at the moment [13:31:39] I'm comparing the incoming relay binlog to db31's binlog to see if any odditional queries hit it. [13:36:14] I see queries: SiteStatsUpdate, GlobalUsage [13:36:16] ok, there are additional updates to db31 that aren't present in db51's master log. [13:36:55] Tim, do you see the deleteTitelProtection Vaxunxula entry? [13:37:04] after 13:26 it's just heartbeat queries [13:37:13] I assume they can be ignored [13:37:18] that happens quite commonly, some changes didn't went to the slaves, but suceeded on master [13:37:38] the heartbeat queries can be ignored. [13:38:10] yes, I see that deleteTitleProtection query [13:39:32] so we have two choices - use a different matser (which will blow away all changes while s31 was master) or use s31 and try and recreate the changes (or do complete reslaves) on the rest of the hosts. Opinions? [13:40:14] how many changes are we talking roughly? [13:40:28] use db31 and start replication from some point before db.php was changed but after db51 went down [13:40:40] ah right. [13:40:45] sorry, not thinking quickly yet. [13:40:55] it's simple enough, it's risky to try to lose the changes [13:40:58] 583444038 is the ID in db31's log file. [13:41:09] (a point after the crash and before the new changes) [13:41:51] the new changes begin around 583601469 [13:42:28] yah, that number sounds reasonable. [13:44:45] the earliest is about 583315125 [13:45:27] ok, before I hit go, any objections to setting the other slaves to replicate from db31 starting from 583315125? [13:45:42] I've checked that position, I'm fine with it [13:45:59] ok. [13:46:03] going forward then. [13:49:51] ran change master to MASTER_HOST="db31.pmtpa.wmnet", master_log_file="db31-bin.000334", master_log_pos=583315125; on db22; now starting slave. [13:50:33] db33 is also a s4 slave [13:50:41] it looks ilke it's ok; seconds_behind_master is dopping, no errors showing up. [13:50:50] ok, caught up. [13:50:53] continuing on db33 [13:52:13] ok, next up db1038 [13:53:09] silly question, is mysqld up there? [13:53:43] ok, I think the tree is repaired. [13:54:04] based on what apergos said, mysqld is probably running on db51, it's just its network link that failed [13:54:17] setting read_only to off on db31 [13:54:27] ok [13:54:31] it's running all right [13:54:35] I supposed it was still running, looking at ganglia [13:54:44] sorry, I am trying to find out (and failing) what switch it's connected to [13:54:51] would have been too easy [13:54:52] so I can have a look at the interface [13:54:55] ok, read only is off. [13:55:22] TimStarling: do you agree we're ready to set s4 to rw in db.php and push it out tot he cluster? [13:55:40] did you RESET SLAVE on db31? [13:55:58] I ran 'change master to master_host=""' [13:55:59] it's not necessary to do RESET SLAVE [13:56:14] I read it from Master_switch :( [13:56:18] hmmm [13:56:24] actually maybe you are right [13:56:37] no, it's not necessary. [13:56:38] or at least, maybe whoever wrote that document (possibly me) was right [13:56:42] it cycles the log files. [13:57:00] RESET SLAVE will stop it from replicating lost transactions from db51 if db51 comes back up [13:57:01] which makes stuff easier (beacuse log position is 0) but it 's not required. [13:57:13] hm.. [13:57:22] as opposed to RESET MASTER which is the thing we don't want [13:57:28] those edits were added in http://wikitech.wikimedia.org/index.php?title=Switch_master&diff=16159&oldid=12773 with comment "two more steps following observed pitfall" [13:57:37] ok, I can run it... [13:57:46] yeah, very smart person added it [13:57:50] I'm sure it won't hurt. [13:57:58] ;) [13:58:08] :) [13:58:13] Query OK, 0 rows affected. [13:58:15] :) [13:58:48] it's interesting that db51 is still in ganglia [13:58:59] it must be able to send network packets but not receive them [13:59:20] so good to update db.php and push it out? [13:59:35] yes [13:59:55] want me to do it? [14:00:17] yes, I think it can send them because a couple other dbs in the same rack report it as an llpd neighbor [14:00:25] the only change is commenting out the s4 maintenance, right? [14:00:30] yes [14:00:37] ok. [14:01:13] TimStarling: I'm half way through; I'll finish it out. [14:02:58] ok, change synced and pushed into git. [14:03:34] someone want to upload a file to commons and see that everything's ok? [14:03:52] my fear is that it's on an unmanaged switch eg mws-c2-sdtpa [14:03:59] users do that for us [14:03:59] Could just watch http://commons.wikimedia.org/wiki/Special:NewFiles [14:04:10] ahh, that's the URL I wanted. [14:04:33] has anybody been logged into #wikimedia-commons? [14:04:35] oh, hi reedy. [14:04:44] There's 3 of them :) [14:04:44] (Upload log); 14:04 . . Filip em (talk | contribs | block) uploaded "File:Johannes Marcinowski.jpg" ‎({{Information |Description ={{en|1=Johannes Jaroslaw Marcinowski (1868-1935), psychiatrist, psychoanalyst}} {{pl|1=Johannes Jaroslaw Marcinowski (1868-1935), psychiatra, psychoanalityk}} {{de|1=Johannes Jaroslaw Marcinowski (1868-1935), Psychiater, ...) [14:04:56] thanks TimStarling and maplebed ! [14:05:25] you're welcome. sorry it took so long... good to be careful though. [14:05:55] TimStarling: thanks very much for sticking around and letting us walk through it together. [14:05:56] maplebed, I have been [14:06:11] no problem, good night [14:06:17] first upload post-crash: http://commons.wikimedia.org/wiki/File:HadassasiyurDSCN3920.JPG [14:06:18] thanks, night [14:06:21] hey binasher. all sorted now ;) [14:06:43] kind of [14:06:59] binasher: I figured there'd be some more cleanup, but I'm not sure what it is. [14:07:00] good night [14:07:15] maplebed: did you look at the wikitech doc? [14:07:18] binasher: why was read-only mode off on db31? [14:07:32] this one? http://wikitech.wikimedia.org/view/Master_switch [14:07:32] hrm [14:07:43] todo: check that the other slaves have read-only set :P [14:08:56] ah, I haven't updated DNS or puppet yet. [14:09:19] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:09:28] RECOVERY - Host db51 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [14:09:51] !log new s4 master position post-rotation is master_log_file="db31-bin.000334", master_log_pos=583315125 [14:09:55] Logged the message, Master [14:09:57] Right. I'm AFK for a few hours [14:10:43] Tim-away: db31 was the last master and i rebooted it right after it was switched out. currently, the script sets it to read only, and puppet sets the my.cnf back to read only, and i must have brought mysql back up after the reboot before puppet updated the cnf [14:11:43] ok, DNS updated. [14:12:26] pushing to ns0 et al [14:13:17] maplebed, the mysql.pp change https://gerrit.wikimedia.org/r/9015 [14:13:30] oh hey, thanks. I was just starting on that. [14:13:38] hooray for gerrit! [14:13:42] :) [14:13:44] the list didn't seem to match at all with db.php [14:14:33] maplebed: i reslaved db51 off the position you !logged [14:14:46] excellent. [14:14:53] yaeh, that's a lot of changes. [14:15:01] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:15:28] PROBLEM - MySQL Replication Heartbeat on db51 is CRITICAL: CRIT replication delay 574 seconds [14:15:48] do puuppet variables not set default to false? [14:15:59] looks like if $writable should be false in that else [14:16:11] binasher: you rebooted db51? [14:16:22] binasher: do you know what db48 does? it's not an sX master but it was in the masters list in puppet. [14:17:02] otrs-master.pmtpa.wmnet is an alias for db48.pmtpa.wmnet. [14:17:07] apergos: yep [14:17:07] Platonides: we need to kep db48 in the list. [14:17:32] how did you guess that would clear the network issue? (I could see nothing about it in the logs) [14:17:39] Platonides: email me about things you think are wrong in there [14:18:09] !log rebooted db51, reslaved [14:18:12] Logged the message, Master [14:19:41] binasher: ? [14:20:18] maplebed, sent new patchset [14:20:35] cool. [14:21:26] looks good. merged. [14:21:41] looks like gerrit's not talking to IRC atm. [14:21:48] wait [14:21:49] aggh [14:21:57] oh. [14:22:04] that was on the test branch. [14:22:07] binasher: waiting. what's up? [14:22:11] ouch [14:22:56] Platonides: did you not run a git pull before editing that? [14:23:08] or using a checkout of the dev branch? [14:23:16] I did, but I was in test branch [14:23:18] the left hand side of that is very ancient [14:23:22] um [14:23:38] that was why I said the master list seemed out of date :P [14:24:17] and you deleted a master [14:24:55] there are the right number. [14:24:58] no [14:25:01] there aren't [14:25:19] I count 8 hosts - the 7 clusters + db48 [14:25:30] do you see the s3 master? [14:25:45] yeah, 39. [14:25:55] New patchset: Platonides; "Mster switch for s4 from db51 to db31 after db51 failed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9017 [14:25:58] they appear out of order... [14:26:11] ok, that one is against production branch [14:26:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9017 [14:26:16] just db51 -> db31 [14:26:34] what i see in the diff is $hostname =~ /^db(36|30|31|45|34|47|16|48)$/ { [14:27:22] that's teh old labs version and as you asid earlier, way out of date. [14:28:01] Platonides: I liked your reordering, but +1 do it as a separate change. [14:28:12] https://gerrit.wikimedia.org/r/#/c/9017/ looks right to me. [14:28:22] ok, gerrit bot is slow [14:28:34] the link above was for 9015 [14:29:03] New patchset: Platonides; "Reorder masters to be ordered by cluster. Note: db48 is otrs master." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9018 [14:29:09] maplebed, I was doing it this other way now :) [14:29:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9018 [14:29:44] I replaced the entries in the first one as they were a bit... nonsensical [14:29:53] binasher: you're cool with 9017? If so I'll merge it. [14:30:16] yeah [14:30:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9017 [14:30:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9017 [14:30:45] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9018 [14:30:47] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9018 [14:31:00] hey, my first puppet change to production :) [14:31:10] ::sigh: There's a pending change to bugzilla. [14:31:23] what's going on with the wmf-config change [14:31:52] binasher: I ran sync-file for the new db.php during rotation; tim ran it before to set readonly and disable db51 [14:31:58] those are the only ones I know of. [14:32:02] git wise [14:32:08] they need to be committed [14:32:17] I committed them to git. [14:32:44] oh, ok [14:32:44] my git checkout doesn't reflect what's live.. hmm, did you push to origin? [14:32:54] (though I may have done it wrong.) [14:33:12] I can git commit filename -m "message"; git push origin [14:33:13] ah, what's on fenari is ahead of 'origin/master' by 2 commits [14:33:30] thanks [14:33:48] i just want to put db51 back in as a slave [14:33:50] binasher: I'm pushing https://gerrit.wikimedia.org/r/#/c/9010/ live on sockpuppet as well. [14:34:00] (it was merged but not pushed to sockpuppet) [14:34:45] !log dns and puppet changes for s4 master rotation done. [14:34:48] Logged the message, Master [14:36:58] i should restart puppet on kaulen now [14:38:07] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay 0 seconds [14:40:28] binasher: is there any additional cleanup for the db rotation? do we need to tell toolserver or anything? (maybe that's why we !log it...) [14:40:30] maplebed: i just reset fenari wmf-config to origin, so don't worry about trying to push from fenari [14:40:42] excellent. [14:40:55] is http://wikitech.wikimedia.org/view/How_to_do_a_configuration_change#Change_wiki_configuration still accurate? [14:41:39] yeah, the !log should be enough for the toolserver people but it's nice to email them too [14:41:50] other than that, all looks good [14:42:50] binasher: when you get a minute, how did you know to reboot db51, that it would clear up the network issue? [14:42:51] i think that doc is accurate, to do a git push origin from fenari though, you have to have your git email setup to what is with gerrit [14:43:04] it will fail if it appears as @fenari [14:43:31] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [14:44:41] I don't think I know the toolserver folks' email addresses. [14:46:06] maplebed: marlen.caemmerer@wikimedia.de is currently the person to email, i'll add that to the master switch doc [14:46:28] ok, I'll send that now. [14:47:10] apergos: i wasn't actually aware of a networking issue.. i saw mention of a switch name when i got online but i don't have context beyond that [14:47:20] oh [14:47:40] how did you get on the host? [14:47:42] mgmt? [14:47:55] oh wait, maybe i should put email addresses in public wikitech docs [14:48:05] yes, mgmt [14:48:47] (cause I was on the mgmt console and saw the reboot messages later) [14:49:04] RECOVERY - Puppet freshness on kaulen is OK: puppet ran at Sun May 27 14:48:49 UTC 2012 [14:49:20] apergos: did you try getting the host up? [14:49:25] it was up [14:49:33] just not on the network? [14:49:36] but networking from it was busted [14:49:54] best guess is that incoming packets weren't being received but maybe it could send some outgoing packets [14:50:05] pings, traceroutes, all failed [14:50:06] that's a new way our dell r510's are screwy! [14:50:34] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [14:50:37] so I was tring to track down what interface it was attached to, to see if I could get at it that way [14:50:53] i wonder if it could have received packets from hosts on the same switch [14:50:58] but then you rebooted and now it's working, how weird is that. there was no indicator in dmes,g syslog, nothing [14:51:19] I tried pinging from a host in the same rakck or to it rather), and fail [14:51:49] do you know what kind of rack switch it is? [14:52:29] apergos: did you try an ifdown eth0 / ifup? [14:52:33] nop, still don't know what switch it's on. it might.. be on msw2 (switch in rack c2 stdpa) [14:52:58] which might be an unmanaged switch. but both of those are gueses, I don't have any docs on the setup and couldn't get on that switch to see. [14:53:04] I think I'm gonna bail. are we all good? [14:53:31] nope, didn't do that [14:54:24] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [14:54:24] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [14:54:59] uh oh [14:55:11] I just came online, I'm about to read the backlog [14:55:13] i'm curious if that would have fix it, and if rebooting just accomplished sending an arp [14:55:29] maplebed: i think so, thanks! [14:55:31] paravoid: summary - s4 db master rotation, with a little extra fun. [14:55:41] heh [14:55:42] why? [14:55:47] I'm curious too [14:55:53] the extra fun? [14:55:59] why the rotation for starters :) [14:56:16] its network connectivity went out to lunch [14:56:36] ah, there's a mail [14:56:45] because the new master was writable and the master was promoted in db.php before readonly was set, so the new master had updates that none of the others had. we had to go binlog diving to find the actual binlog position from which to replicate the other slaves. [14:56:55] if that makes any sense. [14:56:58] yiikes [14:57:02] yeah it didn't actually crash, it just became unreachable [14:57:02] it sounds like what would happen if another host came up with db51's ip [14:57:18] did you get pages? [14:57:19] I didn't [14:57:26] reedy paged me. [14:57:36] i didn't either.. nagios should have paged for that [14:57:48] no pages [14:58:01] I sat down to reconnect cables after putting my room back together [14:58:05] nagios did notice the host being down: PROBLEM - Host db51 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:05] and walked in on the fun [14:58:07] hosts set as masters in db.php are actually supposed to page for their alerts [14:58:48] binasher: and fixed kaulen at the same time? [14:58:50] you're on a roll! [14:59:18] 10G swap partition iirc [14:59:33] not 4 [14:59:33] nagios only shows having notified IRC. [14:59:40] maplebed fixed s4, i slept thru reedy's page and looked at my phone a while later when i actually woke up [14:59:45] I wondeer if blocking the bots took care of the issue [14:59:53] (on kaulen) [14:59:57] anyways it was a good move [15:00:24] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [15:00:24] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [15:00:24] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [15:00:40] ok, now really afk. [15:00:59] later [15:01:27] db51.cfg on spence shows contact_groups admins,sms [15:01:43] for the mysql checks up until the most recent puppet run 10 min ago [15:02:48] only for a couple checks though, and they didn't go off to irc either [15:03:04] binasher, how does puppet treat unset variables? [15:03:11] i should make sure sms alerting is set for the host instead of specific checks [15:04:39] Platonides: i believe as either false or null in manifests, but as an error in templates [15:08:52] ok, so there's no harm in not explicitely setting $writable = "false" in the manifest, then [15:11:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:12:46] Platonides: it would be an error in the template [15:13:36] the manifest would be fine, but it would break the my.cnf erb [15:52:47] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:55:02] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [16:34:36] Platonides: Could you resolve the ambiguity I raised at http://wikitech.wikimedia.org/view/Talk:Switch_master ? [16:34:50] Maybe it makes perfect sense, I don't know (I'm not an op), but maybe it could be clarified :) [16:36:05] ah, nevermind [16:36:13] It is about undoing the change in 4.2 [16:39:14] (right?) [16:42:14] Krinkle, I didn't read it [16:42:24] and you already deleted it [16:42:37] sorry [16:42:52] It said that the "undo local changes" didn't make sense with the "commit your changes" [16:43:05] but then I saw that it told to make another change after the commit [16:43:12] so it makes sense [16:43:15] * Krinkle added it to the document [16:43:25] http://wikitech.wikimedia.org/index.php?title=Switch_master&action=historysubmit&diff=47466&oldid=47465 [16:49:48] ah, ok [17:18:58] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 198 seconds [17:19:43] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 212 seconds [17:35:01] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 29 seconds [17:36:58] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [18:04:40] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:18:55] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:49:13] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:08:46] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:48:18] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 188 seconds [20:58:21] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [21:10:03] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [21:10:03] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [21:10:03] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [21:10:03] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [21:51:34] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [22:13:37] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [22:18:34] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [22:26:22] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiktionary (22181) [22:26:37] say wut [22:27:59] someone touched something high use... [22:31:21] OVER 9000 [22:32:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiktionary (19739) [22:32:51] Exciting, isn't it? [22:57:16] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [23:34:37] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [23:36:49] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000