[00:03:44] RECOVERY - Host es3 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [00:05:05] !log es3:~# rm -rf /usr/local/mysql* [00:05:10] Logged the message, Master [00:09:41] !log pointed es3 to MASTER_LOG_FILE='es1-bin.000788', MASTER_LOG_POS=453509865 [00:09:45] Logged the message, Master [00:12:53] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 1207 seconds [00:21:35] New patchset: Asher; "es1 = master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11013 [00:23:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11013 [00:24:44] !log shutdown mysql on es3. stopped slaving on es1002, rsyncing cluster23 tables to es3 [00:24:48] Logged the message, Master [00:26:08] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11013 [00:26:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11013 [00:27:08] PROBLEM - MySQL Slave Delay on es1003 is CRITICAL: CRIT replication delay 260 seconds [00:28:20] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 334 seconds [00:32:50] RECOVERY - MySQL Slave Delay on es1003 is OK: OK replication delay 0 seconds [00:34:11] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 1 seconds [00:42:33] New patchset: Asher; "ben's thir^H^H^H^Hsecond prod ssh key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11015 [00:42:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11015 [00:43:19] maplebed: want to review it in gerrit? ^^ [00:43:26] yup. [00:43:49] nice snark in the comment there. [00:44:39] it might be snake, but seriously - there are places where not replacing your key after losing it would not be ok [00:44:44] *snark [00:45:27] luckily, wiki doesn't care about security :) [00:48:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/11015 [00:48:27] reviewed. [00:49:34] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11015 [00:49:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11015 [00:51:04] hm, I think I just added edns-client-subnet support to pdns' geo backend [00:51:33] that would solve the "I'm in Europe, use 8.8.8.8 as a DNS and hit eqiad instead of esams" little problem [00:51:42] binasher: which host should I try? (did you run puppet manually on one of the bastions?) [00:53:31] puppet is being slow but it will be on bast1001 [00:54:06] oh well, 4am [00:54:09] k. [00:54:11] time to sleep [00:54:21] srsly, paravoid I think you're not human. [00:54:28] ? [00:54:29] you never sleep! [00:54:34] I sleep a lot [00:54:57] hmmm.... [00:55:10] I'll believe it when I see it. [00:55:12] ;) [00:55:26] nope, i added the key incorrectly [00:55:29] have you considered that our sleep times might overlap? :-) [00:55:30] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [00:55:56] binasher: a conflict with the existing ensure absent key? [00:56:13] just a typo - Parameter type failed: Invalid value "ssh-dsa" [00:56:54] grumble. sorry I didn't catch that. [00:57:03] geobackend.cc | 16 ++++++++++++++-- [00:57:03] 1 file changed, 14 insertions(+), 2 deletions(-) [00:57:15] New patchset: Asher; "key type typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11016 [00:57:19] * paravoid loves when things are so easy [00:57:30] loves it even [00:57:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11016 [00:57:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/11016 [00:58:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11016 [00:58:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11016 [01:00:40] notice: /Stage[main]/Accounts::Ben/Ssh_authorized_key[ben@JDoe-LinuxBookAir-3.local]/ensure: created [01:01:14] success. thanks binasher [01:12:57] hrm, slight problem with my change [01:13:21] would DoS our NS kind of change, dammit [01:15:25] :( [01:22:54] !log removing one slave from each db shard to upgrade/restart [01:23:00] Logged the message, notpeter [01:35:48] !log passes the dba mantel to notpeter [01:35:53] Logged the message, Master [01:41:30] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 259 seconds [01:43:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 375 seconds [01:46:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:36] PROBLEM - mysqld processes on db54 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:46:36] PROBLEM - Host db32 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:50:03] RECOVERY - Host db32 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:51:24] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 24 seconds [02:40:18] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [04:14:29] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:53] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [05:59:11] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 1 seconds [06:23:42] !Log db1047 looks like the aft_article_filter_count is missing a few rows compared to the master (after replication caught up), presumably this is a side effect of the repair, have pinged binasher for help, leaving everything running and hope it's tolerable error for a day [06:23:59] grrr [06:24:06] !log db1047 looks like the aft_article_filter_count is missing a few rows compared to the master (after replication caught up), presumably this is a side effect of the repair, have pinged binasher for help, leaving everything running and hope it's tolerable error for a day [06:24:12] Logged the message, Master [06:24:24] you really need to learn to read capital letters [07:25:35] New review: Dzahn; "in retrorespect: please also leave a line for the existing / non-1000 virt servers in here, mapping ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10951 [07:48:24] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [07:48:24] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [07:48:24] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [08:19:17] New review: Dereckson; "I confirm the Extension:Collection doesn't have any ZIM reference in the code." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/10855 [08:42:27] PROBLEM - Puppet freshness on srv232 is CRITICAL: Puppet has not run in the last 10 hours [08:52:22] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:21:21] New review: Hashar; "Mail sent to operations mailing list to get some feedback about this change." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/7773 [09:28:21] New review: Hashar; "I have asked Ryan to take a look at this old change :-]" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [09:35:16] New review: Hashar; "I have asked Tim for review and send him an email to get his opinion on that limit." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9130 [09:56:34] New patchset: Hashar; "bug 37391 - Install Translate extension on be.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10593 [09:56:40] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10593 [09:57:15] New review: Hashar; "Lets merge it!" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10593 [09:57:18] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10593 [10:00:54] New patchset: Mark Bergsma; "Add RFC 4760 to the supported features list" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11020 [10:00:55] New patchset: Mark Bergsma; "During IPPrefix constructors, address family is not set yet" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11021 [10:00:55] New patchset: Mark Bergsma; "Set self.addressfamily when not None" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11022 [10:00:56] New patchset: Mark Bergsma; "Fix some newly introduced bugs" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11023 [10:00:57] New patchset: Mark Bergsma; "Add rudimentary support for BGP capability advertisements" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11024 [10:00:58] New patchset: Mark Bergsma; "Simplify BGP MP attribute encoding through code reuse" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11025 [10:00:58] New patchset: Mark Bergsma; "Fix several bugs in IPPrefix and Attribute handling" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11026 [10:00:59] New patchset: Mark Bergsma; "Make missing attribute checking optional, as it breaks with multi protocol" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11027 [10:00:59] New patchset: Mark Bergsma; "Fix IPv6 address __str__ method" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11028 [10:01:00] New patchset: Mark Bergsma; "Increase code reuse in MP attribute classes, add NotificationSent __str__ method" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11029 [10:02:27] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10859 [10:02:37] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10859 [10:02:39] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10859 [10:03:12] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10860 [10:03:14] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10860 [10:03:39] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10861 [10:03:41] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10861 [10:04:12] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10862 [10:04:15] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10862 [10:04:57] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10865 [10:04:59] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10865 [10:05:23] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11020 [10:05:25] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11020 [10:05:47] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11021 [10:05:49] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11021 [10:06:11] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11022 [10:06:13] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11022 [10:06:44] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11023 [10:06:46] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11023 [10:07:34] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11024 [10:07:36] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11024 [10:08:08] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11025 [10:08:14] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11025 [10:08:58] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11026 [10:08:59] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11026 [10:09:29] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11027 [10:09:30] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11027 [10:09:54] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11028 [10:09:56] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11028 [10:10:40] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11029 [10:10:42] Change merged: Mark Bergsma; [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11029 [10:33:35] PROBLEM - Host db54 is DOWN: PING CRITICAL - Packet loss = 100% [10:36:35] RECOVERY - Host db54 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [10:41:32] RECOVERY - mysqld processes on db54 is OK: PROCS OK: 1 process with command name mysqld [10:43:47] PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 31738 seconds [10:43:56] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 31695 seconds [10:49:49] New review: Hashar; "Deployed and bug 37391 marked resolved." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10593 [10:56:42] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [11:01:57] PROBLEM - mysqld processes on db22 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:03:09] PROBLEM - mysqld processes on db11 is CRITICAL: Connection refused by host [11:03:32] all of those DBs are me, btw. and it's fine. they're out of db.php right now for kern upgrades [11:03:45] PROBLEM - Host db44 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:54] PROBLEM - Host db47 is DOWN: PING CRITICAL - Packet loss = 100% [11:04:23] going to cycle mw1042 / mw1071 (down since ~ 3d) unless they are special cases [11:04:30] PROBLEM - mysqld processes on db26 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:04:46] mutante: I believe it is still the case that none of those are in use yet [11:04:59] so should be fine [11:05:11] although, if they have hardware issues, that'd be good to note [11:05:12] ok [11:05:15] PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:15] PROBLEM - Host db11 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:09] RECOVERY - Host db47 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [11:06:45] RECOVERY - Host db44 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [11:06:45] RECOVERY - Host db22 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [11:07:21] RECOVERY - Host db11 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [11:07:57] RECOVERY - mysqld processes on db22 is OK: PROCS OK: 1 process with command name mysqld [11:08:33] !log powercycled mw1042 to check for hardware issues and fscked. appears to be just unused (though down since ~3d like mw1071 per nagios) [11:08:37] Logged the message, Master [11:09:18] PROBLEM - mysqld processes on db47 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:10:30] RECOVERY - mysqld processes on db11 is OK: PROCS OK: 1 process with command name mysqld [11:10:39] RECOVERY - Host mw1042 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [11:12:27] RECOVERY - mysqld processes on db47 is OK: PROCS OK: 1 process with command name mysqld [11:14:24] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 931 seconds [11:14:24] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 931 seconds [11:14:24] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: CRIT replication delay 691 seconds [11:14:42] PROBLEM - MySQL Slave Delay on db47 is CRITICAL: CRIT replication delay 675 seconds [11:14:51] PROBLEM - Host db26 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:21] RECOVERY - Host db26 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [11:17:24] ACKNOWLEDGEMENT - SSH on gilman is CRITICAL: Server answer: daniel_zahn grep gilman site.pp. is gilman dead? can we remove this? [11:17:33] ACKNOWLEDGEMENT - jenkins_service_running on gilman is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn grep gilman site.pp. is gilman dead? can we remove this? [11:19:09] ACKNOWLEDGEMENT - Host mw1071 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn most likely not in use yet. please remove flag once it is. [11:19:39] RECOVERY - mysqld processes on db26 is OK: PROCS OK: 1 process with command name mysqld [11:19:48] ACKNOWLEDGEMENT - Host mw1102 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn most likely not in use yet. please remove flag once it is. [11:21:09] ACKNOWLEDGEMENT - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2918 - hardware fail [11:23:24] ACKNOWLEDGEMENT - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours daniel_zahn puppet actually runs - fix firewalling for UDP / snmptraps? [11:25:57] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay 0 seconds [11:26:15] RECOVERY - MySQL Slave Delay on db47 is OK: OK replication delay 4 seconds [11:28:14] !log powercycling downed srv232 (also cause for check_all_memcached crit) [11:28:19] Logged the message, Master [11:28:48] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay 0 seconds [11:28:57] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [11:29:56] MySQL error: 1637: Too many active concurrent transactions (10.0.6.50) :o [11:30:45] PROBLEM - Host srv232 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:03] RECOVERY - Memcached on srv232 is OK: TCP OK - 0.002 second response time on port 11000 [11:31:12] RECOVERY - Host srv232 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:31:38] dang it [11:31:39] RECOVERY - SSH on srv232 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:31:48] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:15] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [11:32:42] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:51] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:33:09] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.312 second response time [11:33:17] whats going on with the mws now [11:33:54] 10.0.6.50 = db40 [11:34:03] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.537 second response time [11:34:12] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.378 second response time [11:34:21] better:) [11:36:02] aude / notpeter: just temp.? / db40 out of db.php? [11:36:36] mutante: works now [11:36:41] cool [11:36:46] just a glitch [11:37:00] * mutante nods [11:38:00] ok [11:38:42] RECOVERY - Puppet freshness on srv232 is OK: puppet ran at Tue Jun 12 11:38:37 UTC 2012 [11:39:45] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 2093 seconds [11:40:13] mutante: db40 is the parser cache [11:40:38] the page rendered except the article cache [11:40:45] article text, errr [11:40:48] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [11:40:51] ah, well, that's potentially awful [11:40:57] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 1967 seconds [11:41:35] yeah, db40 load spiked crazily, but seems to be evening back out [11:41:39] ACKNOWLEDGEMENT - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn went down repeatedly in the past. hardware RT #2896 [11:41:41] good [11:41:54] notpeter: ok, gotcha [11:42:08] well, hrm, maybe [11:42:57] New patchset: Hashar; "wmfHostnames array to easily change hostnames on a cluster basis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11034 [11:43:02] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11034 [11:43:57] nagios-wm: can have a seperator between user and message @ "daniel_zahn went down repeatedly in the past." ? heh [11:44:12] haha [11:44:13] s/user/host [11:44:43] daniel_zahn went high repeatedly in the past [11:45:01] up it is [11:45:40] mark: ganglia seems to be showing that the parsecache isn't able to purge fast enough [11:45:58] what do you know about parsercache, or who that is online knows stuff [11:46:03] tim and asher are both not online... [11:46:10] I don't know anything about the parsercache [11:46:15] damn. [11:47:11] transactions unpurged seems to have a downward trend at the moment [11:47:21] so this might recover, but if not... hello, asher! [11:48:45] is asher the ops DBA ? [11:48:53] I mean the main DBA ? [11:49:16] New patchset: Hashar; "vary wgUploadStashScalerBaseUrl based on cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11035 [11:49:21] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11035 [11:49:48] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:49:48] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [11:49:48] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [11:52:29] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [11:52:56] RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds [11:53:22] hashar: yeah, I'd say so [11:53:34] db40 seems to be calming down [11:54:17] hashar: more specifically, if there were a parsercache problem, I'd say that he and tim are the two people well qualified to fix it. perhaps doma_s as well [11:59:02] New patchset: Hashar; "move mobile related conf to their own files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11036 [11:59:03] New patchset: Hashar; "cleanup whitespace in mobile-pmtpa.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11037 [11:59:08] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11036 [11:59:10] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11037 [11:59:36] notpeter: hoo the parser cache that is a crazy part of MediaWiki [11:59:46] I am pretty sure only a handful of people actually understand it [11:59:50] (and I am not one of them) [12:00:35] yeah, nor am I. but I know that when db40 becomes angry, the site goes down very quickly :) [12:00:45] hehe [12:26:20] mark: here? [12:26:27] yes [12:26:42] so [12:26:59] I was bored late last night and was playing a bit with powerdns [12:27:09] I managed to add edns-client-subnet to the geobackend [12:27:15] (which I found out that you wrote!) [12:27:20] I did [12:27:21] nice :) [12:27:31] although I'm kind of planning to rewrite geobackend with something better [12:27:33] it's a bit aged ;) [12:27:54] that means that we'll get clients using 8.8.8.8 as their NS to the correct geolocated site [12:28:01] yeah :) [12:28:05] very nice [12:29:02] put the commits in gerrit and I'll happily review [12:29:09] I got twisted bgp multiprotocol support working btw [12:29:11] it's for pdns 3.1 ;-) [12:29:21] didn't even check for 2.x, didn't see the point [12:29:25] yeah [12:29:35] btw, they've put the edns fields into the pipe's backend proto v3 [12:29:38] I dunno what SCM they're using atm [12:29:47] (we're using v2) [12:29:49] yep [12:30:01] so we can even do that in pipe if you feel like dumping the geobackend [12:30:09] we might [12:30:15] could just write something around maxmind [12:30:17] geo needs a lot of changes to support ipv6 as I saw it [12:30:23] yeah [12:30:29] oh, and another problem [12:30:29] and something which does straight A records instead of CNAMES [12:30:34] and AAAA [12:30:52] the edns spec specifies that you get a request that has an address and a netmask [12:31:02] so, e.g. google sends no less that /24 for privacy reasons [12:31:17] then on your replies, you reply with a so-called scope netmask [12:31:35] so you have to get that from [12:31:38] what's it called again [12:31:40] ipreftre ;) [12:31:55] as in, you might reply for the whole /16 or for a more specific block [12:31:59] yeah [12:32:01] ippreftree [12:32:09] yes [12:32:14] basically python radix ;) [12:32:15] I haven't done that yet [12:32:17] and maxmind too [12:32:30] I just reply with the same netmask as the request [12:32:42] sounds reasonable [12:32:47] I had a look at ipt, but it would have been more complicated [12:32:55] yeah [12:32:59] it's implemented in a recursive function rather than an iteration [12:33:07] well I don't know what powerdns wants, but I basically just want to make a geobackend v2 [12:33:10] which is quite different [12:33:17] yes [12:33:23] although I don't recall the specifics [12:33:27] fuck, it's like 10 years ago now [12:33:28] so it's difficult to keep the real mask on the side [12:33:50] plus, I was hoping to find an external library that does prefix trees for both v4/v6 in the meantime :) [12:33:55] yup [12:34:08] you see why I'd rather start over ;) [12:34:12] heh :) [12:37:16] anyway [12:37:24] so the real questions now arew [12:37:38] what's the plan for ns[0-4]? :) [12:37:51] 4? [12:38:01] 3, sorry [12:38:05] ns0-2 ;) [12:38:08] 2?? [12:38:12] just 2? really? [12:38:16] 3 [12:38:20] learn to count dude ;p [12:38:29] yeah one in each DC [12:38:31] haha [12:38:59] we need something newer than hardy preferrably :) [12:39:26] basically the plan is to redo the dns deployment system with something git, start using a newer geobackend (possibly pipe based, possibly native C++) [12:39:33] put in something more automated for ipv6 perhaps [12:39:44] and definitely everything on precise [12:40:15] even newer than that I'd say for pdns [12:40:28] yes, pdns3 based anyway [12:40:29] precise has 3.0, there's 3.1 available in quantal+ [12:40:47] and 3.1 is supposed to fix a lot of .0 bugs [12:40:52] yup [12:40:58] habbie has done a lot of good work on it [12:41:23] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [12:42:13] Habbie: apparently a colleague of mine added edns-client-subnet support to geobackend last night [12:42:20] er [12:42:22] hehe [12:42:31] wrong channel [12:42:57] hahaha [12:42:59] i'll let them know so we can sync up plans [12:46:14] hmm [12:46:31] i'm thinking about how to redo NaiveBGPPeering [12:46:34] with ipv6 support [12:46:43] it's so naive that I kind of hate it [12:46:47] but doing it well takes a bit of time ;) [12:47:50] also I need a good way to get the main ipv6 address in python [12:47:56] twisted doesn't help me yet [13:01:37] 15:00:50 <@Habbie> mark, oh, cute - of course client-subnet support is a 3 line patch ;) [13:01:37] 15:01:04 <@Habbie> mark, i also have a patch lying around that allows running multiple geobackends with separate configs, but it eats RAM like crazy [13:01:37] 15:01:27 <@mark> the one that loaded every ip database separately right [13:02:39] it's more like 16, but yes, simple [13:38:03] New review: Tim Starling; "I don't know how many bytes would have to change for it to saturate the network or whether that's ev..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9130 [13:41:24] !log added awight to fr-tech@wikimedia.org email alias [13:41:29] Logged the message, Master [13:45:33] New patchset: Pyoungmeister; "re-adding one db per shard after kernel upgrade" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11039 [13:45:38] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11039 [13:50:39] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 223 seconds [14:28:45] New patchset: Ottomata; "Installing reportcard.wikimedia.org on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11042 [14:29:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11042 [14:30:18] New review: Ottomata; "This needs to wait until stat1 is reinstalled with Precise, so that we have the newer nodejs version." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/11042 [14:32:49] so, [14:32:56] I have installed the new php5 on srv193 aka test.wp.org [14:32:58] and it seems to work [14:33:08] shall I go ahead and put it in the repo? [14:33:25] reminder: puppet has ensure => latest, so that means that it will hit servers immediately [14:33:41] mark? [14:34:21] I don't want it to immediately be in effect on the snapshot hosts [14:34:46] it's just a security update, it shouldn't affect anything [14:35:00] I would try it on one production server first [14:35:04] before you roll it everywhere [14:35:07] build was identical to the old build but with a few patches? same configs etc? [14:35:11] but other than that... [14:35:27] mark: other than srv193 you mean? [14:35:36] apergos: yes [14:35:52] ok, that is reassuring [14:36:17] ah this is not the build for precise. duh ok [14:37:11] no, I have that too [14:37:17] although I don't see the point of rolling that out [14:37:27] and it's not the php 5.4 one either [14:38:11] yeah then I'm fine with it after it's run on a production server (i.e. serving things other than just test.wp) [14:38:20] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [14:39:01] okay [14:40:53] silly questions then: a) how do I know which of the MWs are getting real traffic b) when I upgrade one of them how do I check if things indeed work? is there a log of some kind? [14:42:58] paravoid: yes [14:43:09] yes? :-) [14:43:12] a) /home/w/conf/pybal/pmtpa/apaches [14:43:36] b) you dont. check logs, do manual tests [14:44:05] !log resumed replication on es3, es1002 after cluster23 sync completed [14:44:10] Logged the message, Master [14:44:21] thanks a lot [14:44:54] mw1 has Nagios Apache HTTP and memcached monitoring. other mw100x hosts have just NTP / SSH / puppet. https://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=mw1xxx [14:45:29] !log putting kern-upgraded DBs back into pools [14:45:33] Logged the message, notpeter [14:45:40] New review: Pyoungmeister; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11039 [14:45:42] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11039 [14:52:26] !log set innodb_max_dirty_pages_pct = 0 on db40 in prep for shutdown [14:52:31] Logged the message, Master [14:53:29]