[00:12:44] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [00:13:02] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [00:13:47] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [00:22:47] PROBLEM - Puppet freshness on db23 is CRITICAL: Puppet has not run in the last 10 hours [00:23:41] PROBLEM - Puppet freshness on db10 is CRITICAL: Puppet has not run in the last 10 hours [00:26:41] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [00:32:41] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [00:34:47] PROBLEM - Puppet freshness on db21 is CRITICAL: Puppet has not run in the last 10 hours [00:36:19] Thehelpfulone: thanks for the revert, even though the edit to my user page was legit. I didn't realize I wasn't logged in. [00:36:31] err... [00:36:34] one sec. [00:36:41] heh [00:37:48] no problem, can't quite remember where it was or when I did it though [00:38:01] just 5m ago. [00:38:57] oh wait. [00:39:02] I'm misreading the changelog. [00:39:04] are you sure it was me, I haven't edited for a few hours? now you've intrigued me [00:39:13] it was reverted *to* the last version edited by you, not by you. [00:39:18] heh [00:39:31] so... uhh... nevermind. [00:39:34] :D [00:39:41] thanks for nothing! [00:39:44] :P [00:42:29] I just categorised your user page, that was a lot of effort [00:42:55] every grain of sand helps make the beach a wonderful place. [00:43:11] and now, with thanks and misguided platitudes, off to bed. [00:48:44] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [00:56:14] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:32] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:35] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 0.109 seconds [00:58:38] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [00:59:23] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27546 bytes in 9.713 seconds [01:04:56] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:50] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [01:06:35] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:47] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.115 seconds [01:08:59] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:11] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 0.109 seconds [01:15:53] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [01:19:38] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [01:22:02] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [01:40:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 234 seconds [01:43:38] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.236 seconds [01:44:41] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27406 bytes in 0.106 seconds [01:46:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:47:46] New patchset: Jeremyb; "bug 37006 - fawiki: add Book namespace + aliases" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10084 [01:47:53] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10084 [01:49:52] New review: Jeremyb; "needs a local to sanity check that I didn't butcher the chars and put them in the right place." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10084 [02:02:14] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [02:03:35] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:07:20] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [02:09:08] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:38] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 0.113 seconds [02:21:08] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [02:36:53] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [02:37:47] RECOVERY - Puppet freshness on linne is OK: puppet ran at Mon Jun 4 02:37:32 UTC 2012 [02:38:14] RECOVERY - Puppet freshness on db21 is OK: puppet ran at Mon Jun 4 02:38:07 UTC 2012 [02:40:11] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Mon Jun 4 02:40:07 UTC 2012 [02:50:41] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:23] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27546 bytes in 4.215 seconds [02:56:05] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Mon Jun 4 02:55:51 UTC 2012 [02:57:44] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [03:01:56] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:41] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [03:05:14] RECOVERY - Puppet freshness on db23 is OK: puppet ran at Mon Jun 4 03:04:52 UTC 2012 [03:06:17] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 2.338 seconds [03:06:44] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Jun 4 03:06:27 UTC 2012 [03:14:18] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:18] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:43:15] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [03:58:15] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:33] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [03:59:36] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27546 bytes in 2.971 seconds [04:04:59] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [04:19:05] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.131 seconds [05:41:46] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [05:55:34] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:07] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:01:25] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.108 seconds [06:02:19] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:52:47] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [06:55:20] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [07:05:05] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [07:06:35] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [07:23:24] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.117 seconds [07:32:44] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.170 seconds [08:14:08] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [08:34:01] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27398 bytes in 0.109 seconds [08:45:52] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [08:58:01] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [09:03:43] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [09:25:57] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [09:28:30] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [09:28:30] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [09:28:30] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:34:21] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [09:45:27] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [09:52:57] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [09:59:40] what's up with cp1001/1002? [10:01:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9982 [10:01:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9982 [10:05:24] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Mon Jun 4 10:05:04 UTC 2012 [10:05:33] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.110 seconds [10:07:21] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Mon Jun 4 10:06:55 UTC 2012 [10:08:15] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Mon Jun 4 10:07:59 UTC 2012 [10:11:06] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [10:19:15] New patchset: Mark Bergsma; "Add IPv6 addresses to pmtpa SSL servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10098 [10:19:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10098 [10:19:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10098 [10:19:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10098 [10:26:21] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [10:27:24] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.117 seconds [10:28:20] New patchset: Mark Bergsma; "Multiple interface stanzas for the same interface name and different families are allowed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10099 [10:28:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10099 [10:28:48] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10099 [10:28:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10099 [10:28:54] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:34:00] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [10:49:09] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:52:30] New patchset: Mark Bergsma; "Decommission sq40" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10102 [10:52:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10102 [10:54:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10102 [10:54:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10102 [10:56:21] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.124 seconds response time. www.wikipedia.org returns 208.80.154.225 [11:28:08] New patchset: Mark Bergsma; "Don't upgrade wikimedia-lvs-realserver while I test it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10104 [11:28:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10104 [11:28:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10104 [11:28:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10104 [11:32:14] !log Copied wikimedia-lvs-realserver 0.08 from APT distribution precise-wikimedia to lucid-wikimedia [11:32:21] Logged the message, Master [11:32:45] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [11:36:48] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:43:28] New patchset: Mark Bergsma; "Factor prefers the lo ipv6 address for ::ipaddress6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10106 [11:43:35] morning mr lane [11:43:44] service IPs have been allocated [11:43:44] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10106 [11:44:32] New patchset: Mark Bergsma; "Factor prefers the lo ipv6 address for ::ipaddress6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10106 [11:44:49] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10106 [11:45:39] New patchset: Mark Bergsma; "Facter prefers the lo ipv6 address for ::ipaddress6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10106 [11:45:56] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10106 [11:47:04] mark: morning :D [11:47:07] oh? [11:47:15] see dns [11:47:24] cool [11:47:49] stupid puppet takes the loopback addresses for the facter $ipaddress6 variable :( [11:48:07] :D [11:48:23] yay puppet! [11:49:27] oh god it uses ifconfig [11:49:38] was the lvs realserver stuff for ipv6 done yet? [11:49:56] yes [11:50:00] it's not deployed yet but in the repo [11:50:05] don't use it yet [11:50:07] just do nginx conf [11:50:12] i'm still fiddling :) [11:50:35] well, part of that needs to assign addresses via lvs_realserver [11:50:49] i'll handle that [11:50:52] if I don't do that, then I'm going to cause an outage [11:50:58] why? [11:51:05] because the addresses won't be bound [11:51:09] so what? [11:51:13] noone's using it yet [11:51:19] nginx will restart [11:51:24] and fail [11:51:31] ah it's not INADDR_ANY [11:51:34] right [11:51:35] then wait [11:51:38] * Ryan_Lane nods [11:51:46] brb [11:51:58] I can push it into gerrit and do a −1 if you' dlike [11:54:26] ok [11:56:17] heh [11:56:21] New patchset: Mark Bergsma; "Facter prefers the lo ipv6 address for ::ipaddress6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10106 [11:56:30] have a look at interface_add_ip6_address in generic-definitions.py [11:56:37] dirty hacks fest [11:56:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10106 [11:57:10] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10106 [11:57:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10106 [12:04:16] hmmz [12:04:22] ssh -6 ssl1.wikimedia.org doesn't work [12:04:24] why not... [12:11:59] Ryan_Lane: I can remove the old ipv6 service IP stuff from lvs1, right? [12:12:02] it's completely broken anyway [12:12:21] there was stuff there? [12:12:28] iface eth0 inet6 static [12:12:28] address 2620:0:860:2::80:2 [12:12:28] netmask 64 [12:12:28] gateway 2620:0:860:2::1 [12:12:33] that's the wrong subnet ;) [12:15:19] I've now removed it [12:15:23] so that may mean that nginx won't restart now [12:15:25] on ssl1 [12:15:39] but I don't have time now to fix that [12:16:00] i'm gonna reboot the box [12:16:07] it either does or does not come back up with nginx [12:17:09] yeah. probably won't restart [12:17:13] I'll depool it [12:17:32] it can be repooled after we have the new ipv6 config in [12:17:45] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:05] !log depooling ssl1 [12:18:09] Logged the message, Master [12:18:11] yep [12:19:25] faidon's new wikimedia-lvs-realserver with v6 support seems to work fine [12:19:33] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [12:19:37] he's our new packaging ninja [12:19:46] heh [12:20:09] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [12:20:17] i'm gonna ask puppet to upgrade it everywhere now [12:20:27] PROBLEM - NTP on ssl1 is CRITICAL: NTP CRITICAL: Offset unknown [12:20:46] New patchset: Mark Bergsma; "Revert "Don't upgrade wikimedia-lvs-realserver while I test it"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10109 [12:21:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10109 [12:21:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10109 [12:21:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10109 [12:22:47] oh damn [12:22:56] now we need to have puppet manage the ssh keys for the ipv6 addresses as well [12:23:13] it'll become even slower [12:23:39] !log Upgrading wikimedia-lvs-realserver to version 0.08 across the cluster (by Puppet) [12:23:43] Logged the message, Master [12:24:48] RECOVERY - NTP on ssl1 is OK: NTP OK: Offset 0.0009467601776 secs [12:25:10] ugh. yeah [12:25:26] dnssec and keys in the dns entries? [12:25:30] and we'll have all kinds of weirdness soon with all the hosts that now also have AAAA records for their hostnames [12:25:51] well, fingerprints in the dns entries, that is [12:25:54] especially right after fresh install, when the ipv6 address is not present yet [12:28:15] mark: any estimated date when ipv6 is enabled on prod [12:28:39] hm. I'm looking in dns, were's the forward lookups? [12:28:43] for the -lb addresses? [12:28:58] like Erik wrote in email 3 days ago that it was about to be enabled during hackaton [12:29:50] petan|wk: wednesday [12:29:56] ok [12:29:58] Ryan_Lane: think [12:30:29] it's not in the pipe one [12:30:41] no [12:30:55] because giving wikipedia a AAAA address in DNS before infrastructure is ipv6 enabled is a really bright idea ;-) [12:31:05] hahaha [12:31:07] good point [12:31:34] we should have this in git, then it could be sitting in a change [12:31:42] soon [12:31:46] * Ryan_Lane nods [12:32:03] hrm darn [12:32:18] if I add the v6 ips to the service ip hash in lvs.pp then pybal will try to add it everywhere [12:34:15] yep [12:34:20] go ahead on the ssl servers now [12:34:20] wait [12:34:23] is that true? [12:34:23] those will actually be the best test [12:34:28] since they DONT use that hash yet ;) [12:34:32] ah [12:34:34] that's why :) [12:34:45] they have their own service ip listing right now [12:34:47] I want to change that [12:34:52] but right now it's good that we haven't done that yet ;) [12:35:03] ok, I just add the addresses to the list? [12:35:16] Ryan_Lane: btw if you want I can help you with tagging on wikitech and such boring work [12:35:32] regarding the email [12:35:32] Ryan_Lane: I think. I've not looked at the protoproxy puppet config at all yet (in a while) [12:35:46] petan|wk: ok [12:35:49] as long as it only affects the ssl servers and not lvs/pybal you should be good [12:36:00] Ryan_Lane: petan|wk: count me in, which wiki are you moving stuff to? [12:36:03] it'll only affect ssl, yeag [12:36:11] wikimedia-lvs-realserver you can give the addresses too though [12:36:12] wait. is that tre? [12:36:19] to [12:36:20] it should be [12:36:51] ugh, reading these addresses from the reverse is horrible [12:36:56] hehe [12:37:07] it's all 2620:0:860:ed1a::0 to ::11 [12:37:13] (for pmtpa) [12:37:14] Ryan_Lane: you can create account Petrb with email benapetr at gmail dot com when you aren't busy, let me know then and I will take a look there [12:37:23] petan|wk: ok [12:38:52] so, wikimedia-lb is 2620:0:860:ed1a::0 ? [12:38:58] yes [12:39:05] (or 2620:0:860:ed1a:: ) [12:39:15] it seemed appropriate ;) [12:39:18] fucking hate ipv6's scheme [12:40:04] well, mobile-lb is 12 [12:40:21] hh [12:40:23] *heh [12:40:24] 12? [12:40:28] c? [12:40:31] yes [12:42:52] i've made wikimedia-lvs-realserver really flexible yesterday btw [12:43:00] you can now give it any combination of arrays and hashes [12:43:07] and it will just compile a list of ip addresses out of that [12:43:10] so you can use that if you want [12:43:26] instead of manually specifying each value from key in a hash [12:43:50] eqiad is: 2620:0:861:ed1a:: ? [12:44:04] correct [12:44:08] so for example, this also works: [12:44:09] # TEMP: during ipv6 migration [12:44:09] if $::site == "pmtpa" { [12:44:09] class { "lvs::realserver": realserver_ips => [ $lvs::configuration::lvs_service_ips[$::realm]['bits'][$::site], "2620:0:860:ed1a::a" ] } [12:44:09] } [12:44:09] else { [12:44:10] class { "lvs::realserver": realserver_ips => $lvs::configuration::lvs_service_ips[$::realm]['bits'][$::site] } [12:44:10] } [12:44:35] just putting an entire hash in an array with a literal value [12:45:24] hm [12:45:24] ok [12:45:50] New patchset: Mark Bergsma; "Add IPv6 service IP to pmtpa bits servers for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10110 [12:46:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10110 [12:46:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10110 [12:46:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10110 [12:47:55] hrm [12:48:07] why didn't faidon make the lvs service IPs scope host instead of global [12:48:20] ah to distinguish them [12:52:45] Geo = {} [12:52:51] is what geoiplookup returns for v6 clients [12:52:53] that's fine for now [12:53:08] * Ryan_Lane nods [12:54:02] I think i'm gonna modify the pybal conf template to filter out ipv6 addresses/services except if the host is in a special ipv6 class [12:54:03] for now [12:54:09] until all pybals have been upgraded [12:54:12] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [12:54:19] seems like the easiest way to handle things [12:56:15] New patchset: Ryan Lane; "Adding ipv6 support for all sites for protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10111 [12:56:22] * Ryan_Lane is stealing all the credit ^^ [12:56:35] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10111 [12:56:38] bah [12:57:46] New patchset: Ryan Lane; "Adding ipv6 support for all sites for protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10111 [12:58:07] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10111 [12:58:16] all the blame [12:59:19] brb [13:00:38] New patchset: Ryan Lane; "Adding ipv6 support for all sites for protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10111 [13:01:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10111 [13:01:09] \o/ [13:01:16] mark: review? ^^ [13:01:22] yeah I will [13:01:29] damn, just got a bag of crisps :P [13:01:50] I need to get some food :( [13:03:21] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [13:04:37] New review: Mark Bergsma; "Many service IPs are wrong for the different sites." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10111 [13:05:03] how so? [13:05:09] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [13:05:10] 860 for all sites [13:05:16] see inline comments for an example [13:05:18] damn it [13:05:19] right [13:07:02] New patchset: Ryan Lane; "Adding ipv6 support for all sites for protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10111 [13:07:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10111 [13:08:27] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.165 seconds [13:15:30] notpeter: regarding bellin, going to have to replace the main board. The problem did not follow the DIMM. Should get the new board today. [13:16:38] New review: Mark Bergsma; "The existing ipv6 address *is* in use in Amsterdam (it's in DNS there for certain providers), so we ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/10111 [13:16:51] Ryan_Lane: perhaps a better idea to do this change in small steps at a time? ;) [13:17:27] which existing ip address? [13:17:28] upload? [13:17:30] yes [13:17:34] I can change that back [13:17:50] hrm [13:17:53] that one is gonna be a pain in the butt [13:17:56] perhaps I just disable that now [13:18:00] didn't know if you had changed it already [13:18:08] haven't [13:18:11] that's been like that for years [13:18:12] why not put it in the pipe? [13:18:18] what? [13:18:20] then disable the pipe later? [13:18:32] I think i'll disable that ipv6 thing now [13:18:36] ok [13:18:40] then when dns ttl expires for that, we can make this change [13:18:48] * Ryan_Lane nods [13:19:05] people won't have ipv6 for a few days [13:19:08] and will complain about that ;) [13:19:12] heh [13:19:14] but in 2 days they should be happy [13:19:27] we're going backwards in ipv6 support [13:19:38] briefly ;) [13:19:46] that should be our report [13:20:21] back... [13:20:29] oh [13:20:32] there's an even easier way to do that [13:20:41] that thing i still called upload.esams instead of upload-lb.esams [13:20:47] wb paravoid [13:20:55] that took a while [13:21:38] hm. can I put more than one ipv6 address in there.... [13:21:41] paravoid: i'm gonna disable selective answer for a few days so it's not in our way during our changes [13:22:00] maybe I can just have both [13:22:04] naah [13:22:05] let's not [13:22:07] ok [13:22:09] i want to get rid of it anyway [13:22:16] might as well do that now [13:22:27] where are we? [13:22:48] what can I do? [13:22:56] you can tell me what you did with the lvs balancers [13:23:00] (after having a shower, since I stink atm) [13:23:06] TMI [13:23:19] d-i partitioning wasn't ready [13:23:25] and I was one of the last people left at the venue [13:23:30] so aborted and left [13:23:33] ok [13:23:44] Ryan told me it'd be risky to leave lvs1 down for a lengthy period of time [13:23:53] in case it's pair had a fault [13:23:59] so I just put it back into prod with lucid [13:24:05] good [13:24:36] so, I can do that [13:25:00] please do [13:25:03] make sure lvs1 is not used in any way [13:25:22] nice [13:25:28] dns scenario pmtpa-down still has yaseo in it [13:25:32] I think i'm not gonna touch it now :P [13:25:49] no idea what yaseo is [13:25:57] :| [13:26:04] a squid cluster in south korea we got rid of in 2008 [13:26:40] lol! [13:27:16] !log Changed upload.esams.wikimedia.org CNAME to upload-lb.esams, effectively disabling the IPv6 selective answer script [13:27:20] Logged the message, Master [13:27:50] heh [13:29:03] in an hour or so we can touch the esams ssl servers then [13:29:11] time for a shower [13:30:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:33:30] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.153 seconds [13:36:44] yaseo? that's been dead how many years now? [13:37:01] well, I'm going to go get some food, then [13:40:31] ok [13:47:49] New patchset: Mark Bergsma; "Add variable $ipv6_hosts to enable/disable IPv6 on certain PyBal hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10113 [13:48:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10113 [13:48:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10113 [13:48:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10113 [13:53:19] New patchset: Mark Bergsma; "Define the IPv6 LVS service IP for bits.pmtpa in the LVS hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10114 [13:53:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10114 [13:54:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10114 [13:54:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10114 [14:22:57] argh [14:22:59] ipvsadm is stupid [14:25:08] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [14:27:09] how come? [14:27:19] I never checked whether ipvsadm actually wroked [14:27:20] worked [14:27:26] it doesn't use [ ] around the realserver [14:27:28] just for the service [14:28:14] New patchset: Mark Bergsma; "ipvsadm doesn't use square brackets for realservers" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10116 [14:28:56] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10116 [14:29:04] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10116 [14:29:06] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10116 [14:29:17] hi paravoid, I had a funny thing happen over the weekend. One of the candidates for the QA Engineer position, Alister Scott, was experimenting with doing some browser automation on http://commons.wikimedia.beta.wmflabs.org/ and I think you blocked him for editing with nonsense. I've spoken with him, can you unblock him (or give me the privs to do it?) [14:29:50] chrismcmahon: no, I didn't block anyone [14:29:56] chrismcmahon: and I don't know how to block or unblock [14:30:11] paravoid, one sec... [14:33:41] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [14:35:19] there's nothing in the block log on that wiki [14:35:20] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.293 seconds [14:35:20] http://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=block&user=&page=&year=&month=-1&tagfilter= [14:36:26] I have no idea what you're talking about :-) [14:36:45] I haven't touched mediawiki *at all* on beta [14:36:56] and I've never blocked anyone in any other mean [14:37:06] and I've done nothing wrt to beta this weekend [14:37:26] if I can do something to help I'd be happy to, although I don't know how mediawiki works [14:39:22] paravoid there is a user "Orashmatash" doing a lot of maintenance: http://en.wikipedia.beta.wmflabs.org/wiki/Special:RecentChanges but he doesn't have a user page or talk page [14:39:55] paravoid: I thought that might be you, seems I was wrong [14:40:09] oh, okay :) [14:40:12] nope, not me [14:40:15] I either go by faidon or paravoid [14:40:44] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.106 seconds [14:41:32] petan|wk: do you happen to know who "Orashmatash" is? http://en.wikipedia.beta.wmflabs.org/wiki/Special:RecentChanges [14:41:59] chrismcmahon: no [14:42:26] I will try to find it out [14:42:55] root@brewster:/srv/wikimedia/incoming# reprepro ls pybal [14:42:55] pybal | 0.1+r74215 | lucid-wikimedia | amd64, source [14:42:56] pybal | 1.00 | precise-wikimedia | amd64, i386, source [14:43:27] petan|wk: I want to get the account there for Alister Scott unblocked, but I'd like to get a message to whoever Orashmatash is about that. [14:43:39] can you please take this discussion elsewhere? [14:43:47] chrismcmahon: ok, let's move to -labs [14:43:48] New patchset: Mark Bergsma; "pybal (1.00) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10117 [14:44:17] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10117 [14:44:19] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/10117 [14:44:39] paravoid: so... [14:44:51] the new wikimedia-lvs-realserver has been deployed everywhere by puppet [14:45:12] sq67-70 (bits.pmtpa) and ssl1-4 have ipv6 addresses bound [14:45:15] YAY [14:45:21] ipv6 service IPs have been allocated for all LVS services [14:45:32] and sq67-70 have those service IPs bound to loopback as well [14:45:47] pybal has been rebuilt for precise and is in the apt repo [14:45:52] wow, you rock [14:45:57] so it's now waiting on your reinstall of lvs1 [14:46:04] i'll help you with that in a bit ;) [14:46:04] so, there's a partman recipe for amslvs* already [14:46:08] I can begin with that [14:46:11] but I'm also getting visitors at some point [14:46:14] yeah [14:46:20] then I can do the manual install on lvs1 [14:46:26] obviously partitioning is not very important for lvs [14:46:28] oh yeah [14:46:31] do get-selections --installer [14:46:34] I also modified the pybal template to ignore ipv6 addresses [14:46:39] get the values and put them on puppet [14:46:41] so we can add them to the service ip hash safely [14:46:42] then reinstall the rest [14:46:47] that's the plan atm, I was about to begin [14:46:50] only if you add the lvs host to the $ipv6_hosts variable in that class [14:46:55] they will be added to the pybal config [14:47:04] oh? how come? [14:47:11] I modified the manifest [14:47:21] otherwise the current lucid pybals will try to add ipv6 services [14:47:27] I don't think that will fare well ;-) [14:47:28] ah, right [14:47:37] so once lvs1 has been reinstalled, and pybal is up on it [14:47:42] we can add it to the $ipv6_hosts list [14:47:46] amslvs! [14:47:49] and then ipv6 lvs services will be added [14:47:50] right, amslvs [14:48:04] all in all, going well [14:48:16] screw hackathons, working from home is just better ;-) [14:48:20] hahaha [14:48:26] well, it was a noisy hackathon [14:48:31] yeah exactly [14:48:35] so once we have tested one lvs service [14:48:40] I'll add ips to the other sites and stuff as well [14:48:46] no point in doing that now [14:49:00] actually [14:49:08] that's one reason why I'd like you to start on pmtpa lvs first [14:49:12] there we have inactive realservers [14:49:13] in esams we don't [14:49:36] hm, okay then [14:49:40] manual install it is :-) [14:49:45] partitioning that is [14:49:50] not that crazy [14:51:15] yeah [14:51:21] you can use one of the standard recipes [14:51:27] install with LVM, root only, etc [14:51:32] it doesn't matter [14:51:38] there's no stored data, everything is brought in by puppet [14:51:45] as long as it doesn't die with one broken disk I'm happy [14:51:58] lvs1 has one disk :-) [14:52:02] not sure about the rest [14:52:04] raid1 I think [14:52:06] hw raid [14:52:12] ah, maybe [14:52:16] I just saw an sda [14:52:16] i'm pretty sure [14:52:20] so don't worry about sw raid [14:52:31] okay, RAC has hanged [14:52:44] have to reset it from within linux, don't remember how [14:52:55] just do "racadm racreset" in the drac [14:52:58] mark: ready for me to make the ssl changes? [14:53:04] Ryan_Lane: i'd break it up in smaller changes [14:53:08] this is kind of silly [14:53:16] no, I can't SSH [14:53:17] why? [14:53:20] it either works or breaks everything [14:53:24] it only affects one service [14:53:25] Trying 10.1.4.1... [14:53:25] Connected to lvs1.mgmt.pmtpa.wmnet. [14:53:25] paravoid: oh, odd [14:53:26] Escape character is '^]' [14:53:29] hangs there [14:53:33] darn [14:53:44] the only service it can break is upload [14:53:53] Ryan_Lane: why? [14:53:56] I can test the change on ssl1 first, which is depooled [14:54:01] that sounds fair [14:54:14] I wish we could have environments [14:54:38] then I could merge this in without worrying if the rest will pull the change [14:55:06] !log disabling puppet on all ssl hosts [14:55:11] Logged the message, Master [14:55:36] i'm going to add static routes to the routers now [14:55:38] deactivated [14:58:38] hrm [14:58:44] i'll need ipv6 addresses for the balancers first ;) [14:58:46] i'll wait with that [15:01:35] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10111 [15:01:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10111 [15:02:43] !log depooling ssl1001 and ssl3001 [15:02:47] Logged the message, Master [15:03:14] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 08/22/2015 22:23. [15:03:54] * paravoid sighs [15:04:00] installing ia32-libs on lvs1... [15:04:09] huh? [15:04:11] so that the binary racadm that I extracted from an RPM can work [15:04:31] (ipmitool bmc reset cold didn't work) [15:04:40] so... [15:04:44] chris is in the data center right now :) [15:04:45] mark: well, it seems that all the IPs were added and nginx starts [15:04:46] might be easier! [15:04:56] Ryan_Lane: cool, i'll check it out in a bit [15:05:11] lemme also do this in eqiad and esams [15:05:55] uhh okay [15:06:00] how do I get in contact with him? [15:06:08] he's here in the channel [15:06:10] cmjohnson1: ping [15:06:24] worked in eqiad [15:06:30] can you help paravoid? [15:07:06] hi! [15:07:14] lvs1's drac is hanged [15:07:22] can you physically powercycle? [15:07:41] thanks a lot! [15:08:05] Logged the message, Master [15:08:20] I'd say remove power cables and re-add [15:08:32] power button might not do it [15:09:54] perfect [15:13:17] PROBLEM - Host lvs1 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:33] !log Added new IPv6 LVS prefixes to all routers for uRPF filters; BGP import filters still need adjusting for dual-family sessions [15:13:37] Logged the message, Master [15:14:48] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, sessions up: 6, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [15:14:58] (that's lvs1, ignore that) [15:15:15] awesome, tahnks! [15:16:08] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [15:16:18] and down again :-) [15:16:28] (because I'm resetting it) [15:17:44] Ryan_Lane: why are the ipv6 vhosts listening on ipv4 ips also again? [15:18:00] I was wondering that myself too [15:18:01] I don't remember [15:18:24] on protoproxies you mean, right? [15:18:32] yes [15:18:35] because it also does ssl? [15:18:48] # IPv6 proxying [15:18:48] server { [15:18:48] listen 208.80.152.201:80; [15:18:48] listen [2620:0:860:ed1a::1]:80; [15:18:50] RECOVERY - Host lvs1 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:18:52] ah [15:19:01] I don't remember.... [15:19:05] lemme think for a minute [15:19:12] its probably documented [15:19:15] you were quite rigid about that [15:19:32] well, that was for ssl [15:19:40] I don't remember a good reason for ipv6 [15:19:50] let's try it without [15:19:52] now is the time ;-) [15:19:53] ok [15:19:55] yep [15:19:56] gimme a sec [15:20:10] if there was a reason, it should have been documented [15:20:17] yeah take your time [15:20:26] we have 2 days left ;) [15:20:29] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, sessions up: 6, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [15:20:34] well, it's not documented, so obviously there's no good reason ;) [15:20:40] hehe [15:21:00] !log reinstalling lvs1 with precise [15:21:00] oh [15:21:04] Logged the message, Master [15:21:06] because the template doesn't check [15:21:23] if it's in proxy_addresses, it gets added [15:21:58] hm, how do I check to see if it's an ipv6 address or not. [15:22:26] PROBLEM - SSH on lvs1 is CRITICAL: Connection refused [15:26:47] RECOVERY - SSH on lvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:28:26] is there anything in the new row C yet? [15:28:32] they're going in that row [15:28:37] where was that SSH key that Rob was telling me about? [15:28:38] sounds fine [15:28:45] New patchset: Ryan Lane; "Only add ipv6 addresses to the ipv6 proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10118 [15:28:45] paravoid: for installs? [15:28:47] to login to newly-provisioned servers [15:28:48] yes [15:28:49] sockpuppet's root dir [15:28:54] ah, sockpuppet [15:28:58] was looking at it on brewster [15:29:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10118 [15:29:52] review, anyone? [15:29:56] ok [15:31:06] I guess that works [15:31:12] don't know ruby's continue syntax and such ;) [15:31:33] I had to look it up ;) [15:31:39] then it's probably fine ;) [15:31:41] go ahead [15:31:43] I also looked up how to do a one line if [15:32:03] hehe [15:32:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10118 [15:32:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10118 [15:32:35] I like how we're not using autosigning but our instructions do not mention how to verify the key... [15:33:08] that's still better than autosigning ;) [15:33:38] the current crap with sockpuppet as CA first twice is a bit annoying too [15:33:41] should really fix that soon [15:33:54] ok. seems that worked [15:35:46] hooray for clean configs [15:36:13] :) [15:36:58] err: Could not retrieve catalog from remote server: Server hostname 'sockpuppet.pmtpa.wmnet' did not match server certificate; expected sockpuppet.pmtpa.wmnet [15:37:04] oh puppet, you're so very useful [15:37:28] hahaha [15:37:29] so the first two times, use --server sockpuppet (even though it complains the 2nd time) [15:37:31] then switch to stafford [15:37:40] yeah yeah, I figured it out [15:37:43] ok ;) [15:37:47] I was just pointing out the "funny" error messge [15:37:51] yes [15:37:56] so annoying [15:41:09] argh, precise has privacy extensions enabled by default [15:41:42] does that impact lvs? [15:42:14] oh damn [15:42:33] need to make sure pybal doesn't announce the v6 addresses over bgp [15:42:37] since that will surely break ;) [15:42:40] hmm can do that in the template [15:45:13] New patchset: Mark Bergsma; "Hard disable BGP for IPv6 LVS services, PyBal doesn't support that yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10119 [15:45:36] New patchset: Faidon; "LVS: remove dependency on Linux 2.6.36" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10120 [15:45:39] please review [15:45:43] ah yes [15:45:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10119 [15:45:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10120 [15:47:07] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10119 [15:47:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10119 [15:47:11] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10120 [15:47:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10120 [15:47:26] I have another bug, don't merge yet [15:47:30] ok [15:52:01] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [15:54:43] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [15:55:15] New patchset: Faidon; "Make interface_setting to work when having inet6 too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10121 [15:55:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10121 [15:55:46] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [15:56:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:56:41] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10121 [15:56:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10121 [15:58:19] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:25] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [16:01:01] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 6.912 seconds [16:02:22] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:06:52] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:33] New patchset: Ryan Lane; "Enable ipv6 on the ssl hosts for all datacenters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10122 [16:08:23] New patchset: Faidon; "autoinstall: make early_command work with precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10123 [16:08:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10122 [16:08:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10123 [16:08:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10122 [16:08:25] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10122 [16:08:46] single space commit, woo! [16:09:34] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 8.337 seconds [16:09:43] where'd the stupid bot go? [16:09:43] !log restarting ircecho on manganese [16:09:43] that didn't seem to help [16:09:43] hm. maybe a split? [16:09:47] Logged the message, Master [16:09:51] yep [16:09:56] Ryan_Lane: merged puppet btw [16:10:03] 10123 is live [16:10:10] a split [16:10:10] heh [16:10:16] thanks [16:10:22] was looking for it :) [16:11:40] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.162 seconds [16:13:34] re-reinstalling lvs1 :-) [16:14:49] PROBLEM - SSH on lvs1 is CRITICAL: Connection refused [16:15:22] New patchset: ArielGlenn; "make worker script take a specific wiki arg so en dumps can be like the rest" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/10124 [16:16:46] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:50] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10124 [16:16:52] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/10124 [16:18:07] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 1.102 seconds [16:19:28] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:49] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 9.729 seconds [16:23:58] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:52] RECOVERY - SSH on lvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:26:40] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27542 bytes in 5.495 seconds [16:34:55] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [16:39:39] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.185 seconds [16:42:57] PROBLEM - SSH on lvs1 is CRITICAL: Connection refused [16:47:00] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [16:48:56] !log upgraded kernel on db1047 / analytics [16:49:00] Logged the message, Master [16:50:27] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 101010 seconds [16:50:36] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 101021 seconds [16:52:06] PROBLEM - Host lvs1 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:09] RECOVERY - SSH on lvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:56:18] RECOVERY - Host lvs1 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:02:01] New review: Jeremyb; "I read the consensus a little in google translate and I think it's good." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/10084 [17:05:27] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [17:12:31] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [17:16:51] New patchset: Faidon; "autoinstall: also fix netboot.cfg to work with precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10126 [17:17:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10126 [17:18:13] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10126 [17:18:15] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10126 [17:20:54] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [17:23:09] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [17:24:21] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:35:11] New patchset: Mark Bergsma; "Add IPv6 addresses on reinstalled (Precise) LVS servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10128 [17:35:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10128 [17:35:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10128 [17:35:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10128 [17:54:34] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [17:56:40] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:58:56] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [17:59:04] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:19] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:05:22] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:12:25] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, sessions up: 6, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [18:13:10] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.162 seconds [18:14:58] PROBLEM - Apache HTTP on mw64 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [18:15:16] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [18:17:31] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:52] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27542 bytes in 4.820 seconds [18:22:01] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [18:27:43] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [18:32:17] New patchset: Mark Bergsma; "Allow IPv6 LVS services on lvs1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10132 [18:32:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10132 [18:32:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10132 [18:32:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10132 [18:34:01] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [18:43:55] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.268 seconds [18:46:28] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [18:53:28] New patchset: Hashar; "tests for databases configurations" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10133 [18:53:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10133 [18:54:43] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [18:55:55] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:26:55] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs for upload.pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10137 [19:27:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10137 [19:27:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10137 [19:27:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10137 [19:30:30] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [19:30:30] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:30:30] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [19:32:10] !log force running puppet on ssl servers [19:32:15] Logged the message, Master [19:38:11] New patchset: Mark Bergsma; "Add Squid IPv6 IPs to a special 'ipv6' LVS service, similar to https" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10139 [19:38:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10139 [19:39:31] New patchset: Mark Bergsma; "Add Squid IPv6 IPs to a special 'ipv6' LVS service, similar to https" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10139 [19:39:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10139 [19:39:53] man lvs is getting complicated ;) [19:40:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10139 [19:40:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10139 [19:47:46] New patchset: Mark Bergsma; "Add 'ipv6' LVS service for ipv6 protoproxies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10140 [19:48:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10140 [19:48:12] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10140 [19:49:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10140 [19:49:42] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 9.311 seconds [20:06:49] New patchset: Mark Bergsma; "Bind IPv6 LVS service IPs to lvs1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10141 [20:07:12] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10141 [20:07:43] New patchset: Mark Bergsma; "Bind IPv6 LVS service IPs to lvs1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10141 [20:08:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10141 [20:08:15] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10141 [20:08:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10141 [20:12:41] New patchset: Mark Bergsma; "Fix LVS service IP lookup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10143 [20:13:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10143 [20:13:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10143 [20:13:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10143 [20:13:42] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [20:15:03] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [20:16:11] New patchset: Mark Bergsma; "Fix duplicate key in LVS service IP hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10145 [20:16:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10145 [20:16:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10145 [20:16:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10145 [20:23:18] PROBLEM - mysqld processes on es4 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:26:00] PROBLEM - Host es4 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:21] RECOVERY - Host es4 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [20:28:10] hiya [20:29:35] internet stopped working [20:29:45] using my host's internet stick [20:29:49] trying to get mine registered [20:29:50] ouch [20:30:04] yep :( [20:30:09] you any idea why ssl1 had the wrong ipv6 subnet configured? [20:30:17] 2620:0:862:2 instead of :1 [20:30:22] it did? [20:30:29] did you do anything for it? [20:30:32] is it wrong in puppet? [20:30:38] well it can't be [20:30:42] it takes values from facter [20:31:01] however once something is wrong, it can stay wrong [20:31:02] that was assigned to lo? [20:31:04] so may be nothing [20:31:06] no [20:31:08] eth0 [20:31:11] main server ip, v6 [20:31:12] ah [20:31:17] I didn't do anything there [20:31:20] alright [20:31:39] it's strange, since I rebooted your old ::80:2 ip (with that wrong subnet) earlier and rebooted the box [20:31:43] ouch. this connection is terrible too [20:31:47] so it seems strange that it now came back with the new v4 encoding in it [20:31:48] hehe [20:31:51] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 14051 seconds [20:32:28] also [20:32:32] (I put this in the RT ticket) [20:32:39] we need to make nginx listen on the main ipv6 ip [20:32:46] why's that? [20:32:48] port 80 [20:32:48] oh [20:32:50] right [20:32:52] otherwise pybal can't contact [20:32:54] yeah [20:32:56] oh [20:32:58] no not ipv6 [20:32:59] just like the ssl ones [20:32:59] ipv4 is fine [20:33:02] is that what that ip did? [20:33:05] which you removed earlier :D [20:33:09] hahaha [20:33:12] pybal contacts over ipv4 only [20:33:13] maybe so [20:33:14] (for now) [20:33:21] -_- [20:33:24] well... [20:33:27] so it needs to listen on port 80 and 443 of the *main server ip * [20:33:28] I'll put that back in ;) [20:33:33] check if it's that though [20:33:38] I thought it was a different ip [20:33:39] not sure now [20:33:48] oh. no. it was the service ip [20:33:51] which is wrong [20:33:52] yeah that's no help [20:34:02] so we need a vhost which does nothing but listen on those ports [20:34:07] just for pybal [20:34:18] well, not necessarily that [20:34:31] I'll do whatever I'm doing for ssl [20:34:38] which is? [20:34:42] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [20:35:12] listen <%= ipaddress %>:443 [20:35:12] so, listen <%= ipaddress %>:80 [20:35:22] ok [20:35:24] yes that will work [20:35:29] it won't work for ipaddress6 in the future [20:35:35] since that takes some lvs service ip [20:35:40] ipaddress6_eth0 would work though [20:35:52] ipaddress => 208.80.152.120 [20:35:52] ipaddress6 => 2620:0:860:ed1a::b [20:35:52] ipaddress6_eth0 => 2620:0:860:1:208:80:152:120 [20:36:04] no? because puppet is broken? [20:36:07] yeah [20:36:27] well [20:36:28] bleh [20:36:29] also the kernel [20:36:30] ipaddress6_eth0 will do, I guess [20:36:30] whatever [20:36:39] you don't need that now [20:36:43] pybal doesn't do v6 for monitoring yet [20:36:43] New patchset: Ryan Lane; "Add the server's IP to the listen addresses, so that it can be used by pybal." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10148 [20:36:44] I kno [20:36:46] w [20:36:48] ok [20:36:58] we can write a custom fact to fix that better later [20:37:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10148 [20:37:12] now to figure out how to register my sim.... [20:37:43] paravoid: please don't reinstall more LVS servers yet [20:38:11] and now google's translation service is breaking :( [20:45:09] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [20:47:23] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [20:48:41] this translation thing in chrome is really the most useful thing sever [20:50:11] mutante: you on? [20:50:33] I can't read the phone number attached to this sim :D [20:58:05] New review: Demon; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10133 [20:58:07] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10133 [21:03:06] !log restarting nginx on all ssl hosts [21:03:10] Logged the message, Master [21:03:17] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay 0 seconds [21:04:25] !log repooling ssl1, ssl1001, ssl3001 [21:04:29] Logged the message, Master [21:04:56] crap [21:05:05] I should have merged and pushed out that change before doing that [21:05:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10148 [21:05:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10148 [21:07:04] !log force running puppet on all ssl hosts again [21:07:08] Logged the message, Master [21:16:13] !log restarting nginx on all ssl boxes again [21:16:18] Logged the message, Master [21:33:08] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:30] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 0.119 seconds [21:46:02] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:23] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27409 bytes in 0.110 seconds [21:48:57] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:57] PROBLEM - LVS HTTP on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:06] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:06] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:06] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:06] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:06] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:24] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:33] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:33] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:33] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:33] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:33] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:34] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:34] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:34] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:35] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:36] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:42] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:42] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:42] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:42] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:42] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:43] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:43] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:18] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:18] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:27] PROBLEM - LVS HTTP on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:27] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:27] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:37] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:37] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:46] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:46] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:47] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:47] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:48] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:48] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:49] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:49] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:50] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:50] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:51] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:51] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:52] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:54] mark? [21:50:54] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:54] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:54] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:03] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:03] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:03] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:03] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:03] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:04] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:04] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:04] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:05] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:12] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:13] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:13] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:14] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:22] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:22] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:23] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:23] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:30] PROBLEM - LVS HTTPS on foundation-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [21:51:57] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:06] PROBLEM - LVS HTTP on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [21:52:06] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:15] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:15] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:15] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:15] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:15] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:28] mark: that is not good :/ [21:52:33] weird [21:52:40] could it be some network switch? [21:53:49] nopw [21:54:45] dberror.log got ton of db related issues [21:54:54] about 10.0.0.228 [21:54:54] yup [21:54:59] should be resolved [21:55:04] good [21:55:10] I was just done for the day [21:55:50] * mark off again ;) [21:55:54] things should be ok now [21:55:58] thanks [21:56:02] * hashar waves at mark [21:56:04] have a good night [21:56:19] es4 not happy then? [22:01:11] yeah :/ [22:04:38] binasher: I wish we used something else for es sometimes [22:05:00] the es is a fucking nightmare [22:05:11] ERROR 144 (HY000) at line 382: Table './arwiki/blobs_cluster22' is marked as crashed and last (automatic?) repair failed [22:05:24] having this all in myisam is terrible [22:06:51] binasher: do you know of a key/value store with decent compression, were you could perhaps hint that "this blob derives from that blob"? [22:06:54] new stuff will be in innodb whenever the hardware gets here but i hopefully we can replace mysql entirely [22:07:14] all out custom compression code is scary [22:07:20] yeah it is [22:07:22] and any mistakes can cause dataloss (as in the past) [22:08:50] binasher: funny, my myisam search table crashed on my testwiki [22:08:59] * AaronSchulz did TRUNCATE to "fix" :p [22:09:13] what would you want the 'derives from other blob' for? storing diffs instead of full revisions, or just having them stored on the same or neighbor pages for faster diffs? [22:10:19] well, the delta compression greatly reduced the space...though at the expense of read time, so it was only done on the older, less-used, stuff [22:10:39] though, if "disk space is cheap", we can just compress single objects and not care [22:11:20] i'll have to read more on innodb compression / talk to domas - facebook has been heavily developing it. i think it's compression is on a per-page basis [22:11:21] If I recall it was about 20x compression [22:11:47] blobs often won't be store w/ the primary key and will get their own page though, so it might not be the best fit [22:14:41] it might be though, depending on what the actual avg size of the blogs are as fully compressed [22:15:38] i was talking to one of the couchdb cofounders at the hackathon [22:16:45] it has a big flaw around b+tree compaction that would make it annoying for some applications, but it could be manageable for the es [22:17:21] it natively supports gz compression for its blob equivalent ("attachments") [22:18:38] no compressing of related blobs together though [22:19:55] how much space are we using on the recompressed es stores now? [22:20:52] i don't think anything recent has been recompressed [22:21:42] right, since it would hurt performance [22:21:54] if you have to fetch 1000 blobs to read one ;) [22:22:19] I curious how much we can just suck up using more space and migrate to something else [22:22:44] * AaronSchulz wonders how big those binlogs get... [22:23:48] 3-5GB a day [22:25:37] mark: back [22:28:43] hah! another one working late [22:39:24] !log stopping mysql on es4. all tables marked as having repair fails are in cluster22, resyncing just those from es1002 [22:39:28] Logged the message, Master [22:39:48] crap [22:49:58] !log started an experiment on es1004 - altering all es tables from myisam to innodb one at a time with file_per_table enabled [22:50:02] Logged the message, Master [23:13:03] paravoid: Looks like the ldap server for labs is down... am I the last one to notice that? [23:15:55] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/8344 [23:17:13] <^demon> andrewbogott: I don't think LDAP is down...otherwise couldn't login to gerrit. [23:17:36] Can you still log in as of two minutes ago? [23:17:50] I haven't tried gerrit yet, but can't sudo in labs instances and can't log into labsconsole. [23:18:21] <^demon> Just tried 2 seconds ago. [23:18:22] <^demon> WFM [23:18:30] New patchset: Asher; "setting myisam-recover to quick mode since es tables should never have deletes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10197 [23:18:52] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10197 [23:18:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10197 [23:18:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10197 [23:19:48] ^demon: labs console, too? [23:20:02] <^demon> Hmm, labsconsole is yelling at me on login. [23:20:05] <^demon> But gerrit works [23:20:42] "No Nova credentials found for your account." ? [23:21:00] <^demon> No, "Incorrect password entered. Please try again." [23:21:09] hm [23:22:07] <^demon> Can't sudo on one of my instances either. [23:22:10] <^demon> Hrm. [23:24:03] andrewbogott: same problems here. [23:24:22] Does anyone have a guess where those credentials are served from? I would've thought formey. [23:25:31] maybe virt0? [23:25:37] <^demon> virt0, iirc. [23:25:39] <^demon> Not formey [23:26:40] Grrrr the default vim settings on all production machines hurt my eyes [23:26:59] <^demon> server = ldaps://virt0.wikimedia.org [23:27:02] Well, nova on virt0 complains that it can't contact ldap. You think that ldap is also running on virt0? [23:27:06] Oh, fair enough :) [23:27:06] <^demon> According to gerrit.config [23:28:12] <^demon> Wonder if gerrit's still working cuz it caches credentials. [23:28:18] <^demon> (Which would be mildly scary, tbh) [23:28:42] I was hoping there would be an 'ldapd' in the service list. [23:32:28] New patchset: Asher; "fix perms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10200 [23:32:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10200 [23:32:56] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10200 [23:33:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10200 [23:33:51] ^demon, ssmollett: better? [23:34:24] <^demon> Yup yup :) [23:34:51] fixed [23:36:03] Cool. No idea why it went down, though :( [23:36:52] <^demon> Unicorns weren't fed today? [23:42:14] !log re-enabled es4 monitoring. its currently our only es server without any tables marked as crashed / needing recovery, myisam recovery has been absent for all systems since the ms servers were migrated off of in nov 2011. (Sum of human knowledge * Rényi entropy = ES) [23:48:55] New patchset: awjrichards; "Make it possible to use wgExtensionAssetsPath for mobile frontend custom logos" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10203 [23:49:02] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10203 [23:51:34] New patchset: awjrichards; "Make it possible to use wgExtensionAssetsPath for mobile frontend custom logos" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10203 [23:51:40] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10203