[00:03:12] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 00:03:07 UTC 2013 [00:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:21] !log pooled new https node ssl1005.wikimedia.org [00:18:08] !log pooled new https node ssl1006.wikimedia.org (with hyperthreading enabled) [00:20:52] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:21:06] hm [00:21:15] that's probably not a good sign [00:21:52] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:22:00] -_- [00:22:42] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.006 second response time [00:23:42] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:24:18] hm. it's bound on lo on ssl1005/6 [00:24:42] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.013 second response time [00:24:44] (CR) Ori.livneh: [C: 1] "Yay. Thanks!" [operations/puppet] - https://gerrit.wikimedia.org/r/75265 (owner: Pyoungmeister) [00:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 00:24:46 UTC 2013 [00:25:04] and it's listening... [00:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:25:42] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.003 second response time [00:25:52] PROBLEM - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:26:30] notpeter_, you can merge it, then I'll help Niklas tomorrow with migrating to that box [00:26:42] RECOVERY - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.013 second response time [00:29:02] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 00:28:54 UTC 2013 [00:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [00:29:52] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.035 second response time [00:30:42] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:31:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 00:31:01 UTC 2013 [00:31:19] ok. wtf. [00:31:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:31:42] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.013 second response time [00:31:45] PROBLEM - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:31:52] PROBLEM - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:32:50] LeslieCarr: ^^ [00:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 00:32:42 UTC 2013 [00:33:01] LeslieCarr: would there be routing issues with the new ssl hosts and ipv6? [00:33:19] !log depooling ssl1005/6 due to ipv6 issues [00:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:33:32] RECOVERY - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 601 bytes in 0.002 second response time [00:33:42] RECOVERY - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.010 second response time [00:37:42] * Elsie wonders when Ryan_Lane will notice that morebots isn't here. [00:37:45] :-) [00:37:49] ugh [00:37:50] heh [00:38:43] !log pooled new https node ssl1005.wikimedia.org [00:38:48] !log pooled new https node ssl1006.wikimedia.org (with hyperthreading enabled) [00:38:54] Logged the message, Master [00:38:59] !log depooling ssl1005/6 due to ipv6 issues [00:39:04] Logged the message, Master [00:39:13] Logged the message, Master [00:41:22] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [00:55:12] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 00:55:09 UTC 2013 [00:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 00:57:41 UTC 2013 [00:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:59:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 00:59:13 UTC 2013 [00:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:34] !log enabled hyperthreading on ssl1005 [01:01:52] Logged the message, Master [01:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 01:02:37 UTC 2013 [01:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:12:02] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:32] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms [01:15:02] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:15:32] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.59 ms [01:15:42] ... [01:25:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 01:24:56 UTC 2013 [01:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 01:28:55 UTC 2013 [01:29:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 01:28:55 UTC 2013 [01:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:33:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 01:33:14 UTC 2013 [01:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:23] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:22] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 01:54:53 UTC 2013 [01:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 01:58:10 UTC 2013 [01:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 01:58:46 UTC 2013 [01:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 02:02:43 UTC 2013 [02:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:03:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [02:08:03] !log LocalisationUpdate completed (1.22wmf11) at Tue Jul 23 02:08:03 UTC 2013 [02:08:16] Logged the message, Master [02:12:36] !log LocalisationUpdate completed (1.22wmf10) at Tue Jul 23 02:12:36 UTC 2013 [02:12:46] Logged the message, Master [02:19:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:21:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 23 02:21:20 UTC 2013 [02:21:30] Logged the message, Master [02:25:12] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 02:25:11 UTC 2013 [02:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 02:28:56 UTC 2013 [02:29:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 02:29:12 UTC 2013 [02:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:33:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 02:33:20 UTC 2013 [02:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 02:54:54 UTC 2013 [02:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:57:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 02:57:48 UTC 2013 [02:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:12] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 02:59:09 UTC 2013 [02:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 03:02:42 UTC 2013 [03:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 03:24:51 UTC 2013 [03:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:28:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 03:27:53 UTC 2013 [03:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [03:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 03:28:44 UTC 2013 [03:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:31:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:33:32] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 03:33:25 UTC 2013 [03:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 03:54:48 UTC 2013 [03:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 03:57:40 UTC 2013 [03:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [03:59:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 03:58:57 UTC 2013 [03:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [04:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 04:02:35 UTC 2013 [04:02:42] PROBLEM - Varnish HTTP mobile-frontend on cp1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:03:33] RECOVERY - Varnish HTTP mobile-frontend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.002 second response time [04:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 04:24:47 UTC 2013 [04:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:27:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 04:27:45 UTC 2013 [04:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 04:28:57 UTC 2013 [04:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 04:32:37 UTC 2013 [04:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 04:54:52 UTC 2013 [04:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 04:58:11 UTC 2013 [04:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 04:58:47 UTC 2013 [04:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [05:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 05:02:39 UTC 2013 [05:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [05:05:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:52] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:13:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:34:41] !log authdns-update for cr1-ulsfo <-> cr2-eqiad, dns-auth-lb -> nsN(-new), ssl100x fixed IPv6 [05:34:52] Logged the message, Master [05:44:27] !log pooling ssl1005/1006 for https/ipv6 [05:44:37] Logged the message, Master [05:44:46] Ryan_Lane/LeslieCarr: ^^^ [05:44:55] oh cool [05:45:09] i'll keep an eye out for a little bit [05:45:11] and this is just fyi, don't you dare deal with it if it goes south [05:45:27] 11pm and everything [05:46:25] :) [05:46:37] haha [05:47:24] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=SSL+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:47:28] cluster looks a lot happier [05:47:59] and even with weight 30/50, the old ones use double the CPU as the new ones [05:48:00] paravoid: yep and the weight on the new two is 50, rather than 30 as well [05:48:08] yeah I saw [05:48:11] and HT helped by nearly 2x [05:48:13] wow [05:48:18] yep [05:48:28] I was expecting it to be much better, I didn't expect 2x [05:48:37] same [05:48:43] so we should probably enable it on the old ones too [05:48:53] it's already enabled on thm [05:49:01] look at /proc/cpuinfo [05:49:06] ah [05:49:13] ok, so then I should adjust weights [05:49:33] 28% vs. 13% [05:52:28] ok, sh may be crap [05:52:39] weight 30/50 and yet they get the exact same bandwidth [05:53:10] hmm, http://tracker.ceph.com/issues/5675 was fixed [05:53:18] yes [05:53:43] then we had http://tracker.ceph.com/issues/5691 [05:53:55] " [05:53:55] All bucket ACLs lost after upgrade [05:54:14] and I currently run the wip-5691 branch on ms-fe1002 iirc [05:54:42] waiting for 0.67-rc... [05:55:07] I wish their commits mentioned had links [05:55:21] what do you mean? [05:55:29] oh in the comments? [05:55:36] there's an infobox on the right that has a link [05:56:03] nice, I was already looking at my local checkout [05:56:28] so the current plans are [05:56:34] sounds like b/c fail [05:56:38] install 0.67-rc when it gets released [05:56:52] and carefully work towards re-enabling it in production [05:57:01] at the same time, build a new swift cluster over some spare boxes in eqiad [05:57:10] to have as a backup for when tampa gets decom'ed [05:57:24] or when ceph fails again, whichever comes first :) [05:57:30] when is tampa getting decom'd? [05:58:04] end of the calendar year afaik [05:58:42] do we have any spare machines with public ips? [05:58:47] what do you mean? [05:59:11] a server that isn't doing much that i could use [05:59:29] um, i guess we can give you a decom'ed server ? [05:59:31] probably [05:59:41] RobH would know [05:59:45] what do you need it for? [06:00:01] ori-l: this sounds like ticket time [06:00:17] yeah, probably, was just feeling out the territory :P [06:00:28] it was really productive to have vanadium to try out different things with production data, but now it's stable and puppetized and i can't really mess with it [06:01:28] * Aaron|home looks at http://tracker.ceph.com/rb/master_backlog/ceph [06:01:53] i have a few things i'd like to try; i'll write them up in a ticket [06:02:14] Aaron|home: their bug tracker looks gorgeous, btw [06:03:12] ori-l: I think I'd reserve that word for things like http://www.doctormacro.com/Images/Taylor,%20Elizabeth/Annex/Annex%20-%20Taylor,%20Elizabeth_02.jpg [06:03:20] hahahaha [06:03:21] but yes it is fairly pretty [06:03:32] paravoid: sh should increase the number of active connections going to the higher weights [06:03:42] weird that the bandwidth is the same [06:03:50] I lowered 30 to 25, no difference [06:04:06] increasing ssl1005/1006 to 75 [06:05:18] I like how you can really visualize what is being worked on [06:05:25] * Aaron|home missed that with PivotalTracker [06:05:31] all we have is suck bugzilla [06:06:05] and mingle :) [06:06:09] and rt [06:06:33] and Trello [06:06:38] right [06:06:52] yeah, that is indeed lame [06:07:04] Ryan_Lane: 25 vs. 75 and no difference [06:07:08] :( [06:07:09] * paravoid officially declares sh as crap [06:07:19] yeah. sh is definitely crap [06:07:20] [06:08:35] lol [06:09:01] (PS1) Lcarr: fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 [06:09:03] * Aaron|home pictures a literal stamp of disapproval that says "crap" [06:09:14] yes [06:09:21] I should have that made [06:09:40] huh I thought http://tracker.ceph.com/issues/3188 was older than it was [06:11:40] paravoid: I bet when that's fixed some downtime bug regression will pop up :) [06:12:39] (CR) Lcarr: [C: 2] fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 (owner: Lcarr) [06:12:40] (Merged) Lcarr: fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 (owner: Lcarr) [06:14:33] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=ssl1004.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=SSL+cluster+eqiad [06:14:36] wow [06:14:39] pegged at 100% [06:14:59] well, 95%, but still [06:15:40] hopefully ulsfo would help with this [06:15:56] but we really should do something about esams too, especially if we're about to turn more traffic to it via esams [06:16:04] er, via DNS [06:21:27] (PS1) Ori.livneh: Tweak 'collect_every' and 'name_match' in EL's Ganglia module [operations/puppet] - https://gerrit.wikimedia.org/r/75284 [06:22:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:43] "Although I don't have a use-case at this specific point it would be very cool to have node.js bindings." [06:23:44] hehe [06:23:51] haha [06:24:03] http://tracker.ceph.com/issues/4230 [06:24:32] did you hear that ori-l? [06:24:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:24:58] "#4230: Replace existing bug tracker with photo of Elizabeth Taylor" [06:25:15] clearly you mean Jeri Ryan [06:26:00] Aaron|home: are you pegging me as the no-use-case guy or the node.js guy? :P [06:26:27] ori-l is semi-secretly the SNOBOL guy [06:26:32] shhh. [06:26:49] paravoid: I think the ssl spike may be related to google's increasing indexing of ssl [06:27:08] ok, i'm headin gout again [06:27:10] I'm pretty sure the rel=canonical change never went in [06:27:20] bye [06:27:28] which is what we need to ensure google lists http, rather than https for anon [06:27:34] ohh, more VE comments on wikitech [06:27:42] * Aaron|home pulls out the popcorn [06:27:44] while i'm gone, check out this - http://cuteoverload.com/2013/07/22/care-and-care-alike/ [06:27:51] it will aid you in all the ssl [06:28:08] awwwww [06:29:25] that is pretty adorable [06:30:22] * YuviPanda is terrified of clicking [06:32:11] it's meh [06:36:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:39:22] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [06:40:12] RECOVERY - Puppet freshness on neon is OK: puppet ran at Tue Jul 23 06:40:02 UTC 2013 [06:41:44] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [06:46:03] Ryan_Lane: soo, how you like the facebook UI changes? [06:46:28] facebook ui changes? [06:46:37] there are changes? [06:46:42] maybe I haven't gotten them [06:46:57] the top bar changed [06:47:02] ooohhh [06:47:04] it did [06:47:09] the search dropdown looks kind of mobile-ish [06:47:26] yeah [06:47:32] it's fine [06:47:33] I feel like the top-right links a bit less readable...low contrast [06:47:34] doesn't bother me [06:47:39] otherwise OK [06:47:43] yep [06:47:49] yeah, slightly less readable [06:47:55] * YuviPanda pats his stayfocusd [06:49:10] Ryan_Lane: they didn't get community consensus though :) [/troll] [06:49:42] hahaha [06:49:51] one of the nice things about being in ops [06:50:14] I rarely need to deal with community consensus [06:50:43] I actually like how FB does this, I enjoy the surprise [06:51:11] I've never been a giant hater of change [06:51:29] just an occasional small hater ;) [06:52:03] Ryan_Lane: http://perennialreflection.files.wordpress.com/2011/05/haterade.png [06:52:34] looks like quite a refreshing beverage [06:52:55] It's rarely about change, per se. It's about whether something is an improvement. [06:53:35] I don't drink those kind of drinks anymore though...too much Powerade in college [06:54:06] damn you ubiquitous vending machines [06:54:41] Elsie: lots of people think no change is an improvement [06:55:05] and those people tend to be way more vocal than the people who are fine with change [06:55:25] not that I have any strong opinion on VE in particular [06:55:54] I do find myself using "edit source" a lot when I run into oddities, and haven't done many edits lately [06:56:07] I don't edit much ;) [06:56:16] some degree of frustration could thus be understood, heh [06:56:38] that said, the last 20 or so edits I did used VE [06:56:47] thinking of that, I need to update wikitech [06:56:52] maybe I'll do that tomorrow [06:57:01] I always try to use it first [06:58:36] I wouldn't say lots. [06:59:33] we most not be editing the same stuff or I got unlucky [06:59:37] who knows [07:04:11] * Aaron|home reads http://www.businessinsider.com/the-worst-part-about-working-at-google-2013-4 comments [07:58:38] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:40] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:01:20] (PS1) Mark Bergsma: Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 [08:02:05] hello [08:02:13] I got a new server for jenkins \O/ [08:02:49] (PS1) Mark Bergsma: Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 [08:03:02] (CR) Mark Bergsma: [C: 2] Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 (owner: Mark Bergsma) [08:03:02] hi mark :) I got yet another varnish change for you. Related to xff trusted sources this time https://gerrit.wikimedia.org/r/#/c/75085/ [08:03:03] (Merged) Mark Bergsma: Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 (owner: Mark Bergsma) [08:03:31] i'll have a look in a bit [08:03:44] (CR) Mark Bergsma: [C: 2] Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 (owner: Mark Bergsma) [08:03:45] (Merged) Mark Bergsma: Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 (owner: Mark Bergsma) [08:04:20] take your time :) [08:05:02] don't say that [08:06:20] ;-D [08:07:13] to get that change I have actually read the varnishncsa and varnishlog man pages, they are powerful tools [08:07:26] took me a while to figure out the X-Forwarded-For proto thing though [08:07:38] RECOVERY - search indices - check lucene status page on search1001 is OK: HTTP OK: HTTP/1.1 200 OK - 213 bytes in 0.004 second response time [08:13:16] (PS1) Mark Bergsma: Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 [08:14:01] (CR) Mark Bergsma: [C: 2] Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 (owner: Mark Bergsma) [08:14:02] (Merged) Mark Bergsma: Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 (owner: Mark Bergsma) [08:16:08] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [08:35:54] (PS1) Physikerwelt: Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 [08:39:21] (CR) Physikerwelt: "can someone have a look if I have set up the repository correctly?" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:50:40] (CR) Hashar: "Andrew, Alexandros and Faidon should be able to help review the debian package. I have added them as reviewers." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:51:58] (PS1) Mark Bergsma: Append the Varnish default internal vcl_recv [operations/puppet] - https://gerrit.wikimedia.org/r/75303 [08:52:15] (PS1) ArielGlenn: more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 [08:53:05] MaxSem: good morning [08:53:20] check this horror: https://gerrit.wikimedia.org/r/#/c/75303/1 [08:53:36] (CR) ArielGlenn: [C: 2] more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 (owner: ArielGlenn) [08:53:37] (Merged) ArielGlenn: more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 (owner: ArielGlenn) [08:54:15] mark: I am still wondering how you manage to spot all those caching bugs :) [08:54:37] just because I'm actively looking at it and also cleaning up VCL [08:54:53] (CR) Physikerwelt: "Thanks. Up to now there is no debian package this is just a setup of the svn-repository clone and git review." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:56:39] i did know there were some default VCL functions being run where they shouldn't [08:56:47] but I didn't see this one [08:56:54] (CR) AzaToth: [C: 1] Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:57:18] (CR) AzaToth: "I see nothing wrong with the .gitreview file." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:57:40] (CR) Mark Bergsma: [C: 2] varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [08:57:41] (Merged) Mark Bergsma: varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [08:57:48] awesome [08:59:19] hashar: still wonders why you made happy smiley when you told me there wouldn't be any attempts before mid september [08:59:33] AzaToth: ohhh [09:00:07] AzaToth: I put smileys at the end of most of my chat lines. They are usually irrelevant [09:00:08] sorry [09:00:47] hashar: that was the first time I saw you use one... [09:00:49] mark: also I got an issue on beta where puppet tries to downgrade varnish* wm14 to wm13 but can't find the packages. Trace is in https://rt.wikimedia.org/Ticket/Display.html?id=5489 [09:01:17] AzaToth: sorry [09:01:42] AzaToth: anyway the issue is that building packages usually need root and or installing packages dependencies (the build-depends: field in debian packages) [09:01:58] AzaToth: so we can't have them run on the Jenkins production slaves since they do not have sudo / su rights. [09:02:29] hashar: erm, they build in chroots [09:02:29] AzaToth: one solution would be to build them in a varnish instance, but I have not looked at it yet. The second solution is to use a labs instance which let us setup some sudo rights [09:03:50] some sudo acc would be needed though, but no full ac to install packages ヾ [09:04:50] hashar: but I would assume it's a 3 min job to fire up a labs instance and use it ツ [09:05:49] 1. create instance; 2. setup jenkins user; 3. give jenkins user ALL sudo access; 4. done [09:05:59] yeah that is what I did on the labs instance [09:06:13] then to build the package on patchset submission, we need the Jenkins slave to fetch the patchset from the Zuul git repository [09:06:21] which is only locally available right now [09:06:34] local where? [09:06:34] the idea is to have it published over http so jenkins slaves can fetch from it [09:06:41] on gallium the contint server [09:06:48] which is in prod [09:07:05] (PS1) ArielGlenn: re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 [09:07:24] does zuul use a jenkins plugin or is it fully external? [09:07:44] (CR) ArielGlenn: [C: 2] re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 (owner: ArielGlenn) [09:07:45] (Merged) ArielGlenn: re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 (owner: ArielGlenn) [09:08:12] AzaToth: it is an independant python daemon. It listens for Gerrit events via ssh 'stream-events' which is a json stream of events happening in Gerrit [09:08:48] AzaToth: whenever a patch is submitted against a configured repository, Zuul update the branch, attempt to merge the patchset on top of the branch and trigger a Jenkins job using the resulting commit [09:08:59] AzaToth: Jenkins then fetch the change from the Zuul git repository and run the tasks [09:09:05] ok [09:10:46] hashar: I would assume the slaves could clone the git from gallium via ssh [09:11:43] I will use http [09:11:55] and have the slave use a gerrit replication of the git repos as a reference [09:12:44] ok [09:13:18] but you don't have time to do this before mid september? [09:14:18] I will try :) [09:14:34] I am on vacations in 7 days and still have a bunch of things to handle [09:14:38] better use of smiley :-P [09:14:45] hehe [09:15:14] in the end sorry for not having that in place in a timely manner, that is mostly a matter of priority on my side [09:15:16] vacation... orly [09:15:35] and unfortunately debian packaging via jenkins is not that urgent despite your investement [09:15:38]