[00:03:12] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 00:03:07 UTC 2013 [00:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:21] !log pooled new https node ssl1005.wikimedia.org [00:18:08] !log pooled new https node ssl1006.wikimedia.org (with hyperthreading enabled) [00:20:52] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:21:06] hm [00:21:15] that's probably not a good sign [00:21:52] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:22:00] -_- [00:22:42] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.006 second response time [00:23:42] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:24:18] hm. it's bound on lo on ssl1005/6 [00:24:42] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.013 second response time [00:24:44] (CR) Ori.livneh: [C: 1] "Yay. Thanks!" [operations/puppet] - https://gerrit.wikimedia.org/r/75265 (owner: Pyoungmeister) [00:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 00:24:46 UTC 2013 [00:25:04] and it's listening... [00:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:25:42] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.003 second response time [00:25:52] PROBLEM - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:26:30] notpeter_, you can merge it, then I'll help Niklas tomorrow with migrating to that box [00:26:42] RECOVERY - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.013 second response time [00:29:02] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 00:28:54 UTC 2013 [00:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [00:29:52] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.035 second response time [00:30:42] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:31:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 00:31:01 UTC 2013 [00:31:19] ok. wtf. [00:31:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:31:42] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.013 second response time [00:31:45] PROBLEM - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:31:52] PROBLEM - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:32:50] LeslieCarr: ^^ [00:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 00:32:42 UTC 2013 [00:33:01] LeslieCarr: would there be routing issues with the new ssl hosts and ipv6? [00:33:19] !log depooling ssl1005/6 due to ipv6 issues [00:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:33:32] RECOVERY - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 601 bytes in 0.002 second response time [00:33:42] RECOVERY - LVS HTTP IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.010 second response time [00:37:42] * Elsie wonders when Ryan_Lane will notice that morebots isn't here. [00:37:45] :-) [00:37:49] ugh [00:37:50] heh [00:38:43] !log pooled new https node ssl1005.wikimedia.org [00:38:48] !log pooled new https node ssl1006.wikimedia.org (with hyperthreading enabled) [00:38:54] Logged the message, Master [00:38:59] !log depooling ssl1005/6 due to ipv6 issues [00:39:04] Logged the message, Master [00:39:13] Logged the message, Master [00:41:22] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [00:55:12] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 00:55:09 UTC 2013 [00:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 00:57:41 UTC 2013 [00:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:59:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 00:59:13 UTC 2013 [00:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:34] !log enabled hyperthreading on ssl1005 [01:01:52] Logged the message, Master [01:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 01:02:37 UTC 2013 [01:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:12:02] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:32] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms [01:15:02] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:15:32] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.59 ms [01:15:42] ... [01:25:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 01:24:56 UTC 2013 [01:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 01:28:55 UTC 2013 [01:29:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 01:28:55 UTC 2013 [01:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:33:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 01:33:14 UTC 2013 [01:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:22] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:23] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:22] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 01:54:53 UTC 2013 [01:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 01:58:10 UTC 2013 [01:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 01:58:46 UTC 2013 [01:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 02:02:43 UTC 2013 [02:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:03:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [02:08:03] !log LocalisationUpdate completed (1.22wmf11) at Tue Jul 23 02:08:03 UTC 2013 [02:08:16] Logged the message, Master [02:12:36] !log LocalisationUpdate completed (1.22wmf10) at Tue Jul 23 02:12:36 UTC 2013 [02:12:46] Logged the message, Master [02:19:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:21:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 23 02:21:20 UTC 2013 [02:21:30] Logged the message, Master [02:25:12] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 02:25:11 UTC 2013 [02:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 02:28:56 UTC 2013 [02:29:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 02:29:12 UTC 2013 [02:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:33:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 02:33:20 UTC 2013 [02:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 02:54:54 UTC 2013 [02:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:57:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 02:57:48 UTC 2013 [02:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:12] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 02:59:09 UTC 2013 [02:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 03:02:42 UTC 2013 [03:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 03:24:51 UTC 2013 [03:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:28:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 03:27:53 UTC 2013 [03:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [03:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 03:28:44 UTC 2013 [03:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:31:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:33:32] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 03:33:25 UTC 2013 [03:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 03:54:48 UTC 2013 [03:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 03:57:40 UTC 2013 [03:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [03:59:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 03:58:57 UTC 2013 [03:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [04:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 04:02:35 UTC 2013 [04:02:42] PROBLEM - Varnish HTTP mobile-frontend on cp1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:03:33] RECOVERY - Varnish HTTP mobile-frontend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.002 second response time [04:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 04:24:47 UTC 2013 [04:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:27:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 04:27:45 UTC 2013 [04:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 04:28:57 UTC 2013 [04:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 04:32:37 UTC 2013 [04:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Jul 23 04:54:52 UTC 2013 [04:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Jul 23 04:58:11 UTC 2013 [04:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Jul 23 04:58:47 UTC 2013 [04:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [05:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Jul 23 05:02:39 UTC 2013 [05:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [05:05:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:52] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:13:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:34:41] !log authdns-update for cr1-ulsfo <-> cr2-eqiad, dns-auth-lb -> nsN(-new), ssl100x fixed IPv6 [05:34:52] Logged the message, Master [05:44:27] !log pooling ssl1005/1006 for https/ipv6 [05:44:37] Logged the message, Master [05:44:46] Ryan_Lane/LeslieCarr: ^^^ [05:44:55] oh cool [05:45:09] i'll keep an eye out for a little bit [05:45:11] and this is just fyi, don't you dare deal with it if it goes south [05:45:27] 11pm and everything [05:46:25] :) [05:46:37] haha [05:47:24] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=SSL+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:47:28] cluster looks a lot happier [05:47:59] and even with weight 30/50, the old ones use double the CPU as the new ones [05:48:00] paravoid: yep and the weight on the new two is 50, rather than 30 as well [05:48:08] yeah I saw [05:48:11] and HT helped by nearly 2x [05:48:13] wow [05:48:18] yep [05:48:28] I was expecting it to be much better, I didn't expect 2x [05:48:37] same [05:48:43] so we should probably enable it on the old ones too [05:48:53] it's already enabled on thm [05:49:01] look at /proc/cpuinfo [05:49:06] ah [05:49:13] ok, so then I should adjust weights [05:49:33] 28% vs. 13% [05:52:28] ok, sh may be crap [05:52:39] weight 30/50 and yet they get the exact same bandwidth [05:53:10] hmm, http://tracker.ceph.com/issues/5675 was fixed [05:53:18] yes [05:53:43] then we had http://tracker.ceph.com/issues/5691 [05:53:55] " [05:53:55] All bucket ACLs lost after upgrade [05:54:14] and I currently run the wip-5691 branch on ms-fe1002 iirc [05:54:42] waiting for 0.67-rc... [05:55:07] I wish their commits mentioned had links [05:55:21] what do you mean? [05:55:29] oh in the comments? [05:55:36] there's an infobox on the right that has a link [05:56:03] nice, I was already looking at my local checkout [05:56:28] so the current plans are [05:56:34] sounds like b/c fail [05:56:38] install 0.67-rc when it gets released [05:56:52] and carefully work towards re-enabling it in production [05:57:01] at the same time, build a new swift cluster over some spare boxes in eqiad [05:57:10] to have as a backup for when tampa gets decom'ed [05:57:24] or when ceph fails again, whichever comes first :) [05:57:30] when is tampa getting decom'd? [05:58:04] end of the calendar year afaik [05:58:42] do we have any spare machines with public ips? [05:58:47] what do you mean? [05:59:11] a server that isn't doing much that i could use [05:59:29] um, i guess we can give you a decom'ed server ? [05:59:31] probably [05:59:41] RobH would know [05:59:45] what do you need it for? [06:00:01] ori-l: this sounds like ticket time [06:00:17] yeah, probably, was just feeling out the territory :P [06:00:28] it was really productive to have vanadium to try out different things with production data, but now it's stable and puppetized and i can't really mess with it [06:01:28] * Aaron|home looks at http://tracker.ceph.com/rb/master_backlog/ceph [06:01:53] i have a few things i'd like to try; i'll write them up in a ticket [06:02:14] Aaron|home: their bug tracker looks gorgeous, btw [06:03:12] ori-l: I think I'd reserve that word for things like http://www.doctormacro.com/Images/Taylor,%20Elizabeth/Annex/Annex%20-%20Taylor,%20Elizabeth_02.jpg [06:03:20] hahahaha [06:03:21] but yes it is fairly pretty [06:03:32] paravoid: sh should increase the number of active connections going to the higher weights [06:03:42] weird that the bandwidth is the same [06:03:50] I lowered 30 to 25, no difference [06:04:06] increasing ssl1005/1006 to 75 [06:05:18] I like how you can really visualize what is being worked on [06:05:25] * Aaron|home missed that with PivotalTracker [06:05:31] all we have is suck bugzilla [06:06:05] and mingle :) [06:06:09] and rt [06:06:33] and Trello [06:06:38] right [06:06:52] yeah, that is indeed lame [06:07:04] Ryan_Lane: 25 vs. 75 and no difference [06:07:08] :( [06:07:09] * paravoid officially declares sh as crap [06:07:19] yeah. sh is definitely crap [06:07:20] [06:08:35] lol [06:09:01] (PS1) Lcarr: fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 [06:09:03] * Aaron|home pictures a literal stamp of disapproval that says "crap" [06:09:14] yes [06:09:21] I should have that made [06:09:40] huh I thought http://tracker.ceph.com/issues/3188 was older than it was [06:11:40] paravoid: I bet when that's fixed some downtime bug regression will pop up :) [06:12:39] (CR) Lcarr: [C: 2] fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 (owner: Lcarr) [06:12:40] (Merged) Lcarr: fixing wikidata-lb monitoring ip address [operations/puppet] - https://gerrit.wikimedia.org/r/75281 (owner: Lcarr) [06:14:33] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=ssl1004.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=SSL+cluster+eqiad [06:14:36] wow [06:14:39] pegged at 100% [06:14:59] well, 95%, but still [06:15:40] hopefully ulsfo would help with this [06:15:56] but we really should do something about esams too, especially if we're about to turn more traffic to it via esams [06:16:04] er, via DNS [06:21:27] (PS1) Ori.livneh: Tweak 'collect_every' and 'name_match' in EL's Ganglia module [operations/puppet] - https://gerrit.wikimedia.org/r/75284 [06:22:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:43] "Although I don't have a use-case at this specific point it would be very cool to have node.js bindings." [06:23:44] hehe [06:23:51] haha [06:24:03] http://tracker.ceph.com/issues/4230 [06:24:32] did you hear that ori-l? [06:24:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:24:58] "#4230: Replace existing bug tracker with photo of Elizabeth Taylor" [06:25:15] clearly you mean Jeri Ryan [06:26:00] Aaron|home: are you pegging me as the no-use-case guy or the node.js guy? :P [06:26:27] ori-l is semi-secretly the SNOBOL guy [06:26:32] shhh. [06:26:49] paravoid: I think the ssl spike may be related to google's increasing indexing of ssl [06:27:08] ok, i'm headin gout again [06:27:10] I'm pretty sure the rel=canonical change never went in [06:27:20] bye [06:27:28] which is what we need to ensure google lists http, rather than https for anon [06:27:34] ohh, more VE comments on wikitech [06:27:42] * Aaron|home pulls out the popcorn [06:27:44] while i'm gone, check out this - http://cuteoverload.com/2013/07/22/care-and-care-alike/ [06:27:51] it will aid you in all the ssl [06:28:08] awwwww [06:29:25] that is pretty adorable [06:30:22] * YuviPanda is terrified of clicking [06:32:11] it's meh [06:36:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:39:22] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [06:40:12] RECOVERY - Puppet freshness on neon is OK: puppet ran at Tue Jul 23 06:40:02 UTC 2013 [06:41:44] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [06:46:03] Ryan_Lane: soo, how you like the facebook UI changes? [06:46:28] facebook ui changes? [06:46:37] there are changes? [06:46:42] maybe I haven't gotten them [06:46:57] the top bar changed [06:47:02] ooohhh [06:47:04] it did [06:47:09] the search dropdown looks kind of mobile-ish [06:47:26] yeah [06:47:32] it's fine [06:47:33] I feel like the top-right links a bit less readable...low contrast [06:47:34] doesn't bother me [06:47:39] otherwise OK [06:47:43] yep [06:47:49] yeah, slightly less readable [06:47:55] * YuviPanda pats his stayfocusd [06:49:10] Ryan_Lane: they didn't get community consensus though :) [/troll] [06:49:42] hahaha [06:49:51] one of the nice things about being in ops [06:50:14] I rarely need to deal with community consensus [06:50:43] I actually like how FB does this, I enjoy the surprise [06:51:11] I've never been a giant hater of change [06:51:29] just an occasional small hater ;) [06:52:03] Ryan_Lane: http://perennialreflection.files.wordpress.com/2011/05/haterade.png [06:52:34] looks like quite a refreshing beverage [06:52:55] It's rarely about change, per se. It's about whether something is an improvement. [06:53:35] I don't drink those kind of drinks anymore though...too much Powerade in college [06:54:06] damn you ubiquitous vending machines [06:54:41] Elsie: lots of people think no change is an improvement [06:55:05] and those people tend to be way more vocal than the people who are fine with change [06:55:25] not that I have any strong opinion on VE in particular [06:55:54] I do find myself using "edit source" a lot when I run into oddities, and haven't done many edits lately [06:56:07] I don't edit much ;) [06:56:16] some degree of frustration could thus be understood, heh [06:56:38] that said, the last 20 or so edits I did used VE [06:56:47] thinking of that, I need to update wikitech [06:56:52] maybe I'll do that tomorrow [06:57:01] I always try to use it first [06:58:36] I wouldn't say lots. [06:59:33] we most not be editing the same stuff or I got unlucky [06:59:37] who knows [07:04:11] * Aaron|home reads http://www.businessinsider.com/the-worst-part-about-working-at-google-2013-4 comments [07:58:38] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:40] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:01:20] (PS1) Mark Bergsma: Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 [08:02:05] hello [08:02:13] I got a new server for jenkins \O/ [08:02:49] (PS1) Mark Bergsma: Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 [08:03:02] (CR) Mark Bergsma: [C: 2] Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 (owner: Mark Bergsma) [08:03:02] hi mark :) I got yet another varnish change for you. Related to xff trusted sources this time https://gerrit.wikimedia.org/r/#/c/75085/ [08:03:03] (Merged) Mark Bergsma: Disable backend request queuing on the frontend layer [operations/puppet] - https://gerrit.wikimedia.org/r/75296 (owner: Mark Bergsma) [08:03:31] i'll have a look in a bit [08:03:44] (CR) Mark Bergsma: [C: 2] Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 (owner: Mark Bergsma) [08:03:45] (Merged) Mark Bergsma: Disable the hit_for_pass action on ttl <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75297 (owner: Mark Bergsma) [08:04:20] take your time :) [08:05:02] don't say that [08:06:20] ;-D [08:07:13] to get that change I have actually read the varnishncsa and varnishlog man pages, they are powerful tools [08:07:26] took me a while to figure out the X-Forwarded-For proto thing though [08:07:38] RECOVERY - search indices - check lucene status page on search1001 is OK: HTTP OK: HTTP/1.1 200 OK - 213 bytes in 0.004 second response time [08:13:16] (PS1) Mark Bergsma: Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 [08:14:01] (CR) Mark Bergsma: [C: 2] Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 (owner: Mark Bergsma) [08:14:02] (Merged) Mark Bergsma: Set req.hash_ignore_busy in vcl_recv instead [operations/puppet] - https://gerrit.wikimedia.org/r/75298 (owner: Mark Bergsma) [08:16:08] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [08:35:54] (PS1) Physikerwelt: Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 [08:39:21] (CR) Physikerwelt: "can someone have a look if I have set up the repository correctly?" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:50:40] (CR) Hashar: "Andrew, Alexandros and Faidon should be able to help review the debian package. I have added them as reviewers." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:51:58] (PS1) Mark Bergsma: Append the Varnish default internal vcl_recv [operations/puppet] - https://gerrit.wikimedia.org/r/75303 [08:52:15] (PS1) ArielGlenn: more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 [08:53:05] MaxSem: good morning [08:53:20] check this horror: https://gerrit.wikimedia.org/r/#/c/75303/1 [08:53:36] (CR) ArielGlenn: [C: 2] more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 (owner: ArielGlenn) [08:53:37] (Merged) ArielGlenn: more dumps written to dataset1001, update rsync script, also fix comments [operations/puppet] - https://gerrit.wikimedia.org/r/75304 (owner: ArielGlenn) [08:54:15] mark: I am still wondering how you manage to spot all those caching bugs :) [08:54:37] just because I'm actively looking at it and also cleaning up VCL [08:54:53] (CR) Physikerwelt: "Thanks. Up to now there is no debian package this is just a setup of the svn-repository clone and git review." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:56:39] i did know there were some default VCL functions being run where they shouldn't [08:56:47] but I didn't see this one [08:56:54] (CR) AzaToth: [C: 1] Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:57:18] (CR) AzaToth: "I see nothing wrong with the .gitreview file." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [08:57:40] (CR) Mark Bergsma: [C: 2] varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [08:57:41] (Merged) Mark Bergsma: varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [08:57:48] awesome [08:59:19] hashar: still wonders why you made happy smiley when you told me there wouldn't be any attempts before mid september [08:59:33] AzaToth: ohhh [09:00:07] AzaToth: I put smileys at the end of most of my chat lines. They are usually irrelevant [09:00:08] sorry [09:00:47] hashar: that was the first time I saw you use one... [09:00:49] mark: also I got an issue on beta where puppet tries to downgrade varnish* wm14 to wm13 but can't find the packages. Trace is in https://rt.wikimedia.org/Ticket/Display.html?id=5489 [09:01:17] AzaToth: sorry [09:01:42] AzaToth: anyway the issue is that building packages usually need root and or installing packages dependencies (the build-depends: field in debian packages) [09:01:58] AzaToth: so we can't have them run on the Jenkins production slaves since they do not have sudo / su rights. [09:02:29] hashar: erm, they build in chroots [09:02:29] AzaToth: one solution would be to build them in a varnish instance, but I have not looked at it yet. The second solution is to use a labs instance which let us setup some sudo rights [09:03:50] some sudo acc would be needed though, but no full ac to install packages ヾ [09:04:50] hashar: but I would assume it's a 3 min job to fire up a labs instance and use it ツ [09:05:49] 1. create instance; 2. setup jenkins user; 3. give jenkins user ALL sudo access; 4. done [09:05:59] yeah that is what I did on the labs instance [09:06:13] then to build the package on patchset submission, we need the Jenkins slave to fetch the patchset from the Zuul git repository [09:06:21] which is only locally available right now [09:06:34] local where? [09:06:34] the idea is to have it published over http so jenkins slaves can fetch from it [09:06:41] on gallium the contint server [09:06:48] which is in prod [09:07:05] (PS1) ArielGlenn: re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 [09:07:24] does zuul use a jenkins plugin or is it fully external? [09:07:44] (CR) ArielGlenn: [C: 2] re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 (owner: ArielGlenn) [09:07:45] (Merged) ArielGlenn: re-enable rsync dumps cron job between dataset hosts [operations/puppet] - https://gerrit.wikimedia.org/r/75305 (owner: ArielGlenn) [09:08:12] AzaToth: it is an independant python daemon. It listens for Gerrit events via ssh 'stream-events' which is a json stream of events happening in Gerrit [09:08:48] AzaToth: whenever a patch is submitted against a configured repository, Zuul update the branch, attempt to merge the patchset on top of the branch and trigger a Jenkins job using the resulting commit [09:08:59] AzaToth: Jenkins then fetch the change from the Zuul git repository and run the tasks [09:09:05] ok [09:10:46] hashar: I would assume the slaves could clone the git from gallium via ssh [09:11:43] I will use http [09:11:55] and have the slave use a gerrit replication of the git repos as a reference [09:12:44] ok [09:13:18] but you don't have time to do this before mid september? [09:14:18] I will try :) [09:14:34] I am on vacations in 7 days and still have a bunch of things to handle [09:14:38] better use of smiley :-P [09:14:45] hehe [09:15:14] in the end sorry for not having that in place in a timely manner, that is mostly a matter of priority on my side [09:15:16] vacation... orly [09:15:35] and unfortunately debian packaging via jenkins is not that urgent despite your investement [09:15:38] you woill just sit infront of a computer coding :-P [09:15:46] naaa [09:16:01] no probs [09:16:07] I got a family so will spend most of my time disconnected hehe [09:22:18] (PS9) Hashar: lanthanum as a jenkins slave [operations/puppet] - https://gerrit.wikimedia.org/r/64601 [09:23:34] (CR) Hashar: "PS9:" [operations/puppet] - https://gerrit.wikimedia.org/r/64601 (owner: Hashar) [09:27:21] mark, duh [09:27:23] (PS2) Hashar: Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [09:28:02] (CR) Hashar: [C: 1 V: 2] "Added a newline at end of file" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [09:30:42] MaxSem: ? [09:31:39] response to that caching fun:) [09:31:58] so the XFF handling part in that is needed [09:32:00] and it's kinda broken [09:32:10] XFF is done in 3 places now, and not working together very well [09:32:16] but at least that cookie pass thing we can take out [09:34:42] (CR) Mark Bergsma: [C: 2] Append the Varnish default internal vcl_recv [operations/puppet] - https://gerrit.wikimedia.org/r/75303 (owner: Mark Bergsma) [09:34:43] (Merged) Mark Bergsma: Append the Varnish default internal vcl_recv [operations/puppet] - https://gerrit.wikimedia.org/r/75303 (owner: Mark Bergsma) [09:36:35] mark, I've been working on XVO yesterday, will probably finish today. this includes getting rid of one of cookies [09:36:51] cool [09:37:10] (CR) QChris: "It's not really enforced or required, but we typically [1]" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [09:40:07] does MF use the XFF header for anything? [09:42:01] (PS1) Mark Bergsma: Remove unnecessary sections of the default vcl_recv code [operations/puppet] - https://gerrit.wikimedia.org/r/75312 [09:42:09] Heja, this is the daily request to restart git to those who are awake (ref. https://bugzilla.wikimedia.org/show_bug.cgi?id=51769#c9 ) [09:45:41] ahh [09:45:45] that needs a stack trace [09:45:49] (PS2) Mark Bergsma: Remove unnecessary sections of the default vcl_recv code [operations/puppet] - https://gerrit.wikimedia.org/r/75312 [09:46:12] qchris: hi! do you have access on the git.wikimedia.org box to take a stack trace ? [09:46:17] qchris: it is dead again hehe [09:46:24] hashar: Nope. I haven't. [09:46:35] time to restart it via cronjob?:P [09:46:36] hashar: This this dies 1000 deaths lately :-( [09:46:53] MaxSem: That sounds like a really good idea :-) [09:47:00] is a trace taken with: jstack -F [09:47:37] hashar: That should work. yes. [09:48:15] mark: would you mind taking some java traces for us on antimony.wikimedia.org . It hosts git.wikimedia.org which is currently dead [09:48:41] mark: it is a java process running the gitblit.jar , a stack trace can be taken using: jstack -F [09:48:54] ok [09:49:08] root 29861 178 25.7 7761404 4211624 ? Sl Jul22 1058:17 java -jar gitblit.jar [09:49:09] then you can kill it and restart it with: cd /var/lib/gitblit, java -jar gitblit.jar & [09:49:10] this pid I assume? [09:49:14] yup [09:49:32] why does that run as root... [09:49:38] yeah that is lame [09:49:51] maybe there is a gitblit user? [09:50:10] and why is there no init script? [09:50:24] file all the bugs [09:50:26] about it [09:50:27] I guess upstream does not provide any [09:50:48] oh that's ok then [09:51:06] http://p.defau.lt/?wN1DAoBu1MZdReP0hiQhEw [09:51:50] lol, jstack choked on it [09:51:51] qchris: would that stack trace be enough ? [09:52:07] hashar: :-D [09:52:18] Caused by: sun.jvm.hotspot.runtime.VMVersionMismatchException: Supported versions are 23.7-b01. Target VM is 20.0-b12 [09:52:20] arhghggg [09:52:26] hashar: Well that should allow to get a real stack trace next time. [09:52:34] how? [09:52:45] Supported versions are 23.7-b01. Target VM is 20.0-b12 [09:53:01] By making those numbers match. [09:53:04] seems there is a mismatch between jstack version and the java that got used [09:54:43] /usr/lib/jvm/java-6-openjdk-amd64/bin/jstack versus /usr/lib/jvm/java-7-openjdk-amd64/bin/jstack [09:54:44] maybe [09:55:21] i had to remove the temp dir too [09:55:27] Maybe. But when selecting the java implementation to use, shuldn't that switch to the correct directory? [09:55:44] mark: could you also check the java / jstack alternative versions with: [09:55:44] ls -1l /etc/alternatives/java /etc/alternatives/jstack [09:56:13] oh, those are separate for ubuntu :-D [09:56:23] lrwxrwxrwx 1 root root 46 Jun 25 17:41 /etc/alternatives/java -> /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java [09:56:23] lrwxrwxrwx 1 root root 44 Jun 25 17:47 /etc/alternatives/jstack -> /usr/lib/jvm/java-7-openjdk-amd64/bin/jstack [09:56:30] yeah that is it [09:56:40] i'll update jstack to java-6 then [09:56:43] so Gitblit was started with jdk 6 [09:56:51] yeah that would be nice [09:57:33] if you haven't killed the process yet you can then attempt a jstack again [09:57:48] except the jdk6 jstack doesn't exist [09:57:52] i have killed the process [09:57:54] oh my god [09:58:11] * mark thinks that whoever did this shitty install can support it themselves [09:58:30] filling bugs [09:58:41] thank you [09:59:16] (CR) Mark Bergsma: [C: 2] Remove unnecessary sections of the default vcl_recv code [operations/puppet] - https://gerrit.wikimedia.org/r/75312 (owner: Mark Bergsma) [09:59:16] (Merged) Mark Bergsma: Remove unnecessary sections of the default vcl_recv code [operations/puppet] - https://gerrit.wikimedia.org/r/75312 (owner: Mark Bergsma) [10:02:07] (PS1) Mark Bergsma: Remove install of (no longer existing) varnish 3.0.3plus~rc1-wm13 package [operations/puppet] - https://gerrit.wikimedia.org/r/75313 [10:03:14] (PS2) Mark Bergsma: Remove install of (no longer existing) varnish 3.0.3plus~rc1-wm13 package [operations/puppet] - https://gerrit.wikimedia.org/r/75313 [10:03:38] (CR) Mark Bergsma: [C: 2] Remove install of (no longer existing) varnish 3.0.3plus~rc1-wm13 package [operations/puppet] - https://gerrit.wikimedia.org/r/75313 (owner: Mark Bergsma) [10:04:04] (Merged) Mark Bergsma: Remove install of (no longer existing) varnish 3.0.3plus~rc1-wm13 package [operations/puppet] - https://gerrit.wikimedia.org/r/75313 (owner: Mark Bergsma) [10:04:16] Jstack issue filled as https://bugzilla.wikimedia.org/show_bug.cgi?id=51859 [10:09:19] and init/upstart script request is https://bugzilla.wikimedia.org/show_bug.cgi?id=51861 [10:09:39] (PS3) QChris: Upon first publicly visible patch set in gerrit, update bug status [operations/puppet] - https://gerrit.wikimedia.org/r/69843 [10:19:59] (PS1) Mark Bergsma: Mobile Cookie Vary caching optimizations [operations/puppet] - https://gerrit.wikimedia.org/r/75316 [10:22:22] (PS2) Mark Bergsma: Mobile Cookie Vary caching optimizations [operations/puppet] - https://gerrit.wikimedia.org/r/75316 [10:25:02] poor vhtcpd doesn't start on labs :D [10:26:08] (PS3) Mark Bergsma: Mobile Cookie Vary caching optimizations [operations/puppet] - https://gerrit.wikimedia.org/r/75316 [10:32:37] (PS1) Mark Bergsma: Factor out XFF append handling into a common function [operations/puppet] - https://gerrit.wikimedia.org/r/75317 [10:34:36] (CR) Mark Bergsma: [C: 2] Factor out XFF append handling into a common function [operations/puppet] - https://gerrit.wikimedia.org/r/75317 (owner: Mark Bergsma) [10:34:37] (Merged) Mark Bergsma: Factor out XFF append handling into a common function [operations/puppet] - https://gerrit.wikimedia.org/r/75317 (owner: Mark Bergsma) [10:41:01] mark: is vhtcpd a custom Wikimedia utility [10:41:12] I am looking up its configuration options :) [10:41:29] DAEMON_OPTS="-F -m 239.128.0.112 -c 127.0.0.1:80 -c 127.0.0.1:3128" does not seem to play nice for beta since we do not use multicast [10:42:38] (PS1) Mark Bergsma: Add new SSL proxies ssl1005 and ssl1006 to the XFF list [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75318 [10:42:40] * 1819f9b - vhtcpd - basically complete (9 weeks ago) [10:42:42] ah indeed [10:42:44] (CR) jenkins-bot: [V: -1] Add new SSL proxies ssl1005 and ssl1006 to the XFF list [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75318 (owner: Mark Bergsma) [10:42:47] hashar: yes [10:43:46] (PS2) Mark Bergsma: Add new SSL proxies ssl1005 and ssl1006 to the XFF list [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75318 [10:44:47] (CR) Mark Bergsma: [C: 2] Add new SSL proxies ssl1005 and ssl1006 to the XFF list [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75318 (owner: Mark Bergsma) [10:44:48] (Merged) Mark Bergsma: Add new SSL proxies ssl1005 and ssl1006 to the XFF list [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75318 (owner: Mark Bergsma) [10:46:45] !log mark synchronized wmf-config/squid.php 'Add ssl1005/ssl1006' [10:46:55] will look that up after lunch [10:46:56] Logged the message, Master [10:46:58] bbl [10:50:42] (PS4) Mark Bergsma: Mobile Cookie Vary caching optimizations [operations/puppet] - https://gerrit.wikimedia.org/r/75316 [10:57:23] oh duh, ssl1005/1006 [10:58:04] time to put nginx on the varnish boxes [10:58:45] yeah, if only we had a distributed ssl session cache [10:59:00] or stud [10:59:05] stud doesn't do XFF [10:59:12] it does PROXY [10:59:20] that's what I like about it [11:00:17] I like the distributed session cache over udp multicast :) [11:00:32] have you looked at all at how difficult would PROXY be for varnish? [11:00:44] no [11:00:57] why would it be difficult [11:01:01] clearly replacing the client ip works ;-) [11:01:50] because it's not http [11:02:19] it's like the first few bytes sent on the connection right? [11:04:16] afaik yes [11:08:28] would we prefer stud over nginx though [11:15:52] there's a patch for nginx too [11:21:42] (CR) AzaToth: [C: -1] "I realize now that latexml already exists in debian, so making a new package is not practical use of someones time." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [11:22:51] what I tried to say is that we should build it onto of allreadt latexml package instead of reinventing the wheel [11:22:56] (CR) Aklapper: [C: 1] Upon first publicly visible patch set in gerrit, update bug status [operations/puppet] - https://gerrit.wikimedia.org/r/69843 (owner: QChris) [11:23:57] Me wonders whether someone from the ops team might have a second to have a look at https://gerrit.wikimedia.org/r/#/c/69843/3 [11:24:20] if there even is any changes [11:25:07] PROBLEM - search indices - check lucene status page on search1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 723 bytes in 0.004 second response time [11:25:12] I second qchris - would be lovely to get that two-liner deployed so we can test if it that enhancement works as expected [11:27:04] And who is on RT duty this week? [11:27:08] or is it not latexml he's talking about? [11:27:51] andre__: the one who firsts asks [11:28:48] AzaToth: So andre__ is on RT duty :-) ... wait ... :-( [11:31:37] hehe [11:32:18] Is there any place where we can lookup who's on RT duty if the channel topic is outdated? [11:33:56] qchris: officewiki, ops meeting minutes [11:34:09] paravoid: Thanks. [11:38:36] qchris: just look who is running away screaming [11:39:25] p858snake|l: :-) No one running. No one screaming. And officewiki does not allow me to log in :-/ [11:40:06] Only now I see the new channel topic... :-) [11:48:05] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:48:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:05] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:06] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:06] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:41] bah I got two varnish htcpd packages :( [11:49:16] /usr/local/bin/varnishhtcpd with an upstart job and /usr/sbin/vhtcpd from the debian package :) [11:52:06] paravoid: May I trick you into taking a look and deploying a two-liner for Chris and me? https://gerrit.wikimedia.org/r/#/c/69843/3 [11:53:05] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:16] I have no idea what that does [11:53:34] (PS1) Hashar: get rid of varnishhtcpd upstart job [operations/puppet] - https://gerrit.wikimedia.org/r/75323 [11:54:13] (CR) Physikerwelt: "Yes I know that there is a package... of 2009 or something like that. LaTeXML has improved a lot in that period." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [11:54:23] paravoid: That's no problem. Just +2 :-) [11:54:45] paravoid: No, seriously. [11:55:01] paravoid: We just set the bug's status to PATCH_TO_REVIEW [11:55:19] ...and that status now exists in Bugzilla. [11:55:24] paravoid: Which is the new way to mark that a change in gerrit is associated to the bug [11:55:30] andre__: woot [11:55:31] yay [11:55:46] (CR) Faidon: [C: 2] Upon first publicly visible patch set in gerrit, update bug status [operations/puppet] - https://gerrit.wikimedia.org/r/69843 (owner: QChris) [11:55:47] (Merged) Faidon: Upon first publicly visible patch set in gerrit, update bug status [operations/puppet] - https://gerrit.wikimedia.org/r/69843 (owner: QChris) [11:55:59] andre__: does it come before or after ASSIGNED in the workflow? (after, i assume?) [11:56:20] It's merged! Thanks Faidon. [11:56:30] Yay! Thanks! [11:56:58] MatmaRex: in a perfect world should be after ASSIGNED, but I don't want to impose such expectations via the bug workflow yet. [11:57:37] yeah [11:57:47] are you planning to migrate the keyword? [11:58:04] MatmaRex, yeah, but I'm still thinking about the best way to do this. [11:58:45] ** Remove "patch-in-gerrit" keyword from closed tickets [11:58:46] ** Retriage and set PATCH_TO_REVIEW status for tickets with patch-in-gerrit keyword [11:58:46] ** Query for tickets with gerrit link but without patch-in-gerrit keyword and set status [11:58:47] etc. [11:59:05] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [12:04:05] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:05:35] (CR) AzaToth: [C: 1] "I see, then I assume it's as good to just ignore the current debian package." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [12:06:47] to fill a feature request against vhtcpd (namely adding support for unicast) should I fill a bug in RT ? ;) [12:07:26] hashar: over my dead body :-P [12:08:23] jokes aside; I think RT should only handle stuff that must be kept out of the prying eyes of the public [12:09:21] if RT is the main tool that ops use, then it's where the task for ops should be. The usual meta-discussion is already in https://bugzilla.wikimedia.org/show_bug.cgi?id=30413 :) [12:09:53] yeah going to fill a RT [12:11:39] andre__: sadly the ops have no incensitive to make it more transparent :( [12:20:08] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [12:22:40] AzaToth: because ops have to handle really nasty stuff which is safer in RT [12:22:51] AzaToth: and they interact with lot of third parties over email. [12:23:08] hashar: yes [12:23:16] so they use RT for good reasons :-] [12:23:27] hashar: I assume suport for unicast is really nasty stuff :-P [12:23:36] I dont know [12:24:51] I could use a root to verify some disk partition layout on lanthanum.eqiad.wmnet . It got installed last week with a SSD but I have no idea what the /dev/ is . I need the information to complete the puppet change that will make it a jenkins slave :-] [12:24:58] (related change https://gerrit.wikimedia.org/r/#/c/64601/ ) [12:27:25] AzaToth: I'm aware of at least one workflow in RT that would not work in Bugzilla (excluding the original reporter of a ticket from bugmail notifications). [12:27:55] andre__: oh [12:28:03] For other safety-related aspects my understanding so far is that things can be worked out by using group access restrictions for specific Bugzilla products [12:28:16] but I would need to understand the RT usage way more before I could really "judge" that [12:29:14] could all switch to redmine and http://redminecrm.com/pages/main [12:29:41] andre__: I've maintained a RT for a while ago, not totally though [12:30:13] I know RT is a bit intimidating to grasp [12:31:16] I think somebody in WMF was even playing with Redmine. Now if I could only remember names :) [12:31:36] I can't remember names either [12:31:50] especially when people like Elsie changes names all the time [12:32:44] andre__: guillaume paumier looked at red mine a few years ago [12:33:13] Yes, Priyanka and I looked at it a few years back [12:33:13] andre__: https://guillaumepaumier.com/2010/03/05/scaling-up-software-development-for-wikimedia-websites-tools/ [12:33:16] there is, I think, another Wikimedia organization that uses Redmine [12:33:28] I wouldn't necessarily recommend it right now, but it made sense at the time. [12:34:28] would be nice to be able to combine tickets and bugs in the same system [12:35:00] That would be nice indeed. [12:35:12] http://lists.wikimedia.org/pipermail/wikilovesmonuments/2011-December/002225.html [12:35:39] AzaToth: oh no, redmine sucks [12:35:52] MatmaRex: orly [12:35:58] yarly [12:36:04] it's interface is worse than bugzilla's [12:36:08] and that says something [12:36:27] MatmaRex: that's subjective [12:36:39] MatmaRex: I feel the opposite [12:37:12] I could say "oh no, bugzilla sucks", "it's interface is worse than redmine's" [12:37:40] its* :P [12:37:42] MatmaRex: thus unless you can reference a objective analysis [12:37:57] guillom: I had to type what he typed :-P [12:37:57] yeah, probably [12:38:08] its*, yes. :D [12:38:39] MatmaRex: thank you for fixing the parameters-not-scrolling bug in the VE's template editor, by the way. [12:38:51] (PS7) Hashar: contint: publish Zuul git repositories [operations/puppet] - https://gerrit.wikimedia.org/r/71968 [12:39:06] guillom: VEs [12:39:07] guillom: it's not merged yet :( [12:39:28] MatmaRex: I know :/ [12:39:30] guillom: so if you could pull a few strings… ;) [12:39:41] AzaToth: Don't make my eyes bleed kthxbai :P [12:39:46] hehe [12:40:18] MatmaRex: Unfortunately, I don't have that kind of powers, but I promise I'll pester and harass as much as I can. [12:40:26] I felt like making a grammar-Patton on your grammar-nazi arse :-P [12:40:36] heh [12:41:51] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [12:43:52] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.004 second response time [12:51:32] (CR) AzaToth: [C: 1] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [12:58:53] (PS10) Hashar: lanthanum as a jenkins slave [operations/puppet] - https://gerrit.wikimedia.org/r/64601 [13:07:20] (PS11) Hashar: lanthanum as a jenkins slave [operations/puppet] - https://gerrit.wikimedia.org/r/64601 [13:08:45] (CR) ArielGlenn: [C: 2] lanthanum as a jenkins slave [operations/puppet] - https://gerrit.wikimedia.org/r/64601 (owner: Hashar) [13:08:46] (Merged) ArielGlenn: lanthanum as a jenkins slave [operations/puppet] - https://gerrit.wikimedia.org/r/64601 (owner: Hashar) [13:13:49] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [13:16:12] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 8.068 second response time [13:19:13] (PS2) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [13:19:58] (PS1) Mark Bergsma: Use hit_for_pass on TTL <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75327 [13:20:03] (PS3) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [13:20:37] (CR) Mark Bergsma: [C: 2] Use hit_for_pass on TTL <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75327 (owner: Mark Bergsma) [13:21:39] (Merged) Mark Bergsma: Use hit_for_pass on TTL <= 0 objects [operations/puppet] - https://gerrit.wikimedia.org/r/75327 (owner: Mark Bergsma) [13:25:14] (PS4) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [13:26:45] yoyo paravoid, could you check this when you get a sec? [13:26:46] https://gerrit.wikimedia.org/r/#/c/74686/ [13:26:46] (PS5) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [13:28:17] hah, I was just loooking at it [13:30:02] * paravoid grumbles at the hacks needed to do something so simple [13:30:21] (PS1) Odder: (bug 51788) Close amwikiquote and uzwikibooks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75329 [13:30:53] (CR) Ottomata: [C: 1] Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [13:31:09] yeah i grumbled a fair amount myself [13:35:38] (CR) Faidon: [C: -1] "(4 comments)" [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 (owner: Ottomata) [13:36:47] the 4644 was puzzling [13:36:58] is this a typo when converting 0644 to 0444? [13:37:21] uhhhhhh goood question [13:37:24] i thikn so [13:39:20] ok ja will fix those paravoid, and as for snakeoil [13:39:24] ja, hue didn't like it [13:39:26] not exactly sure why [13:39:39] must have been the type of cert it created or something, dunno [13:39:50] this was more like what the hue docs said to do [13:39:51] and it works [13:39:53] so :/ [13:39:56] thanks! [13:40:03] i gotta run for a few mins, back in a bit [13:46:28] (PS1) Hashar: admin::roots is plural (for lanthanum) [operations/puppet] - https://gerrit.wikimedia.org/r/75332 [13:47:28] (CR) ArielGlenn: [C: 2] admin::roots is plural (for lanthanum) [operations/puppet] - https://gerrit.wikimedia.org/r/75332 (owner: Hashar) [13:47:29] (Merged) ArielGlenn: admin::roots is plural (for lanthanum) [operations/puppet] - https://gerrit.wikimedia.org/r/75332 (owner: Hashar) [13:51:21] PROBLEM - DPKG on lanthanum is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:52:21] RECOVERY - DPKG on lanthanum is OK: All packages OK [14:08:03] (PS8) Hashar: contint: publish Zuul git repositories [operations/puppet] - https://gerrit.wikimedia.org/r/71968 [14:16:21] AzaToth: I got a way to publish the Zuul git repositories over http (htnaks to openstack folks) [14:16:36] AzaToth: so slaves will be able to fetch ! [14:17:01] (CR) Hashar: "added some trailing slashes." [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [14:17:07] (PS9) Hashar: contint: publish Zuul git repositories [operations/puppet] - https://gerrit.wikimedia.org/r/71968 [14:22:17] !log Jenkins: lanthanum has been added as a Jenkins slave node though NO job should be running there.e [14:22:29] Logged the message, Master [14:32:32] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 8.6478998913 (gt 8.0) [14:33:44] hashar: nice nice [14:34:16] hashar: how do you test it? [14:34:34] or do you just publish it and cross the fingers? [14:38:00] (CR) AzaToth: "(2 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [14:42:14] AzaToth: I have a copy of the CI production infrastructure in labs :-] [14:42:28] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 3.99124366412 [14:44:12] AzaToth: basically Zuul + Gerrit + Jenkins all on the same instance with a copy of the Apache conf of integration.wikimedia.org (though on the instantce that points to 127.0.0.1) [14:47:25] !log stopping puppet on gadolinium, looking into filters causing socat process packet loss [14:47:35] Logged the message, Master [14:50:48] (PS2) Reedy: remove obsolete wgUseDynamicDates variable [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74838 (owner: TTO) [14:50:52] (CR) Reedy: [C: 2] remove obsolete wgUseDynamicDates variable [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74838 (owner: TTO) [14:51:02] (Merged) jenkins-bot: remove obsolete wgUseDynamicDates variable [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74838 (owner: TTO) [14:51:49] (CR) Cmcmahon: "Bug 51884" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [14:51:50] (CR) Reedy: "Roan, can you confirm this config is correct?" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [14:52:24] (PS2) Reedy: (bug 51788) Close amwikiquote and uzwikibooks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75329 (owner: Odder) [14:52:30] (CR) Reedy: [C: 2] (bug 51788) Close amwikiquote and uzwikibooks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75329 (owner: Odder) [14:52:40] (Merged) jenkins-bot: (bug 51788) Close amwikiquote and uzwikibooks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75329 (owner: Odder) [14:53:08] (PS2) Reedy: enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [14:53:08] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:40] (PS2) Reedy: (bug 51684) Modify two variables for Ukrainian Wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74691 (owner: Odder) [14:53:47] (CR) Reedy: [C: 2] (bug 51684) Modify two variables for Ukrainian Wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74691 (owner: Odder) [14:53:55] (Merged) jenkins-bot: (bug 51684) Modify two variables for Ukrainian Wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74691 (owner: Odder) [14:55:19] (PS2) Reedy: (bug 51715) allow sysops to add/remove confirmed group on ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74825 (owner: TTO) [14:55:24] (CR) Reedy: [C: 2] (bug 51715) allow sysops to add/remove confirmed group on ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74825 (owner: TTO) [14:55:35] (Merged) jenkins-bot: (bug 51715) allow sysops to add/remove confirmed group on ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74825 (owner: TTO) [14:55:48] (PS2) Reedy: (bug 42113) remove ability to debureaucrat from enwiktionary bureaucrats [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74828 (owner: TTO) [14:55:54] (CR) Reedy: [C: 2] (bug 42113) remove ability to debureaucrat from enwiktionary bureaucrats [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74828 (owner: TTO) [14:56:06] (Merged) jenkins-bot: (bug 42113) remove ability to debureaucrat from enwiktionary bureaucrats [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74828 (owner: TTO) [14:56:58] YuviPanda: perhaps some more color to grrrit-wm [14:56:58] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:57:10] AzaToth: I was thinking about it [14:57:22] AzaToth: unsure though about how well IRC clients handle it [14:57:34] !log dist-upgrade & reboot payments2 [14:57:43] Logged the message, Master [14:57:46] YuviPanda: either they show the color, or they only show uncolored text [14:57:48] AzaToth: the library I use supports easy colors [14:57:59] don't they get garbage text if not colors? [14:58:04] YuviPanda: if someone is using a client made in 1970, it's their own fault ツ [14:58:14] hehe, I see [14:58:24] does irssi support colors? [14:58:29] YuviPanda: yes [14:58:34] that's settled then [14:58:49] not right now, though. Will do over the weekend [14:59:15] andre__: can we have a component in bugzilla for grrrit-wm? [14:59:22] under 'Tools' perhaps? [14:59:24] only for gerrit-vm [14:59:28] :-P [14:59:38] * YuviPanda suddenly gets very distracted [14:59:44] hehe [15:09:38] MatmaRex: What version is the fawiki collation updates in? [15:09:52] nvm [15:09:53] RTFB [15:11:00] (CR) Reedy: [C: -1] "(1 comment)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [15:11:54] (PS2) Reedy: Disable "Mark as helpful" extension on English Wikipedia. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75075 (owner: Eloquence) [15:11:59] (CR) Reedy: [C: 2] Disable "Mark as helpful" extension on English Wikipedia. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75075 (owner: Eloquence) [15:12:10] (Merged) jenkins-bot: Disable "Mark as helpful" extension on English Wikipedia. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75075 (owner: Eloquence) [15:13:26] !log dist-upgrade & reboot payments2-4 [15:13:37] Logged the message, Master [15:13:37] !log reedy synchronized database lists files: [15:13:47] Logged the message, Master [15:14:13] !log reedy synchronized wmf-config/InitialiseSettings.php [15:14:23] Logged the message, Master [15:14:32] (PS2) Reedy: Change fa wikis to use uca-fa sort order [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75248 (owner: Brian Wolff) [15:14:45] (CR) Reedy: [C: 2] Change fa wikis to use uca-fa sort order [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75248 (owner: Brian Wolff) [15:14:54] (Merged) jenkins-bot: Change fa wikis to use uca-fa sort order [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75248 (owner: Brian Wolff) [15:20:03] YuviPanda, https://www.mediawiki.org/wiki/Bug_management/Project_Maintainers#To_add_a_project_or_component [15:21:15] !log dist-upgrade &reboot db1025 [15:21:25] Logged the message, Master [15:21:37] (PS1) Ottomata: Moving as many easy filters to oxygen. gadolinium's socat process is dropping packets with udp2log running. [operations/puppet] - https://gerrit.wikimedia.org/r/75342 [15:22:52] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:02] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:24:03] (CR) Ottomata: [C: 2 V: 2] Moving as many easy filters to oxygen. gadolinium's socat process is dropping packets with udp2log running. [operations/puppet] - https://gerrit.wikimedia.org/r/75342 (owner: Ottomata) [15:24:04] (Merged) Ottomata: Moving as many easy filters to oxygen. gadolinium's socat process is dropping packets with udp2log running. [operations/puppet] - https://gerrit.wikimedia.org/r/75342 (owner: Ottomata) [15:25:13] PROBLEM - check_squid on payments3 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [15:25:22] RECOVERY - Host db1025 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [15:26:50] (PS1) Ottomata: Need role::cache::configuration class in oxygen logging role [operations/puppet] - https://gerrit.wikimedia.org/r/75344 [15:27:11] (CR) Ottomata: [C: 2 V: 2] Need role::cache::configuration class in oxygen logging role [operations/puppet] - https://gerrit.wikimedia.org/r/75344 (owner: Ottomata) [15:27:12] (Merged) Ottomata: Need role::cache::configuration class in oxygen logging role [operations/puppet] - https://gerrit.wikimedia.org/r/75344 (owner: Ottomata) [15:27:12] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:02] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:30:12] RECOVERY - check_squid on payments3 is OK: PROCS OK: 1 process with command name squid [15:30:58] (PS1) Ottomata: Need webrequest_filter_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75345 [15:31:31] (CR) Ottomata: [C: 2 V: 2] Need webrequest_filter_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75345 (owner: Ottomata) [15:31:32] (Merged) Ottomata: Need webrequest_filter_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75345 (owner: Ottomata) [15:40:29] Off for today :) [15:42:36] (PS1) Andrew Bogott: Create an 'interface' module. [operations/puppet] - https://gerrit.wikimedia.org/r/75347 [15:43:53] (CR) Andrew Bogott: "Just getting the ball rolling." [operations/puppet] - https://gerrit.wikimedia.org/r/75347 (owner: Andrew Bogott) [15:52:29] (PS1) Ottomata: Moving webstatscollector filter process to oxygen udp2log instance [operations/puppet] - https://gerrit.wikimedia.org/r/75348 [15:53:12] (CR) Ottomata: [C: 2 V: 2] Moving webstatscollector filter process to oxygen udp2log instance [operations/puppet] - https://gerrit.wikimedia.org/r/75348 (owner: Ottomata) [15:53:13] (Merged) Ottomata: Moving webstatscollector filter process to oxygen udp2log instance [operations/puppet] - https://gerrit.wikimedia.org/r/75348 (owner: Ottomata) [15:55:07] (PS4) Aude: Move property-create for * to after loading of Wikibase [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 [15:57:26] !log stop icinga notifications for db1008 [15:57:35] Logged the message, Master [15:57:43] !log stop icinga notifications for db1025 mysql [15:57:52] Logged the message, Master [15:59:28] Jeff_Green [15:59:38] I'm having a lot of trouble with udp2log on gadolinium right now [16:00:10] you there? [16:00:38] i am but I'm moments from a fundraising db master swap [16:00:52] so I may not be able to follow well for a bit [16:05:08] hmm ok [16:05:14] well, basically, udp2log there is not running righ tnow [16:05:19] i've moved everything except for your FR stuff to another host [16:05:23] not sure what to do about FR [16:05:36] q, can we upgrade locke and give it to you 100%? [16:05:42] we'd have to deal with the NFS mount move stuff again [16:05:50] but more of the FR job stuff is puppetized now, so it should be easier to move again [16:06:02] Jeff_Green ^ [16:08:04] ottomata: uhhwhut? so we're not logging banners at the moment? [16:08:10] (CR) Aude: "(1 comment)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [16:08:16] right now, no we are not [16:08:27] did you tell the fundraisers already? [16:08:38] its only been an hourish since I started looking at this [16:08:43] i've been trying to save as much as possible [16:09:00] i'm not entirely sure why this is happening all of the sudden, but gadolinium is running a socat multicast relay [16:09:05] which basically feeds all udp2log instances [16:09:05] i see. luckily this is happening right in the middle of planned maintenance, so they pulled hte banners down earlier today [16:09:06] righ tnow [16:09:28] with gadolinium running udp2log at the same time as socat, the socat process drops packets [16:09:33] if I stop udp2log, it is fine [16:09:37] so, since socat is so important [16:09:40] ok. I can't deal with this for about an hour, but... [16:09:42] and very difficult to move [16:09:52] i decided to move udp2log [16:10:12] can you get m.ark to allow whichever host it is that you want fundraising to use to mount the fr share r/w? [16:10:39] also host has to be at the same datacenter as gadolinium or he has to reverse netapp replication [16:11:08] aaawww poof [16:11:10] locke is pmtpa [16:11:12] yeah we shouldn't use that [16:11:17] ok, i dunno then [16:11:44] I have to go back to the db move for an hour or so [16:12:01] ok [16:12:15] how do I notifiy fundraisers? [16:13:28] i just let them know in #wikimedia-fundraising, but if youre gonna send email include fr-tech [16:14:45] hmm, mark do we vary desktop HTML based on protocol? [16:14:57] (PS1) Demon: Adding init script for gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75350 [16:15:44] (PS1) Reedy: Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 [16:33:11] Heya, LeslieCarr, you there? [16:33:16] question about frack and networking [16:33:27] they can/should be able to consume from the udp2log multicast group, right? [16:35:46] (CR) Matmarex: [C: -1] "(1 comment)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 (owner: Reedy) [16:36:18] (PS2) Pyoungmeister: adding zinc as rtt solr host [operations/puppet] - https://gerrit.wikimedia.org/r/75265 [16:38:30] (CR) Pyoungmeister: [C: 2] adding zinc as rtt solr host [operations/puppet] - https://gerrit.wikimedia.org/r/75265 (owner: Pyoungmeister) [16:38:32] (Merged) Pyoungmeister: adding zinc as rtt solr host [operations/puppet] - https://gerrit.wikimedia.org/r/75265 (owner: Pyoungmeister) [16:39:53] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [16:40:32] !log authdns-update to propagate fundraisingdb-* cname changes [16:40:42] Logged the message, Master [16:43:57] (PS2) Reedy: Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 [16:45:36] (CR) Matmarex: [C: 1] Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 (owner: Reedy) [17:00:16] MaxSem: no [17:02:38] MaxSem: also check https://gerrit.wikimedia.org/r/#/c/75316/ [17:02:51] mark, that's why I'm asking:) [17:02:58] ok [17:03:22] (CR) MaxSem: [C: -1] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75316 (owner: Mark Bergsma) [17:04:30] ottomata: hey i'm here [17:04:52] ottomata: possibly , would need to check the firewall rules to see if it was allowed , but if not i can poke the hole [17:06:02] i just tried to consume it on aluminum, but it didnt' work [17:06:27] if you could poke hole, woudl be much appreciated [17:06:32] i want to give FR their own udp2log instance [17:09:41] (PS5) Mark Bergsma: Mobile Cookie Vary caching optimizations [operations/puppet] - https://gerrit.wikimedia.org/r/75316 [17:09:41] so what's the source/dest/port pair ? and can you make a ticket (for pci compliance should all be ticketd) [17:11:04] (CR) Aaron Schulz: [C: 2] Added DB performance log [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75260 (owner: Aaron Schulz) [17:11:50] (Merged) jenkins-bot: Added DB performance log [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75260 (owner: Aaron Schulz) [17:13:22] !log aaron synchronized wmf-config/InitialiseSettings.php '26162fbc97ebbe25fee84bb333b0b0b2efe4c447' [17:13:33] Logged the message, Master [17:15:10] LeslieCarr: , can do [17:16:49] https://rt.wikimedia.org/Ticket/Display.html?id=5505&results=d40c70ee966f7c29b4ae623c07c24c79 [17:16:55] LeslieCarr: ^ [17:17:06] thanks [17:17:35] thank you! [17:20:08] oh ottomata , aluminium is still outside of frack and just using iptables [17:20:17] oh [17:20:26] hm, let me ask if that's ok then [17:20:31] which makes it easier and should just be in puppet [17:20:39] (PS1) Jgreen: flip db1008 to fundraisingdb slave, remove references to db1013 [operations/puppet] - https://gerrit.wikimedia.org/r/75361 [17:21:33] LeslieCarr: wanna join #wikimedia-fundraising? [17:25:50] !log reedy synchronized php-1.22wmf10/extensions/Wikibase [17:26:00] Logged the message, Master [17:26:00] notpeter_: what are the next steps? niklas updates translatewiki, then we can remove the role from vanadium? [17:27:39] !log reedy synchronized php-1.22wmf11/extensions/Wikibase [17:27:49] Logged the message, Master [17:28:07] (PS1) Aude: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 [17:28:17] (PS5) Reedy: Move property-create for * to after loading of Wikibase [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [17:28:51] (PS2) Aude: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 [17:29:03] (CR) Reedy: [C: 2] Move property-create for * to after loading of Wikibase [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [17:29:38] (Merged) jenkins-bot: Move property-create for * to after loading of Wikibase [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [17:30:34] ori-l: chris had a little bit of trouble installing zinc htis morning [17:30:56] !log reedy synchronized wmf-config/ [17:30:59] so that needs to happen, then a couple of puppet runs, and then maybe do some gets against each/work with niklas to make sure things are correct [17:31:07] Logged the message, Master [17:31:39] \o/ [17:32:07] Reedy: just added you as a reviewer to this: https://gerrit.wikimedia.org/r/#/c/74514/ I was going to +2 it, but I don't have +2 in mediawiki-config, I assume you do :) [17:32:10] * Jeff_Green wonders how long to wait for gerrit to finish reviewing my tiny merge [17:33:25] (CR) Jgreen: [C: 2 V: 2] flip db1008 to fundraisingdb slave, remove references to db1013 [operations/puppet] - https://gerrit.wikimedia.org/r/75361 (owner: Jgreen) [17:33:26] (Merged) Jgreen: flip db1008 to fundraisingdb slave, remove references to db1013 [operations/puppet] - https://gerrit.wikimedia.org/r/75361 (owner: Jgreen) [17:34:24] greg-g: Isn't that deploy on thursday? [17:37:39] (PS3) Aude: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 [17:38:00] (PS4) Aude: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 [17:39:22] (PS5) Reedy: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 (owner: Aude) [17:39:28] (CR) Reedy: [C: 2] enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 (owner: Aude) [17:39:40] (Merged) jenkins-bot: enable Wikibase for Wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75363 (owner: Aude) [17:40:41] !log reedy synchronized wmf-config/ [17:45:03] (PS1) Demon: Allow the user's browser to cache resources [operations/puppet] - https://gerrit.wikimedia.org/r/75366 [17:45:49] Ryan_Lane, Shall we gut sockpuppet today? [17:46:04] https://gerrit.wikimedia.org/r/#/c/75263/ [17:46:14] notpeter_: cool, thanks for the update [17:46:26] I decided to do the private repo bits by hand because getting a keypair set up for a puppet merge was getting messy [17:49:02] Reedy: yeah.... right... good thing I don't have +2 on -config :P [17:49:29] that repo is automatically sucked down and pushed out by puppet then, right? [17:55:50] andrewbogott: let me review [17:56:40] andrewbogott: did you test this in labs? [17:56:53] it looks like it'll work to me [17:58:37] !log Populated sites table on all wikivoyage wikis [17:58:47] Logged the message, Master [18:00:08] !log reedy synchronized wmf-config/ [18:02:34] !log reedy synchronized database lists files: [18:02:43] Logged the message, Master [18:03:41] Ryan_Lane, it's hard to test thorougly because of the deps on private. It compiles at least... [18:03:48] * Ryan_Lane nods [18:04:21] RobH: we need another udp2log host for eqiad urgently, is there something available in the spares pool? [18:04:43] The patch also moves the files for the puppet master on stafford. Not strictly necessary but… it seems better to have both systems the same. [18:06:13] !log reedy synchronized wmf-config/ [18:06:26] RobH, Jeff_Green: should I file a ticket? [18:06:27] Ryan_Lane, since this may break puppet refreshes for a bit, think I should schedule a window? Or is just shouting something out on IRC enough? [18:06:37] for new box? [18:06:41] ottomata: yes pls [18:06:48] andrewbogott: if it breaks we can revert [18:06:53] * Jeff_Green grabbing lunch. back in ~20 [18:06:54] Jeff_Green: hrmm [18:06:56] I'd say go for it [18:07:00] hm [18:07:01] Jeff_Green: alternatively [18:07:03] wait.... [18:07:04] Jeff_Green: what kinda host? [18:07:05] I see one issue [18:07:08] we could find a beter home for the socat relay [18:07:12] but, that one is harder to move [18:07:14] ie: whats the existing one so i copy it? [18:07:17] it requires config changes to all frontend hosts [18:07:21] (PS3) Reedy: Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 [18:07:28] (CR) Reedy: [C: 2] Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 (owner: Reedy) [18:07:39] RobH, something like oxygen or gadolinium would be fine [18:07:41] (Merged) jenkins-bot: Category sorting order for sv.wikisource [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75351 (owner: Reedy) [18:07:47] ottomata: I'm cool whichever makes more sense to you in terms of logging architecture [18:07:51] andrewbogott: the user is now gitpuppet, but root still owns the repo and still writes to it [18:08:08] ottomata: file a ticket yea, but i'll start snagging it now [18:08:09] one upside of moving socat is we don't have to retool the netapp mount [18:08:14] just have ticket so i can have a paper trail [18:08:21] yeah [18:08:22] and i do feel a bit shameful moving udp2log off of gadolinium [18:08:25] since it was allocated for that purpose [18:08:27] hm [18:08:43] i just want keep socat and udp2log separate [18:08:44] !log reedy synchronized wmf-config/InitialiseSettings.php [18:08:52] Ryan_Lane: Oh, hm. That's fine from the perspective of the git server but is not so good on the client. [18:08:54] Logged the message, Master [18:08:57] ………………. hm. [18:08:58] andrewbogott: maybe this won't be a problem on stafford, as long as the repo is owned by gitpuppet [18:09:12] i wish the socat thing wasn't so annoying to move [18:09:13] because the repo is readable by all [18:09:23] Jeff_Green: go get some lunch, i'll discuss with RobH [18:09:29] so you guys need to list the justifications in the ticket [18:09:38] cuz if we already allocated one machine to this, ops mgmt will wanna know whats up [18:09:40] ;] [18:09:40] Ryan_Lane, yeah, I think it won't matter... [18:09:43] yeah totally [18:09:50] RobH, maybe you can help me figure out what's best to do [18:09:50] so a single cpu misc server? [18:09:51] ottomata: cool. thanks! [18:09:54] so [18:10:03] i think we put in larger disks for the logging hosts didnt we? [18:10:06] yeah [18:10:08] if so, i dont have one like that [18:10:12] a socat box wouldn't need disk at all [18:10:13] does socat need the larger disks? [18:10:16] then move it. [18:10:18] not logs [18:10:19] ok [18:10:22] hm [18:10:34] I presume the puppet master doesn't have to own the files, it just needs to read them [18:10:35] But gitpuppet will need to write to the repo dir. Lemme check that [18:10:36] this is from hw perspective, i dunno how hard it is to do that [18:10:38] it uses a quite a bit of CPU [18:10:40] but all it does [18:10:44] is accept unicast udp traffic [18:10:50] and then forward all of it to a multicast group [18:10:57] andrewbogott: yeah, it needs to write to the repo on stafford [18:11:00] well i have single cpu misc boxes, and i have higher performance dual cpu misc boxes [18:11:02] but, this is *all* webrequest log traffic [18:11:49] the socat will only use one cpu, but it will use it up [18:11:55] so a couple of cpus at least might be nice [18:12:02] doesn't need much mem either [18:12:28] ok, so gadolinium is existing high performance misc host with dual 500GB disks [18:12:36] thats really not that large a disk in restrospect. [18:12:36] yeah, we should keep udp2log there, you are right [18:12:43] yeah, you are probaly thikning of the stat hosts [18:12:46] those have huge disks [18:12:49] but, still [18:12:50] PROBLEM - SSH on pdf3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:56] i think udp2log shoudl stay on gadolinium, now that we are talking about it [18:12:58] we should move socat [18:13:00] it is a pain to move [18:13:03] ottomata: so i think if we allocate a dual cpu host for another host it'll be identical [18:13:05] so we should move it once and then leave it [18:13:14] in which case i dont care which you move, i revoke my statement ;] [18:13:28] well, gadolinium has 12 cores [18:13:29] since the other dual cpu host i would give you will have identical spec to this one [18:13:33] it has dual cpu yes [18:13:39] all our dual cpu misc hosts have 12 cores total [18:13:42] ah k [18:13:43] 6 core cpus [18:13:51] k [18:13:51] =] [18:13:57] so if its easier for you to move logging, thats cool [18:14:00] oh, hm, so it would be identical then? [18:14:00] lemme get you one [18:14:01] yeah it would be [18:14:13] way easier to move logging, that's just applying puppet to that box [18:14:27] socat change requires deploying varnish, squid and nginx configs to all frontend hosts [18:14:31] so if i give you a box and set its vlan can you handle rest? [18:14:34] yup [18:14:40] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [18:14:41] will file RT ticket now [18:14:46] cool, lemme know RT when you have it in and i'll comment on it [18:14:49] i'll allocate it now [18:15:10] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 183 seconds [18:16:10] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [18:16:49] RobH [18:16:49] https://rt.wikimedia.org/Ticket/Display.html?id=5507 [18:17:09] ottomata: Oh so gadolinium is external ip [18:17:10] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 20 seconds [18:17:17] does socat need external ip and should it be on internal [18:17:24] and does the logging need internal or external ip? [18:17:57] cuz now my concern is if one needs external and one internal, whatever is external shoudl stay on gadolinium [18:18:08] (erbium will be new host, fyi) [18:18:12] HMMMM [18:18:38] i think the socat needs external, udp2log does not [18:18:45] because socat receives traffic from esams [18:18:52] or something, right? [18:18:56] i don't really know the details of that [18:19:59] well, then thats perfect. [18:20:05] cuz the new one can be internal then [18:20:13] * RobH jealously guards external ip addresses [18:20:22] hehe, k [18:20:24] i think tha tshoudl be fine [18:20:28] I'll get your dns setup too for it. [18:20:32] we run udp2log instances on analytics boxes with intenral IPs [18:20:34] so that's fine [18:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [18:24:06] (PS6) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [18:24:25] (CR) jenkins-bot: [V: -1] Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [18:24:29] Ryan_Lane, ok, now repos should be owned by gitpuppet (on sockpuppet and on stafford) and puppet-merge does sudo -u gitpuppet [18:24:49] oops, and there's a syntax error apparently [18:25:32] (PS7) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [18:26:06] andrewbogott: you should also add a git hook that disallows root from making changes [18:26:25] sorry for the piecemeal reviews :( [18:27:18] good idea [18:27:48] I think you want pre-commit [18:28:20] andrewbogott: http://pastebin.com/70zBztE1 [18:28:26] I'm using that elsewhere [18:28:28] (CR) Ottomata: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [18:28:47] there's already a pre-commit, I think we need pre-merge as well [18:28:48] though "your own user" should likely get changed ;) [18:28:55] * Ryan_Lane nods [18:30:22] ottomata: So erbium is now allocated to this task. It has internal vlan and ip set, but won't be in the dhcpd lease files yet [18:30:50] awesome danke [18:31:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:36] RobH, no OS, right? [18:33:19] no os yet [18:33:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [18:33:24] i see the ticket, so will comment on it [18:35:35] (PS8) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [18:38:58] !log reedy synchronized php-1.22wmf10/extensions/Wikibase [18:39:55] * Elsie bites AzaToth. [18:41:04] no biting in -operations! [18:44:06] (PS9) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [18:46:06] (CR) Ottomata: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [18:49:03] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 239 seconds [18:49:45] (PS10) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [18:50:03] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [18:50:12] RobH [18:50:18] NIC 1 or 2? [18:50:23] Embedded NIC MAC Addresses: [18:50:23] NIC.Embedded.1-1-1 Ethernet = 90:B1:1C:2D:7E:CD [18:50:23] WWN = 90:B1:1C:2D:7E:CD [18:50:23] NIC.Embedded.2-1-1 Ethernet = 90:B1:1C:2D:7E:CE [18:50:23] WWN = 90:B1:1C:2D:7E:CE [18:50:35] oh sorry [18:50:56] hmmm, or 'MAC address'? from above? [18:51:01] MAC Address = 90:B1:1C:2D:7E:CF [18:51:02] hmm [18:51:08] that' is probably the racadmin addy [18:51:22] I guess: NIC 1? [18:51:22] 90:B1:1C:2D:7E:CD [18:51:23] ? [18:52:03] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:03] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [18:58:23] (PS1) Ottomata: Big ol' commit prepping for erbium udp2log instance. [operations/puppet] - https://gerrit.wikimedia.org/r/75474 [19:00:10] (PS2) Ottomata: Big ol' commit prepping for erbium udp2log instance. [operations/puppet] - https://gerrit.wikimedia.org/r/75474 [19:00:19] (CR) Ottomata: [C: 2 V: 2] Big ol' commit prepping for erbium udp2log instance. [operations/puppet] - https://gerrit.wikimedia.org/r/75474 (owner: Ottomata) [19:00:23] (Merged) Ottomata: Big ol' commit prepping for erbium udp2log instance. [operations/puppet] - https://gerrit.wikimedia.org/r/75474 (owner: Ottomata) [19:03:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [19:05:34] (PS1) Jgreen: don't fetch dumps from db1025, it's fundraisingdb master now [operations/puppet] - https://gerrit.wikimedia.org/r/75475 [19:06:18] (PS1) Ottomata: Fixing missing semicolon [operations/puppet] - https://gerrit.wikimedia.org/r/75476 [19:06:37] (CR) Ottomata: [C: 2 V: 2] Fixing missing semicolon [operations/puppet] - https://gerrit.wikimedia.org/r/75476 (owner: Ottomata) [19:06:38] (Merged) Ottomata: Fixing missing semicolon [operations/puppet] - https://gerrit.wikimedia.org/r/75476 (owner: Ottomata) [19:06:45] (CR) Jgreen: [C: 2 V: 2] don't fetch dumps from db1025, it's fundraisingdb master now [operations/puppet] - https://gerrit.wikimedia.org/r/75475 (owner: Jgreen) [19:06:46] (Merged) Jgreen: don't fetch dumps from db1025, it's fundraisingdb master now [operations/puppet] - https://gerrit.wikimedia.org/r/75475 (owner: Jgreen) [19:07:06] oop, jeff i' about to run puppet-merge [19:07:09] will merge for ytou [19:07:14] Jeff_Green: ^ [19:07:28] oop i just did [19:07:35] oo mee too! [19:07:39] ok then! [19:08:27] ottomata: sorry, was writing tech doc and zoned out [19:08:30] nic1 [19:08:36] no prob, i found the mention of first nic on build a new server doc [19:08:37] danke [19:08:53] yea im workign on improving that doc this week [19:08:55] awesome [19:08:58] (CR) Ryan Lane: [C: 2] Allow the user's browser to cache resources [operations/puppet] - https://gerrit.wikimedia.org/r/75366 (owner: Demon) [19:09:01] having to get new row d buildout stuff done first but its on list [19:09:02] (Merged) Ryan Lane: Allow the user's browser to cache resources [operations/puppet] - https://gerrit.wikimedia.org/r/75366 (owner: Demon) [19:11:08] Elsie: :-P [19:11:58] ^demon: so, what heppened to gerrit? [19:12:32] <^demon> Gerrit's fine? [19:12:39] understand it's not top prio, but I've not seen you rereviewed it [19:12:49] the "update" [19:12:57] <^demon> Ahh, that. Yeah, I do need to do that. [19:13:26] ^demon: https://gerrit.wikimedia.org/r/#/c/68485/ [19:13:33] still a big red X from you there :( [19:14:10] as you are not verybusy atm ツ [19:14:31] <^demon> Hehe, I'm trying to make gitblit less crashy first :) [19:15:04] (PS2) Demon: Adding init script for gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75350 [19:15:13] ah [19:16:39] (CR) Ryan Lane: [C: 2] Adding init script for gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75350 (owner: Demon) [19:16:40] (Merged) Ryan Lane: Adding init script for gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75350 (owner: Demon) [19:17:26] ^demon: when you make init scripts, it's a good idea to use /etc/init.d/skeleton as boilerplate [19:17:41] when there's a "chkconfig" line in a init script, it smells redhat [19:18:17] ^demon: you hadn't init info in the file [19:18:43] <^demon> It's basically copied from upstream :p [19:18:51] and you should have a "status" command [19:18:54] I know [19:19:54] the vars should be read from /etc/default/gitblit as well... :-P [19:21:22] mixing chkconfig and start-stop-daemon... it's like having a bash script running System32.exe [19:22:25] (PS7) Pyoungmeister: Add elasticsearch module and role. [operations/puppet] - https://gerrit.wikimedia.org/r/74534 (owner: Manybubbles) [19:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:27] AzaToth: it's like have upstart manage an init job [19:23:03] i kid. it'd be better still to just write an upstart job, but i think the availability of a good init script upstream trumps other arguments. [19:23:11] ori-l: btw, I noticed you added me to Editor-engagement [19:23:21] i did? [19:23:24] did you ask me to add you? [19:23:33] I've not afaik [19:24:02] https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Editor-engagement&diff=72400&oldid=67639 [19:24:05] (CR) Manybubbles: [C: 1] "The rebase looks right." [operations/puppet] - https://gerrit.wikimedia.org/r/74534 (owner: Manybubbles) [19:24:12] perhaps I've forgot I asked you ヾ [19:24:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:24:35] * AzaToth scratches heards [19:25:10] I don't know if I'm a member of that group [19:25:35] AzaToth, there's an infinite beauty in crappy shit that works as opposed to nice shit that doesn't [19:25:59] AzaToth: probably I just made a mistake. but now you have to start doing editor engagement work! [19:26:03] * ori-l looks forward to your patches [19:26:07] ori-l: hehe [19:26:32] if there's no other option, be bold and run system32 from bash [19:26:49] there's no such exe, however:P [19:26:56] hehe ツ [19:29:26] PROBLEM - Host payments1 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:43] !log dist-upgrade & reboot for payments* [19:29:51] oh noes, no payments! [19:29:54] Logged the message, Master [19:30:26] whee! [19:30:46] RECOVERY - Host payments1 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [19:32:17] (PS1) Ottomata: erbium is internal name, not .wikimedia.org [operations/puppet] - https://gerrit.wikimedia.org/r/75482 [19:32:32] (CR) Ottomata: [C: 2 V: 2] erbium is internal name, not .wikimedia.org [operations/puppet] - https://gerrit.wikimedia.org/r/75482 (owner: Ottomata) [19:32:37] (Merged) Ottomata: erbium is internal name, not .wikimedia.org [operations/puppet] - https://gerrit.wikimedia.org/r/75482 (owner: Ottomata) [19:35:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Connecting Slave SQL: No Seconds Behind Master: (null) [19:35:17] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Connecting Slave SQL: No Seconds Behind Master: (null) [19:35:18] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Connecting Slave SQL: No Seconds Behind Master: (null) [19:35:51] (PS3) Physikerwelt: Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 [19:39:57] (PS1) Ottomata: Fixing location of packet-loss.log on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75484 [19:39:58] (PS1) Ottomata: role/logging.pp: tabs -> spaces [operations/puppet] - https://gerrit.wikimedia.org/r/75485 [19:40:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:18] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:19] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:40:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:51] (CR) Ottomata: [C: 2 V: 2] Fixing location of packet-loss.log on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75484 (owner: Ottomata) [19:40:52] (Merged) Ottomata: Fixing location of packet-loss.log on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75484 (owner: Ottomata) [19:41:09] (CR) Ottomata: [C: 2 V: 2] role/logging.pp: tabs -> spaces [operations/puppet] - https://gerrit.wikimedia.org/r/75485 (owner: Ottomata) [19:41:13] (Merged) Ottomata: role/logging.pp: tabs -> spaces [operations/puppet] - https://gerrit.wikimedia.org/r/75485 (owner: Ottomata) [19:41:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [19:41:18] (CR) Physikerwelt: "I'm not sure if I got that right." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [19:44:54] (PS1) Ottomata: Using webrequest_log_directory as base erbium udp2log log_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75486 [19:45:09] (CR) Ottomata: [C: 2 V: 2] Using webrequest_log_directory as base erbium udp2log log_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75486 (owner: Ottomata) [19:45:10] (Merged) Ottomata: Using webrequest_log_directory as base erbium udp2log log_directory [operations/puppet] - https://gerrit.wikimedia.org/r/75486 (owner: Ottomata) [19:45:53] (PS1) Ottomata: Don't need filter dir on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75487 [19:46:06] (CR) Ottomata: [C: 2 V: 2] Don't need filter dir on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75487 (owner: Ottomata) [19:46:07] (Merged) Ottomata: Don't need filter dir on erbium [operations/puppet] - https://gerrit.wikimedia.org/r/75487 (owner: Ottomata) [19:46:42] !log reedy synchronized php-1.22wmf11/extensions/Wikibase [19:47:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:54] agh i gotta run for a bit, be back on to finish up erbium (Jeff_Green) [19:47:55] byeye [19:49:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:51:08] LeslieCarr: the Ohio RTT pin has no number? why? :( [19:51:35] (CR) Pyoungmeister: [C: 2] Add elasticsearch module and role. [operations/puppet] - https://gerrit.wikimedia.org/r/74534 (owner: Manybubbles) [19:51:38] (Merged) Pyoungmeister: Add elasticsearch module and role. [operations/puppet] - https://gerrit.wikimedia.org/r/74534 (owner: Manybubbles) [19:51:55] manybubbles: ok, all merged up [19:52:03] as for nagios and ganglia stuffs, I can take a look [19:52:22] notpeter_: manybubbles: I guess you can soon get an instance in beta :-] [19:52:27] oh that is because that probe was reachabe but didn't send back a response [19:54:10] !log reedy synchronized php-1.22wmf11/extensions/Wikibase [19:55:18] LeslieCarr: lazy probe [19:55:54] that probe does not speak for all of Ohio! [19:57:26] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [19:58:23] totally lazy! [20:00:06] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 182 seconds [20:02:06] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 29 seconds [20:05:43] (PS11) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [20:06:17] (PS1) Demon: Add one more misbehaving bot, copy full bad_browser list to gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75492 [20:07:18] ^demon: one day we will have to make that a shared list for all tools to reuse [20:07:32] or maybe I will finally work on having a frontend varnish for our misc tools [20:07:53] <^demon> Ryan_Lane: Misbehaving bot patch ^ [20:08:26] PROBLEM - udp2log log age for erbium on erbium is CRITICAL: NRPE: Command check_udp2log_log_age-erbium not defined [20:08:47] (CR) Andrew Bogott: [C: 2] Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [20:08:50] (Merged) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [20:08:53] (CR) Ryan Lane: [C: 2] Add one more misbehaving bot, copy full bad_browser list to gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75492 (owner: Demon) [20:08:59] (Merged) Ryan Lane: Add one more misbehaving bot, copy full bad_browser list to gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75492 (owner: Demon) [20:09:06] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 239 seconds [20:09:39] andrewbogott: I merged your change on sockpuppet [20:09:47] thanks [20:10:14] yw [20:12:20] (PS1) Demon: Upping heap to something reasonable [operations/puppet] - https://gerrit.wikimedia.org/r/75493 [20:15:56] PROBLEM - Host payments1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:25] andrewbogott: hi sorry I missed our checkin this morning (your time) [20:18:38] hashar, no problem, I haven't worked on CI this week anyway [20:18:42] andrewbogott: there was not much to say anyway [20:18:44] hehe [20:20:06] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 13 seconds [20:20:26] RECOVERY - Host payments1002 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:21:00] (PS1) Andrew Bogott: Actually include gitpuppet class. [operations/puppet] - https://gerrit.wikimedia.org/r/75494 [20:21:58] (CR) Andrew Bogott: [C: 2] Actually include gitpuppet class. [operations/puppet] - https://gerrit.wikimedia.org/r/75494 (owner: Andrew Bogott) [20:22:01] (Merged) Andrew Bogott: Actually include gitpuppet class. [operations/puppet] - https://gerrit.wikimedia.org/r/75494 (owner: Andrew Bogott) [20:29:49] (PS1) Andrew Bogott: Use 'private/' rather than '..private/' [operations/puppet] - https://gerrit.wikimedia.org/r/75495 [20:30:07] (CR) jenkins-bot: [V: -1] Use 'private/' rather than '..private/' [operations/puppet] - https://gerrit.wikimedia.org/r/75495 (owner: Andrew Bogott) [20:30:50] ryan_lane, any concerns about https://gerrit.wikimedia.org/r/#/c/75495/? [20:31:07] I'm going to link 'private' in /etc/puppet, I don't want to link it directly in /etc [20:31:30] hm, jenkins hates that [20:33:20] (PS1) Andrew Bogott: Link private repo to /etc/private [operations/puppet] - https://gerrit.wikimedia.org/r/75496 [20:34:17] (PS2) Andrew Bogott: Link private repo to /etc/private [operations/puppet] - https://gerrit.wikimedia.org/r/75496 [20:34:22] eh, nm, I'll just recreate the old model in /etc/private [20:34:56] (Abandoned) Andrew Bogott: Use 'private/' rather than '..private/' [operations/puppet] - https://gerrit.wikimedia.org/r/75495 (owner: Andrew Bogott) [20:35:11] (PS3) Andrew Bogott: Link private repo to /etc/private [operations/puppet] - https://gerrit.wikimedia.org/r/75496 [20:36:03] (CR) Andrew Bogott: [C: 2] Link private repo to /etc/private [operations/puppet] - https://gerrit.wikimedia.org/r/75496 (owner: Andrew Bogott) [20:36:04] (Merged) Andrew Bogott: Link private repo to /etc/private [operations/puppet] - https://gerrit.wikimedia.org/r/75496 (owner: Andrew Bogott) [20:36:55] ooooo mid puppet master cleanup, eh? [20:39:39] ottomata, yeah [20:41:00] (PS1) Hashar: phase out misc::contint::test::packages [operations/puppet] - https://gerrit.wikimedia.org/r/75497 [20:41:01] (PS1) Hashar: creates role::contint::jenkins::slave [operations/puppet] - https://gerrit.wikimedia.org/r/75498 [20:41:02] (PS1) Hashar: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 [20:41:03] (PS1) Hashar: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - https://gerrit.wikimedia.org/r/75500 [20:41:57] (CR) Hashar: "This is part of the manifests/misc/contint.pp cleanup which I have been slowly doing on a step by step basis." [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [20:43:07] (PS2) Hashar: creates role::contint::jenkins::slave [operations/puppet] - https://gerrit.wikimedia.org/r/75498 [20:44:27] (CR) J: [C: 1] Proposed settings for VIPS. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74514 (owner: Brian Wolff) [20:44:35] (PS1) Manybubbles: Make config file depend on elasticsearch package. [operations/puppet] - https://gerrit.wikimedia.org/r/75501 [20:45:18] https://gerrit.wikimedia.org/r/#/c/73565/ [20:45:27] Can we get this merged, please? Pretty please? [20:45:41] (PS2) Hashar: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 [20:45:42] (PS2) Hashar: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - https://gerrit.wikimedia.org/r/75500 [20:45:58] (PS2) Manybubbles: Make config file depend on elasticsearch package. [operations/puppet] - https://gerrit.wikimedia.org/r/75501 [20:46:09] twkozlowski: not going to happen until the discussion on wikitech-l reach some point [20:46:23] twkozlowski: or we will end up in a commit war. sorry [20:46:40] hashar: have you followed that thread? [20:46:45] no [20:46:56] then please do, or at least skim it [20:47:01] I have no interest in bikeshedding over a preference :) [20:47:06] (there's a lot of text there) [20:47:26] i will happily merge and deploy that change if there is a clear instruction to have it deployed [20:47:28] the point is rather clearly reached, but, just as before, not acknowledged. [20:47:41] so get it acknowledged :-] [20:47:44] heh [20:47:57] in one or another then we can either abandon or merge+deploy [20:47:57] the later I can do [20:48:08] sadly i don't have physical access to james, necessary to do that [20:48:11] but I will only be the weapon there, not the solider [20:48:13] soldier [20:49:21] hashar: the discussion on wikitech-l has pretty much finished [20:49:38] without /any at all/ involvement from James or other VE people, as MatmaRex duly noted [20:49:48] I don't want to be involved in that discussion [20:49:49] well… ^demon, is gerrit suffering a crisis? [20:49:54] but will happily merge the change if needed. [20:49:55] well, james did restate his opinion [20:49:59] Possible I broke it but I don't think so... [20:50:02] (with regards to the preference, not the technical details of the VE) [20:50:06] anyway, i'm off to do something useful instead [20:50:18] (PS1) Andrew Bogott: Move the puppetmaster private link yet again. [operations/puppet] - https://gerrit.wikimedia.org/r/75503 [20:50:18] andrewbogott: lots of weirdness I should just wait on right now, right? [20:50:21] haha [20:50:22] I think I'll just abandon this shit [20:50:23] guess so :) [20:50:28] ottomata, yep [20:50:32] k danke [20:50:37] twkozlowski: why would you do that? [20:50:39] andrewbogott: gerrit been slow on my side too. [20:50:53] slow is fine, I was just worried it had died entirely [20:51:16] ottomata, it's going to be an hour or two before I feel confident that the new setup is stable [20:53:59] off to bed [20:54:05] MatmaRex: I feel uneasy with being the submitter of the patch. [20:54:25] and moreover, "Erik Möller has said that they will not be offering an integrated option to disable Visual Editor, essentially for the reason that they don't want to make it easy to disable." [20:55:52] (PS3) Hashar: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 [20:55:53] (PS3) Hashar: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - https://gerrit.wikimedia.org/r/75500 [20:55:58] (CR) Andrew Bogott: [C: 2] Move the puppetmaster private link yet again. [operations/puppet] - https://gerrit.wikimedia.org/r/75503 (owner: Andrew Bogott) [20:57:18] (Merged) Andrew Bogott: Move the puppetmaster private link yet again. [operations/puppet] - https://gerrit.wikimedia.org/r/75503 (owner: Andrew Bogott) [20:59:24] (CR) Hashar: "PS2 removed a change dependency" [operations/puppet] - https://gerrit.wikimedia.org/r/75499 (owner: Hashar) [21:04:58] ryan_lane, ping me when you're back? [21:07:50] (PS1) Aude: enable wikidata dispatcher for wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75505 [21:15:28] (PS2) Aude: enable wikidata dispatcher for wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75505 [21:16:52] (PS1) Manybubbles: Enable CirrusSearch in beta. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75507 [21:17:35] (CR) Manybubbles: [C: -1] "Not ready because we don't know the names of the servers and I'm sure there are other things wrong with it :)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75507 (owner: Manybubbles) [21:19:31] (PS1) Andrew Bogott: Move puppet templatedir again. [operations/puppet] - https://gerrit.wikimedia.org/r/75508 [21:20:24] (CR) Andrew Bogott: [C: 2] Move puppet templatedir again. [operations/puppet] - https://gerrit.wikimedia.org/r/75508 (owner: Andrew Bogott) [21:20:25] (Merged) Andrew Bogott: Move puppet templatedir again. [operations/puppet] - https://gerrit.wikimedia.org/r/75508 (owner: Andrew Bogott) [21:22:16] (CR) Reedy: [C: 2] "(1 comment)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75505 (owner: Aude) [21:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:32] (Merged) jenkins-bot: enable wikidata dispatcher for wikivoyage [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75505 (owner: Aude) [21:24:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [21:26:08] springle: This is Nik from the database email. Good morning! [21:26:22] manybubbles, hi :) [21:26:55] I don't get to talk to many folks in Australia just because of the time difference. I normally log off pretty soon. [21:27:14] !log reedy synchronized wmf-config/CommonSettings.php [21:27:25] Logged the message, Master [21:28:39] and I've got to head out - really soon - oh bother timing! [21:29:04] manybubbles, when do you usually start? [21:29:21] i'm often around late [21:29:27] about nine hours ago [21:29:37] (PS1) Andrew Bogott: Don't use group 'gitpuppet' [operations/puppet] - https://gerrit.wikimedia.org/r/75510 [21:29:54] well, anywhere from 7.5 hours ago to 9 hours ago, depending on the morning [21:30:34] (CR) Andrew Bogott: [C: 2] Don't use group 'gitpuppet' [operations/puppet] - https://gerrit.wikimedia.org/r/75510 (owner: Andrew Bogott) [21:30:35] (Merged) Andrew Bogott: Don't use group 'gitpuppet' [operations/puppet] - https://gerrit.wikimedia.org/r/75510 (owner: Andrew Bogott) [21:31:01] springle: so Coren was saying that there is hardware waiting to be sprinkled with magic database dust. [21:32:12] springle: Long story short, I have a changeset that just blindly installs a vanilla mysql; I'm pretty sure the defaults would be All Wrong™. :-) https://gerrit.wikimedia.org/r/#/c/74158/ [21:33:02] (PS1) Andrew Bogott: And now we've entered the typo-heavy part of the afternoon [operations/puppet] - https://gerrit.wikimedia.org/r/75512 [21:33:34] springle: Also we want labsudb2 to be a slave of labsudb1. I could do this with my eyes closed with postgres, but I have no idea where to start with mysql. :-) [21:34:08] Coren: I've always enjoyed postgres too.... [21:35:13] (CR) Andrew Bogott: [C: 2] And now we've entered the typo-heavy part of the afternoon [operations/puppet] - https://gerrit.wikimedia.org/r/75512 (owner: Andrew Bogott) [21:35:14] (Merged) Andrew Bogott: And now we've entered the typo-heavy part of the afternoon [operations/puppet] - https://gerrit.wikimedia.org/r/75512 (owner: Andrew Bogott) [21:35:48] Coren and springle: I don't strictly need the slave but it would certainly make my labs work even more prodlike. I do need the performance and for my rampant IO not to crush everyone else.... [21:36:42] manybubbles: *We* need the slave, remember you're just the most immediate user but this box is going to replace a pile of smaller cruddy mysql running in VMs. :-) [21:36:59] Coren: sounds good to me! [21:37:16] (Running DBs in VMs has never been a good idea) [21:37:30] manybubbles, Coren, can you shoot me a quick email primer? say, description of the dataset. types of queries you expect. read/write ratio guess. ram on the vm, etc [21:38:02] I can send my summary [21:38:18] thanks [21:44:14] RobH: about ? [21:45:04] notpeter_: am now, was at lunch [21:45:06] sup? [21:45:08] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 205 seconds [21:45:48] manybubbles, MySQL or MariaDB? [21:46:08] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 17 seconds [21:46:21] RobH: I'm assuming yo ujust back from lunch with leslie as she just responded to the pertinent question :) [21:46:24] so, you're in the clear [21:48:08] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [21:48:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:08] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:08] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:08] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:09] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [21:49:24] woooo [21:49:27] \o/ [21:53:08] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [21:53:19] Hm. anyone around who knows how puppet certs are supposed to work? [21:53:26] I seem to've broken them without touching them :( [21:56:11] haha, the puppet cert server is sockpuppet though the master is stafford ... [21:56:14] what's th eerror ? [21:59:08] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [21:59:23] <^demon> Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/75493/ also, that ups the heap from 1G to 4G [21:59:33] LeslieCarr: The answer is I'm being dumb, I think -- trying to run puppet as user [21:59:54] yep [22:00:08] (PS1) AzaToth: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 [22:00:19] ah [22:00:20] andrewbogott: I'm back. need help with anything? [22:00:55] Ryan_Lane: I did but now maybe I don't :) taking stock [22:01:25] heh [22:01:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [22:03:35] <^demon> err: /Stage[main]/Gitblit::Instance/Service[gitblit]: Could not evaluate: Could not find init script for 'gitblit' [22:03:35] <^demon> err: /Stage[main]/Gitblit::Instance/File[/etc/init.d/gitblit]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/gitblit/gitblit-ubuntu at /etc/puppet/manifests/misc/gitblit.pp:47 [22:03:39] <^demon> Ryan_Lane: ^ [22:03:46] Ryan_Lane, lots of that patch doesn't work because root isn't allowed to use 'sudo'. Should I add root to sudoers, or is there some use of 'su' that will replace my calls to sudo? [22:04:08] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [22:04:38] andrewbogott: you can use su -s [22:04:49] err [22:05:16] su - gitpuppet -c 'blah' [22:05:52] notpeter_: i see why - the dhcp files have zinc as its public name, not .eqiad.wmnet [22:06:13] LeslieCarr: you're the best [22:06:14] thanks! [22:06:15] hm [22:06:23] I guess I herped when I should have derped [22:06:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [22:07:18] LeslieCarr: did you just fix? [22:07:30] (CR) Physikerwelt: [C: 2 V: 2] Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [22:07:31] (Merged) Physikerwelt: Initial commit and setup of git review [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75301 (owner: Physikerwelt) [22:07:37] andrewbogott: ah. it has no shell [22:07:52] Ah, I will fix that in this patch as well. [22:08:02] (PS1) Demon: Typofix [operations/puppet] - https://gerrit.wikimedia.org/r/75514 [22:08:09] <^demon> Ryan_Lane: Fixed that ^ [22:08:12] andrewbogott: for instance: su - -s '/bin/bash' -c 'touch /tmp/test' gitpuppet [22:08:48] (PS1) Andrew Bogott: Use su instead of sudo [operations/puppet] - https://gerrit.wikimedia.org/r/75515 [22:08:49] (CR) Ryan Lane: [C: 2] Typofix [operations/puppet] - https://gerrit.wikimedia.org/r/75514 (owner: Demon) [22:08:50] notpeter_: i did not fix [22:08:50] (Merged) Ryan Lane: Typofix [operations/puppet] - https://gerrit.wikimedia.org/r/75514 (owner: Demon) [22:08:59] LeslieCarr: where are you seeing this? [22:09:02] oh wait [22:09:03] looks right to me [22:09:07] old puppet [22:09:08] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 239 seconds [22:09:09] stupid puppet [22:09:13] ah, gotcha [22:09:14] :D puppet-merge is broken [22:09:29] Ryan_Lane: you're broken! [22:10:05] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 5 seconds [22:10:38] oh no :( [22:10:48] physikerwelt___: I assume you know how to build it now [22:11:44] (CR) Andrew Bogott: [C: 2] Use su instead of sudo [operations/puppet] - https://gerrit.wikimedia.org/r/75515 (owner: Andrew Bogott) [22:11:45] notpeter_: hrm, it's bad on brewster - i'm guessing that it is either related to puppet merge or something ? [22:11:57] physikerwelt___: I don't know if it was right to place the sty:s in /usr/share/texmf/ (which is $TEXMFDEBIAN), but it was totally wrong to have them in TEXMFLOCAL due to the fact we are making a package [22:12:03] I'm no tex package master [22:12:08] root@brewster:/etc/dhcp3# grep -i zinc * [22:12:09] linux-host-entries.ttyS1-115200:host zinc { [22:12:10] linux-host-entries.ttyS1-115200: fixed-address zinc.wikimedia.org; [22:12:11] YuviPanda: wtf? [22:12:17] ah, gotcha [22:12:19] (Merged) Andrew Bogott: Use su instead of sudo [operations/puppet] - https://gerrit.wikimedia.org/r/75515 (owner: Andrew Bogott) [22:13:20] root not allowed to sudo? [22:13:34] why not add sudo to root? in sudoers? [22:14:25] which offcourse shouldn't be needed [22:14:51] I assume you have "root ALL=(ALL:ALL) ALL" in sudoers [22:14:59] probably not [22:15:05] we basically have sudo disabled [22:15:50] hmm [22:16:09] not that it's a good thing [22:16:09] Ryan_Lane: wouldn't using sudo generally be more safe than using su? [22:16:09] we just handled that poorly [22:16:47] {{sofixit}} [22:16:50] * AzaToth hides [22:17:11] (CR) Physikerwelt: [V: 2] Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [22:18:24] (PS1) Andrew Bogott: When using su - we have to explicitly set cwd [operations/puppet] - https://gerrit.wikimedia.org/r/75517 [22:19:14] (CR) Andrew Bogott: [C: 2] When using su - we have to explicitly set cwd [operations/puppet] - https://gerrit.wikimedia.org/r/75517 (owner: Andrew Bogott) [22:19:15] (Merged) Andrew Bogott: When using su - we have to explicitly set cwd [operations/puppet] - https://gerrit.wikimedia.org/r/75517 (owner: Andrew Bogott) [22:20:25] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [22:21:10] (CR) Pyoungmeister: [C: 2] Make config file depend on elasticsearch package. [operations/puppet] - https://gerrit.wikimedia.org/r/75501 (owner: Manybubbles) [22:21:11] (Merged) Pyoungmeister: Make config file depend on elasticsearch package. [operations/puppet] - https://gerrit.wikimedia.org/r/75501 (owner: Manybubbles) [22:24:09] Ryan_Lane: What in blazes is nagios doing on labstore3? Its check_disk process runs at 100% CPU for 30 seconds every minute(!) [22:24:54] And it grows to an RSS of 8g! [22:25:00] whaaaaaaat? [22:25:22] well, it seems like it would be the check_disk process that's an issue [22:25:42] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [22:25:42] 7944 nagios 20 0 6373m 6.2g 688 R 100 19.8 0:16.85 check_disk [22:26:04] right, so it's check disk [22:26:18] need to figure out what in the hell it's doing [22:26:27] wow. [22:26:59] Well, it didn't *use* to do that; I was tracking down odd spikes in the usage. [22:26:59] maybe it sees 80918401840 devices and loses its shit? :) [22:28:34] (PS1) Demon: Block spiders from indexing archives [operations/puppet] - https://gerrit.wikimedia.org/r/75518 [22:28:39] (PS1) Andrew Bogott: Meaningless test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75519 [22:28:41] Hm. Something odd is going on with the snapshots, that's probably why it's going crazy. [22:28:53] * Coren|Away chuckles. [22:29:03] Apparently, I no longer unmount snapshots. Ever. :-) [22:29:08] hahahahaha [22:29:35] that could be… problematic [22:29:37] (PS2) Demon: Upping heap to something reasonable [operations/puppet] - https://gerrit.wikimedia.org/r/75493 [22:29:47] Hm. We probably want to have it not check the snapshots though. [22:29:52] At best, it's useless. [22:29:55] (CR) Andrew Bogott: [C: 2] Meaningless test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75519 (owner: Andrew Bogott) [22:29:56] (Merged) Andrew Bogott: Meaningless test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75519 (owner: Andrew Bogott) [22:31:03] (CR) Ryan Lane: [C: 2] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75493 (owner: Demon) [22:31:04] (Merged) Ryan Lane: Upping heap to something reasonable [operations/puppet] - https://gerrit.wikimedia.org/r/75493 (owner: Demon) [22:31:29] LeslieCarr: livehacked dhcp files on brewster, still no dice [22:31:37] straght up no requests coming in [22:31:39] andrewbogott: puppet-merge shows the login text for stafford now ;) [22:31:41] and restarted dhcp ? [22:31:48] well, the MOTD anyway [22:32:02] Yeah, but it doesn't actually do anything... [22:32:09] ah [22:32:17] stdin: is not a tty [22:32:26] notpeter_: restart the machine again, i've got the network port listening [22:32:26] does the user have a shell? [22:32:30] oh it's down righ tnow [22:32:40] ok back up [22:32:40] just restarted [22:32:40] if you can restart [22:33:05] Ryan_Lane, that's if you have a forwarded key. If you don't… Permission denied (publickey). [22:33:14] ah [22:33:43] the key has no passphrase and is a default id_rsa key? [22:33:58] (CR) Ryan Lane: [C: 2] Block spiders from indexing archives [operations/puppet] - https://gerrit.wikimedia.org/r/75518 (owner: Demon) [22:33:59] (Merged) Ryan Lane: Block spiders from indexing archives [operations/puppet] - https://gerrit.wikimedia.org/r/75518 (owner: Demon) [22:34:03] LeslieCarr: hadn't restarted dhcp. derp derp derp [22:34:13] Ryan_Lane: Yeah, I don't know any better. How should I be handling that instead? [22:34:23] well, no, that's correct ;) [22:34:37] Ah, then, yes. Yes it is. [22:34:50] hm. /bin/sh? [22:34:55] you should really use bash [22:35:16] root@sockpuppet:~/puppet# su - gitpuppet [22:35:16] $ ssh stafford [22:35:16] Permission denied (publickey). [22:35:19] notpeter_: is it happier now ?i saw the request [22:35:26] ah ha [22:35:37] andrewbogott: gitpuppet.key [22:35:48] it should be id_rsa [22:35:58] ah, right. [22:36:01] if an rsa key, or id_dsa if a dsa key [22:36:01] otherwise ssh won't use it [22:36:09] unless you specify the key, anyway [22:36:50] LeslieCarr: yep [22:36:56] that seems to help, although behavior still seems wrong... [22:37:02] LeslieCarr: I was just being particularly bone-headed [22:37:40] andrewbogott: command needs to be before the key, not after [22:37:44] anything after the key is a comment [22:37:57] no prob [22:38:57] there it goes [22:39:29] now I have to sort out the no tty thing [22:41:28] There might have been something wrong with my snapshot system. I count ~16k mounts. :-) [22:42:30] (PS1) Andrew Bogott: Fix several issues with gitpuppet's keypair [operations/puppet] - https://gerrit.wikimedia.org/r/75520 [22:44:36] RECOVERY - Disk space on labstore3 is OK: DISK OK [22:46:23] (PS2) Andrew Bogott: Fix several issues with gitpuppet's keypair [operations/puppet] - https://gerrit.wikimedia.org/r/75520 [22:47:02] (CR) Demon: [C: 1] replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 (owner: Hashar) [22:47:41] (PS3) Andrew Bogott: Fix several issues with gitpuppet's keypair [operations/puppet] - https://gerrit.wikimedia.org/r/75520 [22:47:53] (CR) Demon: [C: -1] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75500 (owner: Hashar) [22:48:15] Coren|Away: :D [22:48:30] (PS2) Physikerwelt: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [22:48:49] Aha. Found why. I have an umount that presumes xfs. [22:48:49] (CR) Andrew Bogott: [C: 2] Fix several issues with gitpuppet's keypair [operations/puppet] - https://gerrit.wikimedia.org/r/75520 (owner: Andrew Bogott) [22:48:50] (Merged) Andrew Bogott: Fix several issues with gitpuppet's keypair [operations/puppet] - https://gerrit.wikimedia.org/r/75520 (owner: Andrew Bogott) [22:50:31] Coren|Away: ah, right, and it's ext4 now [22:52:06] It was still a bad idea to presume. [22:53:08] notpeter_, can you review https://gerrit.wikimedia.org/r/74521 please? [22:53:10] * Ryan_Lane nods [22:53:32] (PS1) Andrew Bogott: Tiny test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75521 [22:53:34] MaxSem: I was actually *just* looking at that :) [22:54:07] (CR) Pyoungmeister: [C: 2] solr: __version__ magic field to GeoData schema [operations/puppet] - https://gerrit.wikimedia.org/r/74521 (owner: MaxSem) [22:54:08] (Merged) Pyoungmeister: solr: __version__ magic field to GeoData schema [operations/puppet] - https://gerrit.wikimedia.org/r/74521 (owner: MaxSem) [22:54:21] :):) [22:54:44] (CR) Andrew Bogott: [C: 2] Tiny test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75521 (owner: Andrew Bogott) [22:54:44] (Merged) Andrew Bogott: Tiny test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75521 (owner: Andrew Bogott) [22:54:47] RoanKattouw, ori-l, gwicke_away: are you guys going to be doing any deploys soon? [22:54:54] Not today [22:54:56] I'll be merging my deployment changes soon [22:54:57] ok [22:55:08] I have a window for 8am-10am tomorrow, but nothing there involves git-deploy [22:55:35] ok. good. let me know your next deploy so that I'm sure to be around [22:56:17] I've tested it in labs, but things happen [22:56:28] (PS5) Ryan Lane: Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 [22:56:37] Ryan_Lane: Please send Gabriel an email to that extent, CC Subbu. Gabriel is out on vacation this week and next, and Subbu will be doing any deploys should they be needed [22:56:43] ok. will do [22:56:47] what's subbu's email? [22:56:51] ssastry@ [22:56:53] (CR) AzaToth: [C: -1] "(1 comment)" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [22:56:57] thanks [22:57:18] (And I'm around to help Subbu out given that he hasn't done a lot of deploys yet) [22:57:47] * Coren|Away is actually impressed the box held on with only check_disk throwing a fit despite having several thousand bind mounts, and a couple hundred extra snapshots mounted. [22:58:41] bah, still getting 'permission denied' from within the post-merge hook [22:58:48] Coren|Away: indeed [22:59:02] andrewbogott: test by suing to the user [22:59:11] Ryan_Lane, it works when I do that. [22:59:18] root@sockpuppet:~/puppet# su - gitpuppet [22:59:18] $ ssh stafford [22:59:18] Already up-to-date. [22:59:18] Submodule 'modules/cdh4' () registered for path 'modules/cdh4' [22:59:18] Submodule 'modules/jmxtrans' () registered for path 'modules/jmxtrans' [22:59:18] Submodule 'modules/zookeeper' () registered for path 'modules/zookeeper' [22:59:19] Connection to stafford closed. [22:59:38] oh [22:59:39] I know [22:59:57] andrewbogott: what does your su look like? you need to switch environnment [23:00:00] su - [23:00:06] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 239 seconds [23:00:16] so: su - -c 'blah' gitpuppet [23:00:17] su - $git_user -c "cd ${basedir} && ${cmd}" [23:00:42] you're running that in puppet-merge? [23:00:55] yes [23:01:13] rather than just running ssh? [23:01:20] or are both needed? [23:01:20] (PS1) Pyoungmeister: removing spence from site.pp [operations/puppet] - https://gerrit.wikimedia.org/r/75523 [23:01:33] * Ryan_Lane looks at the code [23:01:35] ${cmd} is 'git merge' [23:01:40] which causes the post-merge hook [23:01:42] which does the ssh [23:01:45] ahh [23:01:49] maybe the hook is getting called as some other user... [23:01:57] I couldn't imagine how [23:02:09] (Abandoned) Pyoungmeister: re-kill spence from site.pp [operations/puppet] - https://gerrit.wikimedia.org/r/73370 (owner: Dzahn) [23:02:12] I prefer to have the stafford update in the post-merge hook so it happens if a user routes around puppet-merge [23:02:27] not vital though [23:03:25] (CR) Pyoungmeister: [C: 2] removing spence from site.pp [operations/puppet] - https://gerrit.wikimedia.org/r/75523 (owner: Pyoungmeister) [23:03:26] (Merged) Pyoungmeister: removing spence from site.pp [operations/puppet] - https://gerrit.wikimedia.org/r/75523 (owner: Pyoungmeister) [23:04:11] su - gitpuppet -c 'ssh stafford' <— that also works [23:05:06] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 5 seconds [23:05:07] there's a pending patch, you can run puppet-merge now and see [23:06:19] it worked for me, but I have a forwarded agent [23:06:46] so for some reason it's running as root [23:07:08] the merge is occuring as root [23:08:07] the post merge hook is: ssh -t -t stafford.pmtpa.wmnet 'cd /var/lib/git/operations/puppet && git pull && git submodule update --init' [23:08:30] being called as root [23:08:43] ok, so that's the problem… why running as root? [23:09:13] git merge --quiet --ff-only "${fetch_head_sha1}" && \ [23:09:17] git submodule update --quiet --init) [23:09:27] ^^ that causes the post-merge hook to be called as root [23:09:58] Oh, that's a different stage of the process, but maybe that's breaking things... [23:10:02] you're only wrapping the fetch as the git user as far as I can tell [23:10:48] oh. wait [23:10:56] su - $git_user -c "cd ${basedir} && ${cmd}" [23:10:57] bleh [23:10:59] that should work [23:11:32] ooohhh [23:11:32] also [23:11:32] We should send the bug upstream to icinga. "check_disk performance drops to unacceptable levels when there are more than 20000 mounted filesystems" just to see their reaction. :-) [23:11:39] I ran that from /root/puppet [23:11:45] andrewbogott: are you also using /root/puppet? [23:11:56] its post-merge hook it different [23:12:01] Coren|Away: that'd be awesome [23:12:09] ssh root@stafford.pmtpa.wmnet 'cd /var/lib/git/operations/puppet && git pull' [23:12:21] (PS1) Andrew Bogott: Don't run our post-merge magic every time [operations/puppet] - https://gerrit.wikimedia.org/r/75525 [23:13:03] Ryan_Lane, In theory I'm not having anything to do with /root/puppet -- hoping to delete that soon. [23:13:18] andrewbogott: puppet-merge runs based on the location [23:13:32] The above patch seems right, but it doesn't quite explain the order that you're seeing. [23:13:37] defaults to /root/puppet. [23:13:46] oohhh [23:13:48] docs are just wrong [23:13:55] Yeah, I just changed it [23:14:13] PROBLEM - Solr on solr1003 is CRITICAL: Average request time is 965.3333 (gt 400) [23:14:19] Coren|Away: :D [23:14:26] Coren|Away: "are you kidding me?" [23:15:20] (PS2) Andrew Bogott: Don't run our post-merge magic every time [operations/puppet] - https://gerrit.wikimedia.org/r/75525 [23:15:43] RECOVERY - Solr on solr1 is OK: All OK [23:15:50] (CR) Ryan Lane: [C: 2] Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 (owner: Ryan Lane) [23:15:50] (Merged) Ryan Lane: Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 (owner: Ryan Lane) [23:17:14] RECOVERY - Solr on solr1003 is OK: All OK [23:17:28] andrewbogott: I merged a change if you're wanting to test it ;) [23:17:52] It wasn't what I thought. That last merge isn't happening as gitpuppet [23:18:33] PROBLEM - Solr on solr1002 is CRITICAL: Average request time is 1080.3334 (gt 400) [23:18:43] oh, may've found it [23:19:26] (PS3) Physikerwelt: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [23:19:34] PROBLEM - Solr on solr1001 is CRITICAL: Average request time is 1452.5 (gt 400) [23:19:55] (PS3) Andrew Bogott: I wrapped every git command except the one that mattered. [operations/puppet] - https://gerrit.wikimedia.org/r/75525 [23:20:42] (CR) Andrew Bogott: [C: 2] I wrapped every git command except the one that mattered. [operations/puppet] - https://gerrit.wikimedia.org/r/75525 (owner: Andrew Bogott) [23:20:42] (Merged) Andrew Bogott: I wrapped every git command except the one that mattered. [operations/puppet] - https://gerrit.wikimedia.org/r/75525 (owner: Andrew Bogott) [23:21:25] OK, now I need another patch to test on [23:21:40] * Coren|Away notes a minor difference: [23:21:43] http://ganglia.wikimedia.org/latest/graph.php?r=1hr&z=xlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Labs+NFS+cluster+pmtpa [23:23:31] (PS1) Andrew Bogott: Another tiny test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75529 [23:23:34] RECOVERY - Solr on solr1002 is OK: All OK [23:24:33] RECOVERY - Solr on solr1001 is OK: All OK [23:24:50] (CR) Andrew Bogott: [C: 2] "Lookin' good, champ!" [operations/puppet] - https://gerrit.wikimedia.org/r/75529 (owner: Andrew Bogott) [23:24:51] (Merged) Andrew Bogott: Another tiny test patch [operations/puppet] - https://gerrit.wikimedia.org/r/75529 (owner: Andrew Bogott) [23:25:34] grrrrrrrrrr [23:27:34] (PS1) Andrew Bogott: Probably not the last tiny test patch. [operations/puppet] - https://gerrit.wikimedia.org/r/75530 [23:29:42] (PS3) Cmcmahon: Enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 [23:30:31] (CR) Catrope: [C: 1] Enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [23:30:46] chrismcmahon: Looks good. I'll pick that one up when I deploy tomorrow morning [23:31:11] thanks RoanKattouw it is making some UI tests fail right now [23:31:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [23:34:27] !log done messing with sockpuppet, stafford, puppet-merge, etc for now. [23:34:27] (CR) Andrew Bogott: [C: 2] Probably not the last tiny test patch. [operations/puppet] - https://gerrit.wikimedia.org/r/75530 (owner: Andrew Bogott) [23:34:28] (Merged) Andrew Bogott: Probably not the last tiny test patch. [operations/puppet] - https://gerrit.wikimedia.org/r/75530 (owner: Andrew Bogott) [23:34:38] Logged the message, Master [23:40:34] !log Rebuilding Solr index to catch up new schema [23:40:45] Logged the message, Master [23:49:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 203 seconds [23:49:44] MaxSem: ori-l zinc is up and puppeted [23:49:48] no idea where niklas is, though [23:50:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds [23:50:31] in #-dev:P [23:50:41] poking him... [23:50:47] though too late [23:51:08] cool [23:51:32] MaxSem: can you help him from here? if there's anything else I can do, I will do so gladly [23:51:42] sure [23:53:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [23:55:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds