[00:16:20] ori-l, agree :) [00:16:26] will be a fun project to change all of it :) [00:16:55] and we really ought to do the generic json config... [00:16:59] one of these days ... [00:53:45] (03PS2) 10Legoktm: Add $wmgCirrusSearchEnablePref, which adds a BetaFeature [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 [00:55:18] (03CR) 10Legoktm: "Ok, didn't know that. I've added in another variable to control it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [00:56:40] addshore: ansible is so slow! [00:56:56] addshore: but I'm running it on my computer to a remote host in the US, so... :) [02:10:13] !log LocalisationUpdate completed (1.23wmf4) at Sat Nov 30 02:10:13 UTC 2013 [02:10:30] Logged the message, Master [02:18:36] !log LocalisationUpdate completed (1.23wmf5) at Sat Nov 30 02:18:36 UTC 2013 [02:18:51] Logged the message, Master [02:47:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Nov 30 02:47:22 UTC 2013 [02:47:36] Logged the message, Master [03:27:07] Need the go-ahead for a 10k bigdelete\ [03:36:52] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:36:52] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:02] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:12] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:32] PROBLEM - Apache HTTP on mw1120 is CRITICAL: Connection timed out [03:37:32] PROBLEM - Apache HTTP on mw1140 is CRITICAL: Connection timed out [03:37:32] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:32] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:32] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:32] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:32] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:33] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:33] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:34] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:34] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:35] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:35] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:36] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:36] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:37] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:42] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:43] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:43] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:44] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:52] PROBLEM - Apache HTTP on mw1117 is CRITICAL: Connection timed out [03:37:52] PROBLEM - Apache HTTP on mw1134 is CRITICAL: Connection timed out [03:37:52] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:02] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:03] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection timed out [03:38:03] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:04] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:04] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:05] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:12] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:12] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:12] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:12] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.184 second response time [03:38:32] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.140 second response time [03:38:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:39:02] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:52] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.020 second response time [03:40:02] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.679 second response time [03:40:22] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.118 second response time [03:40:32] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.971 second response time [03:41:02] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.764 second response time [03:41:32] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.887 second response time [03:42:32] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.134 second response time [03:42:32] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.569 second response time [03:42:42] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.214 second response time [03:42:42] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.108 second response time [03:42:52] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.194 second response time [03:42:52] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.493 second response time [03:42:53] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.759 second response time [03:43:02] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.334 second response time [03:43:02] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.211 second response time [03:43:02] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.032 second response time [03:43:22] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.136 second response time [03:43:22] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.042 second response time [03:43:22] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.045 second response time [03:43:23] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.274 second response time [03:43:23] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.323 second response time [03:43:32] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.809 second response time [03:43:32] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.039 second response time [03:43:32] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.226 second response time [03:43:32] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.331 second response time [03:43:32] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.783 second response time [03:43:32] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.229 second response time [03:43:42] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.274 second response time [03:43:42] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.108 second response time [03:43:42] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.273 second response time [03:43:52] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.108 second response time [03:43:52] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.295 second response time [03:43:52] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.039 second response time [03:43:52] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.199 second response time [03:43:52] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.223 second response time [03:43:52] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.217 second response time [03:43:53] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.754 second response time [03:43:53] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.870 second response time [03:43:54] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.037 second response time [03:43:54] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.042 second response time [03:43:55] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.232 second response time [03:43:55] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.337 second response time [03:43:56] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.235 second response time [03:43:56] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.161 second response time [03:43:57] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.212 second response time [03:44:02] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.091 second response time [03:44:02] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.188 second response time [03:44:02] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.236 second response time [03:44:22] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.189 second response time [03:44:22] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.190 second response time [03:44:22] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2944 bytes in 0.074 second response time [03:44:24] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.234 second response time [03:44:25] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.201 second response time [03:44:32] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.165 second response time [03:44:32] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.213 second response time [04:11:21] yuck [04:45:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:55:40] (03CR) 10Legoktm: [C: 04-1] "Discussion says account blocks will be indef, and anon blocks are leaning towards 3 months. Current patch will make everything be 3 months" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [05:20:52] (03CR) 10Legoktm: [C: 031] enable Echo on all beta.wmflabs.org-wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 (owner: 10Umherirrender) [05:21:49] * p858snake|l echo's legoktm out to the shell [05:43:18] wtf [05:47:15] paravoid: it was all legoktm's fault >.> <.< [05:47:25] hmm? [05:47:26] whaaaaa [05:54:28] !log rebooting cp1052 with the same kernel (control) [05:54:43] Logged the message, Master [05:55:10] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [05:56:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [05:57:40] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [06:27:27] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:29:27] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [07:10:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:00:27] ori-l: https://github.com/vimeo/graph-explorer [08:21:48] jeremyb: looks cool. want to puppetize? [12:30:43] (03CR) 10Faidon Liambotis: [C: 032] Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [13:27:05] (03PS4) 10John F. Lewis: Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 [14:09:46] (03CR) 10Odder: Enable AbuseFilter block option on Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [14:27:39] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:26] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.071 second response time [15:05:31] (03CR) 10Ottomata: Initial deb packaging (0313 comments) [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 (owner: 10Ottomata) [15:06:03] hey [15:06:05] (03PS7) 10Ottomata: Initial Debian packaging [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 [15:06:11] hey! [15:06:36] saturday morning work time woooo [15:06:53] hows it goin faidon?! [15:06:56] okay [15:07:15] how's the sleep vs. emergency paging these days? [15:07:33] less emergencies lately [15:07:46] but it is saturday and I've worked for > 8h now [15:07:53] tsk tsk tsk [15:07:54] so... :) [15:08:06] thanks for reviewing my stuff, much appreciated [15:08:13] no worries [15:08:17] will get new patchsets in shortly, but don't worry about getting to them today [15:09:50] (03CR) 10Faidon Liambotis: [C: 032] "(No idea about the (WMF), maybe Debian tools ignore it?)" [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 (owner: 10Ottomata) [15:10:04] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [15:10:09] finally [15:10:32] is that a normal level? [15:10:44] the blue one [15:10:44] yes [15:11:08] the red one, it's also normal but it's basically mediawiki [15:11:29] i haven't been following the 5xx saga there, was most of that packet loss issues? [15:11:34] nope [15:12:05] some code deployment? [15:14:25] it was packet loss, the strange GET \\ requests, the centralautologin stuff, ULS, the page cache kernel bug, mobile varnish crashing every 10', XFS kernel bug deadlock, varnish bug resulting in XFS fragmentation [15:14:32] that's the last two weeks [15:15:00] heh [15:15:22] well, plus other unrelated outages that cascaded, e.g. some database overloads, parser cache crash, elasticsearch issues, vectorbeta deployment [15:15:41] (also the last two weeks) [15:16:00] let's just hope december will be a better month [15:17:46] yeah [15:17:47] three varnish bugs, two kernel bugs, 4 network outages by the same vendor in two different sites [15:17:54] I mean, the odds... [15:17:56] ee yeah [15:29:43] paravoid: how about last nginx bugs? :) [15:29:51] which one? [15:30:06] http://mailman.nginx.org/pipermail/nginx-announce/2013/000125.html [15:30:25] (03PS3) 10Ottomata: Initial Debian packaging [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/95556 [15:30:42] (03CR) 10Ottomata: Initial Debian packaging (039 comments) [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/95556 (owner: 10Ottomata) [15:31:29] I don't think that affect us anywhere [15:31:38] nice [15:32:14] paravoid, got a +1 from alex on the varnishkafka .deb, once we are good with an actual tag, are you ok if I merge that? [15:32:42] ottomata: yes, but see magnus' latest commits [15:32:46] before you tag [15:32:56] ok cool, yeh i see them, som eneed merged still [15:33:03] (03CR) 10Faidon Liambotis: [C: 032] Initial Debian packaging [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/95556 (owner: 10Ottomata) [15:33:05] we'll get that running for at least a day or something before we tag [15:33:54] VCL_Log:key is also useful [15:34:03] I want us to start using it [15:34:05] oh ok [15:34:06] cool [15:34:19] that can/should be done in puppet though, right? [15:34:23] not as a package default [15:34:54] btw, what is it? :) [15:34:59] that's varnishkafka support (r98135) + a VCL change + format string change [15:35:27] it's so that you can do from VCL std.log("analytics: zero=123") [15:35:34] without setting it as a response header (X-Analytics) [15:35:36] and logging that [15:35:38] ah coooool [15:35:50] so it just makes a varnishtag without mucking with http headers? [15:35:52] arbitrary logging from VCL basically [15:36:12] yes [15:36:13] (dunno if 'varnishtag' is correct term, but varnish stuff, not http headers) [15:36:19] cool [15:36:22] well, whatever you want to log [15:36:27] right aye [15:36:27] you can do all kinds of crazy shit in VCL anyway [15:36:40] but X-Analytics is one we can do [15:36:52] we set it now just so that we can log it [15:38:31] hm [15:38:33] question [15:38:45] on the varnishkafka change [15:38:46] format => [15:38:51] is space separated, not tab [15:38:52] is this right? [15:39:28] in the package, the default I set is tab [15:39:42] I'm talking about https://gerrit.wikimedia.org/r/#/c/94169/8/manifests/role/cache.pp [15:39:44] in puppetization change, its json, so neither. [15:39:51] ah, right [15:40:34] may I suggest a very trivial change? [15:40:36] sure [15:40:38] make it [15:40:42] format => join( [15:40:46] er, format => join([ [15:40:50] hmmmm, cool [15:40:51] i like that [15:40:51] ' 484 [15:40:55] yeah that will be much easier to read [15:41:11] "%{fake_tag0@hostname?${::fqdn}}", newline, ... [15:41:15] one per line [15:41:26] i coouuuuld make the puppet module fancier [15:41:35] and if it is passed a list, do that in the template [15:41:40] heh, I guess you could, yes [15:41:55] I don't mind either way, as long as we make it somewhat more readable :) [15:41:58] hm, would have to add a separator param if it was not json though [15:41:58] meh [15:42:30] so, remind me, if it's json does varnishkafka splits the string with spaces? [15:42:41] yes, possibly any whitespace, dunno [15:42:47] okay [15:43:16] and I thikn that varnishkafka will output the actual json data in the order that it is listed in the format string [15:43:37] even though that doesn't actually matter for parsing it, it is convenient to be able to read it the same way every time [15:43:45] nod [15:43:56] it's nice to see all of this coming along [15:43:59] finally [15:44:02] haha [15:44:07] uh huhhhhh FINALLY [15:44:09] haha [15:44:27] i was hoping to deploy a month ago, and then was going to deploy this monday and will probably deploy in a week, and then probably say that again in a motnh [15:44:30] hopefully soon! [15:44:34] but i'll believe it when I see it :p [15:44:36] you did see my comment about deploying gradually, right? :P [15:44:46] sounds familiar… [15:44:50] like, what, one mobile at a time? [15:44:52] I sent an email today [15:44:55] or at least, start with one [15:44:55] ah [15:45:00] haven't read all emails yet [15:45:09] ah reading [15:45:11] yeah, some form of that [15:45:21] like half of eqiad, then all of eqiad then esams [15:45:23] or something like that [15:45:25] but not all together [15:45:43] and wait a few days maybe between step 1 & 2 [15:46:25] yeah, good idea [15:47:10] oo on Kafak mailing list: [15:47:10] how's the consumer side coming along? [15:47:11] 'The 0.8 final release is being voted right now. If the vote passes, it [15:47:11] should be available next week.' [15:47:15] camus is it? [15:47:17] i haven't been working on it much [15:47:18] yeah [15:47:19] i mean [15:47:20] i have it working [15:47:22] it works great [15:47:23] but [15:47:27] its a hacky setup rigiht now [15:47:32] prebuilt .jar + user cron job [15:48:01] cron job? [15:48:04] oh, and also automatic hive partitioning [15:48:06] to consume? [15:48:08] yeah, camus is just a MR job [15:48:20] but, it doesn't matter when it is run, really [15:48:28] it'll launch a bunch of kafka consumer procs on the datanodes [15:48:32] each consuming from a topic-partition [15:48:51] and each will write the messages to proper time bucked hdfs dirs [15:48:57] based on content timestamp [15:49:05] then, dan and I wrote some other code (that has to be run separately) [15:49:23] to automatically look at the partitions of a hive table, and what dirs exist in hdfs [15:49:30] and then create whatever hive partitions need to be created [15:49:39] so, once that is all puppetized [15:49:54] it'll basically be automated queryable hive tables of webrequest logs [15:50:02] so how is linkedin doing it? [15:50:05] we'll probably run it once an hour [15:50:06] without yours/dan's code? [15:50:14] oh i doubt they are using hive much, I don't know [15:50:36] aha [15:50:46] so, wait, if we run it once an hour [15:50:53] you mean just leave logs being buffered inside kafka? [15:51:02] well, we can import as often as we want [15:51:13] but, the data is partitioned hourly [15:51:18] it doesn't actually matter [15:51:34] it just means that if we run more often, the latest partition won't be full once it shows up in hive [15:51:49] so if you query and do some kind of grouping on hour, it might look a little weird for the latest hour [15:51:52] buuut, it doesn't really matter [15:51:55] we can do however often we want [15:52:14] my camus cron job runs hourly right now, and it actually consumes from kafka pretty fast [15:52:19] but, that's just the 2 upload host's logs [15:52:22] so meh? [15:52:29] well, I'm thinking also some more realtime log analysis [15:52:36] sure, we can import as often as you like :) [15:52:45] well, almost [15:52:49] i think right now it doesn't matter [15:52:52] since we aren't using hadoop so much, [15:53:01] but, these are MR jobs, so they do take up some hadoop resources [15:53:27] if we run toooo often, the overhead time of launching the camus job could be greater than the time it actually takes to consume the data from kafka [15:53:32] heh [15:53:50] but we could probably do up to every 10 or 5 minutes or something [15:53:51] also [15:53:57] if you really want to to real time analysis [15:54:01] just consume directly from kafka [15:54:03] you can do that whenever you want [15:54:06] without affecting hadoop [15:54:15] I want both usually :) [15:54:20] yeah [15:54:29] well, I guess I can consume the whole kafka buffer [15:54:33] how long is it going to be? 7 days? [15:54:38] yeah [15:54:45] unless we change it, but that's the default, and it sounds good to me [15:54:52] we've def got room for that with mobiles for now [15:54:55] sure [15:55:05] wait until desktop comes though :) [15:55:08] heheh [15:55:08] yeah [15:55:24] it's about 10x I think [15:56:06] are there any alternatives to camus? [15:56:11] yeah i think that's right [15:56:28] sure, there's a hadoop consumer in contrib/ that ships with kafka (we aren't including it in our package) [15:56:31] but that is less useful [15:56:40] it just consumes from kafka into some directory [15:56:45] or we could write our own mr job [15:56:46] its not hard [15:56:52] camus is just nice because its a pluggable framework [15:57:01] i didn't have to write much to get it to work for our data [15:57:02] why is writing into some directory less useful? [15:57:05] just a coupel of short classes [15:57:16] no time buckets? [15:57:17] well, we'd like the data to be time bucketed in hdfs [15:57:18] yeah [15:57:20] right [15:57:55] so the time partitions in hive are mostly useful for optimziing your queries [15:58:04] if you know that you only want to look at the last month [15:58:09] you can tell hive that in your query in the where clause [15:58:24] and it knows to only read that data [15:58:27] rather than passing over all of it [15:59:10] oh, one other thing we'll have to do before we do regular analysis of these imports [15:59:14] is run a deduplication job of some kind [15:59:24] the imports as is could have duplicates in it, since kafka is at least one guaruntee [15:59:29] hopefully they will be rare [15:59:44] but, we will need to deduplicate before we use it for regular analysis [15:59:50] that's a todo on analytics side [16:00:06] (meaning, I don't really want to write it :p ) [16:00:40] heh [16:01:25] remind me, are we doing snappy per message? [16:01:29] or per 10 messages? [16:01:38] I remember such a discussion, but I forget the details now [16:01:47] its not per message, um, i think it is per batch size [16:02:07] that's internally to kafka, right? [16:02:19] hmmm to librdkafka/varnishkafka yes [16:02:23] there are two values [16:02:25] camus gets & stores decoded data, right? [16:02:28] a time and a batch size [16:02:31] er, decompressed [16:02:36] whichever it hits first [16:02:50] hmmm, no, data in hdfs is snappy compressed by default [16:02:52] so yeah [16:03:00] there are a lot of burnt cycles compressing and decompressing here [16:03:06] but we decompress & compress agian [16:03:07] right [16:03:28] varnishkafka compress, kafka broker decompresses when sending to camus, camus writes to hadoop, haddop compresses again [16:03:29] yeah [16:05:46] so, i think the batch size defaults are 1000 messages or 1000ms, whichever comes first [16:06:11] we could probably increase that, i think sending once a second is probably good [16:06:22] and the uploads are doing something like 2K messages per second, i thikn [16:06:56] more messages = better compression ratio [16:07:03] biiunno [16:07:19] it's all microoptimizations at this point I think [16:09:12] This means that there is a single process that loads all data feeds into [16:09:15] Hadoop, handles data partitioning, and creates and maintains Hive tables to match the most recent schema [16:09:21] To support this we have built custom Avro plugins for Pig and Hive to allow them to work directly with Avro [16:09:24] data and complement the built-in Avro support for Java MapReduce. We have contributed these back to the open [16:09:27] source community. This allows all these common tools to share a single authoritative source of data and allow [16:09:30] the output of a processing step built with one technology to seamlessly integrate with other tools [16:10:01] When a new [16:10:02] topic is created we automatically detect this, load the data, and use the Avro schema for that data to create an [16:10:04] appropriate Hive [16:10:35] also, the camus job: [16:10:36] This job [16:10:36] is run on a ten minute interval and takes on average two minutes to complete the data load for that interval [16:10:49] that's from LinkedIn's paper, fwiw [16:11:16] oh cool, they have automatic hive partitioning? i have not heard that [16:11:21] that's not in camus though, is it? [16:11:33] dunno, I'm just reading the paper :) [16:11:36] http://sites.computer.org/debull/A12june/pipeline.pdf [16:11:43] oh that paper [16:11:53] ah camus isn't mentioned there [16:12:04] it is, just not by name [16:12:18] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:13:18] RECOVERY - DPKG on cp1046 is OK: All packages OK [16:15:08] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:16:01] (03CR) 10Manybubbles: [C: 04-1] "It doesn't look like the change made it into this revision." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [16:16:08] PROBLEM - SSH on cp1046 is CRITICAL: Server answer: [16:16:38] PROBLEM - puppet disabled on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:16:41] Last login: Wed Nov 20 22:59:25 2013 from 2620:0:861:2:7a2b:cbff:fe09:11ba [16:16:44] -bash: xmalloc: ../bash/parse.y:4701: cannot allocate 3 bytes (540672 bytes allocated) [16:16:47] Connection to cp1046.eqiad.wmnet closed. [16:16:50] what. the. hell. [16:16:53] uh oh [16:17:08] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:17:08] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [16:17:46] jesus [16:17:50] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1046.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1385828228&g=mem_report&z=large&c=Mobile%20caches%20eqiad [16:18:38] RECOVERY - puppet disabled on cp1046 is OK: OK [16:19:25] that's just this one? [16:19:41] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=mem_report&s=by+name&c=Mobile+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [16:19:44] * paravoid cries [16:19:53] http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=mem_report&s=by+name&c=Mobile+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 too [16:20:46] got in [16:20:48] time for echo f > /proc/sysrq-trigger [16:20:51] ahh, just reboot once a month [16:21:21] that's what josef k sysop would do [16:29:58] (03CR) 10Ottomata: [C: 032 V: 032] Initial Debian packaging [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/95556 (owner: 10Ottomata) [16:30:52] (03CR) 10Ottomata: [C: 032 V: 032] Adding JsonLogster parser. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 (owner: 10Ottomata) [16:43:08] paravoid, good luck with your stuffs, thanks for the help. i'm signing off for the day [16:43:10] laters! [16:43:13] I found it :) [16:43:15] bye [17:14:33] PROBLEM - SSH on amslvs3 is CRITICAL: Server answer: [17:16:55] (03PS1) 10Faidon Liambotis: check-raid: fix megaraid_sas detection [operations/puppet] - 10https://gerrit.wikimedia.org/r/98292 [17:17:19] (03CR) 10Faidon Liambotis: [C: 032] check-raid: fix megaraid_sas detection [operations/puppet] - 10https://gerrit.wikimedia.org/r/98292 (owner: 10Faidon Liambotis) [17:18:33] RECOVERY - SSH on amslvs3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:27:33] PROBLEM - DPKG on cp1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:43:36] RECOVERY - DPKG on cp1065 is OK: All packages OK [17:47:36] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 01/20/2016 12:00. [17:50:33] (03PS1) 10Faidon Liambotis: protoproxy: fix pmtpa mobile stanza [operations/puppet] - 10https://gerrit.wikimedia.org/r/98296 [17:51:09] (03CR) 10Faidon Liambotis: [C: 032] protoproxy: fix pmtpa mobile stanza [operations/puppet] - 10https://gerrit.wikimedia.org/r/98296 (owner: 10Faidon Liambotis) [18:01:50] (03PS1) 10Faidon Liambotis: openstack-manager: add nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/98297 [18:02:09] (03CR) 10Faidon Liambotis: [C: 032] openstack-manager: add nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/98297 (owner: 10Faidon Liambotis) [18:05:29] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.000 second response time on port 11000 [18:29:49] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:59] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [18:46:01] (03PS1) 10BBlack: allocate varnish strings from the correct storage pool? [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/98299 [19:01:01] (03CR) 10BBlack: [C: 032 V: 032] allocate varnish strings from the correct storage pool? [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/98299 (owner: 10BBlack) [19:01:50] (03PS1) 10BBlack: bump netmapper patch to 553fa29a for possible memleak fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/98302 [19:01:51] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm19) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/98303 [19:02:32] (03CR) 10BBlack: [C: 032 V: 032] bump netmapper patch to 553fa29a for possible memleak fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/98302 (owner: 10BBlack) [19:02:53] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm19) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/98303 (owner: 10BBlack) [19:41:33] (03PS1) 10Faidon Liambotis: openstack: convert iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 [19:42:04] (03PS5) 10John F. Lewis: Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 [19:44:55] (03CR) 10Odder: [C: 031] Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [19:48:13] paravoid: hi, how can I pull your most recent changes to my patch to work locally? [19:59:20] matanya: git review -d NNNN [19:59:25] !mw amend [19:59:26] mediawiki but we usually call them appserver [19:59:31] whaa. [19:59:38] https://www.mediawiki.org/wiki/Amend [19:59:45] thanks MatmaRex [20:00:23] may i, and can i, kill the mw key and replace it with something useful? [20:00:32] i think that is for the mw server prefix [20:00:32] @trusted [20:00:33] I trust: petan|w.*wikimedia/Petrb (2admin), .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@mediawiki/Catrope (2admin), .*@wikimedia/RobH (2admin), .*@wikimedia/Ryan-lane (2admin), petan!.*@wikimedia/Petrb (2admin), .*@wikimedia/Krinkle (2admin), [20:00:34] !cp [20:00:34] caching proxy (squid or varnish) [20:00:44] hm [20:00:48] mkay then [20:00:55] i wanted !mw is https://www.mediawiki.org/wiki/$wiki_encoded_* [21:36:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 8d 12h 17m 54s [21:38:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [21:40:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 4m 0s [21:42:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 6m 0s [21:44:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 8m 0s [21:44:41] 8 minutes!? [21:44:45] * Reedy kicks icinga-wm [21:45:29] maybe it heh, weird base icinga-wm uses [21:46:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 10m 0s [21:48:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 12m 0s [21:50:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 14m 0s [21:52:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 16m 0s [21:54:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 18m 0s [21:56:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 20m 0s [21:58:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 22m 0s [22:00:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 24m 1s [22:00:55] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Sat Nov 30 22:00:51 UTC 2013 [22:03:15] PROBLEM - Puppet freshness on mw1109 is CRITICAL: No successful Puppet run for 0d 0h 2m 14s [22:06:38] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Sat Nov 30 22:06:32 UTC 2013 [22:49:57] (03PS1) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [22:50:56] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [22:51:28] ori-l: legoktm python code that's completeish, would be nice if I could get style/idiom based review :) [22:51:33] for https://gerrit.wikimedia.org/r/#/c/98352/ [22:51:38] (only py code tehere) [23:16:45] sure [23:20:18] (03CR) 10Legoktm: toollabs: Add proxylistener that runs on the dynamicproxy machine (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [23:30:49] legoktm: sweet [23:31:14] legoktm: hah! thanks : [23:31:32] legoktm: not sure if that's ERROR, since it's actually not going to cause a problem for the server itself. it's not going down [23:31:51] dunno, I don't usually use logging [23:32:00] print 'ahhh its broken' [23:32:28] legoktm: hehe :P [23:32:38] (03PS2) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [23:32:46] legoktm: anyway, i've not written any major python in like 2 years, so this wasn't too bad :D [23:32:56] legoktm: thanks for the review! fixed everything except the logg level [23:32:59] :) [23:33:33] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [23:33:36] what now [23:34:12] legoktm: what, jenkins has a strict 80char limit? [23:34:15] NOOOOOOO! [23:34:19] lol [23:34:29] that's the most fucking stupidest part of PEP8 [23:34:59] * YuviPanda|away decides to fucking ignore that. [23:35:03] stupidest rule EVER