[00:05:50] hashar: i don't know any background about galliium, but i'm going to take a look at it [00:06:21] binasher: hi:) it has been upgraded using Ubuntu dist-upgrade [00:06:33] since we had some stuff not fully puppetized [00:07:19] good to know [00:08:02] \/win 38 [00:08:11] I thought it could be PHP related but it is hard to know really since everything got upgraded [00:08:23] php5-apc is installed at least [00:08:40] also the machine went to swap earlier (around 8pm UTC) [00:09:10] some PHP process started eating all memory. but that must be a bug in either Jenkins or our php scripts. [00:10:44] and it looks jenkins eats a lot of disk [00:10:47] according to atop [00:11:42] binasher: actually, killing the job_random inequality and leaving the order by works too [00:12:54] !log installing package upgrades on marmontel (blog) [00:13:02] Logged the message, Master [00:21:40] binasher: sorry heading bed. 1:20am there :/ [00:21:59] if you find any thing simply reply to the email :-] [00:22:11] have a good night! [00:27:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:09] New patchset: J?r?mie Roquet; "(bug 41526) Disable the Contest extension on mwwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31165 [00:35:59] binasher: here is what I have so far https://gerrit.wikimedia.org/r/#/c/31129/3/includes/job/JobQueueDB.php [00:40:30] TimStarling: can you look at https://gerrit.wikimedia.org/r/#/c/31129/3 ? [00:40:59] I'm in a meeting [00:42:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:43:00] AaronSchulz: why does claimOldest order by job_random? [00:43:08] vs timestamp [00:43:28] it doesn't seem to actually claim oldest [00:43:31] binasher: job_random is indexed and is timestamp based in that case [00:43:46] well, it's always indexed [00:44:46] binasher: I may do it by job_id later by an the index, which I may do while making some other changes (dropping the old job_cmd index, adding a job_retries column) [00:45:33] *by a new index, erm [00:45:53] ordering by job_id asc will always be oldest to newest [00:46:30] yes, I might do that when I make some other db changes, but right now there is no job_cmd,job_id index [00:46:34] but ok, job_cmd_token index [00:46:51] maybe I'll rename job_cmd_token -> job_cmd_token_rand while at it [00:47:17] I hate it when indexes don't mention everything in it ;) [00:48:15] where is job_random changed into a time based value? [00:48:48] binasher: look at insertFields() [00:49:40] just a (job_cmd, job_token) index might be ok for that case, pk is at the end of every secondary [00:50:10] does that work for sorting? [00:50:27] actually I was looking for a straight answer in the docs for that just earlier [00:50:49] is insertFields() in a different patch? [00:51:04] binasher: it's in that file, not the patch [00:51:48] ah, it's already supported, ok [00:52:21] i do think it would word in that case [00:52:45] binasher: I was too cheap to add another index...though I'm ok doing that if the (job_cmd, job_namespace, job_title, job_params(128)); index is nuked [00:53:16] * AaronSchulz imagines that one is a little expensive [00:53:45] which it can be since we use job_sha1 now [00:57:57] i think job_random meaning different things in different cases is a bit counter intuitive [00:57:58] but [00:57:59] eh [00:58:25] binasher: yeah, it will be dealt with soon :) [00:58:27] frankly, i'd rather see all of this as a stepping stone towards replacing JobQueueDB with something else [00:58:43] JobQueueShinyThing [00:59:03] first I need to finish up performance and retry attempts for the db one [00:59:05] so i don't know if it's really worth putting too much effort into polishing this beyond the point of functioning well [00:59:12] then I'll look into something else [00:59:41] binasher: how php do you know? ;) [01:00:18] and getting the indexes and naming perfect isn't critical to make it function well [01:00:27] what's php? [01:15:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:06] binasher: meh, I won't bother rearranging the deck chairs...I mean renaming that one index [01:17:29] ;) [01:24:08] binasher: in the meantime, you can compile your ideas for other subclasses, like redis [01:29:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.092 seconds [01:32:50] New patchset: Asher; "adding redis class to mc pmtpa servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31166 [01:33:53] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31166 [01:42:31] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 285 seconds [01:44:39] New patchset: Asher; "remove superfluous variable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31167 [01:45:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31167 [01:47:47] New patchset: Asher; "fix typo not caught by puppet parser validate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31168 [01:48:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31168 [01:54:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:54:10] New patchset: Asher; "fix default pkg version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31169 [01:54:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31169 [01:57:10] New patchset: Asher; "fnord" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31170 [01:57:25] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31170 [02:00:41] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 290 seconds [02:01:02] New patchset: Asher; "by default, name redis server after package name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31171 [02:01:16] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [02:01:16] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:01:16] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:01:31] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31171 [02:03:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [02:28:04] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 1 02:28:03 UTC 2012 [02:28:13] Logged the message, Master [02:54:45] !log LocalisationUpdate completed (1.21wmf2) at Thu Nov 1 02:54:45 UTC 2012 [02:54:55] Logged the message, Master [02:58:00] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Thu Nov 1 02:57:57 UTC 2012 [03:56:04] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [03:59:15] !log tstarling synchronized php-1.21wmf3/includes/job/JobQueueDB.php 'fix deadlocks' [03:59:21] Logged the message, Master [04:11:37] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [04:47:20] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: reverted wmf3 deployment made earlier today. [04:47:24] Logged the message, Master [04:54:14] TimStarling: Hm.. just revisiting http://tstarling.com/presentations/Tim%20Lua%202012.pdf and couldn't help notice that table indexes are 1 based, not 0. [04:54:21] looking it up confirms it [04:54:40] how annoying. Is that influenced from somewhere or just a Lua oddity? [04:55:20] I mean, its not bad. just different than pretty much every other language I know. [04:56:13] FORTRAN arrays start at 1 [04:56:48] Aaron|home: that's a large community of sixty year olds that we could tap to work on templates [04:57:09] Success rate of builds on Travis CI: https://twitter.com/konstantinhaase/status/263235120151027712 [04:57:09] interesting [04:57:09] * ori-l is just trolling. Disregard. [04:57:09] and it supports complex numbers! [04:57:10] you just don't know enough languages ;) [04:57:41] I know about FORTRAN's existence and rough place in history / family, but that's about it. [04:58:23] arrays in old dialects of BASIC were 1-based, and you couldn't set a zero element if you wanted to [04:58:33] you just had to offset [04:58:33] TimStarling: not the languages that you think count, i guess :P [04:58:51] at least in lua, you can set elements with keys less than 1 [04:59:01] I also find odd how it creates a new property in a table by referring to the property in the function name of a function declaration [04:59:29] TimStarling: Keys can be strings as well, right? [04:59:31] it's just that array constructors start from 1 if you don't specify a key, and some library functions return arrays indexed from 1 [04:59:35] anyway, that's the sort of superficial property that people outside the language huff and puff about and that doesn't end up mattering at all, like semantic whitespace in python. [04:59:48] yes, or tables or functions [04:59:57] any value can be a key [05:00:07] Hm.. even functions and other tables? [05:00:11] Interesting.. [05:00:12] two days after you start working with python you forget that "semantic whitespace" is even a thing; the only time it comes up is when you're talking with skeptics who don't know the language [05:00:16] So they go by reference then I suppose? [05:00:20] Or is it serialized? [05:00:21] ditto for go's object system more recently [05:00:26] or lack thereof [05:01:01] people approach a language by looking for the things that they know and if their expectations are violated they get upset. [05:01:03] I mean in javascript you can do obj[function foo() { return 123; }] = 'Hi', but it will toString() the function body and use that as a key, so a similar (but not the same by reference) function will work for the same key. [05:01:24] better sort out this operations issue first [05:01:36] sure thing [05:02:23] ori-l: meh, as an observing javascript developer I know better than that. The most common problem and most commonly with this one language, javascript, is that people don't learn it before they write it. [05:03:31] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:03:34] i don't think it's such a severe problem any more. it's fun to rage about people who don't know what they're doing, but the mean quality of JS code "in the wild" has improved tremendously over the past five years, and no one mistakes the language for a toy anymore. no one who isn't complete idiot, anyway. [05:03:34] Douglas Crockford even uses that as his general theme throughout his presentations. How people try to beat it into something its not, thus limiting themselves to the overlap with other languages, and ignoring some of the most powerful features. [05:03:57] ori-l: I disagree. It depends a lot on where you look. [05:04:18] Krinkle: you have to look hard nowadays to find things like var x = Array(); [05:05:07] I often help out on StackOverflow. Just a few days ago I got a student asking a question, who claimed to be in a class room. The examples the professor was giving were awful, to be ashamed of. [05:05:19] the page was deleted so I can't share it anymore [05:05:34] good javascript books written a few years ago, like stoyanov's JS patterns, still devoted half their bulk to ill-conceived ways of implementing classical inheritance in JS. but that stuff jsut disappeared from the recent literature. [05:05:38] The problem is transforming [05:06:21] it isn't ignorant people who don't want to be writing javascript, its the new generation who want to do it, but are given shit from the old generation (some of them, that is. There is many great devs, too, of course) [05:07:41] e.g. leaving off 'var', everything global. incorrectly using new, or writing x = Array(1, 2, 3, 4) indeed. [05:07:43] There is so much. [05:08:26] yeah. the traps and edge cases are js's big problem [05:08:34] most of it is due to the flaws in the language, but that only proves the writer didn't "read the manual". [05:09:06] if you know the language, you wouldn't fall in the traps as you wouldn't try to write such code to begin with. [05:09:47] there's still edge cases that everybody traps into from time to time, but fortunately we have code quality checks for that now (like jshint) [05:09:58] and ES5 strict mode. [05:10:45] yes, but -- i'm a pretty experienced JS developer, and if you had to sit behind me and watch me code and were not allowed to say anything i bet you would go barking mad in ten minutes. [05:11:03] today? [05:11:37] yeah, there are just sooooo many tiny things one has to know to avoid [05:11:44] Maybe, it depends on whether I'm in the mode of what's good enough and what I would want to do instead. [05:12:26] although 'good enough' is a tricky one. I can't define it myself. [05:12:32] hello gerrit-wm & gerrit-wm_ [05:13:08] logmsgbot_, logmsgbot [05:13:18] example: obj[function foo() { return 123; }] = 'Hi' [05:13:21] what the flying fuck [05:13:34] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [05:13:35] Objects are hashes, indexed by strings [05:13:42] this just made my head explode. i would have sworn you'd end up with obj[123] [05:13:48] of course not [05:13:50] er nevermind i guess you aren't invoking it [05:13:51] there is no invocation [05:13:56] yeah [05:13:58] i just misread that, nevermind [05:14:10] any non-string is passed through [[toString]] [05:14:17] So they go by reference then I suppose? [05:14:17] Or is it serialized? [05:14:24] by reference [05:14:38] TimStarling: that's pretty cool [05:14:48] and it has both strong and weak references [05:15:13] say if you have some objects managed by some other module, and you're writing an extension to that module [05:15:28] and you want some data associated with the foreign object [05:15:54] TimStarling: often in JS there is the issue of "private" keys. For example in an implementation of the Purse model or Safe system. You'd get an object (could just be an empty object) which is then the unique key for whatever data. [05:16:02] so you make a table with weak keys, indexed by object reference, and store your data in it [05:16:30] but since in JS objects can't be keys, one has to work around it with an array. Then look up the object in the array and use the index as the ID internally. [05:16:38] then when the foreign module deletes all references to the object you're interested in, the garbage collector magically deletes it from the weakly keyed array [05:16:47] very nice [05:17:18] ori-l: Since arrays are just objects in javascript, even "array" indexes are string keys. var x = ['x']; return x['0']; [05:17:34] Although I know from V8 that it optimises for this (in that x[0] is faster than x['0 [05:17:44] yes, i know. i read an extra () into that earlier [05:18:04] whereas it is usually the opposite since it has to convert 0 into a string to do the lookup. [05:19:16] TimStarling: aha, that's even more awesome. in the JS implementation that wouldn't work since the object reference would be stored in the array as well, so it'd never garbage collect on its own. [05:19:20] if you write "var x = [ 14, 23, 1232 ];" v8 will store that as an array of ints [05:19:34] if you then do x.push('foo'); it has to box the array and it's expensive [05:20:20] I'm not sure if it optimises for the kind of values, but it does optimise for the kind of keys (e.g. for simple arrays [0] will be faster than ['0'], until it gets a string property) [05:20:22] it works very well most of the time without you having to think about it, but it's useful to know to avoid violently breaking the engine's expectations about the types flowing through [05:20:49] ori-l: you know this? (I don't) [05:20:58] Or hypothetical. [05:20:58] yes [05:20:58] nice [05:21:48] for example, if you have a function, function sum(a, b) { return a + b; } [05:21:55] and you invoke it often, and always with ints [05:22:03] * robla reads backlog and notes that Pascal also had 1-based arrays [05:22:10] *has even [05:22:19] robla: okay, people in their 60s *and* 50s [05:22:24] * ori-l ducks [05:22:30] 40s even :-P [05:22:43] * ori-l was taught pascal in elementary school [05:22:46] shh [05:23:02] child abuse! [05:23:07] you whippersnappers and your 0-based arrays [05:23:15] Krinkle: anyways v8 will compile that into code that works on ints [05:23:25] 1-based makes more sense, its just that the world has turned somehow. [05:23:35] humanly speaking anyway. [05:23:41] it's all C's fault [05:23:49] if after 1000 calls the 1001th call is sum(4, "hello"), the engine has to slam on the breaks [05:24:19] arrays are just pointers....bah! [05:24:21] and basically reinterpret your code [05:24:52] anyway, as Trevor so eloquently puts it, It is important not to optimise for the optimiser! [05:24:55] robla: arrays are just objects that have special syntax for getting and setting keys that are strings of digits! [05:24:57] it's so simple! [05:25:16] (especially given that there are more than 1) [05:25:31] ... in js. [05:26:21] ori-l: + some Array.prototype methods for convenience, which naturally inherit from Object.prototype of course. [05:26:40] It becomes especially tricky when prototype objects themselves are objects that inherit from the Object.prototype [05:27:24] so even when there is a 20-long inheritance chain (e.g. HTMLDivElement > HTMLElement > Element > Node ... > Object) every prototype object inbetween is also an object [05:28:12] ori-l: btw, did you know that the worst thing in javascript is also what we use inevitably in the browser? [05:28:18] with statement. [05:28:32] Every inline and external script is evaluated in with (window) { .. } [05:29:20] which is why 'document' is a "global" variable and why "x = 5" is an implied global (because that's how a with statement works). var a = {}; with (a) { foo = true; } . creates a.foo [05:30:06] which is yet another reason why browsers are evil, but javascript itself isn't so bad. js doesn't really have implied globals I believe. [05:30:21] rant! brb later [05:42:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:42:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:42:55] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:22:58] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:37:19] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.032 second response time on port 8123 [06:47:22] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:48:52] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.027 second response time on port 8123 [06:55:55] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:59:31] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [07:05:50] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.026 second response time on port 8123 [07:24:34] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [07:28:55] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:30:20] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [07:35:58] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.029 second response time on port 8123 [07:38:49] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:40:12] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [07:40:20] !log Killed and restarted lucene on search1016 [07:40:26] Logged the message, Master [07:41:45] New patchset: Mark Bergsma; "Make the backend weights equal to upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31182 [07:42:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31182 [07:44:40] sigh [07:55:36] New patchset: Mark Bergsma; "Fix double spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31183 [07:57:06] New patchset: Mark Bergsma; "Fix double spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31183 [07:57:51] New patchset: Mark Bergsma; "Fix double spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31183 [07:58:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31183 [07:59:13] PROBLEM - Puppet freshness on db51 is CRITICAL: Puppet has not run in the last 10 hours [08:00:16] d'oh [08:00:21] i need more coffee again [08:01:12] New patchset: Mark Bergsma; "Revert "Fix double spaces"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31184 [08:01:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31184 [08:07:48] New patchset: Mark Bergsma; "Fix double spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31185 [08:08:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31185 [08:08:59] :) [08:10:10] really more coffee [08:10:31] New patchset: Mark Bergsma; "Fix method name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31186 [08:10:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31186 [08:16:33] hehe [08:16:43] cp3003 sees all eqiad backends as sick [08:16:56] the moment I turn on prefer_ipv6 [08:16:56] it's all happy [08:22:01] New review: Hydriz; "Hmm, this is weird, not sure why it doesn't work." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30588 [08:24:30] New patchset: Mark Bergsma; "Allow extra runtime parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31188 [08:24:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31188 [08:27:56] New patchset: Mark Bergsma; "Puppet's lack of string concatenation sucks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31189 [08:28:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31189 [08:35:12] New patchset: Mark Bergsma; "Could not use $extraopts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31190 [08:35:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31190 [08:40:20] !log running some nonpriority jobs manually on mw12 (so people later don't get weirded out bu the ganglia graphs) [08:40:25] Logged the message, Master [08:40:38] New patchset: Mark Bergsma; "Prefer IPv6 when contacting eqiad backends in esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31191 [08:41:22] New patchset: Mark Bergsma; "Prefer IPv6 when contacting eqiad backends in esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31191 [08:41:42] hehe [08:41:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31191 [08:42:24] can I restart more upload caches in eqiad? [08:45:01] yes [08:46:14] !log Restarted backend varnish instance on cp1025 [08:46:18] Logged the message, Master [08:48:19] swift req/s doubled [08:50:29] New patchset: Mark Bergsma; "Add Ganglia cluster Upload caches esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31193 [08:55:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31193 [09:22:00] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:24:51] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:30:10] New patchset: Mark Bergsma; "Fix ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31197 [09:30:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31197 [09:45:48] hello [10:23:08] New patchset: Mark Bergsma; "Fix tcptweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31199 [10:24:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31199 [10:25:30] Anyone available for deploying? [10:26:47] no, let me change that: Is anyone available to deploy code for me? [10:34:23] New patchset: Mark Bergsma; "Move generic::tcptweaks to base, where it belongs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31200 [10:35:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31200 [10:37:41] New patchset: Mark Bergsma; "Fix the dependencies of base::tcptweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31201 [10:37:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31201 [10:46:58] hello OrenBochman [10:55:06] !log Restarted backend varnish instance on cp1026 [10:55:12] Logged the message, Master [11:02:23] New patchset: Nikerabbit; "Space attack, reduce. See I3aa4e3a3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [11:29:54] New patchset: Mark Bergsma; "Significantly lower the streaming threshold on backend instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31213 [11:30:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31213 [11:57:39] New patchset: Mark Bergsma; "Lower stream threshold for esams frontends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31217 [11:58:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31217 [12:01:35] hm? [12:01:54] New patchset: Mark Bergsma; "Pass $cluster_tier to frontends as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31219 [12:02:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31219 [12:02:22] aha [12:02:24] got it :) [12:02:35] it takes longer to download large files to esams [12:02:39] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [12:02:39] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:02:39] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:02:40] and there are more http hops [12:02:46] yeah yeah [12:02:47] got it :) [12:02:49] if they all wait for the entire object to be in it adds up [12:03:00] thought of that before you said it [12:03:57] waiting 4s for a 5 mb file sucks :) [12:06:04] hey we don't have initcwnd 10 for ipv6 [12:06:04] hah! [12:06:04] I guess we don't [12:06:04] kernel supports it [12:06:08] but it's a little annoying to manage with puppet [12:06:26] i just fixed up the ipv4 one too [12:06:32] after I found some more varnish boxes without it applied [12:06:47] didn't you reboot them? [12:06:50] i'm pretty sure I did [12:07:15] but also this change applies only on the 2nd puppet run [12:07:16] which we should fix [12:07:21] because the first puppet run deploys the facts [12:07:29] and then the 2nd one the initcwnd change [12:07:35] the facts should be deployed by the fileserver facts module [12:07:37] or a puppet module [12:07:43] not through puppet, that's a deadlock [12:07:52] I put that in the rt ticket for leslie to fix, but she hasn't yet [12:07:58] what do you mean? [12:08:12] when are facts run? [12:08:13] before a puppet run [12:08:16] yes [12:08:18] if a fact is broken... [12:08:20] puppet won't run [12:08:29] so once she had a broken fact script [12:08:33] and it blocked all future puppet runs [12:08:41] for that reason [12:08:49] i manually fixed some boxes which were in that state [12:08:57] yeah, I've had a similar problem in the past [12:09:15] so if puppet would simply download them from the puppetmaster from the factsync module [12:09:16] the fact wasn't broken, it just didn't detect virtual correctly [12:09:19] instead of as a file resource [12:09:24] then we wouldn't have that problem [12:09:28] so it removed ntp and stuff, until the next run [12:09:52] are we deploying them as files? [12:09:53] oh dear :) [12:09:53] yes [12:10:06] so I asked leslie to fix that, but I guess she didn't understand it [12:10:06] yeah, we should deploy them as part of a respective module [12:10:10] yup [12:10:20] not a factsync module though [12:10:26] just place the facts where they belong [12:10:29] in fact... [12:10:34] yeah that's what I meant [12:10:38] i confused it with the old pluginsync [12:10:39] I started coding yesterday an apt module for an entirely differnt reason :-) [12:11:25] (I want to add the Ubuntu Cloud Archive and don't want to hack it up like some other repos in our puppet) [12:11:39] right [12:11:53] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31070 [12:12:22] sigh, so much work to do [12:13:14] why won't firefox respect my /etc/hosts anymore [12:13:16] it's annoying [12:25:53] New review: Hydriz; "Look good, but..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/30319 [12:29:33] !log Pooled cp3003 as upload Varnish cluster in PyBal with weight 1 [12:29:39] Logged the message, Master [12:31:52] are there more detailed logs for varnish errors? [12:31:56] i.e. http://upload.wikimedia.org/wikipedia/test2/thumb/7/78/Floating_in_the_dead_sea.webm/800px--Floating_in_the_dead_sea.webm.jpg [12:32:08] sometimes gives me a varnish error and sometimes an error from the imagescaler [12:35:17] !log Depooled cp3003, high rate of 500 responses [12:35:22] no there aren't [12:35:22] Logged the message, Master [12:37:25] hmm [12:37:31] lots of requests for one thumb which gives a 500 response [12:37:45] Wappen_Reinerzau.png [12:38:41] all 180px [12:38:42] while the original is 140 [12:44:12] sec [12:44:53] hi folks, mutante are you here? [12:45:12] there is some problem with instance creation on labs [12:50:11] back [12:53:28] petan: IIRC it's known (plus for mutante it's 5AM) [12:53:51] andre__ where [12:53:56] is it known [12:54:11] afaik I am subscribed to all wmflabs bugs [12:55:00] I'm not sure if it is properly logged [12:56:04] mark: could you give the exact URL? [12:56:19] upload.wikimedia.org /wikipedia/commons/thumb/a/ac/Wappen_Reinerzau.png/180px-Wappen_Reinerzau.png [12:56:19] or I'll find it from the logs I guess [12:56:21] ah cool [12:56:24] looking [12:56:24] i think it's just the image scaler saying "I won't scale to larger than original" [12:56:33] so not necessarily a problem [12:56:37] although a 404 would be a more appropriate response [12:57:01] yes that's what it is [12:57:35] 500 is what mediawiki sends [12:57:44] swift just proxies that [12:59:27] yes [12:59:29] varnish too [12:59:59] New patchset: Mark Bergsma; "Cache Retry-After 5xx responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31223 [13:00:00] until now :) [13:00:08] care to review? [13:01:41] looking [13:02:23] who sets Retry-After? [13:02:45] varnish backends? [13:02:52] I don't know [13:02:59] mediawiki? [13:03:32] no [13:03:56] varnish it seems [13:04:25] and it also replaces the body of the 500 [13:04:25] hmm wait [13:04:25] it's caching these already [13:04:25] i'm seeing HITs [13:04:32] try hitting http://ms-fe.pmtpa.wmnet/wikipedia/commons/thumb/a/ac/Wappen_Reinerzau.png/180px-Wappen_Reinerzau.png [13:04:33] lookup hits, not cache hits [13:04:46] that doesn't have a Retry-After [13:04:55] and it says "Error creating thumbnail: Image was not scaled, is the requested width bigger than the source? [13:05:09] indeed [13:05:11] but if you hit via upload.wm.org you get guru meditation [13:05:12] varnish should not replace that [13:05:15] ah I see why [13:05:15] and retry-after [13:05:19] i'll fix that [13:05:29] vcl_error? [13:05:42] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31223 [13:05:52] no [13:05:52] retry5xx [13:06:16] hmm [13:06:24] i can disable that [13:06:31] and replace it with more upload specific logic if needed [13:06:36] so yeah, we should tell Aaron to make it return 404 or something [13:06:38] that too [13:08:26] New patchset: Mark Bergsma; "Don't retry 5xx" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31225 [13:08:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31225 [13:10:05] tomorrow i'm going to the ceph workshop [13:10:13] cool! [13:10:20] not that I know anything about ceph yet [13:10:26] I should read up really ;) [13:10:48] ok i'm going to repool cp3003 [13:10:55] this wasn't its fault [13:12:12] !log Repooled cp3003 [13:12:20] Logged the message, Master [13:20:55] haha [13:21:00] I was looking at cp3003's backend instance [13:21:09] and I noticed about 100 requests/s, but 0 hits [13:21:15] and I thought "something's wrong" [13:21:42] took me a while to realize that there is only one frontend in front of cp3003, cp3003's frontend instance itself [13:21:54] and it has a 1 GB memory backend [13:21:57] it wasn't full yet ;) [13:22:22] took about 7 mins to fillup, and now the backend is having some cache hits [13:23:00] hehe [13:39:59] is there some way to get the current configuration for a request like http://upload.wikimedia.org/wikipedia/test2/thumb/7/78/Floating_in_the_dead_sea.webm/800px--Floating_in_the_dead_sea.webm.jpg from the imagescaler? $wgMaxShellMemory was upped to 400mb yesterday but the error still shows up, it does not show up if I upload the same file to a mediawiki instance with 400mb MaxShellMemory on 64bit, am I missing some other configuration that could [13:40:59] the error looks like its memory constraint, or is it possible that it starts to extract before the file is fully downloaded from swift? [13:41:39] j^: we ops really don't know much about mediawiki I'm afraid [13:42:03] for this specific problem Aaron from the platform team would know [13:42:23] he's being doing all the swift coding on mediawiki's side [13:44:31] ok, was just hoping there is an easy answer from the ops side since it works in labs, my vm etc, so its an issue with the production setup [13:44:45] will discuss further with Aaron [13:45:48] so, wait [13:45:50] what's the problem again? [13:46:25] I see some avconv errors [13:46:56] New patchset: Hashar; "link MediaWiki core nightly builds on CI portal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31234 [13:46:56] New patchset: Hashar; "sort nightly MediaWiki builds by descending date" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31235 [13:47:03] paravoid: the problem is that image extraction with avconv on the imagescalers gives those errors [13:47:44] other installations with precice, 64bit, avconv dont show this behaviour [13:48:30] i assume they have enough space in /tmp [13:48:58] /dev/sda3 19G 251M 18G 2% /tmp [13:49:13] I surely hope we don't have 20GB videos :P [13:49:53] we might have eventually [13:50:00] so far not [13:50:20] then we should fix imagescalers to not fetch the whole video to create a thumb [13:50:43] in the longer term i think image extraction from videos should directly read from http swift urls [13:50:52] what do you mean? [13:51:01] oh you mean without writing it on /tmp first [13:51:16] to avoid the local copy, just open http to some range requests and save the jpg to tmp [13:51:20] do you generate one thumb per video or do you have multiple thumbs over the course of the video? [13:52:00] upped the weight on cp3003 [13:52:28] there is one thumb per video but also other resolutions. could extract one at upload time and derive the others from that instead of extracting it from the video [13:52:57] can't you create it from only a partial segment of the video? [13:53:11] i.e. range request the first 10MB or so and get the thumb? [13:53:12] not really [13:53:21] as long as you have a keyframe? [13:53:35] seeking does not work and you need the inital headers to setup the video codec [13:54:02] so make a sparse file, get first range, end range, and the range you want for thumb? [13:54:08] just getting the first part might be an option if we just seek some seconds in, right now its 50% [13:54:10] root@srv220:~# ps aux |grep -c avconv [13:54:10] 34 [13:54:41] can you send me the full output of the commands the run? [13:55:17] like the avconv arguments, there state etc [13:55:31] it's less actually, that counted the ulimit.sh and shell calls [13:55:43] /usr/bin/avconv -ss 21 -y -i /tmp/localcopy_3f3bb69f5e4b-1.webm -ss 2 -s 640x480 -f mjpeg -an -vframes 1 /tmp/transform_24955c6362e8-1.jpg [13:55:48] would want to see all theat, including ulimit [13:56:14] -rw-r--r-- 1 apache apache 3768145 Nov 1 12:20 /tmp/localcopy_15112730559e-1.webm [13:56:17] -rw-r--r-- 1 apache apache 3768145 Nov 1 12:21 /tmp/localcopy_38a26cce6ef5-1.webm [13:56:20] -rw-r--r-- 1 apache apache 3768145 Nov 1 12:31 /tmp/localcopy_3f3bb69f5e4b-1.webm [13:56:25] etc [13:56:26] 11 of these in srv220, 12 in srv222 [13:56:31] same file size, so presumably the same video [13:56:40] let me get you the command-line [13:56:54] apache 31498 0.0 0.0 4400 612 ? S 12:31 0:00 sh -c /bin/bash '/usr/local/apache/common-local/php-1.21wmf3/bin/ulimit4.sh' 50 400000 102400 ''\''/usr/bin/avconv'\'' -ss 21 -y -i '\''/tmp/localcopy_3f3bb69f5e4b-1.webm'\'' -ss 2 -s 640x480 -f mjpeg -an -vframes 1 '\''/tmp/transform_24955c6362e8-1.jpg'\'' 2>&1' [13:56:58] apache 31499 0.0 0.0 20752 1592 ? S 12:31 0:00 /bin/bash /usr/local/apache/common-local/php-1.21wmf3/bin/ulimit4.sh 50 400000 102400 '/usr/bin/avconv' -ss 21 -y -i '/tmp/localcopy_3f3bb69f5e4b-1.webm' -ss 2 -s 640x480 -f mjpeg -an -vframes 1 '/tmp/transform_24955c6362e8-1.jpg' 2>&1 [13:57:03] apache 31500 0.0 0.1 385788 10088 ? Sl 12:31 0:00 /usr/bin/avconv -ss 21 -y -i /tmp/localcopy_3f3bb69f5e4b-1.webm -ss 2 -s 640x480 -f mjpeg -an -vframes 1 /tmp/transform_24955c6362e8-1.jpg [13:57:24] all of them are the same, in two variations [13:57:37] one for 640x480 and one for 120x90 [13:58:34] starting from 12:13 UTC and ending at 12:41 UTC [13:58:38] these are same in two boxes, I guess the same on all of them [13:58:49] dude, you're killing the cluster :-) [13:58:54] so ulimit limits to 50 seconds why are they still running? [13:59:34] S and Sl also look more like they hang somehow [14:01:00] do the imagescalers have anything special in /etc/security/limits.conf that could explain why they hang? [14:01:57] no [14:02:01] limits.conf is empty [14:03:53] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [14:04:36] can you run md5sum on one of the webm files in /tmp [14:04:57] or some other idea how to map it back to the file thats causing this [14:05:11] I straced an avconv and it's not very helpful [14:05:20] just futex calls, nothing else [14:05:33] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [14:05:48] md5sum is 9b5969edf2bd1c25944b9076387ebbd7 [14:06:45] !log Restarted backend varnish instance on cp1027 [14:06:50] Logged the message, Master [14:08:24] ok found the file, http://test2.wikipedia.org/wiki/File:Romney_on_FEMA_Government_Spending.webm [14:08:31] how? :) [14:11:00] New patchset: Mark Bergsma; "No point in having Squid backends on Varnish frontends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31239 [14:11:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31239 [14:12:26] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [14:12:40] paravoid: http://test2.wikipedia.org/wiki/Special:FileList looking for webm files and md5sum on my local copy [14:12:50] mark: you missed the TODO :) [14:12:57] ? [14:13:09] no that still stands [14:13:18] j^: oh heh [14:13:31] j^: so, any ideas? [14:13:33] that should get replaced by the list I just removed [14:13:38] but that has squids now, not varnish servers ;) [14:13:44] you're our avconv expert I guess :) [14:13:52] paravoid: so testing it here locally if extracts fine without any errors [14:13:59] <^demon> Aw, Ryan's not around. [14:14:01] did you try with the ulimit? [14:14:02] in 300ms or so [14:14:58] paravoid: yes with same ulimit settings [14:15:03] heh, I tried running /usr/bin/avconv and those arguments and it did finish too [14:17:46] okay [14:17:50] so I reproduced it [14:18:02] I ran the ulimit [14:18:06] then re-ran avconv [14:18:21] and now I get [14:18:21] Error while decoding stream #0:0 Last message repeated 5 times [14:18:21] [vp8 @ 0xcbbb60] Discarding interframe without a prior keyframe! [14:18:21] [vp8 @ 0xcce700] Discarding interframe without a prior keyframe! [14:18:22] etc. [14:19:55] in your shell you ran ulimit? [14:20:01] yes [14:20:23] so [14:20:28] root 32098 0.0 0.1 385788 10096 pts/0 Sl+ 14:18 0:00 /usr/bin/avconv -ss 21 -y -i /tmp/localcopy_4d3acc727d00-1.webm -ss 2 -s 640x480 -f mjpeg -an -vframes 1 foo.jpg [14:20:40] see the 385788? [14:20:46] that's hitting the 400000 memory limit [14:20:58] and avconv probably doesn't handle memory exhaustion well [14:21:44] yes it fails randomly [14:21:52] ulimit -t 50 -v 400000 -f 102400 [14:21:59] just -v 400000 is enough [14:22:09] now i have no idea why does not fail here [14:22:21] what os do you run? [14:22:21] is it 64-bit? [14:22:25] is it the same version of avconv? [14:23:37] ubuntu 12.04/64bit avconv from ubuntu repositories [14:23:50] nice [14:24:34] avconv version 0.8.3-4:0.8.3-0ubuntu0.12.04.1, Copyright (c) 2000-2012 the Libav developers built on Jun 12 2012 16:52:09 with gcc 4.6.3 [14:24:59] avconv version 0.8.3-4:0.8.3-0ubuntu0.12.04.1 same here [14:25:13] anything in /usr/local on the imagescalers? [14:25:27] some libs in /usr/local/lib [14:25:55] no [14:27:08] * j^ makes sure all packages in imagescaler::packages are installed in vm [14:29:28] http://pastebin.mozilla.org/1898156 [14:29:58] Linux srv222 3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [14:30:02] (just in case) [14:32:04] j^: ^^^ [14:34:41] i get the errors here only once i limit it to ulimit -v 320000 [14:34:42] 3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [14:35:13] saw the pastebin? [14:35:27] here the errors stop at 800.000 [14:35:38] 750.000 is not enough, I stopped bisecting there [14:35:47] paravoid: yes just went through all of them and could not spot any difference [14:35:53] ah how many cores? [14:36:07] ha! good question [14:36:07] 8 [14:36:21] here we go [14:36:35] can you try adding -threads 1 just after avconv [14:37:04] yep! [14:37:06] worked [14:37:18] 2 worked too [14:37:23] and 3, but not 4 [14:37:35] with 400.000 that is [14:38:22] wow, cores of cause, will limit it to 1 thread, might even be able to go down to 300mb again with that [14:38:29] can you try that [14:38:51] 300.000 works with -threads 2 but not 3 [14:38:57] (and 1 obviously) [14:40:52] great [14:41:10] congrats :) [14:41:35] so libav folks forked off ffmpeg-mt? [14:41:35] not ffmpeg? [14:41:39] or did they reimplement multithreading? [14:42:08] everyone merged ffmpeg-mt [14:42:19] ohrly [14:42:21] I feel old [14:43:16] some changes only made it into ffmpeg and not libav i think but both have improved multithreading since [14:43:47] just that we hit it now with the precise update and having more cores... [14:44:03] I'm reading through libav.org [14:44:13] I've heard of the fork but haven't ever seen the site [14:44:13] *so* cool [14:44:21] finally, the ffmpeg madness stops [14:45:27] i've got 3500 req/s on cp3003 now [14:53:00] New patchset: Demon; "Adding daily gerrit backups to amanda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31247 [14:53:09] hi mark, paravoid [14:53:40] does anyone know about squids (if any config is in gerrit) and/or can look at caching issues with ULS and wikidata for anons? [14:53:43] see https://bugzilla.wikimedia.org/show_bug.cgi?id=41451 [14:54:27] * aude doesn't know who the right person is for squids stuf [14:54:28] f [14:54:37] New patchset: Demon; "Adding daily gerrit backups to amanda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31247 [14:55:11] New review: Demon; "PS2 is just a rebase." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/31247 [14:55:46] <^demon> Could someone please take a look at https://gerrit.wikimedia.org/r/#/c/31247/ for me? I will sleep much better at night. [14:56:03] paravoid: thanks for your help debugging this [14:56:03] New patchset: Jgreen; "fix local archive paths for db78 backup script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31248 [14:56:07] replied [14:56:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:31] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31248 [14:56:49] I was about to reply too [14:56:51] it's gonna be fun if we have to vary over hundreds of languages with varnish [14:57:11] hehe [14:57:26] and we were worrying about thumbs... [14:57:47] why is it the same domains/URLs for all languages? [14:57:54] indeed [14:58:09] j^: no worries! [14:58:28] <^demon> paravoid: It's like commons or mediawiki.org -- multilingual. [14:58:56] <^demon> Well, mw.org is mostly english. But like commons. [14:59:10] that's gonna cache badly [14:59:55] apergos: aren't you our amanda expert? [15:03:28] mark: it's even worse than that btw [15:03:34] accept-language isn't a single language [15:03:52] we need to normalize it in VCL if we want to be sane [15:04:08] wikidata isn't gonna serve public traffic anyway right? ;-) [15:04:16] just wikimedians [15:04:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:04:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [15:04:38] at least I hope [15:05:17] is the ULS coming form varnish or squid? [15:05:31] i have no idea what "ULS" means [15:05:37] universal language selector [15:05:42] me neither :) [15:05:42] sorry :) [15:05:56] it's the language selector at top of wikidata.org [15:06:10] sounds like something we should switch off :) [15:06:15] hahaha [15:06:24] which if you're logged out, usually is in english but sometimes we get random pages like in norwegian, icelandic, etc. that are stuck in cache [15:06:31] logged in, no problem because it does things using preferences [15:06:45] * aude also has problems just staying logged in [15:06:46] it's probably served by mediawiki proxied/cached by squid [15:06:48] sometimes [15:06:58] ok [15:06:59] no, no problem because you're bypassing the caches [15:07:07] when you're logged in [15:07:07] right [15:07:21] is there anything squid stuff to see in gerrit? [15:07:24] any configs? [15:07:26] no [15:07:29] :( [15:07:30] but that's not needed either [15:07:36] * aude hoping to understand better [15:07:36] this has nothing to do with squid's config [15:07:37] ok [15:07:43] when you serve a page [15:07:45] if squid is caching the wrong things, you're not sending the right HTTP caching headers [15:07:52] ah, okay [15:08:00] so probably someone should read up on HTTP and caching headers :) [15:08:05] it needs to cache cookies i think? or something [15:08:11] squid is going to cache it for everyone, because it doesn't know that the content is different based on an incoming http header [15:08:19] hmm [15:08:19] specifically the Vary header, and X-Vary-Options [15:08:21] but you can tell that to squid on the reply [15:08:25] ok [15:08:34] * aude looks at my headers [15:08:34] and Cache-Control [15:08:38] ok [15:08:56] you can tell it Vary: Accept-Language, meaning that "this content is valid only for *this* value of this incoming HTTP header(s)" [15:09:18] so squid/varnish can keep multiple copies of the same URL [15:09:30] one for each value of Accept-Language [15:09:38] that's standard HTTP [15:09:44] *but* [15:09:52] having too many variants of the same page doesn't really work well with varnish [15:09:53] ok [15:09:58] or squid really [15:10:00] oh :( [15:10:02] so we shouldn't really vary per language, much less so for Accept-Language [15:10:36] so, basically, this probably never work for e.g. wikipedia [15:10:41] would* [15:10:47] ugh? [15:10:48] really? [15:11:02] we'd end up with hundreds or thousands of cached copies of the same articles [15:11:05] end we'd go down [15:11:11] and [15:11:17] hmm [15:11:24] and cache lookups would be slow [15:11:27] because of the linear search [15:12:45] what about if we somehow magically configure domains like de.wikidata.org/wiki/Q100 to be wikidata.org/wiki/Q100?uselang=de ? [15:12:54] like with apache rewrite rules? [15:13:01] * aude no expert at this stuff [15:13:02] Q100? [15:13:07] Q100 is an item page [15:13:13] I have no idea how wikidata works or is structured [15:13:17] http://www.wikidata.org/wiki/Q100 [15:13:36] those are especially important for the use to be able to choose a language [15:14:10] why is it called Q100? [15:14:19] that's offtopic, but I'm still wondering [15:14:34] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [15:14:38] it's the item id [15:15:02] we also have special:itembytitle/dewiki/Berlin which does a redirect to the item [15:15:25] we could do apache rewrites to that, but doesn't necessarily work for non data pages [15:17:12] New patchset: Jgreen; "more fundraising offhost_backups adjustments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31249 [15:18:32] aude: I think you send a mail to ops@ or engineering@ [15:19:03] 10kreq/s on cp3003 now [15:19:09] unless mark has an answer already [15:19:14] :-) [15:19:14] for apache rewrites? a bugzilla ticket works [15:19:32] no rob [15:19:32] no [15:19:34] ? [15:19:37] not for architectural design decisions [15:19:43] wikidata & caching [15:20:01] (and btw I am not anyone's amanda expert, that was all fred) [15:20:01] oh [15:20:14] apergos: but you do know amanda? [15:20:15] (I don't) [15:20:30] I was actually a developer stuck in office it land (cause our office it guys kept getting fired) at that time [15:20:44] no, sadly I don't [15:20:58] i used amanda, a long time ago [15:21:16] who knows amanda on our team? [15:21:22] at some point I looked at it for like 5 seconds when invistigating backups for the office, that was 3.5 years ago. and when I say looked at it, I mean features,issues, not a test install [15:21:22] noone [15:21:25] I mean, I can always merge it as it is [15:21:39] but that defeats the point of review, doesn't it [15:21:44] :-D [15:22:03] apergos: https://gerrit.wikimedia.org/r/#/c/31247/ enjoy [15:22:20] geethanks [15:22:23] looked at it 3.5 years ago is better than at all, so feel free. [15:22:23] ha, cp3003 filled its 96 GB of memory now [15:22:39] mark: I'm looking at the ganglia graphs and I don't see 10k req/s [15:22:40] liar :P [15:22:53] then you need glasses [15:23:42] although the correct ganglia metric for that is a bit fucked up, I do think it's around 10kreq/s RMS :) [15:23:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31249 [15:24:58] why do the req graphs have these ups and downs? [15:25:10] because there's something wrong with the ganglia plugin [15:25:11] ganglia plugin bug? [15:25:11] anyway [15:25:15] use varnishstat -n frontend [15:25:20] okay [15:25:20] ganglia is just difficult in that regard [15:27:19] New patchset: Jgreen; "grr sneaky typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31251 [15:28:42] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31251 [15:28:54] in this config, cp3003 would of course work better without the frontend/backend ;) [15:29:02] the frontend uses the most cpu by far [15:29:38] i'll have to experiment with bigger frontend cache memory as well [15:29:44] see if that makes much of a difference [15:29:58] what, you haven't coded the part where we're going to merge frontend/backends yet? [15:30:04] :P [15:30:18] no [15:30:33] i decided it's not important enough right now ;) [15:30:38] :P [15:31:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31247 [15:31:50] pybal has depooled it two times briefly [15:32:01] so cp3003 is now serving 25% of european upload traffic, right? [15:32:01] back, sorry [15:32:23] or is 33% [15:32:31] we'll be chatting with robla but [15:32:48] yeah 33% [15:32:50] think it might be nice to schedule a time to chat this over more with ops people [15:34:35] yes [15:35:14] I think we generally should default to mail and only if that doesn't work resort to phone calls [15:35:28] New patchset: Andrew Bogott; "Allow roles to insert arbitary lines into LocalSettings.php." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31252 [15:36:55] mail works for us [15:37:10] is there anyone specific we should email? or where does ops@ go? [15:37:20] ops@ would be best [15:37:23] ok [15:37:24] ops goes to the whole ops team plus a few platform people [15:37:25] perhaps wikitech as well [15:37:36] good idea [15:37:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:44] plus a few other people like Erik I guess [15:38:11] i'm not sure we are the best people to draft a strategy for this but i think something should be put somewhere? mediawiki.org or something [15:38:11] mark: so, what's stopping you from increasing the weight more? :) [15:38:23] try to outline the problem and solutions [15:38:25] i.e., when do you know when to stop [15:38:32] the fact that it was slightly unstable for the first time ;) [15:38:32] how even [15:38:32] but also the cache miss rate [15:38:48] 600 misses/s is quite a lot [15:39:01] it's going down [15:39:22] i'll increase it more but i'm waiting a bit [15:39:45] hehe [15:39:57] varnishhist still looks ok-ish [15:40:07] some requests taking just under a second, but very few [15:40:28] we hit ~1.1k req/s on swift again [15:40:44] ah [15:40:45] not now, half an hour ago [15:40:47] haven't really been watching that [15:40:50] what about our issues of being logged out? [15:40:54] i restarted some more boxes today [15:41:22] aude: that's either a caching issue if the right headers are not sent wrt cookies [15:41:30] or it's actual sessions getting lost on the mediawiki / redis / memcached layer [15:41:31] i keep getting logged out as [[User:Aude]], especially in chrome (even when clearing my cookies, etc.) [15:41:38] hmmm.... [15:41:43] !log reedy synchronized php-1.21wmf3/extensions/Wikibase [15:41:51] Logged the message, Master [15:41:51] * aude tries to log into wikipedia [15:42:13] yes, wikipedia works fine [15:42:43] i think something's not configured right, yet for wikidata but don't know where to look or which config [15:43:17] first read up on HTTP caching [15:43:33] this is never going to work well if wikidata developers don't know how that works very well [15:44:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:44:30] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:44:30] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:44:33] ok :) [15:44:58] the HTTP 1.1 RFC is a good read [15:45:10] i just wonder how we're doing things differently, if we have SUL, central auth, etc. all the same stuff as wikipedia and elsewhere [15:45:12] there's also a good o'reilly book on squid which explains it well [15:45:16] ok [15:45:30] well, for one, you have the same domain for all languages [15:45:31] i've poked at varnish but not squid [15:45:39] paravoid: ah ok [15:45:47] but commons seems to work ok? [15:45:52] varnish works the same in that respect [15:46:06] mostly english but [15:46:12] ok [15:46:58] commons uses ?uselang=$lang from what I can see [15:47:06] sure [15:47:26] wonder if there's a way to rewrite that but at the same time cache [15:47:34] <^demon> paravoid: uselang + javascript hackery. [15:47:41] :) [15:47:41] so foo?uselang=en is a different URL than foo?uselang=de [15:47:51] * aude nods [15:47:55] so they'll be separate cache objects [15:48:00] makes sense [15:48:05] ^demon: do tell! [15:48:43] <^demon> paravoid: They have some site JS that detects language, and then hides stuff for other languages. [15:48:56] jesus [15:48:58] <^demon> All of the the language stuff is sent to you. [15:49:11] hrm [15:49:39] New review: Demon; "Thank you!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31247 [15:53:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [15:54:45] !log reedy synchronized php-1.21wmf3/extensions/Wikibase/lib/includes/Utils.php [15:54:51] Logged the message, Master [15:56:05] !log reedy synchronized php-1.21wmf3/includes/Revision.php [15:56:10] Logged the message, Master [15:56:37] !log reedy synchronized php-1.21wmf3/includes/specials/SpecialUndelete.php [15:56:45] Logged the message, Master [15:58:12] so cp3003 got really unstable and I've pooled it at regular load [15:58:29] how's so? [15:59:00] i don't know the reason yet, i wasn't gonna figure it out while it was really unstable ;) [15:59:11] i'll look at it carefully on a next careful rampup [15:59:18] what do you mean unstable? [15:59:20] slow or 500s? [15:59:25] slow, nonresponding [15:59:26] pybal depooling [15:59:35] 15k load average :) [15:59:39] so probably a lot of stuck threads [16:01:22] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20esams&h=cp3003.esams.wikimedia.org&v=14930&m=frontend.n_wrk&r=hour&z=default&jr=&js=&st=1351785586&vl=N&ti=N%20worker%20threads&z=large [16:01:35] oh hah [16:04:58] i'm going to reinstall bits boxes with precise [16:05:01] and then call it a day [16:14:07] !log Reinstalling sq67 with Precise [16:14:12] Logged the message, Master [16:16:26] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:07] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:25:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:36] PROBLEM - SSH on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:12] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:06] RECOVERY - SSH on sq67 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:32:34] !log Restarted backend varnish instance on cp1029 [16:32:39] Logged the message, Master [16:34:35] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ 'Update to master' [16:34:41] Logged the message, Master [16:35:10] !log reedy synchronized php-1.21wmf2/extensions/TimedMediaHandler/ 'Update to master' [16:35:15] Logged the message, Master [16:35:56] !log reedy synchronized php-1.21wmf2/extensions/MwEmbedSupport/ [16:36:06] Logged the message, Master [16:36:29] !log reedy synchronized php-1.21wmf3/extensions/MwEmbedSupport/ [16:36:35] Logged the message, Master [16:37:03] annoying puppet overloads [16:38:36] New patchset: Reedy; "Enable TMH and MwEmbed on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31265 [16:38:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.618 seconds [16:38:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31265 [16:39:13] Reedy: can we fix the issue that j^ was investigating earlier first? [16:39:16] oh too late I guess [16:39:26] What's that? [16:39:29] I've not deployed anything yet... [16:39:40] avconv hanged processes on all imagescalers [16:39:43] paravoid: that will be fixed with the update [16:39:48] ah, okay [16:40:09] I should killall avconv now then [16:40:53] All good then? :) [16:41:22] yes [16:42:18] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable TMH and mwembed on enwiki' [16:42:20] kdone [16:42:24] done even [16:42:24] Logged the message, Master [16:42:26] thanks! [16:45:02] New patchset: Mark Bergsma; "Fix interface_aggregate / interface_add_ip6_mapped deadlock" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31267 [16:45:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31267 [16:45:51] PROBLEM - NTP on sq67 is CRITICAL: NTP CRITICAL: No response from NTP server [16:50:33] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disable TMH on enwiki' [16:50:38] Logged the message, Master [16:50:47] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:40] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:57:54] sigh [16:57:59] sq67's LACP doesn't want to come up [17:00:13] root@sq67:~# ifenslave ifenslave -d bond0 eth1 eth2 eth3 [17:00:13] Master 'ifenslave': Error: handshake with driver failed. Aborting [17:00:28] what is this shit [17:01:37] whatever [17:01:41] i'll look at it tomorrow or so [17:08:14] !log gallium is sent to swap from time to time. The cause seems to be the Ext-Wikibase job for which update.php eat all memory. I have disabled the job meanwhile. In case of trouble, simply kill the wild php process. [17:08:20] Logged the message, Master [17:10:16] hashar: Hi [17:10:29] (just Hi) [17:11:31] New patchset: Reedy; "Enable PageTriage on test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31275 [17:11:55] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31275 [17:12:40] Krinkle: just hi :-] [17:12:52] Krinkle: I am leaving again. Just came for gallium madness :-] [17:13:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:34] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable PageTriage on test2wiki' [17:14:39] Logged the message, Master [17:17:07] Reedy: Any idea why there's 2 gerrit-wm and logmsgbot? I thought those ran on different servers? [17:17:50] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler [17:17:55] Logged the message, Master [17:18:36] !log reedy synchronized php-1.21wmf3/includes/ [17:18:45] Logged the message, Master [17:19:27] paravoid: now you can killall avconv on the imagescalers if they are still running [17:19:34] I already did [17:26:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [17:27:14] paravoid: j^: why'd you need to killall on avconv? [17:28:40] robla: turns out the memory issue was related to threads, and imagescalers have more cores than the test equivalents, by default avconv creates threads based on the available cores, now it just uses one thread. but some avconv process where hanging on the imagescales from testing earlier today. [17:29:09] I'm going to be offline for a couple of minutes, but should come back on in a bit [17:34:08] !log reedy synchronized php-1.21wmf3/includes/Message.php [17:34:13] Logged the message, Master [17:34:36] !log reedy synchronized php-1.21wmf3/includes/cache/MessageCache.php [17:34:40] Logged the message, Master [17:41:42] !log reedy synchronized wmf-config/InitialiseSettings.php 'Re-enable TMH on enwiki' [17:41:52] Logged the message, Master [17:47:28] !log mlitn synchronized php-1.21wmf2/extensions/ArticleFeedbackv5 'desc' [17:47:34] Logged the message, Master [17:48:17] !log mlitn synchronized php-1.21wmf3/extensions/ArticleFeedbackv5 'desc' [17:48:23] Logged the message, Master [17:49:09] !log reedy synchronized php-1.21wmf2/includes/ [17:49:14] Logged the message, Master [18:00:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:36] PROBLEM - Puppet freshness on db51 is CRITICAL: Puppet has not run in the last 10 hours [18:02:14] !log reedy synchronized php-1.21wmf2/extensions/TimedMediaHandler [18:02:18] Logged the message, Master [18:11:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.877 seconds [18:14:08] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Test [18:14:12] Logged the message, Master [18:39:18] !log deleting duplicate docroot for wikidata in /h/w/common/docroot/ [18:39:23] Logged the message, Master [18:46:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.708 seconds [19:00:19] !log pulled IPv6 (only) announcements from AS1257 due to routing issues [19:00:25] Logged the message, Mistress of the network gear. [19:01:08] !log (modifying above ^^) pulled IPv6 (only) announcements from AS1257 on cr1-eqiad ONLY due to routing issues [19:01:12] Logged the message, Mistress of the network gear. [19:07:30] New patchset: Demon; "Moving github replication to wikimedia account" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31300 [19:08:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31300 [19:33:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:46] !log reedy synchronized php-1.21wmf3/extensions/ProofreadPage [19:37:52] Logged the message, Master [19:43:04] New patchset: MaxSem; "Kill wlm.wikimedia.org with fire" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31302 [19:47:41] New patchset: MaxSem; "Redirection rules for wlm.wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/31303 [19:48:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [20:01:55] LeslieCarr: just curious to know if you ever looked into why the varnish boxen have interesting daily spikes/death -- we have a couple of points where slow response times (as seen by watchmouse) seem to correlate with the spikes [20:02:32] i did not see anything myself but am not the best person for this -- binasher may have some more insight [20:02:46] ok; I'll poke him about it [20:03:06] actually; might just draft an email so that people can look when they have time [20:03:39] sounds good [20:15:12] how do I move files from one machine on the cluster to another? I'd like to move logs to some permanent storage before leaving yttrium [20:16:09] <^demon> MaxSem: scp? [20:20:16] I just got a "The table 'mw_text' is full (localhost)" error when trying to edit wikitech... [20:20:44] anyone know who I can bug about that? [20:21:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:49] <^demon> Ouch. [20:22:26] <^demon> binasher, perhaps? [20:24:41] mutante maybe [20:25:38] would it be better to just file an RT ticket about it? [20:26:08] <^demon> Well if mw_text is full then we've got a problem. [20:27:05] yep :) no one can schedule deployments for one :p [20:33:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.073 seconds [20:33:18] RobH: which cabinet do we use for XC orders again ? (0000 ? ) [20:33:48] yes, the rest should be 10X for row A, 20X for B, etc. [20:33:59] iirc [20:34:28] lemme check to make sure [20:34:47] LeslieCarr: yep. [20:34:57] 0000 is the dmarc xconnect cabinet on side of cage [20:34:59] cool thanks :) [20:35:10] ^demon, yes that worked, thanks [20:35:14] ordering the new TiNET cross connect :) [20:35:41] <^demon> MaxSem: You're welcome. [21:08:09] New patchset: Ori.livneh; "Deploy PostEdit to {fi,fr,id,te,vi}wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31320 [21:09:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:20] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31320 [21:17:46] !log freeing disk space on wikitech [21:17:54] Logged the message, Master [21:19:19] mwalker: (quick) fixed, but if you created a ticket pls keep it open. [21:19:54] mutante: nope; didn't create a ticket [21:20:45] ok, its fine, it can be considered part of an existin one (backup issue) [21:20:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.836 seconds [21:21:37] we'll just get a new instance anyways [21:23:09] AaronSchulz: fenari:/home/asher/db/bug41649/ [21:25:06] !log spage synchronized php-1.21wmf3/extensions/E3Experiments 'ACUX updates' [21:25:11] Logged the message, Master [21:26:34] binasher: how just need to parse that [21:26:43] *now I'll just [21:26:47] * AaronSchulz can't type [21:28:09] the s3 file has mediawikiwiki going back to the 29th [21:51:37] New patchset: Asher; "base pidfile on servicename" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31327 [21:52:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31327 [21:56:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:46] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [22:03:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:03:46] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:07:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.949 seconds [22:10:58] New review: Dzahn; "15:08 < effeietsanders> mutante: please check with multichill" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31302 [22:16:16] New patchset: Asher; "create parent to /a/redis/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31335 [22:16:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31335 [22:29:30] New patchset: Reedy; "Enable import from meta on wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31336 [22:29:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31336 [22:30:10] binasher: is there a window for a wider memcached deploy? [22:30:46] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable wikidata import from metawiki' [22:30:48] Logged the message, Master [22:33:55] !log spage synchronized php-1.21wmf3/extensions/PostEdit 'latest PostEdit' [22:33:57] Logged the message, Master [22:34:25] Where would I file a bug about the ganglia reporting? [22:38:46] awight: https://bugzilla.wikimedia.org Wikimedia>General, probably [22:39:13] awight: RT (you just got access) [22:39:52] [22:40:00] heh [22:40:09] heh, or both and link them or which you prefer :p [22:41:12] mutante: rad, thanks. I will throw a party in RT [22:41:45] !log spage synchronized php-1.21wmf3/extensions/E3Experiments 'ACUX bump versions' [22:41:50] Logged the message, Master [22:43:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:43:47] !log Created translate tables on wikidatawiki [22:43:55] Logged the message, Master [22:46:32] New patchset: Reedy; "Bug 41585 - Add the extensions Translate and TranslateNotifications on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31342 [22:47:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31342 [22:48:31] !log reedy synchronized wmf-config/InitialiseSettings.php [22:48:36] Logged the message, Master [22:50:59] Reedy: same config as on Meta? [22:51:05] yup [22:51:53] Reedy: also for user rights? [22:52:27] probably not [22:53:49] New patchset: Reedy; "Disable and remove contest extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31343 [22:54:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31343 [22:55:25] New review: Alex Monk; "Reedy just did this in I4066878c." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/31165 [22:56:23] !log reedy synchronized wmf-config/ 'Disable Contest extension' [22:56:29] Logged the message, Master [22:58:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [22:59:41] !log Dropped contest tables from testwiki [22:59:49] Logged the message, Master [23:22:04] !log reedy synchronized wmf-config/InitialiseSettings.php 'Temporarily disable translate and translation notification on wikidatawiki' [23:22:07] Logged the message, Master [23:23:16] AaronSchulz: nothing scheduled, want to switch a few bigger wikis to pecl-memcached tomorrow? [23:23:51] as in Friday tomorrow? [23:24:27] ;) [23:24:40] !log reedy synchronized wmf-config/InitialiseSettings.php 'Revert that, didn't fix the issue' [23:24:45] Logged the message, Master [23:26:34] ok, my script seems to be parsing those log dumps fine [23:28:31] AaronSchulz: let's do it fri at 5pm, then go home [23:28:36] but not tell anyone [23:28:47] and go home drunk so no one can do anything [23:29:00] New patchset: Asher; "redis nagios check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31349 [23:29:09] and forget the multiwrite [23:29:46] enwiki is good to test with, right? [23:30:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31349 [23:32:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:23] I'm running scap on fenari for E3 and Fundraising [23:33:37] binasher: https://gerrit.wikimedia.org/r/#/c/31350/1 [23:34:06] LeslieCarr: http://www.cisco.com/en/US/products/hw/wireless/ps4570/products_tech_note09186a0080bb1d7c.shtml [23:34:33] preilly: fyi it's a different cisco model [23:34:47] LeslieCarr: yeah I figured [23:36:03] binasher: 1pm mon? [23:36:12] for mc [23:36:20] hey ops experts, I'm seeing screenfuls of include_once failures during scap? I think it's in "Updating ExtensionMessages-1.21wmf3.php" step. [23:36:26] ok, fine.. be reasonable [23:36:58] spagewmf: i think you want the non-ops experts in -tech or -dev [23:36:59] spagewmf: ignore them [23:37:23] preilly: there is an un-cisco recommendation to try a huge igmp timeout cuz apple sucks at multicast [23:37:44] LeslieCarr: ha ha [23:37:44] preilly: try again ? [23:38:39] hmm, still need to call updateIfNewerOn [23:39:30] LeslieCarr: thanks for looking into this for me [23:39:52] is it working now ? [23:40:27] AaronSchulz: have you done a dry run? [23:41:09] binasher: I've only tested the first part mostly, I'll do a dry-run after I make some tweaks [23:42:52] !log spage Started syncing Wikimedia installation... : PostEdit i18n, E3 ACUX updates, fixing Cross site reqest problem in CentralNotice [23:42:58] Logged the message, Master [23:47:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds