[00:05:22] Elsie: Tried qchris? [00:06:29] :-D He found me in the analytics channel :-) [00:06:30] iptables isn't in 12.04 stock.. should I be using ufw or something else? [00:06:48] in iptables I'd just set up a TCP redirect to another port... [00:07:10] By syock, you mean installed by default? [00:07:12] ufw would have to use iptables [00:07:13] *stock [00:07:45] yes, installed by default [00:07:56] apt-get install iptables, and I'm happy [00:08:12] but I can go learn ufw [00:08:26] ufw is just a wrapper for iptables [00:08:28] Can't see why that's going to be an issue... Just make sure your puppet code is making sure it's installed and should be good to go [00:08:40] (rather than expecting it to be installed manually) [00:09:27] I always just use iptables and iptables-persistent [00:11:14] or https://forge.puppetlabs.com/arusso/iptables for puppet [00:11:22] ufw isn't install stock either.. :) [00:11:31] so I just went with iptables -- I'm happy [00:11:52] thx [00:13:02] cajoel: we use ferm in production [00:13:32] bast1001 has iptables [00:14:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:16:38] (03CR) 10Chad: [C: 031] "I don't think we need a second variable. The amount of time it takes to index most wikis is usually very minimal considering the increased" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [00:17:18] <^d> greg-g: Today's LD totally taken now? [00:18:53] (03PS1) 10Aaron Schulz: Added logstash role and applied it to logging logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/99278 [00:20:07] (03CR) 10Ori.livneh: [C: 032] Added logstash role and applied it to logging logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/99278 (owner: 10Aaron Schulz) [00:20:44] ...I started reviewing that :) [00:23:41] i was sitting next to aaron as he wrote it [00:24:21] Pair programming! [00:25:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:33] but review still welcome! [00:25:39] (03PS1) 10Ori.livneh: Qualify elasticsearch includes in role::logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/99279 [00:26:46] (03CR) 10Ori.livneh: [C: 032] Qualify elasticsearch includes in role::logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/99279 (owner: 10Ori.livneh) [00:27:10] paravoid: we just copied role::elasticsearch and modified it [00:27:18] <^d> Does anyone have dibs on the remainder of today's LD? I don't see anything on-wiki. [00:28:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:36:44] (03PS1) 10Ori.livneh: Qualify elasticsearch class names in role::elasticsearch, too [operations/puppet] - 10https://gerrit.wikimedia.org/r/99280 [00:37:38] (03CR) 10Ori.livneh: [C: 032] Qualify elasticsearch class names in role::elasticsearch, too [operations/puppet] - 10https://gerrit.wikimedia.org/r/99280 (owner: 10Ori.livneh) [00:56:06] (03PS1) 10Springle: depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 [00:56:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:35] (03CR) 10Springle: [C: 032] depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 (owner: 10Springle) [00:56:44] (03Merged) 10jenkins-bot: depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 (owner: 10Springle) [00:57:22] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:58:04] !log springle synchronized wmf-config/db-eqiad.php 'depool es1003 for upgrade' [00:58:20] Logged the message, Master [01:06:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:27:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:35:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:44:42] (03CR) 10Tim Starling: [C: 04-1] "MZMcBride, please put your changes in a separate commit." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:45:23] Bleh. [01:47:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:48:42] (03PS9) 10MZMcBride: Update wgServer, wgCanonicalServer for sub.subdomain wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:51:37] (03PS1) 10Springle: switch es1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99296 [01:53:29] (03CR) 10Springle: [C: 032] switch es1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99296 (owner: 10Springle) [01:53:58] (03PS1) 10Ori.livneh: Disable elasticsearch nagios/ganglia monitoring on logstash cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/99297 [01:55:06] (03CR) 10Ori.livneh: [C: 032] Disable elasticsearch nagios/ganglia monitoring on logstash cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/99297 (owner: 10Ori.livneh) [01:55:24] (03PS1) 10MZMcBride: Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 [01:57:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:58:05] (03CR) 10MZMcBride: "Tim, okay: ." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:58:54] (03CR) 10Dr0ptp4kt: [C: 031 V: 031] "My own generate.php run resulted in the same records. My usernames and UID don't allow for deployment, but this looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98036 (owner: 10Tim Starling) [01:59:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:01:38] (03CR) 10Bsitu: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [02:08:10] (03PS1) 10BBlack: temporarily add cp3013 to esams mobile list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99303 [02:09:29] (03CR) 10BBlack: [C: 032 V: 032] temporarily add cp3013 to esams mobile list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99303 (owner: 10BBlack) [02:15:43] PROBLEM - Varnish HTTP mobile-backend on cp3013 is CRITICAL: Connection refused [02:18:43] RECOVERY - Varnish HTTP mobile-backend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.193 second response time [02:20:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:13] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:22:00] !log LocalisationUpdate completed (1.23wmf5) at Thu Dec 5 02:22:00 UTC 2013 [02:22:17] Logged the message, Master [02:25:12] (03PS1) 10Ori.livneh: Rename role::elasticsearch -> role::elasticsearch::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 [02:28:02] (03CR) 10Ori.livneh: [C: 032] Rename role::elasticsearch -> role::elasticsearch::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [02:28:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:06] Hah! [02:31:42] (03CR) 10Ori.livneh: "Theory confirmed; this was indeed the issue." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [02:32:18] (03PS1) 10Ori.livneh: Revert "Disable elasticsearch nagios/ganglia monitoring on logstash cluster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99309 [02:33:13] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:33:27] (03CR) 10Ori.livneh: [C: 032] Revert "Disable elasticsearch nagios/ganglia monitoring on logstash cluster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99309 (owner: 10Ori.livneh) [02:36:53] AaronSchulz: figured it out [02:38:26] !log LocalisationUpdate completed (1.23wmf4) at Thu Dec 5 02:38:26 UTC 2013 [02:38:42] Logged the message, Master [02:49:25] (03CR) 10Tim Landscheidt: [C: 031] "I assume the deletion of apache2.2-common is deliberate, so might be nice to mention it in the commit message." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 (owner: 10Matanya) [02:50:31] Could someone help me fix permissions for my .ssh/known_hosts on fenari.wikimedia.org? [02:50:31] $ touch /home/krinkle/.ssh/known_hosts [02:50:31] touch: cannot touch `/home/krinkle/.ssh/known_hosts': Permission denied [02:50:33] As a result I get an ssh fingerprint yes/no everytime I connect to anything from fenari [02:53:26] Krinkle: try now? [02:54:45] andrewbogott: works :) [02:55:22] thanks [02:55:32] Krinkle: you didn't have a known_hosts, file and didn't have write permissions to the .ssh dir. [02:55:42] So I just made you an empty known_hosts :) [02:56:08] andrewbogott: should the directory be chgrp'ed to me instead of root? [02:56:13] or is that a good thing? [02:56:16] (03PS2) 10Matanya: toollabs: remove old tips absent declartions [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 [02:56:58] hm, now I'm typing in the wrong channel too [02:57:12] Don't you own it but it's just not writeable? [02:57:18] Anyway, I think it's correct as it is. [02:57:23] Or, at least, fine as it is. [03:00:04] k, chmod u+w .ssh fixed it so that my user can create files in the future [03:01:41] (03PS1) 10Springle: repool es1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99314 [03:02:27] (03CR) 10Springle: [C: 032] repool es1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99314 (owner: 10Springle) [03:03:32] andrewbogott: would be nice to have some reviews from you :) [03:03:40] !log springle synchronized wmf-config/db-eqiad.php 'repool es1003 after upgrade, max_connections lowered during warm up' [03:03:52] matanya: I'll try to catch up a bit tomorrow. [03:03:53] Logged the message, Master [03:04:03] thanks andrewbogott [03:16:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:23:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Dec 5 03:23:22 UTC 2013 [03:23:37] Logged the message, Master [03:32:28] (03PS1) 10Springle: depool es1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99318 [03:33:08] (03CR) 10Springle: [C: 032] depool es1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99318 (owner: 10Springle) [03:34:04] !log springle synchronized wmf-config/db-eqiad.php 'depool es1002 for upgrade' [03:34:25] Logged the message, Master [03:34:39] (03CR) 10Tim Starling: [C: 04-1] "Still needs redirects." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [03:35:03] Awww, no mutante [03:36:17] (03CR) 10Tim Starling: [C: 031] Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [03:37:54] For some reason I thought Reedy was on vacation. [03:38:11] Wassat? [03:38:20] I was just rebasing to be polite in his absence. Now I have a changeset of my own, hrm. [03:38:29] I'm not sure why I thought that. [03:43:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:45:43] Elsie: I was last week? [03:45:52] I mean, that's a true statement, just, you know. [03:46:14] The whole U.S. kind of was. [03:49:56] Ryan_Lane, I saw a trello card about autoredirect to https in various languages. What's the status of that, and would it be possible to avoid it based on the presence of the X-CS header? [03:50:38] Elsie: I just also took M-W, but yeah, pretty slow week [03:53:10] greg-g: I'm proud of you. [03:53:17] Elsie: I was too [03:53:23] I doubt I confused you with Reedy, though. ;-) [03:53:41] Wouldn't imagine, just I couldn't think who else was on vacation. [03:53:45] anywho, g'evening. [03:56:21] Bye. [03:57:52] greg-g that means i will deploy a few more things this week :) [03:58:11] i have enough for another depl window, or at least for a quick depl for sure :) [03:58:40] any good time tomorrow? :) [04:10:54] yurik-road2: what is it? [04:11:16] yurik-road2: bug numbers/etc? [04:11:28] yurik-road2: feel free to email me, since you're on the road right now :) [04:31:19] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:19] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:41:38] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [04:45:38] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [04:47:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:08] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:55:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:20:04] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [05:29:24] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:14] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:36:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:23] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:45:03] (03PS1) 10Springle: switch es1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99327 [05:46:00] (03CR) 10Spage: [C: 04-1] "I think Benny's suggestion that we need two patches is right." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [05:46:38] (03CR) 10Springle: [C: 032] switch es1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99327 (owner: 10Springle) [05:57:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:23] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:12:06] (03PS1) 10Springle: repool es1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99329 [06:12:07] (03PS1) 10Ori.livneh: logstash: add Ganglia group and specify aggregators [operations/puppet] - 10https://gerrit.wikimedia.org/r/99330 [06:13:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:14:25] (03CR) 10Springle: [C: 032] repool es1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99329 (owner: 10Springle) [06:15:20] !log springle synchronized wmf-config/db-eqiad.php 'repool es1002 after upgrade, max_connections lowered during warm up' [06:15:34] Logged the message, Master [06:24:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:56] !log cp301[123] puppet freshness is me, please leave them disabled [06:31:11] Logged the message, Master [06:31:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:38:47] bblack, hey, read your bug hunt, pretty impressive. One of those days I should learn some of the tricks you used [06:39:18] any luck finding theculprit? [06:40:18] well, the culprit in that particular case is "jemalloc bugs", the question is how we fix it :) [06:40:41] switching to a new malloc lib? :/ :) [06:41:10] yeah that's my current plan, actually, but I'm waiting for Faidon to wake up and say my plan isn't insane first [06:41:21] I rebuilt with s/jemalloc/tcmalloc/, and I like it [06:41:23] i'm sure there other libs have no bugs [06:42:11] jemalloc comes from FreeBSD, so the Linux port is kind of a 2nd-class citizen and not as well-tested, IMHO. tcmalloc from google has a lot of the same goals and properties, but was developed on Linux and is pretty stable. [06:42:19] are they really drop-in replacements like that? [06:42:46] as long as you're not using allocator-specific APIs, but in this case varnish is just using the normal posix APIs [06:43:16] moreover, what malloc lib is typically used by C? isn't there a standard c lib of some sort that everyone adhears to? [06:43:45] malloc normally comes from libc, so in our case normal is glibc's allocator [06:44:14] but glibc's allocator is fairly mundane and generic, which is why memory+thread-intensive software tends to want a better allocator [06:44:18] right, so these libs provide substantially better perfs for massive alloc/dealoc? [06:45:55] it's mostly not about "massive", it's about being efficient with gobs of small allocations, not fragmenting the heap in the face of repeated dealloc/realloc of small stuff, and being thread-aware so that hundreds of threads in the same proc don't trash each other on inter-cpu cache stuff and/or any necessary mutexes in the malloc implementation [06:45:55] has there been a push to migrate to tcmalloc in general? [06:46:08] no - honestly the glibc one is fine for most purposes, tcmalloc is considered special-purpose, people put it in place for specific apps when they have intensive needs that fit it [06:47:19] interesting - i would have thought that any threaded app which uses cpu for more than UI would want a highly efficient malloc [06:48:00] and more than net/file access [06:48:08] well there's no free lunch, implementations like tcmalloc have tradeoffs, such as wasting more memory to get things done more efficienctly in the big picture [06:48:47] i see. But I guess memory is cheaper nowadays, whereas performance is as critical as ever... [06:48:55] if tcmalloc replaced glibc's alloc for the whole system, you'd be making those tradeoffs for every random shellscript and simple utility and daemon, etc, where it's really not worth it. It would be a net loss of free ram for no noticeable improvement to the user. [06:50:19] true. Why has free bsd developed it? I would understand google's specific need [06:50:21] there are a lot of allocator implementations, and there's no one best answer that's optimal for every scenario [06:50:36] are they basic kernel on it somehow? [06:51:01] (03CR) 10Ori.livneh: [C: 031] "LGTM; I'll wait for you to be around before merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [06:51:14] freebsd uses jemalloc as their default allocator AFAIK, but I don't keep up with FreeBSD as well as I could/should [06:51:30] tcmalloc isn't meant to be like that, it's special purpose for intense loads and lots of threads [06:51:31] on the other hand - not sure why - kernel doesn't rotate memobjects as often [06:53:15] this is old but insightful if you want to know about jemalloc: http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf [06:55:18] thanks! will learn something new :) good luck!!! [06:55:37] re-reading the above and my email, sometimes it comes off like I think jemalloc is bad, because I keep just saying "jemalloc" - obviously, our problems are with the Linux port of it specifically [06:55:54] I'm sure in FreeBSD it works great, because that implementation gets exercised and tested a lot more :) [06:56:07] bblack: notice whose malloc implementation it replaced :P [06:56:22] yeah [06:56:24] i got that. I wonder how a port would have those bugs though [06:57:03] not even sure how much is needed to port - i always thought freebsd and linux had fairly similar posix kernel api [06:57:15] well, all complex software has bugs, and allocators get pretty complicated. I think the key here is that one installed as a system default gets lots of exercise to wring out those bugs, and one that isn't doesn't. [06:59:13] possibly dependency libs of sorts... but still [06:59:15] a lot of the linux port diffs are about pthread lock differences, some other stuff like madvise() as well [06:59:59] but why are they different - from my bad memory of pthread, its all done in the user space [07:00:24] just minor differences [07:00:35] are they kernel dependent? [07:00:47] but it's a port, and the copy in varnish is very old, lots of bugs were fixed in the real jemalloc since [07:01:02] and it was exported as libjemalloc for linux since then as well, which has several releases and a long line of bugfixes to look at [07:01:30] everything is kernel dependent when you get down to it :) [07:02:47] how so? for malloc i would think you need lock management (you don't have to start your own threads), and an ability to ask OS for large globs of memory. not much more? [07:04:02] btw, i don't want to keep you away from doing other stuff - just curiosity :) [07:04:31] if I really knew the answer to every question about allocators, I'd be writing one instead of debugging one :) [07:04:56] hehe. Just wondering why they keep a fork when one doesn't appear to be badly needed [07:05:01] but at the very least, there's going to be differences in the underlying behaviors of mmap() and brk() [07:05:01] esp for something so useful [07:06:18] well, in this particular case, I think the chain of events goes something like this: varnish is developed on FreeBSD. FreeBSD gets jemalloc with better threads and less contention, and it makes varnish better. someone tries to use varnish on Linux and notices glibc's allocator isn't fancy like that and is a performance problem there, so jemalloc gets ported over to linux to use with varnish. [07:06:51] independently (and later in time, IIRC), Google develops tcmalloc natively on Linux with a lot of the same rough goals about handling thread concurrency better [07:07:35] yes, but wouldn't jemalloc devs want to develop for linux too? or is there political rivalry going on between the lower level devs? :) [07:08:06] since when they were developing jemalloc, they clearly were not happy with glibc [07:08:07] jemalloc was written for FreeBSD by FreeBSD people I believe [07:08:17] FreeBSD has its own libc, it's not glibc [07:08:24] oh, didn't know [07:08:32] glibc is GNU, FreeBSD is BSD, two different parts of the open source world [07:08:46] yes, makes sense ... in a sad way [07:09:11] both groups trying to make the world better... and duplicate the effort :( [07:09:30] aaanyway, thanks for all the info! [07:09:37] yeah that's a whole other subject of infinite debate, the great BSD-vs-GPL debate! [07:10:01] i wouldn't want to be in between - just sad that it exists [07:10:38] if you think about it though, apple exists because of freebsd :) [07:10:57] they would have been dead if they didn't switch to the new kernel [07:11:12] and now they have largest capitalization :( [07:15:10] hmm, not entirelly accurate - netbsd was also a player [07:33:28] greg-g, Reedy, is it possible for me to piggy-back on the deploy to Wikipedia? I just need to update the GettingStarted submodule. [07:33:34] Otherwise, I need to back another commit out. [07:34:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:41:56] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [07:45:56] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [07:56:26] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:57:48] (03PS1) 10Ori.livneh: Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 [07:58:21] (03CR) 10jenkins-bot: [V: 04-1] Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 (owner: 10Ori.livneh) [08:00:26] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:38] how dare you, jenkins [08:01:14] morning [08:01:40] hi paravoid [08:03:11] i ran into the weirdest puppet bug [08:05:06] * paravoid waits for it [08:05:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:06:29] aaron asked me to look over the logstash manifest with him, so i was all excited -- here's my chance! [08:06:35] look at me, i'm not a complete idiot! [08:06:38] (03PS2) 10Ori.livneh: Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 [08:06:55] so of course it didn't work and i got a duplicate class definition that i couldn't for the life of me fix [08:07:56] nik had this weird thing: [08:08:11] class role::elasticsearch inherits role::elasticsearch::config [08:10:09] okay...? [08:10:19] I'm here, I'm waiting for it :) [08:10:30] well, i thought it was a bit funny to have a class extend a class that was lower than it in the hierarchy [08:10:46] so i just renamed role::elasticsearch to role::elasticsearch::server and it fixed it [08:11:09] the "class foo inherits foo::params" is fairly common [08:11:20] well, the [08:11:26] i never use inheritance [08:11:40] "class foo($server = $foo::params::server) inherits foo::params" [08:11:47] it's not in my imaginary "puppet: the good parts" book [08:12:04] might be a pretty short book [08:12:07] (good m orning) [08:12:22] goooood morning [08:12:48] (03CR) 10Ori.livneh: [C: 032] Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 (owner: 10Ori.livneh) [08:14:19] so, where was the bug? [08:16:26] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [08:17:30] well, why was there a duplicate class definition? [08:17:49] references to 'elasticsearch' were all qualified to disambiguate them from the role class of the same name [08:20:36] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [08:22:48] (03PS1) 10Ori.livneh: graphite/carbon: Leave default aggregation pattern unspecified [operations/puppet] - 10https://gerrit.wikimedia.org/r/99334 [08:23:50] (03CR) 10Ori.livneh: [C: 032] graphite/carbon: Leave default aggregation pattern unspecified [operations/puppet] - 10https://gerrit.wikimedia.org/r/99334 (owner: 10Ori.livneh) [08:41:38] lo [08:43:13] apergos: if you are around, i got a few tiny changes for contint/beta in ops/puppet :-] [08:45:52] lay em on me [08:48:38] hashar: [08:49:30] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [08:49:39] apergos: here they are [08:49:46] https://gerrit.wikimedia.org/r/97526 needs curl on Jenkins slaves [08:50:00] that might cause duplicate definition though, not sure how to test it [08:50:34] I would just merge and fix if something wrong happens :D [08:50:53] https://gerrit.wikimedia.org/r/#/c/98155/ add another field in Zuul configuration erb template [08:51:20] https://gerrit.wikimedia.org/r/99196 djvulibre-bin package on contint slaves (needed for some MediaWiki core tests to exercise djvu rendering [08:51:47] then I have two changes for my python script that continuously update beta https://gerrit.wikimedia.org/r/99052 and https://gerrit.wikimedia.org/r/99053 [08:51:52] all of them are https://gerrit.wikimedia.org/r/#/q/status:open+owner:hashar+project:operations/puppet,n,z :-D [08:53:47] ok, lemme look [08:59:40] can't you if !defined(Package['blah']) { .. } for the first one? adding it to other places in the manifests that include curl as well [09:04:27] and while we're in here, a style question on the second one: what's preferred, package { [ 'blah' ]: attribs...} or package { 'blah': attribs...} ? because I see both in this file [09:05:55] (03PS2) 10ArielGlenn: beta: update Parsoid dependencies only on changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99052 (owner: 10Hashar) [09:06:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:46] (03CR) 10ArielGlenn: [C: 032] beta: update Parsoid dependencies only on changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99052 (owner: 10Hashar) [09:08:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:08:29] apergos: will later on :-) [09:08:31] thx! [09:08:38] apergos: in an audio right now [09:08:40] k [09:09:06] (03CR) 10Odder: [C: 031] Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [09:09:13] (03PS2) 10ArielGlenn: beta: missing docstring in autoupdater [operations/puppet] - 10https://gerrit.wikimedia.org/r/99053 (owner: 10Hashar) [09:10:35] (03CR) 10ArielGlenn: [C: 032] beta: missing docstring in autoupdater [operations/puppet] - 10https://gerrit.wikimedia.org/r/99053 (owner: 10Hashar) [09:15:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:33:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:50:12] akosiaris, what's the deal with the Wikimedia PA wiki (pa-us.wikimedia.org)? [09:50:23] It's not accessible, but it's still in at least some of the config. [09:50:25] Was it killed? [09:51:27] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:52:23] (03CR) 10Mattflaschen: "There are problems with two of the dash ones (seems related to bug 31335)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [09:55:19] (03PS1) 10Faidon Liambotis: varnish: sort bits esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/99342 [09:55:27] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:44] (03CR) 10Faidon Liambotis: [C: 032] varnish: sort bits esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/99342 (owner: 10Faidon Liambotis) [09:58:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:10:30] (03PS1) 10Faidon Liambotis: ganglia_new: write a gmond.pid pidfile [operations/puppet] - 10https://gerrit.wikimedia.org/r/99345 [10:11:01] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ganglia_new: write a gmond.pid pidfile [operations/puppet] - 10https://gerrit.wikimedia.org/r/99345 (owner: 10Faidon Liambotis) [10:13:57] paravoid morning! [10:14:54] any questions for varnish patch? [10:15:08] (if you have time of course) [10:15:14] hi yurik, I haven't looked at it yet [10:15:58] paravoid, you know when Reedy generally comes on? [10:16:50] superm401: sorry, no [10:38:11] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:39:11] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:41:11] PROBLEM - Host ms5 is DOWN: PING CRITICAL - Packet loss = 100% [10:42:41] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [10:43:01] RECOVERY - Host ms5 is UP: PING OK - Packet loss = 0%, RTA = 35.67 ms [10:46:41] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [10:49:17] superm401: pa-us was closed by Reedy on Mar 14 but you are right that configs should not exist if the wiki is closed. I 'll talk with Reedy to figure this out [10:51:28] apergos: thank you for the merge. out for lunch, will revisit this afternoon. :-] [11:14:04] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:04] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:21:24] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [11:43:20] (03PS1) 10Faidon Liambotis: Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 [11:44:04] (03CR) 10Faidon Liambotis: [C: 032] Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 (owner: 10Faidon Liambotis) [11:45:59] (03CR) 10Faidon Liambotis: [V: 032] Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 (owner: 10Faidon Liambotis) [11:48:18] (03PS1) 10Faidon Liambotis: Fixup for interface::ip [operations/puppet] - 10https://gerrit.wikimedia.org/r/99360 [11:48:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Fixup for interface::ip [operations/puppet] - 10https://gerrit.wikimedia.org/r/99360 (owner: 10Faidon Liambotis) [11:54:05] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:56:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:56:56] (03PS1) 10Faidon Liambotis: s/onlyif/unless/ on interface::ip (doh) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99361 [11:57:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] s/onlyif/unless/ on interface::ip (doh) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99361 (owner: 10Faidon Liambotis) [12:02:55] (03PS1) 10Matanya: puppet: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99365 [12:11:02] Nemo_bis still online? [12:12:53] (03PS1) 10Matanya: nrpe: two space to 4 space [operations/puppet] - 10https://gerrit.wikimedia.org/r/99369 [12:17:54] (03CR) 10Manybubbles: "Huh? I'm really confused. If this is the standard then great we'll follow it. Elasticsearch doesn't really have a ::client, though. Wh" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [12:23:10] (03CR) 10Faidon Liambotis: [C: 04-1] logstash: add Ganglia group and specify aggregators (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99330 (owner: 10Ori.livneh) [12:28:53] (03CR) 10Faidon Liambotis: [C: 04-1] salt: lint cleanup (038 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 (owner: 10Matanya) [12:31:11] (03CR) 10Manybubbles: "I'll remove the -1 but I don't agree. My objection comes from the setting: "Automatically enable all new beta features". If someone chec" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [12:43:37] (03PS1) 10Matanya: mysql_wmf : lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99374 [12:48:02] (03PS4) 10Matanya: salt: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 [12:50:07] (03PS2) 10Faidon Liambotis: Add redirects for mobile/wap.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98058 [12:50:08] (03PS1) 10Faidon Liambotis: Make m/zero landing page redirect less aggressive [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99375 [12:50:49] (03CR) 10Faidon Liambotis: [C: 032] Make m/zero landing page redirect less aggressive [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99375 (owner: 10Faidon Liambotis) [13:05:52] (03PS1) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [13:07:29] (03CR) 10Physikerwelt: [C: 04-1] "author is wrong" [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [13:07:58] (03CR) 10jenkins-bot: [V: 04-1] added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [13:09:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:07] (03PS2) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [13:10:49] (03CR) 10Faidon Liambotis: [C: 032] Add redirects for mobile/wap.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98058 (owner: 10Faidon Liambotis) [13:11:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:14:11] http://whatthecommit.com/ [13:14:12] rofl [13:17:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:17:51] bblack: that you? [13:18:23] shouldn't be, no [13:18:34] even a restart could do it [13:18:52] nope [13:18:53] never mind [13:19:03] someone playing [13:19:20] POST http://www.wikipedia.org/ [13:19:25] heh [13:19:27] now why these are 503s, is beyond me [13:19:34] we have to dig through logs again i guess [13:20:25] I'm about to upgrade + restart 3012 though, last chance to object/delay :) [13:20:26] paravoid: can you imagine how things were when we had 3 people? [13:20:45] Ryan_Lane: I have a pretty good picture right now [13:20:48] :D [13:20:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:25:09] !log cp3012 running test varnish pkg w/ jemalloc 3.4.1 [13:25:25] Logged the message, Master [13:27:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:27] !log hashar synchronized php-1.23wmf5/extensions/ProofreadPage 'Update Proofreadpage {{gerrit|99042}}' [13:28:41] Logged the message, Master [13:28:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:34:46] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:39:52] (03PS1) 10BBlack: remove cp3013 from mobile esams backend list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99388 [13:43:36] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [13:43:56] PROBLEM - Varnish HTTP mobile-backend on cp3013 is CRITICAL: Connection refused [13:43:56] PROBLEM - Varnish HTTP mobile-frontend on cp3013 is CRITICAL: Connection refused [13:47:36] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [13:48:35] (03CR) 10BBlack: [C: 032 V: 032] remove cp3013 from mobile esams backend list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99388 (owner: 10BBlack) [13:54:46] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:36] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:55:46] RECOVERY - Puppet freshness on cp3013 is OK: puppet ran at Thu Dec 5 13:55:45 UTC 2013 [14:01:20] (03Abandoned) 10Faidon Liambotis: Removed X-DfltLang and X-DfltPage headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/86721 (owner: 10Yurik) [14:02:06] (03CR) 10Faidon Liambotis: [C: 032] Removed X-DfltLang & X-DfltPage from zero VCLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/97122 (owner: 10Yurik) [14:09:32] (03PS1) 10Faidon Liambotis: varnish: remove .{wap,mobile} . rewrites/redirects [operations/puppet] - 10https://gerrit.wikimedia.org/r/99394 [14:10:04] (03CR) 10Faidon Liambotis: [C: 032] varnish: remove .{wap,mobile} . rewrites/redirects [operations/puppet] - 10https://gerrit.wikimedia.org/r/99394 (owner: 10Faidon Liambotis) [14:10:11] (03CR) 10Faidon Liambotis: [V: 032] varnish: remove .{wap,mobile} . rewrites/redirects [operations/puppet] - 10https://gerrit.wikimedia.org/r/99394 (owner: 10Faidon Liambotis) [14:11:25] paravoid, you are doing it?!?!?!? [14:11:35] * yurik-road2 hides [14:12:20] am I doing what? [14:12:22] did I break something? [14:12:34] no, but you +2 the varnish change :) [14:12:38] checking.... [14:12:40] not just that [14:12:43] oh? [14:12:45] see the commits above [14:12:47] what else did you break? [14:12:53] * yurik-road2 looking... [14:13:00] moar cleanups [14:13:16] (03CR) 10Nikerabbit: varnish: remove .{wap,mobile} . rewrites/redirects (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99394 (owner: 10Faidon Liambotis) [14:13:39] Nikerabbit: correct, too late :( [14:14:29] paravoid [14:14:34] yes [14:14:40] i really not sure about that cleanup [14:14:48] what if we had some weird obligations semwhere [14:14:55] obligations of what? [14:15:00] for [14:15:08] for some legacy app or something else [14:15:16] i would have to check with brion [14:15:20] when he wakes up [14:15:20] I don't understand [14:15:28] what change are you talking about? [14:15:55] bblack: I need to run puppet on the mobile esams boxes I'm afraid [14:15:59] i vagly remember someone talking long time ago about strange attempts at mobile or wap or some other stuff via a separate domain [14:16:05] https://gerrit.wikimedia.org/r/#/c/99394/ [14:16:15] maybe you're thinking of the apple dictionary gateway? [14:16:39] paravoid: you can, cp3011+2. The only reason they're still disabled at the moment is I know it's going to trigger a varnish restart, and I didn't want that while I was watching close for problems [14:16:47] again - i don't know myself, just remember someone talking about it. It could be anything, but i really would rather ask brion & dan about this [14:16:48] but go ahead [14:16:53] bblack: reload you mean? [14:17:04] well, we'll see [14:17:19] !log bouncing labstore1001 (kernel upgrade) [14:17:33] paravoid: it happens, just left a comment in case someone is looking at it afterwards [14:17:36] Logged the message, Master [14:18:41] hmm [14:18:49] bots are hitting en.mobile.wikipedia.org for some reason [14:19:17] told you, something might be weird there - it could be an old app [14:19:22] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:30] googlebot is an old app? :) [14:19:40] it certainly is :) [14:19:56] paravoid, could it be the google's wap gateway? [14:20:06] no [14:20:09] they apparently have something like that, although it probably just refactors mobile [14:20:22] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:20:32] RECOVERY - DPKG on labstore1001 is OK: All packages OK [14:21:01] paravoid: it's enabled whenever you want to run, if you didn't already [14:21:03] 301ing those is right nevertheless [14:21:04] paravoid,in any case, i don't think its a good idea to delete domains like that without an email to the mailing list [14:21:10] I didn't delete them [14:21:20] i thought you did? [14:21:22] no [14:21:40] I'm emitting 301s from apache instead of serving them directly [14:21:53] but since there are hits, I'll play it safe and revert [14:22:04] thx [14:22:20] and btw, in that cleanup you should have removed 666 handling i think, since now nothing throws it [14:22:34] wrong, again [14:22:36] look closer [14:22:51] my commit message links to another commit, go read that. [14:23:03] paravoid: also, jemalloc-3.4.1 looks pretty stable so far on cp3012, still have old setup on cp3011. How long do you think we should let that go before it's sane enough to push elsewhere? wait for the daily assert again? [14:23:46] (03PS1) 10Faidon Liambotis: Revert "varnish: remove .{wap,mobile}. rewrites/redirects" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99395 [14:25:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:25:42] (03PS2) 10Faidon Liambotis: Revert "varnish: remove .{wap,mobile}. rewrites/redirects" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99395 [14:25:57] bblack: sounds good to me [14:26:12] paravoid, i looked at the linked patch https://gerrit.wikimedia.org/r/#/c/98058/ -- it does not produce 666 errors from what i can see [14:26:14] bblack: but now wouldn't hurt much either I think [14:26:38] yurik-road2: there is no 666 error in HTTP [14:26:48] this is an internal varnish hack to do 302 redirects [14:27:10] (03CR) 10Faidon Liambotis: [C: 032] Revert "varnish: remove .{wap,mobile}. rewrites/redirects" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99395 (owner: 10Faidon Liambotis) [14:27:29] that would be awesome though :-) [14:27:42] MaxSem: would you happen to know why google & bingbot hit en.mobile. instead of en.m. ? [14:28:06] paravoid, correct, but you have removed (now reverted) all the raise 666 from the mobile.frontend file - so you don't need to handle if (obj.status == 666) { anymore [14:28:07] paravoid, there are links to it? [14:28:40] hm, should link rel="canonical" take care of that? [14:29:07] paravoid, links on the interwebs [14:30:02] was .mobile. used before .m. ? [14:31:47] if i recall it correctly, yes [14:32:52] RECOVERY - Varnish HTTP mobile-backend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.194 second response time [14:32:52] RECOVERY - Varnish HTTP mobile-frontend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.198 second response time [14:34:34] 74 transcode.php, 795 .mobile., 1838608 .m. [14:34:35] http://en.wikipedia.org/wiki/Help:Options_to_hide_an_image#Go_to_the_mobile_Wikipedia_site_.28en.mobile.wikipedia.org.29_and_disable_all_images [14:34:37] in a day [14:34:40] bugagaga [14:34:42] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [14:35:05] mozilla firefox 2.0 [14:35:06] awesome [14:35:39] yurik-road2: again, I didn't delete the domain before [14:35:51] links would continue to work normally as they were before [14:36:07] stop panicing :) [14:36:42] paravoid, not panicing - just saw a link to mobile, was funny. As for redirects - yes, understood what you did. As for 666 - i don't see why you wanted to keep it [14:36:49] * bblack panics [14:36:52] keep what [14:36:53] ? [14:37:16] the error 666 handling in mobile.frontend when you deleted the code that raised them [14:37:30] you said i was wrong, trying to figure out how [14:37:32] I didn't want to keep it, that's why I deleted it [14:37:38] but you didn't [14:37:44] ? [14:38:24] paravoid, sub vcl_error had an if (obj.status == 666) { left in it [14:38:49] so i'm trying to understand why you wanted to keep it [14:38:52] it doesn't matter [14:39:14] it's dead code right now [14:39:24] but it's useful to have something to throw redirects if needed [14:39:27] right, that's what i thought and was surprised when you said i was wrong [14:39:42] oki, gotcha [14:39:53] we could deploy https://www.varnish-cache.org/vmod/redirect at some point [14:39:56] anyway, i tried the m landing - seems to be working [14:40:08] will play with zero landing in a sec [14:40:37] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.00190615654 secs [14:40:47] yeah, seems like a much cleaner solution [14:40:57] RECOVERY - Puppet freshness on cp3012 is OK: puppet ran at Thu Dec 5 14:40:54 UTC 2013 [14:41:34] !log mobile esams caches restarted again on new packages w/ new jemalloc, puppet's enabled there and should be stable [14:41:51] Logged the message, Master [14:42:01] paravoid, have you seen my email about WAP deprecation? [14:42:17] MaxSem: I have, haven't gotten to it yet [14:42:55] MaxSem: I don't expect to find the time to reply this week [14:42:58] probably next... [14:43:13] I'm sorry, I know it sucks [14:43:17] RECOVERY - Puppet freshness on cp3011 is OK: puppet ran at Thu Dec 5 14:43:11 UTC 2013 [14:43:25] np, just don't forget about it:) [14:43:36] I've starred it [14:43:47] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:06] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm23) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/99403 [14:44:15] my starred label has 28 mails right now [14:44:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [14:45:46] time for gmail to implement a very-starred status to sort out the starred queue :) [14:45:56] bblack: don't upgrade eqiad yet [14:46:03] ok [14:46:11] let's have something to fall back on if this crashes spectularly [14:46:27] and by that I mean losing the persistent store like the day before yesterday [14:46:40] what's on cp301[12] now supposedly has the netmapper memleak fix as well. if it's stable we should see whether that's true in the graphs [14:46:48] nod [14:48:10] bblack, you multiple inboxes labs feature - you can set up whichever filter to show the mail you intersted in the most in a separate section(s) at the top [14:48:18] s/you/use [14:49:08] MaxSem: I added this to the SoS dependency wall, card #54 [14:49:10] I want the labs feature that just shows me 1 email in my whole inbox every time I look at gmail, and it's the 1 email I need to look at right now to work on whatever I'm doing next. [14:49:27] https://mingle.corp.wikimedia.org/projects/scrum_of_scrums/cards/grid?color_by=status&filters[]=[Show+on+Wall][is][Yes]&group_by[lane]=team+dependency&group_by[row]=team&lanes=+%2CAnalytics%2CCore+Features%2CGrowth%2CLanguage%2CMobile%2CMobile+apps%2COperations%2CParsoid%2CPlatform%2CQA%2CVisual+Editor%2CWikipedia+Zero&tab=Dependency+Wall [14:49:31] cool [14:49:48] bblack, that system is called secretary [14:49:58] yes, that's what I need, a secretary [14:50:36] or a brain upgrade that supports multitasking [14:51:10] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm23) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/99403 (owner: 10BBlack) [14:51:14] (03PS1) 10Aude: Add Item and Item_talk namespace aliases for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 [14:51:50] bblack, found it! http://www.treasuresoftware.com/ps.html [14:51:56] (03CR) 10Aude: "no idea why we never had these. I always had these in my test wikis" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 (owner: 10Aude) [14:53:15] lol, that's for managing a bowling league :) [14:53:54] but at least its "perfect" !!! [14:54:51] seriously though - try multiple inboxes - i use it to show top 10 "drafts & starred" at the top of the screen [14:56:47] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [15:01:47] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [15:05:05] wonder why it socket timeout :/ [15:06:10] (03PS1) 10Yurik: Zero: Removed explicit 404-01 carrier because will be covered by default case [operations/puppet] - 10https://gerrit.wikimedia.org/r/99408 [15:06:28] error 404: carrier not found [15:06:54] hehe [15:07:20] (03CR) 10Faidon Liambotis: [C: 032] Zero: Removed explicit 404-01 carrier because will be covered by default case [operations/puppet] - 10https://gerrit.wikimedia.org/r/99408 (owner: 10Yurik) [15:07:29] wow [15:07:31] i mean [15:07:32] wow [15:07:40] paravoid, you ok? [15:07:51] this has by far been the quickest +2!!! [15:08:08] * yurik-road2 gives paravoid a chocolate cookie ! [15:08:33] he secretly set up a bot that auto-reviews your commits, it +2's a random 1 out of every 10 :) [15:08:47] hahahaha [15:09:00] yei!!! so if i do lots of patches with minor variation.... mmmm.... i'm the king of the world!!! [15:09:16] brandon's guide to hacking ops... [15:10:04] * yurik-road2 secretly expects a revert patch to come in any second now... [15:14:37] (03CR) 10Akosiaris: [C: 032] "Some minor nitpicks. If you decide to fix them, submit another patch else ping me to merge" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 (owner: 10Matanya) [15:15:10] pushing a patch akosiaris [15:15:22] :-) [15:17:36] (03PS5) 10Matanya: salt: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 [15:18:03] here ^ you go :) [15:18:09] (03CR) 10Akosiaris: [C: 032] salt: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 (owner: 10Matanya) [15:19:08] akosiaris: i really appricate all your review efforts. (you too paravoid) [15:19:39] no worries [15:23:38] bblack: are there any relevent alternatives to autoconf? [15:23:46] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:33] Snaps: some people like cmake, but it's really not as complete a solution, even though it's much simpler [15:27:22] or I guess scons too :) [15:27:54] (03PS1) 10Hashar: nrpe: let us specify timeout of check_nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/99410 [15:27:55] (03PS1) 10Hashar: icinga: raise timeout of check_job_queue nrpe command [operations/puppet] - 10https://gerrit.wikimedia.org/r/99411 [15:27:59] but mostly, for portable stuff posixy systems-level stuff, I view autoconf as still being a necessary evil, unless the project is very simple [15:28:12] the topic is librdkafka [15:28:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [15:30:33] if it were me, I'd use autoconf for it [15:30:52] (/automake/libtool) [15:31:26] bblack: so why don't you use autoconf ? :D [15:32:36] if you want, I could do an autoconf conversion pull req on librdkafka's github to get you started, copy a bunch of boilerplate from elsewhere [15:36:51] you would be considered evil bblack :D [15:37:56] :) [15:43:15] !log added librdkafka 0.8.1-1~precise1 to apt [15:43:29] Logged the message, Master [15:49:53] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:43] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [15:55:53] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [16:04:11] (03PS29) 10Ottomata: Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [16:05:42] (03CR) 10Ottomata: [C: 032 V: 032] Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [16:05:57] (03CR) 10BryanDavis: "Ori has given a +1 with intent to merge on Ie568f268b1. This still doesn't need to go out until we are cleared for prod release (still pen" [operations/dns] - 10https://gerrit.wikimedia.org/r/98849 (owner: 10BryanDavis) [16:05:59] woot!, 28 patchsets later :) [16:06:59] !log added varnishkafka 1.0.0-1 to apt [16:07:15] Logged the message, Master [16:07:52] heya bblack, i'm going to install varnishkafka on 3 mobiles today [16:08:03] faidon said I should check with you to make sure we don't bump heads [16:08:18] you're working debugging some mobile varnish stufff, right? [16:10:47] (03CR) 10BryanDavis: "Production release of app is blocked pending resolution of Bug 57546 (security review). This probably should not be merged until that is r" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [16:11:50] (03CR) 10Andrew Bogott: [C: 032] toollabs: remove old tips absent declarations. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 (owner: 10Matanya) [16:13:13] thanks andrewbogott :) [16:19:34] heya akosiaris, you around? [16:19:44] want to walk through this varnishkafka thing with me? [16:19:55] (03CR) 10Andrew Bogott: [C: 032] puppet: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99365 (owner: 10Matanya) [16:20:29] ottomata: yes [16:21:15] cool, ok [16:21:29] i just got a 10 minute popup reminder that I have my 1on1 with toby in 10 minutes [16:21:40] real quick, i'll show what i'm about to do [16:21:43] (03CR) 10Matanya: "some nitpicks, based on style guide." (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [16:21:45] and we can create the new topic in kafka [16:21:48] so [16:22:25] i'm going to deploy this to cp1046,cp3011,cp4011 [16:22:27] one mobile in each dc [16:22:36] there's no traffic in ulsfo right now though, right? [16:22:44] exactly [16:22:48] k [16:23:22] mwalker|away: reedy didn't get back to you, right? [16:23:41] mwalker|away: generally, yeah, probably, just be ready to help before so reedy knows what's going one :) [16:23:44] er on [16:23:46] ok, so we are going to produce to kafka on topic => 'webrequest-mobile' [16:23:57] that topic doesn't exist yet, and librdkafka doesn't auto create topics [16:24:08] so we have to create it ourselves using the kafka cli [16:24:24] akosiaris: want to join a screen on analytics1021.eqiad.wmnet [16:24:24] ? [16:25:02] k [16:25:12] you there? [16:25:15] screen -x kakfa [16:25:18] screen -x kafka [16:25:44] coool :) [16:25:50] (03PS8) 10Dzahn: bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 [16:26:23] (03CR) 10jenkins-bot: [V: 04-1] bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [16:26:27] (03CR) 10Dzahn: bugzilla module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [16:26:32] so so, notice that there is a kafka.sh profile.d script that sets ZOOKEEPER_URL [16:26:36] this is installed by puppet [16:26:46] and it a convenience, the kafka cli will look for that env var [16:26:57] if it is set, you don't have to set that flag all the time on all the kafka commands [16:27:16] ok you can see that I have one topic created [16:27:18] called 'test' [16:27:30] it has 10 partitions (there are 10 kafka log disk partitions) [16:27:34] and 2 replicas [16:27:52] you can also see that the leaders for each topic-partition are spread evenly between each broker [16:28:05] if you ever have to restart a broker, the leader will change [16:28:12] so if we restarted the broker on an21 [16:28:17] the leader for this topic would switch to an22 [16:28:28] it never changes back automatically, you have to tell it to do so [16:28:41] that's what preferred-replica-election [16:28:45] it just rebalacnes normally [16:28:48] anyway, we don't have to do that now [16:28:52] (03PS9) 10Dzahn: bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 [16:28:54] we want to create the webrequest-mobile topic [16:29:12] so, i'm going to create it with 10 partitions and 2 replicas [16:29:14] just like test [16:29:31] easy enough :-) [16:29:31] so, there we go, now it exists [16:29:34] yup [16:29:55] ok, my 1on1 with toby is about to start, let's pick this back up when I'm done? [16:30:01] ok [16:30:44] (03CR) 10Dzahn: [C: 031] "weekday and other comments by alex on PS7: done" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [16:31:49] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:53] akosiaris: thanks for all your reviews, didn't get to it, yesterday and today on datacenter visits.. but alll comments done [16:32:21] and, yea, wasnt about to merge that firewall bastion thing :p very good catch to -1 it though [16:32:45] hm akosiaris, no toby yet and he's not on irc [16:32:49] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [16:32:54] so uhhh, let's merge! [16:33:27] i betcha there will be at least one puppet error [16:33:27] :p [16:34:18] (03PS12) 10Ottomata: Setting up varnishkafka on 3 mobile varnish hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [16:34:31] (03CR) 10Ottomata: [C: 032 V: 032] Setting up varnishkafka on 3 mobile varnish hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [16:35:47] ok, running puppet on cp1046 and cp3011 [16:36:49] (03CR) 10Akosiaris: [C: 032] bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [16:37:19] (03PS2) 10Dzahn: remove outdated tesla subnet from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/96489 [16:37:25] mutante: ^ :) Thanks for all the work :) [16:37:30] akosiaris: weee:) [16:37:46] (03CR) 10Andrew Bogott: [C: 032] mysql_wmf : lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99374 (owner: 10Matanya) [16:39:42] (03CR) 10Dzahn: [C: 031] "i changed this patch to ONLY remove the tesla subnet but not touch any other networks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96489 (owner: 10Dzahn) [16:39:47] matanya: both of those linting patches applied with no diff -- nice work! [16:40:08] (03PS1) 10Ottomata: Depending on proper package name ganglia-monitor [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/99417 [16:40:08] thanks andrewbogott :) [16:40:09] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2352 MB (6% inode=93%): [16:40:26] ottomata: seems like you were right ? [16:40:35] hehe [16:40:39] (03CR) 10Ottomata: [C: 032 V: 032] Depending on proper package name ganglia-monitor [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/99417 (owner: 10Ottomata) [16:40:46] fixing the lutetium issue [16:41:49] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:13] akosiaris: i'll do the role/site.pp change putting it on zirconium tomorrow probably.. gotta run to visit dc.. thx again, cya [16:42:26] ok bye :-) [16:42:37] (sacramento) [16:42:56] (03PS1) 10Ottomata: Fixing kafka broker config in role/cache.pp, updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/99418 [16:43:19] (03CR) 10Ottomata: [C: 032 V: 032] Fixing kafka broker config in role/cache.pp, updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/99418 (owner: 10Ottomata) [16:43:19] see you later guys [16:45:09] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2352 MB (6% inode=93%): [16:45:49] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [16:48:11] (03PS1) 10Ottomata: Fixing parameter name for logline_scratch_size [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/99420 [16:48:21] (03CR) 10Ottomata: [C: 032 V: 032] Fixing parameter name for logline_scratch_size [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/99420 (owner: 10Ottomata) [16:49:07] (03PS1) 10Ottomata: Updating varnishkafka module with logline_scratch_size fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/99421 [16:49:20] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module with logline_scratch_size fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/99421 (owner: 10Ottomata) [16:50:09] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 26779 MB (75% inode=93%): [16:50:50] akosiaris: almost! i've started a console-consumer for the webrequest-mobile topic in our screen [16:50:56] wop, there it goes! [16:50:59] i 've seen it [16:51:02] a lot of data... [16:51:16] so that is just mobile ? [16:51:22] cool, i see both hosts [16:51:23] sampling ? [16:51:27] that's just 2 mobile hosts [16:51:30] no sampling [16:51:47] hmmm cool :-) [16:52:36] COOOOL [16:54:43] !log deployed varnishkafka 1.0.0-1 to cp1046,cp3011,cp4011 [16:54:59] Logged the message, Master [16:57:49] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:01:49] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [17:02:05] (03CR) 10Legoktm: [C: 031] Correct capitalization of "ShoutWiki". [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/96018 (owner: 10Jack Phoenix) [17:13:50] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [17:18:50] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:19:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [17:23:23] (03CR) 10Ori.livneh: "stet!" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [17:23:41] hi is the puppet zookeeper module used somewhere . If I try to install it I get The following packages have unmet dependencies: zookeeperd : Depends: zookeeper (= 3.3.5+dfsg1-1ubuntu1) but 3.4.5+20-1.cdh4.3.1.p0.76~precise-cdh4.3.1 is to be installed E: Unable to correct problems, you have held broken packages. [17:25:50] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:29:37] !log catrope synchronized php-1.23wmf5/resources/oojs/oojs.js 'Add oojs' [17:29:52] Logged the message, Master [17:29:52] !log catrope synchronized php-1.23wmf5/resources/Resources.php 'Add oojs' [17:30:08] Logged the message, Master [17:30:24] !log catrope synchronized php-1.23wmf5/extensions/VisualEditor 'oojs fixes' [17:30:39] Logged the message, Master [17:30:42] !log catrope synchronized php-1.23wmf5/extensions/MultimediaViewer 'oojs fixes' [17:30:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [17:30:58] Logged the message, Master [17:36:28] paravoid, would you mind taking a look at https://gerrit.wikimedia.org/r/99272 and adding your feedback? yurik and i are in violent agreement and could use your advice [17:38:00] !log catrope synchronized php-1.23wmf5/resources/oojs/oojs.js 'I synced you before, now start existing' [17:38:17] Logged the message, Master [17:38:38] !log catrope synchronized php-1.23wmf5/resources/ [17:38:55] Logged the message, Master [17:42:42] andrewbogott: great [17:43:25] * andrewbogott reads about plural support [17:43:26] andrewbogott: Here's some of the recent work on undeclared class properties in core: https://gerrit.wikimedia.org/r/#/q/project:mediawiki/core+branch:master+topic:visibility,n,z [17:43:39] andrewbogott: https://www.mediawiki.org/wiki/Manual:Messages_API [17:45:17] dr0ptp4kt: I'm lacking too much context :( [17:47:04] paravoid, okay. you able to do a google hangout real quick? [17:47:39] ok [17:48:16] "calling" [17:48:44] (03CR) 10Matanya: Add configuration for Wikimania Scholarships (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [17:50:32] greg-g: I'm a bit confused; what was reed_ supposed to get back to me on? [17:54:57] mwalker: you pinged him and me about a deploy today? [17:55:07] gah, nope, not you [17:55:10] mwalker: my bad [17:55:14] no worries! [17:55:28] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:36] superm401: reedy didn't get back to you, right? [17:55:56] matt vs matt [17:56:01] in my head [17:56:14] very much like spy vs spy [17:57:11] (03CR) 10BryanDavis: Add configuration for Wikimania Scholarships (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [17:57:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:00:28] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:11] but he's super matt [18:01:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:01:52] mwalker: so you have it cut out for it, good luck ;) [18:09:29] oh hey Reedy, did you see that thing I sent you re download.wikimedia? [18:10:06] Nope [18:10:13] I can't do anything with it anyway... [18:10:23] right, well... [18:11:02] Reedy: ah, I buried it in an email, the "RelEng & QA followup" subject one [18:11:13] last paragraph that no one reads ;) [18:11:28] (I mean that seriously, I totally buried it) [18:13:35] greg-g: Can you summarise? :P [18:13:46] good call [18:14:30] basically, was chatting with Antoine, and he and I thought you might be able to help with more reliably getting tarballs on downloaddot, dealing with the swift backend [18:14:51] no in that you know everything now, but in that, it'd be a cool project to learn stuff [18:14:57] s/no/not/ [18:21:02] (03CR) 10MaxSem: Add configuration for Wikimania Scholarships (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [18:25:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:22] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:28:01] greg-g, no, no one got back to me yet. I'm glad to explain it. [18:28:23] Basically, I moved some messages out of GuidedTour, and I want to move them into GettingStarted before prod breaks. [18:28:30] Meant to do this earlier, but there was an oversight. [18:29:03] this is different than the pa-us issue, right? [18:29:15] paravoid, yeah, totally unrelated. [18:29:47] okay, let's just invoke Reedy's wisdom to this issue too then [18:30:09] :) [18:30:29] well, sounds like we should do it pre deploy/with the deploy that's about to happen :) [18:30:36] In theory... [18:30:49] Are you also changing the key names? [18:31:38] Reedy, yeah, that's what I'm requesting. [18:31:53] To do piggy back on the one at 11 PST. [18:31:59] Not changing the key names. [18:32:01] Or values. [18:32:12] I'm confused [18:32:15] What are you requesting? [18:32:44] To do an extension submodule update on wmf5 before you rotate wmf5 to Wikipedia. [18:33:15] Oh, right [18:33:24] Yeah, should be fine [18:33:29] Okay, thanks. [18:33:43] Have you made a commit to do the update? [18:34:02] Yes [18:34:08] https://gerrit.wikimedia.org/r/99338 [18:34:31] One big scappy family [18:34:58] Reedy, thanks, do you want me to update the submodule on tin, or will that get taken care of? [18:35:06] I'm doing it now :) [18:35:21] Thank you [18:35:22] "One big scappy family" made me laugh way harder than it should have. :D [18:35:49] php-1.22wmf15 through php-1.23wmf6 [18:35:52] I should cleanup [18:39:49] bd808: sorry to dirty your module :| [18:40:04] matanya: No worries [18:40:58] I'd only be mad if you slapped a -2 on it :) [18:41:09] :) [18:41:11] And I'd get over that [18:45:31] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:31] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:54:31] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:41] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [18:55:41] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [18:56:01] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [18:56:01] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [18:56:01] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [18:56:04] !log reedy synchronized php-1.23wmf6 'Staging' [18:56:18] hrm [18:56:18] Logged the message, Master [18:56:21] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [18:56:31] uh oh [18:56:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [18:56:41] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [18:56:45] ugh [18:56:52] those are image scalers [18:57:01] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [18:57:07] huh, right, rendering.svc, makes sense [18:58:02] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:14] what's going on? [18:58:15] i see a flood of /usr/bin/convert jobs [18:58:28] image scalers went nuts [18:58:51] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:58:52] Reedy: do you know of any large uploads going on? [18:59:00] I'm not doing any [18:59:07] well then [19:00:02] PROBLEM - RAID on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:09] Oh there we go [19:00:11] PROBLEM - DPKG on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:11] PROBLEM - Disk space on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:11] PROBLEM - Swift HTTP backend on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:11] PROBLEM - puppet disabled on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:11] PROBLEM - SSH on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:17] RoanKattouw: ? [19:00:24] Now the media storage backends are overloading [19:00:30] Or at least 1004 is [19:00:33] * greg-g nods [19:00:44] (It probably doesn't actually have RAID problems, but if the checks are timing out, that means it's under high stress) [19:00:47] AaronSchulz: ^^^ do you know anything about what's going on? [19:01:23] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [19:01:23] 24157 apache 20 0 322m 263m 2648 R 100 2.2 0:35.09 convert [19:02:01] RECOVERY - Disk space on ms-fe1004 is OK: DISK OK [19:02:11] RECOVERY - puppet disabled on ms-fe1004 is OK: OK [19:02:11] RECOVERY - DPKG on ms-fe1004 is OK: All packages OK [19:02:11] RECOVERY - SSH on ms-fe1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:02:45] welcome back 1004 [19:03:52] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.310 second response time [19:05:16] now what about the renderers though, did they just blow up completely and the storage is back becuase it isn't getting hammered anymore? [19:06:56] "Conversion" sounds so... cultish. :-) [19:06:59] I mean I'm on mw1159 and it doesn't seem to be unhappy or down [19:07:09] PROBLEM - Disk space on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:09] PROBLEM - DPKG on ms-fe1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:46] apergos: but icinga doesn't think it's recovered though, right? [19:07:59] RECOVERY - Disk space on ms-fe1004 is OK: DISK OK [19:07:59] RECOVERY - DPKG on ms-fe1004 is OK: All packages OK [19:08:03] it's been ooming processes though [19:08:09] converts [19:08:09] RECOVERY - Swift HTTP backend on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.078 second response time [19:08:09] * greg-g nods [19:08:38] but that seems to be par for the course [19:08:45] !log reedy updated /a/common to {{Gerrit|I18d8a12b1}}: repool es1002 after upgrade [19:08:46] I mean I see them from the time the syslog started [19:08:50] RECOVERY - RAID on ms-fe1004 is OK: NRPE: Unable to read output [19:08:59] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.460 second response time [19:09:01] Logged the message, Master [19:09:39] * greg-g waits for it... [19:09:40] icinga thinks apache is unreachable [19:09:46] (on mw1159) [19:09:58] hmmmm [19:10:10] which of our extensions touch the parser cache? cirrus, maybe? [19:10:23] (asking re: https://bugzilla.wikimedia.org/show_bug.cgi?id=58042 ) [19:11:19] PROBLEM - Swift HTTP backend on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:29] bah, still going [19:11:59] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [19:12:21] ori-l: have you been able to figure out where the requests are coming from? [19:13:35] greg-g: The internet [19:13:37] !log reedy started scap: testwiki to 1.23wmf6, build l10n cache and rebuild for 1.23wmf5 [19:14:14] Reedy: jerk [19:14:39] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:08] Logged the message, Master [19:15:19] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:15:29] that was me resarting the apache on there [19:15:29] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:16:19] PROBLEM - DPKG on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:17:09] RECOVERY - Swift HTTP backend on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.077 second response time [19:17:19] RECOVERY - DPKG on ms-fe1001 is OK: All packages OK [19:17:39] PROBLEM - SSH on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:50] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.109 second response time [19:17:59] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:59] PROBLEM - puppet disabled on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:17:59] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:06] Reedy: why would you scap? [19:18:09] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:29] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [19:18:39] PROBLEM - Swift HTTP frontend on ms-fe1001 is CRITICAL: Connection timed out [19:18:50] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.025 second response time [19:18:53] RECOVERY - puppet disabled on ms-fe1001 is OK: OK [19:18:57] right, so, there's been no recovery yet, further deploys on hold [19:18:59] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:59] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 1.057 second response time [19:19:54] paravoid: ^^ thoughts on the image scalers dieing? [19:20:19] PROBLEM - Swift HTTP backend on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:39] PROBLEM - Disk space on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:39] RECOVERY - Swift HTTP frontend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 4.251 second response time [19:20:49] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 3.325 second response time [19:20:59] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:02] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Jobrunners+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [19:21:20] AaronSchulz: any idea? [19:21:39] RECOVERY - Disk space on ms-fe1001 is OK: DISK OK [19:21:55] I'd like to know what went down just before they started falling over [19:21:59] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 8.037 second response time [19:22:09] PROBLEM - SSH on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:09] PROBLEM - Disk space on ms-fe1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:09] RECOVERY - Swift HTTP backend on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.078 second response time [19:22:10] PROBLEM - RAID on ms-fe1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:19] PROBLEM - RAID on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:19] PROBLEM - DPKG on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:38] apergos: seems you are workign the outage, let me know if you need me to pull myself, or chris/daniel/leslie out of this meeting [19:22:39] RECOVERY - SSH on ms-fe1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:22:59] RECOVERY - Disk space on ms-fe1002 is OK: DISK OK [19:22:59] RECOVERY - SSH on ms-fe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:23:06] Wow, what happened [19:23:09] not working it very well I must say [19:23:09] RECOVERY - RAID on ms-fe1002 is OK: NRPE: Unable to read output [19:23:09] RECOVERY - DPKG on ms-fe1001 is OK: All packages OK [19:23:19] RECOVERY - RAID on ms-fe1001 is OK: NRPE: Unable to read output [19:23:19] RobH: help would be appreciated [19:23:24] paravoid: you there? [19:23:29] PROBLEM - puppet disabled on ms-fe1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:47] (03PS1) 10coren: Tool Labs: add mod_setenv by default to lighttpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/99449 [19:24:00] on mw1156 at least it's this image: http://commons.wikimedia.org/wiki/File:Planetoid_90377_sedna_animation_location.gif [19:24:16] Image scalers are dying? Anything else causing trouble? [19:25:19] PROBLEM - Swift HTTP backend on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:34] marktraceur: just that (and the repurcussions) [19:25:36] we culd shoot the current converts [19:25:39] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:43] 'kay, topic changed in -tech [19:25:48] but again I didn't see that on mw1159 (for example) [19:25:53] 2013-12-05 19:25:40 mw1155 commonswiki: Thumbnail failed on mw1155: could not get local copy of "Rengo_19_10_2013_(10370081265).jpg" [19:25:57] (03CR) 10coren: [C: 032] "Trivial change." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99449 (owner: 10coren) [19:26:09] PROBLEM - SSH on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:09] PROBLEM - Disk space on ms-fe1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:09] PROBLEM - RAID on ms-fe1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:11] 2013-12-05 19:25:57 mw1155 commonswiki: thumbnail failed on mw1155: error 137 "" from "'/usr/bin/convert' -background white '/tmp/localcopy_0bdc284c0c88-1.gif' -coalesce -thumbnail '330x248!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Light_dispersion_conceptual_waves.gif' -depth 8 -rotate -0 -fuzz 5% -layers optimizeTransparency '/tmp/transform_c82db1200635-1.gif'" [19:26:19] RECOVERY - puppet disabled on ms-fe1002 is OK: OK [19:26:36] cmjohnson1: can you relay to those who are with you that image scalers are dead/dying, and mediastorage is also flapping [19:26:44] Mostly unable to get local copy errors [19:26:58] greg-g [19:27:00] k [19:28:00] RECOVERY - Disk space on ms-fe1002 is OK: DISK OK [19:28:09] RECOVERY - SSH on ms-fe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:28:09] RECOVERY - RAID on ms-fe1002 is OK: NRPE: Unable to read output [19:28:25] exec.log is filled with animated gifs specifically [19:29:09] RECOVERY - Swift HTTP backend on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.077 second response time [19:29:14] why would there be a lot of animated gifs all at once? [19:29:40] Someone deciding to purge them all [19:29:50] hey [19:29:54] hey [19:29:59] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:05] Certainly doesn't look to be like a mass upload of them [19:30:21] 0 in the last 500 uploads to commons [19:30:28] Nemo_bis: any idea? [19:30:29] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:30:36] i looked at three of these gifs and they all were linked to from User:Nemo bis/Sandbox [19:30:40] Reedy: looks like that, that planet gif was from 2005 [19:30:49] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.008 second response time [19:31:23] Nemo_bis: hey do you know anything about htis ? [19:31:36] I didn't even know I had such a subpage [19:31:42] There's 1129 gifs on that page [19:32:00] PROBLEM - MySQL InnoDB on db1059 is CRITICAL: CRIT longest blocking idle transaction sleeps for 739 seconds [19:32:08] Nemo_bis: you created it :) [19:32:12] * Reedy delete it [19:32:14] * Reedy deleted it [19:32:17] thanks [19:32:22] :) [19:32:43] I'm sure we all went to the same page potentially making the problem possibly worse.. [19:32:43] * greg-g prays [19:32:47] I have seen this one [19:32:49] File:DNA_orbit_animated.gif [19:32:53] ABUSE [19:32:53] crop up a few times now [19:32:58] was it created a long time ago ? [19:33:05] now I lost my sandbox history [19:33:06] or is it possible your account pw was compromised ? [19:33:15] the page was a year old [19:33:31] LeslieCarr: well, if someone can look at the log, it was created by nemo, then others were the last to edit, no one I knew, but that doesn't mean anything [19:33:33] Nemo_bis: you can have it back later [19:33:39] It just seems a bit suspect [19:33:40] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:59] iirc I just used it to glance over a few GIFs; there aren't many on Commons so it's normal that a good portion of them was present in it [19:33:59] greg-g: we really don't know that's nemo's page [19:34:00] Most of the errors are still Thumbnail failed on mw1154: could not get local copy of in the thumb logs [19:34:00] RECOVERY - MySQL InnoDB on db1059 is OK: OK longest blocking idle transaction sleeps for 1 seconds [19:34:07] it could easily easily be a coincidence [19:34:12] ori-l: completely [19:34:19] PROBLEM - Swift HTTP backend on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:32] i'll check out mw1154's health .... [19:34:39] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:34:40] Which suggests it's the image storage, not the scalers having issues (that and the flapping of SWIF stuff) [19:34:46] ooo, lots of kernel dumps [19:34:53] crap, brb [19:35:05] lots [19:35:08] yeah but you see those all day long (oom fr cgroup blah) [19:35:10] That's not good [19:35:11] might just be that it's flipping out ? [19:35:17] unless you are seeing something different [19:35:20] let me compare to a healthy image scaler [19:35:29] Is it me... Or did the thumb logs just get a lot quieter? [19:35:46] oh, same on mw1160 [19:35:48] Nope, just a pause [19:36:13] RECOVERY - Swift HTTP backend on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.077 second response time [19:36:16] icinga still thinks mw1159 is unhappy but I'm seeing the converts come in [19:36:22] and not an overwhelming number wither [19:36:24] either [19:36:54] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.044 second response time [19:36:54] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:37:03] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64262 bytes in 0.308 second response time [19:37:13] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.610 second response time [19:37:28] marktraceur: could a recent change to MultimediaViewer have caused this? [19:37:41] Image scalers? Naw, MMV is all frontend [19:37:42] !log reedy finished scap: testwiki to 1.23wmf6, build l10n cache and rebuild for 1.23wmf5 [19:37:49] And not enough people use it to cause a heavy load [19:37:58] Logged the message, Master [19:38:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.23wmf5 for now [19:38:34] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99453 [19:38:42] Logged the message, Master [19:39:04] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99453 (owner: 10Reedy) [19:39:13] (03Merged) 10jenkins-bot: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99453 (owner: 10Reedy) [19:39:54] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.040 second response time [19:39:54] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:40:20] well, that's good, but what happened? [19:40:23] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.436 second response time [19:40:23] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.415 second response time [19:40:37] Reedy reverted? [19:40:54] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [19:40:58] I dont think so [19:41:03] I think it's unrelated [19:41:04] oh, yeah [19:41:08] missed that [19:41:12] well I have not been doing anything here but looking [19:41:24] testwiki was actually only on 1.23wmf6 for less than a minute [19:41:30] looks better now : https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Image%2520scalers%2520eqiad&tab=m&vn=&hide-hf=false [19:41:33] !log Quick bounce of labstore1001 (kernel tweak) [19:41:33] and it was only testwiki? [19:41:37] indeed [19:41:49] Logged the message, Master [19:42:04] It seemingly started around the time of syncing the wmf6 code [19:42:08] But nothing was using it at that point [19:42:27] Thumb logs look no happier [19:42:48] do they look any happier from earlier, say 2 hours ago? [19:43:00] 2013-12-05 19:42:40 mw1160 commonswiki: Thumbnail failed on mw1160: could not get local copy of "Kevin_Federline.jpg" [19:43:00] 2013-12-05 19:42:40 mw1156 commonswiki: Thumbnail failed on mw1156: could not get local copy of "Finis_gloriae_mundi_from_Juan_Valdez_Leal.jpg" [19:43:08] Wasn't looking at them that long ago... [19:43:16] good to compare [19:43:34] it was wihing the same minute (icinga notified in the :25 minute, !log that Reedy started scap in the :26 minute) [19:44:03] yeah it was in the same minute (I looked at the irc timestamps too) [19:44:03] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [19:44:03] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [19:44:03] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [19:44:03] PROBLEM - puppet disabled on labstore1001 is CRITICAL: Connection refused by host [19:44:31] well i don't think there's much i can do to help right now ... feel free to text me if need be [19:44:38] [18:54:28] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:38] [18:55:58] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [19:44:38] [18:56:01] !log reedy synchronized php-1.23wmf6 'Staging' [19:44:53] PROBLEM - SSH on labstore1001 is CRITICAL: Connection refused [19:45:12] yeah, but the log msg is on sync complete [19:45:19] Right [19:45:24] But it was pushing unused code [19:46:09] there go the front ends again [19:47:40] reedy@fluorine:/a/mw-log$ tail -n 10000 thumbnail.log | grep -c local [19:47:40] 7092 [19:47:43] and now magically ok (watching icinga) [19:51:38] Reedy: unrelated, but lots of docroot 404s in apache2.log [19:51:44] on fluorine, i mean [19:52:30] oooooold versions [19:53:19] Looks mostly like noise [19:53:23] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.008 second response time [19:53:37] slowly.. [19:53:56] 2013-12-05 19:46:56 mw1084 commonswiki: [53da5566] /w/index.php?title=Special:UserLogin&returnto=Special%3AGlobalUsage&returntoquery=offset%3DShanghai_Transrapid_002.jpg%7Cruwiki%7C1881751&type=signup&uselang=sl&campaign=loginCTA&fromhttp=1&fromhttp=1 Exception from line 1057 of /usr/local/apache/common-local/php-1.23wmf5/includes/filebackend/SwiftFileBackend.php: Got InvalidResponseException exception. [19:54:25] ok well I restarted the proxy server on ms-fe1003 so [19:54:33] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:01] all the swift eqiad frontends look lik they are stabilized but at higher load than before, same is true of the backends [19:56:08] http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [19:56:13] yeah, was just looking at that [19:56:33] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:56:33] Other than updating the localisation cache in wmf5, there's nothing new been deployed/in use [19:57:05] we should page paravoid [19:57:24] please do [19:57:29] I asked that 20 minutes ago ;) [19:57:31] not near a phone [19:58:21] [5431487.795544] swift-object-se: page allocation failure: order:5, mode:0x4020 [19:58:44] calling his work extension... [19:59:16] greg-g: He's marked /away [19:59:22] Probably better texting him [19:59:30] he's usally pretty responsive [20:00:42] Reedy: I don't have international texting, I beliebe [20:00:44] v [20:00:55] I can do it [20:01:01] * greg-g is on cheap reseller cell plan [20:01:07] What are we wanting to say... Swift was unhappy, but isn't anymore? ;) [20:01:08] ty [20:01:20] it's still a little unhappy [20:01:30] well, I'm just worried we won't be able to figure out what happened if we don't look at it now [20:01:54] I can text him [20:02:21] just let's not two of us text/call him [20:02:36] I called, left message on his work extension, reedy's texting [20:02:50] ok [20:02:56] I hadn't started [20:03:00] ah [20:03:03] Would be cheapest for apergos to do it :P [20:03:06] then I will cause it's very cheap [20:03:07] yep [20:03:10] thanks [20:03:11] :D [20:03:18] brb [20:04:11] !log reedy updated /a/common to {{Gerrit|Ifda85f2ce}}: Add/update symlinks [20:04:27] Logged the message, Master [20:04:28] Still wrong logmsgbot [20:04:51] what do you do to update the repo? [20:04:54] it works for everyone else [20:05:03] I often commit from tin [20:05:15] so that was after comitting [20:05:30] (03PS1) 10Reedy: Everything else to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99458 [20:06:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: everything else to 1.23wmf5 [20:06:08] it's a local-only commit at the time you're committing so the script assumes it's a security patch [20:06:19] Ah [20:06:20] Logged the message, Master [20:06:30] PHP Fatal error: Call to undefined method WikitextContent::getHeader() in /usr/local/apache/common-local/php-1.23wmf5/extensions/ProofreadPage/includes/page/EditProofreadPagePage.php on line 137 [20:06:32] No tpt... [20:07:05] (03CR) 10Reedy: [C: 032] Everything else to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99458 (owner: 10Reedy) [20:07:16] (03Merged) 10jenkins-bot: Everything else to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99458 (owner: 10Reedy) [20:07:44] we'll see, he might actually be out [20:08:06] https://gerrit.wikimedia.org/r/#/c/99042/ [20:08:31] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:08:43] !log image scalers started overloading at 18:56, cause appears to be a spike of convert jobs exceeding limits & getting killed; swift-backend.log on fluoine has lots of InvalidResponseException; syslog on swift has "swift-object-se: page allocation failure". [20:09:00] Logged the message, Master [20:09:52] !log increase in load coincided with sync of wmf6 to apaches and subsided on roll-back, but wmf6 was not enabled anywhere at the time of syncing [20:09:52] yeah except that the front and back ends still continue to have more load now [20:10:01] this is the part I don't believe... [20:10:08] Logged the message, Master [20:10:31] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [20:10:32] not "reedy did it" but "the spike caused it", that's what I don't believe [20:10:47] It's never my fault! [20:10:48] * Reedy grins [20:10:51] heh [20:10:58] well it would have been easier if it was [20:11:00] revert and done [20:11:08] anyways.... [20:11:35] actually the front ends are (mostly) now back at their same level [20:12:20] lemme look at one of these here backends now that things aren't broken [20:12:45] root@ms-be1008:/var/log# ps aux | grep swift-object-server | wc -l [20:12:45] 102 [20:12:53] most from oct 3 [20:14:50] I'm on ms-be1003 and 1004, both report around the same load [20:16:06] Can someone as root on tin please run rm -rf /a/common/php-1.22wmf17/extensions/Elastica [20:16:35] !log reedy updated /a/common to {{Gerrit|Ida4a0d980}}: Everything else to 1.23wmf5 [20:16:41] (03PS1) 10Reedy: Remofve 1.22wmf15 through 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99459 [20:16:52] Logged the message, Master [20:17:02] (03CR) 10Reedy: [C: 032] Remofve 1.22wmf15 through 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99459 (owner: 10Reedy) [20:17:04] atop thinks the ganglia cpu load graphs are a lie [20:17:05] hm [20:17:13] (03Merged) 10jenkins-bot: Remofve 1.22wmf15 through 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99459 (owner: 10Reedy) [20:17:32] Reedy: done [20:17:42] thanks [20:18:03] !log reedy synchronized docroot and w [20:18:18] Logged the message, Master [20:19:57] (03PS2) 10Dan-nl: Enable GWToolset on betacommons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98684 (owner: 10MarkTraceur) [20:20:41] RECOVERY - DPKG on labstore1001 is OK: All packages OK [20:20:51] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:20:51] RECOVERY - Disk space on labstore1001 is OK: DISK OK [20:21:01] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical [20:21:01] RECOVERY - puppet disabled on labstore1001 is OK: OK [20:21:13] apergos: strace of swift-storage on mw-be1008 has a lot of sendto(3, "<131>object-server STDOUT: Traceback (most recent call last): (txn: txeee8519afa454128ab2e2fe69235d950)\0", 104, 0, NULL, 0) = -1 ENOTCONN (Transport endpoint is not connected) [20:24:18] the object server? [20:25:10] yeah [20:27:36] i think that started because oom killer killed swift-proxy-server on ms-fe1004 [20:27:40] restarted object server on ms-be1005 to see if it makes a difference, I saw several process pegged at 100% [20:27:44] which started at 18:58 [20:28:08] and happened every few minutes after that until 19:18 [20:28:11] stupid that it can't recover, if that's what it is [20:29:11] doing swift-container too, there was one of those at 100% [20:29:46] don't remember what regular behavior is for the container server [20:29:53] but the object server should not be doing that [20:30:38] that was it [20:30:45] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Swift+eqiad&h=ms-be1005.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [20:31:34] Reedy: so, with the proofreadpage stuff, is it reasonable to downgrade proofread extension to the one previous this major rewrite? [20:31:45] I'lll look at 1003 and 1007 and do the same if needed [20:32:06] Reedy: I'd prefer it to be, and let tpt work on it in betacluster [20:32:57] yep, objct server several process stuck there also [20:34:15] will do 1001 and 1006 but leave 1008 so there is a sample [20:35:26] uh [20:36:22] I previously wanted to say that 1100 GIFs do nothing, but if I wanted to try a DoS I'd just preview a page with a few thousands huge DjVy in 1px thumbs... refrained for WP:BEANS but it's what actually happened [20:36:26] ? [20:39:14] !log reedy synchronized php-1.23wmf6 [20:39:30] Logged the message, Master [20:39:35] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:25] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [20:43:21] (03PS12) 10Ori.livneh: Add configuration for Wikimania Scholarships [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [20:44:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: phase1 wikis to 1.23wmf6 [20:44:43] Logged the message, Master [20:44:45] !log reedy updated /a/common to {{Gerrit|Ifae924950}}: Remofve 1.22wmf15 through 1.22wmf19 [20:44:49] (03PS1) 10Reedy: phase1 wikis to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99464 [20:44:59] (03CR) 10Reedy: [C: 032] phase1 wikis to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99464 (owner: 10Reedy) [20:45:01] Logged the message, Master [20:45:11] (03Merged) 10jenkins-bot: phase1 wikis to 1.23wmf6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99464 (owner: 10Reedy) [20:45:12] !log over the last while, restarted swift-object-server on ms-be100* except for 1008, left that for poking at [20:45:27] Logged the message, Master [20:45:53] and I am strving, how did I not get/make or eat dinner? [20:46:36] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:37] I had to double take what you said then [20:46:51] In the #Cyanogenmod build environments they use stuff like make lunch, make dinner [20:46:55] awfully confusing [20:47:33] (03CR) 10Ori.livneh: [C: 032] Add configuration for Wikimania Scholarships [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [20:47:35] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [20:51:07] (03CR) 10Ori.livneh: [C: 032 V: 032] added groups::wikidev and accounts::bd808 to zirconium for scholarship app [operations/puppet] - 10https://gerrit.wikimedia.org/r/99466 (owner: 10Ori.livneh) [20:51:43] you'll have to request sudo, that's not up to me [20:52:04] ori-l: Will I need it? [20:52:18] bd808: not with that attitude [20:52:44] ori-l: My license plate says "sudo" that should give me rights everywhere [20:53:35] you can mention that in the RT ticket [20:53:37] (03PS3) 10Reedy: Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 (owner: 10Aude) [20:53:41] (03CR) 10Reedy: [C: 032] Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 (owner: 10Aude) [20:54:00] (03Merged) 10jenkins-bot: Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 (owner: 10Aude) [20:56:35] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:57] (03PS1) 10BryanDavis: Fix passwords::mysql::wikimania_scholarships include [operations/puppet] - 10https://gerrit.wikimedia.org/r/99469 [20:58:03] !log reedy synchronized docroot and w [20:58:08] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix passwords::mysql::wikimania_scholarships include [operations/puppet] - 10https://gerrit.wikimedia.org/r/99469 (owner: 10BryanDavis) [20:58:20] Logged the message, Master [20:58:35] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [21:03:35] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:35] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [21:06:00] Are the image scalers doing OK? [21:08:18] Looks like yes [21:19:23] !log stopping puppet on cp1046 to troubleshoot some ganglia stuff [21:19:38] Logged the message, Master [21:21:03] I haven't done cyanongenmod in a long tme [21:21:19] comes of having an old phone that doesn't support aything current (and no data plan either) [21:22:11] !log catrope synchronized php-1.23wmf5/extensions/VisualEditor/modules/oojs-ui/oojs-ui.js 'touch' [21:22:26] Logged the message, Master [21:23:09] foooood [21:25:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [21:31:20] (03PS1) 10Ottomata: Fixing version number on latest logster, source for JsonLogster had changed. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99521 [21:31:37] (03CR) 10Ottomata: [C: 032 V: 032] Fixing version number on latest logster, source for JsonLogster had changed. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99521 (owner: 10Ottomata) [21:39:53] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [21:50:12] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [21:54:46] https://www.mediawiki.org/wiki/Special:Watchlist?uselang=en is missing messages [21:54:48] (three in english, apparently only one in other languages) [22:02:23] (03CR) 10Aklapper: [C: 031] bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [22:07:10] (reedy is working on that now) [22:10:04] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:04] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [22:14:12] hume: Failed to add the RSA host key for IP address '2620:0:860:2:21d:9ff:fe33:f235' to the list of known hosts (/home/l10nupdate/.ssh/known_ho). [22:14:18] known_ho? [22:21:52] !log LocalisationUpdate completed (1.23wmf5) at Thu Dec 5 22:21:52 UTC 2013 [22:22:07] Logged the message, Master [22:22:25] slooooooooow [22:22:35] woot, was it running for all that time? [22:23:05] [22:06:48] I'm running localisation update [22:23:08] 17 minutes for 1 version [22:23:15] …lols. [22:23:16] Running updates for 1.23wmf5 (on aawikibooks) [22:23:16] 38 MediaWiki messages are updated [22:23:16] Updated 560 messages in total [22:23:16] Done [22:23:20] also, doesn't look fixed to me. :( [22:23:25] https://www.mediawiki.org/wiki/Special:Watchlist [22:23:29] mw.org isn't running 1.23wmf5 [22:23:33] this shouldn't be cached or anything [22:23:34] ah [22:23:57] but that was only happening on mw.org, no? [22:24:12] easier to just run it everywhere [22:24:20] heh, alright [22:28:17] Running updates for 1.23wmf6 (on mediawikiwiki) [22:28:18] 534478 MediaWiki messages are updated [22:28:18] Updated 1187824 messages in total [22:28:18] Done [22:28:18] All done in 370.33610486984 seconds [22:28:31] another 15 minutes or more to wait... [22:29:04] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:08] (03PS1) 10Ottomata: Fixing JsonLogster bug when keys contain '/' [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99537 [22:29:21] (03CR) 10Addshore: [C: 031] Add Item and Item_talk namespace aliases for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 (owner: 10Aude) [22:29:30] (03PS2) 10Ottomata: Fixing JsonLogster bug when keys contain '/' [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99537 [22:29:37] (03CR) 10Ottomata: [C: 032 V: 032] Fixing JsonLogster bug when keys contain '/' [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99537 (owner: 10Ottomata) [22:29:40] Reedy, jmust rebuild it in 100 threads [22:30:04] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [22:33:01] (03PS1) 10Ottomata: Updating changelog for changes from master [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99538 [22:33:07] (03PS2) 10Ottomata: Updating changelog for changes from master [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99538 [22:33:20] (03CR) 10Ottomata: [C: 032 V: 032] Updating changelog for changes from master [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/99538 (owner: 10Ottomata) [22:40:36] paravoid: are all the mc servers on the same row? [22:47:09] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [22:49:42] (03PS1) 10Aaron Schulz: Tweaked $wgJobQueueAggregator to use redis job servers and have failover [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99543 [22:50:54] still going [22:52:04] * Reedy finds somewhere to die of boredom [22:52:05] ugh, fucking submodule [22:54:59] !log LocalisationUpdate completed (1.23wmf6) at Thu Dec 5 22:54:59 UTC 2013 [22:55:09] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:55:15] Logged the message, Master [22:55:28] Yay [22:55:58] MatmaRex: Fixed [22:55:59] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [22:57:08] MatmaRex: Doesn't look doubly parsed either.. [23:00:28] (03PS2) 10Aaron Schulz: Tweaked $wgJobQueueAggregator to use redis job servers and have failover [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99543 [23:05:45] Reedy: well, there's nothing to escape in there normally [23:08:00] (03PS1) 10BBlack: get rid of init mutex - really isn't needed [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99548 [23:08:01] (03PS1) 10BBlack: Fix minor infrequent memleak (during vcl reload) [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99549 [23:08:02] (03PS1) 10BBlack: Fix big leak - 'struct addrinfo' leak on every .map() [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99550 [23:09:17] (03CR) 10BBlack: [C: 032 V: 032] get rid of init mutex - really isn't needed [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99548 (owner: 10BBlack) [23:09:45] (03CR) 10BBlack: [C: 032 V: 032] Fix minor infrequent memleak (during vcl reload) [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99549 (owner: 10BBlack) [23:10:19] (03CR) 10BBlack: [C: 032 V: 032] Fix big leak - 'struct addrinfo' leak on every .map() [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/99550 (owner: 10BBlack) [23:16:09] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:18:43] (03CR) 10Aaron Schulz: [C: 032] Tweaked $wgJobQueueAggregator to use redis job servers and have failover [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99543 (owner: 10Aaron Schulz) [23:19:48] (03Merged) 10jenkins-bot: Tweaked $wgJobQueueAggregator to use redis job servers and have failover [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99543 (owner: 10Aaron Schulz) [23:21:09] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:14] !log aaron synchronized wmf-config/jobqueue-pmtpa.php 'Tweaked $wgJobQueueAggregator to use redis job servers and have failover' [23:22:29] Logged the message, Master [23:22:47] !log aaron synchronized wmf-config/jobqueue-eqiad.php 'Tweaked $wgJobQueueAggregator to use redis job servers and have failover' [23:22:59] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:23:01] Logged the message, Master [23:28:00] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [23:29:37] (03CR) 10GWicke: "Ping ;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99251 (owner: 10GWicke) [23:33:36] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm24) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/99554 [23:33:56] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm24) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/99554 (owner: 10BBlack) [23:36:07] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:37:08] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:37:18] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Dec 5 23:37:18 UTC 2013 [23:37:33] Logged the message, Master [23:40:39] who can I bother about irc.wikimedia.org stuff, namely blocking broken noisy bots (~yahoo_age@anonymous.user) on the #pt.wikipedia channel? [23:40:39] see https://bugzilla.wikimedia.org/show_bug.cgi?id=54821 [23:43:08] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:45:07] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:50:31] someone broke central notice? [23:50:42] not it [23:50:44] 20 PHP Warning: Missing argument 3 for CentralNoticeCampaignLogPager::testBooleanChange(), called in /usr/local/apache/common-local/php-1.23wmf5/extensions/CentralNotice/CentralNoticeCampaig [23:50:46] nLogPager.php on line 251 and defined in /usr/local/apache/common-local/php-1.23wmf5/extensions/CentralNotice/CentralNoticeCampaignLogPager.php on line 309 [23:50:59] 3 of them = 60 [23:51:32] K4-713: mwalker: ^ ? [23:51:50] that should've been fixed... [23:52:04] mwalker, they weren't there about 10 min ago [23:52:08] We didn't push anything today, did we? [23:52:10] no [23:52:15] Didn't think so. [23:52:16] Hm. [23:52:20] and it just means that someone hit that form [23:52:23] I really don't care about it [23:52:26] or rather I do [23:52:48] but I care about other things than PHPs apparent inability to withold it's tendencies to touch it's childrens privates [23:53:02] at this exact moment [23:53:24] bblack: do you have the set of commands for adding a patch and rebuilding? [23:53:27] there are 80 of them now - line 251,250,249,247 [23:53:33] exactly 20 each [23:53:35] bblack: in your terminal buffer, i mean [23:53:37] kinda weird numbering [23:53:52] yurik: ya; it's building a paged list [23:53:56] with 20 entries on it [23:54:00] ori-l: adding a patch and rebuilding what? [23:54:24] bblack: a debian package [23:54:48] it really depends on the package I think, whether it uses gbp and how the branches are set up [23:55:06] this is my last varnish build today: [23:55:07] git buildpackage --git-debian-branch=testing/3.0.3plus-rc1 --git-upstream-branch=upstream-3.0.3plus-rc1 --git-upstream-tree=branch --git-export-dir=../build-area --git-no-create-orig -us -uc [23:55:27] (after committing the changes to debian/patches/) [23:55:34] thanks, that's useful [23:56:11] in that example, we're doing our debian/ work in branch testing/3.0.3plus-rc1, and that upstream branch should be identical other than the lack of debian/, comes from upstream [23:56:31] you probably won't need --git-no-create-orig, but I've needed to do that for this package for odd reasons