[00:05:22] Elsie: Tried qchris? [00:06:29] :-D He found me in the analytics channel :-) [00:06:30] iptables isn't in 12.04 stock.. should I be using ufw or something else? [00:06:48] in iptables I'd just set up a TCP redirect to another port... [00:07:10] By syock, you mean installed by default? [00:07:12] ufw would have to use iptables [00:07:13] *stock [00:07:45] yes, installed by default [00:07:56] apt-get install iptables, and I'm happy [00:08:12] but I can go learn ufw [00:08:26] ufw is just a wrapper for iptables [00:08:28] Can't see why that's going to be an issue... Just make sure your puppet code is making sure it's installed and should be good to go [00:08:40] (rather than expecting it to be installed manually) [00:09:27] I always just use iptables and iptables-persistent [00:11:14] or https://forge.puppetlabs.com/arusso/iptables for puppet [00:11:22] ufw isn't install stock either.. :) [00:11:31] so I just went with iptables -- I'm happy [00:11:52] thx [00:13:02] cajoel: we use ferm in production [00:13:32] bast1001 has iptables [00:14:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:16:38] (03CR) 10Chad: [C: 031] "I don't think we need a second variable. The amount of time it takes to index most wikis is usually very minimal considering the increased" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [00:17:18] <^d> greg-g: Today's LD totally taken now? [00:18:53] (03PS1) 10Aaron Schulz: Added logstash role and applied it to logging logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/99278 [00:20:07] (03CR) 10Ori.livneh: [C: 032] Added logstash role and applied it to logging logstash servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/99278 (owner: 10Aaron Schulz) [00:20:44] ...I started reviewing that :) [00:23:41] i was sitting next to aaron as he wrote it [00:24:21] Pair programming! [00:25:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:33] but review still welcome! [00:25:39] (03PS1) 10Ori.livneh: Qualify elasticsearch includes in role::logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/99279 [00:26:46] (03CR) 10Ori.livneh: [C: 032] Qualify elasticsearch includes in role::logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/99279 (owner: 10Ori.livneh) [00:27:10] paravoid: we just copied role::elasticsearch and modified it [00:27:18] <^d> Does anyone have dibs on the remainder of today's LD? I don't see anything on-wiki. [00:28:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:36:44] (03PS1) 10Ori.livneh: Qualify elasticsearch class names in role::elasticsearch, too [operations/puppet] - 10https://gerrit.wikimedia.org/r/99280 [00:37:38] (03CR) 10Ori.livneh: [C: 032] Qualify elasticsearch class names in role::elasticsearch, too [operations/puppet] - 10https://gerrit.wikimedia.org/r/99280 (owner: 10Ori.livneh) [00:56:06] (03PS1) 10Springle: depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 [00:56:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:35] (03CR) 10Springle: [C: 032] depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 (owner: 10Springle) [00:56:44] (03Merged) 10jenkins-bot: depool es1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99288 (owner: 10Springle) [00:57:22] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [00:58:04] !log springle synchronized wmf-config/db-eqiad.php 'depool es1003 for upgrade' [00:58:20] Logged the message, Master [01:06:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:27:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:35:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:44:42] (03CR) 10Tim Starling: [C: 04-1] "MZMcBride, please put your changes in a separate commit." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:45:23] Bleh. [01:47:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:48:42] (03PS9) 10MZMcBride: Update wgServer, wgCanonicalServer for sub.subdomain wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:51:37] (03PS1) 10Springle: switch es1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99296 [01:53:29] (03CR) 10Springle: [C: 032] switch es1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99296 (owner: 10Springle) [01:53:58] (03PS1) 10Ori.livneh: Disable elasticsearch nagios/ganglia monitoring on logstash cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/99297 [01:55:06] (03CR) 10Ori.livneh: [C: 032] Disable elasticsearch nagios/ganglia monitoring on logstash cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/99297 (owner: 10Ori.livneh) [01:55:24] (03PS1) 10MZMcBride: Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 [01:57:15] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:58:05] (03CR) 10MZMcBride: "Tim, okay: ." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [01:58:54] (03CR) 10Dr0ptp4kt: [C: 031 V: 031] "My own generate.php run resulted in the same records. My usernames and UID don't allow for deployment, but this looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98036 (owner: 10Tim Starling) [01:59:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:01:38] (03CR) 10Bsitu: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [02:08:10] (03PS1) 10BBlack: temporarily add cp3013 to esams mobile list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99303 [02:09:29] (03CR) 10BBlack: [C: 032 V: 032] temporarily add cp3013 to esams mobile list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99303 (owner: 10BBlack) [02:15:43] PROBLEM - Varnish HTTP mobile-backend on cp3013 is CRITICAL: Connection refused [02:18:43] RECOVERY - Varnish HTTP mobile-backend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.193 second response time [02:20:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:13] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:22:00] !log LocalisationUpdate completed (1.23wmf5) at Thu Dec 5 02:22:00 UTC 2013 [02:22:17] Logged the message, Master [02:25:12] (03PS1) 10Ori.livneh: Rename role::elasticsearch -> role::elasticsearch::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 [02:28:02] (03CR) 10Ori.livneh: [C: 032] Rename role::elasticsearch -> role::elasticsearch::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [02:28:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:06] Hah! [02:31:42] (03CR) 10Ori.livneh: "Theory confirmed; this was indeed the issue." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [02:32:18] (03PS1) 10Ori.livneh: Revert "Disable elasticsearch nagios/ganglia monitoring on logstash cluster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99309 [02:33:13] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:33:27] (03CR) 10Ori.livneh: [C: 032] Revert "Disable elasticsearch nagios/ganglia monitoring on logstash cluster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99309 (owner: 10Ori.livneh) [02:36:53] AaronSchulz: figured it out [02:38:26] !log LocalisationUpdate completed (1.23wmf4) at Thu Dec 5 02:38:26 UTC 2013 [02:38:42] Logged the message, Master [02:49:25] (03CR) 10Tim Landscheidt: [C: 031] "I assume the deletion of apache2.2-common is deliberate, so might be nice to mention it in the commit message." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 (owner: 10Matanya) [02:50:31] Could someone help me fix permissions for my .ssh/known_hosts on fenari.wikimedia.org? [02:50:31] $ touch /home/krinkle/.ssh/known_hosts [02:50:31] touch: cannot touch `/home/krinkle/.ssh/known_hosts': Permission denied [02:50:33] As a result I get an ssh fingerprint yes/no everytime I connect to anything from fenari [02:53:26] Krinkle: try now? [02:54:45] andrewbogott: works :) [02:55:22] thanks [02:55:32] Krinkle: you didn't have a known_hosts, file and didn't have write permissions to the .ssh dir. [02:55:42] So I just made you an empty known_hosts :) [02:56:08] andrewbogott: should the directory be chgrp'ed to me instead of root? [02:56:13] or is that a good thing? [02:56:16] (03PS2) 10Matanya: toollabs: remove old tips absent declartions [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 [02:56:58] hm, now I'm typing in the wrong channel too [02:57:12] Don't you own it but it's just not writeable? [02:57:18] Anyway, I think it's correct as it is. [02:57:23] Or, at least, fine as it is. [03:00:04] k, chmod u+w .ssh fixed it so that my user can create files in the future [03:01:41] (03PS1) 10Springle: repool es1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99314 [03:02:27] (03CR) 10Springle: [C: 032] repool es1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99314 (owner: 10Springle) [03:03:32] andrewbogott: would be nice to have some reviews from you :) [03:03:40] !log springle synchronized wmf-config/db-eqiad.php 'repool es1003 after upgrade, max_connections lowered during warm up' [03:03:52] matanya: I'll try to catch up a bit tomorrow. [03:03:53] Logged the message, Master [03:04:03] thanks andrewbogott [03:16:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:23:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Dec 5 03:23:22 UTC 2013 [03:23:37] Logged the message, Master [03:32:28] (03PS1) 10Springle: depool es1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99318 [03:33:08] (03CR) 10Springle: [C: 032] depool es1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99318 (owner: 10Springle) [03:34:04] !log springle synchronized wmf-config/db-eqiad.php 'depool es1002 for upgrade' [03:34:25] Logged the message, Master [03:34:39] (03CR) 10Tim Starling: [C: 04-1] "Still needs redirects." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/53885 (owner: 10Reedy) [03:35:03] Awww, no mutante [03:36:17] (03CR) 10Tim Starling: [C: 031] Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [03:37:54] For some reason I thought Reedy was on vacation. [03:38:11] Wassat? [03:38:20] I was just rebasing to be polite in his absence. Now I have a changeset of my own, hrm. [03:38:29] I'm not sure why I thought that. [03:43:17] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:45:43] Elsie: I was last week? [03:45:52] I mean, that's a true statement, just, you know. [03:46:14] The whole U.S. kind of was. [03:49:56] Ryan_Lane, I saw a trello card about autoredirect to https in various languages. What's the status of that, and would it be possible to avoid it based on the presence of the X-CS header? [03:50:38] Elsie: I just also took M-W, but yeah, pretty slow week [03:53:10] greg-g: I'm proud of you. [03:53:17] Elsie: I was too [03:53:23] I doubt I confused you with Reedy, though. ;-) [03:53:41] Wouldn't imagine, just I couldn't think who else was on vacation. [03:53:45] anywho, g'evening. [03:56:21] Bye. [03:57:52] greg-g that means i will deploy a few more things this week :) [03:58:11] i have enough for another depl window, or at least for a quick depl for sure :) [03:58:40] any good time tomorrow? :) [04:10:54] yurik-road2: what is it? [04:11:16] yurik-road2: bug numbers/etc? [04:11:28] yurik-road2: feel free to email me, since you're on the road right now :) [04:31:19] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:19] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:41:38] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [04:45:38] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [04:47:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:08] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:55:18] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:20:04] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [05:29:24] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:14] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:36:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:23] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:45:03] (03PS1) 10Springle: switch es1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99327 [05:46:00] (03CR) 10Spage: [C: 04-1] "I think Benny's suggestion that we need two patches is right." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [05:46:38] (03CR) 10Springle: [C: 032] switch es1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/99327 (owner: 10Springle) [05:57:23] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:23] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:12:06] (03PS1) 10Springle: repool es1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99329 [06:12:07] (03PS1) 10Ori.livneh: logstash: add Ganglia group and specify aggregators [operations/puppet] - 10https://gerrit.wikimedia.org/r/99330 [06:13:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:14:25] (03CR) 10Springle: [C: 032] repool es1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99329 (owner: 10Springle) [06:15:20] !log springle synchronized wmf-config/db-eqiad.php 'repool es1002 after upgrade, max_connections lowered during warm up' [06:15:34] Logged the message, Master [06:24:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:56] !log cp301[123] puppet freshness is me, please leave them disabled [06:31:11] Logged the message, Master [06:31:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:38:47] bblack, hey, read your bug hunt, pretty impressive. One of those days I should learn some of the tricks you used [06:39:18] any luck finding theculprit? [06:40:18] well, the culprit in that particular case is "jemalloc bugs", the question is how we fix it :) [06:40:41] switching to a new malloc lib? :/ :) [06:41:10] yeah that's my current plan, actually, but I'm waiting for Faidon to wake up and say my plan isn't insane first [06:41:21] I rebuilt with s/jemalloc/tcmalloc/, and I like it [06:41:23] i'm sure there other libs have no bugs [06:42:11] jemalloc comes from FreeBSD, so the Linux port is kind of a 2nd-class citizen and not as well-tested, IMHO. tcmalloc from google has a lot of the same goals and properties, but was developed on Linux and is pretty stable. [06:42:19] are they really drop-in replacements like that? [06:42:46] as long as you're not using allocator-specific APIs, but in this case varnish is just using the normal posix APIs [06:43:16] moreover, what malloc lib is typically used by C? isn't there a standard c lib of some sort that everyone adhears to? [06:43:45] malloc normally comes from libc, so in our case normal is glibc's allocator [06:44:14] but glibc's allocator is fairly mundane and generic, which is why memory+thread-intensive software tends to want a better allocator [06:44:18] right, so these libs provide substantially better perfs for massive alloc/dealoc? [06:45:55] it's mostly not about "massive", it's about being efficient with gobs of small allocations, not fragmenting the heap in the face of repeated dealloc/realloc of small stuff, and being thread-aware so that hundreds of threads in the same proc don't trash each other on inter-cpu cache stuff and/or any necessary mutexes in the malloc implementation [06:45:55] has there been a push to migrate to tcmalloc in general? [06:46:08] no - honestly the glibc one is fine for most purposes, tcmalloc is considered special-purpose, people put it in place for specific apps when they have intensive needs that fit it [06:47:19] interesting - i would have thought that any threaded app which uses cpu for more than UI would want a highly efficient malloc [06:48:00] and more than net/file access [06:48:08] well there's no free lunch, implementations like tcmalloc have tradeoffs, such as wasting more memory to get things done more efficienctly in the big picture [06:48:47] i see. But I guess memory is cheaper nowadays, whereas performance is as critical as ever... [06:48:55] if tcmalloc replaced glibc's alloc for the whole system, you'd be making those tradeoffs for every random shellscript and simple utility and daemon, etc, where it's really not worth it. It would be a net loss of free ram for no noticeable improvement to the user. [06:50:19] true. Why has free bsd developed it? I would understand google's specific need [06:50:21] there are a lot of allocator implementations, and there's no one best answer that's optimal for every scenario [06:50:36] are they basic kernel on it somehow? [06:51:01] (03CR) 10Ori.livneh: [C: 031] "LGTM; I'll wait for you to be around before merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98740 (owner: 10BryanDavis) [06:51:14] freebsd uses jemalloc as their default allocator AFAIK, but I don't keep up with FreeBSD as well as I could/should [06:51:30] tcmalloc isn't meant to be like that, it's special purpose for intense loads and lots of threads [06:51:31] on the other hand - not sure why - kernel doesn't rotate memobjects as often [06:53:15] this is old but insightful if you want to know about jemalloc: http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf [06:55:18] thanks! will learn something new :) good luck!!! [06:55:37] re-reading the above and my email, sometimes it comes off like I think jemalloc is bad, because I keep just saying "jemalloc" - obviously, our problems are with the Linux port of it specifically [06:55:54] I'm sure in FreeBSD it works great, because that implementation gets exercised and tested a lot more :) [06:56:07] bblack: notice whose malloc implementation it replaced :P [06:56:22] yeah [06:56:24] i got that. I wonder how a port would have those bugs though [06:57:03] not even sure how much is needed to port - i always thought freebsd and linux had fairly similar posix kernel api [06:57:15] well, all complex software has bugs, and allocators get pretty complicated. I think the key here is that one installed as a system default gets lots of exercise to wring out those bugs, and one that isn't doesn't. [06:59:13] possibly dependency libs of sorts... but still [06:59:15] a lot of the linux port diffs are about pthread lock differences, some other stuff like madvise() as well [06:59:59] but why are they different - from my bad memory of pthread, its all done in the user space [07:00:24] just minor differences [07:00:35] are they kernel dependent? [07:00:47] but it's a port, and the copy in varnish is very old, lots of bugs were fixed in the real jemalloc since [07:01:02] and it was exported as libjemalloc for linux since then as well, which has several releases and a long line of bugfixes to look at [07:01:30] everything is kernel dependent when you get down to it :) [07:02:47] how so? for malloc i would think you need lock management (you don't have to start your own threads), and an ability to ask OS for large globs of memory. not much more? [07:04:02] btw, i don't want to keep you away from doing other stuff - just curiosity :) [07:04:31] if I really knew the answer to every question about allocators, I'd be writing one instead of debugging one :) [07:04:56] hehe. Just wondering why they keep a fork when one doesn't appear to be badly needed [07:05:01] but at the very least, there's going to be differences in the underlying behaviors of mmap() and brk() [07:05:01] esp for something so useful [07:06:18] well, in this particular case, I think the chain of events goes something like this: varnish is developed on FreeBSD. FreeBSD gets jemalloc with better threads and less contention, and it makes varnish better. someone tries to use varnish on Linux and notices glibc's allocator isn't fancy like that and is a performance problem there, so jemalloc gets ported over to linux to use with varnish. [07:06:51] independently (and later in time, IIRC), Google develops tcmalloc natively on Linux with a lot of the same rough goals about handling thread concurrency better [07:07:35] yes, but wouldn't jemalloc devs want to develop for linux too? or is there political rivalry going on between the lower level devs? :) [07:08:06] since when they were developing jemalloc, they clearly were not happy with glibc [07:08:07] jemalloc was written for FreeBSD by FreeBSD people I believe [07:08:17] FreeBSD has its own libc, it's not glibc [07:08:24] oh, didn't know [07:08:32] glibc is GNU, FreeBSD is BSD, two different parts of the open source world [07:08:46] yes, makes sense ... in a sad way [07:09:11] both groups trying to make the world better... and duplicate the effort :( [07:09:30] aaanyway, thanks for all the info! [07:09:37] yeah that's a whole other subject of infinite debate, the great BSD-vs-GPL debate! [07:10:01] i wouldn't want to be in between - just sad that it exists [07:10:38] if you think about it though, apple exists because of freebsd :) [07:10:57] they would have been dead if they didn't switch to the new kernel [07:11:12] and now they have largest capitalization :( [07:15:10] hmm, not entirelly accurate - netbsd was also a player [07:33:28] greg-g, Reedy, is it possible for me to piggy-back on the deploy to Wikipedia? I just need to update the GettingStarted submodule. [07:33:34] Otherwise, I need to back another commit out. [07:34:22] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:41:56] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [07:45:56] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [07:56:26] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:57:48] (03PS1) 10Ori.livneh: Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 [07:58:21] (03CR) 10jenkins-bot: [V: 04-1] Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 (owner: 10Ori.livneh) [08:00:26] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:38] how dare you, jenkins [08:01:14] morning [08:01:40] hi paravoid [08:03:11] i ran into the weirdest puppet bug [08:05:06] * paravoid waits for it [08:05:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:06:29] aaron asked me to look over the logstash manifest with him, so i was all excited -- here's my chance! [08:06:35] look at me, i'm not a complete idiot! [08:06:38] (03PS2) 10Ori.livneh: Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 [08:06:55] so of course it didn't work and i got a duplicate class definition that i couldn't for the life of me fix [08:07:56] nik had this weird thing: [08:08:11] class role::elasticsearch inherits role::elasticsearch::config [08:10:09] okay...? [08:10:19] I'm here, I'm waiting for it :) [08:10:30] well, i thought it was a bit funny to have a class extend a class that was lower than it in the hierarchy [08:10:46] so i just renamed role::elasticsearch to role::elasticsearch::server and it fixed it [08:11:09] the "class foo inherits foo::params" is fairly common [08:11:20] well, the [08:11:26] i never use inheritance [08:11:40] "class foo($server = $foo::params::server) inherits foo::params" [08:11:47] it's not in my imaginary "puppet: the good parts" book [08:12:04] might be a pretty short book [08:12:07] (good m orning) [08:12:22] goooood morning [08:12:48] (03CR) 10Ori.livneh: [C: 032] Carbon: allow storage aggregation rules to be specified as class parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/99333 (owner: 10Ori.livneh) [08:14:19] so, where was the bug? [08:16:26] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [08:17:30] well, why was there a duplicate class definition? [08:17:49] references to 'elasticsearch' were all qualified to disambiguate them from the role class of the same name [08:20:36] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [08:22:48] (03PS1) 10Ori.livneh: graphite/carbon: Leave default aggregation pattern unspecified [operations/puppet] - 10https://gerrit.wikimedia.org/r/99334 [08:23:50] (03CR) 10Ori.livneh: [C: 032] graphite/carbon: Leave default aggregation pattern unspecified [operations/puppet] - 10https://gerrit.wikimedia.org/r/99334 (owner: 10Ori.livneh) [08:41:38] lo [08:43:13] apergos: if you are around, i got a few tiny changes for contint/beta in ops/puppet :-] [08:45:52] lay em on me [08:48:38] hashar: [08:49:30] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [08:49:39] apergos: here they are [08:49:46] https://gerrit.wikimedia.org/r/97526 needs curl on Jenkins slaves [08:50:00] that might cause duplicate definition though, not sure how to test it [08:50:34] I would just merge and fix if something wrong happens :D [08:50:53] https://gerrit.wikimedia.org/r/#/c/98155/ add another field in Zuul configuration erb template [08:51:20] https://gerrit.wikimedia.org/r/99196 djvulibre-bin package on contint slaves (needed for some MediaWiki core tests to exercise djvu rendering [08:51:47] then I have two changes for my python script that continuously update beta https://gerrit.wikimedia.org/r/99052 and https://gerrit.wikimedia.org/r/99053 [08:51:52] all of them are https://gerrit.wikimedia.org/r/#/q/status:open+owner:hashar+project:operations/puppet,n,z :-D [08:53:47] ok, lemme look [08:59:40] can't you if !defined(Package['blah']) { .. } for the first one? adding it to other places in the manifests that include curl as well [09:04:27] and while we're in here, a style question on the second one: what's preferred, package { [ 'blah' ]: attribs...} or package { 'blah': attribs...} ? because I see both in this file [09:05:55] (03PS2) 10ArielGlenn: beta: update Parsoid dependencies only on changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99052 (owner: 10Hashar) [09:06:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:46] (03CR) 10ArielGlenn: [C: 032] beta: update Parsoid dependencies only on changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/99052 (owner: 10Hashar) [09:08:16] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:08:29] apergos: will later on :-) [09:08:31] thx! [09:08:38] apergos: in an audio right now [09:08:40] k [09:09:06] (03CR) 10Odder: [C: 031] Specify HTTPS for $wgCanonicalServer for all private wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [09:09:13] (03PS2) 10ArielGlenn: beta: missing docstring in autoupdater [operations/puppet] - 10https://gerrit.wikimedia.org/r/99053 (owner: 10Hashar) [09:10:35] (03CR) 10ArielGlenn: [C: 032] beta: missing docstring in autoupdater [operations/puppet] - 10https://gerrit.wikimedia.org/r/99053 (owner: 10Hashar) [09:15:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:15] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:33:25] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:50:12] akosiaris, what's the deal with the Wikimedia PA wiki (pa-us.wikimedia.org)? [09:50:23] It's not accessible, but it's still in at least some of the config. [09:50:25] Was it killed? [09:51:27] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [09:52:23] (03CR) 10Mattflaschen: "There are problems with two of the dash ones (seems related to bug 31335)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99299 (owner: 10MZMcBride) [09:55:19] (03PS1) 10Faidon Liambotis: varnish: sort bits esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/99342 [09:55:27] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:44] (03CR) 10Faidon Liambotis: [C: 032] varnish: sort bits esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/99342 (owner: 10Faidon Liambotis) [09:58:18] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:10:30] (03PS1) 10Faidon Liambotis: ganglia_new: write a gmond.pid pidfile [operations/puppet] - 10https://gerrit.wikimedia.org/r/99345 [10:11:01] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ganglia_new: write a gmond.pid pidfile [operations/puppet] - 10https://gerrit.wikimedia.org/r/99345 (owner: 10Faidon Liambotis) [10:13:57] paravoid morning! [10:14:54] any questions for varnish patch? [10:15:08] (if you have time of course) [10:15:14] hi yurik, I haven't looked at it yet [10:15:58] paravoid, you know when Reedy generally comes on? [10:16:50] superm401: sorry, no [10:38:11] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:39:11] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [10:41:11] PROBLEM - Host ms5 is DOWN: PING CRITICAL - Packet loss = 100% [10:42:41] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [10:43:01] RECOVERY - Host ms5 is UP: PING OK - Packet loss = 0%, RTA = 35.67 ms [10:46:41] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:45:22 AM UTC [10:49:17] superm401: pa-us was closed by Reedy on Mar 14 but you are right that configs should not exist if the wiki is closed. I 'll talk with Reedy to figure this out [10:51:28] apergos: thank you for the merge. out for lunch, will revisit this afternoon. :-] [11:14:04] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:04] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:21:24] PROBLEM - Puppet freshness on cp3013 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 02:19:29 AM UTC [11:43:20] (03PS1) 10Faidon Liambotis: Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 [11:44:04] (03CR) 10Faidon Liambotis: [C: 032] Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 (owner: 10Faidon Liambotis) [11:45:59] (03CR) 10Faidon Liambotis: [V: 032] Silence interface::ip with an onlyif [operations/puppet] - 10https://gerrit.wikimedia.org/r/99358 (owner: 10Faidon Liambotis) [11:48:18] (03PS1) 10Faidon Liambotis: Fixup for interface::ip [operations/puppet] - 10https://gerrit.wikimedia.org/r/99360 [11:48:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Fixup for interface::ip [operations/puppet] - 10https://gerrit.wikimedia.org/r/99360 (owner: 10Faidon Liambotis) [11:54:05] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:56:05] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:56:56] (03PS1) 10Faidon Liambotis: s/onlyif/unless/ on interface::ip (doh) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99361 [11:57:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] s/onlyif/unless/ on interface::ip (doh) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99361 (owner: 10Faidon Liambotis) [12:02:55] (03PS1) 10Matanya: puppet: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99365 [12:11:02] Nemo_bis still online? [12:12:53] (03PS1) 10Matanya: nrpe: two space to 4 space [operations/puppet] - 10https://gerrit.wikimedia.org/r/99369 [12:17:54] (03CR) 10Manybubbles: "Huh? I'm really confused. If this is the standard then great we'll follow it. Elasticsearch doesn't really have a ::client, though. Wh" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99307 (owner: 10Ori.livneh) [12:23:10] (03CR) 10Faidon Liambotis: [C: 04-1] logstash: add Ganglia group and specify aggregators (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99330 (owner: 10Ori.livneh) [12:28:53] (03CR) 10Faidon Liambotis: [C: 04-1] salt: lint cleanup (038 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 (owner: 10Matanya) [12:31:11] (03CR) 10Manybubbles: "I'll remove the -1 but I don't agree. My objection comes from the setting: "Automatically enable all new beta features". If someone chec" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [12:43:37] (03PS1) 10Matanya: mysql_wmf : lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99374 [12:48:02] (03PS4) 10Matanya: salt: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/99051 [12:50:07] (03PS2) 10Faidon Liambotis: Add redirects for mobile/wap.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98058 [12:50:08] (03PS1) 10Faidon Liambotis: Make m/zero landing page redirect less aggressive [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99375 [12:50:49] (03CR) 10Faidon Liambotis: [C: 032] Make m/zero landing page redirect less aggressive [operations/apache-config] - 10https://gerrit.wikimedia.org/r/99375 (owner: 10Faidon Liambotis) [13:05:52] (03PS1) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [13:07:29] (03CR) 10Physikerwelt: [C: 04-1] "author is wrong" [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [13:07:58] (03CR) 10jenkins-bot: [V: 04-1] added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [13:09:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:07] (03PS2) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [13:10:49] (03CR) 10Faidon Liambotis: [C: 032] Add redirects for mobile/wap.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98058 (owner: 10Faidon Liambotis) [13:11:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:14:11] http://whatthecommit.com/ [13:14:12] rofl [13:17:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:17:51] bblack: that you? [13:18:23] shouldn't be, no [13:18:34] even a restart could do it [13:18:52] nope [13:18:53] never mind [13:19:03] someone playing [13:19:20] POST http://www.wikipedia.org/ [13:19:25] heh [13:19:27] now why these are 503s, is beyond me [13:19:34] we have to dig through logs again i guess [13:20:25] I'm about to upgrade + restart 3012 though, last chance to object/delay :) [13:20:26] paravoid: can you imagine how things were when we had 3 people? [13:20:45] Ryan_Lane: I have a pretty good picture right now [13:20:48] :D [13:20:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:25:09] !log cp3012 running test varnish pkg w/ jemalloc 3.4.1 [13:25:25] Logged the message, Master [13:27:51] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:27] !log hashar synchronized php-1.23wmf5/extensions/ProofreadPage 'Update Proofreadpage {{gerrit|99042}}' [13:28:41] Logged the message, Master [13:28:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:34:46] PROBLEM - check_job_queue on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [13:39:52] (03PS1) 10BBlack: remove cp3013 from mobile esams backend list [operations/puppet] - 10https://gerrit.wikimedia.org/r/99388 [13:43:36] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Thu 05 Dec 2013 01:41:16 AM UTC [13:43:56]