[00:38:18] New review: Ori.livneh; "Hashar, are you waiting on me to merge / deploy this? (I'd be happy to; just let me know.)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71777 [01:02:27] New review: Andrew Bogott; "I don't have time to implement this just now, but here's what I've just learned (which you maybe alr..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [01:12:42] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.16394 (gt 1000) [01:13:22] !log starting Parsoid config update with latest dependencies [01:13:32] Logged the message, Master [01:15:10] !log finished Parsoid config update with latest dependencies [01:15:20] Logged the message, Master [01:17:14] New review: GWicke; "Roan has such a purge script in his home directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72653 [01:18:35] New review: Catrope; "....which is a hack. That said, /usr/local/bin/purge-varnish is basically this but uses a different ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72653 [01:43:47] RECOVERY - Solr on vanadium is OK: All OK [01:55:47] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1849.4286 (gt 1000) [01:59:47] PROBLEM - Puppet freshness on db78 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:27] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:20] RECOVERY - Disk space on mc15 is OK: DISK OK [02:15:49] !log LocalisationUpdate completed (1.22wmf9) at Wed Jul 10 02:15:40 UTC 2013 [02:16:00] Logged the message, Master [02:21:44] !log on tin: attempted to do a git pull in wmf8 but it updated a whole lot of extensions for some reason. Checking out the old versions manually. [02:21:54] Logged the message, Master [02:29:29] !log LocalisationUpdate completed (1.22wmf8) at Wed Jul 10 02:29:28 UTC 2013 [02:29:39] Logged the message, Master [02:36:07] !log tstarling synchronized php-1.22wmf8/includes/MappedIterator.php [02:36:17] Logged the message, Master [02:36:38] !log tstarling synchronized php-1.22wmf8/includes/job/JobQueue.php [02:36:49] Logged the message, Master [02:37:18] !log tstarling synchronized php-1.22wmf8/includes/job/JobQueueRedis.php [02:37:28] Logged the message, Master [02:38:07] !log tstarling synchronized php-1.22wmf8/maintenance/showJobs.php [02:38:16] Logged the message, Master [02:42:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 10 02:42:32 UTC 2013 [02:42:51] Logged the message, Master [02:52:27] !log tstarling synchronized php-1.22wmf9/extensions/ProofreadPage/ProofreadPage.body.php [02:52:37] Logged the message, Master [02:58:00] !log tstarling Started syncing Wikimedia installation... : ProofreadPage update for bug 51085 [02:58:09] Logged the message, Master [03:02:09] !log tstarling Finished syncing Wikimedia installation... : ProofreadPage update for bug 51085 [03:02:19] Logged the message, Master [03:05:11] what happened? [03:05:45] re: 4-minute scap [03:08:48] that's about right isn't it? [03:09:37] +/- 36 minutes [03:10:52] maybe before the network-aware thing was set up [03:12:16] on the day we had downtime, it was 8 minutes with lots of changes, and 4 with none [03:13:01] I had to check a random apache to make sure you weren't trolling [03:17:10] https://zh.wikipedia.org/w/index.php?title=%E9%80%A0%E5%8C%96%E8%A1%97%E9%81%93&action=edit [03:17:53] it is Liangent, of course [03:17:59] it takes 8 seconds to render [03:18:58] and he recently updated all the data templates, that is why zhwiki has 25k jobs in its queue now [03:19:45] $ perl ~/job-stats.pl runJobs.log [03:19:45] count time (s) DB [03:19:45] 768592 1245.641 enwiki [03:19:45] 366038 834.929 zhwiki [03:19:45] 121166 512.333 ocwiki [03:19:46] 285373 471.334 commonswiki [03:19:48] 661429 392.787 enwiktionary [03:19:50] 983006 332.544 frwiktionary [03:21:24] it's easy to tell which wikis have the craziest wikitext template programmers, isn't it? [03:23:52] You could still nab a spot in this month's metrics meeting if you put that on a colorful chart [03:24:08] also, $('a[onclick]').length -> 444 [03:26:47] OK, I'm going to head off. I confirmed the ProofreadPage fatal went away. [03:26:53] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:54] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:55] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:27:03] Thanks for reviewing / deploying. [03:27:10] np [04:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [04:16:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [04:27:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:29:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.905 second response time [05:11:00] The metrics joke was funny. [05:11:02] ori-l: ^ [07:13:59] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:49] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:41:52] good morning :-) [07:51:00] gerrit-wm: why you silent? [07:55:24] I guess the poor ircecho might need to be restarted [07:57:11] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [07:57:11] hm, perhaps it just doesn't like abandonment [07:57:17] ah here it is [08:10:23] New patchset: Hashar; "beta: upload cache rebuild to use varnish" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72900 [08:11:13] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72900 [08:14:03] apergos: mark: I have migrated the beta upload cache to point to the varnish instance \O/ The basic functionalities seems to be working [08:14:14] ok great [08:14:24] though https does not hehe https://en.wikipedia.beta.wmflabs.org/wiki/File_talk:Polar_bear.jpeg [08:14:45] nice [08:15:45] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:16:15] seems to be because of the untrusted certificate on bits, accepting it fix the css [08:16:22] need to enable nginx on upload now [08:16:45] there was an ssl terminator for upload before everything got re-arranged [08:17:27] I forgot to reapply role::protoproxy::ssl::beta [08:17:45] yay can't log into deployment-cache-upload03, gluster home I bet, and it can't create the home dir [08:18:01] can we just make all the instances not use gluster home? it's getting old [08:18:29] ahh [08:18:40] apergos: I am going to get upload03 shutdown [08:18:56] ok well before it goes away entirely [08:19:00] also yesterday a patch got merged that apply the role::labsnfs class on all of the deployment-prep instance [08:19:14] the class is included in base.pp to make sure everything uses NFS for /home [08:19:53] yes I saw that but apparently [08:19:57] it's not actually applied everywhere [08:20:04] or the old mount point is still in use instead [08:20:34] yup the instance need to be rebooted apparently [08:20:49] since /home has two automount snippet. I will get them all rebooted [08:21:20] that would be awesome [08:23:40] so what we had was on deployment-cache-upload03 there was nginx with the cert chain and key from the deployment-squid nginx instance [08:24:19] using the "wikimedia" conf file on deployment-squid and calling it 'upload', with a tiny amount of edting [08:24:23] this is what made https work [08:24:35] 2013/07/10 08:19:03 [emerg] 10569#0: SSL_CTX_use_certificate_chain_file("/etc/ssl/certs/star.wmflabs.org.chained.pem") failed (SSL: error:02001002:system library:fopen:No such file or directory error:20074002:BIO routines:FILE_CTRL:system lib error:140DC002:SSL routines:SSL_CTX_use_certificate_chain_file:system lib) [08:24:39] that is on the cache-upload04 [08:24:51] see above ^^ [08:26:12] so now you can repeat this by stealing the cert stuff and the 'upload' conf from deployment-cache-upload03 and putting it on deployment-cache-upload04 (with probably again a tiny bit of editing) [08:26:15] :-P [08:26:44] can't we get that puppetized ? [08:27:19] yes, that's the plan [08:27:25] seems the role::protoproxy::ssl::beta might need to include the cert [08:28:22] role::protoproxy::ssl::beta::common has it [08:28:30] so...? [08:29:07] missing the /etc/ssl/certs/star.wmflabs.org.chained.pem [08:31:25] install_certificate should do that, no? [08:31:38] I restarted nginx, that fixed it [08:31:47] probably the cert got installed after nginx [08:31:51] meh [08:31:59] or some nfs cache [08:32:03] whatever, it works :-] [08:32:05] anchor, included class, floating resources (maybe) [08:32:08] ok [08:32:41] I see the little bear thumbs now [08:33:20] with all certs approved https://en.wikipedia.beta.wmflabs.org/wiki/File_talk:Polar_bear.jpeg :-] [08:33:33] yes that's where I am [08:33:43] they're very cute at that size :-) [08:35:09] what do you think about rebooting aggregator1? [08:35:19] I have no idea what it is for [08:35:23] I can't get on that box either (can't create home dir)... same old thing [08:35:26] ganglia and icinga [08:35:36] ah [08:35:40] in order to try to resolve the ganglia bug [08:35:57] go ahead :-] [08:36:21] did so [08:36:54] for folks who don't want to live in the labs channel, can I log from somewhere else? [08:37:03] RECOVERY - Solr on vanadium is OK: All OK [08:38:44] apergos you have a moment for a labs related question? [08:38:45] http://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Current_event_needs_photography [08:39:14] does that sound like somehting hostable by labs? [08:39:25] or is it not in the scope? [08:40:03] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.4558 (gt 1000) [08:40:08] really no idea what bots are ok [08:40:15] I'm not at all involved with that project [08:40:21] ah [08:40:28] ToAruShiroiNeko: there is google news for that isn't it ? :-] [08:40:30] I think of you as a know all of everything :p [08:40:42] hashar the idea is getting message to the people [08:40:46] twitter is there too [08:41:09] ToAruShiroiNeko: you can ask petan in #wikimedia-labs , but too me that seems to be overlapping with existing tools such as mail notifications in google news, or twitter or whatever. [08:41:10] if a plane crashes near wikipedians and they hear about it in evening news its an opportunity lost [08:41:24] sure I can do that [08:41:36] hashar point is notifying people whom arent news buffs that would use such tools [08:42:31] Creating directory '/home/ariel'. [08:42:31] Unable to create and initialize directory '/home/ariel'. [08:42:35] same old thing from aggregator1 [08:42:59] I don't know how to tell if it really has the right classes for /home cause .. I can't get on there to look at it [08:46:33] apergos: I am wondering if you get access to the instance using root ssh [08:48:03] RECOVERY - Solr on vanadium is OK: All OK [08:57:31] apparently not [08:57:51] this is the same behavior I have seen a few other times and it's been gluster/home every time [08:59:21] gluster never works [09:00:27] I have no way to know if the host actually rebooted either (can't check uptime, attempts to retrieve console output fail too) [09:00:32] that is why I moved all of beta under NFS :-] [09:00:46] apergos: have you tried to reboot it via the web interface? [09:00:51] that's how I tried [09:01:09] I have no other way to reboot it! [09:02:37] hashar: https will change a bit soon anyway [09:02:41] we're gonna put it on the varnish host [09:02:47] best if you wait for that with beta [09:03:31] mark: we achieved that a few weeks ago already :-] [09:03:35] basically there is a nginx proxy installed on each of the beta varnish cache [09:03:56] the nginx uses 127.0.0.1:80 as a backend which is the frontend varnish cache [09:03:59] works like a charm [09:04:05] so why doesn't it work then? [09:04:24] I got it fixed after restarting nginx [09:04:30] maybe it started up before the cert got generated [09:05:02] so is varnish going to do the SSL terminaison ? [09:05:03] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.7904 (gt 1000) [09:05:17] no, varnish can't [09:05:22] but we're gonna put nginx on the varnish boxes [09:05:26] (or something else which does similar) [09:05:31] so pretty much like you have done [09:07:08] yup apergos found that architecture hack [09:07:26] architecture hack? [09:08:47] it's been planned for production too for a while [09:09:06] I thought about using iptables on each cache to redirect the traffic to the nginx ssl cache, but ariel went with a clever idea which was to get nginx directly on each cache instance [09:09:08] great! [09:10:17] I am pretty out of the loop on the varnish stuff n prod [09:10:30] I know you guys talk about it a lot but there's already a lot to keep track of, so... [09:11:52] hashar: unless you have any other thoughts about how I can get access to the aggregator instance or get the nfs home mount going on there, I'm going to give up on that bug for now, I commented on the bug and ryan will see it anyways [09:12:04] aggregator instance? [09:12:07] labs [09:12:14] apergos: I have no clue [09:13:25] ok [09:14:00] how about the aft thing? https://bugzilla.wikimedia.org/show_bug.cgi?id=50623 logged in works, lgged out still fails... [09:14:08] (if you are doing other stuff, tell me and I'll not bug you) [09:15:56] !log Inserted varnish 3.0.3plus-rc1-wm12 packages into the precise-wikimedia APT repository [09:16:07] Logged the message, Master [09:17:13] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm12) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72903 [09:20:59] New patchset: Mark Bergsma; "Allow persistent connections with vcl_error" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72904 [09:20:59] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm12) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72905 [09:21:53] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72904 [09:22:08] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72905 [09:45:23] New patchset: Mark Bergsma; "Allow persistent connections for HTTP PURGE (error) responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [09:45:23] New patchset: Mark Bergsma; "Maintain persistent connections for geoip, redirects, 204 responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72910 [09:47:24] !log Upgrading Varnish on bits servers [09:47:34] Logged the message, Master [09:53:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [10:00:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:21:26] went back home :-D [10:21:26] and recovered internet access in the process [10:35:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72910 [10:38:57] RECOVERY - Solr on vanadium is OK: All OK [10:44:48] nice change [10:53:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:54:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:10:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.600 second response time [11:26:24] New patchset: Mark Bergsma; "Allow persistent connections for HTTP PURGE (error) responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [11:26:24] New patchset: Mark Bergsma; "Maintain persistent connections when serving mobile redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72929 [11:32:59] !log Upgraded and restarted eqiad mobile caches (front/back) [11:33:09] Logged the message, Master [11:36:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72929 [11:56:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:57:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [12:00:20] PROBLEM - Puppet freshness on db78 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:51] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1005.5813 (gt 1000) [12:21:49] New patchset: Ottomata; "Installing nrpe::monitor_service for Kafka producer processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72934 [12:22:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72934 [12:31:51] RECOVERY - Solr on vanadium is OK: All OK [12:37:52] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.09357 (gt 1000) [12:38:52] RECOVERY - Solr on vanadium is OK: All OK [12:44:05] New patchset: Ottomata; "Adding support for tmax, dmax and sendMetadata." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [12:46:59] New patchset: coren; "Labs: make autofs forcibly use nfs4 for labsnfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72936 [12:47:05] hashar: ^^ [12:49:34] New review: coren; "Seems okay to me to add both." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/69289 [12:50:47] https://gerrit.wikimedia.org/r/72936 affects stuff outside of tool labs so I'd rather not self +2. Anyone kind enough to sanity check me? [12:51:02] Coren: nice :-) [12:52:36] * Coren needs breakfast and coffee. [12:52:56] hashar: As soon as this goes into puppet and gets applied to your lucid instances, the problem "should" be fixed. [12:58:31] New review: coren; "(that was intended to be -1)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67055 [13:10:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:12:20] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:13:40] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:15:38] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:16:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [13:21:27] New review: Hashar; "I am not sure how to test it :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72936 [13:24:05] apergos: I guess we could give that patch a try on a precise instance [13:26:55] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:56] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:56] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:57] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:44:06] New patchset: Ottomata; "Puppetizing jmxtrans for analytics udp2log kafka producers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72943 [13:53:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [14:02:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:06:49] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:18:27] New patchset: Ottomata; "Fixing dmax on jmxtrans" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72965 [15:18:37] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72965 [15:21:10] ottomata: out of pure curiosity, what's your plan with jmxtrans? [15:23:27] right now, just puppetizing some stuff that is already there [15:23:52] but what's the purpose? [15:23:57] pretty ganglia graphs for packet loss? [15:23:58] New review: Andrew Bogott; "Note also that we do probably want this function in a standalone script so that we can eventually in..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [15:24:36] graphs and monitoring alerts, i'm using it for kafka [15:24:47] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafka [15:25:39] drdee: btw, I fixed up https://gerrit.wikimedia.org/r/#/c/68711/ see ps13 & my comments [15:25:49] ottomata too, obviously :) [15:26:57] ottomata: and that's for the udp2log->kafka thing you have? [15:27:05] yes [15:27:20] that nad kafka brokers, but right now i'm focusin on the udp2log -> kafka thing [15:27:33] for the last week or so, the udp2log -> kafka piece has been very very fragile [15:27:37] i'm not sure why yet [15:27:41] is jmxtrans a thing the producers need to do or can it be done on brokers too? [15:27:56] brokers are fine [15:27:59] the question is, if it'll work when we switch to varnishkafka :) [15:28:01] jmxtrans works the same for any jvm exposing jmx stats [15:28:07] no, because that won't be a jvm [15:28:13] yes exactly [15:28:14] New review: Andrew Bogott; "Oh, my mistake, it is standalone :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [15:28:19] so we'd need to do it on the java brokers [15:28:26] yes, i ahve it there too, not puppetized yet [15:28:33] right [15:28:34] that's where most of those stats in that view is coming from [15:28:38] but it's possible, okay [15:28:49] yup, but they are slightly different stats [15:29:09] its nice to have stats directly from producers [15:29:28] right [15:29:28] how do varnish stats currently get into ganglia? [15:30:16] with a ganglia python module [15:30:18] a ganglia plugin that calls varnishstat iirc [15:30:21] but it doesn't work that well [15:30:36] the concept is nice though, the implementation not so much [15:31:19] and vhtcpd writes a json with stats on /tmp [15:31:43] New patchset: Ottomata; "Upcasing slope value, Ganglia output writer expects this to be all caps." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72969 [15:32:10] is it possible to expose extra metrics in varnishkafka? [15:32:15] Change merged: Ottomata; [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72969 [15:32:34] not via varnishstat afaik, you'd need an entirely different infrastructure for that [15:32:46] hm k [15:33:05] implementing something like a statsd or ganglia client in there shouldn't be too difficult though [15:33:09] New patchset: Ottomata; "Updating jmxtrans submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:33:26] ja would be real useful [15:34:46] New patchset: Ottomata; "Updating jmxtrans submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:34:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:35:01] I think the statsd protocol is dead simple [15:35:39] it's something like a simple string of concatenated values sent over udp [15:35:45] hey, while you are both here, shoudl I modify the post-merge hook on sockpuppet so that it also runs git submodule update —init on stafford? [15:36:02] paravoid: yup saw it , thanks so much! does this mean it can be merged and is now done? [15:38:27] drdee: as I mentioned there, can we switch to upstream's JNI? [15:38:41] I'm not sure of the specifics there but it looks like a better way forward? [15:39:01] plus I *really* don't want to be messing with autoconf [15:39:15] oh ah I said autoconf and bblack joined [15:39:29] don't have any objection, they basically took our JNI implementation as an example [15:39:32] freenode is screwing with me on nickserv / sasl again :P [15:39:34] and reimplemente it [15:40:15] ok, but if that's the case let's just use upstream's then [15:40:22] paravoid: nuke it from orbit, it's the only way to free yourself from autotools [15:40:59] bblack: since you're here [15:41:10] bblack: we should probably go with 1.8.3 for our setup rather than 1.9.0 for now, right? [15:41:47] paravoid: were you using edns-client-subnet already with powerdns or something? as in, your NS IPs already whitelisted with opendns/google? [15:41:52] no [15:41:58] New patchset: Cmjohnson; "decommissioning and removing spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [15:42:06] then I guess it doesn't matter, much [15:42:28] it didn't support it; I added some support for it at some point but it's non-trivial to support it properly using the existing code [15:42:34] it's actually how I ended up finding gdnsd btw :) [15:42:43] we probably do want to turn that on pretty soon though, and then 1.9.0 becomes necessary, since they're phasing out the old option code in a month [15:42:47] yeah [15:42:56] we will, but as I was saying at the meeting [15:43:04] I'd like to avoid shifting too much traffic with the DNS update [15:43:10] get it out the door first, yeah [15:43:13] right [15:43:27] it's also dangerous, esams doesn't get all of the traffic that it should [15:43:41] dangerous sounds fun [15:43:41] and messing too much with it can result in a traffic surge to esams [15:43:48] but yeah [15:43:58] New patchset: Ottomata; "Selecting proper metric in Ganglia kafka view" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72973 [15:44:02] like shifting India, Africa etc. to esams which now go to eqiad :) [15:44:25] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72973 [15:44:28] did you have a chance to see new-ns0? [15:44:48] last time I tried my ssh login there kept timing out [15:44:55] still seems to be [15:45:43] how are you trying to login? [15:45:52] (works for me) [15:45:59] maybe you're not going via labs' bastion? [15:46:06] I have ssh set up to go through bastion and forward a key, which I use for my labs varnish-work host in the same spot [15:46:26] I get far enough to see "If you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances" - so I think the problem is post-ssh, something to do with home directories or login scripts or whatever [15:46:49] hm maybe you don't have access to the project [15:47:19] oh, yeah, I bet so. I think I know where to fix that too [15:47:32] what's your labs username? [15:48:00] bblack [15:48:01] got it [15:48:20] done [15:52:06] paravoid: just out of curiosity, "include_optional_ns = true" - is this just in the name of "change as little as possible about our responses", or is there some good reason for it I'm unaware of? [15:52:20] the former [15:52:52] ok [15:53:08] it's not set in stone though, if you disagree I can change it [15:53:40] that's some serious templating :) [15:53:59] heh not so much yet [15:54:00] I don't care much, it's really just a packet-size optimization. can always play with it later once other risks are mitigated :) [15:54:08] I have more coming up [15:54:18] using jinja's extend syntax [15:54:24] to have a common template for our zones [15:54:40] most of our zones are very similar [15:57:09] notpeter: around? [15:57:19] how does it look? [15:57:57] paravoid: can you look at this for me. https://gerrit.wikimedia.org/r/#/c/72972/ [15:58:01] plz [15:59:13] paravoid: looks pretty good. if it's not a PITA, I would do explicit listen and http_listen options, assuming the interface IPs are easily templated in puppet. to avoid a pointless thread listening on localhost [15:59:31] New review: Faidon; ""git grep spence" reveals a few more occurences." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/72972 [15:59:57] bblack: so, that's actually a problem I've been banging my head at [16:00:00] it's not just localhost [16:00:04] we have separate service IPs for DNS [16:00:35] so e.g. right now we have dobson as ns0, but there's no DNS being served at dobson's IP [16:00:51] is the IP failed around via something like heartbeat? [16:01:02] no [16:01:09] it just makes it easier to move the service [16:01:12] ok [16:01:20] I think that was the purpose at least, I wasn't around when that happened [16:01:25] but it certainly makes my life easier now :) [16:01:26] so what's the problem? [16:01:43]