[00:38:18] New review: Ori.livneh; "Hashar, are you waiting on me to merge / deploy this? (I'd be happy to; just let me know.)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71777 [01:02:27] New review: Andrew Bogott; "I don't have time to implement this just now, but here's what I've just learned (which you maybe alr..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [01:12:42] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.16394 (gt 1000) [01:13:22] !log starting Parsoid config update with latest dependencies [01:13:32] Logged the message, Master [01:15:10] !log finished Parsoid config update with latest dependencies [01:15:20] Logged the message, Master [01:17:14] New review: GWicke; "Roan has such a purge script in his home directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72653 [01:18:35] New review: Catrope; "....which is a hack. That said, /usr/local/bin/purge-varnish is basically this but uses a different ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72653 [01:43:47] RECOVERY - Solr on vanadium is OK: All OK [01:55:47] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1849.4286 (gt 1000) [01:59:47] PROBLEM - Puppet freshness on db78 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:27] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:20] RECOVERY - Disk space on mc15 is OK: DISK OK [02:15:49] !log LocalisationUpdate completed (1.22wmf9) at Wed Jul 10 02:15:40 UTC 2013 [02:16:00] Logged the message, Master [02:21:44] !log on tin: attempted to do a git pull in wmf8 but it updated a whole lot of extensions for some reason. Checking out the old versions manually. [02:21:54] Logged the message, Master [02:29:29] !log LocalisationUpdate completed (1.22wmf8) at Wed Jul 10 02:29:28 UTC 2013 [02:29:39] Logged the message, Master [02:36:07] !log tstarling synchronized php-1.22wmf8/includes/MappedIterator.php [02:36:17] Logged the message, Master [02:36:38] !log tstarling synchronized php-1.22wmf8/includes/job/JobQueue.php [02:36:49] Logged the message, Master [02:37:18] !log tstarling synchronized php-1.22wmf8/includes/job/JobQueueRedis.php [02:37:28] Logged the message, Master [02:38:07] !log tstarling synchronized php-1.22wmf8/maintenance/showJobs.php [02:38:16] Logged the message, Master [02:42:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 10 02:42:32 UTC 2013 [02:42:51] Logged the message, Master [02:52:27] !log tstarling synchronized php-1.22wmf9/extensions/ProofreadPage/ProofreadPage.body.php [02:52:37] Logged the message, Master [02:58:00] !log tstarling Started syncing Wikimedia installation... : ProofreadPage update for bug 51085 [02:58:09] Logged the message, Master [03:02:09] !log tstarling Finished syncing Wikimedia installation... : ProofreadPage update for bug 51085 [03:02:19] Logged the message, Master [03:05:11] what happened? [03:05:45] re: 4-minute scap [03:08:48] that's about right isn't it? [03:09:37] +/- 36 minutes [03:10:52] maybe before the network-aware thing was set up [03:12:16] on the day we had downtime, it was 8 minutes with lots of changes, and 4 with none [03:13:01] I had to check a random apache to make sure you weren't trolling [03:17:10] https://zh.wikipedia.org/w/index.php?title=%E9%80%A0%E5%8C%96%E8%A1%97%E9%81%93&action=edit [03:17:53] it is Liangent, of course [03:17:59] it takes 8 seconds to render [03:18:58] and he recently updated all the data templates, that is why zhwiki has 25k jobs in its queue now [03:19:45] $ perl ~/job-stats.pl runJobs.log [03:19:45] count time (s) DB [03:19:45] 768592 1245.641 enwiki [03:19:45] 366038 834.929 zhwiki [03:19:45] 121166 512.333 ocwiki [03:19:46] 285373 471.334 commonswiki [03:19:48] 661429 392.787 enwiktionary [03:19:50] 983006 332.544 frwiktionary [03:21:24] it's easy to tell which wikis have the craziest wikitext template programmers, isn't it? [03:23:52] You could still nab a spot in this month's metrics meeting if you put that on a colorful chart [03:24:08] also, $('a[onclick]').length -> 444 [03:26:47] OK, I'm going to head off. I confirmed the ProofreadPage fatal went away. [03:26:53] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:54] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:26:55] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:27:03] Thanks for reviewing / deploying. [03:27:10] np [04:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [04:16:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [04:27:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:29:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.905 second response time [05:11:00] The metrics joke was funny. [05:11:02] ori-l: ^ [07:13:59] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:49] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:41:52] good morning :-) [07:51:00] gerrit-wm: why you silent? [07:55:24] I guess the poor ircecho might need to be restarted [07:57:11] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [07:57:11] hm, perhaps it just doesn't like abandonment [07:57:17] ah here it is [08:10:23] New patchset: Hashar; "beta: upload cache rebuild to use varnish" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72900 [08:11:13] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72900 [08:14:03] apergos: mark: I have migrated the beta upload cache to point to the varnish instance \O/ The basic functionalities seems to be working [08:14:14] ok great [08:14:24] though https does not hehe https://en.wikipedia.beta.wmflabs.org/wiki/File_talk:Polar_bear.jpeg [08:14:45] nice [08:15:45] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:16:15] seems to be because of the untrusted certificate on bits, accepting it fix the css [08:16:22] need to enable nginx on upload now [08:16:45] there was an ssl terminator for upload before everything got re-arranged [08:17:27] I forgot to reapply role::protoproxy::ssl::beta [08:17:45] yay can't log into deployment-cache-upload03, gluster home I bet, and it can't create the home dir [08:18:01] can we just make all the instances not use gluster home? it's getting old [08:18:29] ahh [08:18:40] apergos: I am going to get upload03 shutdown [08:18:56] ok well before it goes away entirely [08:19:00] also yesterday a patch got merged that apply the role::labsnfs class on all of the deployment-prep instance [08:19:14] the class is included in base.pp to make sure everything uses NFS for /home [08:19:53] yes I saw that but apparently [08:19:57] it's not actually applied everywhere [08:20:04] or the old mount point is still in use instead [08:20:34] yup the instance need to be rebooted apparently [08:20:49] since /home has two automount snippet. I will get them all rebooted [08:21:20] that would be awesome [08:23:40] so what we had was on deployment-cache-upload03 there was nginx with the cert chain and key from the deployment-squid nginx instance [08:24:19] using the "wikimedia" conf file on deployment-squid and calling it 'upload', with a tiny amount of edting [08:24:23] this is what made https work [08:24:35] 2013/07/10 08:19:03 [emerg] 10569#0: SSL_CTX_use_certificate_chain_file("/etc/ssl/certs/star.wmflabs.org.chained.pem") failed (SSL: error:02001002:system library:fopen:No such file or directory error:20074002:BIO routines:FILE_CTRL:system lib error:140DC002:SSL routines:SSL_CTX_use_certificate_chain_file:system lib) [08:24:39] that is on the cache-upload04 [08:24:51] see above ^^ [08:26:12] so now you can repeat this by stealing the cert stuff and the 'upload' conf from deployment-cache-upload03 and putting it on deployment-cache-upload04 (with probably again a tiny bit of editing) [08:26:15] :-P [08:26:44] can't we get that puppetized ? [08:27:19] yes, that's the plan [08:27:25] seems the role::protoproxy::ssl::beta might need to include the cert [08:28:22] role::protoproxy::ssl::beta::common has it [08:28:30] so...? [08:29:07] missing the /etc/ssl/certs/star.wmflabs.org.chained.pem [08:31:25] install_certificate should do that, no? [08:31:38] I restarted nginx, that fixed it [08:31:47] probably the cert got installed after nginx [08:31:51] meh [08:31:59] or some nfs cache [08:32:03] whatever, it works :-] [08:32:05] anchor, included class, floating resources (maybe) [08:32:08] ok [08:32:41] I see the little bear thumbs now [08:33:20] with all certs approved https://en.wikipedia.beta.wmflabs.org/wiki/File_talk:Polar_bear.jpeg :-] [08:33:33] yes that's where I am [08:33:43] they're very cute at that size :-) [08:35:09] what do you think about rebooting aggregator1? [08:35:19] I have no idea what it is for [08:35:23] I can't get on that box either (can't create home dir)... same old thing [08:35:26] ganglia and icinga [08:35:36] ah [08:35:40] in order to try to resolve the ganglia bug [08:35:57] go ahead :-] [08:36:21] did so [08:36:54] for folks who don't want to live in the labs channel, can I log from somewhere else? [08:37:03] RECOVERY - Solr on vanadium is OK: All OK [08:38:44] apergos you have a moment for a labs related question? [08:38:45] http://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Current_event_needs_photography [08:39:14] does that sound like somehting hostable by labs? [08:39:25] or is it not in the scope? [08:40:03] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.4558 (gt 1000) [08:40:08] really no idea what bots are ok [08:40:15] I'm not at all involved with that project [08:40:21] ah [08:40:28] ToAruShiroiNeko: there is google news for that isn't it ? :-] [08:40:30] I think of you as a know all of everything :p [08:40:42] hashar the idea is getting message to the people [08:40:46] twitter is there too [08:41:09] ToAruShiroiNeko: you can ask petan in #wikimedia-labs , but too me that seems to be overlapping with existing tools such as mail notifications in google news, or twitter or whatever. [08:41:10] if a plane crashes near wikipedians and they hear about it in evening news its an opportunity lost [08:41:24] sure I can do that [08:41:36] hashar point is notifying people whom arent news buffs that would use such tools [08:42:31] Creating directory '/home/ariel'. [08:42:31] Unable to create and initialize directory '/home/ariel'. [08:42:35] same old thing from aggregator1 [08:42:59] I don't know how to tell if it really has the right classes for /home cause .. I can't get on there to look at it [08:46:33] apergos: I am wondering if you get access to the instance using root ssh [08:48:03] RECOVERY - Solr on vanadium is OK: All OK [08:57:31] apparently not [08:57:51] this is the same behavior I have seen a few other times and it's been gluster/home every time [08:59:21] gluster never works [09:00:27] I have no way to know if the host actually rebooted either (can't check uptime, attempts to retrieve console output fail too) [09:00:32] that is why I moved all of beta under NFS :-] [09:00:46] apergos: have you tried to reboot it via the web interface? [09:00:51] that's how I tried [09:01:09] I have no other way to reboot it! [09:02:37] hashar: https will change a bit soon anyway [09:02:41] we're gonna put it on the varnish host [09:02:47] best if you wait for that with beta [09:03:31] mark: we achieved that a few weeks ago already :-] [09:03:35] basically there is a nginx proxy installed on each of the beta varnish cache [09:03:56] the nginx uses 127.0.0.1:80 as a backend which is the frontend varnish cache [09:03:59] works like a charm [09:04:05] so why doesn't it work then? [09:04:24] I got it fixed after restarting nginx [09:04:30] maybe it started up before the cert got generated [09:05:02] so is varnish going to do the SSL terminaison ? [09:05:03] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.7904 (gt 1000) [09:05:17] no, varnish can't [09:05:22] but we're gonna put nginx on the varnish boxes [09:05:26] (or something else which does similar) [09:05:31] so pretty much like you have done [09:07:08] yup apergos found that architecture hack [09:07:26] architecture hack? [09:08:47] it's been planned for production too for a while [09:09:06] I thought about using iptables on each cache to redirect the traffic to the nginx ssl cache, but ariel went with a clever idea which was to get nginx directly on each cache instance [09:09:08] great! [09:10:17] I am pretty out of the loop on the varnish stuff n prod [09:10:30] I know you guys talk about it a lot but there's already a lot to keep track of, so... [09:11:52] hashar: unless you have any other thoughts about how I can get access to the aggregator instance or get the nfs home mount going on there, I'm going to give up on that bug for now, I commented on the bug and ryan will see it anyways [09:12:04] aggregator instance? [09:12:07] labs [09:12:14] apergos: I have no clue [09:13:25] ok [09:14:00] how about the aft thing? https://bugzilla.wikimedia.org/show_bug.cgi?id=50623 logged in works, lgged out still fails... [09:14:08] (if you are doing other stuff, tell me and I'll not bug you) [09:15:56] !log Inserted varnish 3.0.3plus-rc1-wm12 packages into the precise-wikimedia APT repository [09:16:07] Logged the message, Master [09:17:13] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm12) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72903 [09:20:59] New patchset: Mark Bergsma; "Allow persistent connections with vcl_error" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72904 [09:20:59] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm12) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72905 [09:21:53] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72904 [09:22:08] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72905 [09:45:23] New patchset: Mark Bergsma; "Allow persistent connections for HTTP PURGE (error) responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [09:45:23] New patchset: Mark Bergsma; "Maintain persistent connections for geoip, redirects, 204 responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72910 [09:47:24] !log Upgrading Varnish on bits servers [09:47:34] Logged the message, Master [09:53:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [10:00:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:21:26] went back home :-D [10:21:26] and recovered internet access in the process [10:35:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72910 [10:38:57] RECOVERY - Solr on vanadium is OK: All OK [10:44:48] nice change [10:53:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:54:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:10:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.600 second response time [11:26:24] New patchset: Mark Bergsma; "Allow persistent connections for HTTP PURGE (error) responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [11:26:24] New patchset: Mark Bergsma; "Maintain persistent connections when serving mobile redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72929 [11:32:59] !log Upgraded and restarted eqiad mobile caches (front/back) [11:33:09] Logged the message, Master [11:36:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72929 [11:56:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:57:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [12:00:20] PROBLEM - Puppet freshness on db78 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:51] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1005.5813 (gt 1000) [12:21:49] New patchset: Ottomata; "Installing nrpe::monitor_service for Kafka producer processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72934 [12:22:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72934 [12:31:51] RECOVERY - Solr on vanadium is OK: All OK [12:37:52] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1000.09357 (gt 1000) [12:38:52] RECOVERY - Solr on vanadium is OK: All OK [12:44:05] New patchset: Ottomata; "Adding support for tmax, dmax and sendMetadata." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [12:46:59] New patchset: coren; "Labs: make autofs forcibly use nfs4 for labsnfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72936 [12:47:05] hashar: ^^ [12:49:34] New review: coren; "Seems okay to me to add both." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/69289 [12:50:47] https://gerrit.wikimedia.org/r/72936 affects stuff outside of tool labs so I'd rather not self +2. Anyone kind enough to sanity check me? [12:51:02] Coren: nice :-) [12:52:36] * Coren needs breakfast and coffee. [12:52:56] hashar: As soon as this goes into puppet and gets applied to your lucid instances, the problem "should" be fixed. [12:58:31] New review: coren; "(that was intended to be -1)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67055 [13:10:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:12:20] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:13:40] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:15:38] New patchset: Ottomata; "Adding support for more per-attribute Ganglia settings. See: https://github.com/jmxtrans/jmxtrans/wiki/GangliaWriter" [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72935 [13:16:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [13:21:27] New review: Hashar; "I am not sure how to test it :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72936 [13:24:05] apergos: I guess we could give that patch a try on a precise instance [13:26:55] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:55] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:56] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:56] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:26:57] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:44:06] New patchset: Ottomata; "Puppetizing jmxtrans for analytics udp2log kafka producers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72943 [13:53:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [14:02:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:06:49] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:18:27] New patchset: Ottomata; "Fixing dmax on jmxtrans" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72965 [15:18:37] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72965 [15:21:10] ottomata: out of pure curiosity, what's your plan with jmxtrans? [15:23:27] right now, just puppetizing some stuff that is already there [15:23:52] but what's the purpose? [15:23:57] pretty ganglia graphs for packet loss? [15:23:58] New review: Andrew Bogott; "Note also that we do probably want this function in a standalone script so that we can eventually in..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [15:24:36] graphs and monitoring alerts, i'm using it for kafka [15:24:47] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafka [15:25:39] drdee: btw, I fixed up https://gerrit.wikimedia.org/r/#/c/68711/ see ps13 & my comments [15:25:49] ottomata too, obviously :) [15:26:57] ottomata: and that's for the udp2log->kafka thing you have? [15:27:05] yes [15:27:20] that nad kafka brokers, but right now i'm focusin on the udp2log -> kafka thing [15:27:33] for the last week or so, the udp2log -> kafka piece has been very very fragile [15:27:37] i'm not sure why yet [15:27:41] is jmxtrans a thing the producers need to do or can it be done on brokers too? [15:27:56] brokers are fine [15:27:59] the question is, if it'll work when we switch to varnishkafka :) [15:28:01] jmxtrans works the same for any jvm exposing jmx stats [15:28:07] no, because that won't be a jvm [15:28:13] yes exactly [15:28:14] New review: Andrew Bogott; "Oh, my mistake, it is standalone :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [15:28:19] so we'd need to do it on the java brokers [15:28:26] yes, i ahve it there too, not puppetized yet [15:28:33] right [15:28:34] that's where most of those stats in that view is coming from [15:28:38] but it's possible, okay [15:28:49] yup, but they are slightly different stats [15:29:09] its nice to have stats directly from producers [15:29:28] right [15:29:28] how do varnish stats currently get into ganglia? [15:30:16] with a ganglia python module [15:30:18] a ganglia plugin that calls varnishstat iirc [15:30:21] but it doesn't work that well [15:30:36] the concept is nice though, the implementation not so much [15:31:19] and vhtcpd writes a json with stats on /tmp [15:31:43] New patchset: Ottomata; "Upcasing slope value, Ganglia output writer expects this to be all caps." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72969 [15:32:10] is it possible to expose extra metrics in varnishkafka? [15:32:15] Change merged: Ottomata; [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72969 [15:32:34] not via varnishstat afaik, you'd need an entirely different infrastructure for that [15:32:46] hm k [15:33:05] implementing something like a statsd or ganglia client in there shouldn't be too difficult though [15:33:09] New patchset: Ottomata; "Updating jmxtrans submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:33:26] ja would be real useful [15:34:46] New patchset: Ottomata; "Updating jmxtrans submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:34:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72970 [15:35:01] I think the statsd protocol is dead simple [15:35:39] it's something like a simple string of concatenated values sent over udp [15:35:45] hey, while you are both here, shoudl I modify the post-merge hook on sockpuppet so that it also runs git submodule update —init on stafford? [15:36:02] paravoid: yup saw it , thanks so much! does this mean it can be merged and is now done? [15:38:27] drdee: as I mentioned there, can we switch to upstream's JNI? [15:38:41] I'm not sure of the specifics there but it looks like a better way forward? [15:39:01] plus I *really* don't want to be messing with autoconf [15:39:15] oh ah I said autoconf and bblack joined [15:39:29] don't have any objection, they basically took our JNI implementation as an example [15:39:32] freenode is screwing with me on nickserv / sasl again :P [15:39:34] and reimplemente it [15:40:15] ok, but if that's the case let's just use upstream's then [15:40:22] paravoid: nuke it from orbit, it's the only way to free yourself from autotools [15:40:59] bblack: since you're here [15:41:10] bblack: we should probably go with 1.8.3 for our setup rather than 1.9.0 for now, right? [15:41:47] paravoid: were you using edns-client-subnet already with powerdns or something? as in, your NS IPs already whitelisted with opendns/google? [15:41:52] no [15:41:58] New patchset: Cmjohnson; "decommissioning and removing spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [15:42:06] then I guess it doesn't matter, much [15:42:28] it didn't support it; I added some support for it at some point but it's non-trivial to support it properly using the existing code [15:42:34] it's actually how I ended up finding gdnsd btw :) [15:42:43] we probably do want to turn that on pretty soon though, and then 1.9.0 becomes necessary, since they're phasing out the old option code in a month [15:42:47] yeah [15:42:56] we will, but as I was saying at the meeting [15:43:04] I'd like to avoid shifting too much traffic with the DNS update [15:43:10] get it out the door first, yeah [15:43:13] right [15:43:27] it's also dangerous, esams doesn't get all of the traffic that it should [15:43:41] dangerous sounds fun [15:43:41] and messing too much with it can result in a traffic surge to esams [15:43:48] but yeah [15:43:58] New patchset: Ottomata; "Selecting proper metric in Ganglia kafka view" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72973 [15:44:02] like shifting India, Africa etc. to esams which now go to eqiad :) [15:44:25] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72973 [15:44:28] did you have a chance to see new-ns0? [15:44:48] last time I tried my ssh login there kept timing out [15:44:55] still seems to be [15:45:43] how are you trying to login? [15:45:52] (works for me) [15:45:59] maybe you're not going via labs' bastion? [15:46:06] I have ssh set up to go through bastion and forward a key, which I use for my labs varnish-work host in the same spot [15:46:26] I get far enough to see "If you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances" - so I think the problem is post-ssh, something to do with home directories or login scripts or whatever [15:46:49] hm maybe you don't have access to the project [15:47:19] oh, yeah, I bet so. I think I know where to fix that too [15:47:32] what's your labs username? [15:48:00] bblack [15:48:01] got it [15:48:20] done [15:52:06] paravoid: just out of curiosity, "include_optional_ns = true" - is this just in the name of "change as little as possible about our responses", or is there some good reason for it I'm unaware of? [15:52:20] the former [15:52:52] ok [15:53:08] it's not set in stone though, if you disagree I can change it [15:53:40] that's some serious templating :) [15:53:59] heh not so much yet [15:54:00] I don't care much, it's really just a packet-size optimization. can always play with it later once other risks are mitigated :) [15:54:08] I have more coming up [15:54:18] using jinja's extend syntax [15:54:24] to have a common template for our zones [15:54:40] most of our zones are very similar [15:57:09] notpeter: around? [15:57:19] how does it look? [15:57:57] paravoid: can you look at this for me. https://gerrit.wikimedia.org/r/#/c/72972/ [15:58:01] plz [15:59:13] paravoid: looks pretty good. if it's not a PITA, I would do explicit listen and http_listen options, assuming the interface IPs are easily templated in puppet. to avoid a pointless thread listening on localhost [15:59:31] New review: Faidon; ""git grep spence" reveals a few more occurences." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/72972 [15:59:57] bblack: so, that's actually a problem I've been banging my head at [16:00:00] it's not just localhost [16:00:04] we have separate service IPs for DNS [16:00:35] so e.g. right now we have dobson as ns0, but there's no DNS being served at dobson's IP [16:00:51] is the IP failed around via something like heartbeat? [16:01:02] no [16:01:09] it just makes it easier to move the service [16:01:12] ok [16:01:20] I think that was the purpose at least, I wasn't around when that happened [16:01:25] but it certainly makes my life easier now :) [16:01:26] so what's the problem? [16:01:43] the problem is that I envisioned having a different git tree with templates & gdnsd.conf [16:02:01] not puppet [16:02:01] ah! [16:02:27] have puppet stuff necessary factoids into some file like /etc/gdnsd/facts, and have the separate git+template system suck them from there? [16:02:55] I wanted to avoid having gdnsd.conf also be templated for simplicity [16:03:18] but yeah, it's possible... [16:04:37] hm, maybe an include conf directive would be useful here [16:04:45] :P [16:04:47] :) [16:05:06] I just thought of that, it wasn't some elaborate plan on asking you to implement that :) [16:05:47] it's been mentioned before :) I'm really not a huge fan of the custom config language anymore though, I feel like that should all be refactored to use something standard that has cool features for reptition/templates/includes or whatever [16:05:48] but yeah, maybe gdnsd.conf should be managed by puppet and including lb.conf or something [16:05:52] like Lua [16:06:17] fwiw, I like it as a config language [16:06:19] and make the plugin config dynamic like the zonefiles would be nice, too. or really everything, if possible [16:06:49] nod, that'd be nice indeed [16:07:22] the catch there is really dynamic listen addrs and runtime create/destroy of new listen threads, etc. especially given permissions issues on sockets [16:07:32] it's possible, it's just complicated [16:07:43] well, I think having to restart for that is fine [16:08:16] for dynamic plugin config the plugin API would need a lot big changes too, but I've never claimed stability there, so :) [16:08:33] hehe [16:09:08] there's also a bad side-effect about listening on all interfaces right now [16:09:15] dobson also serves as recursor0 [16:09:24] also listening on 53, on a different service IP [16:09:24] yeah [16:09:31] that's just not possible now [16:09:44] bleh [16:10:23] if we had config includes, I'd just template gdnsd.conf via puppet and have it include separate file for the plugin config from the separate dns git [16:10:32] yes, that was my idea [16:10:44] gdnsd.conf be a puppet erb including gdnsd-lb.conf or something [16:10:55] it wouldn't be hard to implement, I'm just stubborn about not adding features to things I think should go away in general [16:11:00] hehe [16:11:11] I guess I could just do cat gdnsd-head.conf gdnsd-lb.conf > gdnsd.conf too [16:11:22] true [16:13:04] yeah that sounds like a viable plan for now [16:13:06] is someone attacking chanserv/nickserv or this is just random general issues? [16:13:31] I think freenode has been suffering from DDoSes lately [16:14:12] someone was thinking for us to give them a few labs instances to set up servers but I didn't like much that idea [16:14:12] yep (as usual) [16:16:35] yet another of the many woes that could be solved if people would implement: http://tools.ietf.org/html/rfc3013#section-4.3 [16:16:58] aka BCP 38 [16:17:13] I mean, you could still DDoS from actual bot armies, but cutting out forged reflections would be really nice for the internet [16:21:30] what's the biggest organization that benefits from Freenode? [16:21:40] odder: FSF? [16:21:45] biggest in what sense? [16:21:45] I think Canonical and Ubuntu have a few channels here? [16:21:46] :P [16:21:56] budget? employees? [16:21:57] apergos: number of channels and users [16:22:29] number of channels could be us, one per project per language (how many are actually registered, I dunno) [16:22:31] New patchset: Cmjohnson; "decommissioning and removing spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [16:22:47] http://freenode.net/acknowledgements.shtml we're not listed here [16:31:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:34:13] no. it would be nice to figure out how we can help them without simply becoming a target [16:34:19] !log upgrading ruby and other packages on sockpuppet and stafford [16:34:28] Logged the message, Master [16:37:16] I'd say Canonical aside from us. [16:43:23] New patchset: GWicke; "Increase Parsoid backend timeout to 5 minutes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72681 [16:43:39] mark: ^^ [16:44:24] you'll need to do frontend too I think? [16:44:35] yes [16:44:44] otherwise the client will close the connection and varnish still won't cache it [16:44:57] ah, I was wondering about that case [16:45:09] is there a way to force the backend to continue even if the client aborted? [16:45:41] the VE currently has a timeout of 100s, so for some pathological pages it will time out before the backend is done [16:45:42] New review: Mark Bergsma; "The frontend needs a higher timeout than the backend, as otherwise it will close the connection even..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/72681 [16:46:51] no that won't work [16:48:38] mark: so a client disconnect from the frontend will always propagate to the backend? [16:48:50] yes [16:49:13] New patchset: GWicke; "Increase Parsoid backend timeout to 5 minutes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72681 [16:49:53] mark: well, at least we'll get to 100 seconds then [16:50:47] New review: Mark Bergsma; "I think we'll regret this." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/72681 [16:50:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72681 [16:51:14] note that I restarted one of the two varnish caches today [16:51:18] and it came up with an empty cache [16:51:36] that is odd [16:52:15] do you have suggestions on how to improve timeouts & varnish setup in general? [16:53:00] Roan used ban to purge yesterday, not sure if that is related [16:54:13] no it's a bug in the persistent storage stuff of varnish [16:54:17] it's very experimental [16:54:40] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:25] yoo ori-l [16:55:27] you there? [16:55:30] RECOVERY - DPKG on mc15 is OK: All packages OK [16:57:02] mark: will the cache currently be wiped on restart in general? [16:57:21] it's not supposed to, but because of the bug, very likely [16:57:37] the majority of varnish processes I restarted today had one or two storage files not come back [16:58:02] grr.. at least the traffic we are seeing so far is low enough that this does not matter too much [16:58:18] paravoid: can you check again�i got everything and will remove certs next https://gerrit.wikimedia.org/r/72972 [16:58:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [16:59:51] topic gitisapain, hahaha [17:00:38] yeah�got that from robh [17:01:14] New review: Faidon; "It does what it's supposed to do, +1 for that." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/72972 [17:03:40] New patchset: Matthias Mullie; "Use https in oversight request emails" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72980 [17:06:12] are you fucking kiddingme [17:06:28] varnish includes the mmap address in the storage signature [17:06:38] whaa?! [17:06:45] tries to mmap at the same address next time [17:06:49] doesn't even check if that's the case [17:06:55] and then fails on the signature check [17:07:11] New review: Dzahn; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [17:07:15] how does that even work!? [17:07:35] presumably linux usually mmaps at the same address when given the choice the first time (target NULL) [17:07:49] did you just answer "accidentally"? [17:07:53] and can usually honour that next time, except when it an't? :) [17:07:56] can't [17:07:59] i think so [17:08:04] i have to verify this, which is a bit annoying [17:08:13] but from reading the code, that's what I get [17:08:34] that's the signature check part that's failing, the mmapped address [17:08:55] and it does really fish for that signature mmap address before doing the big mmap on the storage file [17:09:20] fish how? [17:09:32] /* Try to determine correct mmap address */ [17:09:32] i = read(sc->fd, &sgn, sizeof sgn); [17:09:32] assert(i == sizeof sgn); [17:09:32] if (!strcmp(sgn.ident, "SILO")) [17:09:32] target = (void*)(uintptr_t)sgn.mapped; [17:09:33] else [17:09:35] target = NULL; [17:09:42] and then [17:09:43] sc->base = mmap(target, sc->mediasize, PROT_READ|PROT_WRITE, [17:09:44] MAP_NOCORE | MAP_NOSYNC | MAP_SHARED, sc->fd, 0); [17:09:45] jesus [17:09:58] and then it creates a signature context from the signature at offset 0 [17:10:07] (offset 0 from sc->base I mean) [17:10:23] and then if that signature's mmap doesn't match what it read IN the mmap'ed file, the signature check fails [17:10:27] and it doesn't load the storage file [17:10:33] i believe that's what's happening anyway [17:10:53] so any random parameter change can cause this to suddenly trigger across your clusters ;) [17:12:04] it's not so bad if it doesn't actually rely on that target address being correct [17:12:46] wait [17:12:47] which I haven't verified yet [17:12:48] linux has ASLR [17:12:57] that can't be, it'd always fail [17:13:02] no no [17:13:12] if you ASK mmap to map at that address [17:13:19] then it can often honor that [17:13:21] except if it can't [17:13:30] so the first time, linux chooses, target == NULL [17:13:36] ah, right [17:13:38] next time, it reads the mmap address from the file, target is X [17:13:40] the address *is* random [17:13:43] it's just stored [17:13:44] yes [17:14:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72936 [17:14:49] i'll test this tomorrow [17:15:16] so it's possible varnish *relies* on the address being still the same [17:15:26] because every object ptr in there needs to live at the same address [17:15:38] right [17:15:45] this can't be accidental :) [17:15:58] i don't know why you'd do this if not [17:16:07] not just to check if a signature is correct anyway [17:16:34] it's still baffling why this hasn't been a problem before [17:16:40] well [17:17:03] the larger your storage size, the more chance of hitting it [17:17:08] New patchset: MaxSem; "Murderdeathkill bits special casing for mobile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72983 [17:17:32] you mean the more likely is the kernel can't satisfy your mmap request? [17:17:37] New review: MaxSem; "Next week." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/72983 [17:19:48] PROBLEM - Memcached on mc15 is CRITICAL: Connection timed out [17:20:02] !log db78 switched to frack puppetmaster [17:20:11] Logged the message, Master [17:20:12] !log grosley switched to frack puppetmaster [17:20:21] Logged the message, Master [17:20:38] RECOVERY - Memcached on mc15 is OK: TCP OK - 0.027 second response time on port 11211 [17:20:43] !log dist-upgrade and reboot aluminium, db78 [17:20:53] Logged the message, Master [17:25:45] mark: that sounds awful [17:26:15] yes [17:26:23] hopefully they don't store addresses in the file [17:26:28] they do [17:26:36] ouch [17:26:38] and to cheer you up, you just based your storage architecture on this code [17:26:48] just to save the conversion between offsets and addresses? [17:27:02] yup [17:27:14] because it mmaps into memory and then references directly everywhere else in varnish [17:27:15] that seems like a really dumb thing to do [17:27:31] read https://www.varnish-cache.org/trac/wiki/ArchitectNotes just for fun [17:27:55] should be easy to fix though [17:28:05] mutant have all the services from spence been moved? [17:28:07] not at all [17:28:19] mutante ^ [17:28:23] well [17:28:33] i guess in this case it is easy to fix [17:28:33] instead of directly writing / reading a pointer, store an offset and add the base to it when reading [17:28:43] yeah but it doesn't work like that [17:28:53] it doesn't "read" [17:29:00] well, dereference [17:29:06] it CAN be fixed here, on reading the file yes [17:29:24] but they're shooting themselves in the foot in multiple ways with this [17:29:30] because they also can't relocate objects within the file [17:29:58] that is true for offsets too [17:30:17] offsets would be relative to the segment they're in [17:30:18] you'll have to update everything that points to it, using direct pointer or offset [17:30:23] and segments could then be relocated [17:30:34] yeah, that could help a bit [17:30:41] they do use segments [17:30:44] they just don't help a lot ;) [17:31:04] cmjohnson1: just been talking about that and made a last check, we can't find anything that would still be needed on it. ishmael stuff has been moved, and last thing was stat.wp which is now on stat1001 [17:31:26] mark: good thing we don't plan to use Varnish for this in the longer term [17:31:47] if your "short term" is potentially 2 years as you stated, that's still problematic enough ;) [17:32:10] cmjohnson1: go ahead, spence is already killed from DNS, so it can't really break things that worked before [17:32:12] nah, I think that we'll get HTML storage this fiscal year [17:32:23] HTML-only wikis otoh might take longer [17:32:25] anyway [17:32:28] mutante: okay [17:32:35] that's good, but then we still have all our other varnish clusters ;) [17:32:47] every single one of them is hitting this atm [17:32:50] except bits [17:32:51] New review: Dzahn; "made a last check. everything should have been moved. spence was out of wikimedia DNS already" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/72972 [17:32:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [17:33:35] thx mutante [17:33:40] mark: technically all it should take is to replace all direct pointer stores / dereferences with a macro that converts between offset and pointer [17:34:00] granted, that might be tedious as there are probably many of those [17:34:05] but not very hard [17:34:10] mark: weren't you contemplating writing your own storage engine? :) [17:34:14] i was [17:34:22] varnish could also fix up all the pointers when reading the segments [17:34:25] it would take a bit longer on startup [17:34:34] every object needs to be touched... [17:34:57] mark: that would be more of a hack [17:35:03] New patchset: Ottomata; "Puppetizing hive client, server and metastore." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/71569 [17:35:05] you'd also know where all pointers live [17:35:15] if you always store offsets you don't have that problem [17:35:30] and adding an offset should be free these days [17:38:34] New review: Dzahn; "git grep spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72972 [17:38:39] yeah, maybe that works, maybe that doesn't [17:38:45] but that's not a very practical solution at this point ;) [17:40:07] the practical solution is probably "don't restart varnish" ;( [17:40:21] no that's not very practical either ;) [17:40:25] anyway [17:40:27] i've had enough for today [17:41:58] I can imagine [17:46:36] !log DNS update - killing spence [17:46:46] Logged the message, Master [17:47:22] New patchset: Ottomata; "Ensuring /var/lib/hadoop exists in labs kraken cdh4 puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72989 [17:47:29] New patchset: Ottomata; "Ensuring /var/lib/hadoop exists in labs kraken cdh4 puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72989 [17:47:46] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72989 [17:47:57] RIP spence - Host spence not found: 3(NXDOMAIN) [17:48:55] New patchset: Andrew Bogott; "rake wrapper to run puppet module tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72721 [17:50:20] PROBLEM - NTP on williams is CRITICAL: NTP CRITICAL: No response from NTP server [17:50:21] PROBLEM - NTP on mw1123 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:10] PROBLEM - NTP on sq45 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:21] PROBLEM - NTP on es1005 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:30] PROBLEM - NTP on mw1127 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:31] PROBLEM - NTP on mw1065 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:31] PROBLEM - NTP on db1019 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:31] mutante, can you review https://gerrit.wikimedia.org/r/72734 please? [17:51:31] PROBLEM - NTP on mw1095 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:31] PROBLEM - NTP on mw1052 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:31] PROBLEM - NTP on db53 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:40] PROBLEM - NTP on mw1044 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:41] PROBLEM - NTP on mw1042 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:41] PROBLEM - NTP on es1002 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:41] PROBLEM - NTP on ms5 is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:50] PROBLEM - NTP on mw1114 is CRITICAL: NTP CRITICAL: No response from NTP server [17:52:02] mutante: spence ntp service issue [17:52:21] PROBLEM - NTP on mw1111 is CRITICAL: NTP CRITICAL: No response from NTP server [17:52:31] PROBLEM - NTP on mw1126 is CRITICAL: NTP CRITICAL: No response from NTP server [17:52:43] cmjohnson1: hrmm, is it.. hrmm [17:53:00] PROBLEM - NTP on vanadium is CRITICAL: NTP CRITICAL: No response from NTP server [17:53:30] PROBLEM - NTP on mw1147 is CRITICAL: NTP CRITICAL: No response from NTP server [17:53:40] PROBLEM - NTP on ms-fe3 is CRITICAL: NTP CRITICAL: No response from NTP server [17:53:47] poor spence [17:54:00] PROBLEM - NTP on mw1081 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:03] did we setup an ntp server elsewhere as part of replacing spence? [17:54:31] PROBLEM - NTP on mw1038 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:31] PROBLEM - NTP on mw1115 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:31] PROBLEM - NTP on mw1089 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:33] fwiw, I'm with gwicke on the above. the code should use offsets into whatever's mmapped. it's probably a large ugly patch at this point, though. [17:54:40] PROBLEM - NTP on mw1057 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:40] PROBLEM - NTP on es6 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:40] PROBLEM - NTP on sq42 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:41] PROBLEM - NTP on stat1 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:41] PROBLEM - NTP on mw1056 is CRITICAL: NTP CRITICAL: No response from NTP server [17:54:41] PROBLEM - NTP on mw1140 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:21] PROBLEM - NTP on mw1032 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:21] PROBLEM - NTP on db39 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:31] PROBLEM - NTP on mw1106 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:31] PROBLEM - NTP on mw1087 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:40] PROBLEM - NTP on db34 is CRITICAL: NTP CRITICAL: No response from NTP server [17:56:21] PROBLEM - NTP on db1022 is CRITICAL: NTP CRITICAL: No response from NTP server [17:56:50] PROBLEM - NTP on mw1046 is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:03] !log reverting removing spence, NTP server issue [17:57:14] Logged the message, Master [17:57:21] PROBLEM - NTP on mw1122 is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:21] PROBLEM - NTP on mw1033 is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:21] PROBLEM - NTP on fluorine is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:21] PROBLEM - NTP on mw1043 is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:41] PROBLEM - NTP on mw1024 is CRITICAL: NTP CRITICAL: No response from NTP server [17:58:21] PROBLEM - NTP on mw1129 is CRITICAL: NTP CRITICAL: No response from NTP server [17:58:31] PROBLEM - NTP on mw1107 is CRITICAL: NTP CRITICAL: No response from NTP server [17:58:31] PROBLEM - NTP on mw1118 is CRITICAL: NTP CRITICAL: No response from NTP server [17:58:41] PROBLEM - NTP on mw1025 is CRITICAL: NTP CRITICAL: No response from NTP server [17:59:21] PROBLEM - NTP on mw1066 is CRITICAL: NTP CRITICAL: No response from NTP server [17:59:21] PROBLEM - NTP on mw1054 is CRITICAL: NTP CRITICAL: No response from NTP server [18:00:21] PROBLEM - NTP on mw1021 is CRITICAL: NTP CRITICAL: No response from NTP server [18:00:21] PROBLEM - NTP on mw1084 is CRITICAL: NTP CRITICAL: No response from NTP server [18:00:22] cmjohnson1: we need the change in ntp-server.erb reverted too [18:00:22] on it [18:00:27] meh, the netsplit doesnt help [18:00:40] PROBLEM - NTP on mw1047 is CRITICAL: NTP CRITICAL: No response from NTP server [18:00:41] PROBLEM - NTP on mw1131 is CRITICAL: NTP CRITICAL: No response from NTP server [18:01:00] PROBLEM - NTP on es9 is CRITICAL: NTP CRITICAL: No response from NTP server [18:01:09] New patchset: Petr Onderka; "saving page metadata" [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/72993 [18:01:11] PROBLEM - NTP on mw1105 is CRITICAL: NTP CRITICAL: No response from NTP server [18:01:20] PROBLEM - NTP on mw1049 is CRITICAL: NTP CRITICAL: No response from NTP server [18:01:34] Change merged: Petr Onderka; [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/72993 [18:01:42] PROBLEM - NTP on mw1128 is CRITICAL: NTP CRITICAL: No response from NTP server [18:01:42] PROBLEM - NTP on sanger is CRITICAL: NTP CRITICAL: No response from NTP server [18:04:44] New patchset: Dzahn; "revert removing spence from ntp-server.erb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72994 [18:05:50] PROBLEM - NTP on mw1022 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:21] PROBLEM - NTP on mw1134 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:21] PROBLEM - NTP on mw1139 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:38] New patchset: Andrew Bogott; "Remove all .fixtures.yml files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72996 [18:06:40] PROBLEM - NTP on db31 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:40] PROBLEM - NTP on db60 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:40] PROBLEM - NTP on mw1080 is CRITICAL: NTP CRITICAL: No response from NTP server [18:06:41] PROBLEM - NTP on mw1093 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72994 [18:07:21] PROBLEM - NTP on es1008 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:31] PROBLEM - NTP on mw1121 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:40] PROBLEM - NTP on mw1090 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:40] PROBLEM - NTP on mw1145 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:40] PROBLEM - NTP on mw1112 is CRITICAL: NTP CRITICAL: No response from NTP server [18:07:53] binasher, what's the eta for archive? [18:07:53] mutante: we yanked the certs from spence [18:08:04] Cyberpower678: what? [18:08:10] PROBLEM - NTP on mw1059 is CRITICAL: NTP CRITICAL: No response from NTP server [18:08:10] PROBLEM - NTP on mw1026 is CRITICAL: NTP CRITICAL: No response from NTP server [18:08:21] PROBLEM - NTP on mw1141 is CRITICAL: NTP CRITICAL: No response from NTP server [18:08:27] binasher, when the archive table be added to replication? [18:08:33] New patchset: Ottomata; "Allowing configuration of namenode_hostname via $:hadoop_namenode global variable in labs role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72997 [18:08:51] New patchset: Ottomata; "Allowing configuration of namenode_hostname via $:hadoop_namenode global variable in labs role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72997 [18:09:11] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72997 [18:09:12] !log updated Parsoid to 8ea8ab0 [18:09:22] Logged the message, Master [18:09:29] Cyberpower678: sorry, don't know [18:09:37] I was told to ask you. [18:09:46] Because you manage the labs dbs. [18:09:59] Cyberpower678: i think you want Coren [18:10:28] binasher: Actually, he's more interested in when bz 49189 will be done. :-) [18:10:45] cmjohnson1: RobH was going to resign it, we need to also re-add to site.pp [18:11:36] Coren: probably next month [18:12:23] Cyberpower678: ^^ [18:12:32] Cyberpower678: So that means archive "after Wikimania" [18:12:45] At the earliest. [18:12:47] Robh: https://gerrit.wikimedia.org/r/#/c/72994/1/modules/ntp/templates/ntp-server.erb [18:12:51] Coren, next month we'll archive? YES! [18:12:57] Robh: i already reverted that, but that totally looks like it, right [18:13:05] and i agree, dobson should be NTP [18:14:46] mutante: you didn't fix site.pp yet did you? [18:15:16] i havent resigned spence [18:15:38] nad it went from no responnce to offset unknown [18:15:43] so somethings goin on [18:15:48] wow, typos. [18:15:49] that's a good thing [18:16:01] offset unknown is like .. ehm.. cough.. normal..for a little while [18:16:10] it should fix itself [18:16:26] so i think we are ok to kill and move on. [18:16:32] New patchset: Ottomata; "Ensuring datanode_mounts are created in labs hadoop puppetization." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72999 [18:16:47] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72999 [18:16:52] what about this: https://gerrit.wikimedia.org/r/#/c/72972/2/manifests/misc/icinga.pp [18:17:19] $nagios_config_dir = '/etc/nagios' [18:17:22] shouldnt that come out? [18:17:35] nah, neon has it too [18:17:51] root@neon:/etc/nagios# ls [18:17:52] nrpe.cfg nrpe.d [18:18:09] yea but is that from the migration or for daily use? [18:18:11] doesnt matter if its there, just askin. [18:18:12] it's used [18:18:16] bleh. [18:18:19] New patchset: Ottomata; "Fixing ensure" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73000 [18:18:24] ah i need to slow down a bit! typos! [18:18:49] MaxSem: i'll get to that Apache change as soon as this spence issue is sorted out [18:19:05] thanks [18:19:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73000 [18:19:56] cmjohnson1: can you put it back in site.pp for now [18:20:07] Robh: yes,it's running ntpd /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:111 [18:22:11] New patchset: Cmjohnson; "adding spence back to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73002 [18:24:15] !log nikerabbit synchronized php-1.22wmf9/extensions/UniversalLanguageSelector/ 'ULS perf fix' [18:24:23] cmjohnson1: Robh https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=NTP [18:24:25] Logged the message, Master [18:24:39] a bit random? [18:25:38] New review: GWicke; "It did fix the problem for http://en.wikipedia.org/wiki/List_of_Advanced_Dungeons_%26_Dragons_2nd_ed..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72681 [18:26:00] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73002 [18:26:02] command_line $USER1$/check_ntp_time -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ [18:26:27] New patchset: Ottomata; "Don't want to ensure that mounts exist." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73003 [18:27:21] RECOVERY - NTP on es1008 is OK: NTP OK: Offset -0.002594947815 secs [18:27:21] RECOVERY - NTP on mw1121 is OK: NTP OK: Offset -0.001339793205 secs [18:27:26] mutante: back in site.pp and yes very random [18:27:28] there [18:27:30] RECOVERY - NTP on mw1145 is OK: NTP OK: Offset -0.005273580551 secs [18:27:37] cmjohnson1: so, we didnt have to, it seems [18:27:40] RECOVERY - NTP on mw1112 is OK: NTP OK: Offset -0.00389111042 secs [18:27:43] just the one i reverted first [18:27:52] mutante: thats fubar dude [18:28:10] cmjohnson1: so adding back to the file for ntp allowed query servers fixed the offset error [18:28:10] RECOVERY - NTP on mw1026 is OK: NTP OK: Offset -0.0005013942719 secs [18:28:10] RECOVERY - NTP on mw1135 is OK: NTP OK: Offset 4.255771637e-05 secs [18:28:11] but [18:28:14] it shoudlnt [18:28:19] it shouldnt have anything to do with it. [18:28:20] that makes no sense [18:28:20] RECOVERY - NTP on mw1148 is OK: NTP OK: Offset 2.205371857e-05 secs [18:28:20] RECOVERY - NTP on mw1141 is OK: NTP OK: Offset -0.00254368782 secs [18:28:20] RECOVERY - NTP on mw1082 is OK: NTP OK: Offset -0.006798863411 secs [18:28:25] New patchset: Ottomata; "Don't want to ensure that mounts exist." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73003 [18:28:25] indeed, its fucking nuts. [18:28:30] RECOVERY - NTP on mw1110 is OK: NTP OK: Offset -0.001755833626 secs [18:28:31] RECOVERY - NTP on mw1090 is OK: NTP OK: Offset -0.00218307972 secs [18:28:31] RECOVERY - NTP on sq49 is OK: NTP OK: Offset -0.001191616058 secs [18:28:33] thereby, its annoying the hell out of me. [18:28:36] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73003 [18:28:40] RECOVERY - NTP on mw1073 is OK: NTP OK: Offset -0.004191994667 secs [18:28:47] !log nikerabbit synchronized php-1.22wmf8/extensions/UniversalLanguageSelector/ 'ULS perf fix' [18:28:49] heh, well https://gerrit.wikimedia.org/r/#/c/72994/1/modules/ntp/templates/ntp-server.erb [18:28:56] Logged the message, Master [18:29:00] RECOVERY - NTP on mw1108 is OK: NTP OK: Offset -0.0005559921265 secs [18:29:20] RECOVERY - NTP on mw1102 is OK: NTP OK: Offset -0.002832889557 secs [18:29:31] RECOVERY - NTP on mw1119 is OK: NTP OK: Offset -0.001158237457 secs [18:29:31] RECOVERY - NTP on sq36 is OK: NTP OK: Offset -0.001199364662 secs [18:29:31] RECOVERY - NTP on mw1019 is OK: NTP OK: Offset -0.0005331039429 secs [18:29:59] so i yanked out spence manually on mw1059 [18:30:09] going to see if it goes back intp ntp error [18:30:20] RECOVERY - NTP on mw1039 is OK: NTP OK: Offset -0.001277327538 secs [18:30:30] RECOVERY - NTP on ms-fe4 is OK: NTP OK: Offset -0.0003541707993 secs [18:31:20] RECOVERY - NTP on nfs2 is OK: NTP OK: Offset -0.01575374603 secs [18:31:30] RECOVERY - NTP on hooper is OK: NTP OK: Offset -0.003059267998 secs [18:31:31] RoanKattouw: reminds me of http://knowyourmeme.com/memes/you-cant-cut-back-on-funding-you-will-regret-this [18:32:20] RECOVERY - NTP on mw1036 is OK: NTP OK: Offset -0.004727602005 secs [18:32:30] RECOVERY - NTP on mw1076 is OK: NTP OK: Offset -1.668930054e-05 secs [18:32:30] RECOVERY - NTP on db60 is OK: NTP OK: Offset 0.0003219842911 secs [18:32:40] RECOVERY - NTP on mw1079 is OK: NTP OK: Offset -0.0004614591599 secs [18:32:40] RECOVERY - NTP on mw1109 is OK: NTP OK: Offset -0.001217603683 secs [18:33:40] RECOVERY - NTP on mw1133 is OK: NTP OK: Offset -0.002179145813 secs [18:33:40] RECOVERY - NTP on nfs1 is OK: NTP OK: Offset 0.0003098249435 secs [18:33:47] I only played sm city once. I remember I funded everythign reasonably except for one thing: the cops.... [18:33:55] game rules are skewed :-P [18:34:00] RECOVERY - NTP on mw1098 is OK: NTP OK: Offset 0.000586271286 secs [18:34:20] RECOVERY - NTP on mw1034 is OK: NTP OK: Offset 0.001031041145 secs [18:34:23] apergos had high crime rates in his sim city [18:34:30] RECOVERY - NTP on mw1077 is OK: NTP OK: Offset 6.890296936e-05 secs [18:34:30] RECOVERY - NTP on db49 is OK: NTP OK: Offset 0.001464128494 secs [18:34:37] I did [18:35:14] ottomata: 6 or later? [18:35:20] RECOVERY - NTP on ms-fe1 is OK: NTP OK: Offset 0.0007516145706 secs [18:35:22] speaking of nothing, what does it look like for ms-be5 today? [18:35:30] RECOVERY - NTP on locke is OK: NTP OK: Offset -0.001107811928 secs [18:35:39] cmjohnson1: [18:35:40] RECOVERY - NTP on mw1028 is OK: NTP OK: Offset 0.0008904933929 secs [18:36:10] RECOVERY - NTP on mw1097 is OK: NTP OK: Offset 0.0009825229645 secs [18:36:10] RECOVERY - NTP on mw1029 is OK: NTP OK: Offset 0.0009075403214 secs [18:36:20] RECOVERY - NTP on mw1023 is OK: NTP OK: Offset -0.0002377033234 secs [18:36:20] RECOVERY - NTP on es1001 is OK: NTP OK: Offset 0.001837611198 secs [18:36:20] RECOVERY - NTP on mw1067 is OK: NTP OK: Offset 0.0001199245453 secs [18:36:56] apergos: so ms-be5 is still lingering�steve didn't get to it yesterday [18:37:04] ok [18:37:13] RECOVERY - NTP on mw1113 is OK: NTP OK: Offset 0.0004721879959 secs [18:37:19] should I be checking on it tomorrow morning you think? [18:37:23] RECOVERY - NTP on mw1037 is OK: NTP OK: Offset -0.0004260540009 secs [18:37:43] RECOVERY - NTP on mw1064 is OK: NTP OK: Offset 0.0006532669067 secs [18:38:09] no, he will not be there today. I expect it to be finished tomorrow during my afternoon [18:38:13] RECOVERY - NTP on mw1088 is OK: NTP OK: Offset -0.0001447200775 secs [18:38:23] RECOVERY - NTP on mw1018 is OK: NTP OK: Offset -0.001129865646 secs [18:38:23] RECOVERY - NTP on mw1100 is OK: NTP OK: Offset 0.001982688904 secs [18:38:33] RECOVERY - NTP on mw1117 is OK: NTP OK: Offset 0.0005354881287 secs [18:38:34] RECOVERY - NTP on mw1099 is OK: NTP OK: Offset 7.677078247e-05 secs [18:38:40] ok [18:39:09] thanks for the info [18:39:41] is ms-be5 down? [18:40:34] RECOVERY - NTP on mw1103 is OK: NTP OK: Offset -0.0004615783691 secs [18:40:53] paravoid https://rt.wikimedia.org/Ticket/Display.html?id=5428 [18:41:48] New review: Faidon; "Really nice code." [operations/puppet/cdh4] (master) C: 2; - https://gerrit.wikimedia.org/r/71569 [18:42:57] sigh [18:43:36] paravoid: we are going to use this as an opportunity to replace the controller, swap the ssds and fix the drac [18:43:50] makes sense [18:44:15] paravoid: no it's not down [18:44:56] and why sigh? [18:45:12] and now I see the ticket was linked so I didn't have to answer that :-/ [18:45:41] sigh for the amount of Dell breakage we get [18:45:47] esp. on ms-fe boxes [18:45:54] or be? [18:46:01] er, yes, ms-be I meant [18:46:24] the disks really do seem t have a higher death rate than I'm used to [18:46:34] yeah and eqiad was crap too [18:46:42] cmjohnson1 was replacing disks like crazy [18:47:02] a bit late for you isn't it [18:47:04] I can't imagine it's just bad batches [18:47:15] yeah I've been on most of the day, I was just popping in to see how things were [18:47:20] paravoid: is there any reason for that? [18:47:21] are thos SSD's? [18:47:21] *e [18:47:34] no, they're regular sata drives [18:47:48] I don't think it is bad batches�the analytics boxes have not experienced any faiulres [18:47:56] vendor decided to lower their QA standards? [18:48:00] suspicious isn't it? [18:48:25] probably high data transfer rates causing disks to fail IMO [18:48:27] these were replacements, maybe they did gave us a bad batch they had spare :) [18:48:28] which drive are those? samsung? [18:48:55] we had western digitals but I don't remember what these are now [18:49:22] I remember some vendor had a recent firmware issue, but can't recall which [18:49:41] that firmware caused early drive death [18:49:48] matanya�the sea gates have been failing [18:49:55] toshiba has been the replacements [18:49:59] correct, thanks [18:50:19] New patchset: Springle; "require percona toolkit on deployment servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73007 [18:50:24] the wds were crap [18:50:34] ^ agree [18:50:35] * apergos looks: yeah it's logged. well, they were [18:51:43] RECOVERY - NTP on ms-fe2 is OK: NTP OK: Offset -0.002739667892 secs [18:52:01] oh wow so google seagate hard drives high rate of failure and there is quite a bit [18:52:06] I suffer from memory multibit error lately, not drives death [18:52:07] * apergos is gonna watch some news and then afk for the evning. back balcony/garden for some reading in the cool evening air [18:52:53] We went through a run of multi-bit errors on the R410s but has quieted it down lately [18:53:13] I suffer it alot on the R710's [18:53:13] RECOVERY - NTP on brewster is OK: NTP OK: Offset -0.0008366107941 secs [18:53:22] the 720 are better [18:53:46] these are R720xd fwiw [18:53:56] good deal [18:54:22] I can check here and see if i had such issues lately [18:54:42] what are the modles you on that batch? [18:54:57] *you have in [18:57:03] Change merged: Springle; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73007 [18:58:43] the old controllers are h310s and the new ones (I don't know if there have as many problems once the new controllers are in) are h710s .. I think [18:58:55] that's correct [18:59:05] to clarify, h310->h710 [18:59:19] the trailing s is for plural, not part of the model :) [18:59:19] which firmware? [18:59:25] (there is an h710p) [18:59:55] * matanya is looking in his inventory [19:00:00] mantanya �the dimm mod 00AD00B380AD [19:00:12] from what i can tell it is used in the 710 as well [19:00:15] Subsystem: Dell PERC H710 Mini [19:00:17] these say [19:00:22] (lspci output) [19:01:07] is it the integrated one? or the external one? [19:01:23] these are external cards [19:01:48] FW Package Build: 21.0.2-0001 [19:01:58] FW Version : 3.130.05-1587 [19:02:22] and if the bios version is useful too it's [19:02:23] BIOS Version : 5.30.00_4.12.05.00_0x05110000 [19:02:38] mini is the embedded one iirc [19:02:43] the small card [19:03:52] what is an SN for example? [19:04:01] one would be enough [19:04:12] lemme pm you one [19:04:18] (of the server of course) [19:04:21] I dunno why I'm not sure it should be logged but [19:04:22] sure [19:11:29] any luck? [19:11:54] I see a fix for the controller [19:12:05] quite important [19:12:38] linky? [19:12:45] - Corrected an issue where the controller would hang while performing IO on a degraded VD. [19:13:13] https://www.dell.com/support/drivers/us/en/555/DriverDetails/Product/poweredge-r720xd?driverId=C1VYX&osCode=LNUX&fileId=3197089357&languageCode=EN&categoryId=SF# [19:13:20] ah one more thing, these (in tampa) are all set up as either jbod or (h710) raid 0 for each disk [19:13:27] looking [19:13:31] some other enhancments too [19:13:47] and if you use encrytion there is acritical fix there [19:15:07] HA! i knew it. URGENT fix [19:15:13] for seagate drives [19:15:34] oh? [19:16:04] I remembered applying a critical fix, now i know which [19:16:23] see the SAS section [19:16:34] has a firmware marked as urgent [19:17:37] what are the drive models you said you have? [19:18:09] the new ones coming in are toshibas, cmjohnson1 was saying [19:18:31] and they are replacing seagates, don't know what model [19:19:07] yeah but most are seagate [19:19:07] barracuda [19:19:11] so the seagates must have this fix, they die like hell if not applied [19:19:33] * matanya is so happy his memory leak had slowed down [19:19:38] sorry but which sas section? I didn't see it in the release notes [19:19:54] no, i'll link it again [19:20:04] https://www.dell.com/support/drivers/us/en/04/DriversHome/ShowProductSelector [19:20:12] use this link and put your SN in [19:20:14] oh..wrong on that �they're constellation es.2 hdd [19:20:14] ah [19:20:36] then pick linux and scroll to the SAS drives [19:21:07] cmjohnson1: didn't understand [19:23:39] that was the seagate model [19:24:22] oh, so the patch apply to them then? [19:25:14] I guess we need the model numbers [19:25:33] yes, seems so [19:25:41] you should have it in hdparm [19:26:39] yeah I wasn't on the box any more [19:26:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:51] hdparm -i, i guess you know that :) [19:26:54] I keep saying I'm just going to watch the news and then get going but there's soccer instead [19:27:26] or you can say it is too intersting here... [19:27:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:27:55] ST32000645SS at least on the box I'm on right now [19:28:58] that si the constellation [19:29:01] *is [19:30:24] not on thelist [19:30:24] ok I"m going to get going, news or no news... thanks for looking into this [19:30:41] thank you, have fun [19:42:17] cmjohnson1: I show two deliveries, one may be SFP+ [19:42:44] i see that to..i am going to head back there now and get it�.i wasn't expecting anything [19:42:56] brb [19:51:45] robh: https://rt.wikimedia.org/Ticket/Display.html?id=5266 is what I received and the lc adapter [19:52:12] not sure why I got parts for a 4550 [20:01:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [20:06:22] New review: Dzahn; "this would make it 301 Moved Permanently http://en.wikipedia.org/ because redirects.conf has " # Sen..." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/72734 [20:08:52] weird, I couldn't connect to free node most of the morning - SASL auth kept timing out. [20:09:25] orenwolf: me too :) maybe a canada bell issue :D [20:09:35] i used a different port to connect [20:09:53] Well, I'm not on bell, but I am in Canada, so there might be something there :) [20:10:09] i had the same issue from the office most of this morning [20:11:01] Ah, well then. [20:13:03] you know freenode had Ddos most of the day? [20:16:51] matanya: hard not to notice ;) [20:17:14] orenwolf: binasher drdee: not an issue between you and freenode i think. services (e.g. NickServ) has been up and down and you probably can't SASL without a NickServ [20:17:36] 10 19:59:44 [freenode] -mist(~mrmist@freenode/staff/mist)- [Global Notice] We're still working on getting services (nickserv, chanserv, alis, etc.) back up and running. Another global notice will be sent once we're happy that they are back properly. [20:17:51] 10 20:01:48 [freenode] -mquin(~mquin@freenode/staff/mquin)- [Global Notice] Services are now back but may be lagged for a little while as everyone identifies. Thank you for your patience and for flying freenode! [20:18:00] Cool, thanks. :) [20:18:19] my fault for not being persistently in IRC! ;) [20:20:02] how about them netsplits [20:21:28] New patchset: Ottomata; "Removing IP based filters on oxygen, replacing with the X-Analytics zero= based filter." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73019 [20:21:35] hey ottomata. i just learned there was an encyclopedist named otto [20:21:43] oh yeah? [20:21:46] does that mean I get a server?! [20:21:56] errr, you're behind the times [20:22:08] * jeremyb too thouggh [20:22:08] though* [20:22:43] ottomata: https://rt.wikimedia.org/Ticket/Display.html?id=3912 https://rt.wikimedia.org/Ticket/Display.html?id=3911 [20:22:44] New patchset: MaxSem; "Rewrite rule for m.wikimediafoundation.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73020 [20:23:08] ottomata: if you read https://rt.wikimedia.org/Ticket/Display.html?id=3406 the wrong way it could be interpreted as server otto :-P [20:23:26] (the subject at least) [20:23:53] haha [20:27:33] New patchset: Ottomata; "Removing IP based filters on oxygen, replacing with the X-Analytics zero= based filter." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73019 [20:28:02] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73019 [20:30:13] New review: Dzahn; "m.wikimediafoundation.org is an alias for m.wikimedia.org." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73020 [20:32:14] New patchset: Ottomata; "Rsyncing zero*.gz to relevant stat servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73024 [20:49:01] New review: MaxSem; "> m.wikimediafoundation.org is an alias for m.wikimedia.org." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73020 [20:51:30] New patchset: Yurik; "Renamed 405-0* to 405-25 for TATA India zero carrier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73027 [20:51:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73020 [20:54:54] hi, analytics have requested zero ID rename, i would like to rename a corresponding META page in semi-sync with the merge. Could someone take a look please? :) https://gerrit.wikimedia.org/r/#/c/73027/ [20:55:10] http://codereview-proxy.wikimedia.org/ hrmmpff [20:55:11] would like to get rid [20:55:11] https://wikitech.wikimedia.org/wiki/Codereview-proxy.wikimedia.org [20:56:23] New patchset: Ottomata; "Rsyncing public-datasets to stat1001 every 30 minutes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73032 [20:56:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73032 [20:57:05] New patchset: Ottomata; "Rsyncing zero*.gz to relevant stat servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73024 [20:57:13] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73024 [21:02:03] !log package upgrades on hooper (etherpad) [21:02:12] Logged the message, Master [21:35:41] !log removing old racktables Apache site and other remnants from hooper [21:35:49] Logged the message, Master [22:01:03] PROBLEM - Puppet freshness on db78 is CRITICAL: No successful Puppet run in the last 10 hours [22:26:44] !log reprepro copying etherpad from wikimedia-lucid to wikimedia-precise [22:26:53] Logged the message, Master [22:41:05] New patchset: Dzahn; "use misc::etherpad on zirconium to verify it works, not changing db or DNS yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73114 [22:44:58] New patchset: Dzahn; "use misc::etherpad on zirconium to verify it works, not changing db or DNS yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73114 [22:46:12] New review: Dzahn; "existing class from hooper, copied package lucid->precise, confirm it works to move out of Tampa" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/73114 [22:46:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/73114 [23:08:46] PROBLEM - Etherpad HTTP on zirconium is CRITICAL: Connection refused [23:10:22] yea, it's not there yet, ack [23:23:43] !updated Parsoid to f571475 [23:27:46] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:27:46] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:46] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:46] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:46] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:47] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:27:47] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [23:28:57] !log csteipp synchronized php-1.22wmf9/extensions/CentralAuth 'Updating CentralAuth on wmf9 for SUL' [23:29:07] Logged the message, Master [23:37:29] gwicke: is !update a keyword, or did you forget !log? [23:37:33] !log catrope synchronized php-1.22wmf8/extensions/VisualEditor 'Update VE to master' [23:37:43] Logged the message, Master [23:37:47] !log [00:23] gwicke !updated Parsoid to f571475 [23:37:56] Logged the message, Mr. Obvious [23:37:59] !log catrope synchronized php-1.22wmf9/extensions/VisualEditor 'Update VE to master' [23:38:08] Logged the message, Master [23:38:25] AzaToth: I made that mistake a few times now, so clearly it should be an alias [23:38:34]