[02:05:46] !log LocalisationUpdate completed (1.22wmf12) at Tue Aug 13 02:05:45 UTC 2013 [02:06:00] Logged the message, Master [02:15:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 13 02:15:58 UTC 2013 [02:16:09] Logged the message, Master [02:22:13] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [03:08:18] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [03:37:40] (03PS3) 10MZMcBride: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [04:37:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [05:01:34] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [05:03:04] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [05:05:54] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [05:17:24] PROBLEM - NTP on mw27 is CRITICAL: NTP CRITICAL: Offset unknown [05:22:24] RECOVERY - NTP on mw27 is OK: NTP OK: Offset 0.00118970871 secs [05:44:58] (03PS1) 10ArielGlenn: disallow / for bots on gitblit, as /blobdiff/ and /zip/ isn't enough [operations/puppet] - 10https://gerrit.wikimedia.org/r/78934 [05:45:38] dup [05:46:20] (03Abandoned) 10Faidon: disallow / for bots on gitblit, as /blobdiff/ and /zip/ isn't enough [operations/puppet] - 10https://gerrit.wikimedia.org/r/78934 (owner: 10ArielGlenn) [05:46:25] (03CR) 10Faidon: [C: 032] Disallowing all indexing for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/78919 (owner: 10Demon) [05:56:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:56:47] !log restarted apache on stafford, puppet was dead [05:56:58] Logged the message, Master [05:59:17] !log ran puppet on antimony and restarted apache to pick up new robots.txt (see r/78919 ) [05:59:29] Logged the message, Master [06:01:38] (03PS1) 10Tim Starling: Revoke Aaron's SSH key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78935 [06:01:54] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.256 second response time [06:02:07] (03CR) 10Tim Starling: [C: 032 V: 032] Revoke Aaron's SSH key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78935 (owner: 10Tim Starling) [06:02:34] TimStarling: thanks [06:04:27] (03PS4) 10Akosiaris: Refactor nrpe to a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/77720 [06:04:47] (03CR) 10Akosiaris: [C: 032] Refactor nrpe to a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/77720 (owner: 10Akosiaris) [06:14:14] PROBLEM - RAID on solr1002 is CRITICAL: Connection refused by host [06:39:01] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [06:41:22] PROBLEM - RAID on analytics1014 is CRITICAL: Connection refused by host [06:52:22] (03PS1) 10Akosiaris: Some more etherpad-lite apache config tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/78937 [07:12:45] (03CR) 10Akosiaris: [C: 032] Some more etherpad-lite apache config tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/78937 (owner: 10Akosiaris) [07:16:11] RECOVERY - Puppet freshness on zirconium is OK: puppet ran at Tue Aug 13 07:16:05 UTC 2013 [07:21:55] (03PS1) 10ArielGlenn: twemproxy for snapshot hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/78939 [07:24:27] (03CR) 10ArielGlenn: [C: 032] twemproxy for snapshot hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/78939 (owner: 10ArielGlenn) [07:29:08] !log enabled https redir for epl.wikimedia.org. [07:29:19] Logged the message, Master [07:39:01] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:39:01] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [07:39:01] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [07:39:01] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:01] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:02] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [07:52:40] PROBLEM - DPKG on snapshot4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:53:40] RECOVERY - DPKG on snapshot4 is OK: All packages OK [08:05:40] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:40] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:10:00] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:00] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:00] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:00] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:00] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:01] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [08:47:23] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:43] PROBLEM - Host wikidata-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:53] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 58%, RTA = 86.80 ms [08:47:55] RECOVERY - Host wikidata-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.72 ms [08:48:33] wtf is going on [08:48:54] dunno just got the pages myself [08:48:59] works now [08:48:59] got em all at once [08:49:03] network hiccup? [08:49:04] gah [08:49:10] transit? [08:49:20] hm, there's still packet loss [08:50:57] !log taking down epl.wikimedia.org for maintanance [08:51:08] Logged the message, Master [09:51:28] If anyone is around still, gitblit needs the defib again [09:56:08] p858snake|l: for reals ? [09:56:27] rstarted [09:58:59] the bot hasn't picked up the new robots.txt file yet [10:27:14] (03PS1) 10QChris: Switch to changes table for gerrit's reviewer count file [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 [11:47:27] (03PS1) 10TTO: Clean up abusefilter.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78951 [11:50:12] (03PS1) 10Springle: Script to handle purging of query digest data older than 4 weeks. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78952 [12:12:16] (03PS1) 10Petr Onderka: Use at() for bounds checking [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78953 [12:14:19] (03PS2) 10Hashar: reenable otrs GenericAgent.pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/78819 (owner: 10Jgreen) [12:22:58] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [13:22:48] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:01] (03PS1) 10Akosiaris: Force epl.wikimedia.org -> etherpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/78957 [13:23:48] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [13:26:44] (03CR) 10Akosiaris: [C: 032] Force epl.wikimedia.org -> etherpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/78957 (owner: 10Akosiaris) [13:59:17] cmjohnson1: hey, have you seen the ms-be* disk tickets? [13:59:49] Hi paravoid: yes...i have seen them. ms-be1005 is easy cuz the controller shows it failed ...1008 is not failed [14:08:28] (03CR) 10Petr Onderka: [C: 032 V: 032] Use at() for bounds checking [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78953 (owner: 10Petr Onderka) [14:29:24] hi all, [14:29:42] I need to install a package for jenkins, but none of the puppet classes included on gallium seem like a good place to do so [14:34:27] (03CR) 10BBlack: [C: 031] "+1 on this because we need it and it looks sane functionally. Would be nice to have someone else who's more familiar with our puppet modu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [14:43:42] (03PS1) 10Ottomata: Installing dclass libs on gallium for device classification unit tests. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78958 [14:44:00] (03CR) 10Ottomata: [C: 032 V: 032] Installing dclass libs on gallium for device classification unit tests. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78958 (owner: 10Ottomata) [14:47:32] (03PS1) 10Ottomata: Fixing name of libdclass0-dev package [operations/puppet] - 10https://gerrit.wikimedia.org/r/78959 [14:47:44] (03CR) 10Ottomata: [C: 032 V: 032] Fixing name of libdclass0-dev package [operations/puppet] - 10https://gerrit.wikimedia.org/r/78959 (owner: 10Ottomata) [15:09:36] PROBLEM - Apache HTTP on mw131 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50272 bytes in 0.155 second response time [15:10:35] RECOVERY - Apache HTTP on mw131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.132 second response time [15:33:58] Can I find robots.txt for Bugzilla somewhere? operations/puppet/files/apache/sites/bugzilla.wikimedia.org is not the place, hence wondering if it's puppetized and where [15:37:19] looks like it is not [15:37:58] april 12 2012 [15:54:36] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:25] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [16:27:35] binasher: are you in the office? [16:39:15] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [16:40:48] Aaron|home: no one from ops is (yet). [16:41:06] hah, speak of the devil, RobH just came in [16:43:19] devil :-D [16:54:10] greg-g: what was ops needed for? [16:55:59] RobH: dunno, see Aaron|home above [16:56:23] * Aaron|home was just curious [16:56:26] :-D [16:56:43] somehow my name tag fell off my chair [16:56:46] and i had a nice chair [16:56:56] so now i found one almost the same, but is slightly more broken [16:57:00] but fuck it, its my chair now. [16:57:13] (03PS1) 10Cmjohnson: puppetizing /etc/lograte.d/squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/78965 [16:57:41] robh: you are always getting your chair stolen [16:58:00] I assembled my chair [16:58:06] someone else is now sitting in it I'm sure [16:58:10] for years now [16:58:25] "apergos was here ~~~~" [16:58:26] <^d> apergos: Thanks for the merge earlier btw. [16:58:34] that was paravoid [16:58:44] cmjohnson1: its cuz i had a nice aeon chair [16:58:47] all greeks look the same! [16:58:48] I didn't see your change, (and that was just a gratuitous ping, rats) [16:59:00] jeremyb: you mean 'its all greek to me' apergos loves that expression. [16:59:05] cause I didn't see you had a change in there. anyways googlebot picked upt he new file since then [16:59:17] <^d> apergos: Heh, too many e-mails. All blurred together :) [16:59:19] * apergos stabs RobH slowly with a rusty spoon [16:59:31] yeah pretty much :-) [16:59:31] RobH: added another link to the burnin page [16:59:56] so now the trick is I guess, how to allow indexing without it falling over, but I leave that in your hands [16:59:58] good luck! [17:00:03] as in "good luck with that then" [17:00:32] <^d> Step 1) Make gitblit emit some expire headers that are more than a minute into the future. [17:00:42] <^d> Step 2) Varnish will actually cache things decently then [17:00:46] <^d> Step 3) ??? [17:00:49] <^d> Step 4) Profit!!! [17:01:04] (03PS2) 10Cmjohnson: puppetizing /etc/lograte.d/squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/78965 [17:02:13] sounds great to me [17:02:44] jeremyb: cool [17:02:58] cmjohnson1: just in case you werent aware [17:02:58] https://wikitech.wikimedia.org/wiki/Automated_hardware_testing [17:03:41] cool..stresslinux is the one I have been messing with [17:19:45] (03PS1) 10Manybubbles: Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 [17:39:02] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [17:39:02] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [17:39:02] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [17:39:02] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:02] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:03] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:44:04] hey ^d, do you control gerritadmin@wikimedia.org? [17:44:26] <^d> The e-mail address, or the bugzilla bot using the address? [17:44:32] the e-mail address [17:44:33] <^d> The address is just an alias. [17:44:54] asking because we should set a wmf logo as a gravatar [17:45:11] to beautify wmf github activity stream :P [17:45:28] <^d> It's an alias and it goes to me and...Antoine I think? [17:46:59] (03PS2) 10Manybubbles: Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 [17:47:55] !log rob is testing his new ssh settings to see if he can dsh graceful all apaches without errors, no actual changes are being made [17:48:06] Logged the message, RobH [17:49:34] ARGH it didnt fix problem [17:49:37] still auth errors [17:49:40] so many... i hate you dsh [17:49:44] you scale for shit [17:50:17] did someone say salt? [17:50:37] yea so apache-gracefull-all dsh loops through all the apaches [17:50:50] and it does it in a fashion that causes ssh authentication errors to crop up for a number of us [17:50:57] (seems to do with there being so damned many apaches now) [17:51:16] i tried to tweak my settings with the controlmaster/controlpath settings [17:51:23] and i still get the errors [17:51:37] salt isnt the 'official' way we apply apache changes [17:51:47] but i may havbe to use it, annoying. [18:03:44] ^d: heya, mind deploying a revert/cherry-pick for the parsoid team today? [18:04:36] <^d> That's cool, what's the change? [18:05:33] ^d: https://gerrit.wikimedia.org/r/#/c/78967/ [18:07:50] just waiting for jenkins (as always) ;) [18:08:17] there, jenkins is done [18:10:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:02] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:02] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:03] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [18:33:41] (03PS1) 10RobH: adding in wikipedia.in redirection support [operations/apache-config] - 10https://gerrit.wikimedia.org/r/78970 [18:34:32] (03CR) 10RobH: [C: 032] adding in wikipedia.in redirection support [operations/apache-config] - 10https://gerrit.wikimedia.org/r/78970 (owner: 10RobH) [18:40:07] <^d> Dumb question. Where do we graph the parser cache hit rates? [18:40:14] <^d> I thought it was...somewhere...but I can't find it. [18:40:31] graphite? [18:40:49] !log updated apache redirects, but dsh stuff wont workfor me so hacking an apache gracefull all via salt [18:41:00] Logged the message, RobH [18:41:17] ^d: http://gdash.wikimedia.org/dashboards/pcache/ [18:41:37] <^d> A-ha! Thank you binasher and Aaron|home [18:42:47] * Aaron|home needs tim-tams [18:51:18] !log wikipedia.in is now working, huzzah. [18:51:30] Logged the message, RobH [19:08:42] (03Abandoned) 10Cmjohnson: adding mc1-16 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/73193 (owner: 10Cmjohnson) [19:12:05] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [19:15:15] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:19:52] (03PS3) 10Cmjohnson: puppetizing /etc/lograte.d/squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/78965 [19:26:24] (03CR) 10ArielGlenn: [C: 031] puppetizing /etc/lograte.d/squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/78965 (owner: 10Cmjohnson) [19:27:05] (03CR) 10Cmjohnson: [C: 032 V: 032] puppetizing /etc/lograte.d/squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/78965 (owner: 10Cmjohnson) [19:47:15] (03CR) 10Ori.livneh: [C: 031] "@BBlack: the files are in the right place, IMO. It deviates from https://wikitech.wikimedia.org/wiki/Puppet_coding#WMF_Design_conventions " [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [20:05:47] <^d> Aaron|home: I saw you +1'd my config change. Do you think that's reasonably sane? [20:05:56] <^d> Getting that out sooner rather than later would be nice :) [20:09:45] (03CR) 10Demon: [C: 031] Switch to changes table for gerrit's reviewer count file [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [20:10:01] ^d: sure [20:13:39] ottomata: your turn! [20:13:49] * RobH runs away [20:15:55] ohhhh [20:15:57] woohoo! [20:16:00] i missed yesterday too! [20:16:03] <^d> Aaron|home: Actually, it's not as clever as I thought. [20:16:05] tsk tsk, shame on me [20:16:22] <^d> If $wgSearchType is still LuceneSearch, SearchUpdate won't pick the right backend to update. [20:17:05] <^d> So wikis running with it as an alternative won't do updates :( [20:18:36] ^d: that not the config though but the core code right? [20:19:01] but, yeah, you'd need to update all the alternatives somehow [20:26:35] <^d> Aaron|home: Easiest solution imho would be to turn SearchUpdate back on for wmf, and make MWSearch implement it as a no-op. [20:26:49] <^d> Then we could easily iterate the backends. [20:30:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:02] trying to think of how best the code would be refactored for that [20:32:10] <^d> Do it in SearchUpdate rather than the 4 or so callers. They shouldn't care what backends you use. [20:32:48] right, maybe doUpdate could iterate and $wgDisableSearchUpdate could optionally be an array or something [20:33:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [20:33:43] <^d> Sounds like a reasonable approach. I'll start mucking about with a patch and see what I end up. [20:33:47] <^d> *end up with [21:03:30] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [21:03:48] (03PS1) 10RobH: RT5511 ytterbium setup info [operations/puppet] - 10https://gerrit.wikimedia.org/r/79014 [21:05:43] (03CR) 10RobH: [C: 032] RT5511 ytterbium setup info [operations/puppet] - 10https://gerrit.wikimedia.org/r/79014 (owner: 10RobH) [21:06:40] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:39] ^d: yo, i have your server nearly ready [21:07:46] just have os install now, so i'll hand off to you tomorrow. [21:07:55] <^d> Cool beans [21:29:04] (03PS1) 10Danny B.: skwikisource: Project name localization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79016 [22:03:53] (03CR) 10Demon: [C: 04-2] "Needs a bit more work before I merge." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [22:23:57] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [22:55:24] !log authdns update [22:55:36] Logged the message, Master [23:04:44] (03PS1) 10Cmjohnson: adding hafnium and removing ocg1-3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79027 [23:05:48] (03CR) 10Cmjohnson: [C: 032 V: 032] adding hafnium and removing ocg1-3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79027 (owner: 10Cmjohnson)