[00:00:00] <Krinkle>	 legoktm: Hm.. that frwikt filter did check 'bot'
[00:00:04] <Krinkle>	 how come it denied still
[00:00:14] <legoktm>	 I didn't give the fake account any rights
[00:00:24] <Krinkle>	 $user->addGroup('bot')
[00:00:33] <Krinkle>	 at the start of any maintenance script using a fake username
[00:00:37] <icinga-wm>	 RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 2.38 ms  
[00:00:40] <grrrit-wm>	 (03CR) 10Chad: "Heh, I've been calling it with sudo myself for ages. Glad it's finally gonna be fixed :p" [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy)
[00:00:42] <Krinkle>	 dont worry, doesn't need to exist.
[00:00:48] <Krinkle>	 add that as well :)
[00:01:07] <Krinkle>	 so frwikt fix wasn't needed
[00:01:11] <Krinkle>	 but enwikt will still fail
[00:01:42] <legoktm>	 we should also set $wgUser so the abusefilter log entries don't say 127.0.0.1
[00:01:59] <Krinkle>	 Yep
[00:02:17] <Krinkle>	 legoktm: for comparison, this is how I delete pages
[00:02:17] <Krinkle>	 function kfDeleteDefault($title,$reason,$user) { if (!$user) { print "Invalid user\n"; return; } $user->addGroup( 'bot' );global $wgUser; $wgUser = $user;$title = Title::newFromText( $title ); if (!$title->exists()) { print "Title [[$title]] not found\n"; return; } $dbw = wfGetDB( DB_MASTER ); $dbw->begin( 'eval' ); $page = WikiPage::factory( $title ); $error = ''; $success = $page->doDeleteArti
[00:02:18] <Krinkle>	 cle( $reason, false, 0, false, $error, $user ); $dbw->commit( 'eval' ); echo "Deleted [[$title]]\n"; wfWaitForSlaves(); } 
[00:02:32] <Krinkle>	 (if I have to resort to eval.php for sysadmin reasons)
[00:03:02] <Krinkle>	 e.g.
[00:03:04] <Krinkle>	 $title = 'Commons:Auto-protected files/wikipedia/zh/Archive 4';
[00:03:05] <Krinkle>	 $user = User::newFromName( 'Maintenance script' );
[00:03:06] <Krinkle>	 $reason = 'Delete page with over 5000 revisions (requested by Krinkle)';
[00:03:08] <Krinkle>	 kfDeleteDefault( $title, $reason, $user );
[00:03:18] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 
[00:04:13] <grrrit-wm>	 (03PS1) 10Dzahn: base monitoring - set hostgroups based on $cluster [puppet] - 10https://gerrit.wikimedia.org/r/157015 
[00:04:19] <cmjohnson1>	 !log shutting down ms-fe1004 to relocate racks
[00:04:25] <morebots>	 Logged the message, Master
[00:05:48] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures  
[00:05:58] <icinga-wm>	 PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100%  
[00:07:03] <spagewmf>	 greg-g, (OuKB) : backports fixed Flow and Echo bugs on mediawiki.org
[00:07:13] <MaxSem>	 whee
[00:13:36] <grrrit-wm>	 (03CR) 10Dzahn: "group has been added fine but stays empty, which is a global, not codfw-specific problem. see https://gerrit.wikimedia.org/r/#/c/157015/1 " [puppet] - 10https://gerrit.wikimedia.org/r/157003 (owner: 10Dzahn)
[00:14:15] <mutante>	 " # For unclear historic reasons, this box has a massive /a drive." :)
[00:19:50] <grrrit-wm>	 (03CR) 10Dzahn: "we already have icinga-wm reporting all the Icinga notifications to IRC, is this an attempt to replace that? or just because you want the " [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE))
[00:21:58] <icinga-wm>	 RECOVERY - Host ms-fe1004 is UP: PING WARNING - Packet loss = 86%, RTA = 2.64 ms  
[00:27:18] <icinga-wm>	 PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100%  
[00:27:26] <mutante>	 !log restarting gmetad on nickel
[00:27:31] <morebots>	 Logged the message, Master
[00:28:02] <mutante>	 godog: still on ? ^ swift host down again?
[00:28:26] <godog>	 mutante: ye it is me and cmjohnson1 poking no worries :)
[00:28:42] <mutante>	 just saw, ok
[00:29:04] <mutante>	 recovery at 85% packet loss :p :)
[00:29:18] <godog>	 classy
[00:31:27] <icinga-wm>	 RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms  
[00:32:34] <godog>	 !log repool ms-fe1004
[00:32:40] <morebots>	 Logged the message, Master
[00:34:41] <godog>	 !log depool ms-fe1001
[00:34:47] <morebots>	 Logged the message, Master
[00:38:47] <cmjohnson1>	 !log shutting down ms-fe1001 for rack relocation
[00:38:54] <morebots>	 Logged the message, Master
[00:39:41] <grrrit-wm>	 (03PS1) 10Dzahn: add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 
[00:40:58] <icinga-wm>	 PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100%  
[00:43:17] <icinga-wm>	 PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Thu 28 Aug 2014 22:42:28 UTC  
[00:45:47] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:46:22] <spagewmf>	 uh, how do I view fatals in logstash? Under TYPES I see runJobs, Hadoop, etc., but not "exceptions" and "fatals".  It ties with graphite for "awesome power, no idea how to work it".
[00:46:34] <grrrit-wm>	 (03PS2) 10Dzahn: add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 
[00:46:47] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 8.763 second response time  
[00:46:47] <icinga-wm>	 PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds  
[00:46:47] <icinga-wm>	 PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds  
[00:49:37] <ori>	 i'm going to deploy a small update to wikimediaevents
[00:52:09] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 (owner: 10Dzahn)
[00:56:34] <logmsgbot>	 !log ori Synchronized php-1.24wmf19/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 04s)
[00:56:40] <morebots>	 Logged the message, Master
[00:57:41] <mutante>	 Duplicate declaration: File[/usr/lib/ganglia/python_modules] is already declared   ...grrrrr
[00:57:45] <logmsgbot>	 !log ori Synchronized php-1.24wmf18/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 03s)
[00:57:51] <morebots>	 Logged the message, Master
[00:58:40] <Krinkle>	 ori: Hm.. I saw some CR earlier about wfishiphop instead of defined
[00:59:03] <Krinkle>	 ori: did you mean to put it in page output instead of startup module?
[00:59:08] <ori>	 yes
[00:59:14] <ori>	 it's request-specific
[00:59:37] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Epic puppet fail  
[00:59:39] <Krinkle>	 yeah, but startup is a request, too
[00:59:50] <Krinkle>	 could've worked either way I suppose
[00:59:56] <Krinkle>	 but caching is key
[01:00:09] <Krinkle>	 bypasses cache or is fragmented by it?
[01:00:28] <ori>	 Krinkle: https://gerrit.wikimedia.org/r/#/c/152903/
[01:00:38] <ori>	 it's a lot to explain :P
[01:00:47] <icinga-wm>	 RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 5.48 ms  
[01:01:11] <Krinkle>	 ori: so it'd work for bits too, right?
[01:01:17] <icinga-wm>	 PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC  
[01:01:47] <ori>	 yes, but the var ought to describe what generated the page, not what generated the startup module
[01:01:47] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).  
[01:01:52] <ori>	 because there's a chance that they're different
[01:02:01] <godog>	 !log repool ms-fe1001
[01:02:05] <Krinkle>	 ah, and it's not the same cookie for both
[01:02:08] <morebots>	 Logged the message, Master
[01:02:08] <Krinkle>	 gotcha
[01:02:18] <ori>	 since varnish will route requests to the zend backends if the hhvm backends are sick
[01:02:23] <ori>	 it's the same cookie
[01:02:31] <Krinkle>	 no
[01:02:40] <Krinkle>	 if you set cookie hhvm on enwiki, your bits request wont be hhvm
[01:03:12] <ori>	 oh, that's what you meant. yes, right.
[01:03:20] <ori>	 that's an even better point
[01:03:29] <ori>	 since it's more likely than the possibility i mentioned above
[01:03:34] <Krinkle>	 yeah
[01:03:37] <Krinkle>	 good :)
[01:03:55] <ori>	 using wfIsHHVM() is probably neater, yeah
[01:04:10] <grrrit-wm>	 (03PS1) 10Dzahn: do not set $cluster on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/157022 
[01:04:17] <godog>	 !log depool ms-fe1002
[01:04:24] <morebots>	 Logged the message, Master
[01:05:16] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] do not set $cluster on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/157022 (owner: 10Dzahn)
[01:06:07] <ori>	 Krinkle: do you have a javascript handy for superimposing some image on the interface if a variable is set? (for a user script, not something that would actually be forced on anyone)
[01:06:18] <ori>	 *a javascript snippet, even
[01:06:30] <ori>	 i guess it's easy enough to write one
[01:06:47] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.  
[01:06:58] <cmjohnson1>	 !log shutting down ms-fe1002 to relocate racks
[01:07:04] <morebots>	 Logged the message, Master
[01:07:17] <Krinkle>	 if mw.config.get isHip: $('#p-logo').css('background-color', 'pink')
[01:07:17] <Krinkle>	 ori: 
[01:07:27] <Krinkle>	 I tend to use logo background color for different temp purposes
[01:07:52] <Krinkle>	 stick it into your global.js :)
[01:08:07] <ori>	 weee! :)
[01:09:26] <Krinkle>	 Hm.. ukwikimedia is still being iterated over
[01:09:27] <icinga-wm>	 PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[01:09:31] <Krinkle>	 ins't that off cluster now?
[01:09:46] <Krinkle>	 https://wikimedia.org.uk/wiki/User:Krinkle/common.css
[01:09:55] <legoktm>	 it's still on the cluster, but there's a redirect in place
[01:10:01] <Krinkle>	 right
[01:10:03] <Krinkle>	 inaccessilbe
[01:10:16] <Krinkle>	 (well, short of /etc/hosts
[01:10:25] <Krinkle>	 Im sure the servers still respond to it :P
[01:10:54] <Krinkle>	 Hm.. nah, redirects still catch it
[01:11:44] <Krenair>	 Krinkle, so we're not serving it now?
[01:11:51] <Krinkle>	 indeed
[01:11:56] <Krinkle>	 legoktm: https://wikimania2009.wikimedia.org/wiki/User:Krinkle/common.js didn't match pattern
[01:11:59] <Krenair>	 But it's not been marked as closed?
[01:12:00] <Krinkle>	 mw.loader.load('http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&action=raw&ctype=text/javascript','text/javascript'); 
[01:12:04] <Krinkle>	 interesting second argument
[01:12:07] <Krinkle>	 that's valid
[01:12:12] <Krinkle>	 very rare
[01:12:13] <Krinkle>	 no idea what I was doing
[01:12:39] <legoktm>	 the regex wasn't expecting the second argument
[01:14:10] <Krinkle>	 Krenair: Yeah, it shouldn't be iterated over by maintenance scripts any more
[01:14:19] <Krinkle>	 legoktm: btw, what does "does not load global modules on this wiki" mean
[01:14:30] <Krinkle>	 wikimania2015wiki:  Krinkle does not load global modules on this wiki. 
[01:14:33] <Krenair>	 hm, wikimania 2008-2012 wikis still point to wikimania 2013 as being 'future'
[01:14:51] <Krinkle>	 aawiktionary:  Krinkle does not load global modules on this wiki. 
[01:14:56] <Krenair>	 It's on closed.dblist
[01:15:08] <Krinkle>	 right
[01:15:14] <legoktm>	 Krinkle: means the account doesn't exist locally or isn't attached in CA. 
[01:15:16] <Krinkle>	 or did I just not autocreate on those
[01:15:19] <Krinkle>	 ic
[01:15:25] <legoktm>	 		if ( !$user->getId() || !GlobalCssJsHooks::loadForUser( $user ) ) {
[01:15:25] <legoktm>	 			$this->output( "$userName does not load global modules on this wiki.\n" );
[01:16:13] <legoktm>	 the script should only delete it globalcssjs would replace it, if your account isn't attached in CA, it won't load, hence no deletion
[01:17:39] <Krinkle>	 legoktm: yeah, and no local js/css pages if theres no account
[01:18:07] <Krinkle>	 unless some other privileged user created them for them without checking account first
[01:26:47] <icinga-wm>	 RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms  
[01:27:10] <godog>	 !log repool ms-fe1002
[01:27:17] <morebots>	 Logged the message, Master
[01:31:27] <grrrit-wm>	 (03PS1) 10Dzahn: install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 
[01:31:42] <grrrit-wm>	 (03PS2) 10Dzahn: install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 
[01:31:44] <Krinkle>	 legoktm: ori: btw, submodule commits mystyle: https://gist.github.com/Krinkle/479399ac9a11e9ff8b62
[01:32:28] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 (owner: 10Dzahn)
[01:34:37] <icinga-wm>	 RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures  
[01:41:49] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).  
[01:42:56] <grrrit-wm>	 (03CR) 10Dzahn: "Aug 29 01:34:07 install2001 puppet-agent[23260]: (/Stage[main]/Ganglia_new::Monitor::Aggregator/Service[ganglia-monitor-aggregator]) Unsch" [puppet] - 10https://gerrit.wikimedia.org/r/157026 (owner: 10Dzahn)
[01:46:35] <grrrit-wm>	 (03PS1) 10Springle: depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 
[01:48:14] <grrrit-wm>	 (03CR) 10Springle: [C: 032] depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 (owner: 10Springle)
[01:48:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 (owner: 10Springle)
[01:49:37] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: depool db1070. pool db1072. (duration: 00m 06s)
[01:49:46] <morebots>	 Logged the message, Master
[02:04:59] <icinga-wm>	 PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3871 MB (3% inode=99%):  
[02:06:26] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Move glance images to /a where there's more room. [puppet] - 10https://gerrit.wikimedia.org/r/157012 (owner: 10Andrew Bogott)
[02:07:16] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 
[02:07:50] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.  
[02:08:59] <icinga-wm>	 RECOVERY - Disk space on virt0 is OK: DISK OK  
[02:09:20] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 (owner: 10Andrew Bogott)
[02:10:49] <icinga-wm>	 PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[02:10:59] <icinga-wm>	 PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[02:11:49] <icinga-wm>	 RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[02:11:59] <icinga-wm>	 RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[02:17:03] <grrrit-wm>	 (03PS1) 10Dzahn: ganglia-aggregators on install2001 as data sources [puppet] - 10https://gerrit.wikimedia.org/r/157033 
[02:18:00] <grrrit-wm>	 (03PS2) 10Dzahn: ganglia-aggregators on install2001 as data sources [puppet] - 10https://gerrit.wikimedia.org/r/157033 
[02:19:39] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "using it like hooft is used" [puppet] - 10https://gerrit.wikimedia.org/r/157033 (owner: 10Dzahn)
[02:30:09] <icinga-wm>	 RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri Aug 29 02:30:04 UTC 2014  
[02:31:10] <icinga-wm>	 PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures  
[02:38:23] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf18) at 2014-08-29 02:37:20+00:00
[02:38:31] <morebots>	 Logged the message, Master
[02:42:12] <grrrit-wm>	 (03PS1) 10Springle: reassign db1070 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/157035 
[02:48:43] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: depool db1070. pool db1072. (duration: 00m 07s)
[02:48:50] <morebots>	 Logged the message, Master
[02:51:49] <icinga-wm>	 PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[02:52:00] <icinga-wm>	 PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[02:54:49] <icinga-wm>	 RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[02:54:59] <icinga-wm>	 RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[03:01:29] <icinga-wm>	 PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC  
[03:02:09] <icinga-wm>	 RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[03:12:13] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf19) at 2014-08-29 03:10:26+00:00
[03:20:46] <morebots>	 Logged the message, Master
[03:21:36] <grrrit-wm>	 (03CR) 10Springle: [C: 032] reassign db1070 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/157035 (owner: 10Springle)
[03:29:53] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: reduce db1056 load while cloning (duration: 00m 06s)
[03:34:10] <morebots>	 Logged the message, Master
[03:34:12] <springle>	 !log xtrabackup clone db1056 to db1070
[03:34:14] <morebots>	 Logged the message, Master
[04:01:04] <grrrit-wm>	 (03CR) 10Withoutaname: "Mainly tried to move only the group permissions. Some configuration settings not related to permissions changes were moved into CommonSett" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 (https://bugzilla.wikimedia.org/58247) (owner: 10Withoutaname)
[04:14:09] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 29 04:13:03 UTC 2014 (duration 13m 2s)
[04:14:15] <morebots>	 Logged the message, Master
[04:20:47] <icinga-wm>	 PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19635 MB (3% inode=99%):  
[04:29:47] <icinga-wm>	 RECOVERY - Disk space on elastic1009 is OK: DISK OK  
[04:44:26] <grrrit-wm>	 (03CR) 10Hoo man: "I'm not actually a fan of this as I actually like the one file per extension configuration approach. Also what this does seems a little me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 (https://bugzilla.wikimedia.org/58247) (owner: 10Withoutaname)
[05:01:54] <icinga-wm>	 PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC  
[06:28:04] <icinga-wm>	 PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail  
[06:28:34] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:45] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:14] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:14] <icinga-wm>	 PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:14] <icinga-wm>	 PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:14] <icinga-wm>	 PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:54] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:44] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures  
[06:33:28] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: return db1056 to normal load (duration: 00m 06s)
[06:33:34] <morebots>	 Logged the message, Master
[06:38:14] <icinga-wm>	 PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 3 failures  
[06:45:34] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures  
[06:46:14] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures  
[06:46:14] <icinga-wm>	 RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[06:46:15] <icinga-wm>	 RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[06:46:15] <icinga-wm>	 RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures  
[06:46:54] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[06:47:04] <icinga-wm>	 RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[06:49:15] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:51:24] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[06:51:24] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[06:56:14] <icinga-wm>	 RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures  
[07:00:22] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Already being done in monitor_host definition. manifests/nagios.pp:23. Perhaps some nodes are missing the $cluster variable ?" [puppet] - 10https://gerrit.wikimedia.org/r/157015 (owner: 10Dzahn)
[07:02:54] <icinga-wm>	 PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC  
[07:04:34] <icinga-wm>	 PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Epic puppet fail  
[07:05:04] <icinga-wm>	 RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Fri Aug 29 07:04:55 UTC 2014  
[07:05:20] <_joe_>	 !log re-enabling puppet on the jobrunner, to check if the luasandbox fix works
[07:05:24] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[07:05:24] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[07:05:26] <morebots>	 Logged the message, Master
[07:06:34] <icinga-wm>	 RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures  
[07:07:14] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures  
[07:17:40] <grrrit-wm>	 (03CR) 10Ori.livneh: "I'm not sure I understand the documentation. What does this do that https://github.com/puppetlabs/hiera/blob/master/lib/hiera/backend/yaml" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto)
[07:26:40] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "@ori:" [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto)
[07:38:13] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Add DNS views in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/157045 
[07:53:55] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "It would suck to have to squish all variable data into a single YAML file, so I see the value. It's also similar to the pattern used in PH" [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto)
[08:04:49] <hashar>	 !log Jenkins: in the jenkins-job-builder-config branch 'cloudbees' has been merged in 'master'. Unifying CI and browser tests jobs!  \O/
[08:04:55] <morebots>	 Logged the message, Master
[08:04:55] <hashar>	 morebots: come on
[08:04:56] <morebots>	 I am a logbot running on tools-exec-06.
[08:04:56] <morebots>	 Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log.
[08:04:56] <morebots>	 To log a message, type !log <msg>.
[08:14:19] <hashar>	 anyone around familiar with Zero X-CS X-CS2 headers by any chance ?
[08:14:22] <hashar>	 got some questions :D
[08:30:54] <icinga-wm>	 PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[08:31:34] <icinga-wm>	 PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[08:31:54] <icinga-wm>	 RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[08:32:34] <icinga-wm>	 RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[08:47:17] <_joe_>	 hashar_: can't say I'm familiar, but I've seen those
[08:47:25] <_joe_>	 working on varnish
[08:47:39] <hashar_>	 _joe_:  I found out we have some super fun X-CS header for Zero :]
[08:47:53] <hashar_>	 I am playing with the browser tests and found out what i needed
[08:50:30] <_joe_>	 yes we do indeed
[08:50:49] <_joe_>	 look at zero.inc.vcl.erb in the varnish module :)
[08:50:54] <icinga-wm>	 PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[08:51:49] <hashar_>	 _joe_: also I noticed overnight that the poor puppet compiler instance has full disk :(
[08:51:54] <icinga-wm>	 RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[08:51:59] <_joe_>	 hashar_: again?
[08:52:01] <_joe_>	 grr
[08:52:05] <hashar_>	 yeah /tmp filled up
[08:52:10] <_joe_>	 ok I'll fix that as well
[08:52:11] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: beta: manage virtualhosts via puppet [puppet] - 10https://gerrit.wikimedia.org/r/156762 
[08:52:16] <_joe_>	 it's not tmp alone
[08:52:43] <_joe_>	 I'll fix it
[08:53:55] <hashar>	 _joe_: the instance should probably has extended disk enabled with role::labs::lvm::srv  and script made to point to something like /srv/tmp :D
[08:56:11] <_joe_>	 no
[08:56:13] <_joe_>	 :)
[08:56:28] <_joe_>	 I think I know the best way to manage a filesystem
[08:56:46] <_joe_>	 and let's say I'm not super-fond of how we do that in labs in general
[08:57:04] <_joe_>	 so I'm going to do that on my own
[08:58:33] <hashar>	 if you have a better proposal, I am sure labs users would love a fix :D
[08:58:50] <_joe_>	 oh no I don't, I'm just complaining
[08:58:59] <_joe_>	 I'm a grumpy, old opsen
[08:59:02] <_joe_>	 :)
[08:59:42] <_joe_>	 or better, I would love to, but I have nooo time
[09:30:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1206 is CRITICAL: Connection timed out  
[09:30:25] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:30:26] <icinga-wm>	 PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:30:34] <icinga-wm>	 PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[09:30:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection timed out  
[09:30:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:30:54] <icinga-wm>	 PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[09:30:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:24] <icinga-wm>	 PROBLEM - RAID on mw1121 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:24] <icinga-wm>	 PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:25] <icinga-wm>	 PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:25] <icinga-wm>	 PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1208 is CRITICAL: Connection timed out  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1199 is CRITICAL: Connection timed out  
[09:31:34] <icinga-wm>	 RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1197 is CRITICAL: Connection timed out  
[09:31:45] <icinga-wm>	 PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:45] <icinga-wm>	 PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:31:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1132 is CRITICAL: Connection timed out  
[09:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:31:54] <icinga-wm>	 RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner  
[09:31:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1147 is CRITICAL: Connection timed out  
[09:32:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1193 is CRITICAL: Connection timed out  
[09:32:14] <icinga-wm>	 RECOVERY - RAID on mw1128 is OK: OK: no RAID installed  
[09:32:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:15] <icinga-wm>	 RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 933 seconds ago with 0 failures  
[09:32:15] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 360 seconds ago with 0 failures  
[09:32:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:24] <icinga-wm>	 RECOVERY - RAID on mw1123 is OK: OK: no RAID installed  
[09:32:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.611 second response time  
[09:32:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:35] <icinga-wm>	 PROBLEM - check configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:32:44] <icinga-wm>	 RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 734 seconds ago with 0 failures  
[09:32:44] <icinga-wm>	 RECOVERY - DPKG on mw1147 is OK: All packages OK  
[09:32:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.243 second response time  
[09:32:44] <icinga-wm>	 RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[09:32:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.220 second response time  
[09:32:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.235 second response time  
[09:32:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.985 second response time  
[09:32:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:32:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.124 second response time  
[09:33:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.227 second response time  
[09:33:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.252 second response time  
[09:33:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.127 second response time  
[09:33:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.416 second response time  
[09:33:14] <icinga-wm>	 RECOVERY - RAID on mw1121 is OK: OK: no RAID installed  
[09:33:15] <icinga-wm>	 RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 1082 seconds ago with 0 failures  
[09:33:24] <icinga-wm>	 PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:33:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.320 second response time  
[09:33:35] <icinga-wm>	 PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:34:09] <_joe_>	 api again
[09:34:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.125 second response time  
[09:34:16] <icinga-wm>	 RECOVERY - RAID on mw1147 is OK: OK: no RAID installed  
[09:34:24] <icinga-wm>	 RECOVERY - RAID on mw1122 is OK: OK: no RAID installed  
[09:34:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.889 second response time  
[09:34:35] <_joe_>	 mw1196 has a load of 171 :P
[09:35:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time  
[09:35:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.393 second response time  
[09:35:24] <icinga-wm>	 PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:35:24] <icinga-wm>	 PROBLEM - DPKG on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:35:24] <icinga-wm>	 PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:35:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.519 second response time  
[09:35:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:36:40] <icinga-wm>	 RECOVERY - RAID on mw1128 is OK: OK: no RAID installed  
[09:36:40] <icinga-wm>	 RECOVERY - DPKG on mw1128 is OK: All packages OK  
[09:36:40] <icinga-wm>	 RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1173 seconds ago with 0 failures  
[09:36:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time  
[09:36:40] <icinga-wm>	 RECOVERY - check configured eth on mw1146 is OK: NRPE: Unable to read output  
[09:37:20] <icinga-wm>	 PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:37:20] <icinga-wm>	 PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:37:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.380 second response time  
[09:37:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.378 second response time  
[09:37:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.853 second response time  
[09:37:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.874 second response time  
[09:37:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.210 second response time  
[09:37:34] <icinga-wm>	 RECOVERY - DPKG on mw1146 is OK: All packages OK  
[09:37:35] <icinga-wm>	 RECOVERY - RAID on mw1146 is OK: OK: no RAID installed  
[09:37:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.175 second response time  
[09:37:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.175 second response time  
[09:37:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.093 second response time  
[09:37:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.190 second response time  
[09:37:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.410 second response time  
[09:37:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.557 second response time  
[09:37:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.581 second response time  
[09:37:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.148 second response time  
[09:37:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.504 second response time  
[09:37:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.031 second response time  
[09:38:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.314 second response time  
[09:38:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.254 second response time  
[09:38:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.671 second response time  
[09:38:06] <icinga-wm>	 PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 40 data above and 0 below the confidence bounds  
[09:38:14] <icinga-wm>	 PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 40 data above and 0 below the confidence bounds  
[09:38:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.062 second response time  
[09:51:28] <MaxSem>	 _joe_, I see a lot of DB errors around that time
[09:52:05] <MaxSem>	 "Connection error: No working slave server: Unknown error (10.64.16.42)"
[09:55:28] <_joe_>	 MaxSem: mmmh what does that mean in the context of mediawiki?
[09:55:58] <MaxSem>	 appservers were waiting for slaves?
[09:56:34] <_joe_>	 oh it's es1004.eqiad.wmnet
[09:57:12] <_joe_>	 so maybe it's some problem with es, which will also mean the slowdown of expandtemplates makes sense
[09:57:35] <_joe_>	 springle_: around?
[09:58:21] <_joe_>	 MaxSem: eh, there was a huge spike of load on es1004
[09:58:27] <_joe_>	 thanks for spotting that
[09:58:39] <_joe_>	 I was searching for incoming traffic patterns
[09:58:39] <MaxSem>	 srsly, we shouldn't use mysql for that
[09:58:46] <_joe_>	 :))
[09:59:19] <_joe_>	 maybe cassandra with a cache layer in front...
[09:59:31] <MaxSem>	 es1003 has the same network spike
[09:59:36] <_joe_>	 or couchbase, IDK
[09:59:46] <_joe_>	 yes I'm looking at tendril
[09:59:55] <MaxSem>	 we already have plans about cassie
[10:00:23] <_joe_>	 I used couchbase for read-heavy payloads
[10:00:28] <_joe_>	 but mysql is not bad for this in general
[10:01:14] <grrrit-wm>	 (03CR) 10Hashar: "That fixed the display of ruby lint test results on https://gerrit.wikimedia.org/r/#/c/143591/" [puppet] - 10https://gerrit.wikimedia.org/r/156103 (owner: 10Hashar)