[00:00:00] legoktm: Hm.. that frwikt filter did check 'bot' [00:00:04] how come it denied still [00:00:14] I didn't give the fake account any rights [00:00:24] $user->addGroup('bot') [00:00:33] at the start of any maintenance script using a fake username [00:00:37] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 2.38 ms [00:00:40] (03CR) 10Chad: "Heh, I've been calling it with sudo myself for ages. Glad it's finally gonna be fixed :p" [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [00:00:42] dont worry, doesn't need to exist. [00:00:48] add that as well :) [00:01:07] so frwikt fix wasn't needed [00:01:11] but enwikt will still fail [00:01:42] we should also set $wgUser so the abusefilter log entries don't say 127.0.0.1 [00:01:59] Yep [00:02:17] legoktm: for comparison, this is how I delete pages [00:02:17] function kfDeleteDefault($title,$reason,$user) { if (!$user) { print "Invalid user\n"; return; } $user->addGroup( 'bot' );global $wgUser; $wgUser = $user;$title = Title::newFromText( $title ); if (!$title->exists()) { print "Title [[$title]] not found\n"; return; } $dbw = wfGetDB( DB_MASTER ); $dbw->begin( 'eval' ); $page = WikiPage::factory( $title ); $error = ''; $success = $page->doDeleteArti [00:02:18] cle( $reason, false, 0, false, $error, $user ); $dbw->commit( 'eval' ); echo "Deleted [[$title]]\n"; wfWaitForSlaves(); } [00:02:32] (if I have to resort to eval.php for sysadmin reasons) [00:03:02] e.g. [00:03:04] $title = 'Commons:Auto-protected files/wikipedia/zh/Archive 4'; [00:03:05] $user = User::newFromName( 'Maintenance script' ); [00:03:06] $reason = 'Delete page with over 5000 revisions (requested by Krinkle)'; [00:03:08] kfDeleteDefault( $title, $reason, $user ); [00:03:18] (03PS1) 10Andrew Bogott: Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 [00:04:13] (03PS1) 10Dzahn: base monitoring - set hostgroups based on $cluster [puppet] - 10https://gerrit.wikimedia.org/r/157015 [00:04:19] !log shutting down ms-fe1004 to relocate racks [00:04:25] Logged the message, Master [00:05:48] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:05:58] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [00:07:03] greg-g, (OuKB) : backports fixed Flow and Echo bugs on mediawiki.org [00:07:13] whee [00:13:36] (03CR) 10Dzahn: "group has been added fine but stays empty, which is a global, not codfw-specific problem. see https://gerrit.wikimedia.org/r/#/c/157015/1 " [puppet] - 10https://gerrit.wikimedia.org/r/157003 (owner: 10Dzahn) [00:14:15] " # For unclear historic reasons, this box has a massive /a drive." :) [00:19:50] (03CR) 10Dzahn: "we already have icinga-wm reporting all the Icinga notifications to IRC, is this an attempt to replace that? or just because you want the " [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [00:21:58] RECOVERY - Host ms-fe1004 is UP: PING WARNING - Packet loss = 86%, RTA = 2.64 ms [00:27:18] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [00:27:26] !log restarting gmetad on nickel [00:27:31] Logged the message, Master [00:28:02] godog: still on ? ^ swift host down again? [00:28:26] mutante: ye it is me and cmjohnson1 poking no worries :) [00:28:42] just saw, ok [00:29:04] recovery at 85% packet loss :p :) [00:29:18] classy [00:31:27] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [00:32:34] !log repool ms-fe1004 [00:32:40] Logged the message, Master [00:34:41] !log depool ms-fe1001 [00:34:47] Logged the message, Master [00:38:47] !log shutting down ms-fe1001 for rack relocation [00:38:54] Logged the message, Master [00:39:41] (03PS1) 10Dzahn: add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 [00:40:58] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [00:43:17] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Thu 28 Aug 2014 22:42:28 UTC [00:45:47] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:22] uh, how do I view fatals in logstash? Under TYPES I see runJobs, Hadoop, etc., but not "exceptions" and "fatals". It ties with graphite for "awesome power, no idea how to work it". [00:46:34] (03PS2) 10Dzahn: add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 [00:46:47] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 8.763 second response time [00:46:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds [00:46:47] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds [00:49:37] i'm going to deploy a small update to wikimediaevents [00:52:09] (03CR) 10Dzahn: [C: 032] add ganglia_new aggregator to install2001/codfw [puppet] - 10https://gerrit.wikimedia.org/r/157020 (owner: 10Dzahn) [00:56:34] !log ori Synchronized php-1.24wmf19/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 04s) [00:56:40] Logged the message, Master [00:57:41] Duplicate declaration: File[/usr/lib/ganglia/python_modules] is already declared ...grrrrr [00:57:45] !log ori Synchronized php-1.24wmf18/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 03s) [00:57:51] Logged the message, Master [00:58:40] ori: Hm.. I saw some CR earlier about wfishiphop instead of defined [00:59:03] ori: did you mean to put it in page output instead of startup module? [00:59:08] yes [00:59:14] it's request-specific [00:59:37] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Epic puppet fail [00:59:39] yeah, but startup is a request, too [00:59:50] could've worked either way I suppose [00:59:56] but caching is key [01:00:09] bypasses cache or is fragmented by it? [01:00:28] Krinkle: https://gerrit.wikimedia.org/r/#/c/152903/ [01:00:38] it's a lot to explain :P [01:00:47] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 5.48 ms [01:01:11] ori: so it'd work for bits too, right? [01:01:17] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [01:01:47] yes, but the var ought to describe what generated the page, not what generated the startup module [01:01:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:01:52] because there's a chance that they're different [01:02:01] !log repool ms-fe1001 [01:02:05] ah, and it's not the same cookie for both [01:02:08] Logged the message, Master [01:02:08] gotcha [01:02:18] since varnish will route requests to the zend backends if the hhvm backends are sick [01:02:23] it's the same cookie [01:02:31] no [01:02:40] if you set cookie hhvm on enwiki, your bits request wont be hhvm [01:03:12] oh, that's what you meant. yes, right. [01:03:20] that's an even better point [01:03:29] since it's more likely than the possibility i mentioned above [01:03:34] yeah [01:03:37] good :) [01:03:55] using wfIsHHVM() is probably neater, yeah [01:04:10] (03PS1) 10Dzahn: do not set $cluster on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/157022 [01:04:17] !log depool ms-fe1002 [01:04:24] Logged the message, Master [01:05:16] (03CR) 10Dzahn: [C: 032] do not set $cluster on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/157022 (owner: 10Dzahn) [01:06:07] Krinkle: do you have a javascript handy for superimposing some image on the interface if a variable is set? (for a user script, not something that would actually be forced on anyone) [01:06:18] *a javascript snippet, even [01:06:30] i guess it's easy enough to write one [01:06:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:06:58] !log shutting down ms-fe1002 to relocate racks [01:07:04] Logged the message, Master [01:07:17] if mw.config.get isHip: $('#p-logo').css('background-color', 'pink') [01:07:17] ori: [01:07:27] I tend to use logo background color for different temp purposes [01:07:52] stick it into your global.js :) [01:08:07] weee! :) [01:09:26] Hm.. ukwikimedia is still being iterated over [01:09:27] PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:31] ins't that off cluster now? [01:09:46] https://wikimedia.org.uk/wiki/User:Krinkle/common.css [01:09:55] it's still on the cluster, but there's a redirect in place [01:10:01] right [01:10:03] inaccessilbe [01:10:16] (well, short of /etc/hosts [01:10:25] Im sure the servers still respond to it :P [01:10:54] Hm.. nah, redirects still catch it [01:11:44] Krinkle, so we're not serving it now? [01:11:51] indeed [01:11:56] legoktm: https://wikimania2009.wikimedia.org/wiki/User:Krinkle/common.js didn't match pattern [01:11:59] But it's not been marked as closed? [01:12:00] mw.loader.load('http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&action=raw&ctype=text/javascript','text/javascript'); [01:12:04] interesting second argument [01:12:07] that's valid [01:12:12] very rare [01:12:13] no idea what I was doing [01:12:39] the regex wasn't expecting the second argument [01:14:10] Krenair: Yeah, it shouldn't be iterated over by maintenance scripts any more [01:14:19] legoktm: btw, what does "does not load global modules on this wiki" mean [01:14:30] wikimania2015wiki: Krinkle does not load global modules on this wiki. [01:14:33] hm, wikimania 2008-2012 wikis still point to wikimania 2013 as being 'future' [01:14:51] aawiktionary: Krinkle does not load global modules on this wiki. [01:14:56] It's on closed.dblist [01:15:08] right [01:15:14] Krinkle: means the account doesn't exist locally or isn't attached in CA. [01:15:16] or did I just not autocreate on those [01:15:19] ic [01:15:25] if ( !$user->getId() || !GlobalCssJsHooks::loadForUser( $user ) ) { [01:15:25] $this->output( "$userName does not load global modules on this wiki.\n" ); [01:16:13] the script should only delete it globalcssjs would replace it, if your account isn't attached in CA, it won't load, hence no deletion [01:17:39] legoktm: yeah, and no local js/css pages if theres no account [01:18:07] unless some other privileged user created them for them without checking account first [01:26:47] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [01:27:10] !log repool ms-fe1002 [01:27:17] Logged the message, Master [01:31:27] (03PS1) 10Dzahn: install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 [01:31:42] (03PS2) 10Dzahn: install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 [01:31:44] legoktm: ori: btw, submodule commits mystyle: https://gist.github.com/Krinkle/479399ac9a11e9ff8b62 [01:32:28] (03CR) 10Dzahn: [C: 032] install2001 - just incl. base, not standard [puppet] - 10https://gerrit.wikimedia.org/r/157026 (owner: 10Dzahn) [01:34:37] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:41:49] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:42:56] (03CR) 10Dzahn: "Aug 29 01:34:07 install2001 puppet-agent[23260]: (/Stage[main]/Ganglia_new::Monitor::Aggregator/Service[ganglia-monitor-aggregator]) Unsch" [puppet] - 10https://gerrit.wikimedia.org/r/157026 (owner: 10Dzahn) [01:46:35] (03PS1) 10Springle: depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 [01:48:14] (03CR) 10Springle: [C: 032] depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 (owner: 10Springle) [01:48:19] (03Merged) 10jenkins-bot: depool db1070 for maintenance. pool db1072 in place. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157030 (owner: 10Springle) [01:49:37] !log springle Synchronized wmf-config/db-eqiad.php: depool db1070. pool db1072. (duration: 00m 06s) [01:49:46] Logged the message, Master [02:04:59] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3871 MB (3% inode=99%): [02:06:26] (03CR) 10Andrew Bogott: [C: 032] Move glance images to /a where there's more room. [puppet] - 10https://gerrit.wikimedia.org/r/157012 (owner: 10Andrew Bogott) [02:07:16] (03PS2) 10Andrew Bogott: Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 [02:07:50] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:08:59] RECOVERY - Disk space on virt0 is OK: DISK OK [02:09:20] (03CR) 10Andrew Bogott: [C: 032] Move scap files back to /usr [puppet] - 10https://gerrit.wikimedia.org/r/157014 (owner: 10Andrew Bogott) [02:10:49] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:10:59] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:11:49] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:11:59] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:17:03] (03PS1) 10Dzahn: ganglia-aggregators on install2001 as data sources [puppet] - 10https://gerrit.wikimedia.org/r/157033 [02:18:00] (03PS2) 10Dzahn: ganglia-aggregators on install2001 as data sources [puppet] - 10https://gerrit.wikimedia.org/r/157033 [02:19:39] (03CR) 10Dzahn: [C: 032] "using it like hooft is used" [puppet] - 10https://gerrit.wikimedia.org/r/157033 (owner: 10Dzahn) [02:30:09] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri Aug 29 02:30:04 UTC 2014 [02:31:10] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [02:38:23] !log LocalisationUpdate completed (1.24wmf18) at 2014-08-29 02:37:20+00:00 [02:38:31] Logged the message, Master [02:42:12] (03PS1) 10Springle: reassign db1070 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/157035 [02:48:43] !log springle Synchronized wmf-config/db-eqiad.php: depool db1070. pool db1072. (duration: 00m 07s) [02:48:50] Logged the message, Master [02:51:49] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:52:00] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:54:49] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:54:59] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:01:29] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [03:02:09] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:12:13] !log LocalisationUpdate completed (1.24wmf19) at 2014-08-29 03:10:26+00:00 [03:20:46] Logged the message, Master [03:21:36] (03CR) 10Springle: [C: 032] reassign db1070 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/157035 (owner: 10Springle) [03:29:53] !log springle Synchronized wmf-config/db-eqiad.php: reduce db1056 load while cloning (duration: 00m 06s) [03:34:10] Logged the message, Master [03:34:12] !log xtrabackup clone db1056 to db1070 [03:34:14] Logged the message, Master [04:01:04] (03CR) 10Withoutaname: "Mainly tried to move only the group permissions. Some configuration settings not related to permissions changes were moved into CommonSett" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 (https://bugzilla.wikimedia.org/58247) (owner: 10Withoutaname) [04:14:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 29 04:13:03 UTC 2014 (duration 13m 2s) [04:14:15] Logged the message, Master [04:20:47] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19635 MB (3% inode=99%): [04:29:47] RECOVERY - Disk space on elastic1009 is OK: DISK OK [04:44:26] (03CR) 10Hoo man: "I'm not actually a fan of this as I actually like the one file per extension configuration approach. Also what this does seems a little me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 (https://bugzilla.wikimedia.org/58247) (owner: 10Withoutaname) [05:01:54] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [06:28:04] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [06:28:34] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:45] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:54] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:33:28] !log springle Synchronized wmf-config/db-eqiad.php: return db1056 to normal load (duration: 00m 06s) [06:33:34] Logged the message, Master [06:38:14] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 3 failures [06:45:34] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:47:04] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:49:15] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:24] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [06:51:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [06:56:14] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:00:22] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Already being done in monitor_host definition. manifests/nagios.pp:23. Perhaps some nodes are missing the $cluster variable ?" [puppet] - 10https://gerrit.wikimedia.org/r/157015 (owner: 10Dzahn) [07:02:54] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [07:04:34] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Epic puppet fail [07:05:04] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Fri Aug 29 07:04:55 UTC 2014 [07:05:20] <_joe_> !log re-enabling puppet on the jobrunner, to check if the luasandbox fix works [07:05:24] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:05:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:05:26] Logged the message, Master [07:06:34] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:07:14] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:17:40] (03CR) 10Ori.livneh: "I'm not sure I understand the documentation. What does this do that https://github.com/puppetlabs/hiera/blob/master/lib/hiera/backend/yaml" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [07:26:40] (03CR) 10Giuseppe Lavagetto: "@ori:" [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [07:38:13] (03PS1) 10Alexandros Kosiaris: Add DNS views in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/157045 [07:53:55] (03CR) 10Ori.livneh: [C: 031] "It would suck to have to squish all variable data into a single YAML file, so I see the value. It's also similar to the pattern used in PH" [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [08:04:49] !log Jenkins: in the jenkins-job-builder-config branch 'cloudbees' has been merged in 'master'. Unifying CI and browser tests jobs!  \O/ [08:04:55] Logged the message, Master [08:04:55] morebots: come on [08:04:56] I am a logbot running on tools-exec-06. [08:04:56] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [08:04:56] To log a message, type !log . [08:14:19] anyone around familiar with Zero X-CS X-CS2 headers by any chance ? [08:14:22] got some questions :D [08:30:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:31:34] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:31:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:32:34] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:47:17] <_joe_> hashar_: can't say I'm familiar, but I've seen those [08:47:25] <_joe_> working on varnish [08:47:39] _joe_: I found out we have some super fun X-CS header for Zero :] [08:47:53] I am playing with the browser tests and found out what i needed [08:50:30] <_joe_> yes we do indeed [08:50:49] <_joe_> look at zero.inc.vcl.erb in the varnish module :) [08:50:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:51:49] _joe_: also I noticed overnight that the poor puppet compiler instance has full disk :( [08:51:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:51:59] <_joe_> hashar_: again? [08:52:01] <_joe_> grr [08:52:05] yeah /tmp filled up [08:52:10] <_joe_> ok I'll fix that as well [08:52:11] (03PS3) 10Giuseppe Lavagetto: beta: manage virtualhosts via puppet [puppet] - 10https://gerrit.wikimedia.org/r/156762 [08:52:16] <_joe_> it's not tmp alone [08:52:43] <_joe_> I'll fix it [08:53:55] _joe_: the instance should probably has extended disk enabled with role::labs::lvm::srv and script made to point to something like /srv/tmp :D [08:56:11] <_joe_> no [08:56:13] <_joe_> :) [08:56:28] <_joe_> I think I know the best way to manage a filesystem [08:56:46] <_joe_> and let's say I'm not super-fond of how we do that in labs in general [08:57:04] <_joe_> so I'm going to do that on my own [08:58:33] if you have a better proposal, I am sure labs users would love a fix :D [08:58:50] <_joe_> oh no I don't, I'm just complaining [08:58:59] <_joe_> I'm a grumpy, old opsen [08:59:02] <_joe_> :) [08:59:42] <_joe_> or better, I would love to, but I have nooo time [09:30:17] PROBLEM - Apache HTTP on mw1206 is CRITICAL: Connection timed out [09:30:25] PROBLEM - puppet last run on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:26] PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:34] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:30:44] PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection timed out [09:30:54] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:30:55] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:04] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:04] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:14] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:24] PROBLEM - RAID on mw1121 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:24] PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:25] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:25] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:25] PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:34] PROBLEM - Apache HTTP on mw1208 is CRITICAL: Connection timed out [09:31:34] PROBLEM - Apache HTTP on mw1199 is CRITICAL: Connection timed out [09:31:34] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:31:34] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:34] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:34] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:34] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:35] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:44] PROBLEM - Apache HTTP on mw1197 is CRITICAL: Connection timed out [09:31:45] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:45] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:45] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:45] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] PROBLEM - Apache HTTP on mw1132 is CRITICAL: Connection timed out [09:31:54] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:31:55] PROBLEM - Apache HTTP on mw1147 is CRITICAL: Connection timed out [09:32:04] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:04] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:14] PROBLEM - Apache HTTP on mw1193 is CRITICAL: Connection timed out [09:32:14] RECOVERY - RAID on mw1128 is OK: OK: no RAID installed [09:32:15] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:15] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 933 seconds ago with 0 failures [09:32:15] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 360 seconds ago with 0 failures [09:32:15] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:24] RECOVERY - RAID on mw1123 is OK: OK: no RAID installed [09:32:34] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.611 second response time [09:32:34] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:35] PROBLEM - check configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:44] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 734 seconds ago with 0 failures [09:32:44] RECOVERY - DPKG on mw1147 is OK: All packages OK [09:32:44] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.243 second response time [09:32:44] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:32:44] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.220 second response time [09:32:44] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.235 second response time [09:32:44] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.985 second response time [09:32:54] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:54] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.124 second response time [09:33:04] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.227 second response time [09:33:04] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.252 second response time [09:33:04] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.127 second response time [09:33:05] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.416 second response time [09:33:14] RECOVERY - RAID on mw1121 is OK: OK: no RAID installed [09:33:15] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 1082 seconds ago with 0 failures [09:33:24] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:34] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.320 second response time [09:33:35] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:09] <_joe_> api again [09:34:16] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.125 second response time [09:34:16] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [09:34:24] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [09:34:34] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.889 second response time [09:34:35] <_joe_> mw1196 has a load of 171 :P [09:35:05] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [09:35:14] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.393 second response time [09:35:24] PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:24] PROBLEM - DPKG on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:24] PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:34] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.519 second response time [09:35:34] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:40] RECOVERY - RAID on mw1128 is OK: OK: no RAID installed [09:36:40] RECOVERY - DPKG on mw1128 is OK: All packages OK [09:36:40] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1173 seconds ago with 0 failures [09:36:40] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [09:36:40] RECOVERY - check configured eth on mw1146 is OK: NRPE: Unable to read output [09:37:20] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:20] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:24] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.380 second response time [09:37:24] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.378 second response time [09:37:34] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.853 second response time [09:37:34] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.874 second response time [09:37:34] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.210 second response time [09:37:34] RECOVERY - DPKG on mw1146 is OK: All packages OK [09:37:35] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [09:37:35] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.175 second response time [09:37:35] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.175 second response time [09:37:45] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.093 second response time [09:37:54] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.190 second response time [09:37:54] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.410 second response time [09:37:54] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.557 second response time [09:37:54] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.581 second response time [09:37:55] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.148 second response time [09:37:55] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.504 second response time [09:37:55] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.031 second response time [09:38:04] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.314 second response time [09:38:05] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.254 second response time [09:38:06] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.671 second response time [09:38:06] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 40 data above and 0 below the confidence bounds [09:38:14] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 40 data above and 0 below the confidence bounds [09:38:15] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.062 second response time [09:51:28] _joe_, I see a lot of DB errors around that time [09:52:05] "Connection error: No working slave server: Unknown error (10.64.16.42)" [09:55:28] <_joe_> MaxSem: mmmh what does that mean in the context of mediawiki? [09:55:58] appservers were waiting for slaves? [09:56:34] <_joe_> oh it's es1004.eqiad.wmnet [09:57:12] <_joe_> so maybe it's some problem with es, which will also mean the slowdown of expandtemplates makes sense [09:57:35] <_joe_> springle_: around? [09:58:21] <_joe_> MaxSem: eh, there was a huge spike of load on es1004 [09:58:27] <_joe_> thanks for spotting that [09:58:39] <_joe_> I was searching for incoming traffic patterns [09:58:39] srsly, we shouldn't use mysql for that [09:58:46] <_joe_> :)) [09:59:19] <_joe_> maybe cassandra with a cache layer in front... [09:59:31] es1003 has the same network spike [09:59:36] <_joe_> or couchbase, IDK [09:59:46] <_joe_> yes I'm looking at tendril [09:59:55] we already have plans about cassie [10:00:23] <_joe_> I used couchbase for read-heavy payloads [10:00:28] <_joe_> but mysql is not bad for this in general [10:01:14] (03CR) 10Hashar: "That fixed the display of ruby lint test results on https://gerrit.wikimedia.org/r/#/c/143591/" [puppet] - 10https://gerrit.wikimedia.org/r/156103 (owner: 10Hashar)