[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151218T0000). Please do the needful. [00:00:05] ebernhardson Krinkle jdlrobson legoktm: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:08] !log restarted mathoid on sca1001 [00:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:00:26] I'll do it [00:00:50] o/ [00:01:39] (03CR) 10Catrope: [C: 032] Remove unused $wgObjectCaches['resourceloader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258165 (owner: 10Krinkle) [00:01:48] (03CR) 10Catrope: [C: 032] $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [00:02:04] you ruined my streak of unscheduled deploys [00:02:24] (03Merged) 10jenkins-bot: Remove unused $wgObjectCaches['resourceloader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258165 (owner: 10Krinkle) [00:02:24] lol, how dare I adhere to the schedule [00:02:30] ori, if the streak goes on too long, at a certain point it becomes scheduled. ;) [00:02:35] by deploying your own change :D [00:02:48] heh [00:02:51] RoanKattouw: Please sync to mw1017 first just in case (both of those) [00:02:53] If you haven't already [00:03:00] I haven't yet but I will [00:03:16] Wanna make sure it won't cause a notice [00:03:16] I do that by doing sync-common on mw1017, right? [00:03:19] Yeah [00:03:26] it'd be nice to have a scap task for that [00:03:29] (03Merged) 10jenkins-bot: $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [00:03:30] so it could easily be done from tin [00:03:51] ori: https://phabricator.wikimedia.org/T121597 [00:04:26] meh, ssh mw1017 sync-common and sudo -u mwdeploy ssh mw1017 sync-common don't work [00:04:52] * RoanKattouw stops being lazy and opens another window [00:04:52] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw1017 sync-common [00:04:53] Hah! That worked! Thanks Krenair [00:05:07] Krinkle: Deployed on mw1017, please verify [00:05:29] OK. Checking [00:07:12] RoanKattouw: OK. I think it looks fine. [00:07:24] But testwiki.log and our error logs in general are too spammy to be certain it didn't add introduce something [00:07:30] (on fluroine) [00:08:26] !log catrope@tin Synchronized wmf-config/CommonSettings.php: SWAT: cleanup (duration: 00m 30s) [00:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:54] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:12:03] (03PS1) 10Yuvipanda: ores: Split cache and queue redis [puppet] - 10https://gerrit.wikimedia.org/r/259897 (https://phabricator.wikimedia.org/T121658) [00:12:13] halfak: ^ [00:12:23] (03CR) 10jenkins-bot: [V: 04-1] ores: Split cache and queue redis [puppet] - 10https://gerrit.wikimedia.org/r/259897 (https://phabricator.wikimedia.org/T121658) (owner: 10Yuvipanda) [00:12:25] ebernhardson: Around for SWAT? [00:12:36] halfak: 512M for the queue redis, 3G for cache [00:12:49] Nice. [00:12:50] (03PS2) 10Yuvipanda: ores: Split cache and queue redis [puppet] - 10https://gerrit.wikimedia.org/r/259897 (https://phabricator.wikimedia.org/T121658) [00:13:01] ori: ^ multiinstance outside of mw :D [00:13:33] nice nice! [00:13:33] WOOO. Merging configs works exactly as intended *AND* now it's easier to test on local machine [00:14:00] halfak: :D [00:14:15] halfak: I'm going to merge my change and go to town on staging. I guess it doesn't matter if it dies? [00:14:25] RoanKattouw: yea [00:14:47] YuviPanda, that's right. Nothing is depending on staging right now [00:15:19] So, now how do we get the contextual config files up? [00:15:22] halfak: so I suggest that I just merge it and setup the two instances for staging and see how that goes? [00:15:32] halfak: hmm, good point. if you commit a sample file I'll make a puppet patch [00:15:34] Oh. Sure [00:16:10] YuviPanda, I was thinking that we could do the "available/" "enabled/" with symlinks dance. [00:16:21] meh or yeah? [00:16:39] halfak: hmm, I think we shouldn't do that until the first time we go 'I want to keep this config file around but not enable it just yet' [00:16:42] RoanKattouw: should be an easy one, just flips a flag in cirrussearch [00:16:44] we hardly use sites-available, for example [00:16:55] and for most of the time they're the exact same [00:16:57] YuviPanda, that's my local dev simplification file :) [00:16:59] so let's not do that until we need to [00:17:02] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/MobileFrontend: SWAT: Schema:MobileWebSectionUsage: always log the isTestA field (duration: 00m 31s) [00:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:13] halfak: ah, I see. but we don't need to do that in production, no? [00:17:14] jdlrobson: ---^^ [00:17:33] Yeah. Part of my deployment cycle is local testing before staging. [00:17:37] RoanKattouw: checking [00:17:38] But I could just not check it in [00:17:42] * YuviPanda nods [00:17:46] OK until I have to work from another machine. [00:17:51] RoanKattouw: unfortunately will take a while to check [00:18:04] (03PS3) 10Yuvipanda: ores: Split cache and queue redis [puppet] - 10https://gerrit.wikimedia.org/r/259897 (https://phabricator.wikimedia.org/T121658) [00:18:10] But yeah. I'm happy to keep it simple. [00:18:25] halfak: hmm, are you locally testing from ores-wikimedia-config? [00:18:32] !log catrope@tin Synchronized php-1.27.0-wmf.9/resources/src/mediawiki.messagePoster/mediawiki.messagePoster.factory.js: SWAT: fix error in messagePoster (duration: 00m 29s) [00:18:32] yeah. [00:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:49] legoktm: ---^^ [00:18:54] * legoktm tests [00:18:54] hmm, yeah, doing that dance might be a bit strange [00:19:01] I don't like to debug against staging because I might affect the environment. Happened once before. [00:19:13] halfak: I can make it work if need be, but would prefer to not right now. (the symlink stuff) [00:19:16] matt_flaschen: We need to do the Nuke patch first, then the Flow patch, right? [00:19:21] but if you feel strongly about it I'm totally up for doing it [00:19:26] YuviPanda, OK. I'll just not check it in. [00:19:27] RoanKattouw, correct. [00:19:41] So next Q. Do we need more than one config file to merge? [00:19:52] Or are we still going to do the ores-redis hack? [00:20:11] halfak: we definitely need to split 'connections' and 'config' if we want to use two instances (for cache vs queue) [00:20:22] halfak: hmm, actually, we can do that with ores-redis-queue [00:20:22] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/Nuke/: SWAT: Nuke support in Flow, part 1 (duration: 00m 30s) [00:20:24] and ores-redis-cache [00:20:25] RoanKattouw: still not working, I'm going to wait 5 minutes [00:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:30] halfak: so I can do that too, yeah. [00:20:38] YuviPanda, actually we don't since cache and queue are two different parts of the config. [00:20:46] (03CR) 10Yuvipanda: [C: 032] ores: Split cache and queue redis [puppet] - 10https://gerrit.wikimedia.org/r/259897 (https://phabricator.wikimedia.org/T121658) (owner: 10Yuvipanda) [00:20:55] (03CR) 10Catrope: [C: 032] Enable completion suggester beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259701 (https://phabricator.wikimedia.org/T119989) (owner: 10DCausse) [00:21:05] halfak: no, the split is so we can vary it for staging vs prod [00:21:12] Gotcha. OK. [00:21:40] So, I'll gist my local simplification and commit everything else. [00:21:46] * YuviPanda nods [00:21:48] (03Merged) 10jenkins-bot: Enable completion suggester beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259701 (https://phabricator.wikimedia.org/T119989) (owner: 10DCausse) [00:22:50] ebernhardson: Yours is going out now [00:22:53] Then Flow+Nuke part 2 [00:22:59] * Deskana waits patiently [00:23:13] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: enable completion suggester beta on all wikis except wikidata (duration: 00m 29s) [00:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:44] !log catrope@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: enable completion suggester beta on all wikis except wikidata (duration: 00m 30s) [00:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:19] Deskana: done ---^^ [00:25:05] RoanKattouw: thanks [00:25:11] https://gist.github.com/halfak/53e65cbbb525b5ee2da6 [00:25:13] RoanKattouw: confirmed working [00:25:21] YuviPanda, ^ [00:25:23] sadly can't know 100% until EventLogging tables catch up [00:25:24] I'm not seeing it on enwiki. [00:25:27] thanks for the help [00:25:33] Deskana: me either :S [00:25:34] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/Flow: SWAT: Nuke support for Flow, part 2 (duration: 00m 32s) [00:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:44] YuviPanda, see also https://github.com/wiki-ai/ores-wikimedia-config/tree/master/config [00:25:52] matt_flaschen: ---^^ could you see if that's working? [00:25:56] YuviPanda, and https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_wsgi.py#L10 [00:25:57] RoanKattouw: i think you might have to re-sync initialiseSettings.php, [00:26:03] ugh right of course [00:26:05] RoanKattouw: caching always bites me ... [00:26:35] halfak: ok, figuring out the puppet stuff right now let me look at it in a moment [00:26:52] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: grumble grumble touch InitialiseSettings grumble (duration: 00m 30s) [00:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:24] RoanKattouw: working now [00:28:01] YuviPanda, cool. I'll hold tight [00:28:39] RoanKattouw: hmm mine still not showing up in beta features list...debugging. [00:28:52] oh duh, its the whitelist [00:28:55] ebernhardson: Did you add it to the .... yes [00:29:10] incoming with another patch [00:29:30] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1889443 (10ori) >>! In T119637#1884977, @faidon wrote: > Status update: > - As of late last week, we have 3.11 packages prepared *in the upstream Debian pkg-hhvm repository* (t... [00:30:32] (03PS1) 10EBernhardson: Add Cirrus completion suggester to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259899 [00:30:33] RoanKattouw: ^ [00:30:39] RoanKattouw, it doesn't work with the production cluster setup. :( I'll fix. [00:31:21] (03PS1) 10Yuvipanda: ores: Create the redis data folders manually [puppet] - 10https://gerrit.wikimedia.org/r/259900 [00:31:38] (03CR) 10Catrope: [C: 032] Add Cirrus completion suggester to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259899 (owner: 10EBernhardson) [00:31:39] 7Blocked-on-Operations, 6operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1889458 (10ori) [00:31:50] ebernhardson: You know, if you'd done that the first time, we wouldn't have had the caching problem either :P [00:32:02] since that patch touches InitialiseSettings.php [00:32:14] RoanKattouw: :P [00:32:26] (03Merged) 10jenkins-bot: Add Cirrus completion suggester to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259899 (owner: 10EBernhardson) [00:33:55] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [00:34:48] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add completion suggester to BetaFeatures whitelist (duration: 00m 30s) [00:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:00] RoanKattouw, ebernhardson: It's aliiiiive! [00:35:26] RoanKattouw, ebernhardson: Thanks, gents. [00:35:30] (03CR) 10Yuvipanda: [C: 032] ores: Create the redis data folders manually [puppet] - 10https://gerrit.wikimedia.org/r/259900 (owner: 10Yuvipanda) [00:35:42] RoanKattouw, https://gerrit.wikimedia.org/r/259924 [00:35:53] Sorry, Vagrant uses single-cluster, but I should have seen it in code review. [00:36:22] RoanKattouw: it looks to be happy now, thanks! [00:36:26] All Nuke (not just Flow) is broken until that's deployed. [00:38:04] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [00:38:17] Thanks matt_flaschen [00:39:46] (03PS1) 10Yuvipanda: ores: Account for more magic [puppet] - 10https://gerrit.wikimedia.org/r/259945 [00:40:07] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Account for more magic [puppet] - 10https://gerrit.wikimedia.org/r/259945 (owner: 10Yuvipanda) [00:40:11] I think I'm gonna force-merge that into wmf9 because Zuul appears to be freaking out right now [00:41:13] ...which is because the security release just happened and all those patches are going through right nwo [00:42:02] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/Flow: SWAT: Nuke support for Flow, part 3 (duration: 00m 32s) [00:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:32] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1889526 (10csteipp) I approve of Jalexander being on security@. In case something thinks he forged that irc interaction :) [00:42:53] matt_flaschen: ---^^ how about now [00:42:56] halfak: ok the puppet stuff is done [00:42:59] looking at your patch now [00:43:50] halfak: ok, so for staging we have to override the ores-redis-01 to say 'localhost' [00:44:05] RoanKattouw, I should have looked more closely when I fixed this. This whole thing won't work. We need to revert. [00:44:28] It works locally, but it won't in productions, because it's joining between tables on entirely different clusters. [00:44:40] YuviPanda, yeah. Probably very similar to https://gist.github.com/halfak/53e65cbbb525b5ee2da6 [00:45:02] The only real difference is that we'll leave the metrics_collector alone. [00:45:06] halfak: ok [00:45:13] * YuviPanda puts [00:45:15] * halfak makes another gist [00:45:56] RoanKattouw, I could try to fix it, but I don't want to rush it at this point. [00:46:00] YuviPanda, https://gist.github.com/halfak/936759245c94f1398f77 [00:46:20] Assuming that everything is on the same machine and default port in staging [00:46:49] halfak: ok [00:48:54] matt_flaschen: OK, I'll pull it [00:50:06] Sorry [00:51:02] matt_flaschen: Could you file a task for fixing it? [00:51:10] Cause we should still fix this before the next train [00:51:14] (which isn't for a few weeks, but still) [00:51:18] YuviPanda, do you need anything from me now? [00:51:37] Yeah. I will. [00:52:10] OK so I'm reverting in wmf9 but not in master [00:52:19] And I'm not touching Nuke, because that change just added hooks [00:52:38] (03PS1) 10Yuvipanda: ores: Setup separate staging config [puppet] - 10https://gerrit.wikimedia.org/r/259958 [00:52:40] halfak: ^ [00:52:44] halfak: can you take a look at that? [00:53:30] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/Flow: Revert Nuke-Flow integration, doesn't work (duration: 00m 32s) [00:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:54:11] (03PS2) 10Yuvipanda: ores: Setup separate staging config [puppet] - 10https://gerrit.wikimedia.org/r/259958 [00:54:19] halfak: just the config bits of it. [00:54:20] Thanks, filed. [00:54:28] YuviPanda, looks like we'll need to overwrite the port for the cache *OR* I could just take it out of the main config. [00:54:37] halfak: oh? why so? [00:54:39] * YuviPanda looks at main config [00:54:44] halfak: oh, right. [00:54:49] halfak: I'll just overwrite it. [00:54:52] OK [00:55:02] halfak: wait [00:55:09] oh [00:55:11] no you're right [00:55:24] we're using 79 for queue and 80 for cache [00:55:53] (03PS3) 10Yuvipanda: ores: Setup separate staging config [puppet] - 10https://gerrit.wikimedia.org/r/259958 [00:56:05] halfak: done ^ [00:56:06] * halfak looks again [00:57:24] OK. Looks good. [00:57:58] halfak: ok [00:58:12] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Setup separate staging config [puppet] - 10https://gerrit.wikimedia.org/r/259958 (owner: 10Yuvipanda) [00:58:28] Ha. You beat my +1 [00:58:41] :D [00:58:45] hmm [00:58:46] forgot gerrit used wikitech creds. [00:58:57] config should probably restart some things [00:59:03] uwsgi / celery whatever I suppose [00:59:18] halfak: am running on ores-staging-01 [00:59:22] halfak: this shouldn't affect prod at all [00:59:28] OK [00:59:32] +1 [00:59:58] (03PS1) 10Yuvipanda: ores: Fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/259961 [01:00:38] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/259961 (owner: 10Yuvipanda) [01:02:16] (03PS1) 10Yuvipanda: ores: Fix another stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/259962 [01:02:29] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Fix another stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/259962 (owner: 10Yuvipanda) [01:03:18] halfak: can you do a deploy to staging? [01:03:21] halfak: with all the new stuff? [01:03:37] Will do. [01:03:44] halfak: ok. puppet is failing until that [01:04:05] doing. [01:04:12] Waiting for uwsgi to restart [01:04:16] .... [01:06:02] OK [01:06:05] Restart complete [01:06:23] Something is wrong [01:06:37] yeah [01:06:42] the puppet run should fix that [01:06:52] it's pointing to ores-redis-01 when that isn't setup properly yet [01:07:09] Gotcha. [01:07:16] halfak: can you gimme a test url again before you go off? [01:07:23] http://ores-staging.wmflabs.org/scores/testwiki/reverted/32234234/ [01:07:31] thanks [01:07:33] Should work no matter what number you put in it [01:07:45] Can stay for the puppet run. [01:07:47] Should I do it? [01:07:56] halfak: nope, go! [01:08:04] OK. Have a good night! [01:08:05] o/ [01:08:07] you too [01:08:09] \o [01:10:44] (03PS1) 10Ori.livneh: Add static variants of [[en:2015 in sports]] for perf testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259964 [01:11:01] (03CR) 10Ori.livneh: [C: 032] Add static variants of [[en:2015 in sports]] for perf testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259964 (owner: 10Ori.livneh) [01:12:05] PROBLEM - HHVM rendering on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:54] (03CR) 10Ori.livneh: [V: 032] Add static variants of [[en:2015 in sports]] for perf testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259964 (owner: 10Ori.livneh) [01:13:05] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:15] (03PS1) 10Yuvipanda: Revert "redis: upstart should track PID after one fork" [puppet] - 10https://gerrit.wikimedia.org/r/259965 [01:13:56] ori: ^ [01:14:06] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:16] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:16] ori: should we disable puppet on all the rdbs and try it on one? [01:14:24] PROBLEM - salt-minion processes on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:26] PROBLEM - configured eth on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:45] PROBLEM - dhclient process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:55] PROBLEM - SSH on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:55] PROBLEM - nutcracker port on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:05] PROBLEM - HHVM processes on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:18] PROBLEM - nutcracker process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:25] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:35] PROBLEM - Check size of conntrack table on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:15:44] PROBLEM - puppet last run on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:55] RECOVERY - Disk space on mw1125 is OK: DISK OK [01:19:56] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:20:05] RECOVERY - salt-minion processes on mw1125 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:20:14] RECOVERY - configured eth on mw1125 is OK: OK - interfaces up [01:20:26] RECOVERY - dhclient process on mw1125 is OK: PROCS OK: 0 processes with command name dhclient [01:20:36] RECOVERY - SSH on mw1125 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:20:45] RECOVERY - nutcracker port on mw1125 is OK: TCP OK - 0.000 second response time on port 11212 [01:20:46] RECOVERY - HHVM processes on mw1125 is OK: PROCS OK: 6 processes with command name hhvm [01:20:54] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.723 second response time [01:20:56] RECOVERY - nutcracker process on mw1125 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:21:06] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:21:15] RECOVERY - Check size of conntrack table on mw1125 is OK: OK: nf_conntrack is 5 % full [01:21:25] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [01:21:34] !log krinkle@tin Synchronized docroot and w: (no message) (duration: 00m 32s) [01:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:45] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 71912 bytes in 1.137 second response time [01:25:30] !log in preparation for Iaefb2d191e, disabling puppet on mc* and rdb* [01:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:26:45] PROBLEM - DPKG on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:04] PROBLEM - Disk space on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:15] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 3 failures [01:27:16] PROBLEM - RAID on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:45] PROBLEM - salt-minion processes on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:56] PROBLEM - puppet last run on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:15] PROBLEM - Check size of conntrack table on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:15] PROBLEM - configured eth on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:25] PROBLEM - dhclient process on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:41] (03PS2) 10Ori.livneh: Revert "redis: upstart should track PID after one fork" [puppet] - 10https://gerrit.wikimedia.org/r/259965 (owner: 10Yuvipanda) [01:41:05] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:43:45] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: puppet fail [01:45:44] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:52:04] !log re-enabled puppet on rdb* / mc* [01:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:22] (03Abandoned) 10Yuvipanda: Revert "redis: upstart should track PID after one fork" [puppet] - 10https://gerrit.wikimedia.org/r/259965 (owner: 10Yuvipanda) [02:05:05] PROBLEM - SSH on planet1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:46] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:07:16] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: puppet fail [02:22:16] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 08m 45s) [02:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:35] PROBLEM - SSH on planet1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Dec 18 02:29:10 UTC 2015 (duration 6m 55s) [02:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:56] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [02:34:55] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:46:26] (03PS1) 10Aaron Schulz: Adjust queue "maxPartitionsTry" and timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259969 [02:51:46] PROBLEM - salt-minion processes on elastic1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:46] PROBLEM - salt-minion processes on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:46] PROBLEM - salt-minion processes on mw2117 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:54] PROBLEM - salt-minion processes on mw2128 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:54] PROBLEM - salt-minion processes on mw2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:54] PROBLEM - salt-minion processes on restbase2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:55] PROBLEM - salt-minion processes on mw2073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:55] PROBLEM - salt-minion processes on mw2055 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:55] PROBLEM - salt-minion processes on mw1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:55] PROBLEM - salt-minion processes on mw2151 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:56] PROBLEM - salt-minion processes on mw1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:51:56] PROBLEM - salt-minion processes on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:14] RECOVERY - salt-minion processes on mw1082 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:15] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:54:15] RECOVERY - salt-minion processes on mw2186 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:25] RECOVERY - salt-minion processes on mw1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:25] RECOVERY - salt-minion processes on mw1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:34] RECOVERY - salt-minion processes on mw1185 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:34] RECOVERY - salt-minion processes on mw1219 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:35] RECOVERY - salt-minion processes on mw2057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:35] RECOVERY - salt-minion processes on mw2141 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:35] RECOVERY - salt-minion processes on mw2091 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:36] RECOVERY - salt-minion processes on mw1200 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:36] RECOVERY - salt-minion processes on wtp1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:44] RECOVERY - salt-minion processes on mw2208 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:44] RECOVERY - salt-minion processes on restbase1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:45] RECOVERY - salt-minion processes on mw1220 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:54] RECOVERY - salt-minion processes on mw2145 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:55] RECOVERY - salt-minion processes on mw1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:55] RECOVERY - salt-minion processes on elastic1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:55] RECOVERY - salt-minion processes on wtp1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:55] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:54:55] RECOVERY - salt-minion processes on mw1093 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:04] RECOVERY - salt-minion processes on mw2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:04] RECOVERY - salt-minion processes on elastic1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:05] RECOVERY - salt-minion processes on mw2157 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:14] RECOVERY - salt-minion processes on mw2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:15] RECOVERY - salt-minion processes on mw1231 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:16] RECOVERY - salt-minion processes on mw1260 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:25] RECOVERY - salt-minion processes on mw1226 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:26] RECOVERY - salt-minion processes on sca1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:35] RECOVERY - salt-minion processes on mw1228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:44] RECOVERY - salt-minion processes on mw1120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:44] RECOVERY - salt-minion processes on mw2081 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:45] RECOVERY - salt-minion processes on mw2126 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:54] RECOVERY - salt-minion processes on mw2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:55] RECOVERY - salt-minion processes on wtp2008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:55] RECOVERY - salt-minion processes on mw1222 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:55] RECOVERY - salt-minion processes on mw1203 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:04] RECOVERY - salt-minion processes on mw1090 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:04] RECOVERY - salt-minion processes on mw2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:05] RECOVERY - salt-minion processes on mw2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:05] RECOVERY - salt-minion processes on elastic1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:06] RECOVERY - salt-minion processes on mw2118 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:15] RECOVERY - salt-minion processes on mw1215 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:15] RECOVERY - salt-minion processes on mw2050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:24] RECOVERY - salt-minion processes on mw2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:24] RECOVERY - salt-minion processes on scb1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:25] RECOVERY - salt-minion processes on elastic1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:25] RECOVERY - salt-minion processes on mw1060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:26] RECOVERY - salt-minion processes on mw2076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:26] RECOVERY - salt-minion processes on mw1158 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:34] RECOVERY - salt-minion processes on mw2158 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:35] RECOVERY - salt-minion processes on wtp2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:44] RECOVERY - salt-minion processes on mw1086 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:45] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:56:54] RECOVERY - salt-minion processes on mw1112 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:55] RECOVERY - salt-minion processes on mw1110 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:56] RECOVERY - salt-minion processes on mw2045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:04] RECOVERY - salt-minion processes on mw2036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:05] RECOVERY - salt-minion processes on mw2129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:05] RECOVERY - salt-minion processes on mw1199 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:15] RECOVERY - salt-minion processes on mw2207 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:15] RECOVERY - salt-minion processes on mw1170 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:24] RECOVERY - salt-minion processes on mw2095 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:24] RECOVERY - salt-minion processes on mw2146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:25] RECOVERY - salt-minion processes on wtp1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:25] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:35] RECOVERY - salt-minion processes on mw1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:35] RECOVERY - salt-minion processes on mw1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:46] RECOVERY - salt-minion processes on mw2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:46] RECOVERY - salt-minion processes on wtp2015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:54] RECOVERY - salt-minion processes on elastic1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:54] RECOVERY - salt-minion processes on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:55] RECOVERY - salt-minion processes on restbase2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:55] RECOVERY - salt-minion processes on mw2073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:55] RECOVERY - salt-minion processes on aqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:57:56] RECOVERY - salt-minion processes on mw1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:04] RECOVERY - salt-minion processes on mw2136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:05] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:14] RECOVERY - salt-minion processes on wtp2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:15] RECOVERY - salt-minion processes on elastic1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:15] RECOVERY - salt-minion processes on mw2052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:15] RECOVERY - salt-minion processes on wtp1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:24] RECOVERY - salt-minion processes on mw2166 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:24] RECOVERY - salt-minion processes on mw2138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:25] RECOVERY - salt-minion processes on mw1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:25] RECOVERY - salt-minion processes on mw1055 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:25] RECOVERY - salt-minion processes on mw2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:34] RECOVERY - salt-minion processes on mw1177 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:34] RECOVERY - salt-minion processes on mw2061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:34] RECOVERY - salt-minion processes on mw1251 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:35] RECOVERY - salt-minion processes on mw1149 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:35] RECOVERY - salt-minion processes on mw1172 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:35] RECOVERY - salt-minion processes on mw1195 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:45] RECOVERY - salt-minion processes on mw2120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:46] RECOVERY - salt-minion processes on nobelium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:54] RECOVERY - salt-minion processes on mw1102 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:55] RECOVERY - salt-minion processes on mw1249 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:55] RECOVERY - salt-minion processes on wtp1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:56] RECOVERY - salt-minion processes on snapshot1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:56] RECOVERY - salt-minion processes on mw2077 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:05] RECOVERY - salt-minion processes on mw2177 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:05] RECOVERY - salt-minion processes on mw2132 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:14] RECOVERY - salt-minion processes on mw1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:16] RECOVERY - salt-minion processes on mw1202 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:35] RECOVERY - salt-minion processes on mw1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:35] RECOVERY - salt-minion processes on mw2088 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:36] RECOVERY - salt-minion processes on mw2198 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:36] RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:44] RECOVERY - salt-minion processes on mw1234 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:44] RECOVERY - salt-minion processes on mw1246 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:45] RECOVERY - salt-minion processes on wtp2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:45] RECOVERY - salt-minion processes on snapshot1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:59:55] RECOVERY - salt-minion processes on mw1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:04] RECOVERY - salt-minion processes on mw1191 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:04] RECOVERY - salt-minion processes on mw1079 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:04] RECOVERY - salt-minion processes on mw1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:04] RECOVERY - salt-minion processes on mw1157 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:04] RECOVERY - salt-minion processes on mw1151 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:05] RECOVERY - salt-minion processes on mw2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:14] RECOVERY - salt-minion processes on silver is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:15] RECOVERY - salt-minion processes on mw2054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:15] RECOVERY - salt-minion processes on mw2037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:24] RECOVERY - salt-minion processes on mw1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:24] RECOVERY - salt-minion processes on restbase1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:26] RECOVERY - salt-minion processes on mw1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:34] RECOVERY - salt-minion processes on mw2035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:34] RECOVERY - salt-minion processes on mw1184 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:35] RECOVERY - salt-minion processes on elastic1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:35] RECOVERY - salt-minion processes on mw2072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:44] RECOVERY - salt-minion processes on mw1216 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:44] RECOVERY - salt-minion processes on wtp1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:45] RECOVERY - salt-minion processes on elastic1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:55] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: puppet fail [03:00:55] RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:00:55] RECOVERY - salt-minion processes on mw1183 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:04] RECOVERY - salt-minion processes on wtp1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:04] RECOVERY - salt-minion processes on mw1258 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:06] RECOVERY - salt-minion processes on elastic1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:14] RECOVERY - salt-minion processes on mw1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:14] RECOVERY - salt-minion processes on mw2181 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:15] RECOVERY - salt-minion processes on mw1244 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:15] RECOVERY - salt-minion processes on mw1156 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:15] RECOVERY - salt-minion processes on mw1257 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:15] RECOVERY - salt-minion processes on mw1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:15] RECOVERY - salt-minion processes on mw1096 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:25] RECOVERY - salt-minion processes on mw2125 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:25] RECOVERY - salt-minion processes on hafnium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:34] RECOVERY - salt-minion processes on mw1248 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:35] RECOVERY - salt-minion processes on mw2191 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:35] RECOVERY - salt-minion processes on mw1098 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:35] RECOVERY - salt-minion processes on wtp1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:44] RECOVERY - salt-minion processes on mw2051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:45] RECOVERY - salt-minion processes on mw2162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:45] RECOVERY - salt-minion processes on mw2214 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:54] RECOVERY - salt-minion processes on wtp1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:55] RECOVERY - salt-minion processes on mw1252 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:55] RECOVERY - salt-minion processes on mw2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:04] RECOVERY - salt-minion processes on mw2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:05] RECOVERY - salt-minion processes on restbase2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:05] RECOVERY - salt-minion processes on mw1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:34] RECOVERY - salt-minion processes on mw1109 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:34] RECOVERY - salt-minion processes on logstash1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:35] RECOVERY - salt-minion processes on mw2169 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:36] RECOVERY - salt-minion processes on mw1097 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:44] RECOVERY - salt-minion processes on mw2074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:45] RECOVERY - salt-minion processes on mw1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:45] RECOVERY - salt-minion processes on mw1212 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:45] RECOVERY - salt-minion processes on wtp2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:45] RECOVERY - salt-minion processes on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:55] RECOVERY - salt-minion processes on wtp1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:55] RECOVERY - salt-minion processes on mw1192 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:56] RECOVERY - salt-minion processes on mw1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:04] RECOVERY - salt-minion processes on restbase2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:04] RECOVERY - salt-minion processes on mw2155 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:04] RECOVERY - salt-minion processes on elastic1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:05] RECOVERY - salt-minion processes on labnodepool1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:07] RECOVERY - salt-minion processes on mw1239 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:14] RECOVERY - salt-minion processes on mw2038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:15] RECOVERY - salt-minion processes on mw1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:15] RECOVERY - salt-minion processes on mw1077 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:15] RECOVERY - salt-minion processes on mw1105 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:15] RECOVERY - salt-minion processes on mw2187 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:15] RECOVERY - salt-minion processes on mw1229 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:16] RECOVERY - salt-minion processes on mw2159 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:16] RECOVERY - salt-minion processes on mw1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:16] RECOVERY - salt-minion processes on mw2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:25] RECOVERY - salt-minion processes on wtp2013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:25] RECOVERY - salt-minion processes on wtp1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:25] RECOVERY - salt-minion processes on elastic1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:25] RECOVERY - salt-minion processes on elastic1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:34] RECOVERY - salt-minion processes on restbase-test2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:34] RECOVERY - salt-minion processes on mw2205 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:34] RECOVERY - salt-minion processes on mw2071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:34] RECOVERY - salt-minion processes on mw2122 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:35] RECOVERY - salt-minion processes on mw2108 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:35] RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:35] RECOVERY - salt-minion processes on mw1108 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:44] RECOVERY - salt-minion processes on mw1121 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:45] RECOVERY - salt-minion processes on mw1186 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:45] RECOVERY - salt-minion processes on mw2210 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:54] RECOVERY - salt-minion processes on wtp1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:54] RECOVERY - salt-minion processes on mw2175 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:15] RECOVERY - salt-minion processes on mw2124 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:15] RECOVERY - salt-minion processes on mw2154 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:15] RECOVERY - salt-minion processes on mw2180 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:15] RECOVERY - salt-minion processes on ocg1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:15] RECOVERY - salt-minion processes on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:24] RECOVERY - salt-minion processes on mw1223 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:24] RECOVERY - salt-minion processes on snapshot1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:25] RECOVERY - salt-minion processes on logstash1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:25] RECOVERY - salt-minion processes on mw1160 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:25] RECOVERY - salt-minion processes on elastic1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:34] RECOVERY - salt-minion processes on mw2148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:35] RECOVERY - salt-minion processes on mw1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:36] RECOVERY - salt-minion processes on mw2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:36] RECOVERY - salt-minion processes on mw2204 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:44] RECOVERY - salt-minion processes on wtp2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:54] RECOVERY - salt-minion processes on mw2213 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:55] RECOVERY - salt-minion processes on mw2171 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:55] RECOVERY - salt-minion processes on mw1217 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:04] RECOVERY - salt-minion processes on mw2185 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:04] RECOVERY - salt-minion processes on mw2121 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:06] RECOVERY - salt-minion processes on elastic1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:14] RECOVERY - salt-minion processes on mw2063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:15] RECOVERY - salt-minion processes on mw2066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:15] RECOVERY - salt-minion processes on mw1209 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:15] RECOVERY - salt-minion processes on mw1117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:25] RECOVERY - salt-minion processes on mw1207 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:25] RECOVERY - salt-minion processes on mw1250 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:25] RECOVERY - salt-minion processes on restbase-test2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:34] RECOVERY - salt-minion processes on mw1224 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:35] RECOVERY - salt-minion processes on mw2115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:45] RECOVERY - salt-minion processes on mw1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:46] RECOVERY - salt-minion processes on analytics1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:54] RECOVERY - salt-minion processes on mw1187 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:54] RECOVERY - salt-minion processes on mw2156 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:55] RECOVERY - salt-minion processes on restbase1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:04] RECOVERY - salt-minion processes on mw2173 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:05] RECOVERY - salt-minion processes on mw2128 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:05] RECOVERY - salt-minion processes on mw1099 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:05] RECOVERY - salt-minion processes on mw1176 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:14] RECOVERY - salt-minion processes on mw1193 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:15] RECOVERY - salt-minion processes on mw2097 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:15] RECOVERY - salt-minion processes on mw1255 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:24] RECOVERY - salt-minion processes on mw1100 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:24] RECOVERY - salt-minion processes on mw2195 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:25] RECOVERY - salt-minion processes on elastic1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:25] RECOVERY - salt-minion processes on logstash1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:25] RECOVERY - salt-minion processes on mw2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:34] RECOVERY - salt-minion processes on mw1113 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:34] RECOVERY - salt-minion processes on mw1088 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:34] RECOVERY - salt-minion processes on mw1242 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:34] RECOVERY - salt-minion processes on elastic1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:44] RECOVERY - salt-minion processes on mw1236 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:44] RECOVERY - salt-minion processes on mw2078 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:45] RECOVERY - salt-minion processes on mw2104 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:45] RECOVERY - salt-minion processes on mw2206 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:45] RECOVERY - salt-minion processes on mw2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:45] RECOVERY - salt-minion processes on mw2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:54] RECOVERY - salt-minion processes on mw1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:54] RECOVERY - salt-minion processes on mw1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:55] RECOVERY - salt-minion processes on wtp2011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:06:56] RECOVERY - salt-minion processes on mw2060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:05] RECOVERY - salt-minion processes on mw2013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:05] RECOVERY - salt-minion processes on mw2059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:06] much recovery [03:07:15] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:15] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:15] RECOVERY - salt-minion processes on mw1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:24] RECOVERY - salt-minion processes on wtp2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:26] RECOVERY - salt-minion processes on restbase2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:26] RECOVERY - salt-minion processes on mw2137 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:26] RECOVERY - salt-minion processes on mw1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:36] RECOVERY - salt-minion processes on elastic1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:42] !log ori@tin Synchronized php-1.27.0-wmf.9/includes/api/ApiStashEdit.php: ab32f4e740: Make ApiStashEdit use statsd metrics (duration: 00m 49s) [03:07:44] RECOVERY - salt-minion processes on mw1103 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:45] RECOVERY - salt-minion processes on mw1123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:45] RECOVERY - salt-minion processes on mw2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:54] RECOVERY - salt-minion processes on mw1237 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:54] RECOVERY - salt-minion processes on scb1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:05] RECOVERY - salt-minion processes on mw1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:05] RECOVERY - salt-minion processes on mw1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:05] RECOVERY - salt-minion processes on elastic1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:25] RECOVERY - salt-minion processes on mw1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:36] RECOVERY - salt-minion processes on mw1214 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:44] RECOVERY - salt-minion processes on mw2211 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:44] RECOVERY - salt-minion processes on mw2190 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:44] RECOVERY - salt-minion processes on mw1095 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:45] RECOVERY - salt-minion processes on mw2011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:45] RECOVERY - salt-minion processes on mw1175 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:45] RECOVERY - salt-minion processes on mw2119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:55] RECOVERY - salt-minion processes on mw2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:55] RECOVERY - salt-minion processes on mw1126 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:05] RECOVERY - salt-minion processes on mw2178 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:15] RECOVERY - salt-minion processes on elastic1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:15] RECOVERY - salt-minion processes on mw1182 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:16] RECOVERY - salt-minion processes on mw2029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:17] RECOVERY - salt-minion processes on mw2086 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:24] RECOVERY - salt-minion processes on mw2192 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:26] RECOVERY - salt-minion processes on mw1101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:44] RECOVERY - salt-minion processes on mw1247 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:44] RECOVERY - salt-minion processes on wtp1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:45] RECOVERY - salt-minion processes on wtp2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:45] RECOVERY - salt-minion processes on mw1227 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:45] RECOVERY - salt-minion processes on restbase1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:54] RECOVERY - salt-minion processes on mw1190 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:55] RECOVERY - salt-minion processes on elastic1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:04] RECOVERY - salt-minion processes on mw2151 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:16] RECOVERY - salt-minion processes on mw2165 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:25] RECOVERY - salt-minion processes on elastic1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:25] RECOVERY - salt-minion processes on mw2135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:34] RECOVERY - salt-minion processes on mw1125 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:44] RECOVERY - salt-minion processes on mw1081 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:44] RECOVERY - salt-minion processes on mw1127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:45] RECOVERY - salt-minion processes on wtp2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:45] RECOVERY - salt-minion processes on mw2149 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:45] RECOVERY - salt-minion processes on mw2147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:54] RECOVERY - salt-minion processes on mw2194 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:54] RECOVERY - salt-minion processes on elastic1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:55] RECOVERY - salt-minion processes on mw2089 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:55] RECOVERY - salt-minion processes on mw1124 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:04] RECOVERY - salt-minion processes on mw1196 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:04] RECOVERY - salt-minion processes on mw2065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:14] RECOVERY - salt-minion processes on mw2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:15] RECOVERY - salt-minion processes on mw2103 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:15] RECOVERY - salt-minion processes on mw2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:15] RECOVERY - salt-minion processes on mw2042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:16] RECOVERY - salt-minion processes on restbase2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:24] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:24] RECOVERY - salt-minion processes on mw2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:25] RECOVERY - salt-minion processes on mw1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:25] RECOVERY - salt-minion processes on mw1159 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:25] RECOVERY - salt-minion processes on mw1221 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:25] RECOVERY - salt-minion processes on mw2179 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:35] RECOVERY - salt-minion processes on mw1111 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:36] RECOVERY - salt-minion processes on mw2041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:45] RECOVERY - salt-minion processes on wtp1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:45] RECOVERY - salt-minion processes on mw1130 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:05] RECOVERY - salt-minion processes on mw1232 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:06] RECOVERY - salt-minion processes on mw2102 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:06] RECOVERY - salt-minion processes on mw2197 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:14] RECOVERY - salt-minion processes on mw2116 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:15] RECOVERY - salt-minion processes on mw1210 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:24] RECOVERY - salt-minion processes on wtp1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:24] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:25] RECOVERY - salt-minion processes on mw1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:25] RECOVERY - salt-minion processes on mw1087 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:26] RECOVERY - salt-minion processes on mw2112 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:34] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:35] RECOVERY - salt-minion processes on mw2099 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:35] RECOVERY - salt-minion processes on mw2170 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:44] RECOVERY - salt-minion processes on wtp1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:44] RECOVERY - salt-minion processes on mw2139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:45] RECOVERY - salt-minion processes on mw2189 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:45] RECOVERY - salt-minion processes on mw2161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:45] RECOVERY - salt-minion processes on mw2201 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:55] RECOVERY - salt-minion processes on mw1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:14] RECOVERY - salt-minion processes on mw1243 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:14] RECOVERY - salt-minion processes on wtp2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:15] RECOVERY - salt-minion processes on mw2183 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:15] RECOVERY - salt-minion processes on mw1089 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:15] RECOVERY - salt-minion processes on mw1106 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:15] RECOVERY - salt-minion processes on mw1188 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:15] RECOVERY - salt-minion processes on mw2064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:16] RECOVERY - salt-minion processes on elastic1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:24] RECOVERY - salt-minion processes on mw1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:25] RECOVERY - salt-minion processes on mw1201 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:25] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:25] RECOVERY - salt-minion processes on wtp2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:25] RECOVERY - salt-minion processes on mw2160 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:34] RECOVERY - salt-minion processes on aqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:34] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:35] RECOVERY - salt-minion processes on mw1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:35] RECOVERY - salt-minion processes on mw1197 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:35] RECOVERY - salt-minion processes on mw1171 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:44] RECOVERY - salt-minion processes on mw1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:54] RECOVERY - salt-minion processes on logstash1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:56] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:13:56] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:04] RECOVERY - salt-minion processes on elastic1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:05] RECOVERY - salt-minion processes on mw2044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:05] RECOVERY - salt-minion processes on mw2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:05] RECOVERY - salt-minion processes on mw2140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:05] RECOVERY - salt-minion processes on mw2188 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:06] RECOVERY - salt-minion processes on mw2153 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:15] RECOVERY - salt-minion processes on mw2109 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:15] RECOVERY - salt-minion processes on mw2100 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:16] RECOVERY - salt-minion processes on wtp2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:25] RECOVERY - salt-minion processes on mw1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:25] RECOVERY - salt-minion processes on mw1091 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:25] RECOVERY - salt-minion processes on mw1225 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:34] RECOVERY - salt-minion processes on mw2105 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:54] RECOVERY - salt-minion processes on mw1122 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:54] RECOVERY - salt-minion processes on restbase-test2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:55] RECOVERY - salt-minion processes on wtp2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:14:55] RECOVERY - salt-minion processes on mw2010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:05] RECOVERY - salt-minion processes on mw2164 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:05] RECOVERY - salt-minion processes on mw2199 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:14] RECOVERY - salt-minion processes on mw1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:14] RECOVERY - salt-minion processes on mw2075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:15] RECOVERY - salt-minion processes on mw1254 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:24] RECOVERY - salt-minion processes on mw2163 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:24] RECOVERY - salt-minion processes on mw2080 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:24] RECOVERY - salt-minion processes on wtp1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:25] RECOVERY - salt-minion processes on mw1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:34] RECOVERY - salt-minion processes on mw1205 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:34] RECOVERY - salt-minion processes on mw1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:34] RECOVERY - salt-minion processes on mw2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:35] RECOVERY - salt-minion processes on elastic1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:35] RECOVERY - salt-minion processes on mw1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:35] RECOVERY - salt-minion processes on mw1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:45] PROBLEM - NTP on planet1001 is CRITICAL: NTP CRITICAL: No response from NTP server [03:15:54] RECOVERY - salt-minion processes on mw2117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:54] RECOVERY - salt-minion processes on mw2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:55] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:04] RECOVERY - salt-minion processes on mw1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:04] RECOVERY - salt-minion processes on mw1104 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:04] RECOVERY - salt-minion processes on mw2015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:05] RECOVERY - salt-minion processes on mw2087 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:05] RECOVERY - salt-minion processes on mw1107 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:14] RECOVERY - salt-minion processes on mw2067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:14] RECOVERY - salt-minion processes on elastic1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:14] RECOVERY - salt-minion processes on mw1253 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:14] RECOVERY - salt-minion processes on mw1241 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:25] RECOVERY - salt-minion processes on mw1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:26] RECOVERY - salt-minion processes on mw1150 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:26] RECOVERY - salt-minion processes on mw1189 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:26] RECOVERY - salt-minion processes on mw1155 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:26] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:34] RECOVERY - salt-minion processes on mw2184 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:35] RECOVERY - salt-minion processes on mw2070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:35] RECOVERY - salt-minion processes on mw1153 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:35] RECOVERY - salt-minion processes on mw1173 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:36] RECOVERY - salt-minion processes on mw2093 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:44] RECOVERY - salt-minion processes on restbase2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:45] RECOVERY - salt-minion processes on elastic1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:46] RECOVERY - salt-minion processes on mw2114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:55] RECOVERY - salt-minion processes on mw1204 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:55] RECOVERY - salt-minion processes on mw2096 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:56] RECOVERY - salt-minion processes on mw1235 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:56] RECOVERY - salt-minion processes on mw2082 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:56] RECOVERY - salt-minion processes on mw2030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:56] RECOVERY - salt-minion processes on mw2134 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:05] RECOVERY - salt-minion processes on mw2143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:05] RECOVERY - salt-minion processes on mw2123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:05] RECOVERY - salt-minion processes on mw1211 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:14] RECOVERY - salt-minion processes on mw2079 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:14] RECOVERY - salt-minion processes on mw1154 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw2090 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw1118 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw2127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw2176 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw2196 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:16] RECOVERY - salt-minion processes on mw2056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:24] RECOVERY - salt-minion processes on mw1194 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:24] RECOVERY - salt-minion processes on mw2212 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:25] RECOVERY - salt-minion processes on wtp2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:34] RECOVERY - salt-minion processes on mw1128 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:34] RECOVERY - salt-minion processes on mw1092 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:45] RECOVERY - salt-minion processes on mw2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:45] RECOVERY - salt-minion processes on mw2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:45] RECOVERY - salt-minion processes on wtp2010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:54] RECOVERY - salt-minion processes on mw2110 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:55] RECOVERY - salt-minion processes on mw2055 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:55] RECOVERY - salt-minion processes on mw1213 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:04] RECOVERY - salt-minion processes on mw2131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:05] RECOVERY - salt-minion processes on mw2113 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:14] RECOVERY - salt-minion processes on mw1137 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:14] RECOVERY - salt-minion processes on mw2142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:15] RECOVERY - salt-minion processes on mw1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:15] RECOVERY - salt-minion processes on mw1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:24] RECOVERY - salt-minion processes on mw2092 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:34] RECOVERY - salt-minion processes on mw2047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:45] RECOVERY - salt-minion processes on elastic1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:46] RECOVERY - salt-minion processes on mw1206 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:18:54] RECOVERY - salt-minion processes on mw1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:04] RECOVERY - salt-minion processes on mw1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:14] RECOVERY - salt-minion processes on restbase1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:24] RECOVERY - salt-minion processes on mw2101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:54] RECOVERY - salt-minion processes on mw1230 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:28:36] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:28:56] RECOVERY - salt-minion processes on wtp2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:34:05] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [03:37:54] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:43:30] (03PS1) 10Aaron Schulz: New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 [03:49:34] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [03:53:25] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:08:55] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [100000000.0] [04:13:05] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [04:28:45] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:49:27] (03PS2) 10BBlack: Text VCL: Fix up logged-in users caching [puppet] - 10https://gerrit.wikimedia.org/r/259882 [04:49:29] (03PS1) 10BBlack: Text VCL: support testwiki for /static/, too [puppet] - 10https://gerrit.wikimedia.org/r/259972 [04:49:31] (03PS1) 10BBlack: Text VCL: no need to re-detect testwiki hostname in backend [puppet] - 10https://gerrit.wikimedia.org/r/259973 [04:49:33] (03PS1) 10BBlack: Text VCL: explicit pass for Debug/SecAudit headers [puppet] - 10https://gerrit.wikimedia.org/r/259974 [04:50:16] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [04:52:15] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:53:09] (03CR) 10BBlack: [C: 032] Text VCL: support testwiki for /static/, too [puppet] - 10https://gerrit.wikimedia.org/r/259972 (owner: 10BBlack) [04:53:18] (03CR) 10BBlack: [V: 032] Text VCL: support testwiki for /static/, too [puppet] - 10https://gerrit.wikimedia.org/r/259972 (owner: 10BBlack) [04:53:31] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: no need to re-detect testwiki hostname in backend [puppet] - 10https://gerrit.wikimedia.org/r/259973 (owner: 10BBlack) [04:54:44] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: explicit pass for Debug/SecAudit headers [puppet] - 10https://gerrit.wikimedia.org/r/259974 (owner: 10BBlack) [05:04:05] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [05:08:04] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:09:25] PROBLEM - RAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [05:13:55] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [05:15:54] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:25:35] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [05:33:35] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:37:26] (03PS3) 10BBlack: Text VCL: Fix up logged-in users caching [puppet] - 10https://gerrit.wikimedia.org/r/259882 [05:37:27] (03PS1) 10BBlack: Text VCL: move filter_(headers|noise) from common [puppet] - 10https://gerrit.wikimedia.org/r/259976 [05:37:29] (03PS1) 10BBlack: Text VCL: de-duplicate common recv code [puppet] - 10https://gerrit.wikimedia.org/r/259977 [05:37:31] (03PS1) 10BBlack: Text VCL: move normalize_path from common [puppet] - 10https://gerrit.wikimedia.org/r/259978 [05:37:33] (03PS1) 10BBlack: VCL: move retry5xx to parsoid (only user left) [puppet] - 10https://gerrit.wikimedia.org/r/259979 [05:37:35] (03PS1) 10BBlack: VCL: standardize retry503, move oddball to parsoid-backend [puppet] - 10https://gerrit.wikimedia.org/r/259980 [05:39:25] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [05:43:15] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:00:55] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [06:02:54] PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 106020 MB (3% inode=99%) [06:04:18] it looks like this might actually finish ^^ [06:04:45] a large compaction at 95.80%, 105G remaining [06:05:19] /cc godog [06:30:25] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [06:31:25] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:17] 6operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1889811 (10faidon) [06:34:24] PROBLEM - puppet last run on wtp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:25] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:42:15] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [06:47:38] !log restbase1004: nodetool stop -- COMPACTION to avoid running out of disk space [06:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:47:55] RECOVERY - Disk space on restbase1004 is OK: DISK OK [06:52:15] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:56:45] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:45] RECOVERY - puppet last run on wtp2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:10:24] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:11:35] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:17:25] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [07:21:24] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:21:40] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1889836 (10QuimGil) It looks like there has been a **regression**? Now both clicking the topic link and the rectangle will le... [07:25:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:29:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:31:15] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [07:33:26] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:35:15] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:36:04] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:39:24] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:41:05] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [08:10:35] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:16:25] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [08:41:40] Hi there [08:42:49] hi [08:43:55] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:44:28] are they some severe issues on job queues ? [08:44:54] (we cannot even delete jobs from Tool Labs) [08:52:13] <_joe_> Linedwell: delete jobs from toollabs? what do you mean? [08:52:53] _joe_, we have a bot that do not answer anymore and refuse to restart while his job is running [08:53:16] job 381185 is already in deletion [08:53:17] Waiting for bot to stop so it can be restarted [08:53:18] <_joe_> Linedwell: ok so that's a toollabs problem I guess? [08:53:27] Yes [08:53:51] <_joe_> ok, not the mediawiki jobqueue - I asked because I couldn't see an issue with that [08:54:13] sorry for not having been clear :) [08:54:17] <_joe_> Linedwell: I am quite unfamiliar with toollabs's architecture, did you try to ask in #wikimedia-labs? [08:54:27] not yet, I will [08:54:31] thanks [08:54:48] <_joe_> Linedwell: if no one helps, I can try to, but I don't promise resolution :) [08:55:15] ok [08:55:36] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [09:05:36] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:06:04] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1889935 (10mobrovac) [09:06:08] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1889932 (10mobrovac) 5Open>3Resolved This has been deployed yesterday. Thank you everyone! [09:11:16] (03PS5) 10Giuseppe Lavagetto: pybal: introduce role for testing machines [puppet] - 10https://gerrit.wikimedia.org/r/259704 [09:13:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/259704 (owner: 10Giuseppe Lavagetto) [09:20:18] !log Killed Zuul entirely, the queues were full / deadlocked. Patches need to be retriggered [09:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:26] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [09:25:24] (03PS1) 10Mobrovac: Revert "Mathoid: Increase the number of workers temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/259990 [09:25:34] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:26:38] <_joe_> mobrovac: need me to merge ^^ [09:26:39] <_joe_> ? [09:26:49] _joe_: hehe, yup :) [09:27:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/259990 (owner: 10Mobrovac) [09:31:45] PROBLEM - SSH on planet1001 is CRITICAL: Server answer [09:34:35] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:39:57] RECOVERY - salt-minion processes on planet1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:40:45] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [09:41:14] RECOVERY - NTP on planet1001 is OK: NTP OK: Offset 0.02541458607 secs [09:41:34] looking into what's going on on planet1001 [09:41:55] ah Out of memory: Kill process 19807 (planet) score 7 or sacrifice child [09:41:59] ok that's what's going on [09:43:46] !log rebooting planet1001, memory exhaustion, OOM showed up [09:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:56] RECOVERY - DPKG on planet1001 is OK: All packages OK [09:44:56] RECOVERY - Check size of conntrack table on planet1001 is OK: OK: nf_conntrack is 0 % full [09:44:56] RECOVERY - dhclient process on planet1001 is OK: PROCS OK: 0 processes with command name dhclient [09:44:56] RECOVERY - configured eth on planet1001 is OK: OK - interfaces up [09:45:24] RECOVERY - Disk space on planet1001 is OK: DISK OK [09:45:55] RECOVERY - SSH on planet1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:46:05] RECOVERY - RAID on planet1001 is OK: OK: no RAID installed [09:46:26] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:57:26] (03PS2) 10Hashar: contint: monitor Jenkins has a ZMQ publisher [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) [10:00:55] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:09:05] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 794 [10:40:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 998 [10:45:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 76110 Threads: 125 Questions: 5710575 Slow queries: 932 Opens: 9066 Flush tables: 2 Open tables: 64 Queries per second avg: 75.030 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:15:14] PROBLEM - check_apache2 on payments1002 is CRITICAL: PROCS CRITICAL: 256 processes with command name apache2 [11:15:15] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [11:15:24] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [11:20:22] (03PS1) 10Giuseppe Lavagetto: pybal: allow configuration of the protocol, host [puppet] - 10https://gerrit.wikimedia.org/r/259993 [11:20:24] (03PS1) 10Giuseppe Lavagetto: Allow pybal::testing to vary the pybal config [puppet] - 10https://gerrit.wikimedia.org/r/259994 [11:21:18] (03PS2) 10Alexandros Kosiaris: trebuchet: change ferm rule to deployable networks [puppet] - 10https://gerrit.wikimedia.org/r/256752 [11:21:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] trebuchet: change ferm rule to deployable networks [puppet] - 10https://gerrit.wikimedia.org/r/256752 (owner: 10Alexandros Kosiaris) [11:28:45] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [11:29:34] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.002 second response time [11:29:46] <_joe_> moritzm: is that you? [11:31:01] _joe_: this one got restarted, let me have a look [11:32:56] _joe_: could you have a look at the error message in /var/log/hhvm/error.log, did you see that before? [11:33:07] <_joe_> taking a look [11:34:24] <_joe_> moritzm: what specifically looked suspicious to you? [11:34:40] <_joe_> from what I can see, hhvm exited cleanly and that's the last thing I see [11:34:53] * ori guesses without looking: lightprocess terminated [11:35:00] <_joe_> ori: eheh [11:35:02] the "Lost parent, lightProcess exiting"? [11:35:13] that's two mind-reads in close succession [11:35:15] ori: your reading my mind [11:35:29] <_joe_> ori: come on you can read minds over IRC [11:35:42] ori: I thought you could write minds, not read [11:36:49] moritzm: Tim's explanation was great so I saved it: [11:36:51] "We know that fork/exec is slow on HHVM, especially under concurrent load. When you fork/exec, you briefly acquire a lock on the kernel's view of the address space of the whole process. LightProcess is a child process that HHVM forks in advance. Then subsequent shell executions are forked from the child process. If you do your high volume forks from a separate worker process, you avoid acquiring the lock on the main proce [11:36:51] ss." [11:37:13] when the main process dies for whatever reason (including a normal shutdown), you each lightprocess exits too, with the log message you saw [11:37:23] s/you each/each/ [11:38:22] ok, so if that is benign, then I'm wondering why mw1107 misbehaves after all, shall I restart hhvm again? [11:38:48] <_joe_> let me see after that [11:39:01] ok [11:39:58] <_joe_> hhvm-dump-debug tells us it's in a deadlock [11:40:01] <_joe_> while starting [11:40:09] <_joe_> if you want to take a look, moritzm [11:40:18] <_joe_> I'll wait before restarting it [11:40:41] !log logstash: reorganized list of dashboards per sections https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [11:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:49] was fed up with the long list of links [11:45:09] _joe_: nice, I wasn't aware of hhvm-debug-debug, feel free to restart now [11:46:48] <_joe_> {{done}} [11:47:01] <_joe_> !log restarted hhvm on mw1107, stuck at startup [11:47:05] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.156 second response time [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:47:45] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [11:50:14] RECOVERY - check_apache2 on payments1003 is OK: PROCS OK: 127 processes with command name apache2 [11:50:15] RECOVERY - check_apache2 on payments1001 is OK: PROCS OK: 130 processes with command name apache2 [11:58:54] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 69856 MB (3% inode=99%) [11:59:39] godog: ^ [11:59:50] *sigh* [12:00:43] godog: let's just bounce it? [12:01:11] mobrovac: heh, no there's a decommision on that's streaming to 1008 iirc, checking [12:01:21] damn [12:02:03] mobrovac: 13:41 < paravoid> mobrovac: btw, speaking of outages, mathoid outage report please :) [12:02:36] ah yes, paravoid, veni, vidi, didn't react [12:02:38] sorry [12:03:03] mobrovac: yeah that's 20G left, let's wait for it to finish [12:03:07] paravoid: will do [12:03:21] thanks :) [12:03:38] godog: ok, that should be alright, there's 60G left on rb1008 [12:03:49] godog: let's at least stop compactions there? [12:04:29] <_joe_> we should hire someone and give him 'cassandra nanny' as a job title [12:04:41] <_joe_> it's almost a full time job... [12:04:41] haha [12:05:11] "we are looking for a cassandra-sitter" [12:15:44] (03PS2) 10Giuseppe Lavagetto: pybal: allow configuration of the protocol, host [puppet] - 10https://gerrit.wikimedia.org/r/259993 [12:16:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM; the compiler confirms no differences in the config file." [puppet] - 10https://gerrit.wikimedia.org/r/259993 (owner: 10Giuseppe Lavagetto) [12:18:04] <_joe_> !log disabled puppet on all lvs hosts for a potentially harmful change (should be a noop) [12:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:19:07] !log upgrading tor on radium, rebooting for kernel upgrade [12:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:20:26] (03PS1) 10Muehlenhoff: Log more errors [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259996 [12:20:28] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259997 [12:20:30] (03PS1) 10Muehlenhoff: Fix local minion apt logging in case of incomplete minion setup [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259998 [12:20:32] (03PS1) 10Muehlenhoff: Catch an additional dpkg error in log processing [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259999 [12:20:34] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/260000 [12:20:56] RECOVERY - check_apache2 on payments1002 is OK: PROCS OK: 11 processes with command name apache2 [12:23:21] PROBLEM - Host radium is DOWN: CRITICAL - Host Unreachable (208.80.154.39) [12:23:45] RECOVERY - Host radium is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [12:34:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:35:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:40:13] morebots: what exactly healthcheck_url does? [12:40:13] I am a logbot running on tools-exec-1201. [12:40:13] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [12:40:13] To log a message, type !log . [12:40:22] my bad :/ [12:41:00] mobrovac: healthcheck_url - what exactly it does? It seems empty in some places. [12:42:02] <_joe_> kart_: what do you mean "empty"? [12:42:16] hehehe kart_ you seem to confuse me with morebots often :P [12:42:31] modules/mathoid/manifests/init.pp:46 [12:42:32] kart_: that's a small detail we need to deal with for cxserver yet [12:42:34] yep [12:42:38] <_joe_> kart_: it is a python script that fetches the service spec and checks that responses correspond to what the spec dictates [12:42:52] kart_: localhost:8080/v1/?spec gets the cxserver's spec, right? [12:42:58] _joe_: just read it. has_spec - true. [12:43:18] <_joe_> kart_: read what exactly? [12:43:34] yes. [12:43:55] _joe_: was reading service::node documentation. [12:44:02] mobrovac: yes. [12:44:10] <_joe_> oh, ok [12:44:49] kart_: lemme take a look at that spec and we'll take it from there [12:44:57] s/that/cxserver's/ [12:45:16] ok! [12:52:57] (03PS2) 10Giuseppe Lavagetto: Allow pybal::testing to vary the pybal config [puppet] - 10https://gerrit.wikimedia.org/r/259994 [12:54:20] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow pybal::testing to vary the pybal config [puppet] - 10https://gerrit.wikimedia.org/r/259994 (owner: 10Giuseppe Lavagetto) [12:55:13] (03PS1) 10Mobrovac: CXServer: Set up automatic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260001 [12:56:32] (03CR) 10Mobrovac: "Verified to work on sca:" [puppet] - 10https://gerrit.wikimedia.org/r/260001 (owner: 10Mobrovac) [12:56:50] kart_: _joe_: akosiaris: ^^ [12:57:21] healthcheck url == '' ? [12:57:27] that's weird [12:57:31] <_joe_> wow, it works? [12:57:36] <_joe_> I'm impressed [12:57:52] _joe_: that's what I asked :) [12:58:20] <_joe_> no I meant "the spec is actually accurate and doesn't expose bugs in check_service" [12:58:26] <_joe_> of course an empty string works [12:58:39] <_joe_> it's just interpolated in a template [12:59:41] (03PS2) 10KartikMistry: CXServer: Set up automatic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260001 (https://phabricator.wikimedia.org/T121776) (owner: 10Mobrovac) [13:00:08] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM but I have no time to merge it now" [puppet] - 10https://gerrit.wikimedia.org/r/260001 (https://phabricator.wikimedia.org/T121776) (owner: 10Mobrovac) [13:00:14] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [13:00:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [13:00:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [13:00:54] <_joe_> mobrovac: I'll merge that patch after lunch, I don't think it's risky even for a friday afternoon [13:01:23] _joe_: cool, worst case scenario, we revert it, but i've already checked in on sca1001 [13:01:48] <_joe_> yup [13:03:31] mobrovac: thanks! [13:03:44] np [13:05:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:05:14] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [13:05:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:05:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [13:05:15] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [13:05:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [13:05:15] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [13:05:15] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [13:05:16] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [13:10:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:10:14] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [13:10:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:10:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [13:10:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [13:10:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [13:10:15] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 97 seconds ago with 0 failures [13:10:15] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [13:10:16] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [13:15:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:15:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:15:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [13:15:15] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 149 seconds ago with 0 failures [13:15:15] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [13:15:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [13:15:15] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 163 seconds ago with 0 failures [13:15:15] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [13:18:02] (03PS2) 10BBlack: Text VCL: move filter_(headers|noise) from common [puppet] - 10https://gerrit.wikimedia.org/r/259976 [13:18:09] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: move filter_(headers|noise) from common [puppet] - 10https://gerrit.wikimedia.org/r/259976 (owner: 10BBlack) [13:19:07] (03PS2) 10BBlack: Text VCL: de-duplicate common recv code [puppet] - 10https://gerrit.wikimedia.org/r/259977 [13:19:14] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: de-duplicate common recv code [puppet] - 10https://gerrit.wikimedia.org/r/259977 (owner: 10BBlack) [13:19:52] (03PS2) 10BBlack: Text VCL: move normalize_path from common [puppet] - 10https://gerrit.wikimedia.org/r/259978 [13:19:59] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: move normalize_path from common [puppet] - 10https://gerrit.wikimedia.org/r/259978 (owner: 10BBlack) [13:20:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:20:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:20:14] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 286 seconds ago with 0 failures [13:20:14] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:20:15] RECOVERY - check_puppetrun on pay-lvs2002 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [13:20:15] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 194 seconds ago with 0 failures [13:21:16] mobrovac: one more novice question: once healthcheck is placed. Is there any way to get email notification or only IRC? [13:21:55] kart_: for now it's irc only, but there's a future idea to be able to get emails about it as well [13:25:14] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:25:14] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 204 seconds ago with 0 failures [13:25:45] (03PS2) 10BBlack: VCL: move retry5xx to parsoid (only user left) [puppet] - 10https://gerrit.wikimedia.org/r/259979 [13:25:59] (03CR) 10BBlack: [C: 032 V: 032] VCL: move retry5xx to parsoid (only user left) [puppet] - 10https://gerrit.wikimedia.org/r/259979 (owner: 10BBlack) [13:26:56] mobrovac: OK. Emails will be nice! [13:31:01] (03PS2) 10BBlack: VCL: standardize retry503, move oddball to parsoid-backend [puppet] - 10https://gerrit.wikimedia.org/r/259980 [13:32:44] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [13:33:19] (03PS3) 10BBlack: VCL: standardize retry503, move oddball to parsoid-backend [puppet] - 10https://gerrit.wikimedia.org/r/259980 [13:34:17] (03CR) 10BBlack: [C: 032 V: 032] VCL: standardize retry503, move oddball to parsoid-backend [puppet] - 10https://gerrit.wikimedia.org/r/259980 (owner: 10BBlack) [13:46:46] (03PS4) 10BBlack: Text VCL: Fix up logged-in users caching [puppet] - 10https://gerrit.wikimedia.org/r/259882 [13:56:45] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:30] 6operations, 7Monitoring: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#1890371 (10fgiunchedi) 3NEW [14:16:37] 6operations, 7Monitoring: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#1890378 (10fgiunchedi) a:3fgiunchedi [14:18:49] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1890384 (10fgiunchedi) [14:18:50] 6operations, 7Graphite, 5Patch-For-Review: diamond should send statsd metrics in batches - https://phabricator.wikimedia.org/T116033#1890380 (10fgiunchedi) 5Open>3Resolved fixed, but see also {T121861} [14:41:53] (03PS3) 10Giuseppe Lavagetto: CXServer: Set up automatic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260001 (https://phabricator.wikimedia.org/T121776) (owner: 10Mobrovac) [14:43:23] (03CR) 10Giuseppe Lavagetto: [C: 032] CXServer: Set up automatic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260001 (https://phabricator.wikimedia.org/T121776) (owner: 10Mobrovac) [14:43:38] !log set temp-url-key for mw:media account in swift codfw [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:08] !log update privatesettings with swift codfw configuration [14:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:30] do I need to propagate that to mira too? ^ [14:45:40] <_joe_> godog: the next scap should do it [14:45:46] thnx _joe_ ! [14:47:04] _joe_: ack, thanks! [14:47:17] godog: r u keeping track of rb1008? [14:47:18] <_joe_> godog: but ask for confirmations [14:51:16] mobrovac: I saw the decom has finished towards 1008, we could stop compactions if it gets tight again [14:51:50] godog: it still is tight, 50G left free [14:52:09] godog: restart + stop compactions might be in order given it's firday? [14:53:57] mobrovac: I forgot to ask but I don't know what restarting cassandra really does in this case, afaik nothing? [14:54:15] !log gallium: restarted apache2 , was deadlocked/unresponsive somehow [14:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:53] godog: discards and cleans the compaction queue (apart from other things which we do not care about here) [14:55:08] !log SET GLOBAL query_cache_type = 0; on db1025 [14:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:20] godog: stopping compactions might have the same effect but i'm not sure [14:55:40] _joe_: good idea, I'll ask one of the deployers [14:56:12] mobrovac: I'll give stop a try first [14:56:22] k [14:57:17] !log stop compactions on restbase1008 [14:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:49] hey folks, let's say pl.wp would want to delete https://pl.wikipedia.org/wiki/Szablon:Potrzebne , which has 35k revisions. 'bigdelete' prevents that. could somebody very carefully delete it server-side? :D [15:02:19] jynus: i suppose that's a question to you ^. how problematic would it be? [15:03:21] (if it's impossible, i'll report back. if it would require some planning, but you're willing to do it, i'll file a task) [15:05:07] the revisions are not a problem, there, but the links here, etc. [15:05:55] that could be done in batches, I suppose there is a maintenance script? [15:06:46] i've always been told it's the revisions. since page deletion is super lame and requires moving the rows to a different table [15:07:05] (03PS1) 10Faidon Liambotis: tor: drop motd::script from the module [puppet] - 10https://gerrit.wikimedia.org/r/260010 [15:07:07] (03PS1) 10Faidon Liambotis: tor: pass hashed_control_password as a class variable [puppet] - 10https://gerrit.wikimedia.org/r/260011 [15:07:09] there's deleteBacth.php, but that just does normal deletion. [15:07:09] (03PS1) 10Faidon Liambotis: tor: drop tor_ prefix from all class variables [puppet] - 10https://gerrit.wikimedia.org/r/260012 [15:07:11] (03PS1) 10Faidon Liambotis: tor: drop default parameter values [puppet] - 10https://gerrit.wikimedia.org/r/260013 [15:07:13] (03PS1) 10Faidon Liambotis: tor: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/260014 [15:07:15] deleteRevision.php, maybe [15:07:17] (03PS1) 10Faidon Liambotis: tor: add a second relay running on the same server [puppet] - 10https://gerrit.wikimedia.org/r/260015 [15:08:35] MatmaRex, yes, what I meant is that 35K doesn't seem like a lot of them [15:08:59] (assuming a batched delete, not "all at once") [15:11:05] <_joe_> is gerrit more or less dead? [15:11:34] MatmaRex: usually stewards do 'bigdelete' deletions after approval from a sysadmin [15:12:24] (03CR) 10Faidon Liambotis: [C: 032] tor: drop motd::script from the module [puppet] - 10https://gerrit.wikimedia.org/r/260010 (owner: 10Faidon Liambotis) [15:13:30] (03CR) 10Faidon Liambotis: [C: 032] tor: pass hashed_control_password as a class variable [puppet] - 10https://gerrit.wikimedia.org/r/260011 (owner: 10Faidon Liambotis) [15:13:40] (03CR) 10Faidon Liambotis: [C: 032] tor: drop tor_ prefix from all class variables [puppet] - 10https://gerrit.wikimedia.org/r/260012 (owner: 10Faidon Liambotis) [15:13:50] (03CR) 10Faidon Liambotis: [C: 032] tor: drop default parameter values [puppet] - 10https://gerrit.wikimedia.org/r/260013 (owner: 10Faidon Liambotis) [15:15:58] oh Reedy might know if PrivateSettings is synced to mira automatically (?) [15:17:06] (03PS1) 10Ottomata: Fix for kafka broker port ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/260016 [15:17:52] <_joe_> is gerrit super slow just for me? [15:18:10] 6operations, 7Availability: Figure out a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1890459 (10fgiunchedi) [15:18:12] 6operations, 7Availability: Set $wmfSwiftCodfwConfig in PrivateSettings - https://phabricator.wikimedia.org/T119651#1890457 (10fgiunchedi) 5Open>3Resolved completed, `PrivateSettings.php` might need to be synced to mira tho [15:18:23] (03CR) 10Ottomata: [C: 032] Fix for kafka broker port ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/260016 (owner: 10Ottomata) [15:18:39] (03PS1) 10Giuseppe Lavagetto: deployment-prep: add conftool stanza to hiera config [puppet] - 10https://gerrit.wikimedia.org/r/260017 [15:21:16] (03PS2) 10Giuseppe Lavagetto: deployment-prep: add conftool stanza to hiera config [puppet] - 10https://gerrit.wikimedia.org/r/260017 [15:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deployment-prep: add conftool stanza to hiera config [puppet] - 10https://gerrit.wikimedia.org/r/260017 (owner: 10Giuseppe Lavagetto) [15:24:17] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1890464 (10mark) Makes total sense, yes. :) [15:27:43] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1890467 (10Mschon) 5Resolved>3Open >dzahn: >Unless...we think it's worth the price to buy >SSL certs for a redirect to wikipedia.org. does wmf support https://letsencrypt.org ? couldn't wikipedia.lol be a good s... [15:35:12] (03PS2) 10Muehlenhoff: Log more errors [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259996 [15:35:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Log more errors [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259996 (owner: 10Muehlenhoff) [15:35:31] (03PS2) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259997 [15:35:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259997 (owner: 10Muehlenhoff) [15:36:02] (03PS2) 10Muehlenhoff: Fix local minion apt logging in case of incomplete minion setup [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259998 [15:36:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix local minion apt logging in case of incomplete minion setup [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259998 (owner: 10Muehlenhoff) [15:36:47] (03PS2) 10Muehlenhoff: Catch an additional dpkg error in log processing [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259999 [15:36:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Catch an additional dpkg error in log processing [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/259999 (owner: 10Muehlenhoff) [15:37:08] (03PS2) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/260000 [15:37:21] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/260000 (owner: 10Muehlenhoff) [15:39:28] (03PS1) 10Hashar: contint: stop cloning mediawiki/tools/codesniffer.git [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) [15:40:01] jynus: yeah, mediawiki seems to only do it all at once, so it'd have to be manual-ish. [15:40:07] Glaisher: how? they don't seem to have the right [15:40:38] uh.. they regularly do bigdeletes on requests on SRM [15:41:16] MatmaRex: They do. https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/steward [15:41:39] (03CR) 10Hashar: "Cherry picked on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) (owner: 10Hashar) [15:41:42] Glaisher: huh. weird that it's not listed on https://meta.wikimedia.org/wiki/Special:ListGroupRights . [15:41:57] yeah, that's the *local* group [15:42:05] Meta has a local steward group. [15:42:08] mobrovac: thanks for helping with the report, assuming it's you [15:42:18] for rights that shouldn't be assigned globally [15:43:01] * MatmaRex hates crosswiki stuff [15:43:34] Nikerabbit: np, did some very minor tweaks [15:43:45] Nikerabbit: btw, all times need to be in UTC [15:44:23] mobrovac: there is one in IST, everything that I added is in UTC unless I made typos [15:52:00] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1890517 (10fgiunchedi) thanks @pchelolo ! another strategy I've seen is letting the user pick MTU as opposed to number of metrics, I think this would be more predictable... [15:52:28] (03PS1) 10Cmjohnson: yubikey chris [puppet] - 10https://gerrit.wikimedia.org/r/260024 [15:58:37] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1890533 (10Eevans) [16:00:44] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1890537 (10jeremyb) >>! In T88861#1890467, @Mschon wrote: > does wmf support https://letsencrypt.org ? Please skim {T92002} from T92002#1876438, have a look at top of {T101048} and read T101048#1826868. [16:01:36] lol [16:03:54] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1890562 (10RobH) 5stalled>3Resolved I'm resolving this parent task, as it compared Dell and HP system specifications during the refresh. We've gone with the Dell systems... [16:07:12] (03PS1) 10Ottomata: new_wmf_service should create new branch from origin/production [puppet] - 10https://gerrit.wikimedia.org/r/260029 [16:08:16] 6operations, 6Labs: labswiki cannot connect to x1-slave (db1031), and soon, x1-master, either [Error connecting to 10.64.16.20: :real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.20' (4)] - https://phabricator.wikimedia.org/T121866#1890567 (10jcrespo) 3NEW [16:11:58] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1890576 (10Pchelolo) @fgiunchedi Hm, indeed that would work better. I'll amend the PR. [16:12:49] (03CR) 10Cmjohnson: [C: 032] yubikey chris [puppet] - 10https://gerrit.wikimedia.org/r/260024 (owner: 10Cmjohnson) [16:17:31] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 70705 MB (3% inode=99%) [16:18:15] (03PS2) 10Ottomata: new_wmf_service should create new branch from origin/production [puppet] - 10https://gerrit.wikimedia.org/r/260029 [16:18:21] (03CR) 10Ottomata: [C: 032 V: 032] new_wmf_service should create new branch from origin/production [puppet] - 10https://gerrit.wikimedia.org/r/260029 (owner: 10Ottomata) [16:20:21] !log restarting and reconfiguring mysql on db1047 [16:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:52] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1890591 (10RobH) [16:27:59] !log restarted apache on iridium to deploy redirect script changes [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:41] (03CR) 1020after4: [C: 031] Gerrit: redirect old gitweb project urls to Diffusion instead of Gitblit [puppet] - 10https://gerrit.wikimedia.org/r/257523 (owner: 10Chad) [16:31:32] (03PS2) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [16:31:34] (03PS2) 10Andrew Bogott: WIP: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [16:31:36] (03PS1) 10Andrew Bogott: Trivial: Correct an inaccuracy in a comment. [puppet] - 10https://gerrit.wikimedia.org/r/260036 [16:31:38] (03PS1) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [16:32:30] (03PS1) 10Ottomata: Attempt to move eventbus::eventbus role back to just eventbus where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/260038 [16:32:52] godog: Did you find out? [16:33:21] Reedy: not for sure, looks like it should be sync'ed just everything else [16:33:28] It should, yeah [16:33:36] Certainly if explicitly synced [16:35:37] ottomata: btw there are some kafka replica alerts unknown in icinga [16:36:44] oh hm! [16:36:57] different than I expected, ok thanks godog [16:37:40] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:38:57] ottomata: that's you ^ :) [16:39:11] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:39:43] (03PS1) 10Ottomata: Fix icinga check from arg in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/260042 [16:39:58] (03PS1) 10Jcrespo: Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 [16:40:01] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: puppet fail [16:40:07] (sorry about merge warning, fixed.) [16:40:33] (03CR) 10Ottomata: [C: 032] Fix icinga check from arg in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/260042 (owner: 10Ottomata) [16:40:53] (03PS1) 10Ottomata: Update kafka submodule with icinga from param fix [puppet] - 10https://gerrit.wikimedia.org/r/260044 [16:41:09] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule with icinga from param fix [puppet] - 10https://gerrit.wikimedia.org/r/260044 (owner: 10Ottomata) [16:41:11] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:41:41] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:42:27] (03PS2) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [16:43:30] (03CR) 10jenkins-bot: [V: 04-1] Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 (owner: 10Andrew Bogott) [16:44:01] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:20] (03PS2) 10Jcrespo: Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 [16:45:32] (03PS3) 10Jcrespo: Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 [16:47:39] (03PS3) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [16:47:41] PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:41] (03PS3) 10Andrew Bogott: WIP: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [16:47:43] (03PS3) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [16:47:45] (03PS2) 10Andrew Bogott: Trivial: Correct an inaccuracy in a comment. [puppet] - 10https://gerrit.wikimedia.org/r/260036 [16:48:54] (03Abandoned) 10Andrew Bogott: Trivial: Correct an inaccuracy in a comment. [puppet] - 10https://gerrit.wikimedia.org/r/260036 (owner: 10Andrew Bogott) [16:49:33] (03CR) 10jenkins-bot: [V: 04-1] Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 (owner: 10Andrew Bogott) [16:49:50] (03PS4) 10Jcrespo: Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 [16:50:40] (03PS1) 10Ottomata: Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) [16:51:45] (03PS5) 10Jcrespo: Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 [16:52:12] (03PS4) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [16:52:14] (03PS4) 10Andrew Bogott: WIP: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [16:52:16] (03PS4) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [16:53:17] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1890743 (10RobH) 3NEW a:3RobH [16:56:35] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10Ottomata) 3NEW a:3Joe [16:59:26] (03CR) 10Jcrespo: [C: 032] Reconfiguring db1047 mysql [puppet] - 10https://gerrit.wikimedia.org/r/260043 (owner: 10Jcrespo) [17:06:18] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1890806 (10RobH) [17:06:27] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1890808 (10RobH) a:5RobH>3Papaul [17:07:05] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:11:22] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1890843 (10RobH) [17:14:10] I will love you ethernally if you can do that for me and I can focus on the data and mysql parts [17:17:55] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1890863 (10RobH) [17:18:03] (03PS1) 10BBlack: varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) [17:19:00] (03CR) 10jenkins-bot: [V: 04-1] varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [17:20:34] 6operations, 10ops-eqiad: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1890877 (10RobH) 3NEW a:3Cmjohnson [17:21:16] (03PS1) 10Jcrespo: Reconfiguring db1047 and cleaning up db servers [puppet] - 10https://gerrit.wikimedia.org/r/260049 [17:21:26] (03PS2) 10Jcrespo: Reconfiguring db1047 and cleaning up db servers [puppet] - 10https://gerrit.wikimedia.org/r/260049 [17:22:44] (03CR) 10Jcrespo: [C: 032] Reconfiguring db1047 and cleaning up db servers [puppet] - 10https://gerrit.wikimedia.org/r/260049 (owner: 10Jcrespo) [17:26:40] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 10procurement: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1890897 (10mark) @cmjohnson: so where is the original controller in labstore1001 now, and what was wrong with it? If it's not usable, let's order a... [17:28:24] (03PS1) 10Jcrespo: Adding SSL support to eventlogging databases [puppet] - 10https://gerrit.wikimedia.org/r/260050 [17:29:01] (03CR) 10Jcrespo: [C: 032] Adding SSL support to eventlogging databases [puppet] - 10https://gerrit.wikimedia.org/r/260050 (owner: 10Jcrespo) [17:29:10] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890908 (10RobH) We do indeed lack conf2XXX deployments in codfw. After a IRC chat with @ottomata, we're going to assign this task to @joe for his input on how he p... [17:29:24] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890909 (10RobH) p:5Triage>3High [17:34:55] (03PS2) 10BBlack: varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) [17:40:37] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: puppet fail [17:51:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:51:19] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:58:42] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 10procurement: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1890973 (10Cmjohnson) I do not think there were any real problems with it but @coren wanted to replace it jic there was a h/w issue. [17:59:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:59:19] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:03:20] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 10procurement: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1890991 (10mark) @cmjohnson, do we have any ticket or other communication on that? I'm wondering what the issue was, i'm only aware of issues with l... [18:06:48] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:14:07] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1261827 (10RobH) [18:14:41] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1261827 (10RobH) I've created T121893 for the pricining/quoting/ordering of the controllers. Keeping that order to a sub-task will allow this troubleshooting task t... [18:18:18] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [18:21:36] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1891061 (10Dzahn) a:3Dzahn [18:23:57] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1891076 (10Dzahn) I was about to do it and saw James has already been added. ( 'root' did it ). [18:25:58] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1891089 (10Dzahn) 5Open>3Resolved [18:26:29] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1891092 (10Ottomata) Ok! We just had a very helpful meeting about this. I'm going to try to summarize and make a recom... [18:42:29] (03PS2) 10Dzahn: Gerrit: redirect old gitweb project urls to Diffusion instead of Gitblit [puppet] - 10https://gerrit.wikimedia.org/r/257523 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [18:42:45] (03PS3) 10Dzahn: Gerrit: redirect old gitweb project urls to Diffusion instead of Gitblit [puppet] - 10https://gerrit.wikimedia.org/r/257523 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [18:43:53] !log Created account "Krinkle" on collabwiki [18:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:35] !log ditto [18:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:57] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:46:53] (03CR) 10Dzahn: [C: 032] "please expect a gerrit restart resulting in very short downtime - for this redirect change - it's to ..kill gitblit a bit more" [puppet] - 10https://gerrit.wikimedia.org/r/257523 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [18:47:47] !log gerrit will restart in a moment and be right back [18:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:17] !log disregard that, apache config only is enough [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:52] (03PS2) 10Dzahn: tor: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/260014 (owner: 10Faidon Liambotis) [19:07:58] (03CR) 10Aaron Schulz: [C: 031] New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 (owner: 10Aaron Schulz) [19:08:25] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1511/radium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/260014 (owner: 10Faidon Liambotis) [19:14:34] (03PS2) 10RobH: New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 (owner: 10Aaron Schulz) [19:14:50] (03CR) 10RobH: [C: 032] New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 (owner: 10Aaron Schulz) [19:16:45] (03CR) 10Dzahn: [C: 04-1] "the address => 'tor-eqiad-1.wikimedia.org', is used for both" [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [19:17:07] (03PS3) 10RobH: New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 (owner: 10Aaron Schulz) [19:17:34] (03CR) 10RobH: [C: 032] New hardware token protected ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/259971 (owner: 10Aaron Schulz) [19:18:07] (03PS2) 10Dzahn: tor: add a second relay running on the same server [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [19:24:10] (03CR) 10Aaron Schulz: [C: 032] Adjust queue "maxPartitionsTry" and timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259969 (owner: 10Aaron Schulz) [19:24:28] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1891298 (10chasemp) 3NEW [19:24:38] (03Merged) 10jenkins-bot: Adjust queue "maxPartitionsTry" and timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259969 (owner: 10Aaron Schulz) [19:24:46] 6operations, 10ops-eqiad, 10Incident-Labs-NFS-20151216, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1891304 (10chasemp) [19:24:59] 6operations, 10Incident-Labs-NFS-20151216, 6Labs: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1891305 (10chasemp) [19:26:07] !log aaron@tin Synchronized wmf-config/jobqueue-eqiad.php: Adjust queue "maxPartitionsTry" and timeouts (duration: 00m 30s) [19:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:22] (03PS3) 10Dzahn: tor: add a second relay running on the same server [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [19:26:55] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1891322 (10chasemp) 3NEW [19:26:59] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 70614 MB (3% inode=99%) [19:27:25] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1891330 (10chasemp) [19:27:46] (03CR) 10Aaron Schulz: [C: 032] Set $wgCentralAuthUseSlaves in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259611 (owner: 10Aaron Schulz) [19:28:32] (03Merged) 10jenkins-bot: Set $wgCentralAuthUseSlaves in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259611 (owner: 10Aaron Schulz) [19:30:05] 6operations, 10Incident-Labs-NFS-20151216: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#1891335 (10chasemp) [19:35:22] (03CR) 10Dzahn: [C: 032] tor: add a second relay running on the same server [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [19:36:19] 6operations, 10Incident-Labs-NFS-20151216, 6Labs: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1891348 (10chasemp) [19:37:13] (03CR) 10Dzahn: "we have > 0.2.7.6" [puppet] - 10https://gerrit.wikimedia.org/r/260014 (owner: 10Faidon Liambotis) [19:41:19] (03PS1) 10Dzahn: tor: use tor::instance to set up both [puppet] - 10https://gerrit.wikimedia.org/r/260064 [19:42:18] (03PS2) 10Dzahn: tor: use tor::instance to set up both [puppet] - 10https://gerrit.wikimedia.org/r/260064 [19:46:51] (03CR) 10Dzahn: "tor.service loaded active exited Anonymizing overlay network for TCP (mu" [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [19:46:58] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:47:02] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1891366 (10RobH) a:3MoritzMuehlenhoff We need to migrate the pwstore from iron to palladium. Since we already keep the private repo there, this seems to be an in scope use of palladium. Does this sound reasonable? Since @MoritzMuehl... [19:47:08] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:51:47] (03CR) 10Dzahn: [C: 04-1] "Error: Could not find resource 'Exec[tor-systemd-reload]' for relationship from 'File[/etc/tor/instances/wikimediaeqiad1/torrc]'" [puppet] - 10https://gerrit.wikimedia.org/r/260064 (owner: 10Dzahn) [20:04:20] _joe_: should I just amend https://gerrit.wikimedia.org/r/#/c/197499/ ? [20:06:48] (03PS1) 10Dzahn: tor: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/260065 [20:09:25] (03CR) 10Ottomata: "Dcausse, I just saw this in my queue. Ping me about it Monday and I will merge." [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [20:22:54] (03PS2) 10Dzahn: tor: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/260065 [20:38:10] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1891481 (10Ottomata) [20:39:23] mutante: if you get that role/tor.pp thing to work, let me know [20:39:30] ohh [20:39:31] hm [20:39:36] it is tor/relay.pp [20:39:37] that will be fine [20:40:07] mutante: [20:40:09] do you know why [20:40:10] Use tor::relay because that doesn't cause a problem [20:40:10] with the role keyword (unlike for role::tor -> tor/init.pp) [20:40:10] and it makes sense to be more specific what it is. [20:40:11] doesn't work? [20:40:17] why can't role/manifests/tor.pp work? [20:44:16] ottomata: so i had the exact same question and the other day i was told that it's a limitation of the role keyword implementation [20:44:35] so i did the above to kind of cheat around the issue [20:44:45] while i would also argue that it's a better name like this too [20:44:47] yeah, that's why I have role/manifests/eventbus/eventbus [20:44:51] which is not a better name :/ [20:45:14] i also thought once that would be like module/role/manifests/tor/ini.pp [20:45:18] init.pp [20:45:27] like with the other modules, but no [20:45:41] yeah, i keep suggesting that we should just use a different module_path [20:45:50] so what i did so far is pick all the examples where it's easy to move [20:45:52] and treat roles just like modules, but in a separate directory [20:45:56] because you dont have this issue [20:46:18] roles that are already structured like role::bar::foo and role::bar::baz and simply not role::bar [20:47:00] ottomata: we should ask joe about it because he did the role keyword [20:47:03] yeah [20:47:10] his opinion about that limitation [20:47:25] i have asked once, he said try restarting puppetmaster. I finally got around to trying it in beta today, but ja, doesn't work [20:47:45] or we rename all the roles [20:48:19] if it was role::foo and there is just one then make it role::foo::main or something [20:48:44] (03PS1) 10Ottomata: Use role::kafka::analytics::broker on analytics kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/260070 [20:48:46] nawww don't like it :) [20:48:52] secondary module path is the way to go [20:50:56] i haven't thought about that, i'm open about any solution that helps us kill the manifest/role structure, yea [20:51:35] (03CR) 10Ottomata: [C: 032] Use role::kafka::analytics::broker on analytics kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/260070 (owner: 10Ottomata) [20:52:33] it would work just like modules/ [20:52:35] except it would be [20:52:40] role// [20:52:47] autoloading would work excatly the same as it does with modules [20:53:03] i would just allow us for the distinction between generic modules and more specific role classes [20:53:44] aaah, interesting [20:53:48] i see [20:58:22] (03PS1) 10Ottomata: Move kafka role hiera to new location [puppet] - 10https://gerrit.wikimedia.org/r/260072 [20:58:55] (03CR) 10Ottomata: [C: 032 V: 032] Move kafka role hiera to new location [puppet] - 10https://gerrit.wikimedia.org/r/260072 (owner: 10Ottomata) [21:05:02] (03PS1) 10Ottomata: Fix variable assignment in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/260075 [21:06:51] (03CR) 10Ottomata: [C: 032] Fix variable assignment in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/260075 (owner: 10Ottomata) [21:06:59] (03PS1) 10BBlack: add yubi piv key for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260076 [21:10:50] (03PS2) 10BBlack: add yubi piv key for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260076 [21:10:56] (03CR) 10BBlack: [C: 032 V: 032] add yubi piv key for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260076 (owner: 10BBlack) [21:17:08] hey Jeff_Green [21:17:17] is the role::logging::kafkatee::webrequest::fundraising class used anywhere anymore? [21:17:25] i originally made it to test fundraising kafkatee stuff [21:17:37] but I think you moved that to your own configs in frack, ja? [21:17:41] i don't see it anywhere in ops puppet [21:20:41] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1018_v4 [21:20:57] (03PS1) 10Ottomata: Make kafkatee on oxygen use new kafka role class [puppet] - 10https://gerrit.wikimedia.org/r/260078 [21:21:13] (03PS2) 10Ottomata: Make kafkatee on oxygen use new kafka role class [puppet] - 10https://gerrit.wikimedia.org/r/260078 [21:21:24] (03PS1) 10BBlack: remove previous ssh key for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260079 [21:21:46] (03CR) 10BBlack: [C: 032 V: 032] remove previous ssh key for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260079 (owner: 10BBlack) [21:22:33] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [21:23:11] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1022_v4 [21:23:12] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1018_v4 [21:23:12] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [21:24:50] (03PS3) 10Ottomata: Make kafkatee on oxygen use new kafka role class [puppet] - 10https://gerrit.wikimedia.org/r/260078 [21:24:56] (03CR) 10Ottomata: [C: 032 V: 032] Make kafkatee on oxygen use new kafka role class [puppet] - 10https://gerrit.wikimedia.org/r/260078 (owner: 10Ottomata) [21:25:03] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 20 ESP OK [21:25:03] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 20 ESP OK [21:26:59] (03PS1) 10Ottomata: Remove unused role::logging::kafkatee::webrequest::fundraising class [puppet] - 10https://gerrit.wikimedia.org/r/260082 [21:27:21] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1013_v4 [21:27:25] (03CR) 10Ottomata: "pretty sure this is unused, I will abandon if it is not." [puppet] - 10https://gerrit.wikimedia.org/r/260082 (owner: 10Ottomata) [21:27:31] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1013_v4 [21:27:36] ottomata: yeah that's unused [21:28:13] ok thanks! [21:28:22] (03CR) 10Ottomata: [C: 032 V: 032] Remove unused role::logging::kafkatee::webrequest::fundraising class [puppet] - 10https://gerrit.wikimedia.org/r/260082 (owner: 10Ottomata) [21:28:41] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 18 connecting: kafka1013_v4,kafka1018_v4 [21:29:01] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1022_v4 [21:29:13] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 20 ESP OK [21:29:21] (03PS1) 10Ottomata: Use new kafka role class for refinery camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/260083 [21:29:23] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 28 ESP OK [21:29:43] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 18 connecting: kafka1014_v4,kafka1022_v6 [21:29:54] HmmMMm bblack could that be something I just did? [21:30:13] i didn't change anything that I know of.... [21:30:45] (03PS2) 10Ottomata: Use new kafka role class for refinery camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/260083 [21:31:11] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1020_v4 [21:32:02] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1018_v4 [21:32:59] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1014_v4 [21:33:00] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 20 ESP OK [21:34:10] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1020_v4 [21:34:28] Hmmm dunno what is going on here.>.> [21:34:39] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 28 ESP OK [21:34:49] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 20 ESP OK [21:35:11] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 28 ESP OK [21:35:20] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1022_v4 [21:35:30] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1013_v4 [21:36:47] ottomata: don't know, what have you done lately? :) [21:36:49] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1014_v6 [21:37:07] ha, well, i moved the kafka brokers to use a new role class, but there should have been no config changes [21:37:11] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 20 ESP OK [21:37:11] its just a puppet refactor [21:37:31] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 20 ESP OK [21:37:31] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 28 ESP OK [21:37:31] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1018_v4 [21:37:32] i haven't changed the cache varnishkafka yet, its still the same [21:37:35] they still have the same names, and still have the ipsec role in site.pp, right? [21:37:38] yes [21:37:50] was there a major bump in traffic, maybe packet loss? [21:37:52] role kafka::analytics::broker, ipsec [21:38:13] normally I'd say minor link quality issues, but we're seeing random failures to all 3x remote DCs above [21:38:16] not that i know of, but it is pretty correlated with my change [21:38:16] yeah [21:38:36] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1891582 (10Krinkle) [21:38:39] https://gerrit.wikimedia.org/r/#/c/260070/1/manifests/site.pp [21:39:20] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 28 ESP OK [21:39:58] bblack is debdeploy at all related,, no right? i did have to fix this: https://gerrit.wikimedia.org/r/#/c/260072/ [21:39:59] after [21:40:01] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1012_v4 [21:40:04] but, that was only on one kafka broker [21:40:10] i stopped puppet on all of them until i was sure one was ok [21:41:04] (03CR) 10Ottomata: [C: 032] Use new kafka role class for refinery camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/260083 (owner: 10Ottomata) [21:41:50] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 28 ESP OK [21:42:11] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1022_v4 [21:42:11] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1012_v4 [21:42:30] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1020_v4 [21:42:32] bblack, can you tell if this is just temporary, and these are actually recovering? [21:42:35] i can't tell what is going on at all [21:43:07] i see charon logs in syslog on some of these, but those logs are there for a while [21:43:21] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [21:44:09] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 20 ESP OK [21:44:09] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1013_v4 [21:44:30] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [21:45:51] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK [21:45:58] bblack are you still with me? ::-S [21:46:10] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1013_v4 [21:46:30] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 28 ESP OK [21:47:11] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1018_v4 [21:47:52] yeah, def not working [21:47:52] im [21:47:58] i'm looking at cp3016 [21:48:00] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [21:48:05] varnishkafka not production [21:48:09] restarting vk there... [21:49:29] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:29] ottomata: no I'm not entirely with you, I have a bunch of windows open :P [21:50:00] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1022_v4 [21:50:04] ottomata: varnishkafka isn't the issue: ipsec connections are randomly timing out and failing [21:50:10] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 28 ESP OK [21:50:20] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1013_v4 [21:50:31] interface traffic is satured? changes to network config? changes to ferm rules? [21:50:51] (03PS1) 10BBlack: add backup yubikeys for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260087 [21:51:08] (03CR) 10BBlack: [C: 032 V: 032] add backup yubikeys for bblack [puppet] - 10https://gerrit.wikimedia.org/r/260087 (owner: 10BBlack) [21:51:09] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK [21:51:16] yeah the vk restart was a wild swing in the dark... [21:52:00] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1013_v4 [21:52:03] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 20 ESP OK [21:52:03] shoudln't have been any changes... [21:52:12] and I didn't see anything relevant in puppet diffs whne I ran puppet [21:52:39] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1012_v4 [21:52:50] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1014_v4 [21:53:31] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1012_v4 [21:53:33] I see some recent big changes to jmxtrans config [21:53:36] hmm ,bblack the ferm rule did change file names [21:53:37] yes [21:53:51] Dec 18 21:13:46 kafka1020 puppet-agent[11329]: (/Stage[main]/Kafka::Server::Jmxtrans/Jmxtrans::Metrics[kafka-kafka1020-9999]/File[/etc/jmxtrans/kafka-kafka1020-9999.json]/content) - { "host": "208.80.154.10" [21:53:52] those because i am using the config from hiera for ganglia now [21:53:55] Dec 18 21:13:46 kafka1020 puppet-agent[11329]: (/Stage[main]/Kafka::Server::Jmxtrans/Jmxtrans::Metrics[kafka-kafka1020-9999]/File[/etc/jmxtrans/kafka-kafka1020-9999.json]/content) - , "port": 9694 [21:53:58] instead of a hardcoded one [21:53:59] Dec 18 21:13:46 kafka1020 puppet-agent[11329]: (/Stage[main]/Kafka::Server::Jmxtrans/Jmxtrans::Metrics[kafka-kafka1020-9999]/File[/etc/jmxtrans/kafka-kafka1020-9999.json]/content) + { "host": "carbon.wikimedia.org" [21:54:03] Dec 18 21:13:46 kafka1020 puppet-agent[11329]: (/Stage[main]/Kafka::Server::Jmxtrans/Jmxtrans::Metrics[kafka-kafka1020-9999]/File[/etc/jmxtrans/kafka-kafka1020-9999.json]/content) + , "port": 9649 [21:54:07] was the port change from 9694 to 9649 intentional? [21:54:31] that was unexpected, but I assume the hiera one is more correct...and I don't much use kafka stats in ganglia anyroe [21:54:37] so i haven't looked into if that still is working yet [21:54:49] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [21:55:09] heh [21:55:39] there is a little funky network in spike [21:55:39] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=kafka1012.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Analytics+Kafka+cluster+eqiad [21:55:39] hm [21:55:46] who set the one in hiera. I mean on the surface, it looks like a transposition typo, if the point was s/puppetdata/hieradata/ [21:55:50] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 20 ESP OK [21:56:18] but only on 1022? [21:56:27] that was on 1020 where I was looking [21:56:30] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 28 ESP OK [21:56:32] sorry [21:56:41] i meant to say, not on 1022 [21:56:41] also, why is that jxmtrans config saying the same shit 400 times in a row? [21:56:43] the network spike [21:57:04] heh [21:57:06] yeah not the best [21:57:07] but it is [21:57:15] that's how it needed to be rendered to work [21:57:30] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [21:57:31] unrelated, but whatever [21:57:33] anyyyway, i'm pretty sure any jmxrans config is not causing this [21:57:34] yeah [21:57:50] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: kafka1013_v4,kafka1022_v4 [21:57:52] only a small spike in networkin overall [21:57:52] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Analytics+Kafka+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [21:58:10] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 38 ESP OK [21:58:16] well my thought was maybe a bad port# in the jmxtrans config could cause some shitty software to retry to send some data over and over and saturate a network link [21:58:25] but I don't see a big bump in interface traffic anyways [21:58:32] aye hm [21:58:40] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1018_v4 [21:58:40] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1012_v4 [21:58:58] and, whatever that bump is on kafka1012 is network in, and is over now anyway [21:59:00] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1014_v4 [21:59:00] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1013_v4 [21:59:20] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: kafka1018_v4,kafka1020_v6 [21:59:38] bblack, should I revert that change and just see if this goes away? i really can't think of what would ahve changed that would cause this [21:59:58] which change? [22:00:07] the role change [22:00:28] this https://gerrit.wikimedia.org/r/#/c/260083/ [22:00:28] and this: https://gerrit.wikimedia.org/r/#/c/260072/ [22:00:41] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 28 ESP OK [22:00:42] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [22:00:53] hm, i'll make a commit and use puppet compiler to see diffs [22:01:22] I see 260070, 260072, 260075 [22:01:29] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK [22:01:29] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1013_v4 [22:01:39] all 3 of those in a row just before the ipsec fallout, seems all related, could've been merged close on palladium? [22:01:49] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1018_v4 [22:02:20] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1020_v4 [22:02:41] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1013_v4 [22:02:41] merged close? [22:03:09] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [22:03:21] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 20 ESP OK [22:04:00] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [22:04:01] oops, sorry, yeah 260070, not 260083 [22:04:19] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 20 ESP OK [22:04:33] (03PS1) 10Ottomata: Revert new kafka role inclusion on kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/260089 [22:04:41] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 20 ESP OK [22:05:09] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 28 ESP OK [22:05:24] (03CR) 10Alex Monk: [C: 031] "lgtm, did not test" [puppet] - 10https://gerrit.wikimedia.org/r/260037 (owner: 10Andrew Bogott) [22:05:27] Dec 18 21:13:40 kafka1020 puppet-agent[11329]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/10_kafka-ipsec-ike]) Filebucketed /etc/ferm/conf.d/10_kafka-ipsec-ike to puppet with sum f7888eb4958be6 [22:05:32] 8a9b7e3d8cc3f4b8d3 [22:05:34] Dec 18 21:13:40 kafka1020 puppet-agent[11329]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/10_kafka-ipsec-ike]/ensure) removed [22:05:43] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1891651 (10cwdent) [22:06:56] ottomata: ok so the problem is: when I did the ipsec work I added new ferm rules to your kafka broker role [22:07:05] and then you switched them to a new role which doesn't have those ferm rules [22:07:17] ! [22:07:29] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 18 connecting: kafka1012_v4,kafka1013_v4 [22:07:32] this says no changes though: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1514/changes [22:07:33] ! [22:07:34] looking [22:07:40] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 20 ESP OK [22:07:46] OMG [22:07:49] totally my fault [22:07:49] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [22:07:50] man [22:07:53] thanks bblack [22:08:12] ok so, at the bottom of manifests/role/analytics/kafka.pp [22:08:20] the last two ferm things there, need to be in whatever the new role is [22:08:27] see it [22:08:28] fixing... [22:08:36] ferm::rule { 'kafka-ipsec-esp': + ferm::service { 'kafka-ipsec-ike': [22:08:40] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1018_v6 [22:08:50] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1012_v4 [22:09:08] (03PS1) 10Ottomata: Include IPSec ferm rules for analytics kafka broker role [puppet] - 10https://gerrit.wikimedia.org/r/260093 [22:09:08] whythecrap did jenkins say no changes?!?!?! [22:09:09] grrrr [22:09:14] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1891670 (10K4-713) Approved! [22:09:20] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: kafka1013_v4,kafka1014_v4 [22:09:41] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: kafka1014_v4,kafka1020_v4 [22:09:46] (03CR) 10Ottomata: [C: 032 V: 032] Include IPSec ferm rules for analytics kafka broker role [puppet] - 10https://gerrit.wikimedia.org/r/260093 (owner: 10Ottomata) [22:10:50] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 20 ESP OK [22:10:59] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1018_v4 [22:11:22] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [22:11:22] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 20 ESP OK [22:11:40] grooooowwwwwl crap crackers, thanks bblack, i'm embarrassed I didn't see that. [22:11:41] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [22:11:46] and annoyed that i trusted jenkins to tell me [22:11:52] am I doing the jenkins puppet compiler thing wrong? [22:12:31] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 20 ESP OK [22:12:51] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [22:14:54] ottomata: no idea what the catalog compiler was looking at really [22:14:57] mutante: it was on the same address intentionally... [22:15:29] ottomata: you already puppeting the brokers? [22:16:11] it's all ok in icinga now in any case [22:16:12] yes bblack [22:16:13] done [22:16:45] i see data from remote dc caches too [22:16:48] seesh [22:16:48] mutante: can you please revert to my version and merge etc.?? [22:20:37] yikes, ok, thanks for helping me figure that out bblack, [22:20:39] !log aaron@tin Synchronized php-1.27.0-wmf.9/includes/jobqueue/aggregator/JobQueueAggregator.php: 2c942ba1782c42ee68622278a5e0a77e9027945d (duration: 00m 31s) [22:20:45] thigns look good now [22:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:53] i gotta run for the weekend [22:21:08] will check back in an hour just to be sure [22:21:11] latersrs [22:30:17] !log ebernhardson@tin Synchronized php-1.27.0-wmf.9/extensions/CirrusSearch/resources/ext.cirrus.suggest.js: override suggestion type reported in event logging (duration: 00m 30s) [22:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:09] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 69607 MB (3% inode=99%) [22:37:23] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891745 (10Legoktm) p:5Normal>3Unbreak! [22:37:52] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1883244 (10Legoktm) Can reproduce easily with https://en.m.wikipedia.org/wiki/User:Legoktm/test [22:38:07] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891749 (10Legoktm) [22:41:43] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891759 (10Legoktm) https://en.m.wikipedia.org/wiki/User:Legoktm/test is broken, https://en.m.wikipedia.org/wiki/User:Leg... [22:41:52] 6operations, 10Fundraising Tech Backlog, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1891760 (10ori) [22:43:34] 6operations, 10Traffic: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1891771 (10Krinkle) [22:44:44] 6operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1891774 (10Reedy) Noting 3.12 is the next HHVM LTS (to be released end of Jan 2016) 3.9 is out of LTS support mid July 2016 The MW benchmarks are somewhat of a con. They're not in... [22:44:51] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891775 (10dr0ptp4kt) @jdlrobson, @tgr, @bblack any insight? @ori pointed out that the the Star Wars: The Force Awakens... [22:46:55] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891786 (10Tgr) Routing around Varnish with the x-debug header makes the issue go away, so this is a problem with purge r... [22:47:18] 6operations, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1891797 (10faidon) p:5Triage>3High [22:49:26] 6operations, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1891801 (10faidon) This looks like a Firefox bug and it seems to affect fundraising right now. It doesn't appear to... [22:50:04] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891803 (10Legoktm) a:3BBlack @bblack and I discussed this on IRC, and he figured out what's going wrong in the varnish... [22:50:21] !log aaron@tin Synchronized php-1.27.0-wmf.9/includes/jobqueue/aggregator/JobQueueAggregatorRedis.php: 2c942ba1782c42ee68622278a5e0a77e9027945d (duration: 00m 30s) [22:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:21] 6operations, 10Traffic: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1891817 (10Krinkle) [22:52:24] 6operations, 10Fundraising Tech Backlog, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1891818 (10Krinkle) [22:54:23] aaron@mw1001:~$ netstat -a | grep 'rdb1001.eqiad.wmne:6378' | grep WAIT | wc -l [22:54:23] 0 [22:54:30] ori: no longer 4k \o/ [22:54:38] woohoo [22:55:46] \o/ [22:56:52] !log ebernhardson@tin Synchronized php-1.27.0-wmf.9/extensions/CirrusSearch/resources/ext.cirrus.suggest.js: override suggestion type reported in event logging (duration: 00m 30s) [22:57:54] !log ebernhardson@tin Synchronized php-1.27.0-wmf.9/resources/src/mediawiki/mediawiki.searchSuggest.js: allow override of suggestion type reported in event loggin (duration: 00m 29s) [22:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:40] (03PS1) 10BBlack: Text VCL: do not hash-differentiate on the mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/260154 (https://phabricator.wikimedia.org/T121594) [23:07:38] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: do not hash-differentiate on the mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/260154 (https://phabricator.wikimedia.org/T121594) (owner: 10BBlack) [23:14:32] (03CR) 10Dr0ptp4kt: Text VCL: do not hash-differentiate on the mobile cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260154 (https://phabricator.wikimedia.org/T121594) (owner: 10BBlack) [23:23:34] (03PS3) 10BBlack: varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) [23:35:07] (03PS1) 10BBlack: Text VCL: do not block purge for text reqs to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/260162 (https://phabricator.wikimedia.org/T121594) [23:35:26] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: do not block purge for text reqs to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/260162 (https://phabricator.wikimedia.org/T121594) (owner: 10BBlack) [23:36:10] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891911 (10dr0ptp4kt) Hmm. I'm still getting the stale copy for https://en.m.wikipedia.org/wiki/Star... [23:53:52] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1891963 (10AndyRussG) [23:54:39] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1891764 (10AndyRussG) [23:58:08] 6operations, 10Fundraising Tech Backlog, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1891973 (10ori)