[00:00:04] ^d: The time is nigh to deploy Gerrit Maint (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140708T0000) [00:00:55] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.920 second response time [00:02:13] * YuviPanda burrs a bit [00:02:52] <^d> !log gerrit upgraded to 2.8.1-4-ga1048ce from 2.8.1-2-g724b796, back up. Might be slow for a bit while caches warm. [00:02:56] Logged the message, Master [00:03:09] YuviPanda: caps False and True [00:03:11] ? [00:03:17] chasemp: no, I checked the source code. [00:03:23] there's no code path for this [00:03:51] not sure what you mean https://github.com/BrightcoveOS/Diamond/wiki/Configuration [00:03:57] enabled - True or False - Run this collector? [00:04:21] chasemp: that's for collectors, not handlers [00:04:36] ah I see what you are doing now [00:04:49] (03PS2) 10Ori.livneh: Deploy jobrunner to MW job runners via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 [00:05:42] chasemp: think I'll have to add an ensure => absent case in diamond/init.pp [00:09:37] (03PS6) 10Yuvipanda: [WIP]diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [00:09:49] * ori is a chaotic neutral half-orc diamond collector [00:10:33] * YuviPanda attacks ori with a storage-schema [00:10:39] * ori aggregates YuviPanda [00:10:50] (03CR) 10jenkins-bot: [V: 04-1] [WIP]diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 (owner: 10Yuvipanda) [00:10:57] uh [00:11:17] YuviPanda: I'm not sure this is the way to go here, but I'm also out of time I think today. can we circle back in teh morning? [00:11:30] chasemp: yeah, makes sense. It's also 6AM here. [00:11:44] my morning then not yours :D [00:11:49] chasemp: :) of course :D [00:12:03] it's all reltive. I think my body thinks this is about midnight [00:12:28] (03PS7) 10Yuvipanda: [WIP]diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [00:14:40] any objections to +2ing a labs-only MW settings fix? https://gerrit.wikimedia.org/r/#/c/144339/ [00:15:20] greg-g, ^ [00:15:42] none here [00:15:54] +2 & git pull, not sync [00:16:01] (03CR) 10Yurik: [C: 032] LABS: Set domain override for zero banners [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144339 (owner: 10Yurik) [00:16:22] (03Merged) 10jenkins-bot: LABS: Set domain override for zero banners [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144339 (owner: 10Yurik) [00:22:21] (03PS8) 10Yuvipanda: [WIP]diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [00:22:52] (03CR) 10Yuvipanda: [C: 04-1] "Note that this doesn't actually *work*. the enabled = false has no effect." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 (owner: 10Yuvipanda) [00:23:02] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:22] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:22] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:28] poor tungsten [00:23:52] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [00:24:00] I guess tungsten is facing the same problems diamond-collector is [00:24:12] RECOVERY - DPKG on tungsten is OK: All packages OK [00:24:12] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [00:26:15] (03PS1) 10Ori.livneh: Remove beta::hhvm [operations/puppet] - 10https://gerrit.wikimedia.org/r/144624 [00:26:22] ^ bd808 [00:27:04] ori: Sweet. I'll cherry-pick it into beta [00:27:12] cool, thanks [00:30:24] (03CR) 10BryanDavis: [C: 031] "Cherry-picked on deployment-salt and applied on deployment-apache0[12]." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144624 (owner: 10Ori.livneh) [00:31:12] (03PS3) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [00:36:52] (03PS3) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [00:38:12] (03CR) 10jenkins-bot: [V: 04-1] Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 (owner: 10BryanDavis) [00:41:01] Deskana|Away: ^d: (in absence of manybubbles): Do you know whether this is a regression in Search results? http://i.imgur.com/1UrQBfy.png / http://i.imgur.com/14RtkAO.png [00:41:07] E.g. the search except being fragmented template parameters [00:41:14] Not exactly useful [00:41:53] <^d> That's not right. [00:41:53] I know the old search engine used to do it, not sure if the new one is supposed to [00:42:00] (03PS4) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [00:42:03] <^d> Either a regression or a page is stale. [00:42:18] <^d> Either way, file a bug so I don't forget to poke it. [00:42:30] ^d: in core /search? [00:42:55] <^d> Actually, which wiki is this? [00:43:00] enwiki [00:43:07] <^d> Do you have beta turned on? [00:44:02] Nope, the mobile one is anon, the desktop one is my plain enwiki account [00:44:24] https://en.wikipedia.org/wiki/Special:Search/Thielt?srbackend=CirrusSearch [00:44:25] https://en.wikipedia.org/wiki/Special:Search/Thielt?srbackend=default [00:44:34] Looks like cirrus doesn't have it [00:44:41] but it also doesn't have useful results in the first place it seems [00:45:02] it suggests "Thiel" instead of "Tielt" in the "did you mean", and the actual pages don't even mention Tielt [00:45:06] (which has a redirect to it) [00:45:06] (03PS4) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [00:45:15] <^d> So the bogus wikitext is lsearchd, wontfix. [00:45:22] <^d> But missing the obvious result is bad. [00:45:25] <^d> Let's file a bug for that. [00:46:55] It picked it up when I re-created the redirect [00:47:11] or rather, it didn't exist before. The old search engine just understood it better without needing a reidrect [00:47:16] Or is it a lucky guess? [00:47:16] https://en.wikipedia.org/w/index.php?title=Special:Search&search=Thielt&fulltext=1&srbackend=CirrusSearch [00:47:17] https://en.wikipedia.org/w/index.php?title=Special:Search&search=Thielt&fulltext=1&srbackend=default [00:47:20] <^d> bleh. [00:48:04] <^d> It's getting lucky. [00:52:03] <^d> Krinkle: bug 67645 filed. [00:52:37] (03PS1) 10BryanDavis: Check for puppet-lint in path rather than hardcoded location [operations/puppet] - 10https://gerrit.wikimedia.org/r/144626 [00:57:20] ^demon|food: Thx [00:57:28] ^demon|food: updated bug with de.wikipedia example where it still happens [00:57:36] (since I created the redirect on enwiki) [00:57:53] !log springle Synchronized wmf-config/db-eqiad.php: depool db1061 for upgrade (duration: 00m 07s) [00:57:57] Logged the message, Master [01:05:29] (03PS4) 10BryanDavis: beta: New script to restart apaches [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (https://bugzilla.wikimedia.org/36422) [01:07:20] (03PS5) 10BryanDavis: beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 [01:07:39] (03CR) 10BryanDavis: [C: 031] beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [01:09:16] (03PS8) 10BryanDavis: Send Vary header on http to https redirect [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 [01:14:26] (03PS4) 10Ori.livneh: Improvements to asset-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [01:14:45] (03PS5) 10Ori.livneh: Improvements to asset-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [01:16:11] Krinkle: apologies for squashing your commits, but I'd like to merge this, and my entitlement to do so here is a special case, and I didn't want to have to justify it repeatedly for 7 commits. [01:16:58] (03CR) 10Ori.livneh: [C: 032] Improvements to asset-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [01:17:09] (03Abandoned) 10Krinkle: asset-check: Implement --debug [operations/puppet] - 10https://gerrit.wikimedia.org/r/137241 (owner: 10Krinkle) [01:17:16] (03Abandoned) 10Krinkle: asset-check: Use "response.stage" property to filter out duplicates [operations/puppet] - 10https://gerrit.wikimedia.org/r/137242 (owner: 10Krinkle) [01:17:22] (03Abandoned) 10Krinkle: asset-check: Use content-length header when response.bodySize is missing [operations/puppet] - 10https://gerrit.wikimedia.org/r/137252 (owner: 10Krinkle) [01:17:28] (03Abandoned) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 (owner: 10Krinkle) [01:17:32] !log springle Synchronized wmf-config/db-eqiad.php: repool db1061, warm up (duration: 00m 06s) [01:17:33] Krinkle: thanks. i know it's annoying; it was proper to split it up into multiple commits. [01:17:33] (03Abandoned) 10Krinkle: asset-check: Track whether requests are compressed with gzip [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 (owner: 10Krinkle) [01:17:39] (03Abandoned) 10Krinkle: asset-check: Track uncaught exceptions in javascript [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 (owner: 10Krinkle) [01:17:40] Logged the message, Master [01:18:10] ori: OK :) [01:18:52] nn [01:22:03] Krinkle|detached: deployed on hafnium [02:30:03] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-08 02:29:00+00:00 [02:30:10] Logged the message, Master [03:00:29] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Tue 08 Jul 2014 01:00:00 UTC [03:00:38] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-08 02:59:33+00:00 [03:00:41] Logged the message, Master [03:26:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 8 03:25:51 UTC 2014 (duration 25m 50s) [03:27:02] Logged the message, Master [03:33:04] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 08 Jul 2014 01:32:30 UTC [03:40:34] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Tue Jul 8 03:40:29 UTC 2014 [03:55:24] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:15] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [04:04:11] PROBLEM - MySQL Replication Heartbeat on db74 is CRITICAL: CRIT replication delay 320 seconds [04:13:11] RECOVERY - MySQL Replication Heartbeat on db74 is OK: OK replication delay -1 seconds [04:13:21] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 8 04:13:20 UTC 2014 [04:15:54] (03PS7) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [04:41:14] PROBLEM - mysqld processes on db74 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [04:41:44] PROBLEM - mysqld processes on db73 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [04:42:10] uh oh.. they're breeding [04:51:39] !log springle Synchronized wmf-config/db-eqiad.php: depool db1010 for upgrade (duration: 00m 06s) [04:51:45] Logged the message, Master [05:10:16] !log springle Synchronized wmf-config/db-eqiad.php: repool db1010, warm up (duration: 00m 06s) [05:10:21] Logged the message, Master [05:28:50] ahem, is gerrit down? [05:30:00] hmm, seems git review finally went through after 3 tires [06:13:51] (03CR) 10ArielGlenn: [C: 032] dumps deployment setup: replace with salt-master based system [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/144457 (owner: 10ArielGlenn) [06:16:57] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:10] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:10] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:21] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:21] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:30] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:40] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:40] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:50] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:00] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:01] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:01] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:11] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:30] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:30] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:40] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:41] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] oh joy [06:30:00] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:18] (03PS1) 10ArielGlenn: add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 [06:30:31] well let's see what's borken [06:31:50] (03CR) 10jenkins-bot: [V: 04-1] add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [06:36:00] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:40:15] <_joe_> [06:45:23] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:51] note I did nothing [06:46:52] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:38] <_joe_> apergos: did you check one of the backends? [06:47:53] I was looking on palladium [06:47:59] <_joe_> apergos: I think this happens if the check is done while puppet is running [06:48:03] but again I didn't change anything [06:48:50] Jul 8 06:25:24 mw1025 puppet-agent[6982]: (/Stage[main]/Base::Puppet/File[/etc/logrotate.d/puppet]) Could not evaluate: end of file reached Could not retrieve file metadata for puppet:///modules/base/logrotate/puppet: end of file reached [06:48:57] I saw this on a couple of the failed servers [06:49:13] which might indicate the puppet master disconnecting in the middle of the run [06:49:31] <_joe_> Error: /Stage[main]/Mediawiki::Users/File[/home/l10nupdate/.ssh/authorized_keys]: Could not evaluate: end of file reached Could not retrieve file metadata for puppet:///modules/mediawiki/authorized_keys.l10nupdate: end of file reached [06:49:34] <_joe_> mmmh [06:49:49] <_joe_> this is very interesting, seems we rotate logs in a very very stupid way [06:50:05] <_joe_> ok, going to the market to buy some food, bbiab [06:50:13] okey dokey [06:50:42] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:52:43] http://ganglia.wikimedia.org/latest/graph.php?c=Miscellaneous%20eqiad&h=palladium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1404802285&v=5.50&m=load_one&vl=%20&ti=One%20Minute%20Load%20Average load spike last hour on palladium [06:52:50] anyways, seems to be ok now [06:53:13] (03PS1) 10Springle: raise db traffic samplers to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144643 [06:56:37] (03CR) 10Springle: [C: 032] raise db traffic samplers to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144643 (owner: 10Springle) [06:56:43] (03Merged) 10jenkins-bot: raise db traffic samplers to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144643 (owner: 10Springle) [06:57:32] !log springle Synchronized wmf-config/db-eqiad.php: raise db traffic samplers to normal load (duration: 00m 06s) [06:57:36] Logged the message, Master [07:02:01] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [07:12:08] <_joe_> apergos: we should at least open a ticket for this [07:19:58] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:54:44] (03CR) 10Hashar: [C: 031] beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [07:56:20] (03CR) 10Hashar: [C: 031] "The hdf and upstart files will have to be cleaned out manually I guess." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144624 (owner: 10Ori.livneh) [08:12:32] good morning ops! [08:15:17] hey hashar [08:15:24] morning [08:18:15] hi apergos! saw the index.html review flying by, thanks! [08:20:05] hashar: i don't know if you saw, but i pushed all the firewall stuff for ci [08:20:25] yeah I'll be cleaning all that up these days [08:20:33] can't believe that stuff wasn't in there [08:20:38] in puppet I mean [08:22:38] matanya: I have seen it. I need a few more pending puppet patches to land in before having a look at the CI firewall overhaul [08:22:52] ok, thanks [08:40:00] (03CR) 10Filippo Giunchedi: Packaging of php-mailparse from the pecl (032 comments) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [08:40:08] (03CR) 10Filippo Giunchedi: [C: 04-1] Packaging of php-mailparse from the pecl [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [08:40:48] (03CR) 10Filippo Giunchedi: "still some work to do, mainly around the exact distribution to target" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [08:59:01] * YuviPanda waves at godog [08:59:05] around? [09:01:08] hey YuviPanda, sure [09:01:29] godog: so, heard from chasemp you were looking at our graphite problems in production [09:01:36] mostly with graphite not being able to keep up [09:01:53] so I've been looking at the same problem in labs, with quite a bit lesser metrics. [09:02:05] he said you were looking at cassandra backed graphite... and other options. [09:03:06] YuviPanda: yep that's true, tungsten is fine wrt disk io, but won't be for much longer I think, maybe 6-9 months [09:03:36] godog: hmm, right. diamond-collector (the labs collector) is pretty much out already, with 40-50% iowait and lots of dropped metrics [09:03:42] unusably large number of dropped metrics, even [09:04:16] yep I can imagine it won't be getting much disk io on a vm shared with everybody else on a spinning disk [09:04:21] yeah [09:04:39] godog: how's the cassandra thing going? [09:05:02] YuviPanda: I'm playing with it on db1017, trying out this https://github.com/pyr/cyanite [09:06:39] YuviPanda: not sure yet how/if it works though, seen some github issue re: performance and so on [09:06:44] godog: right. [09:07:12] godog: my labs solution, at least for now, is to cut the number of metrics received by about 70%, by not collecting stats for anything outside of toollabs/betalabs [09:07:36] I guess clustering graphite will also run out of IO soon on the VMs [09:09:39] godog: so I'll await the results of your experiments. Feel free to run them on labs as well, if you want :) [09:09:57] YuviPanda: yep, afaict disks on those is software raid10 with 8 disks or sth like that [09:10:10] godog: in the VMs? [09:10:21] I wouldn't be surprised. Not meant for performance, I'd think [09:13:49] yep, well tungsten is beefy on disk i/o, I think it can do ~5k w/s [09:14:37] heh, I don't think the VM can do anything of that sort [09:14:53] I've been trying to tune graphite's cache queueing to let it queue more things and write them at once but doesn't seem to be helping much [09:17:28] akosiaris: that delete on db1001.. think we could make it use partitioning in the future? [09:19:12] springle: it is a python script I am working on to rotate logs on librenms. We can do anything we want with it [09:19:23] for now it does query = """DELETE FROM syslog WHERE timestamp <= DATE(NOW() - INTERVAL %s);""" % (interval) [09:19:36] YuviPanda: unfortunately not, one metric -> one file kills the disk and that's the end of it :( [09:19:42] (03PS10) 10Giuseppe Lavagetto: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [09:20:06] godog: yeah, I guess at 70k metrics it probably has trouble even keeping all the file handles open at the same time [09:21:10] springle: it is the first run now, it will probably be much faster in the future, on account of not having to purge 60+GBytes of data [09:21:28] akosiaris: ah ok :) [09:23:21] YuviPanda: that might be fine, but you can't really get much more than ~150-200 iops on a spinning disk, even with sw raid10 unlikely it could do more than ~1000-1200 perhaps [09:24:12] godog: right. [09:24:18] godog: have you thought of looking into influxdb? [09:24:58] (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [09:26:03] YuviPanda: only a quick glance, it looks promising! did you? [09:27:11] (03PS6) 10Hashar: zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 [09:34:45] godog: no, am thinking of doing that. [09:35:01] godog: however, I was also wondering if we could just get a physical box for this, put that in labsnet. [09:36:34] YuviPanda: yep if it can keep up with the writes why not [09:37:06] springle: some way I can find when a database on db1001 was last accessed ? [09:37:12] you had some look IIRC ? [09:37:17] tool* [09:37:18] godog: yeah, a physical disk probably won't have problems at all. [09:37:28] (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 (owner: 10Hashar) [09:37:50] godog: hmm, but I guess new hardware being assigned would be a bit of b'cratic process where I'd need to poke around to find an idle server that can be used for this... [09:39:06] (03PS1) 10Alexandros Kosiaris: backup db1043 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144653 [09:39:12] (03PS7) 10Hashar: zuul: patch of doom (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 [09:39:23] patch of doom ? [09:39:24] hahaha [09:39:29] yeah wip :D [09:39:39] I am not even sure what I am doing in this patch hehe [09:40:32] <_joe_> akosiaris: "last accessed" means what? [09:40:44] <_joe_> last connection, last update... what? [09:41:02] springle: also kindly requesting a quick review on this https://gerrit.wikimedia.org/r/#/c/144653/ . Also I think I will need the user for the dump created on db1043. Should I do it myself ? [09:41:06] _joe_: connection [09:41:27] <_joe_> akosiaris: AFAIK, if you don't do some sort of stored procedure, you have no way to know that [09:42:25] <_joe_> or using percona server, but maybe mariadb has some new fancy way to do that [09:43:14] _joe_: yeah, which is why I deferred to springle, cause he mentioned once a tool he had for logging connections or something like that. My memory fails me though [09:44:21] <_joe_> akosiaris: oh yuou mean ngrep? [09:44:22] <_joe_> :P [09:44:34] ahahah [09:48:22] (03CR) 10Alexandros Kosiaris: [C: 032] bacula-dir reads a copy of puppet's private key [operations/puppet] - 10https://gerrit.wikimedia.org/r/143901 (owner: 10Alexandros Kosiaris) [09:50:59] (03PS8) 10Hashar: zuul: further split merger and server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 [09:51:13] akosiaris: mariadb has an audit plugin, but it won't help if you're asking about the past since it isn't installed :) [09:51:41] _joe_: see ? I knew he had the answer :-) [09:51:55] springle: no, not past. I wanna delete a database and want to be sure nothing is using it [09:52:06] namely observium db [09:52:13] some 100Gbytes of data [09:52:25] <_joe_> akosiaris: then do it the way I do that usually, if you're 99% sure noone uses it [09:52:28] <_joe_> revoke grants [09:52:30] using it right now, or ever? [09:52:33] <_joe_> see if someone complains [09:52:34] right now [09:52:36] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures [09:52:37] <_joe_> profit :) [09:52:51] _joe_: yeah, I can do that too [09:53:05] I am pretty certain nothing is using it [09:53:09] <_joe_> akosiaris: that's the BOFH way [09:53:23] <_joe_> you can't be wrong if you follow the bofh way :P [09:53:35] shutting down / removing thing is the best way to figure out it is actually needed. [09:53:37] akosiaris: SHOW OPEN TABLES [09:53:40] <_joe_> (joking, and reading about the mariadb audit plugin) [09:54:00] https://mariadb.com/kb/en/mariadb/mariadb-documentation/sql-commands/administration-commands/show/show-open-tables/ [09:54:22] akosiaris: actually, no [09:54:29] those will stay open in the cache [09:55:30] akosiaris: select * from information_schema.processlist where db = 'observium'; [09:55:48] (03CR) 10Hashar: contint: install Zuul on all CI slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141758 (owner: 10Hashar) [09:57:41] hmm here's an alternative of show processlist I haven't used yet [09:58:19] springle: ok thanks. I think it should be ok to delete that DB [09:58:25] yep [09:58:31] that audit plugin seems interesting though [09:58:34] (and yay \o/) [09:58:50] right, the audit plugin is on my list to play with [09:59:00] we're already using percona userstat's [09:59:02] (03PS9) 10Giuseppe Lavagetto: zuul: further split merger and server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [09:59:10] together they pretty much cover all the bases [09:59:56] <_joe_> oh we are using userstat? [10:00:18] and sucking it alll into tendril [10:00:25] so i can watch how accounts are used [10:01:00] _joe_: see %log tables on db1011 sometime [10:01:11] most are not exposed to the web ui yet [10:01:13] doesn't show user statistics also answer my question but about the past as well ? [10:02:45] well assuming that I know the users that can connect to the observium db [10:02:53] which I do [10:02:53] erm.. dont think user stats tracks db [10:03:06] yeah, making an assumption here [10:04:20] (03PS10) 10Hashar: zuul: further split merger and server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 [10:04:45] (03CR) 10Hashar: "Moved '/etc/zuul/gearman-logging.conf' definition from zuul class to zuul::server which has the embedded gearman server." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [10:08:26] (03PS1) 10Alexandros Kosiaris: Fix typo in bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/144660 [10:08:53] (03CR) 10Springle: [C: 04-1] "Suring a past discussion with Chase I mentioned simply replicating m3 to dbstore and allowing the existing backup jobs to handle it along " (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144653 (owner: 10Alexandros Kosiaris) [10:09:15] (03CR) 10Hashar: [C: 031] "Tried on labs, that is a noop :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [10:10:27] (03CR) 10Hashar: [C: 04-1] "This patch has another issue. The ::zuul common class depends on having a jenkins user which is provided by the jenkins package. I guess" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141758 (owner: 10Hashar) [10:16:32] (03CR) 10Alexandros Kosiaris: "That would work too. And quite well obviously. The only pros I can think of are the slightly easier restore (not that important) and the m" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144653 (owner: 10Alexandros Kosiaris) [10:16:55] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo in bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/144660 (owner: 10Alexandros Kosiaris) [10:18:33] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:23:18] (03CR) 10Springle: "Good points. Just for the record: We always have a minimum of 7 days binlogs, plus the existing m3 slave is doing lvm snapshots every 8 ho" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144653 (owner: 10Alexandros Kosiaris) [10:29:58] (03PS11) 10Giuseppe Lavagetto: zuul: further split merger and server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [10:30:06] (03PS1) 10Alexandros Kosiaris: Bacula client stanzas also use the synced key [operations/puppet] - 10https://gerrit.wikimedia.org/r/144669 [10:30:50] (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: further split merger and server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [10:31:27] (03Abandoned) 10Alexandros Kosiaris: backup db1043 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144653 (owner: 10Alexandros Kosiaris) [10:32:29] akosiaris: :) [10:32:40] will get the replication setup [10:33:00] cool [10:34:24] (03CR) 10Alexandros Kosiaris: [C: 032] Bacula client stanzas also use the synced key [operations/puppet] - 10https://gerrit.wikimedia.org/r/144669 (owner: 10Alexandros Kosiaris) [10:35:08] _joe_: merging the the zuul change on palladium as well as mine [10:35:34] <_joe_> akosiaris: oh shit, thanks [10:35:41] <_joe_> got distracted by my own patchset [10:35:43] <_joe_> :/ [10:35:48] ahaha [10:39:05] sigh, tungsten is not well at all https://graphite.wikimedia.org/render/?title=Memory&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=&target=alias%28servers.tungsten.memory.Buffers.value,%22buffers%22%29&target=alias%28servers.tungsten.memory.Active.value,%22active%22%29&target=alias%28servers.tungsten.memory.Cached.value,%22cached%22%29&target=alias%28servers.tungsten.memory.Inactive.value,%22inactive%22%29&target=a [10:39:25] I think it is mwprof going nuts [10:39:50] 40G spikes ? [10:39:58] for the love of ... [10:40:38] seems like they last for at least an hour a time [10:40:47] so it should be easy to verify that godog [10:41:43] indeed akosiaris ! I'll keep an eye on it [10:50:08] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [10:50:28] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [10:50:28] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [10:50:38] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [10:50:48] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [10:56:02] ok, so I managed to reproduce that [10:56:21] kill -HUP `cat /var/run/apache2.pid` on both palladium and strontium at the same time [10:56:34] which happens on ? logrotate every day :-( [10:56:53] anyway off to lunch, will figure later how to deal with it [10:59:43] (03PS5) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [10:59:45] (03PS2) 10Giuseppe Lavagetto: appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 [10:59:47] (03PS1) 10Giuseppe Lavagetto: mediawiki: add apache-config as a submodule, upload it via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144672 [11:01:36] (03CR) 10jenkins-bot: [V: 04-1] appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 (owner: 10Giuseppe Lavagetto) [11:04:51] <_joe_> and of course splitting commits made git forget about the submodule [11:04:54] <_joe_> shit [11:04:59] <_joe_> I hate submodules [11:07:04] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:07:54] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:08:23] (03PS6) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [11:08:25] (03PS3) 10Giuseppe Lavagetto: appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 [11:08:27] (03PS2) 10Giuseppe Lavagetto: mediawiki: add apache-config as a submodule, upload it via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144672 [11:08:35] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:08:44] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:08:44] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:10:19] (03CR) 10jenkins-bot: [V: 04-1] appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 (owner: 10Giuseppe Lavagetto) [11:12:19] (03CR) 10Giuseppe Lavagetto: [C: 032] "This just adds the files in a dedicated directory. it's completely a noop AFAICT for the rest." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144672 (owner: 10Giuseppe Lavagetto) [11:12:49] <_joe_> ok, will try a tagged run of puppet to refresh apache-config [11:12:59] <_joe_> (which is still not used anyways) [11:13:20] <_joe_> so, if palladium/strontium cry, that's me [11:19:45] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not notify apache for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/144673 [11:20:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: do not notify apache for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/144673 (owner: 10Giuseppe Lavagetto) [11:21:55] <_joe_> load average: 18.96 poor palladium [11:22:20] <_joe_> it's completely cpu-bound [11:28:14] (03PS1) 10Springle: monitor m3 replication on dbstore boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144674 [11:31:09] (03CR) 10Springle: [C: 032] monitor m3 replication on dbstore boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144674 (owner: 10Springle) [11:31:23] <_joe_> bbl, lunch and errands [11:34:29] akosiaris: m3 is on dbstore. will watch the next bacula run tomorrow, but i expect that's all we have to do [11:53:12] apergos: varnish doesn't link with libssl [11:53:29] apergos: so it wouldn't need a restart on libssl upgrades [11:53:47] (unless I'm misunderstanding what you wrote on #7806) [12:03:37] oh the ssl terminators [12:03:47] * apergos is still asleep and it's noon... [12:04:55] noon UTC :P [12:06:49] yes, noon utc, my calenara has reset itself sometime this morning... kde calendar acting up again [12:08:38] ori: mwprof / report.py make tungsten very unhappy (using up all memory -> oom -> repeate) where would it be a good place to track these bugs? [12:08:55] ah I've been looking at that becaue of the ticket [12:09:07] https://rt.wikimedia.org/Ticket/Display.html?id=7786 [12:10:02] I got as far as /srv/deployment/mwprof/mwprof/profiler-to-carbon and then I don't know if it's getting the data and failing to send, or failing to get the data [12:10:06] no log entries that I can find [12:11:15] logs would make things too easy, where's the fun in that? [12:11:39] so, one thing we should probably do is disable profiler.py [12:12:14] the one that apache spawns? [12:12:19] yeah [12:12:39] yeah afaict mwprof goes nuts when report.py comes around [12:12:40] mwprof's metric storage is volatile and it's exposed as xml over a raw tcp socket [12:13:19] right, I figured that much out [12:13:47] yeah maybe that's a good first step I'd say [12:13:48] it has two (internal) consumers, the profiler.py thing that provides a 'top' like list of hot mediawiki code [12:14:12] its primary purpose in life is to act as a piece of obscure engineering trivia [12:14:19] "did you know we have this profiler.py thing?" [12:14:44] the other one is graphite, which is actually important and useful [12:17:37] ah ok, is it only humans that consume report.py ori ? [12:17:53] only hu-mans! [12:18:09] nerds, really [12:18:35] basically me and aaron [12:18:47] so the source of the data for the mw php latency graphs is not profiler-to-carbon? [12:19:01] I'm just trying to trace it back and find out where the problem is [12:20:20] it is. report.py is the cgi script, as godog pointed out [12:21:13] ok [12:22:41] ori: haha okay so nobody will cry, apergos happy to push out a code review to disable report.py, that should make things better [12:22:51] still mwprof -> 23gb :( [12:23:08] the memory storage of metrics is a hash table, and it keeps accumulating keys ad infinitum [12:23:25] because metric keys include the branch, which is ever incrementing [12:24:21] ok well that sounds like a good idea for sure [12:24:23] in the past this wasn't an issue in practice because the way report.py was used was to quickly diagnose hotspots in cases of overload, so it had a mechanism for manually clearing the table [12:24:55] s/had/has; there's a 'purge' param iirc [12:25:41] now that it isn't used so often, the metrics just pile up [12:26:33] i'll file an RT for some infinite RAM [12:27:07] * YuviPanda gives ori an ideal turing machine [12:27:08] haha RobH will be thrilled at procuring infinite RAM [12:28:38] so camping on 3811 and listening for tcp packets gives a whole lot of nothing... mwprof should be sending stuff there, yes? [12:28:49] i remember being surprised by the memory usage so i checked for memory leaks very carefully [12:28:55] I would like some infinite ram, could I have aleph-null of that please? [12:29:08] it's not leaking, it just has an unreasonable approach to metric retention [12:29:29] much as i am loathe to suggest it..... having a cron job restart it once every few days would be a viable short-term fix [12:29:48] (03PS1) 10Filippo Giunchedi: disable performance.wikimedia.org/profiler/report [operations/puppet] - 10https://gerrit.wikimedia.org/r/144685 [12:30:09] apergos: you connect to mwprof to get metrics [12:31:30] so people that ask for /profiler/report will now get a 404 I guess? [12:31:39] ( godog ) [12:32:00] s/people/Aaron Schulz/ [12:32:01] correct apergos [12:32:15] let me put him up too [12:32:17] ori: yes, is thatnot tcp 3811? I should at least see the requests for the data [12:33:43] (03CR) 10ArielGlenn: [C: 031] disable performance.wikimedia.org/profiler/report [operations/puppet] - 10https://gerrit.wikimedia.org/r/144685 (owner: 10Filippo Giunchedi) [12:35:10] yes, you should [12:35:24] they are periodic (in the case of profiler-to-carbon) [12:35:47] all right, I'll be a bit more patient [12:36:42] oh, here's a theory about what might be happening [12:37:23] as the number of metrics grow, so does the time it takes to generate the xml, which is done in one go [12:37:43] !log graphite reduced metrics count from 65k to 25k, monitoring io performance [12:37:47] gah [12:37:49] Logged the message, Master [12:38:02] !log disregard previous log message, was meant for labs [12:38:06] Logged the message, Master [12:38:21] profiler-to-carbon connects, mwprof starts building up the xml, but it takes a while, so it times out [12:38:56] profiler-to-carbon connects again, another thread from the pool starts building up the xml, etc. [12:40:08] ah I see there is nothing sent from the profiler-to-carbon side, it just connects, reads data and goes away [12:40:09] (03CR) 10Ori.livneh: [C: 031] disable performance.wikimedia.org/profiler/report [operations/puppet] - 10https://gerrit.wikimedia.org/r/144685 (owner: 10Filippo Giunchedi) [12:40:18] so I would not in fact expect to see anything if mwprof is timing out [12:40:27] that's a good theory [12:40:54] what happened on June 20? [12:43:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] disable performance.wikimedia.org/profiler/report [operations/puppet] - 10https://gerrit.wikimedia.org/r/144685 (owner: 10Filippo Giunchedi) [12:44:43] re [12:48:18] so how can we test your theory, ori? [12:48:28] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:38] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:48] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:48] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:48] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:55] argh no [12:48:58] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [12:48:58] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:08] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:08] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:18] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:28] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:38] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:38] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:39] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:39] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:39] strace + tcpdump? [12:49:39] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:58] Could not evaluate: end of file reached again... [12:49:58] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [12:50:38] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:50:40] changing mwprof to just write the XML to the socket as it's generating it rather than build up the whole thing in memory would probably fix this [12:52:25] tcpdump shows nada on that port [12:52:32] after waiting several minutes [12:53:04] check /proc//maps [12:53:46] just visually inspecting the contents of the memory arena should indicate whether it's mostly XML [12:54:23] the more i think about it the more convinced i am it has to be [12:57:37] I am those puppet failures btw [12:58:08] hmmm sounds like something I could stamp to a t-shirt [12:58:22] ah hah [12:58:41] care to give a one line esummary? [12:59:12] of the failure cause, not the t-shirt :-P [12:59:33] ori: I am not getting much out of the maps stuff, a lot of anon inodes but can't tell anything from that [12:59:40] kill -HUP `cat /var/run/apache2.pid` happening due to logrotate everyday on 6:25 UTC on both strontium and palladium [13:00:04] ahh of course it would [13:00:17] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:23] which is why you see the EOF and blah blah. [13:00:27] here we go again btw [13:00:34] yes that would be 'palladium goes away' iindeed [13:00:57] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:37] the funny thing is that both must be reloaded before the error shows up. If either one of them happens it won't [13:01:56] :-D [13:01:58] which I am thinking I could exploit to solve the problem :-) [13:02:05] stagger? :-D [13:04:37] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:05:27] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:05:37] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:05:37] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:05:37] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:05:48] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:05:48] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:05:57] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:05:57] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:06:17] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:06:27] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:06:47] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:06:57] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:06:57] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:07:07] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:07:37] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:12:59] (03PS1) 10Hashar: zuul: migrate server logging.conf out of zuulwikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/144688 [13:17:17] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:17:57] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:18:06] (03CR) 10Hashar: "Puppet catalog compiler seems happy http://puppet-compiler.wmflabs.org/121/change/144688/html/gallium.wikimedia.org.html" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144688 (owner: 10Hashar) [13:21:28] gi11es: Re SWAT this morning, are you going to prepare the patches to mediawiki/core for updating the extension submodules? [13:28:01] (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: migrate server logging.conf out of zuulwikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/144688 (owner: 10Hashar) [13:28:17] <_joe_> hashar: well done :) [13:28:21] hehe [13:30:50] !log replacing disk 4 ms-be1007 [13:30:54] Logged the message, Master [13:32:00] !log replacing disk disk 6 ms-be1005 [13:32:05] Logged the message, Master [13:32:51] arghhh [13:32:56] I need to split my roles as well [13:32:59] the never ending task [13:33:00] RECOVERY - RAID on ms-be1005 is OK: OK: optimal, 13 logical, 13 physical [13:33:14] icinga-wm: well done RAID recovery [13:33:29] <^demon|food> hashar::__clone() hasn't been implemented. [13:33:47] hey Chad [13:34:02] ^demon|food: sorry about your Zend phpunit session issue :-/ No clue what might be happening [13:34:12] <^demon|food> me either! [13:34:16] if it happens on vagrant, maybe it can be reproduced? [13:34:20] !log slow transaction rollback in progress on db1001 librenms. other databases not affected, but librenms writes are timing out [13:34:23] at worth you can get the vm uploaded somewhere [13:34:24] Logged the message, Master [13:38:05] _joe_: now I remember why we have zuulwikimedia. It is to avoid copy pasting in the labs and production role classes :-D [13:38:56] ori: if you're still here do you want to add your theory to the ticket https://rt.wikimedia.org/Ticket/Display.html?id=7786 and prposed solution? if you're not here just ... raise your hand :-P [13:42:10] (03PS1) 10Alexandros Kosiaris: Increase bacula poolsize to 50 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144691 [13:42:45] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Increase bacula poolsize to 50 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144691 (owner: 10Alexandros Kosiaris) [13:44:42] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Puppet has 1 failures [13:45:22] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: Puppet has 1 failures [13:46:02] sigh... [13:46:50] awww [13:49:12] is ottomata out this week? [13:50:13] also, can someone review this and merge it if it is okay? https://gerrit.wikimedia.org/r/#/c/142543/ [13:51:43] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:53:04] <^demon|food> dogeydogey: He's just out today afaik. [13:53:19] thanks [13:56:39] (03PS1) 10Hashar: zuul: install zuul from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144692 [13:56:41] (03PS1) 10Hashar: zuul: move Icinga checks to zuul::monitoring::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 [13:56:43] (03PS1) 10Hashar: zuul: monitor Zuul merger via nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 [13:57:09] (03PS1) 10Alexandros Kosiaris: Puppetmaster's logrotate made graceful [operations/puppet] - 10https://gerrit.wikimedia.org/r/144695 [13:57:11] (03CR) 10coren: [C: 031] "Okay lint in principle, but there's a couple of instances of trailing whitespace added." (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142542 (owner: 10Scottlee) [14:02:12] anomie: I can try [14:02:13] (03PS1) 10Hashar: zuul: remove /var/lib/git from server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 [14:03:23] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:03:33] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:06:53] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is good and I don't think we can't live with overlapping puppetmaster logs." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144695 (owner: 10Alexandros Kosiaris) [14:13:12] anomie: should the commit message be the same as the extension's commit? or should it just be mentioning the submoduleupdate? [14:14:04] gi11es: Just mentioning the submodule update is standard [14:14:10] alright [14:14:50] (03PS2) 10Scottlee: Fixed spacing and puppet-lint issues in manifests/role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142542 [14:19:25] anomie: added them to the deployments page [14:20:17] gi11es: Good! [14:20:33] (03PS1) 10Tim Landscheidt: ldap: Fix pipe error [operations/puppet] - 10https://gerrit.wikimedia.org/r/144706 [14:22:01] (03CR) 10coren: [C: 032] "Spick and span" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142542 (owner: 10Scottlee) [14:23:59] (03CR) 10Hashar: "Will add a new check in Icinga to verify we have a 'zuul-merger' process running on the node." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [14:24:19] (03CR) 10Hashar: "Merely a cleanup change :-D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 (owner: 10Hashar) [14:24:50] (03CR) 10coren: [V: 032] "It's a linting noop." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142542 (owner: 10Scottlee) [14:41:18] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [14:46:32] dzahn [14:49:01] dogeydogey: dzahn is 'mutante' on irc. He's in California so probably not online yet. [14:59:48] gi11es: If you're ready, I'll do your SWAT patches first [15:00:02] anomie: ready [15:00:04] manybubbles, anomie, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140708T1500) [15:00:10] * anomie begins SWAT [15:00:22] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:00:52] (03PS1) 10Hashar: zuul: introduced config hash in role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144708 [15:00:54] (03PS1) 10Hashar: (WIP) zuul: patch of quake (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [15:01:33] cause 'doom' is getting old [15:01:40] heh [15:02:01] hashar: you can call it the patch of duke nukem, and get it merged a few years later in a very underwhelming shape... :D [15:03:04] YuviPanda: oh the serie of duke nukem shipped on time [15:03:10] YuviPanda: at least 1 to 3 :-D [15:03:16] hehe [15:03:27] I can still remember the day my dad came back home with the duke nuke 1 floppy disks [15:03:38] GAME!!!!!!!!!!! at last that computer is going to serve a purpose! [15:03:45] hashar: :D [15:04:02] hashar: I think Quake 3 was out by the time I started playing anything? [15:04:16] !log anomie Synchronized php-1.24wmf12/extensions/UploadWizard/UploadWizard.config.php: SWAT: Flickr API is https-only now [[gerrit:144583]] (duration: 00m 10s) [15:04:17] gi11es: ^ Test please [15:04:20] Logged the message, Master [15:04:26] meanwhile in Finland, while I was discovering game on a PC, linus was porting unix to x86 ... [15:04:54] (03CR) 10jenkins-bot: [V: 04-1] (WIP) zuul: patch of quake (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 (owner: 10Hashar) [15:04:57] hashar: heh [15:05:12] and inflation adjusted the lame comp costed 2300€ dooh [15:06:04] anomie: I don't know if we have any wiki on 1.24wmf12 where UploadWizard is turned on. can't really test it [15:06:26] and I've just realized that I can't find flickr import on Commons, I wonder if they've turned it off or something [15:06:32] gi11es: ok, going ahead with wmf11 then [15:06:39] gi11es: test has UploadWizard, though [15:06:44] unsure if that's wmf12 [15:06:54] (https://test.wikipedia.org/wiki/Special:UploadWizard) [15:07:06] gi11es, YuviPanda: testwiki should be wmf12, commons is still wmf11 until this afternoon [15:07:23] right, so can be tested [15:07:52] YuviPanda: thanks [15:08:09] anomie: flickr import confirmed to still work on test [15:10:12] !log anomie Synchronized php-1.24wmf11/extensions/UploadWizard/UploadWizard.config.php: SWAT: Flickr API is https-only now [[gerrit:144584]] (duration: 00m 10s) [15:10:14] gi11es: ^ Test please [15:10:18] Logged the message, Master [15:10:36] huh, that's neat (re flickr api https-only) [15:11:03] are we still shipping our 'official' flickr API key to the client though? :) [15:11:13] hashar: Does Jenkins run extension tests on merges to core wmf branches? Or is it pointless to wait for it to run tests when the diff is just a submodule update? [15:11:33] YuviPanda: shhhh [15:11:48] api keys and FLOSS don't mix [15:11:56] anomie: I'm guessing flickr import on commons is limited by user rights which I don't have, or something like that. YuviPanda any suggestion for an '11 wiki with UploadWizard on it? :) [15:12:28] gi11es: :D none that I can think of, but someone else might have commons adminship [15:12:44] gi11es: check #wikimedia-commons? [15:13:06] meh, it's the least dangerous change of all times and we've checked that it doesn't break '12 [15:13:15] * anomie moves on to the next change [15:13:16] anomie: it just run core tests, the extensions /submodules are not fetched [15:13:52] anomie: thanks for the swatting [15:14:14] !log anomie Synchronized php-1.24wmf12/extensions/Scribunto/: SWAT: Fix regression in os.date and os.time at module scope [[gerrit:144511]] (duration: 00m 11s) [15:14:20] Logged the message, Master [15:14:22] * anomie confirms fix [15:15:26] !log anomie Synchronized php-1.24wmf11/extensions/Scribunto/: SWAT: Fix regression in os.date and os.time at module scope [[gerrit:144559]] (duration: 00m 10s) [15:15:31] Logged the message, Master [15:15:32] * anomie confirms fix [15:15:41] * anomie is done with SWAT [15:15:50] hashar: Thanks, that made my swatting much faster ;) [15:15:55] (03PS2) 10Hashar: (WIP) zuul: patch of quake (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [15:16:12] anomie: we will eventually run core + all extensions unit tests + qunit of the resulting install [15:16:20] but that is a long way :D [15:16:34] hashar: Let me know when it happens so I can stop skipping Jenkins on submodule updates [15:16:45] oh there will be a bunch of announcements [15:17:07] hashar: and you should rope in the release management RFP winners for that work, too (to be announce real soon now) [15:17:10] anomie: I have yet to write the vision and first email though [15:17:29] greg-g: yeah I will get Jenkins to fill next year RFP :D [15:18:29] hashar: haha :) [15:18:43] hashar: we should get phabricator to fill in next year's RFP [15:19:05] hashar: lol, I... uhh... would love if you did that. [15:19:46] "must accept RfP on pain of V+2's taking 2h each otherwise' [15:19:53] s/'$/"/ [15:20:40] in theory, we could get rid of mwcore wmf branches and the lame submodule [15:21:08] cause if we now that core@sha1 x + extension @ sha1 y...z pass tests together [15:21:15] we could automatically craft a deployable version [15:21:32] yes! [15:21:35] !log reedy Purged l10n cache for 1.24wmf9 [15:21:40] Logged the message, Master [15:21:43] (03PS7) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [15:21:58] we need tests though [15:22:00] !log reedy Purged l10n cache for 1.24wmf10 [15:22:05] Logged the message, Master [15:22:09] greg-g: hashar sounds suspiciously close to 'continuous deploy' :) [15:22:11] at least closer than today [15:22:11] hashar: https://www.mediawiki.org/wiki/Wikimedia_Release_and_QA_Team/Wishlist#True_code_pipeline [15:22:19] YuviPanda: yes, see ^ [15:22:36] especially the proposal (just one option, to think about) [15:22:47] YuviPanda: I first met Greg during Amsterdam hackathon and we talked about one click deploy. Seems like a worthwhile goal to aim to [15:23:43] hashar: Just to press the button, you've go to go to the SF office ;) [15:24:11] (03CR) 10Giuseppe Lavagetto: "I deployed the directory to all puppet hosts, we can deploy this patch for just one host, verify nothing has changed and quickly deploy ev" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [15:25:34] ah puppet [15:27:27] (03CR) 10Hashar: zuul: move Icinga checks to zuul::monitoring::server (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 (owner: 10Hashar) [15:27:50] (03PS2) 10Hashar: zuul: move Icinga checks to zuul::monitoring::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 [15:28:59] (03PS1) 10Reedy: Non Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144712 [15:29:18] (03CR) 10Hashar: zuul: monitor Zuul merger via nrpe (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [15:30:01] (03PS2) 10Hashar: zuul: monitor Zuul merger via nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 [15:32:14] (03CR) 10Hashar: zuul: monitor Zuul merger via nrpe (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [15:32:31] (03PS3) 10Hashar: zuul: monitor Zuul merger via nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 [15:33:29] (03PS2) 10Hashar: zuul: remove /var/lib/git from server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 [15:33:33] (03PS2) 10Hashar: zuul: introduced config hash in role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144708 [15:34:40] (03CR) 10Hashar: "On integration-dev.eqiad.wmflabs that triggers:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [15:35:07] (03PS3) 10Hashar: (WIP) zuul: patch of quake (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [15:36:28] (03CR) 10Hashar: "Compile fine on integration-dev.eqiad.wmflabs and it is a noop!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 (owner: 10Hashar) [15:36:44] enough puppet for now [15:37:30] RobH: hey! around? [15:37:41] ? [15:37:57] RobH: Coren asked me to poke you if we've any misc servers around that can be used for labs graphite [15:38:01] VMs don't seem to be doing much [15:38:06] (going to file an RT ticket shortly) [15:38:18] I just allocate them, but Mark also approves them [15:38:33] so i have spare servers, but i'd make sure this isnt the first time this was discussed ;] [15:38:47] RobH: I told him to ask you because you'd be the one who knows whether there was one to ask for in the first place. :-) [15:38:52] yeah [15:39:02] https://wikitech.wikimedia.org/wiki/Server_Spares [15:39:07] aha! [15:39:12] please note you cannot just take off that list, as it say on the top [15:39:16] but its where i track my spares [15:39:40] so if what you need matches the spec of one of those, its a LOT easier to get ;] [15:39:40] YuviPanda: Labs monitoring is on the plan for the year, so I don't think it'd be a hard sell with Mark. I'll poke him once he gets online. [15:39:51] indeed, if its already planned then its much easier [15:40:01] Coren: cool! Should I still file an RT or shall I leave it to you? [15:40:01] and i'd just put in the procurement ticket with the details and reasoning [15:40:08] (we're going to make you have one of those tickets no matter what) [15:40:11] ah [15:40:12] ok [15:40:34] but yea, as coren points out, if its already in our roadmap then its just all a matter of course [15:40:47] Coren: I don't think I've access to any RT queue other than ops request, can you file under procurement? [15:41:00] YuviPanda: you can [15:41:01] YuviPanda: if at all reasonable (separate disks) I'd like to combine this with an alert/monitoring. Yeah, I'll handle the ticket. [15:41:05] everyone has procurement! [15:41:10] (not sure why folks still think they dont ;) [15:41:24] but yea, however you guys wanna work it out [15:41:30] Coren: cool! [15:41:31] (but everyone in the org has access to every queue) [15:41:50] RobH: oh? I remember trying it out a long time ago and not getting through... [15:42:00] you said it, a long time ago [15:42:02] hmm, actually, I might not have been part of the WMF at that point, unsure. [15:42:06] i changed those permissions like a year ago [15:42:11] RobH: right. will remember that next time [15:42:14] and folks still think its locked down, heh ;] [15:42:24] but indeed, it was locked when we started using RT [15:42:35] which was annoying to everyone askign for servers, ehh [15:42:39] heh, right [15:42:46] Coren: can you cc me on the ticket? [15:43:06] just list both of you as requestors and you'll get all updates [15:43:20] but yes, i have a spare server to use for it ;] [15:43:33] (03PS1) 10BBlack: move amssq47 to private1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/144714 [15:43:58] (03PS2) 10Scottlee: Fixed puppet-lint issues on manifests/role/analytics/hadoop.pp. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142543 [15:44:39] (03CR) 10BBlack: [C: 032] move amssq47 to private1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/144714 (owner: 10BBlack) [15:44:56] RobH: ty! :) [15:49:14] (03PS1) 10BBlack: convert amssq47 to normal text cache in private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/144715 [15:51:07] YuviPanda: RT 7814 [15:51:14] RobH: Should I CC Mark on this? [15:51:14] (03CR) 10BBlack: [C: 032] convert amssq47 to normal text cache in private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/144715 (owner: 10BBlack) [15:51:15] RobH: do you know where I can look to see the physical configuration of tungsten? [15:51:19] Coren: ty! [15:52:06] Coren: also, I don't think we need that much disk. A good amount of metrics with the same granularilty as prod was under 100G fully done, I think. I'd rather have us max out RAM than disk. [15:52:48] YuviPanda: Then the 2x500Gs would suffice; so we'd have 16G or 64G or ram for one. Comment on the ticket. [15:53:05] Coren: yeah, will do. Looking for specs of tungsten atm, since that's prod graphite server. [15:53:20] Coren: it collects 250k metrics, and has been having trouble of late (just now, just a little bit, not as much as ours) [15:53:22] Coren: no, just put in the procurement ticket, mark loops through that queue reguarly [15:53:32] YuviPanda: At the cost of no disk redundancy, mind you, so we need to decide how valuable historical data is. [15:53:46] Coren: can't we Raid the 2x500 to give us just 500G of storage? [15:53:54] so [15:53:59] there's a plan to improve monitoring in labs [15:54:03] but that plan isn't yet defined [15:54:13] before we start buying hardware for it, i'd like to see a little bit of a plan :) [15:54:20] YuviPanda: Yes, but I'd very much like to have the metrics on its own spinning run to reduce iowait. [15:54:55] mark: I think, to be fair, at this stage we are in exploratory experiment mode; hence a misc server. [15:55:41] exploratory experiments you can do in labs VMs [15:56:00] (03CR) 10Andrew Bogott: [C: 032] "hm, needs manual rebase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141588 (https://bugzilla.wikimedia.org/66962) (owner: 10Tim Landscheidt) [15:56:03] mark: so I've been playing with graphite.wmflabs.org for the last couple of weeks. It's been unable to handle the load of all of labs, and is struggling now even with just toollabs + betalabs [15:56:08] it is on the biggest VM we've got [15:56:20] (03PS2) 10Andrew Bogott: Tools: Install npm for users' use [operations/puppet] - 10https://gerrit.wikimedia.org/r/132238 (owner: 10Tim Landscheidt) [15:56:34] mark: IOPS is the bottleneck [15:56:35] mark: To a point; YuviPanda has already run into I/O issues, and there is an insurmountable problems with monitoring the /infrastructure/ from inside itself (which, IMO, is the bigger issue for me) [15:56:40] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:19] mark: so my other option at this moment are to setup a graphite cluster which will complicate the code for now (and code that will probably go away when we move this to 'real' hardware) [15:57:52] (03CR) 10Andrew Bogott: [C: 032] Tools: Install npm for users' use [operations/puppet] - 10https://gerrit.wikimedia.org/r/132238 (owner: 10Tim Landscheidt) [15:57:56] what sort of metrics are being monitored at the moment? [15:58:01] mark: There is also the obvious issue of recursion; if activity in Labs increases, the load on monitoring increases... which increases the labs activity. :-) [15:58:10] (03PS1) 10Hoo man: Capture Wikibase\UpdateRepoOnMoveJob debug logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144716 [15:58:14] aude: ^ [15:58:37] ACKNOWLEDGEMENT - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Probably related to current amssq47 work due to mislabeled / miswired switch ports in esams, investigating [15:58:39] panic [15:58:40] (03CR) 10coren: [C: 032 V: 032] "Moar lint!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142543 (owner: 10Scottlee) [15:58:43] aude: ?? [15:58:43] ok :) [15:58:47] mark: right now is just generics (CPU/Network/Disk) + some toollabs specific ones (Redis Keys/Clients, nginx load/active connections, exim queue) [15:58:56] * aude panics whenever i am mentioned in this channel :p [15:59:19] aude: :) [15:59:19] mark: I was going to work on getting toollabs 5xx/4xx counts be logged, plus the grid engine states be logged, but have been sidetracked with actually stabilizing it instead [15:59:29] puppet stats are also logged [15:59:43] (minimally, will be added a bit more) [15:59:47] andrewbogott hi [15:59:58] so how much in sync is this with similar monitoring in production? :) [16:00:19] mark: it's the same code, essentially (diamond + txstatsd + graphite) [16:00:26] all puppetized, of course :) [16:00:33] dogeydogey: howdy [16:01:11] mark: On my side, I'm looking for a clean way to allow labs user to monitor tools/services in their projects; unless we have wider plans to move away from Icinga in prod I'm gunning in that direction myself, with some happy trickery to allow generation of configuration per-project. [16:01:19] (03CR) 10Aude: [C: 031] "looks good and ready whenever we can deploy it" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144716 (owner: 10Hoo man) [16:01:55] (03CR) 10Andrew Bogott: [C: 031] "Looks fine to me, but needs a fresh rebase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120347 (owner: 10Tim Landscheidt) [16:02:12] Coren: right, that as well. I wrote up some initial thoughts a long time ago at https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Icinga_for_tools (out of date a bit) [16:02:24] mark: So project admins can get notifications and ack/suppress alerts as needed; this is often requested by tool labs users but I'm certain the deployment-prep gang is going to be happy about it. [16:02:54] yeah, greg-g filed a bug about alerts for deployment-prep recently as well, IIRC [16:03:45] Coren: we do have some plans to move away from icinga in prod in the future [16:03:56] mark: akosiaris was considering shinken, I think. [16:04:02] yes [16:04:19] so what I'd like to avoid is this to diverge [16:04:29] that said, a graphite install in labs for all kinds of metrics does make sense to me [16:05:13] mark: I know, but the question is "do we make monitoring labs dependent on that being done, use labs to try shinken or other alternatives, or use Icinga for now and switch to $production once that has been decided and specced?" [16:05:36] (03PS2) 10Tim Landscheidt: Tools: Install xsltproc [operations/puppet] - 10https://gerrit.wikimedia.org/r/141588 (https://bugzilla.wikimedia.org/66962) [16:05:39] Coren: fwiw, I don't think we can use icinga that easily without work on the prod side anyway. The code is very prod specific atm. [16:06:02] RECOVERY - Host amssq54 is UP: PING OK - Packet loss = 0%, RTA = 95.89 ms [16:06:15] Coren: mark so ideally, I'd like for us to get our graphite stuff straight, work with akosiaris to either move to shinken or fix the puppet icinga code, and then use that in labs. [16:06:26] (03CR) 10Andrew Bogott: [C: 032] Tools: Install xsltproc [operations/puppet] - 10https://gerrit.wikimedia.org/r/141588 (https://bugzilla.wikimedia.org/66962) (owner: 10Tim Landscheidt) [16:06:43] YuviPanda: Sane. [16:06:51] that sounds good to me, although i'm not sure how much time akosiaris has to work on this /at the moment/ [16:07:38] mark: It might be worthwhile to try shinken as an experiment; the lessons learned there will certainly be of use to akosiaris once he does have the time to sit down. [16:07:55] mark: I don't expect the graphite work to be done for a month or two anyway, so after icinga/shinken is decided I'd help work on that as well. [16:08:46] alright [16:08:54] please create a procurement request for the labs graphite server [16:09:00] and then send a link and short plan to the ops list [16:09:08] i'd like everyone to be able to be aware of it and weigh in [16:09:20] mark: 7814 [16:09:31] mark: Will do. [16:09:33] thanks [16:09:37] :D [16:11:25] (03CR) 10Andrew Bogott: [C: 032] "This has been annoying me for ages!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144706 (owner: 10Tim Landscheidt) [16:18:12] (03PS8) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [16:33:44] (03PS1) 10Rush: stop diamond on non-specified labs projects [operations/puppet] - 10https://gerrit.wikimedia.org/r/144718 [16:43:15] (03CR) 10Yuvipanda: [C: 031] "With the minor nit of mixing 'running' and 'true' to get a service running :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144718 (owner: 10Rush) [16:43:21] chasemp: ^ [16:43:33] (03PS2) 10Rush: stop diamond on non-specified labs projects [operations/puppet] - 10https://gerrit.wikimedia.org/r/144718 [16:43:48] (03CR) 10Rush: [C: 032 V: 032] stop diamond on non-specified labs projects [operations/puppet] - 10https://gerrit.wikimedia.org/r/144718 (owner: 10Rush) [16:43:51] <_joe_> matanya: hey, I know I make you wait, but there's a ton of stuff you may grab as small tasks for puppet -> http://etherpad.wikimedia.org/p/Puppet3 [16:44:14] <_joe_> basically, a good part of our templates are not linted [16:44:26] <_joe_> and they use deprecated variable access syntax [16:45:40] chasemp: btw, since yesterday, I've been using whitelist.conf to only let tools/beta/graphite stats to get in [16:45:48] nice [16:45:48] chasemp: much better, but still maxing out IO [16:45:51] at least not much gaps [16:45:56] http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h [16:46:05] and the gaps disappear after a minute or so, so I'll bog that down to lag [16:46:15] which is strange since that shouldn't technically affect graphite [16:56:20] (03PS5) 10Reedy: Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [16:56:27] (03CR) 10Reedy: [C: 031] Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [16:59:03] chasemp: I'll check in an hour or so to see if non tools/betalabs/graphite projects are still sending metrics [17:23:39] (03PS3) 10Andrew Bogott: Add archive-project-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144063 [17:30:11] (03CR) 10Krinkle: [C: 031] "I was worried that using $IP/cache (instead of our wmf-config value of wgCacheDirectory, which is in /tmp), might be unexpected as it is u" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [17:35:23] (03CR) 10BryanDavis: "Using $IP/cache was deliberate as this is a location that scap can sync across the cluster. The git json files have been syncing on each f" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [17:37:23] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:37:57] ^ I fixed fenari by manually creating /var/lib/pybal-check (pybal-check's homedir) [17:38:25] erbium and probably osmium and dataset2 have similar failures from relatively-recent puppet changes to how users are managed [17:38:43] I hate to wade in any deeper and screw it up in puppet or mask the problems by doing one-off local fixups on the hosts themselves [17:39:32] (you can see these in icinga tactical overview of unhandled/critical, they've been in that state since approx 1 week ago) [17:39:48] (also ms-be1005 [17:39:55] and ms-be-3003) [17:40:24] it looks like /var/lib/pybal-check is defined as the home dir in puppet for pylib [17:40:44] weird it's not created, but it is in mediawiki/manifests/users [17:40:48] maybe that's not being applied? [17:40:49] my best guess at the generic cause is that the userids for these users already existed on these hosts before the relevant puppet changes, and manage_homedir doesn't actually do anything if it didn't need to create the user? [17:41:38] (which is why I did a manual fixup on fenari, but tbh I'm not sure if that's correct or just masking a symptom) [17:41:50] ori: James_F: https://graphite.wikimedia.org/render/?width=1048&height=680&from=-2weeks&hideLegend=false&target=frontend.assets.modules.missing.*.max&target=frontend.assets.modules.error.*.max&target=frontend.assets.combined.http4xx.*.max&target=frontend.assets.combined.http5xx.*.max&target=frontend.assets.combined.uncaughtException.*.max [17:42:08] We now have 404, 500, uncaught js error, and module load error in tracking :) [17:43:13] I only looked at erbium from that list that have all been failing puppet recently, there could be other causes [17:43:24] but in the erbium case, it was a similar issue with the homedir for file_mover [17:43:29] osmium has problems out the wazoo [17:43:37] but I don't see pybal in there [17:43:53] osmium and lots of it 'Hhvm::Dev/Package[libboost-dev]/ensure: change from purged to present failed' [17:44:07] is that stuff trusty packaged? idk [17:45:13] yeah and ms-be is also different: [17:45:13] ori: are you the hhvm gentlemen :) could you peek at osmium [17:45:16] Jul 8 17:26:48 ms-be3003 puppet-agent[32421]: (/Stage[main]/Main/Node[ms-be3001-4.esams.wmnet]/Swift::Create_filesystem[/dev/sdk]/Swift::Mount_filesystem[/dev/sdk1]/Mount[/srv/swift-storage/sdk1]) Could not eval [17:45:20] uate: Execution of '/bin/mount /srv/swift-storage/sdk1' returned 32: mount: special device LABEL=swift-sdk1 does not exist [17:46:05] interesting dataset2 is a similar mount issue [17:46:24] so yeah I think the weird problem with /var/lib/whatever homedirs for system users is isolated to just erbium file_mover and fenari pybal-check [17:47:09] I think some of these were outstanding for a while, we just weren't seeing it and now we are thanks to the new icinga puppetfail checks [17:47:22] dataset2 is failing on some nfs labs mount then several things follow. andrewbogott_afk or Coren got a minute to look at dataset2? [17:47:24] probably up your alley [17:47:38] chasemp: Sure, gimme a sec [17:48:12] chasemp: What's the instance name? 'dataset2' fails. [17:48:22] dataset2.wikimedia.org [17:48:40] "err: /Stage[main]/Dataset::Cron::Rsync::Labs/Mount[/mnt/dumps]: Could not evaluate: Execution of '/bin/mount -o rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc /mnt/dumps' returned 32: mount.nfs: an incorrect mount option was specified" [17:48:44] labs-ish? [17:48:52] Ah, I need the actual instance name, not the public IP's name. :-) [17:49:14] it's a prod box [17:49:14] ... "incorrect mount option"? Dafu? [17:49:18] Oh! [17:49:29] doing some kind of labs magic idk [17:49:38] copying over data for analysis? [17:50:07] Yeah, that's the regular sync mount. There's obviously a missing dependency there. [17:50:27] Probably something that's pulled in from another, normally included class. Hang tight. [17:52:14] chasemp: Ah, the proximate cause is obvious: this attempts an NFS4 mount, but this is Lucid? [17:52:48] Ubuntu 10.04.4 LTS \n \l [17:53:16] I think these are apergos's domain? [17:53:17] not sure [17:53:47] (03PS1) 10Rush: managed file_mover home dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/144724 [17:54:08] Eeew. Lucid doesn't speak NFS4; no way to have that mount work from that box. [17:54:48] Hm. At least it doesn't speak it by default; lemme see if there is a package that can be pulled in. [17:55:20] Krinkle: Nice [17:55:49] chasemp: Hm; it looks like it does but the mount syntax might have to be tweaked. [17:55:52] Coren: idk...hit up apergos? I think. I dont' even really know what that's doing :) [17:55:59] oh well that's cool then [17:56:00] Krinkle: Is there a way of fixing the scale? [17:56:17] (03CR) 10Rush: [C: 032] managed file_mover home dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/144724 (owner: 10Rush) [17:56:31] (03CR) 10Dzahn: "aah, gotcha. ok, thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144092 (owner: 10Dzahn) [17:56:35] James_F: what do you mean? [17:56:45] (03Abandoned) 10Dzahn: snmptt - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144092 (owner: 10Dzahn) [17:57:14] chasemp: Yeah, that can be made to work, at least as far as mounting the filesystem, with a tweak of the options. We'll have to special-case it in puppet. [17:57:14] Krinkle: There's a sliver of pink at the bottom of the graph but it's either at zero (no data yet) or the scale is wrong so you can't judge? [17:57:41] bblack: https://gerrit.wikimedia.org/r/#/c/144724/ for erbium seems good [17:57:41] It's all 0. Else there'd be a 1 [17:57:49] I'm not sure graphite has a non-automatic scale [17:57:51] (It's Hardy that doesn't peak v4, but thankfully they're gone now) [17:57:52] Krinkle: 1… what's the scale? [17:57:59] James_F: Well, 0 or more [17:58:02] Krinkle: Events per time period? [17:58:09] number of failures [17:58:13] the value of those counts [17:58:25] individual counts [17:58:33] its run once a day [17:58:37] Krinkle: Sure, but per what? Per minute? Per hour? Per day? [17:58:43] Oh, so it's events-per-day? [17:58:44] or maybe more often, but either way, individual data points are for that run [17:58:49] OK. [17:58:53] No, events on that wikis' main page [17:58:58] from a single load [17:59:03] Oh, right. [17:59:19] chasemp: Specifically, Lucid needs fstype explicitly at 'nfs4' and does not understand the 'nofsc' option. [17:59:54] are you ok with putting up a patch? your the nfs savy one :) I'm just tracking down failed puppet runs in icinga [18:00:05] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140708T1800) [18:00:25] James_F: https://graphite.wikimedia.org/render/?width=1048&height=680&from=-2weeks&hideLegend=false&yStep=1&target=frontend.assets.modules.missing.*.max&target=frontend.assets.modules.error.*.max&target=frontend.assets.combined.http4xx.*.max&target=frontend.assets.combined.http5xx.*.max&target=frontend.assets.combined.uncaughtException.*.max [18:00:28] yStep=1 :) [18:00:33] The time is nigh for Bill Nye. [18:00:54] Krinkle: OK. :-) [18:05:23] eh? I see I was pinged [18:06:52] apergos: dataset2 box nfs mount issues [18:06:57] oh [18:06:57] coren I think knows how to fix [18:07:08] lucid meh [18:07:26] ok cool [18:09:10] (03PS1) 10BBlack: Add v4 reverse for amssq47 [operations/dns] - 10https://gerrit.wikimedia.org/r/144726 [18:09:36] (03CR) 10BBlack: [C: 032] Add v4 reverse for amssq47 [operations/dns] - 10https://gerrit.wikimedia.org/r/144726 (owner: 10BBlack) [18:10:09] chasemp: i'll take care of osmium [18:10:31] ori: I'll let ya! [18:14:49] AaronSchulz: what do you think makes sense in terms of ordering the deployment of jobrunner relative to HHVM? [18:14:59] should it happen before or after? [18:15:20] ori: is this all the mediawiki jobs themselves? [18:16:19] ori: probably before [18:16:28] given all the hhvm bugs [18:17:06] <_joe_> AaronSchulz: to get better debug options, right? [18:17:15] * aude misread as jobrunner on hhvm [18:18:14] i think before makes sense too [18:18:46] should we start with a labs instance? [18:18:47] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144712 (owner: 10Reedy) [18:18:54] (03Merged) 10jenkins-bot: Non Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144712 (owner: 10Reedy) [18:20:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.24wmf12 [18:20:56] Logged the message, Master [18:20:56] ori: if it's not too much work [18:21:02] probably won't be many jobs to run though [18:21:21] i'll amend my patch to make it easy to apply selectively on beta [18:21:31] bd808: would you have time to set up trebuchet deployment for jobrunner on labs? [18:22:07] (03PS2) 10Reedy: Capture Wikibase\UpdateRepoOnMoveJob debug logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144716 (owner: 10Hoo man) [18:22:07] (03CR) 10Reedy: [C: 032] Capture Wikibase\UpdateRepoOnMoveJob debug logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144716 (owner: 10Hoo man) [18:22:13] (03Merged) 10jenkins-bot: Capture Wikibase\UpdateRepoOnMoveJob debug logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144716 (owner: 10Hoo man) [18:22:38] (03PS6) 10Reedy: Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [18:22:43] (03CR) 10Reedy: [C: 032] Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [18:22:47] ori: What does it need? Just the first sync? [18:22:50] (03Merged) 10jenkins-bot: Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) (owner: 10BryanDavis) [18:23:13] bd808|LUNCH: yeah. (but not yet, i still have to land the patch) [18:23:27] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 2 failures [18:24:00] ori: Sure. Just ping me when it needs running. I'll be back after I find some food in my kitchen. [18:24:45] (03PS3) 10Reedy: Allow Internet Archive's Wayback machine to get stuff from bits etc. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144364 (https://bugzilla.wikimedia.org/65464) (owner: 10Nemo bis) [18:24:52] (03CR) 10Reedy: [C: 032] Allow Internet Archive's Wayback machine to get stuff from bits etc. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144364 (https://bugzilla.wikimedia.org/65464) (owner: 10Nemo bis) [18:24:58] (03Merged) 10jenkins-bot: Allow Internet Archive's Wayback machine to get stuff from bits etc. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144364 (https://bugzilla.wikimedia.org/65464) (owner: 10Nemo bis) [18:25:33] (03PS2) 10Reedy: Remove remaining surveys for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 (owner: 10MarkTraceur) [18:25:42] (03CR) 10Reedy: [C: 032] Remove remaining surveys for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 (owner: 10MarkTraceur) [18:25:49] (03Merged) 10jenkins-bot: Remove remaining surveys for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 (owner: 10MarkTraceur) [18:25:51] Huzzah [18:25:55] was someone just on cp1043? I see a recent login + recent puppetfail [18:26:37] last week yes assuming that's not recent enough [18:27:11] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [18:27:16] Logged the message, Master [18:27:27] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:28:28] !log reedy Synchronized robots-private.txt: (no message) (duration: 00m 14s) [18:28:33] Logged the message, Master [18:29:28] bblack: i was on it to check the openssl version but i did not know about puppet fail [18:29:38] looked at bug 53259 [18:29:51] ah ok [18:30:02] it succeeded on next run anyways, must have been some temporary apt-locking thing [18:30:38] oh, i see, i did run apt-get with -s for simuated [18:30:42] simulated [18:31:26] ^d: wait, are you using tests/phpunit/phpunit.php? That are uses LCStoreNull [18:31:53] <^d> Yeah, how else would I run tests? [18:32:27] *already uses [18:32:41] ^d: so how where you hitting that LCStoreDB problem a while back? [18:32:47] (03PS1) 10Ottomata: Temporarily remove analytics1010 from netboot partman for troubleshooting [operations/puppet] - 10https://gerrit.wikimedia.org/r/144730 [18:33:11] (03CR) 10Ottomata: [C: 032 V: 032] Temporarily remove analytics1010 from netboot partman for troubleshooting [operations/puppet] - 10https://gerrit.wikimedia.org/r/144730 (owner: 10Ottomata) [18:33:25] <^d> AaronSchulz: Tests that do batshit things? [18:33:33] <^d> Honestly, nothing surprises me about our phpunit wrapper. [18:40:20] (03PS1) 10Dzahn: update SSL cipher list for gerrit to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 [18:42:38] (03CR) 10Dzahn: [C: 031] update SSL cipher list for gerrit to support PFS (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [18:45:12] (03CR) 10Chad: [C: 031] "I know jack shit about SSL, so I'm just +1'ing because it shouldn't affect Gerrit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [18:45:33] Hm.. which apaches does wikitech run on again? [18:45:50] <^d> It runs on the virt* boxes I thought? [18:46:26] yup outside of the main cluster [18:47:08] Krinkle: should be on virt1000 , under ops responsibility. [18:47:18] i.e. we don't have access to it [18:49:16] hashar: sure, I don't want to access it [18:49:29] hashar: I want to know whether it's firewalled or whether it can communicate regularly with production [18:49:34] it's not yet inside labs, right? [18:49:40] e.g. can it communicate over udp to irc.wikimedia.org [18:50:38] !log patch for bug66608 deployed to wmf11/12 [18:50:43] Logged the message, Master [18:50:50] <^d> Krinkle: It's not completely firewalled, it at least lets us hit LVS. [18:51:01] <^d> So my guess would be "yes" [18:51:05] <^d> Dunno how "regular" though. [18:51:06] <^d> :) [18:51:15] (03PS1) 10Dzahn: update SSL cipher list for OTRS to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 [18:51:37] ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=34685 [18:51:52] (03CR) 10JanZerebecki: [C: 031] update SSL cipher list for gerrit to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [18:52:43] <^d> Krinkle: Yeah, your comment 10 sounds right. Should just be a matter of configuring it. [18:53:07] cool [18:53:19] <^d> If not we'll find out real quick when it doesn't work ;-) [18:55:41] (03CR) 10Jgreen: [C: 031] update SSL cipher list for OTRS to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 (owner: 10Dzahn) [18:55:57] !log disconnecting serial cable from psw1-c2-eqiad [18:56:02] Logged the message, Master [18:57:10] (03PS1) 10Dzahn: update SSL cipher list on wikitech to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144736 [18:57:57] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but: did you test this in labs and with the ssllabs test suite?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 (owner: 10Dzahn) [18:59:07] (03CR) 10Dzahn: "i did not, i figured since i use the same thing we agreed on on the main cluster browser exclusion should not be an issue" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 (owner: 10Dzahn) [18:59:25] Krinkle: I have no idea :/ [18:59:35] Krinkle: ask andrew/marc-andré in #wikimedia-labs that is your best bet [19:00:14] ^d: is wikitech config versioned somewhere by now? [19:00:29] <^d> Hahaha, you must be new here. [19:00:32] <^d> Ahem. [19:00:35] <^d> No, I don't think so. [19:00:36] <^d> :) [19:00:46] (03PS1) 10Scottlee: Bug 67673 -- added line to collect Puppet failures. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 [19:02:06] ^d: I know it wasn't when Ryan Lane initially set it up. But at some point.. [19:02:08] even here.. [19:02:11] you'd think it was done [19:06:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Add twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144350 (owner: 10Ori.livneh) [19:08:08] (03CR) 10CSteipp: [C: 031] update SSL cipher list for gerrit to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [19:09:29] (03PS1) 10Dzahn: make misc varnish support PFS like main cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/144739 [19:12:10] (03CR) 10Dzahn: [C: 04-1] "eh wait, it's already in /etc/nginx/nginx.conf but for some reason it doesnt work on the misc. cluster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144739 (owner: 10Dzahn) [19:15:31] (03PS1) 10Cmjohnson: adding mgmt dns entries for an1028-41 [operations/dns] - 10https://gerrit.wikimedia.org/r/144741 [19:15:44] !log restarting varnish on cp1043/cp1044 (misc cluster) [19:15:49] Logged the message, Master [19:16:19] YuviPanda: do you have any advice for debugging my cookie issue with novaproxy on togetherjs.wmflabs.org ? [19:16:34] i'm getting the 'nocookiesfornew' error message when i try to create an account [19:18:04] alternatively, is there any way i can get togetherjs.wmflabs.org to use SUL, so that people don't have to create new accounts for testing? [19:20:35] ottomata: we may be looking at the same problem? I'm also chasing a netboot.cfg issue right now [19:20:59] !log arr, i meant "nginx", not varnish [19:21:03] Logged the message, Master [19:21:13] ottomata: the first symptom I ran into was my expected automatic ubuntu install on the console keeps pausing and asking for manual/guided partition setup [19:21:27] hmm, intersting, what are you installing? [19:21:36] i'm messing with ciscos, which always give me problems [19:22:00] ottomata: I've tracked that down to the fact that, when preseed.cfg is parsing stuff and does debconf-get netcfg/get_hostname to pick the partman config, it's still set to "unassigned", so it's not picking up partman stuff [19:22:01] also, the dells r720s need a confirmation [19:22:12] partman works, i just have to say 'yes' at some point before it continues the install [19:22:38] hm [19:22:38] in the installer log eventually get_hostname gets set correctly, but too late to pick up partman [19:22:46] whoa, what is this installer log you speak of? [19:23:19] (03PS9) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [19:23:37] when you get to the manual partitioning prompt, you can "Go Back" to the menu, and execute a shell and dig around in /var/log (or use the menu option "Save debug log" or something like that, which will let you fetch it over http to a useful playground) [19:23:42] oh ok [19:23:44] in the shell [19:23:44] aye [19:23:48] hm, interesting, bblack, that sounds a little different than my problem, i think [19:23:55] ok [19:23:58] not sure though [19:24:06] well I noticed your recent commit to netboot.cfg and thought maybe you were seeing the same [19:24:09] yeha [19:24:17] as far as I can tell, the partition scheme that is chosen is correct [19:24:25] but, it also needs a confirmation [19:24:32] yeah on mine it doesn't choose anything and pops the confirm [19:24:37] but when I confirm, it gives me an error about not being able to write the partition scheme [19:24:45] then back in the main menu [19:24:47] are you sure the partitions don't just happen to be correct from a previous install on the same machine? [19:24:48] i can't do anything at all [19:24:51] hmm [19:24:51] (03CR) 10Cmjohnson: [C: 032] adding mgmt dns entries for an1028-41 [operations/dns] - 10https://gerrit.wikimedia.org/r/144741 (owner: 10Cmjohnson) [19:24:53] bblack, i'm not sure [19:24:54] actually [19:24:58] you could be right, this is a reinstall [19:25:14] try using fdisk from the installer shell and wiping out the partition tables and reinstalling, I bet you don't get them back :) [19:25:17] i didn't have trouble with the r720s last week though [19:25:43] hm, aye, i just went through the hassle of manually partitioning, didn't realize there could be a wider problem [19:25:47] yeah I installed a ton of machines exactly like the one I'm doing today two weeks ago, and no issue [19:25:48] thought it was just due to ciscos being a pain [19:26:34] something's changed, but I haven't been able to track down what it is. whatever it is, it makes debconf-get netcfg/get_hostname not return the correct hostname when the preseed include_command stuff for partman wants to look it up [19:26:45] (but then the hostname is ok later) [19:27:40] hm weird [19:27:42] !log this should have fixed all the services behind misc. varnish now getting an actual "A" rating on ssllabs [19:27:45] so, if this install succeeds (after manually partitioning), i'm not going to wipe and try again [19:27:47] Logged the message, Master [19:27:51] but, i still have 1009 to due [19:28:02] cmjohnson1 and I are having a different problem with it [19:28:09] but, i will have 14 new dells to install this week [19:28:13] soooo, i will have ample time to try [19:28:47] mutante: good work [19:29:02] jzerebecki: :) thx [19:29:31] A, not A-, thanks to you [19:29:34] (03PS3) 10Ori.livneh: Add jobrunner class [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 [19:30:03] mail.wikipedia.org's ssl certificate is only for *.wikimedia.org, and that makes browsers complain [19:30:42] jackmcbarn: could the link just be replaced with "lists.wikimedia.org" ? [19:30:58] s/due/do* [19:31:04] sometimes typing doesn't work! [19:31:06] iunno! :) [19:31:08] imho we should not have service names in wikiPedia [19:31:09] mutante: the link in question is a google search result [19:31:21] jackmcbarn: i see [19:31:36] jackmcbarn: what is the query? [19:31:46] second result for "wikitech-l" [19:32:37] i guess we should fix that in apache config (redirects) [19:32:47] there might be existing ticket.. hold on [19:33:31] (03PS4) 10Ori.livneh: Add jobrunner class [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 [19:34:28] jackmcbarn, jzerebecki https://bugzilla.wikimedia.org/show_bug.cgi?id=44731 [19:34:35] godog, _joe_: could you guys have a look at the jobrunner patch ^^ ? the service it configures is set to 'disabled'; the next step will be enabling it on beta on a dedicated runner [19:35:06] jzerebecki: RT 6981 :p [19:36:14] ;) [19:40:25] <_joe_> ori: will do [19:40:47] <_joe_> ori: I already added both me and godog to reviewers :) [19:41:32] * YuviPanda just sent a long winded message to ops@, wonder if it was too much / too little [19:42:29] cscott: hey, sorry, was caught up elsewhere. [19:42:45] cool, yuvi [19:42:47] cscott: not sure what I can do :( I can tail logs for you if you want, but unsure what help it'll be. [19:42:55] shame about graphite IO [19:43:02] i wish opentsdb were as cool as graphite [19:43:04] (03CR) 10Ori.livneh: [C: 031] mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [19:43:11] ori: yup, will take a look later or tomorrow [19:43:25] thanks guys [19:43:26] jgage: yeah, it now is getting just metrics from toollabs/betalabs and can hardly keep up with that either. all of labs was pretty much dead, dropping metrics all the time [19:43:36] :( [19:43:43] jgage: heh, the main thing missing from opentsdb is aggregation so that fetching say 1d doesn't fetch million of points [19:43:46] jgage: but a physical machine *should* be ok [19:43:55] <_joe_> ggrrr I'm going to miss brazil-germany this evening [19:44:06] there's also influxdb that I wanted to try out, will probably do on labs itself at some point [19:44:24] but, it is too young and VC backed (which makes me distrust their marketing a bit (see also MongoDB)) [19:44:31] I know of some setups using ganglia for overviews and opentsdb to not miss anything (i.e. to zoom in) [19:44:52] I guess once we have a solid diamond+graphite setup, we can kill ganglia? [19:45:27] that's what i would hope for, eventually [19:45:42] yeah, me too [19:45:53] I guess it needs dashboards, since graphite isn't much at dashboards. [19:46:08] godog: jgage http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h is from the labs graphite, looks pretty good [19:46:19] plus you can keep the dashboard spec in git, unlike grafana [19:46:34] neat [19:46:46] tessera looks potentially cool also [19:46:59] jgage: yeah, but it has the 'build dashboard from UI, we will store it in db' thing going [19:47:12] yeah that is problematic [19:47:14] which means authentication everywhere, so can't be made fully public, which I *like* [19:47:24] !log running migrateAccount.php --safe for accounts only existing on one wiki (bug 39817) [19:47:30] Logged the message, Master [19:47:40] YuviPanda: yep looks cool [19:48:11] godog: there's a grafana patch for production somewhere, I'll probably add a giraffe patch instead at some point [19:48:18] (03CR) 10JanZerebecki: [C: 031] update SSL cipher list for OTRS to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 (owner: 10Dzahn) [19:49:02] YuviPanda: yeah https://gerrit.wikimedia.org/r/#/c/133274/ from ori [19:49:15] godog: yeah, I made a few modifications to it. [19:50:15] (03CR) 10Dzahn: [C: 04-2] "already fixed, just needed a service restart" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144739 (owner: 10Dzahn) [19:50:21] (03Abandoned) 10Dzahn: make misc varnish support PFS like main cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/144739 (owner: 10Dzahn) [19:51:53] (03CR) 10Dzahn: [C: 032] bastion: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/144362 (owner: 10Matanya) [19:53:48] (03CR) 10Dzahn: "duplicate of I2b9e45368234 ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144427 (owner: 10Matanya) [19:57:33] (03CR) 10Dzahn: [C: 032] Check for puppet-lint in path rather than hardcoded location [operations/puppet] - 10https://gerrit.wikimedia.org/r/144626 (owner: 10BryanDavis) [20:00:00] (03CR) 10Dzahn: [C: 032] deprecated syntax in mysql/generic_my.cnf.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143529 (owner: 10Dzahn) [20:03:27] (03CR) 10Dzahn: [C: 032] update SSL cipher list for gerrit to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [20:04:46] (03PS3) 10Ori.livneh: Add twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144350 [20:05:16] !log restarted apache on ytterbium [20:05:17] (03PS2) 10Tim Landscheidt: Add line to collect Puppet failures [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (https://bugzilla.wikimedia.org/67673) (owner: 10Scottlee) [20:05:22] Logged the message, Master [20:06:02] PROBLEM - nutcracker process on mw1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [20:06:24] that's me [20:06:52] PROBLEM - twemproxy port on mw1041 is CRITICAL: Connection refused [20:06:52] PROBLEM - twemproxy process on mw1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [20:07:02] RECOVERY - nutcracker process on mw1041 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [20:07:19] !log finished migrateAccount.php --safe, now starting migrateAccount.php --attachbroken [20:07:22] Logged the message, Master [20:07:52] RECOVERY - twemproxy port on mw1041 is OK: TCP OK - 0.000 second response time on port 11211 [20:07:52] RECOVERY - twemproxy process on mw1041 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [20:09:50] jzerebecki: gerrit, config change merged, service restarted, but not yet? (as opposed to misc varnish)? [20:10:03] <_joe_> ori: oh god I'm in a meeting, I've seen the alarms first, and I almost fainted before I got to read "that's me" [20:10:41] _joe_: i was just testing and didn't think it would alert, but good thing i checked because i caught a bug [20:10:47] (03PS4) 10Ori.livneh: Add twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144350 [20:10:49] (03PS1) 10Ori.livneh: nutcracker: fix username arg for service check [operations/puppet] - 10https://gerrit.wikimedia.org/r/144752 [20:10:55] _joe_: namely, the username thign ^ [20:11:47] that's why we got PROBLEM - nutcracker process on mw1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [20:11:52] even though nutcracker was running [20:11:59] (it wasn't running as nobody) [20:12:33] (03CR) 10Dzahn: [C: 031] nutcracker: fix username arg for service check [operations/puppet] - 10https://gerrit.wikimedia.org/r/144752 (owner: 10Ori.livneh) [20:12:34] mutante: older libssl than on the bugzilla host? [20:12:50] (03CR) 10Ori.livneh: [C: 032] nutcracker: fix username arg for service check [operations/puppet] - 10https://gerrit.wikimedia.org/r/144752 (owner: 10Ori.livneh) [20:13:29] thanks mutante [20:13:32] jzerebecki: no, 1.0.1-4ubuntu5.16 [20:13:41] ori: np [20:14:28] ori: it's running as "112" ? [20:14:44] /usr/sbin/nutcracker vs /usr/local/bin/nutcracker , fwiw [20:14:59] yeah, that's twemproxy package (which we're getting rid of) vs. proper nutcrackerp ackage [20:15:07] got it [20:15:39] and 112 is that user, ack [20:16:38] (03PS1) 10Ori.livneh: role::mediawiki: don't monitor twemproxy; about to be decom'd [operations/puppet] - 10https://gerrit.wikimedia.org/r/144756 [20:17:56] mutante: mind doing that one too? that way joe can actually decom it once he's free. otherwise we might get alerts until a puppet run completes on neon [20:19:02] (03CR) 10Dzahn: [C: 031] "for decom" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144756 (owner: 10Ori.livneh) [20:19:16] (03CR) 10Ori.livneh: [C: 032] role::mediawiki: don't monitor twemproxy; about to be decom'd [operations/puppet] - 10https://gerrit.wikimedia.org/r/144756 (owner: 10Ori.livneh) [20:19:37] danke [20:19:40] bitte [20:20:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 850 [20:20:16] (03PS5) 10Ori.livneh: Add twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144350 [20:20:44] (03PS6) 10Ori.livneh: Add and apply twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144350 [20:21:55] ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 850 Jeff_Green this is me dumping dbs [20:22:38] !log finished running migrateAccount.php --attachbroken --attachmissing (bug 61876) [20:22:42] Logged the message, Master [20:23:37] (03PS1) 10Ori.livneh: twemproxy: remove leftovers post-decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 [20:23:52] (03CR) 10Aaron Schulz: [C: 031] Add jobrunner class [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [20:25:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 1124313 Threads: 2 Questions: 8790245 Slow queries: 2154 Opens: 9926 Flush tables: 2 Open tables: 64 Queries per second avg: 7.818 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:25:30] (03CR) 10Dzahn: "this did not enabled PFS yet because the Apache version does not support ECDHE yet, but it's still an enhancement compared to before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144731 (owner: 10Dzahn) [20:26:08] (03PS1) 10BBlack: Remove install-time use of pmtpa DNS servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/144758 [20:28:31] (03CR) 10BBlack: [C: 032] Remove install-time use of pmtpa DNS servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/144758 (owner: 10BBlack) [20:29:15] bblack: what should recursor0.wikimedia resolve to ? (because that is that pmtpa IP you removed there, still) [20:30:05] ori: do we already have Apache 2.4 somewhere? [20:30:50] mutante: I'll update those as well. in theory dobson/mchenry still work fine, I'm just getting rid of them to eliminate variables in this installer problem, and prep for eventual pmtpa decom [20:31:10] basically we only have two resolver IPs we should be using: the LVS one in eqiad, and the nescio one in esams, right now. [20:31:52] (03CR) 10JanZerebecki: [C: 031] update SSL cipher list on wikitech to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144736 (owner: 10Dzahn) [20:31:55] bblack: alright, the comment said something about "bayle". gotcha [20:32:55] (03CR) 10Dzahn: "..though unless we have Apache 2.4 it will not actually enable PFS (yet, still an improvement)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144736 (owner: 10Dzahn) [20:32:58] (03CR) 10Yuvipanda: "Looks good! Can you rename the metric to be failed_events rather than events_failure?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (https://bugzilla.wikimedia.org/67673) (owner: 10Scottlee) [20:35:07] gahhahahahaha. I just trashed the drupal database. [20:35:48] (03PS1) 10Milimetric: Add CORS support to public files [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144761 [20:36:47] [on the dev instance but it's still a pita] [20:37:44] mutante: any idea what the hostnames recursorN are used for anyways? [20:39:22] bblack: no, not really, i don't see it appear in other places in puppet [20:39:25] yeah [20:39:49] I'm going to leave the names in just in case and just remap it so they resolve to the two current resolvers [20:39:57] * mutante nods [20:41:15] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Tue 08 Jul 2014 18:40:28 UTC [20:41:24] (03PS1) 10BBlack: switch recursor[01] aliases off of pmtpa recursors [operations/dns] - 10https://gerrit.wikimedia.org/r/144763 [20:41:58] mutante: seem reasonable? ^ [20:45:06] (03CR) 10Dzahn: "seems reasonable. just.. should it not also add the removed PTR in the new place?" [operations/dns] - 10https://gerrit.wikimedia.org/r/144763 (owner: 10BBlack) [20:46:52] (03CR) 10BBlack: "The new addresses already have other names with reverse PTRs for them (dns-rec-lb.eqiad, and recursor0.esams). recursor[01] forward names" [operations/dns] - 10https://gerrit.wikimedia.org/r/144763 (owner: 10BBlack) [20:47:14] (03CR) 10Dzahn: [C: 031] switch recursor[01] aliases off of pmtpa recursors [operations/dns] - 10https://gerrit.wikimedia.org/r/144763 (owner: 10BBlack) [20:47:28] I wonder if maybe some labs stuff uses recursor[01] and it's not obvious in puppet? [20:47:39] or they could just be names someone stuck in DNS to be helpful and nobody ever used [20:48:01] (03CR) 10BBlack: [C: 032] switch recursor[01] aliases off of pmtpa recursors [operations/dns] - 10https://gerrit.wikimedia.org/r/144763 (owner: 10BBlack) [20:48:03] bblack: my guess is it's the latter [20:48:07] from before labs even [20:48:17] of course that does not guarantee it's not also used in labs [20:48:26] tried to grep mw-config repo though (who knows) [20:50:35] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 15.3968160504 [20:51:55] PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 12.8006601681 [20:54:33] (although really, there shouldn't be a functional use for a DNS hostname for a DNS cache. After all, if you can look up the cache's hostname you already have a working cache :P) [20:54:44] (03PS8) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [20:54:53] it's probably just there to make debugging easy, e.g. dig @recursor0 foo [20:55:42] (03CR) 10Ori.livneh: [C: 032 V: 032] role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [20:56:58] bblack: for debugging sounds like it, i guess Mark would know [20:59:43] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Complete puppet failure [21:00:04] spagewmf: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140708T2100) [21:00:07] (03PS3) 10Scottlee: Add line to collect Puppet failures. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 [21:00:33] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Tue Jul 8 21:00:29 UTC 2014 [21:00:36] (03PS1) 10Ori.livneh: Fixup for Ia424c6433 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144788 [21:01:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Fixup for Ia424c6433 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144788 (owner: 10Ori.livneh) [21:03:03] (03PS1) 10Ori.livneh: Fix-up for Ia424c6433: resolve duplicate def'n [operations/puppet] - 10https://gerrit.wikimedia.org/r/144814 [21:03:15] grr, mutante, could you look at that one quickly? fixes a puppet fail on the appserver [21:03:20] https://gerrit.wikimedia.org/r/#/c/144814/ [21:03:43] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:03:46] (03CR) 10JanZerebecki: "Did you want to change the port?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [21:04:11] mutante: note the class is included on L31 below [21:05:21] i'm going to merge because i'm freaked out about the possibility icinga-wm spam from all app servers [21:05:41] (03CR) 10Ori.livneh: [C: 032 V: 032] "i'm going to merge because i'm freaked out about the possibility icinga-wm spam from all app servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144814 (owner: 10Ori.livneh) [21:05:53] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 2.44271184874 [21:09:16] (03PS2) 10Ori.livneh: Remove beta::hhvm [operations/puppet] - 10https://gerrit.wikimedia.org/r/144624 [21:09:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove beta::hhvm [operations/puppet] - 10https://gerrit.wikimedia.org/r/144624 (owner: 10Ori.livneh) [21:10:33] RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 1.16412831933 [21:10:42] (03PS2) 10Tim Landscheidt: Tools: Remove unused syslog role [operations/puppet] - 10https://gerrit.wikimedia.org/r/120347 [21:11:09] (03PS1) 10Ori.livneh: typo fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/144826 [21:12:43] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Complete puppet failure [21:12:49] (03CR) 10coren: [C: 031] "Better than without the fix, for sure." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144826 (owner: 10Ori.livneh) [21:13:00] (03CR) 10Ori.livneh: [C: 032 V: 032] typo fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/144826 (owner: 10Ori.livneh) [21:15:43] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:15:53] (03PS1) 10Ori.livneh: (another) typo fix for Ia424c6433 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144828 [21:16:48] (03CR) 10coren: [C: 031] "Do it right?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144828 (owner: 10Ori.livneh) [21:17:01] (03CR) 10Ori.livneh: [C: 032 V: 032] (another) typo fix for Ia424c6433 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144828 (owner: 10Ori.livneh) [21:17:12] not my brightest moment [21:17:37] thanks for rescuing me from my own stupidity Coren [21:35:57] PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 10.1998965254 [21:40:08] I don't fully comprehend why yet, but getting rid of installer DNS -> pmtpa seems to have solved my partman issues [21:40:17] ottomata: ^ [21:40:33] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 12.2718278814 [21:41:10] I think I've also figured out why most of my pxe installs have had an inexplicable 15 minute delay (this has been going on for a long time). it's a bug in the installer image, and we can update to fix it. [21:55:53] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 3.85670813559 [21:58:50] (03PS2) 10Alexandros Kosiaris: Puppetmaster's logrotate made graceful [operations/puppet] - 10https://gerrit.wikimedia.org/r/144695 [22:00:33] RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 0.561909576271 [22:00:38] (03CR) 10Alexandros Kosiaris: Puppetmaster's logrotate made graceful (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144695 (owner: 10Alexandros Kosiaris) [22:02:03] PROBLEM - RAID on amssq47 is CRITICAL: Connection refused by host [22:02:13] PROBLEM - Varnish HTCP daemon on amssq47 is CRITICAL: Connection refused by host [22:02:13] PROBLEM - Varnish HTTP text-backend on amssq47 is CRITICAL: Connection refused [22:02:23] PROBLEM - Varnish HTTP text-frontend on amssq47 is CRITICAL: Connection refused [22:02:24] PROBLEM - Varnish traffic logger on amssq47 is CRITICAL: Connection refused by host [22:02:33] PROBLEM - Varnishkafka log producer on amssq47 is CRITICAL: Connection refused by host [22:02:33] PROBLEM - check configured eth on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:43] PROBLEM - check if dhclient is running on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:43] PROBLEM - DPKG on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:43] PROBLEM - puppet disabled on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:50] (03CR) 10Alexandros Kosiaris: [C: 032] Puppetmaster's logrotate made graceful [operations/puppet] - 10https://gerrit.wikimedia.org/r/144695 (owner: 10Alexandros Kosiaris) [22:02:53] PROBLEM - puppet last run on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:53] PROBLEM - Disk space on amssq47 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:04:33] RECOVERY - check configured eth on amssq47 is OK: NRPE: Unable to read output [22:04:43] RECOVERY - check if dhclient is running on amssq47 is OK: PROCS OK: 0 processes with command name dhclient [22:04:43] RECOVERY - puppet disabled on amssq47 is OK: OK [22:04:43] RECOVERY - DPKG on amssq47 is OK: All packages OK [22:04:53] RECOVERY - Disk space on amssq47 is OK: DISK OK [22:05:03] RECOVERY - RAID on amssq47 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [22:05:13] RECOVERY - Varnish HTCP daemon on amssq47 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [22:06:13] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [22:06:34] ^ is there some way we can automatically suppress notifications on brand-new hosts for their first hour or so? [22:07:23] RECOVERY - Varnish traffic logger on amssq47 is OK: PROCS OK: 2 processes with command name varnishncsa [22:08:10] Jul 8 22:02:44 cp4006 puppet-agent[24934]: (/Stage[main]/Certificates::Rapidssl_ca_2/File[/etc/ssl/certs/RapidSSL_CA_2.pem]) Could not evaluate: Error 502 on SERVER [22:09:20] apacheconf issues for the source server due to whatever refactoring? [22:10:32] (03PS1) 10BBlack: add mpt-status deps [operations/puppet] - 10https://gerrit.wikimedia.org/r/144837 [22:12:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [22:12:13] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:13:53] PROBLEM - NTP on amssq47 is CRITICAL: NTP CRITICAL: Offset unknown [22:14:25] (03CR) 10BBlack: [C: 032] add mpt-status deps [operations/puppet] - 10https://gerrit.wikimedia.org/r/144837 (owner: 10BBlack) [22:15:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [22:15:12] (03PS4) 10Andrew Bogott: Add archive-project-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144063 [22:16:33] RECOVERY - Varnishkafka log producer on amssq47 is OK: PROCS OK: 1 process with command name varnishkafka [22:17:01] (03CR) 10Andrew Bogott: [C: 032] Add archive-project-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144063 (owner: 10Andrew Bogott) [22:18:53] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 08 Jul 2014 20:17:43 UTC [22:19:56] RECOVERY - NTP on amssq47 is OK: NTP OK: Offset -0.006737351418 secs [22:21:16] RECOVERY - Varnish HTTP text-frontend on amssq47 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.198 second response time [22:21:52] guys just to recount some of analytics' thinking on the packet loss [22:22:00] Oxygen and Analytics 1003 are experiencing UDP packet loss [22:22:07] we think it's due to the world cup being insane atm. [22:22:34] if you see related problems, we're tracking the issue here: https://bugzilla.wikimedia.org/show_bug.cgi?id=67694 [22:22:44] I've gotta head out, but I'll be back shortl [22:23:07] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oxygen.wikimedia.org&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [22:23:21] ^ looks like an artificial limit to me [22:24:17] ditto on a1003: http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=analytics1003.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [22:24:36] huh [22:25:18] those ceilings they're hitting are in the ballpark of 600Mbps [22:25:45] which isn't that far of what "what you can reasonably expect to push through a gigabit interface" [22:25:49] s/of/off/ [22:25:56] right [22:31:31] (03PS8) 10Rush: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 [22:32:16] RECOVERY - Varnish HTTP text-backend on amssq47 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.191 second response time [22:32:55] (03PS1) 10Dzahn: schedule icinga downtimes for new installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/144839 [22:33:55] bblack: re "suppress notifications on brand-new hosts" ^ [22:34:06] just find a way to trigger that on neon on new install ? [22:34:13] nice! [22:34:42] yeah ideally we could trigger this from whatever adds the host to icinga in the first place (naggen, etc) [22:34:46] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:35:14] (03CR) 10Rush: [C: 032] Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [22:38:39] (03PS1) 10Rush: phab.wm.org no https yet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144841 [22:38:57] (03CR) 10Rush: [C: 032 V: 032] phab.wm.org no https yet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144841 (owner: 10Rush) [22:41:06] yurikR2: we may want to do that thurs instead of weds. I assume we'll have some varnish things to sort out as well? but in any case, weds we're also taking a big ulsfo outage, not ideal timing. [22:43:07] bblack, which part? [22:43:23] you were talking about rolling out the defrag stuff tomorrow? [22:43:26] the code will be deployed as usual, but we won't be switiching over anyone juts yet [22:43:31] oh ok [22:43:33] until varnih changes [22:43:40] sure then [22:43:53] bblack, basically with the new change it is up to varnih to either set or not set X-CS=ON [22:43:59] just saying, tomorrow we'll be shuffling traffic around, it's a dangerous day in general [22:44:02] if you don't set it, we get original request [22:44:18] and don't insert javascript blob [22:44:41] in any case, thx for the heads up [22:45:44] (03PS1) 10BBlack: turn on amssq47 text backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/144843 [22:48:53] (03CR) 10BBlack: [C: 032] turn on amssq47 text backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/144843 (owner: 10BBlack) [22:50:49] !log radon (phab)- package and kernel upgrades, rebooting [22:50:53] Logged the message, Master [22:54:10] (03PS1) 10QChris: Use hive serde jar from site's hive setup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144845 [22:54:53] is there a known procedure for removing hosts from ganglia? [22:56:42] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 8 22:56:40 UTC 2014 [22:59:27] bblack: try this: puppetstoredconfigclean.rb hostname.wikimedia.org on palladium, then puppet run on ganglia host [22:59:52] yeah did that several hours ago. it cleaned up icinga, but not ganglia [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140708T2300) [23:00:06] that does it for icinga, and [[Server Lifecycle]] makes it sound like it's the same for ganglia [23:00:15] uhm.. ok... then i don't know :( [23:00:17] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Text%2520caches%2520esams&tab=m&vn=&hide-hf=false [23:00:33] yeah I donno either, it's not on wikitech [23:00:41] this sentence: [23:00:48] " [23:00:48] Your host should now appear in puppet stored configs and therefore in ganglia and icinga. [23:00:55] that makes it totally sound like it was the same [23:00:56] * MaxSem and ori are busy in a meeting [23:01:01] but maybe it's ganglia_new now ? [23:01:13] (03PS6) 10Rush: search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 [23:02:10] bblack: is the host already shut down? [23:02:20] it can re-add itself if still running [23:02:41] (03PS7) 10Rush: search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 [23:02:54] mutante: yeah the host was renamed and is online in the same cluster under a new hostname [23:03:04] there's no storedconfig left for the old name [23:03:20] ok! /me is doing swta [23:03:22] *swat [23:05:13] Reedy, marktraceur; did you already deploy https://gerrit.wikimedia.org/r/#/c/143750/ -- it's merged on tin, so I assume so [23:05:44] (03CR) 10Rush: [C: 032] search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 (owner: 10Rush) [23:06:08] Pretty sure yes, mwalker [23:06:18] cool [23:06:21] easiest swat ever [23:06:32] (03PS6) 10Rush: parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 [23:06:36] (03CR) 10Rush: [C: 032] parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [23:06:39] * mwalker runs away giggling before anyone adds anything to the window [23:06:49] (03CR) 10Rush: [V: 032] parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [23:07:11] (03PS1) 10Tim Landscheidt: ldap: Move ldapsupportlib.py to standard location [operations/puppet] - 10https://gerrit.wikimedia.org/r/144848 [23:13:18] (03CR) 10Rush: "removing myself I think this can be abandoned?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 (owner: 10Yuvipanda) [23:14:42] (03PS7) 10Rush: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 [23:17:05] (03PS8) 10Dzahn: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [23:21:43] (03CR) 10Rush: [C: 032] dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [23:22:31] (03CR) 10Dzahn: [C: 032] bugzilla apache: Enable required modules for caching [operations/puppet] - 10https://gerrit.wikimedia.org/r/127254 (https://bugzilla.wikimedia.org/49720) (owner: 10JanZerebecki) [23:34:57] (03PS1) 10JanZerebecki: Add mtime argument to css link. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 [23:39:56] (03CR) 10Dzahn: Add mtime argument to css link. (031 comment) [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 (owner: 10JanZerebecki) [23:48:45] (03PS2) 10JanZerebecki: Add mtime argument to css link. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 [23:49:24] (03PS1) 10Jforrester: Enable TemplateData GUI for Russian Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144857 (https://bugzilla.wikimedia.org/67704) [23:49:30] (03CR) 10JanZerebecki: Add mtime argument to css link. (031 comment) [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 (owner: 10JanZerebecki)