[00:00:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [00:02:19] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 6.223 second response time [00:07:56] !log restarting gitblit on antimony [00:08:04] Logged the message, Master [00:09:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 237850 bytes in 7.317 second response time [00:34:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [00:38:26] (03CR) 10TTO: [C: 04-1] "Flow is only enabled on a single page, and is causing no problems or interference with normal community activity on Meta, so I am inclined" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [00:44:04] hello [00:57:40] (03PS2) 10Ori.livneh: Set enable_geoiplookup on cp1066 for geo_cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/115525 [00:59:07] bblack: I amended the patch to scope it to a single text varnish, picked at random. There's nothing relying on the GeoIP cookie being there at the moment and no consequence to setting it only on some responses, so it's an easy way to lower to stakes. Dunno why I hadn't thought of that earlier. [01:29:02] (03CR) 10TTO: "> Although I am extremely concerned about Terry's statement" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [01:37:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:48:52] ori: https://gerrit.wikimedia.org/r/#/c/115553/ ;) [01:51:09] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200677) [01:52:46] !log aaron synchronized php-1.23wmf15/includes/filebackend '5a7a77cf3fd118bc70aa79993b85fc5e737d7526' [01:52:57] Logged the message, Master [01:53:10] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:12:36] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-26 02:12:36+00:00 [02:12:44] Logged the message, Master [02:13:39] (03CR) 10Jeremyb: "(not actually reverted, revert abandoned)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 (owner: 10Jeremyb) [02:13:59] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [02:15:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:41:01] !log LocalisationUpdate completed (1.23wmf15) at 2014-02-26 02:41:01+00:00 [02:41:09] Logged the message, Master [02:43:09] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205530) [02:54:19] (03CR) 10Andrew Bogott: [C: 032] Add some labs management scripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115360 (owner: 10Andrew Bogott) [03:00:51] (03PS1) 10Andrew Bogott: Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 [03:01:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [03:02:42] (03PS2) 10Andrew Bogott: Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 [03:04:26] (03CR) 10Andrew Bogott: [C: 032] Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 (owner: 10Andrew Bogott) [03:11:09] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Wed Feb 26 03:11:05 UTC 2014 [03:14:38] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-26 03:14:37+00:00 [03:14:46] Logged the message, Master [03:23:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:32:25] (03PS1) 10Andrew Bogott: Puppetize the new_install key on palladium, add to iron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115566 [03:35:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [03:57:09] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:59:59] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.980 second response time [04:08:30] (03CR) 10Ori.livneh: [C: 032] "I have a bit of time now to watch this roll out, so I'll give it a shot." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115525 (owner: 10Ori.livneh) [04:14:13] (03PS1) 10Ori.livneh: Revert "Set enable_geoiplookup on cp1066 for geo_cookie" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115567 [04:14:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Set enable_geoiplookup on cp1066 for geo_cookie" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115567 (owner: 10Ori.livneh) [04:17:24] !log enabling geo_cookie on cp1066 caused general protection fault, so reverted and restarted. [04:17:33] Logged the message, Master [04:18:39] PROBLEM - Varnish traffic logger on cp1066 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [04:20:39] RECOVERY - Varnish traffic logger on cp1066 is OK: PROCS OK: 2 processes with command name varnishncsa [04:42:58] (03PS6) 10Andrew Bogott: Add a script that updates labs instances after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 [05:35:40] springle: I'm curious what you think of https://bugzilla.wikimedia.org/show_bug.cgi?id=57176 [05:38:02] AaronSchulz: ah that one. reading... [05:38:29] I was also curious about those revision insert timeouts...saw those in the logs earlier [05:38:52] did you see https://bugzilla.wikimedia.org/show_bug.cgi?id=61898 ? [05:39:31] yeah just read that [05:47:06] (03CR) 10Greg Grossmeier: "I wanna close this out, so I'll revoke my bike shed and just make it match the other previous values." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 (owner: 10Greg Grossmeier) [05:47:16] (03PS2) 10Greg Grossmeier: Modify login credential hint [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 [06:01:21] (03CR) 10MZMcBride: "I thought we decided to use Bugzilla for discussion, not Gerrit. I've replied on bug 61729." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [06:02:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [06:08:29] AaronSchulz: replied on 57176. tricky one. do you know if anyone ever tried an index on domain name only without leading protocol? [06:08:39] reversed or not [06:09:17] not sure, I wasn't involved in the table design either [06:11:42] springle: I think B would probably work fine (at least for up to a pretty huge number of matching results) [06:16:36] i'm not clear on how it would blend with the LIMIT. for the example in comment #1, wouldn't it mean running 1023 queries each time and doing some sort of app-side sorting/paging? [06:21:07] springle: if the user wants X rows per page, you'd go through shard 0 first, and give results of X rows, and give an ?continue parameter like 0|. Once you reach the end of that or get less that X rows, you move on to shard 1. The server code is only looking at one (or sometimes a few) shards at a time. [06:21:50] it would work just like the other APIs that page on a tuple instead of a single column [06:22:30] so we really don't care about result order here [06:22:41] you wouldn't have to hit 1024 shards for the small case due to doing an EXPLAIN SELECT to estimate how many rows there are and doing it the current way if it's not super high [06:23:19] you just want some stable ordering even if it is totally meaningless (just like el_id and they OFFSET are anyway) [06:23:41] *and OFFSET [06:25:10] the actually queries hitting the DB (for the shard usage case) would be have WHERE index=X and an OFFSET with that [06:25:24] the filesort would be 1024 times smaller than it is now though [06:26:07] of course, with 30 million hits, it will take some time to traverse but at least the queries won't time out then [06:26:54] there is no filesort presently, but yes i get the point [06:26:58] actually that probably would not filesort [06:27:05] gah, you just said that ;) [06:27:19] * AaronSchulz was thinking of ORDER BY el_id for a second [06:27:44] i had to go back and check the bug ... .oO(did i mention filesort?) :) [06:28:33] in any case, breaking the queries up is A Good Thing for the database. if it works for the api users too, then great [06:29:09] "the actually queries" [06:29:13] * AaronSchulz must be getting tired [06:29:36] springle: anyway, if that works for you I can ping anomie about it, since he seems to do more API stuff [06:29:47] yeah get his input [06:35:39] also you might want to leave a comment there ;) [06:36:23] * AaronSchulz needs to look at WebVideoTranscode::updateJobQueue later [06:36:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [06:44:41] heh, looks like MessageGroupStats::forItemInternal tries to insert the same row from a bunch of servers at once [06:52:14] (03PS1) 10Matanya: remove sockpuppet, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115578 [07:12:17] (03CR) 10Matanya: [C: 031] Setting up kafkatee on analytics1003 to log mobile webrequest logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/115411 (owner: 10Ottomata) [07:16:51] (03PS1) 10Matanya: removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 [07:26:13] (03PS1) 10Matanya: remove locke, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115583 [07:26:46] (03PS1) 10Andrew Bogott: Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 [07:28:23] hi andrewbogott :) [07:28:23] (03PS2) 10Andrew Bogott: Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 [07:28:33] * andrewbogott waves [07:28:50] matanya, I was going to ask you to review something but 'git review' is hanging on my dev box :( [07:29:16] i'm here, ping me when you need [07:29:49] Ah, ok, it didn't hang it's just VERY slow. So maybe in a few more minutes... [07:30:03] (03PS1) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:30:16] There it goes. matanya, check my ruby? ^ [07:31:00] (03CR) 10Matanya: Wipe out /etc/resolv.conf before migrating instances. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 (owner: 10Andrew Bogott) [07:31:01] (03CR) 10Andrew Bogott: [C: 032] Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 (owner: 10Andrew Bogott) [07:31:34] oops, I will take your advice in a subsequent patch :) [07:32:39] (03PS1) 10Andrew Bogott: rm -f resolv.conf. -rf was overkill. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115586 [07:33:03] (03CR) 10Matanya: [C: 031] Add a fact to pull the ec2id out of instance metadata. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 (owner: 10Andrew Bogott) [07:33:52] thanks! [07:35:32] I will never get used to the implicit return thingy that ruby does [07:35:58] it actully makes sense [07:36:13] (03PS2) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:36:15] how so? [07:36:28] I mean, in an assembly 'what was last in the register' sense it makes sense... [07:36:28] why declare something we know? [07:37:14] I guess the fact that the behavior depends on the position of the line bothers me? [07:37:38] think of python indentation :) [07:37:42] Also, what if my intent in a function is to return nothing in particular? Then an arbitrary value can leak out, and a caller might come to depend on that. [07:38:22] But, anyway, I won't argue that it's incorrect, only that it makes me uneasy :) [07:38:31] (03CR) 10Andrew Bogott: [C: 032] rm -f resolv.conf. -rf was overkill. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115586 (owner: 10Andrew Bogott) [07:39:43] logstash is a pain in the ... [07:43:27] (03PS3) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:44:39] (03Abandoned) 10Andrew Bogott: Attempt to get ssh keys working, pre-puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/114408 (owner: 10Andrew Bogott) [07:46:51] (03CR) 10Andrew Bogott: [C: 032] Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 (owner: 10Andrew Bogott) [07:47:59] RECOVERY - RAID on labstore1 is OK: OK: optimal, 2 logical, 24 physical [08:03:55] /join #logstash [08:04:01] arrg [08:27:59] (03PS1) 10Andrew Bogott: Slight change to prod.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/115587 [08:29:01] (03CR) 10Andrew Bogott: [C: 032] Slight change to prod.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/115587 (owner: 10Andrew Bogott) [09:03:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [09:37:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [09:49:03] matanya: a puppet question: When a client changes from one master to another, what do I need to do to encourage it to accept the cert from the new master? [09:49:05] Any idea? [09:49:26] I know that if I erase /everything/ in /var/lib/puppet it helps, but I'd prefer to be more surgical :) [09:49:42] iirc, you need to tell him in his own config about the new master [09:49:43] hashar, same question, in case you know... [09:49:52] Ah, yes, I'm updating puppet.conf already. [09:49:59] But there's a validation phase that's failing for me. [09:50:07] ok, that is a good start :) [09:50:19] I think because client is comparing the old cert to the new master which of course doesn't work... [09:50:21] andrewbogott: can't remember sorry :-( [09:50:29] I think I restart the puppet services [09:50:29] ok [09:50:32] on the client [09:50:34] oh, on the client? [09:50:40] and it eventually end up catching the new cert [09:50:56] apergos and I attempted to write a class that would revert from puppetmaster::self back to the normal labs puppet [09:50:59] never went far though [09:51:03] hm, nope, same [09:51:27] "err: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed. This is often because the time is out of sync on the server or client" <- sound familiar? [09:51:33] ohh [09:51:44] I mean, it's smart that the client is rejecting the new master, could me mitm [09:51:45] never seen that [09:51:48] *could be [09:52:04] is time in sync? aka ntp on both hosts [09:52:14] yes [09:52:21] might want to try with --debug, it might show up the cert being used [09:52:46] at least, to my eyeball they time looks the same [09:52:47] or you want to regenerate the cert for that host. I think puppet client gives you the command to run on the master [09:53:34] :-( [09:53:59] you might need a new cert on the client. [09:54:30] does the client have the correct cert at all? [09:54:52] the client cert is probably not know on the new puppet master [09:55:02] might need to sign it on puppet master [09:55:16] No, it's the other way around I think. [09:55:21] The client cert /is/ signed and ready. [09:55:27] It's the client, rejecting the master. [09:55:29] It checks both ways, right? [09:55:34] yes [09:55:34] no idea :-] [09:55:50] matanya: so, probably the client does not have the correct cert. I'm wondering how to get it the right one. [09:56:06] Here's a debug run, not very useful: https://dpaste.de/A7GA [09:56:24] the easiest way i think is to revoke the key on master a [09:56:32] and create a new one on master b [09:57:16] ok, so I'd run revoke on the client? [09:58:09] yes, but also on the old master [09:58:31] why would that matter? The instance isn't even talking to the old master anymore, doesn't know about it. [09:58:54] I should explain: I'm trying to duplicate an existing instance w/a new puppet master. So I don't want to break things for the original instance. [09:59:04] I just want to make a copy, and make the copy work with the new master. [09:59:04] just for security reasons, not anything else [09:59:08] Ah, sure. [09:59:17] andrewbogott: rm -rf /var/lib/puppet/ssl on instance and puppetd -t [09:59:24] and puppetca -s -a on your puppetmaster [09:59:28] and you are done [09:59:31] i love you akosiaris [09:59:33] akosiaris: that will revoke the client cert as well, which I specifically do not want to do. [09:59:48] ? [09:59:55] revoke it where ? [09:59:55] "andrewbogott: I know that if I erase /everything/ in /var/lib/puppet it helps, but I'd prefer to be more surgical :)" [10:00:16] well i was slightly more surgical. I added /ssl at the end :-) [10:00:16] akosiaris: well… currently my clieng has a cert that is signed on the master. [10:00:21] If I erase the cert on the client side... [10:00:38] then the master will /definitely/ have a cert that doesn't match mine and I will have to revoke that one, etc. etc. [10:00:45] which master ? [10:00:51] old or new ? [10:01:07] old, who cares about it anyway, new won't have anything [10:01:24] until the first puppet run that is [10:01:29] and you running puppetca -s -a [10:02:10] but anyway puppett clients can only have one master and only one ca [10:02:22] OK, let me begin at the beginning. [10:02:44] and since a master is also a ca you basically are well off if you make the client forget about the old CA and be done with it [10:03:03] Yes, so, how do I do that? [10:03:08] Without making the master forget about the client? [10:03:21] again which master? the old one ? [10:03:24] or the new one ? [10:03:25] new one [10:03:30] Here is what happens: [10:03:32] a) I move the instance [10:03:53] that is pmtpa => eqiad labs move ? [10:03:53] b) New instance talks to old puppet master, which prompts an update to puppet.conf, pointing at the new master [10:03:58] yes [10:04:09] c) client talks to new master, gets a cert, and the new master signs the cert [10:04:11] All good, right? [10:04:19] yes [10:04:19] Except, then d) https://dpaste.de/A7GA [10:04:31] Now, why does c work? I don't know. [10:04:31] wait c is wrong [10:04:40] yeah, it shouldn't work, right? [10:04:45] client will not be able to talk to new master [10:04:53] cause it will be a different CA [10:04:57] I'd think. [10:05:03] And yet I see a signed cert on the master. [10:05:10] So maybe something else stupid is happening :( [10:05:18] ok . virt0 and virt1000 ? [10:05:23] may I login and see what is going on ? [10:05:24] andrewbogott: Have you looked at a more general setup like http://docs.puppetlabs.com/guides/scaling_multiple_masters.html ? [10:05:44] scfc_de: he wants to migrate, not scale [10:05:45] masters are virt0 and virt1000 [10:05:51] virt0 will be decomissioned [10:05:55] eventually that is [10:05:56] the instance in question is testmigrate7.eqiad.wmflabs [10:05:58] I will add you to the project [10:06:39] Apparently, the guy in http://stuckinadoloop.wordpress.com/2012/02/16/automated-migration-of-systems-to-a-new-puppet-master-server/ had the same problem. [10:06:51] akosiaris: migrate = scale, then downsize :-). [10:07:22] scfc_de: that approach might cost more that it is worth [10:14:34] i still think my approach will be easiest, though it doesn't fix the root cause [10:23:49] andrewbogott: If both puppet masters are accessible from the client, this is no blocker for you? [10:25:19] scfc_de: I need to move them at some point though [10:25:28] since the old puppet master will be shut down eventually [10:26:05] Yeah, but before you get stuck here, I'd rather pick problems easier to solve :-). [10:28:36] scfc_de: right now puppet by default changes the master to match whatever domain an instance is in. [10:28:45] That clearly doesn't work, but I'm trying to roll with it :) [10:30:19] (03PS7) 10Andrew Bogott: Add a script that updates labs instances after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 [10:42:21] (03CR) 10Tim Landscheidt: "To me, this doesn't feel "puppety" :-). I would do something along the lines of (in role::labs::instance):" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 (owner: 10Andrew Bogott) [10:46:44] scfc_de: ok, but the whole point is that puppet doesn't work [10:46:54] so… a puppet-based solution sounds nice but I don't think it helps [10:47:24] Still… there may be some way to use virt0 as a more effective bootstrap. I'll think about it a bit [10:51:49] andrewbogott: Why doesn't it work? I. e., if you move an instance 1:1 from pmtpa to eqiad and then run Puppet? [11:10:59] scfc_de: one question is… how do I remove the virt0 cert without just always removing every cert? [11:21:15] (03PS1) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:21:27] akosiaris: what do you think? ^ [11:26:33] well, not in love definitely. If anything changes in the erb, all clients will have to refetch the CA and CRL [11:27:00] Want me to limit it to labs? [11:27:15] It would be better to detect that just the master is changing... [11:27:51] Could we manage /var/lib/puppet/ssl/certs/ca.pem as a file resource in Labs? To be able to change it in concert with $base::puppet::server? [11:28:08] well there is a security concern here [11:28:32] if we have ca.pem refresh on an erb change [11:28:48] that may very well open the door for anyone to take over the entire fleet [11:28:58] Isn't an erb change already a wide open door? [11:29:22] yes but you remove a layer of security there [11:29:26] true [11:30:02] I would be surpised if mark showed up and voted a huge -2 on it [11:30:09] I would not* [11:30:15] ! [11:30:19] where is it [11:30:24] ahahaha [11:30:41] what is the issue, also? [11:30:44] Well, I don't want to apply it universally. I need some way to detect that a migration is happening... [11:30:56] mark, clearing the puppet master cert after an instance moves to the new dc [11:31:06] why is that necessary? [11:31:09] different hostname? [11:31:15] yes, different puppet master [11:31:22] both that and virt0 => virt1000 move [11:31:30] ah, yes, both [11:31:50] perhaps we should use an alias now [11:31:56] that won't solve our current problem but will prevent the next [11:32:46] Hm [11:32:48] like puppet.wmflabs ? [11:32:53] yes [11:33:14] assuming we will want to support instance migrations in the future also [11:33:33] Actually that could solve our current problem as well... [11:33:43] because for /working/ instances I can clear the certs with salt. [11:33:48] It's only instances that are post-migration that are broken [11:33:49] as for detecting a migration is happening [11:33:56] would it be feasible to create some file in the filesystem before migration [11:34:02] and remove it when puppet runs and fixes up the system? [11:34:14] mark: yes, that's easy. [11:34:19] So, probably a good solution. [11:34:24] But... [11:34:25] ok, wait: [11:34:26] in that case it would be reasonable to remove the cert [11:34:31] if we can make sure it can't get abused [11:34:36] the issue with the puppet master isn't the /name/ of the master, it's the cert. [11:34:40] yes [11:34:41] So using an alias doesn't help. [11:34:49] well you could use the same cert everywhere then [11:34:52] Unless we copy the cert when we change masters. [11:35:01] Ah, yes, that would solve the problem w/out using an alias :) [11:35:03] i don't see a huge reason why we couldn't do that [11:35:07] But, I defer to akosiaris on that point [11:35:13] right now that's not optimal as using the virt0 name is a bit weird ;) [11:35:23] the only problem will be the CRL i think [11:35:59] and of course we will need a procedure to say "No longer sign certs on virt0, from now on only on virt1000" [11:36:15] and then phase out virt0 [11:36:17] how does signing work in labs now anyway? [11:36:20] ? If they're the same then... [11:36:35] but I think that the issuer in the CA is virt0 in labs [11:36:36] the CA needs to remain in sync [11:36:51] as in the issuer field in the cert [11:36:57] Ah, then sharing doesn't solve anything, it just creates new different problems. [11:38:15] Does Puppet check the issuer or does it only test that the keys match? [11:39:18] it definitely checks the chain [11:40:18] Issuer: CN=Puppet CA: virt0.wikimedia.org [11:40:57] (03PS2) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:41:03] like that better? [11:41:15] akosiaris: I know, but does it test that the hostnames match? [11:41:25] yes it does [11:42:29] akosiaris, mark, is ^ ok? Or should I also limit it to labs? Or... [11:42:39] * andrewbogott doesn't know for sure that that will even solve the problem, but it might :/ [11:43:29] I think limiting it to labs for now wouldn't be a bad idea [11:43:46] also, you may want to have a generic "migration-in-progess" file instead which you could use for other things [11:43:48] scfc_de: well to be clear, the hostname of the master. It will obviously not chase down the entire chain hostnames(that would make no sense) [11:44:24] mark: a generic file wouldn't be rm'd at such a good time though [11:44:36] hence having a task-specific flag that gets removed in one go with the cert clearing [11:44:53] Also otherwise I'm not sure how to make sure the file gets cleared /after/ it's used for the test :) [11:44:58] akosiaris: ACK. [11:45:18] BTW, with Labs and network people here, 208.80.152.234 (amaranth) times out from pmtpa-Labs (incidentally, the server itself is status 503 from the InterNet, but different story). Is this some firewall on Labs or in the network? [11:45:19] I 'd say limited to labs for sure. As I already said, a single merge with a changed 10-main-conf.erb has the potential of sending the entire fleet to another puppetmaster [11:46:08] Are there disadvantages to managing /var/lib/puppet/ssl/certs/ca.pem as a file resource in Labs? [11:46:28] scfc_de: and doing what with it ? [11:47:16] Setting it to the cert of the puppet master selected by $base::puppet::server? [11:48:49] in general it is a security concern. That file is your anchor with the entire puppet infra [11:49:31] (03PS3) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:50:18] Yes, and AFAICS, at the moment, it's not managed at all, but depends on the initial Puppet connection, I think? [11:50:58] mark, akosiaris, ^ [11:51:05] yes, which is the way it was designed to be by the puppet people [11:51:45] it's not supposed to be managed [11:51:50] at least, not by Puppet :) [11:53:00] (03CR) 10Mark Bergsma: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:53:45] Then how are puppet masters supposed to be changed in Puppet if puppet.conf is managed, but ca.pem not? On change to puppet.conf, check that ca.pem's "Issuer" matches "server", otherwise delete ca.pem? [11:54:06] puppet masters are not supposed to be changed in Puppet I think [11:54:08] doesn't mean we can't [11:54:12] but it's a bit hairy [11:55:21] (03CR) 10Alexandros Kosiaris: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:56:17] (03PS4) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:56:39] Ah, much tidier that way. [11:56:46] Presuming that 'subscribe' works like that [11:58:53] As it's a bit hairy, I'm looking for a simple solution :-). [11:58:59] (03CR) 10Matanya: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:59:02] (03CR) 10Alexandros Kosiaris: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:59:24] arrg [11:59:30] conflict [12:01:15] (03PS5) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [12:04:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [12:05:51] matanya, akosiaris, better? [12:07:37] (03CR) 10Matanya: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:08:46] (03PS6) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [12:10:23] (03CR) 10Matanya: [C: 031] Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:12:18] (03CR) 10Tim Landscheidt: [C: 031] "Tested on Tools (removed ca.pem and crl.pem), and Puppet recovered on its own." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:12:59] (03PS1) 10Hashar: contint: python-dev on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/115605 [12:13:08] (03CR) 10Andrew Bogott: [C: 032] Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:15:05] * andrewbogott starts a test & crosses fingers [12:25:44] mark, akosiaris, matanya, scfc_de: It goes! A clean migration -- first puppet run requests a cert and exits, second puppet run (after an appropriate delay) works. [12:25:57] nice! [12:26:21] Salt isn't working -- if anyone wants to look at that I'd welcome the help. testmigrate10.eqiad.wmflabs and salt master virt1000 [12:27:06] :) [12:27:45] oh, nm, salt is working now too, just took a minute. [12:27:54] Lemme see if it works a second time... [12:29:56] (03PS1) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:33:07] andrewbogott: good to know :-) [12:38:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [12:41:38] And a second test works as well. Thanks, all, for helping me sort this out. [12:45:59] (03PS2) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:46:38] mark: communications between virt hosts looks good now. Thanks! Now I just have to figure out how to set up host keys so they can do unattended rsyncs... [12:49:29] (03PS3) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:51:00] (03CR) 10Andrew Bogott: [C: 032] Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 (owner: 10Andrew Bogott) [13:02:17] (03PS1) 10Tim Landscheidt: Tools: Set group for $sysdir according to $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/115609 [13:29:57] (03PS1) 10Tim Landscheidt: Tools: Restore local symlinks for jobutils [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 [13:41:15] (03CR) 10PiRSquared17: [C: 04-1] Remove Flow from Meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [13:43:28] (03PS1) 10Andrew Bogott: Add a line specifying the nova api rate limits. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115614 [13:43:30] (03PS1) 10Andrew Bogott: Turn rate limits WAY up for nova api. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115615 [13:59:36] (03PS1) 10Hashar: beta: lower memcached memory usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/115617 [14:02:31] (03CR) 10coren: [C: 04-1] "It's been many months; those really should not be needed anymore, and will not be present in eqiad." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 (owner: 10Tim Landscheidt) [14:03:37] (03PS1) 10Hashar: mediawiki: stop timidity after it got installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 [14:15:37] (03PS1) 10Alexandros Kosiaris: Create OSM labs db partitioning scheme and dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115622 [14:20:03] (03CR) 10Ottomata: "Uh, does that mean keys() doesn't work if there isn't at least two entries in the hash?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115524 (owner: 10BBlack) [14:20:38] (03PS1) 10Hashar: beta: memcached multiwrite to pmtpa and eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115623 [14:25:53] (03PS1) 10Hashar: deployment::target does not work in labs, skip it [operations/puppet] - 10https://gerrit.wikimedia.org/r/115624 [14:28:35] (03PS1) 10Ottomata: Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 [14:33:15] (03PS1) 10Ottomata: Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 [14:34:33] (03CR) 10Alexandros Kosiaris: "If I understand the commit message correctly, you should also remove the analytics100[34].wikimedia.org forward and reverse records." [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:37:02] (03CR) 10Alexandros Kosiaris: [C: 032] Adding OSM role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/115409 (owner: 10Alexandros Kosiaris) [14:37:08] (03CR) 10Ottomata: "Yeah, I can do that. I left them because I noticed that the .eqiad.wmnet records were mostly left intact from when we changed before...bu" [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:37:40] (03CR) 10Alexandros Kosiaris: [C: 032] Create OSM labs db partitioning scheme and dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115622 (owner: 10Alexandros Kosiaris) [14:37:59] (03PS2) 10Ottomata: Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 [14:38:04] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce labsdb100[45].eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/115410 (owner: 10Alexandros Kosiaris) [14:38:26] (03PS2) 10Hashar: mediawiki: stop timidity only once it got installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 [14:38:46] (03CR) 10Hashar: "Apparently timidity no more install any daemon..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 (owner: 10Hashar) [14:39:38] (03Abandoned) 10Tim Landscheidt: Tools: Restore local symlinks for jobutils [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 (owner: 10Tim Landscheidt) [14:40:09] (03CR) 10Alexandros Kosiaris: [C: 032] Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:50:06] error: server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none while accessing https://git.wikimedia.org/git/mediawiki/tools/release.git/info/refs [14:51:15] Getting that trying to clone any repos from gerrit onto tin/bast1001 [14:51:26] (03PS1) 10Ottomata: README.md updates [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/115628 [14:51:28] (03PS1) 10Hashar: Configuration for beta cluster caches in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 [14:51:40] (03CR) 10Ottomata: [C: 032 V: 032] README.md updates [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/115628 (owner: 10Ottomata) [14:56:20] (03PS1) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [14:57:44] (03PS2) 10Ottomata: Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 [14:57:50] (03CR) 10Ottomata: [C: 032 V: 032] Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 (owner: 10Ottomata) [14:59:02] (03PS2) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:01:20] (03PS3) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:01:58] (03CR) 10jenkins-bot: [V: 04-1] Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 (owner: 10Andrew Bogott) [15:02:56] (03PS4) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:04:04]