[00:00:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [00:02:19] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 6.223 second response time [00:07:56] !log restarting gitblit on antimony [00:08:04] Logged the message, Master [00:09:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 237850 bytes in 7.317 second response time [00:34:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [00:38:26] (03CR) 10TTO: [C: 04-1] "Flow is only enabled on a single page, and is causing no problems or interference with normal community activity on Meta, so I am inclined" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [00:44:04] hello [00:57:40] (03PS2) 10Ori.livneh: Set enable_geoiplookup on cp1066 for geo_cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/115525 [00:59:07] bblack: I amended the patch to scope it to a single text varnish, picked at random. There's nothing relying on the GeoIP cookie being there at the moment and no consequence to setting it only on some responses, so it's an easy way to lower to stakes. Dunno why I hadn't thought of that earlier. [01:29:02] (03CR) 10TTO: "> Although I am extremely concerned about Terry's statement" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [01:37:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:48:52] ori: https://gerrit.wikimedia.org/r/#/c/115553/ ;) [01:51:09] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200677) [01:52:46] !log aaron synchronized php-1.23wmf15/includes/filebackend '5a7a77cf3fd118bc70aa79993b85fc5e737d7526' [01:52:57] Logged the message, Master [01:53:10] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:12:36] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-26 02:12:36+00:00 [02:12:44] Logged the message, Master [02:13:39] (03CR) 10Jeremyb: "(not actually reverted, revert abandoned)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 (owner: 10Jeremyb) [02:13:59] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [02:15:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:41:01] !log LocalisationUpdate completed (1.23wmf15) at 2014-02-26 02:41:01+00:00 [02:41:09] Logged the message, Master [02:43:09] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205530) [02:54:19] (03CR) 10Andrew Bogott: [C: 032] Add some labs management scripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115360 (owner: 10Andrew Bogott) [03:00:51] (03PS1) 10Andrew Bogott: Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 [03:01:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [03:02:42] (03PS2) 10Andrew Bogott: Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 [03:04:26] (03CR) 10Andrew Bogott: [C: 032] Add standard headers to the virtscripts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115563 (owner: 10Andrew Bogott) [03:11:09] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Wed Feb 26 03:11:05 UTC 2014 [03:14:38] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-26 03:14:37+00:00 [03:14:46] Logged the message, Master [03:23:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:32:25] (03PS1) 10Andrew Bogott: Puppetize the new_install key on palladium, add to iron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115566 [03:35:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [03:57:09] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:59:59] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.980 second response time [04:08:30] (03CR) 10Ori.livneh: [C: 032] "I have a bit of time now to watch this roll out, so I'll give it a shot." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115525 (owner: 10Ori.livneh) [04:14:13] (03PS1) 10Ori.livneh: Revert "Set enable_geoiplookup on cp1066 for geo_cookie" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115567 [04:14:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Set enable_geoiplookup on cp1066 for geo_cookie" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115567 (owner: 10Ori.livneh) [04:17:24] !log enabling geo_cookie on cp1066 caused general protection fault, so reverted and restarted. [04:17:33] Logged the message, Master [04:18:39] PROBLEM - Varnish traffic logger on cp1066 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [04:20:39] RECOVERY - Varnish traffic logger on cp1066 is OK: PROCS OK: 2 processes with command name varnishncsa [04:42:58] (03PS6) 10Andrew Bogott: Add a script that updates labs instances after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 [05:35:40] springle: I'm curious what you think of https://bugzilla.wikimedia.org/show_bug.cgi?id=57176 [05:38:02] AaronSchulz: ah that one. reading... [05:38:29] I was also curious about those revision insert timeouts...saw those in the logs earlier [05:38:52] did you see https://bugzilla.wikimedia.org/show_bug.cgi?id=61898 ? [05:39:31] yeah just read that [05:47:06] (03CR) 10Greg Grossmeier: "I wanna close this out, so I'll revoke my bike shed and just make it match the other previous values." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 (owner: 10Greg Grossmeier) [05:47:16] (03PS2) 10Greg Grossmeier: Modify login credential hint [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 [06:01:21] (03CR) 10MZMcBride: "I thought we decided to use Bugzilla for discussion, not Gerrit. I've replied on bug 61729." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [06:02:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [06:08:29] AaronSchulz: replied on 57176. tricky one. do you know if anyone ever tried an index on domain name only without leading protocol? [06:08:39] reversed or not [06:09:17] not sure, I wasn't involved in the table design either [06:11:42] springle: I think B would probably work fine (at least for up to a pretty huge number of matching results) [06:16:36] i'm not clear on how it would blend with the LIMIT. for the example in comment #1, wouldn't it mean running 1023 queries each time and doing some sort of app-side sorting/paging? [06:21:07] springle: if the user wants X rows per page, you'd go through shard 0 first, and give results of X rows, and give an ?continue parameter like 0|. Once you reach the end of that or get less that X rows, you move on to shard 1. The server code is only looking at one (or sometimes a few) shards at a time. [06:21:50] it would work just like the other APIs that page on a tuple instead of a single column [06:22:30] so we really don't care about result order here [06:22:41] you wouldn't have to hit 1024 shards for the small case due to doing an EXPLAIN SELECT to estimate how many rows there are and doing it the current way if it's not super high [06:23:19] you just want some stable ordering even if it is totally meaningless (just like el_id and they OFFSET are anyway) [06:23:41] *and OFFSET [06:25:10] the actually queries hitting the DB (for the shard usage case) would be have WHERE index=X and an OFFSET with that [06:25:24] the filesort would be 1024 times smaller than it is now though [06:26:07] of course, with 30 million hits, it will take some time to traverse but at least the queries won't time out then [06:26:54] there is no filesort presently, but yes i get the point [06:26:58] actually that probably would not filesort [06:27:05] gah, you just said that ;) [06:27:19] * AaronSchulz was thinking of ORDER BY el_id for a second [06:27:44] i had to go back and check the bug ... .oO(did i mention filesort?) :) [06:28:33] in any case, breaking the queries up is A Good Thing for the database. if it works for the api users too, then great [06:29:09] "the actually queries" [06:29:13] * AaronSchulz must be getting tired [06:29:36] springle: anyway, if that works for you I can ping anomie about it, since he seems to do more API stuff [06:29:47] yeah get his input [06:35:39] also you might want to leave a comment there ;) [06:36:23] * AaronSchulz needs to look at WebVideoTranscode::updateJobQueue later [06:36:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [06:44:41] heh, looks like MessageGroupStats::forItemInternal tries to insert the same row from a bunch of servers at once [06:52:14] (03PS1) 10Matanya: remove sockpuppet, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115578 [07:12:17] (03CR) 10Matanya: [C: 031] Setting up kafkatee on analytics1003 to log mobile webrequest logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/115411 (owner: 10Ottomata) [07:16:51] (03PS1) 10Matanya: removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 [07:26:13] (03PS1) 10Matanya: remove locke, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115583 [07:26:46] (03PS1) 10Andrew Bogott: Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 [07:28:23] hi andrewbogott :) [07:28:23] (03PS2) 10Andrew Bogott: Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 [07:28:33] * andrewbogott waves [07:28:50] matanya, I was going to ask you to review something but 'git review' is hanging on my dev box :( [07:29:16] i'm here, ping me when you need [07:29:49] Ah, ok, it didn't hang it's just VERY slow. So maybe in a few more minutes... [07:30:03] (03PS1) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:30:16] There it goes. matanya, check my ruby? ^ [07:31:00] (03CR) 10Matanya: Wipe out /etc/resolv.conf before migrating instances. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 (owner: 10Andrew Bogott) [07:31:01] (03CR) 10Andrew Bogott: [C: 032] Wipe out /etc/resolv.conf before migrating instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115584 (owner: 10Andrew Bogott) [07:31:34] oops, I will take your advice in a subsequent patch :) [07:32:39] (03PS1) 10Andrew Bogott: rm -f resolv.conf. -rf was overkill. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115586 [07:33:03] (03CR) 10Matanya: [C: 031] Add a fact to pull the ec2id out of instance metadata. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 (owner: 10Andrew Bogott) [07:33:52] thanks! [07:35:32] I will never get used to the implicit return thingy that ruby does [07:35:58] it actully makes sense [07:36:13] (03PS2) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:36:15] how so? [07:36:28] I mean, in an assembly 'what was last in the register' sense it makes sense... [07:36:28] why declare something we know? [07:37:14] I guess the fact that the behavior depends on the position of the line bothers me? [07:37:38] think of python indentation :) [07:37:42] Also, what if my intent in a function is to return nothing in particular? Then an arbitrary value can leak out, and a caller might come to depend on that. [07:38:22] But, anyway, I won't argue that it's incorrect, only that it makes me uneasy :) [07:38:31] (03CR) 10Andrew Bogott: [C: 032] rm -f resolv.conf. -rf was overkill. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115586 (owner: 10Andrew Bogott) [07:39:43] logstash is a pain in the ... [07:43:27] (03PS3) 10Andrew Bogott: Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 [07:44:39] (03Abandoned) 10Andrew Bogott: Attempt to get ssh keys working, pre-puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/114408 (owner: 10Andrew Bogott) [07:46:51] (03CR) 10Andrew Bogott: [C: 032] Add a fact to pull the ec2id out of instance metadata. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115585 (owner: 10Andrew Bogott) [07:47:59] RECOVERY - RAID on labstore1 is OK: OK: optimal, 2 logical, 24 physical [08:03:55] /join #logstash [08:04:01] arrg [08:27:59] (03PS1) 10Andrew Bogott: Slight change to prod.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/115587 [08:29:01] (03CR) 10Andrew Bogott: [C: 032] Slight change to prod.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/115587 (owner: 10Andrew Bogott) [09:03:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [09:37:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [09:49:03] matanya: a puppet question: When a client changes from one master to another, what do I need to do to encourage it to accept the cert from the new master? [09:49:05] Any idea? [09:49:26] I know that if I erase /everything/ in /var/lib/puppet it helps, but I'd prefer to be more surgical :) [09:49:42] iirc, you need to tell him in his own config about the new master [09:49:43] hashar, same question, in case you know... [09:49:52] Ah, yes, I'm updating puppet.conf already. [09:49:59] But there's a validation phase that's failing for me. [09:50:07] ok, that is a good start :) [09:50:19] I think because client is comparing the old cert to the new master which of course doesn't work... [09:50:21] andrewbogott: can't remember sorry :-( [09:50:29] I think I restart the puppet services [09:50:29] ok [09:50:32] on the client [09:50:34] oh, on the client? [09:50:40] and it eventually end up catching the new cert [09:50:56] apergos and I attempted to write a class that would revert from puppetmaster::self back to the normal labs puppet [09:50:59] never went far though [09:51:03] hm, nope, same [09:51:27] "err: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed. This is often because the time is out of sync on the server or client" <- sound familiar? [09:51:33] ohh [09:51:44] I mean, it's smart that the client is rejecting the new master, could me mitm [09:51:45] never seen that [09:51:48] *could be [09:52:04] is time in sync? aka ntp on both hosts [09:52:14] yes [09:52:21] might want to try with --debug, it might show up the cert being used [09:52:46] at least, to my eyeball they time looks the same [09:52:47] or you want to regenerate the cert for that host. I think puppet client gives you the command to run on the master [09:53:34] :-( [09:53:59] you might need a new cert on the client. [09:54:30] does the client have the correct cert at all? [09:54:52] the client cert is probably not know on the new puppet master [09:55:02] might need to sign it on puppet master [09:55:16] No, it's the other way around I think. [09:55:21] The client cert /is/ signed and ready. [09:55:27] It's the client, rejecting the master. [09:55:29] It checks both ways, right? [09:55:34] yes [09:55:34] no idea :-] [09:55:50] matanya: so, probably the client does not have the correct cert. I'm wondering how to get it the right one. [09:56:06] Here's a debug run, not very useful: https://dpaste.de/A7GA [09:56:24] the easiest way i think is to revoke the key on master a [09:56:32] and create a new one on master b [09:57:16] ok, so I'd run revoke on the client? [09:58:09] yes, but also on the old master [09:58:31] why would that matter? The instance isn't even talking to the old master anymore, doesn't know about it. [09:58:54] I should explain: I'm trying to duplicate an existing instance w/a new puppet master. So I don't want to break things for the original instance. [09:59:04] I just want to make a copy, and make the copy work with the new master. [09:59:04] just for security reasons, not anything else [09:59:08] Ah, sure. [09:59:17] andrewbogott: rm -rf /var/lib/puppet/ssl on instance and puppetd -t [09:59:24] and puppetca -s -a on your puppetmaster [09:59:28] and you are done [09:59:31] i love you akosiaris [09:59:33] akosiaris: that will revoke the client cert as well, which I specifically do not want to do. [09:59:48] ? [09:59:55] revoke it where ? [09:59:55] "andrewbogott: I know that if I erase /everything/ in /var/lib/puppet it helps, but I'd prefer to be more surgical :)" [10:00:16] well i was slightly more surgical. I added /ssl at the end :-) [10:00:16] akosiaris: well… currently my clieng has a cert that is signed on the master. [10:00:21] If I erase the cert on the client side... [10:00:38] then the master will /definitely/ have a cert that doesn't match mine and I will have to revoke that one, etc. etc. [10:00:45] which master ? [10:00:51] old or new ? [10:01:07] old, who cares about it anyway, new won't have anything [10:01:24] until the first puppet run that is [10:01:29] and you running puppetca -s -a [10:02:10] but anyway puppett clients can only have one master and only one ca [10:02:22] OK, let me begin at the beginning. [10:02:44] and since a master is also a ca you basically are well off if you make the client forget about the old CA and be done with it [10:03:03] Yes, so, how do I do that? [10:03:08] Without making the master forget about the client? [10:03:21] again which master? the old one ? [10:03:24] or the new one ? [10:03:25] new one [10:03:30] Here is what happens: [10:03:32] a) I move the instance [10:03:53] that is pmtpa => eqiad labs move ? [10:03:53] b) New instance talks to old puppet master, which prompts an update to puppet.conf, pointing at the new master [10:03:58] yes [10:04:09] c) client talks to new master, gets a cert, and the new master signs the cert [10:04:11] All good, right? [10:04:19] yes [10:04:19] Except, then d) https://dpaste.de/A7GA [10:04:31] Now, why does c work? I don't know. [10:04:31] wait c is wrong [10:04:40] yeah, it shouldn't work, right? [10:04:45] client will not be able to talk to new master [10:04:53] cause it will be a different CA [10:04:57] I'd think. [10:05:03] And yet I see a signed cert on the master. [10:05:10] So maybe something else stupid is happening :( [10:05:18] ok . virt0 and virt1000 ? [10:05:23] may I login and see what is going on ? [10:05:24] andrewbogott: Have you looked at a more general setup like http://docs.puppetlabs.com/guides/scaling_multiple_masters.html ? [10:05:44] scfc_de: he wants to migrate, not scale [10:05:45] masters are virt0 and virt1000 [10:05:51] virt0 will be decomissioned [10:05:55] eventually that is [10:05:56] the instance in question is testmigrate7.eqiad.wmflabs [10:05:58] I will add you to the project [10:06:39] Apparently, the guy in http://stuckinadoloop.wordpress.com/2012/02/16/automated-migration-of-systems-to-a-new-puppet-master-server/ had the same problem. [10:06:51] akosiaris: migrate = scale, then downsize :-). [10:07:22] scfc_de: that approach might cost more that it is worth [10:14:34] i still think my approach will be easiest, though it doesn't fix the root cause [10:23:49] andrewbogott: If both puppet masters are accessible from the client, this is no blocker for you? [10:25:19] scfc_de: I need to move them at some point though [10:25:28] since the old puppet master will be shut down eventually [10:26:05] Yeah, but before you get stuck here, I'd rather pick problems easier to solve :-). [10:28:36] scfc_de: right now puppet by default changes the master to match whatever domain an instance is in. [10:28:45] That clearly doesn't work, but I'm trying to roll with it :) [10:30:19] (03PS7) 10Andrew Bogott: Add a script that updates labs instances after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 [10:42:21] (03CR) 10Tim Landscheidt: "To me, this doesn't feel "puppety" :-). I would do something along the lines of (in role::labs::instance):" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 (owner: 10Andrew Bogott) [10:46:44] scfc_de: ok, but the whole point is that puppet doesn't work [10:46:54] so… a puppet-based solution sounds nice but I don't think it helps [10:47:24] Still… there may be some way to use virt0 as a more effective bootstrap. I'll think about it a bit [10:51:49] andrewbogott: Why doesn't it work? I. e., if you move an instance 1:1 from pmtpa to eqiad and then run Puppet? [11:10:59] scfc_de: one question is… how do I remove the virt0 cert without just always removing every cert? [11:21:15] (03PS1) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:21:27] akosiaris: what do you think? ^ [11:26:33] well, not in love definitely. If anything changes in the erb, all clients will have to refetch the CA and CRL [11:27:00] Want me to limit it to labs? [11:27:15] It would be better to detect that just the master is changing... [11:27:51] Could we manage /var/lib/puppet/ssl/certs/ca.pem as a file resource in Labs? To be able to change it in concert with $base::puppet::server? [11:28:08] well there is a security concern here [11:28:32] if we have ca.pem refresh on an erb change [11:28:48] that may very well open the door for anyone to take over the entire fleet [11:28:58] Isn't an erb change already a wide open door? [11:29:22] yes but you remove a layer of security there [11:29:26] true [11:30:02] I would be surpised if mark showed up and voted a huge -2 on it [11:30:09] I would not* [11:30:15] ! [11:30:19] where is it [11:30:24] ahahaha [11:30:41] what is the issue, also? [11:30:44] Well, I don't want to apply it universally. I need some way to detect that a migration is happening... [11:30:56] mark, clearing the puppet master cert after an instance moves to the new dc [11:31:06] why is that necessary? [11:31:09] different hostname? [11:31:15] yes, different puppet master [11:31:22] both that and virt0 => virt1000 move [11:31:30] ah, yes, both [11:31:50] perhaps we should use an alias now [11:31:56] that won't solve our current problem but will prevent the next [11:32:46] Hm [11:32:48] like puppet.wmflabs ? [11:32:53] yes [11:33:14] assuming we will want to support instance migrations in the future also [11:33:33] Actually that could solve our current problem as well... [11:33:43] because for /working/ instances I can clear the certs with salt. [11:33:48] It's only instances that are post-migration that are broken [11:33:49] as for detecting a migration is happening [11:33:56] would it be feasible to create some file in the filesystem before migration [11:34:02] and remove it when puppet runs and fixes up the system? [11:34:14] mark: yes, that's easy. [11:34:19] So, probably a good solution. [11:34:24] But... [11:34:25] ok, wait: [11:34:26] in that case it would be reasonable to remove the cert [11:34:31] if we can make sure it can't get abused [11:34:36] the issue with the puppet master isn't the /name/ of the master, it's the cert. [11:34:40] yes [11:34:41] So using an alias doesn't help. [11:34:49] well you could use the same cert everywhere then [11:34:52] Unless we copy the cert when we change masters. [11:35:01] Ah, yes, that would solve the problem w/out using an alias :) [11:35:03] i don't see a huge reason why we couldn't do that [11:35:07] But, I defer to akosiaris on that point [11:35:13] right now that's not optimal as using the virt0 name is a bit weird ;) [11:35:23] the only problem will be the CRL i think [11:35:59] and of course we will need a procedure to say "No longer sign certs on virt0, from now on only on virt1000" [11:36:15] and then phase out virt0 [11:36:17] how does signing work in labs now anyway? [11:36:20] ? If they're the same then... [11:36:35] but I think that the issuer in the CA is virt0 in labs [11:36:36] the CA needs to remain in sync [11:36:51] as in the issuer field in the cert [11:36:57] Ah, then sharing doesn't solve anything, it just creates new different problems. [11:38:15] Does Puppet check the issuer or does it only test that the keys match? [11:39:18] it definitely checks the chain [11:40:18] Issuer: CN=Puppet CA: virt0.wikimedia.org [11:40:57] (03PS2) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:41:03] like that better? [11:41:15] akosiaris: I know, but does it test that the hostnames match? [11:41:25] yes it does [11:42:29] akosiaris, mark, is ^ ok? Or should I also limit it to labs? Or... [11:42:39] * andrewbogott doesn't know for sure that that will even solve the problem, but it might :/ [11:43:29] I think limiting it to labs for now wouldn't be a bad idea [11:43:46] also, you may want to have a generic "migration-in-progess" file instead which you could use for other things [11:43:48] scfc_de: well to be clear, the hostname of the master. It will obviously not chase down the entire chain hostnames(that would make no sense) [11:44:24] mark: a generic file wouldn't be rm'd at such a good time though [11:44:36] hence having a task-specific flag that gets removed in one go with the cert clearing [11:44:53] Also otherwise I'm not sure how to make sure the file gets cleared /after/ it's used for the test :) [11:44:58] akosiaris: ACK. [11:45:18] BTW, with Labs and network people here, 208.80.152.234 (amaranth) times out from pmtpa-Labs (incidentally, the server itself is status 503 from the InterNet, but different story). Is this some firewall on Labs or in the network? [11:45:19] I 'd say limited to labs for sure. As I already said, a single merge with a changed 10-main-conf.erb has the potential of sending the entire fleet to another puppetmaster [11:46:08] Are there disadvantages to managing /var/lib/puppet/ssl/certs/ca.pem as a file resource in Labs? [11:46:28] scfc_de: and doing what with it ? [11:47:16] Setting it to the cert of the puppet master selected by $base::puppet::server? [11:48:49] in general it is a security concern. That file is your anchor with the entire puppet infra [11:49:31] (03PS3) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:50:18] Yes, and AFAICS, at the moment, it's not managed at all, but depends on the initial Puppet connection, I think? [11:50:58] mark, akosiaris, ^ [11:51:05] yes, which is the way it was designed to be by the puppet people [11:51:45] it's not supposed to be managed [11:51:50] at least, not by Puppet :) [11:53:00] (03CR) 10Mark Bergsma: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:53:45] Then how are puppet masters supposed to be changed in Puppet if puppet.conf is managed, but ca.pem not? On change to puppet.conf, check that ca.pem's "Issuer" matches "server", otherwise delete ca.pem? [11:54:06] puppet masters are not supposed to be changed in Puppet I think [11:54:08] doesn't mean we can't [11:54:12] but it's a bit hairy [11:55:21] (03CR) 10Alexandros Kosiaris: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:56:17] (03PS4) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [11:56:39] Ah, much tidier that way. [11:56:46] Presuming that 'subscribe' works like that [11:58:53] As it's a bit hairy, I'm looking for a simple solution :-). [11:58:59] (03CR) 10Matanya: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:59:02] (03CR) 10Alexandros Kosiaris: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [11:59:24] arrg [11:59:30] conflict [12:01:15] (03PS5) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [12:04:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [12:05:51] matanya, akosiaris, better? [12:07:37] (03CR) 10Matanya: Clear master certs if we change puppet.conf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:08:46] (03PS6) 10Andrew Bogott: Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 [12:10:23] (03CR) 10Matanya: [C: 031] Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:12:18] (03CR) 10Tim Landscheidt: [C: 031] "Tested on Tools (removed ca.pem and crl.pem), and Puppet recovered on its own." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:12:59] (03PS1) 10Hashar: contint: python-dev on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/115605 [12:13:08] (03CR) 10Andrew Bogott: [C: 032] Clear master certs if we change puppet.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/115594 (owner: 10Andrew Bogott) [12:15:05] * andrewbogott starts a test & crosses fingers [12:25:44] mark, akosiaris, matanya, scfc_de: It goes! A clean migration -- first puppet run requests a cert and exits, second puppet run (after an appropriate delay) works. [12:25:57] nice! [12:26:21] Salt isn't working -- if anyone wants to look at that I'd welcome the help. testmigrate10.eqiad.wmflabs and salt master virt1000 [12:27:06] :) [12:27:45] oh, nm, salt is working now too, just took a minute. [12:27:54] Lemme see if it works a second time... [12:29:56] (03PS1) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:33:07] andrewbogott: good to know :-) [12:38:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [12:41:38] And a second test works as well. Thanks, all, for helping me sort this out. [12:45:59] (03PS2) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:46:38] mark: communications between virt hosts looks good now. Thanks! Now I just have to figure out how to set up host keys so they can do unattended rsyncs... [12:49:29] (03PS3) 10Andrew Bogott: Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 [12:51:00] (03CR) 10Andrew Bogott: [C: 032] Updated dc-migrate a bit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115606 (owner: 10Andrew Bogott) [13:02:17] (03PS1) 10Tim Landscheidt: Tools: Set group for $sysdir according to $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/115609 [13:29:57] (03PS1) 10Tim Landscheidt: Tools: Restore local symlinks for jobutils [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 [13:41:15] (03CR) 10PiRSquared17: [C: 04-1] Remove Flow from Meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [13:43:28] (03PS1) 10Andrew Bogott: Add a line specifying the nova api rate limits. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115614 [13:43:30] (03PS1) 10Andrew Bogott: Turn rate limits WAY up for nova api. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115615 [13:59:36] (03PS1) 10Hashar: beta: lower memcached memory usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/115617 [14:02:31] (03CR) 10coren: [C: 04-1] "It's been many months; those really should not be needed anymore, and will not be present in eqiad." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 (owner: 10Tim Landscheidt) [14:03:37] (03PS1) 10Hashar: mediawiki: stop timidity after it got installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 [14:15:37] (03PS1) 10Alexandros Kosiaris: Create OSM labs db partitioning scheme and dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115622 [14:20:03] (03CR) 10Ottomata: "Uh, does that mean keys() doesn't work if there isn't at least two entries in the hash?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115524 (owner: 10BBlack) [14:20:38] (03PS1) 10Hashar: beta: memcached multiwrite to pmtpa and eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115623 [14:25:53] (03PS1) 10Hashar: deployment::target does not work in labs, skip it [operations/puppet] - 10https://gerrit.wikimedia.org/r/115624 [14:28:35] (03PS1) 10Ottomata: Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 [14:33:15] (03PS1) 10Ottomata: Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 [14:34:33] (03CR) 10Alexandros Kosiaris: "If I understand the commit message correctly, you should also remove the analytics100[34].wikimedia.org forward and reverse records." [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:37:02] (03CR) 10Alexandros Kosiaris: [C: 032] Adding OSM role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/115409 (owner: 10Alexandros Kosiaris) [14:37:08] (03CR) 10Ottomata: "Yeah, I can do that. I left them because I noticed that the .eqiad.wmnet records were mostly left intact from when we changed before...bu" [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:37:40] (03CR) 10Alexandros Kosiaris: [C: 032] Create OSM labs db partitioning scheme and dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115622 (owner: 10Alexandros Kosiaris) [14:37:59] (03PS2) 10Ottomata: Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 [14:38:04] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce labsdb100[45].eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/115410 (owner: 10Alexandros Kosiaris) [14:38:26] (03PS2) 10Hashar: mediawiki: stop timidity only once it got installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 [14:38:46] (03CR) 10Hashar: "Apparently timidity no more install any daemon..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115618 (owner: 10Hashar) [14:39:38] (03Abandoned) 10Tim Landscheidt: Tools: Restore local symlinks for jobutils [operations/puppet] - 10https://gerrit.wikimedia.org/r/115612 (owner: 10Tim Landscheidt) [14:40:09] (03CR) 10Alexandros Kosiaris: [C: 032] Putting PTR records back for analytics100{3,4}.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/115625 (owner: 10Ottomata) [14:50:06] error: server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none while accessing https://git.wikimedia.org/git/mediawiki/tools/release.git/info/refs [14:51:15] Getting that trying to clone any repos from gerrit onto tin/bast1001 [14:51:26] (03PS1) 10Ottomata: README.md updates [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/115628 [14:51:28] (03PS1) 10Hashar: Configuration for beta cluster caches in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 [14:51:40] (03CR) 10Ottomata: [C: 032 V: 032] README.md updates [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/115628 (owner: 10Ottomata) [14:56:20] (03PS1) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [14:57:44] (03PS2) 10Ottomata: Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 [14:57:50] (03CR) 10Ottomata: [C: 032 V: 032] Pointing analytics100{3,4} macs back at their .eqiad.wmnet addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/115626 (owner: 10Ottomata) [14:59:02] (03PS2) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:01:20] (03PS3) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:01:58] (03CR) 10jenkins-bot: [V: 04-1] Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 (owner: 10Andrew Bogott) [15:02:56] (03PS4) 10Andrew Bogott: Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 [15:04:04] (03CR) 10Andrew Bogott: [C: 04-2] "OK, pretty sure we don't need this now. Keeping it in gerrit for a few days just to be sure." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115342 (owner: 10Andrew Bogott) [15:05:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [15:15:52] (03CR) 10Andrew Bogott: [C: 032] Turn the labs::nfs::client class off for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115631 (owner: 10Andrew Bogott) [15:26:53] !log rebooting analytics1004 [15:27:00] Logged the message, Master [15:27:49] PROBLEM - Host analytics1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [15:45:02] (03PS1) 10Jgreen: grant hashar,reedy access to caesium per RT 6861 [operations/puppet] - 10https://gerrit.wikimedia.org/r/115633 [15:46:04] (03CR) 10Reedy: [C: 04-1] grant hashar,reedy access to caesium per RT 6861 (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115633 (owner: 10Jgreen) [15:46:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [15:48:55] (03CR) 10BBlack: [C: 031] remove sockpuppet from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/115527 (owner: 10Dzahn) [15:51:25] (03PS2) 10Jgreen: grant hashar,reedy access to caesium per RT 6861 [operations/puppet] - 10https://gerrit.wikimedia.org/r/115633 [15:54:41] (03CR) 10Jgreen: [C: 032 V: 031] grant hashar,reedy access to caesium per RT 6861 [operations/puppet] - 10https://gerrit.wikimedia.org/r/115633 (owner: 10Jgreen) [16:02:01] ori: ping [16:23:12] (03CR) 10Alexandros Kosiaris: [C: 032] "And no reason to keep it around as anything. It is out of sync anyway with both salt and puppet for quite some time now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115527 (owner: 10Dzahn) [16:39:23] !log disabling puppet on sodium [16:39:25] (03PS1) 10BBlack: Varnish should restart on initscript/defaults changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/115637 [16:39:31] Logged the message, Master [16:40:02] bblack: that's kinda scary isn't it [16:40:12] you change one file and it restarts the whole cluster [16:40:22] without depooling or anything [16:40:27] yeah, that's why I didn't +2 it :) [16:40:30] even frontends :) [16:40:47] but the current situation sucks pretty bad [16:41:35] in the case of the geoip_cookie patches (which have other issues, but ...), flipping on geoip for a type of cache edits the initscript, pushes the VCL, then VCL reload fails everywhere because it hasn't been restarted with the new CC_COMMAND [16:43:06] I guess in the larger view, this again highlights the mmap problem. If it weren't for that, restarts wouldn't be quite as scary. [16:43:09] (03PS1) 10Jgreen: swap in tantalum for rhodium for pdf/trusty testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/115638 [16:43:17] no they'd be, for frontends [16:43:28] you can alter cc_command without a restart [16:43:32] varnishadm param.* [16:43:43] oh really? [16:43:44] varnishadm param.show cc_command [16:43:55] param.set cc_command ... [16:43:56] etc. [16:44:11] hmm maybe the puppet stuff can be re-worked to invoke that (in addition to editing the initscript) [16:45:26] re: frontend restarts, perhaps all in a 30-minute window would be scary, yeah [16:46:11] no, they are in general without a depool [16:46:22] you're basically throw errors to a bunch of clients :) [16:46:36] yeah I know [16:46:37] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [16:46:43] even more so for all of them in < 30 minutes (and we are generally trying to lower the 30' interval too) [16:46:56] but percentage-wise, one at time is bearable with delays [16:47:59] (03CR) 10BBlack: [C: 04-2] "Don't actually merge this!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115637 (owner: 10BBlack) [16:48:01] (03CR) 10Jgreen: [C: 032 V: 031] swap in tantalum for rhodium for pdf/trusty testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/115638 (owner: 10Jgreen) [16:48:05] welll, ideally we should depool from pybal, wait for most sessions to expire, restart, then pool again [16:48:20] x83 [16:48:27] manually [16:48:29] most things you can do without a restart [16:48:42] via VCL or param.set [16:48:46] while waiting for settlement between, and watching for the impact of delayed backend restarts for failed SILO load [16:49:01] I'm talking about the still-ongoing 3.0.5 rollout [16:49:08] oh [16:49:20] I was still at cc_command :) [16:49:49] well, your comments about frontend restarts lead to "I've not been depooling while upgrading these 83 servers over a period of days to 3.0.5" [16:50:05] which is why we get little spikes on 5xx, but only when I'm not asleep :P [16:50:10] heh, okay, I said "ideally" :) [16:50:41] ideally we'd have a better mechanism for depooling servers automatically :) [16:51:16] ideally software would do its job without all this handholding. if only one soul on this whole planet could write perfect software :P [16:53:13] anyways, I'm on server# 60/83 now, the process should be finished today [16:53:33] cool [16:54:59] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:27] ideally, maybe we could build some puppet mechanism for queuing up delayed/randomized actions [16:59:57] e.g. on an event that requires varnish restart, it enqueues the restart command locally to happend at $random_time_over_48_hrs or whatever [17:12:24] paravoid, bblack: could write a salt runner that iterates over the list of minions, where it depools the minion, does the action, waits for success, then repools the minion [17:12:41] the goes to the next minion [17:12:43] *then [17:13:21] should be relatively easy to write a salt module that can pool/depool/enable/disable hosts in the pybal config [17:13:45] ideally pybal would have some support for that [17:13:55] that would be nice too :) [17:14:17] I had wanted to add an API to pybal for a while [17:14:22] yeah [17:16:09] (03PS1) 10coren: Labs: manage LVM volumes on instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/115641 [17:17:03] I, otoh, want to switch pybal to etcd :) [17:18:52] or zookeeper, but etcd seems much much simpler from the cursory look I've given it [17:25:05] (03PS1) 10Cmjohnson: Adding mgmt ip's for row d pdus [operations/dns] - 10https://gerrit.wikimedia.org/r/115645 [17:26:15] (03CR) 10Cmjohnson: [C: 032] Adding mgmt ip's for row d pdus [operations/dns] - 10https://gerrit.wikimedia.org/r/115645 (owner: 10Cmjohnson) [17:28:06] (03PS2) 10coren: Labs: manage LVM volumes on instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/115641 [17:30:17] (03CR) 10coren: [C: 032] "Noop for now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115641 (owner: 10coren) [17:33:19] (03CR) 10Gage: [C: 032] Setting up kafkatee on analytics1003 to log mobile webrequest logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/115411 (owner: 10Ottomata) [17:33:27] !log upgraded to librdkafka1 0.8.3 on cp3019, restarting varnishkafka [17:33:31] (03PS1) 10Andrew Bogott: Create new instance with duplicate puppet classes and vars. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115646 [17:33:34] Logged the message, Master [17:38:01] (03PS1) 10coren: Labs: fixes to the test role [operations/puppet] - 10https://gerrit.wikimedia.org/r/115647 [17:40:50] (03CR) 10coren: [C: 032] "Small fix." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115647 (owner: 10coren) [17:43:00] (03PS1) 10Alexandros Kosiaris: Add labs-support1-c-eqiad to autoinstall subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/115648 [17:50:54] (03PS1) 10coren: Labs: Fix directory name in labs_lvm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/115650 [17:51:19] (03PS2) 10Alexandros Kosiaris: Add labs-support1-c-eqiad to autoinstall subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/115648 [17:52:00] (03CR) 10coren: [C: 032] "There's your problem." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115650 (owner: 10coren) [17:56:30] (03CR) 10Alexandros Kosiaris: [C: 032] Add labs-support1-c-eqiad to autoinstall subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/115648 (owner: 10Alexandros Kosiaris) [17:59:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [18:05:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:06:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [18:18:18] (03PS1) 10coren: Fixes to the labs_lvm class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115653 [18:18:58] (03PS2) 10coren: Fixes to the labs_lvm class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115653 [18:25:39] (03CR) 10Dzahn: [C: 031] "+1 but we need to agree on removing mgmt as well or only after disk wiping has been confirmed" [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 (owner: 10Matanya) [18:26:18] (03CR) 10Dzahn: [C: 031] "+1 but we need to agree on removing mgmt as well or only after disk wiping has been confirmed" [operations/dns] - 10https://gerrit.wikimedia.org/r/115578 (owner: 10Matanya) [18:27:03] mutante: personally dns should go last in my opinion [18:27:56] matanya: there's 2 kinds of DNS, regular and mgmt [18:28:05] yes [18:28:07] matanya: it's about if the 'mgmt' part should stay or not [18:28:27] because as RobH explained the other day [18:28:35] if we keep them then we can also do disk wiping from remote [18:28:42] and if we remove it we rely on DC tech [18:28:46] yea but the lifecycle supports what you are doin [18:28:48] not what i said [18:28:59] so im the only one who wants to keep [18:29:02] ok, but that was also a good point about the wiping [18:29:09] actually, hold on lemme check the tampa specific entyr [18:29:15] so..yea. that's why i made the comment on that change:) [18:29:25] so you guys can chime in [18:30:01] so yea, i think its nice to leave it until its unracked, in fact i wanna make that policy [18:30:03] its just not yet [18:30:06] cmjohnson1: what do you think? [18:30:17] leave mgmt dns as long as its in rack with power was my intent all along [18:30:23] but the lifecycle document didnt reflect it clearly [18:30:27] the good thing is being able to still do stuff from remote [18:30:28] didnt, still doesnt [18:30:31] the drawback is having 2 changes [18:30:37] if its in a rack with power, we should have mgmt [18:30:41] and having to do another clean up later [18:30:48] there isnt a good reason to me not to (having two dns changes isnt that big a deal imo) [18:30:49] robh: i agree [18:30:51] i think not, what do you? [18:31:04] so cmjohnson1 and i agree [18:31:09] im inclined to say its how it is ;] [18:31:37] "with power" means the rack itself has power, right [18:31:42] but the server is already shutdown -h [18:32:02] robh: can you put the idrac license for ms-be1005 in my home folder sometime today plz [18:32:19] mutante: yeah...the servers should be shutdown [18:32:34] mutante: correct [18:32:37] cmjohnson1: ok, yea, i do, last one was kaulen [18:32:37] i've updated lifecycle docs [18:32:49] RobH: ok, thanks, i'll leave them in [18:32:59] matanya: that means your changes need to be amended i guess [18:33:04] ok [18:33:58] thanks RobH, mutante cmjohnson1 i'll amend [18:34:34] well, the lifecycle doc is now fully updated to reflect [18:34:39] it wasnt clear at all before [18:34:43] so i understand the confusion [18:35:09] the reasoning before was that wiping wasn't done via mgmt anyways [18:35:16] but this makes us more flexible.. yea [18:35:36] so for sockpuppet you want: 2» 1H» IN PTR» sockpuppet.mgmt.pmtpa.wmnet. back ? [18:35:49] or any other entry as well? [18:36:06] well, did it have asset tag mgmt? [18:36:15] all the mgmt entries should ideally just stay in place [18:36:20] matanya: forward and reverse for the mgmt entry [18:36:21] but yea, at minimum that [18:36:32] most systems have dual mgmt entries [18:36:38] one based off asset tag (static) [18:36:44] and one based off hostname [18:36:45] matanya: so wmnet has 2 IPs, one in the mgmt network and one that isnt [18:37:00] the dual entries point to same mgmt ip [18:37:41] that being said, we just made this change [18:37:50] so if some items dont reflect it, meh. [18:37:57] from the 4 changes i did, which 2 are need to leave in? [18:38:06] matanya: you can see the network if you scroll up to the beginning of one of those blocks, see line 243 of 'wmnet" [18:38:17] yeah, i'm there [18:39:49] so, yea, keep the IP in that network, nuke the other one [18:40:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [18:47:28] !log es1006 swapping failed disk [18:47:37] Logged the message, Master [18:48:12] (03CR) 10Dzahn: [C: 031] Modify login credential hint [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 (owner: 10Greg Grossmeier) [18:50:39] dr0ptp4kt: hello [18:50:39] paravoid: hello [18:50:39] so, we think we'll need to go with the outbound cookie approach for the time being, deferring on auto-redirect until the future. [18:50:39] bblack: are you around by any chance? [18:50:39] okay; how come? [18:50:39] paravoid: yes [18:50:47] bblack: got a sec to chat about the zero/contributory features I was telling you about? [18:50:52] nervousness that it will break on phones. [18:51:01] dr0ptp4kt: fair enough [18:51:02] yeah [18:51:03] (03CR) 10BryanDavis: [C: 031] Modify login credential hint [operations/puppet] - 10https://gerrit.wikimedia.org/r/114503 (owner: 10Greg Grossmeier) [18:51:17] I assume we're just going to set some odd cookie in vcl_deliver, right? [18:52:43] alright, so, quick question: (1) should we do it in varnish esi and have it take effect nearly instantaneously (that is, after the js is updated and it's reflected in resourceloader), or (2) should we do it from the origin, which would necessitate a cache flush between now and 30 days out for any carriers supporting https....we can't just put it in the origin, as it would have inconsistent effects across fresh versus stale pages [18:53:01] (that is, we can't put it in the origin without a cache flush afaik) [18:53:04] ^paravoid, bblack [18:53:12] I assume you mean vcl rather than esi? [18:53:14] "quick question" :) [18:53:19] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200233) [18:53:24] yeah, sorry, vcl, not esi. brain wires entangled [18:53:46] not sure if there's a way to cache flush on a vary header [18:54:03] ok maybe my understanding is flawed, let me state what I think is happening and you can tell me where I'm off [18:54:19] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:55:00] 1) You're adding some JS code that says "hey if CookieX is set, show the edit button" and 2) We're setting that cookie in vcl_deliver, iff it was a zero-rated access (X-CS would be set) and this carrier supports zero-over-https in the if/else logic [18:55:10] paravoid, bblack i just lost my connection, can you resend previous 2-3 messages? [18:55:29] bblack, correct [18:56:05] so you're saying you're worried about cache staleness on the updated javascript stuff? [18:56:14] bblack, exactly a non-HTTPOnly cookie so js can read it. as the contributory features generally require js, we use the js to override the display:hidden on all of the currently hidden elements [18:57:05] I think he's saying that you can do the Set-Cookie in MediaWiki too, as the page is Varied on X-CS anyway [18:57:07] bblack, regarding cache staleness, i'm only concerned in the case that we didn't do the cookie from varnish. that is, if we did it from the origin, i believe we would still have objects in the cache that don't have that header, so they would be inaccurate and the js wouldn't run. does that make sense? [18:57:13] do we need the cookie when we're already https? I mean, at that point the js could just unhide itself because the connection is secure already, right? [18:57:49] good question - so we want to show the contrib features on cleartex http as well so that the user will discover them. [18:58:03] for the moment, we can't just do a redirect. too much concern it will bust on a nontrivial number of phones. [18:58:15] plus there's the overhead of the redirect. [18:58:26] right, but what I mean is, can't we set the cookie from http, only for http, and not continue to send it once they've switched to https by clicking an edit button? [18:58:35] we'll want to examine that, though, in time, as it's (i hope) inevitable we'll start funneling most everything to https by default in time [18:58:41] bblack, oh [18:59:13] yeah, i think that's fine. the js can look at scheme://...if https, show....if http, examine cookie [18:59:19] do i have that right? [18:59:38] I guess, I mostly asked because of your "non-HTTPOnly cookie so js can read it" comment [19:00:33] okay, i think we're on the same page. just wanted to indicate that the cookie shouldn't have the flag 'HTTPOnly' because that would bar js from reading the cookie at all. [19:01:10] oh, ok, I thought that meant "this cookie can't have some flag that prevents its use over HTTPS" [19:01:13] that said, do we care if we're sending the cookie on outbound https requests? is that a problem? [19:01:21] bblack, yeah, it's an unfortunate flag name :( [19:01:35] they should have named it httpprotocolonly or something like that [19:01:35] I don't think it's an issue either way [19:01:50] ok, cool...one less check in the vcl if-elseif subsections [19:02:17] regarding your earlier question, I think that ultimately it'd be better if the origin does this as it encodes less logic into our VCL [19:02:32] and I think you already do the carrier lookup early on in the zero extension, right? [19:02:39] so computationally it wouldn't make a difference for you [19:03:00] that being said, we could add it optionally (with a guard to check if it's already set) in VCL for 30 days [19:03:03] so, assuming we keep origin and the vcl logic in sync (which is already an assumption), we could do it both places [19:03:09] and then remove the VCL in 30 days [19:03:09] so that you can deploy immediately [19:03:22] hehe :) [19:03:55] yeah, so, I'll work on a gerrit patch for that later today and we can review/edit from that baseline and sort out the details [19:03:56] sweet, ok, i will proceed in that fashion. can't guarantee i won't be back to ask more questions, but i have my marching orders now [19:04:02] do you have a name/format for the cookie yet? [19:04:17] bblack, you're going to update the vcl? or you want me to update it? [19:04:34] I'll do it. At least, I'll start the process [19:04:49] we can call the cookie anything. we should probably make it feature flag-ish. that is, let's call it ZeroOpts [19:05:14] PROBLEM - Puppet freshness on mw6 is CRITICAL: Last successful Puppet run was Wed 26 Feb 2014 07:00:31 PM UTC [19:05:25] yeah but then if we have multiple options, we'll have to parse them out to modify them in some future VCL update or something, right? [19:05:31] and let's have it be a colon separated thing. for now, it would only contain the value 'tls' [19:06:14] in general, what's the plan when another option hits this, which causes the same issues with a 30-day VCL change, etc? [19:06:51] you can also check "age" from VCL [19:07:01] let's plan to avoid migrating more logic to varnish for this purpose. we'll work to make it an origin-only thing except for this one time. [19:07:14] PROBLEM - Puppet freshness on mw6 is CRITICAL: Last successful Puppet run was Wed 26 Feb 2014 07:00:31 PM UTC [19:07:39] paravoid, you saying varnish will observe the age on the cookie? it's pleasing when caches actually conform to the spec! [19:07:44] I don't see how that can happen tbh. the first time you figure out how to move the stuff to origin-only, you'll have a varnish cache issue deploying that code :) [19:07:59] bblack, lol [19:08:13] bblack, i'm thinking future state with esi [19:08:34] although i guess at that point it probably wouldn't be an esi fragment for a header, but rather a