[16:09:41] <Pchelolo>	 _joe_: hiii! I wanted to ask you something. Remember we talked about switching sessions to kask. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570395 does that and we're quite ready to deploy it. Maybe on tomorrw SWAT. 
[16:10:00] <Pchelolo>	 This will boost the # of HTTP reqs from MW to Kask to 30k/s
[16:10:41] <_joe_>	 yeah I told you I don't think it's a good idea at this point, I'll comment on the patch
[16:10:59] <_joe_>	 I think its deploy can wait for us (serviceops) to install the middleware proxy first
[16:12:20] <Pchelolo>	 ok. thank you.
[16:12:45] <Pchelolo>	 I've somehow missed the conclusion of the conversation we've had
[16:12:51] <Pchelolo>	 sorry about that
[16:13:31] <_joe_>	 akosiaris: what do you think? ^^
[16:15:55] * akosiaris looking
[16:16:53] <akosiaris>	 heh, this will probably cause the exact same outage as https://wikitech.wikimedia.org/wiki/Incident_documentation/20200108-mw-api
[16:17:14] <akosiaris>	 unless there is something magically different about it that we haven't figured out yet
[16:17:42] <akosiaris>	 so yeah, we need something in between to do that HTTPS termination, otherwise php-fpm will probably not survive.
[16:19:09] <akosiaris>	 _joe_: Pchelolo: we can verify that however by only making that change for a single mw host. One with hight traffic preferably
[16:19:40] <_joe_>	 that would be disruptive for users though?
[16:20:15] <akosiaris>	 aren't sessions being written in both stores?
[16:20:35] <akosiaris>	 or do you mean if we cause that mw host's php-fpm to lockup?
[16:22:01] <Pchelolo>	 yeah they are akosiaris, but 1) currently kask is first (we can change that) and 2) nobody knows what happens if it times out. 
[16:22:03] <akosiaris>	 for the former, Pchelolo is probably better than me to answer, for the latter, it's a canary approach anyway ?
[16:23:23] <Pchelolo>	 so far we can expect traffic to be equal to what redis gets, eg https://grafana.wikimedia.org/d/000000174/redis?orgId=1&fullscreen&panelId=8
[16:23:45] <akosiaris>	 2 is pretty concerning from my PoV, for 1, I am not sure why kask would be first in the beginning to be honest. I would expect it to be inserted second and then become first and then redis to be removed.
[16:24:24] <akosiaris>	 so, to return to 2. what happens is kask doesn't answer?
[16:24:38] <akosiaris>	 should we try to artificially create the scenario before moving to group2 ?
[16:26:05] <Pchelolo>	 code says the request times out with normal MW timeout and we go to redis
[16:28:43] <akosiaris>	 fwiw it's rather easy (and probably prudent) to indeed test this. We will just mangle firewalling rules on a selected host (e.g. mwdebug1001) to drop all traffic to sessionstore.svc.eqiad.wmnet. 
[16:28:56] <Pchelolo>	 as for switching the order - we can, that will make kask have no reads and only writes, but this is just pushing the can down the road - when we turn off redis there will be a huge jump in traffic
[16:29:47] <akosiaris>	 Pchelolo: sure, but it will be 2 steps, not 1. A little bit more certainty gained and the problem is broken into 2 smaller problems
[16:31:34] <akosiaris>	 e.g. 7k peaks of writes to kask (at least as per https://grafana.wikimedia.org/d/000000174/redis?orgId=1&fullscreen&panelId=12), is definitely less traffic than 45k 
[16:31:43] <akosiaris>	 sorry, 35k
[16:32:16] <akosiaris>	 sorry, not 7k, 3.6k
[16:32:27] <Pchelolo>	 akosiaris: I can do it if you think it will make a lot of difference. as I understand we're not sure that the final state of the transition will work at all, so it matters less how we get there
[16:33:53] <akosiaris>	 Pchelolo: I think both mine and _joe_'s bet right now is that moving directly into the final state will cause an outage. We aren't 100% sure but if https://wikitech.wikimedia.org/wiki/Incident_documentation/20200108-mw-api is any guidance, php-fpm will lockup across the fleet and bring everything down
[16:34:09] <akosiaris>	 possibly, including the caches
[16:34:34] <akosiaris>	 so, not necessarily easy to get out of (>1 just a revert will be required)
[16:35:24] <Pchelolo>	 ok. then we certainly hold off. My point is that if we split the rollout into any number of phases one of them will cause an outage. Thus we just don't do anything at all right now
[16:35:59] <Pchelolo>	 when do you think you will get that middleware setup?
[16:36:48] <akosiaris>	 we are reprioritizing OKR work in order to do just that. We are wrapping up the various outages documents to make sure we have a good grasp of what went wrong, so probably next week?
[16:37:44] <Pchelolo>	 cool. I'll wait. In the meantime I can look if so many requests is even needed and couldn't be optimized out
[16:37:46] <Pchelolo>	 thank you
[16:48:53] <_joe_>	 akosiaris: if we write to both it's ok to test the change on one appserver I guess
[16:50:26] <akosiaris>	 Pchelolo: does that interest you? I can schedule selectively killing sessionstore for an appserver tomorrow morning ^
[16:51:38] <Pchelolo>	 akosiaris: more info is always good, but it's mostly up to you - is there any result of this test that would make you say "yeah, no worries at all, don't wait for us, deploy"?
[16:55:38] <akosiaris>	 Pchelolo: probably not. If anything it might say the opposite :-(
[16:56:13] <Pchelolo>	 yeah, so I guess we can skip the test and wait for you to deploy the middleware
[17:28:12] <wikibugs>	 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10Ladsgroup) Apparently this is a built-in and not-properly documented (I didn't know it existed) feature...
[17:34:00] <wikibugs>	 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10CDanis) [edit conflict ;)]  The debug messages point to this code path in `WANObjectCache.php`:  `lang=p...
[17:36:58] <wikibugs>	 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10Mholloway) Update: PCS load has been at pretty much the same level since 12/6: not high enough to cause serv...
[17:41:13] <wikibugs>	 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: Mobileapps flapping since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 (10Mholloway) @jcrespo If it's still the case that you're seeing soft mobileapps endpoint timeout alerts on codfw, IMO a new task would be best.
[17:42:11] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul)
[17:42:36] <wikibugs>	 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: Mobileapps flapping since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 (10Mholloway) 05Open→03Resolved Per T238832#5736879, this round of instability seems to have been resolved when Parsoid/JS linting was tur...
[17:44:22] <wikibugs>	 10serviceops, 10Parsoid, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10Mholloway) Tagging #parsoid since this appears related to Parsoid config changes.
[17:55:56] <mutante>	 thanks to papaul for new codfw appservers.. looking at DNS change now to add their mgmt and IPs
[17:57:18] <wikibugs>	 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10Krinkle) The case of `Rejected set() for {cachekey} due to snapshot lag.` is meant to cover situations l...
[18:02:13] <wikibugs>	 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) >>! In T244719#5872221, @Ottomata wrote: > I've never understood why we do this.  Wouldn't it be better to create the VM in the private...
[18:29:58] <wikibugs>	 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Ottomata) Don't we want to make the service HA?
[19:30:51] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services): Define a mediawiki "version" - https://phabricator.wikimedia.org/T218412 (10thcipriani)   >>! In T218412#5027492, @Dzahn wrote: > How about this: >  > We calculate the sha1 sum of each file in /srv/medi...
[19:45:41] <wikibugs>	 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) >>! In T244719#5873618, @Ottomata wrote: > Don't we want to make the service HA?  I don't think it is a goal for now, we'd need a replacem...
[20:06:36] <wikibugs>	 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Ottomata) I thought one of the issues with the current IRC daemon was that it can't ever be taken offline.  Without an HA or at least auto-failove...
[22:00:11] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services): Define a mediawiki "version" - https://phabricator.wikimedia.org/T218412 (10mmodell) >>! In T218412#5866447, @Dzahn wrote: > Kind of have a hard time imagining a situation where we actually want one or...