[08:43:01] 10serviceops, 10Operations, 10Wikidata: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10LSobanski) Removing #DBA as there's nothing specific for us to do right now, do add us back if anything comes up. [10:18:45] Envoy 1.16 is out! We'd like to move towards it in API gateway quite soon as it has lots of features that Clara contributed but I'm also a little concerned about rolling forward with our current approach - moving to the latest version of 1.15 raised the issue of hopping back and forth when it comes to patches coming out for versions in use in prod [10:19:11] would it make sense to have an envoy-future branch in our repo when moving minor versions ahead of what's generally accepted for prod? [10:39:01] * jayme cries [10:40:34] hnowlan: yeah... create a second branch in the gbp repo for envoy-future to build from I would say [10:41:49] And please also take a look at https://gerrit.wikimedia.org/r/c/operations/debs/envoyproxy/+/632479 - would be nice if we could verify a stretch compatible build of 1.16 as well [10:41:59] (in envoy-future I mean) [10:43:32] I have not yet thought about a good way to fix the dependency to an older buildsystem image (like building the image as well and pushing to our registry). So currently the "working version" is only available on the envoy builder host [10:47:30] "Error: error converting YAML to JSON: yaml: line 226069: mapping values are not allowed in this context" ... well, this is a lot of YAML :D [10:47:38] haha D: [10:51:41] merged the envoy changes hnowlan. Curious to hear if that still works out with 1.16 [12:26:39] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6387745, @MMiller_WMF wrote: > @kostajh -- maybe we should do that, but I would like to hear from @ne... [13:06:45] I was under the impression that "upstream latency" for local_service in envoy (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=14&orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=push-notifications&var-destination=local_service) is the latency for requests between envoy and the service it is terminating TLS for. Am I wrong here _joe_? [13:11:57] trying to make sense of why the latency of push-not on the service dashboard (https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&var-service=push-notifications) is so different from the envoy telemetry [13:12:52] and thy latency spiked like crazy after the envoy update there [13:34:20] 10serviceops, 10Operations: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) After a lot of fist shaking and head scratching I think I've found a workable solution, to the problem that PHP build depends on ICU 63 (for intl) and indirectly... [14:02:14] mholloway: are you around by any chance? [14:02:32] jayme: hi, what's up? [14:02:41] aha, reading backscroll [14:02:53] I've created a strange pattern in push-not eqiad [14:03:26] rolled back my change some time ago but the latency issue persists, so I guess it's something else [14:09:25] jayme: i have to run out the door in a minute but i've pinged a couple of teammates about it, they should be here soon [14:09:45] cool. Thanks [14:10:31] hey jayme checking [14:10:43] o/ [14:24:08] <_joe_> jayme: you're correct yes [14:24:39] then something is quite off here [14:25:08] So first things first, the service doesn't get any production traffic at this point so its not that the latency is affecting users [14:25:18] <_joe_> yeah also it's eqiad [14:25:21] <_joe_> but [14:25:45] <_joe_> what is that "latency" measuring? [14:26:02] <_joe_> express_router_request_duration_seconds_sum [14:26:21] <_joe_> so that's an average [14:26:24] <_joe_> not even p50 [14:27:09] but there are histograms as well [14:27:14] <_joe_> jayme: another strange thing: the latency basically doubled [14:27:33] indeed [14:27:49] <_joe_> jayme: is push-not using the native prometheus support in service-runner? [14:27:58] <_joe_> did you have to set up the statsd exporter I mean [14:28:37] <_joe_> anyways there is a 10x difference between the two measures [14:28:41] looking at staging it get's even more confusing as that shows a different (not less strange) pattern :) [14:28:54] <_joe_> I'm thinking we're measuring two very different things [14:29:16] <_joe_> I think envoy is measuring all requests, including the requests to monitoring and static pages like ?spec [14:29:19] Guess so, yeah. But the "different thing" has doubled anyways [14:29:29] <_joe_> jayme: can you try with a service that actually receives traffic? [14:29:41] <_joe_> this isn't really indicative of anything [14:29:48] _joe_: I don't see this for other services! [14:29:55] <_joe_> also, you rolled back and the latency remained the same? [14:30:10] I'm almost done rolling out the envoy update. This is a push-not only issue [14:30:15] <_joe_> ok then it's definitely not the envoy update [14:30:22] nono, it's not [14:30:31] I also rolled back, yes [14:30:36] <_joe_> there is something fundamentally broken there [14:30:58] <_joe_> jayme: is it possible you also deployed a new version of push-not? [14:31:04] Nope [14:31:17] <_joe_> sorry grasping at straws here :P [14:31:31] I do see that :D [14:34:39] I'm inclined to do a rolling restart of push-not in codfw without applying any change to see if that triggers this pattern as well [14:34:53] <_joe_> +1 [14:34:57] <_joe_> but maybe on monday? [14:35:06] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10Joe) Adding some notes after yesterday's meeting: - the current script is using `sqlitedict` right now, and t... [14:35:27] yeah, no rush there. [14:37:13] The only thing that comes in mind that might be different on push-notifications compared to other node services is that its using the `prometheus_metrics` branch of service runner, but the changelog doesn't show anything that might be causing this issue [14:39:18] hmm... [14:40:08] nemo-yiannis: do you know what express_router_request_duration_seconds actually is? [14:40:25] whast is measured there I mean [14:42:48] _joe_: what indeed seems a bit slower with envoy 1.15 is the admin_interface downsteam: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=13&orgId=1&from=now-3h&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=push-notifications&var-destination=local_service [14:42:55] the time spent on each express request/response cycle [14:43:47] <_joe_> jayme: that's truly irrelevant :P [14:45:54] Sure, just wanted to share :D [14:47:52] nemo-yiannis: are you able to reproduce the increased latency? Just to rule out issues with the measurement [14:50:00] nemo-yiannis: also if you zoom out, this has happened already some time ago: https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&from=1600425021209&to=1602254944984&var-dc=eqiad%20prometheus%2Fk8s&var-service=push-notifications [15:00:06] i sent some requests on /_info endpoint and the latency checks out [15:06:53] thats strange. I do get the absolute majority of responses in <100ms [15:26:57] Here is what I get from deploy1001: https://phabricator.wikimedia.org/P12957 [15:29:54] yeah, same for me with ab...I had looked at curl before and that seems way better [15:42:14] but that means we're both unable to catch those very high latencies that procude the p99... [16:05:32] I am a bit late to the party, but just to add that push-not has limited resources for now [16:05:51] I have not looked at the graphs, but keep it in mind [16:06:26] o/ [16:07:18] \o/! [16:07:25] Yeah. Thats probably not the issue here. I basically just restartet the pods in eqiad for things to go south. Nothing else changed AFAICT [16:08:01] alright [16:08:52] nemo-yiannis not the case here, but at some point we may need to give the service a bit more on resources [16:08:59] in which case you can just ping us [16:09:36] I used to have a look at the graphs in the beginning but I don't any more, so give us a shout if you feel the service is starving [16:09:55] sounds good [16:14:31] nemo-yiannis: I'll head off for the weekend. Will take another look on monday. Maybe, if you can, enable some kind of access logging for the service to eventually get an idea which requests take so long to complete [18:47:56] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10nettrom_WMF) @kostajh : Thanks for picking this up and pinging me about it. I think we should switch off EditorJourney since... [23:24:51] 10serviceops, 10Operations: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) There was already T245757 with dependency tickets. [23:25:07] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): upgrade MediaWiki appservers to Debian 10 (buster) - https://phabricator.wikimedia.org/T245757 (10Dzahn) A new ticket for this has been created at T264991. [23:26:07] 10serviceops, 10Operations: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) As well as T250515 for the PHP packages. [23:26:26] 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10Dzahn) T264991#6532447 [23:31:14] 10serviceops, 10Operations: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) > Fix all of our puppet code for MediaWiki for incompatibilities with buster I applied the puppet role on a buster test VM in eqiad and the following packages are missing:... [23:51:39] 10serviceops, 10Operations, 10Patch-For-Review: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Legoktm) >>! In T264991#6533968, @Dzahn wrote: > - ploticus {T253377} > - php7.2-opcache > - php7.2-common These should be php7.3 now. [23:53:53] 10serviceops, 10Operations, 10Patch-For-Review: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn)