[04:05:35] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:06:04] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:06:08] 10serviceops, 10Graphoid, 10Operations, 10Patch-For-Review, 10Platform Team (Icebox): Undeploy graphoid for phase 1 wiki's - https://phabricator.wikimedia.org/T257402 (10Jseddon) 05Open→03Resolved [04:11:43] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:23:35] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Jseddon) [04:26:49] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Jseddon) [07:50:07] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 (10JMeybohm) p:05Triage→03Medium [07:53:37] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm) p:05Triage→03Medium [07:53:45] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) p:05Triage→03Medium [07:53:53] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) p:05Triage→03Medium [07:53:59] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) p:05Triage→03Medium [07:54:03] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) p:05Triage→03Medium [07:54:10] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) p:05Triage→03Medium [07:54:15] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) p:05Triage→03Medium [07:54:22] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) p:05Triage→03Medium [07:54:25] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) p:05Triage→03Medium [07:54:35] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm) p:05Triage→03Medium [07:54:43] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) p:05Triage→03Medium [08:14:55] 10serviceops, 10Patch-For-Review: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10akosiaris) >>! In T255250#6321309, @jijiki wrote: > rdb2003 is primary for ores2* servers and changeprop/changeprop-jobqueue. > > * changeprop/changeprop-jobqueue is generally ok with briefly l... [08:47:19] 10serviceops, 10Operations: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10Joe) p:05Triage→03High a:03Joe I think we should just move to the v3 api as soon as possible. I'll think of how to test it easily. [09:33:23] 10serviceops, 10Patch-For-Review: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10jijiki) >>! In T255250#6321852, @akosiaris wrote: >>>! In T255250#6321309, @jijiki wrote: >> rdb2003 is primary for ores2* servers and changeprop/changeprop-jobqueue. >> >> * changeprop/changep... [10:32:03] akosiaris: just a quick check to be sure I recall correctly. Both for k8s and ganeti instances the cali*/tap* interfaces are all dynamic and "useless" from a tracking point of view in Netbox right? [10:32:17] volans: yes [10:32:26] and from a puppetdb PoV [10:32:36] they are in puppetdb AFAIK right now [10:32:37] so if you want to get rid of a ton of useless facts [10:32:48] that would solve 2 stones with one bird [10:32:54] as we import in netbox from puppetdb [10:33:05] so we could directly remove them from puppetdb facts [10:33:10] is what you're saying? [10:33:34] yes, exactly that [10:33:42] great, I'll look into that, thanks! [11:17:09] akosiaris: unfortunately it's more complex that it seems, there is no easy way to override a built-in fact AFAICT and adding a custom one filtered would only add stuff, not remove it [11:17:38] I think I'll filter them at the puppetdb-proxy that we're using for the netbox import, so that others might get advantage of the filtered view in the future [11:24:32] dear serviceops, we can consider at some point to edit this dashboard [11:24:34] https://gerrit.wikimedia.org/r/projects/wikimedia,dashboards/sre-serviceops:main [11:24:42] oh no wait [11:25:15] abort abort [11:28:35] https://gerrit.wikimedia.org/r/p/wikimedia/+/dashboard/teams:sre-serviceops [11:28:38] ^ that one [11:29:15] timo gave me a tl;dr how to do it [11:53:33] <_joe_> effie: be bold [12:02:34] 10serviceops, 10Operations, 10Traffic, 10affects-Kiwix-and-openZIM: ETAG response headers not always with double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) As it turns out, this is due to a deliberate bug in Swift: https://bugs.launchpad.net/swift/+bug/1099087 > When swift was originally... [13:38:30] you should advertise that more widely to the team; maybe you did and I was not paying attention, but if not... [14:07:56] oh I had to run :/ [14:08:15] that is why I didnt sell it enough [14:27:51] 10serviceops, 10Operations, 10Wikimedia-Apache-configuration: Encoding discrepancy in https://wikipedia.org redirects - https://phabricator.wikimedia.org/T257608 (10Joe) p:05Triage→03Low [14:35:15] https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=34&fullscreen&orgId=1&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All&refresh=5m sigh. I 'll revert back to 72% [14:36:17] _joe_: jayme ^ [14:36:48] Oops :-o [14:36:49] I 'll revert back to 72% and will have a look tomorrow. But funny how increasing traffic from 72% to 96% threw the system into unstable equilibrium [14:37:18] pods started getting killed every now and then btw [14:37:35] <_joe_> akosiaris: need more pods? [14:38:36] could be. Not easy to tell. Per saturation panels, CPU and memory is ok [14:38:47] there is a lot of throttling however I can not account for [14:38:58] e.g. https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=78&fullscreen&orgId=1&from=1595338489749&to=1595342071227&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All [14:39:11] with the limit being at 1.2s.. why is it being throttled so much? [14:39:15] is it again 512c999 ? [14:45:11] logs point to ETIMEDOUT. 171 times vs 22 for null 10 for unknown error and 2 for upstream request timeout (so restbase envoy?) [14:47:40] hmm there seems to be service-runner worker deaths as well. [14:48:05] sigh [14:48:25] I am an idiot. avg() led me down the wrong path [14:48:47] when switching https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=78&fullscreen&orgId=1&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All&refresh=5m to max the throttling becomes obvious [14:50:23] akosiaris: https://wikitech.wikimedia.org/wiki/User:CDanis/Use_more_heatmaps ;) [14:50:42] (... I really need to put actual text on that page ...) [14:51:27] good point [14:52:12] guess I 'll need to revisit my approach. Since I need to look at the bright side, at least this is codfw and the US is only now starting to wake up [14:55:31] and of course it's memory exhaustion :( [14:55:47] should have caught it earlier, only to be misled by my own faulty gauges [16:09:16] Hi dear serviceops. I have a question, not sure if I'm addressing it to the correct group of people though.. Do we have a capability to run a fork of a composer library? [16:09:40] temporarily, we're working on changes accepted upstream, but it's going very slowly [16:12:58] maybe I should ask releng.. [16:16:42] hi Pchelolo, not ignoring you, I just have no idea what the answer is :) [16:17:05] +1 [16:18:17] asked releng as well [16:18:19] rzl: are you in the mood of going over *a few* CRs for *some* values.yaml files? :D [16:18:31] ahaha [16:18:42] this is a trap but I am stepping into it willingly and with my eyes open [16:18:42] sure [16:19:02] it's a trap, obviously :) [16:20:15] Hope I got them all... :/ [16:20:53] rzl: https://gerrit.wikimedia.org/r/q/topic:%22envoy_1.14.4%22+(status:open%20OR%20status:merged) [16:23:41] ahahaha looking [16:23:45] thank you for doing all those [16:24:20] what a mess, we ought to hire somebody to re-engineer all this [16:24:55] j.o.e started looking into it I think [16:25:14] I mean I was pretty sure it was going to be you <_< but either way! [16:25:59] Yeah...but looks like I'm still stuck with other stuff :D [16:26:32] We should at least start to enforce a yaml style so that we can patch those using yq or so without having to deal with to big diffs [16:27:14] (I might have tried to sneaked some style fixes in here and there, though) [16:27:22] Thanks for looking over that [16:27:39] that sounds good to me [16:28:19] and, looking now -- do you know where the default is configured, for cases where "image_version" isn't set? should we just be bumping that instead? [16:29:20] (in particular I'm wondering if this has already been factored out to a single variable somewhere since the last time we had to do this, and the Envoy wikitech page just didn't get updated) [16:29:48] Good point, need to bump that too! But thats in the tls_helper.tpl ... changing that would require us bumping each chart version [16:30:01] ahh okay, so we have to do something across the board anyway [16:30:08] yep :-/ [16:30:13] makes sense, thanks [16:30:47] We need to restructure the helmfile stuff to be able to have fleet wide defaults I guess [16:31:50] nod [16:32:03] ello - not to complicate ongoing envoy stuff, but we need to package envoy 0.15.0 for the API gateway. Would that cause any problems? We don't plan on rolling it out to anywhere else but we would like to build the corresponding docker image [16:32:26] (1.15.0, right?) [16:32:27] * jayme panics [16:33:15] haha oops, yes - 1.15.0 [16:34:13] I can't think of any reasons that would cause trouble, except for maybe spurious debmonitor "hey this is ready for an upgrade" messages [16:34:37] we haven't evaluated any of the deprecations/behavior changes from 1.14, is the only reason we didn't go straight there fleetwide [16:34:57] _j.oe_ and jayme could tell you for real though [16:35:04] but new servers would install 1.15.0 then if we import that... [16:35:13] hmm, that's true [16:35:21] the deb repo just holds the latest version...thats pretty bad [16:35:36] yeah of course you're right :/ [16:36:00] what I did for helm (2 vs. 3) was using a different package name (helm vs. helm3) [16:36:05] oh, argh. I won't do that in a hurry in that case [16:36:27] jayme: hmm, that could work for the time being. [16:36:58] once we've got this upgrade finished we can also start looking at 1.15 and see if we can just upgrade to it, that'd be the simplest if there aren't any incompatible changes [16:37:31] but I don't know how urgent your 1.15 upgrade for the API gateway is [16:37:39] if you can't wait for us, jayme's solution makes sense [16:37:54] (I've added one more CR rzl) [16:37:58] 👍 [16:39:12] I'll have a think and I'll get back to you. I'd like to avoid creating any hassle for the time being but unfortunately 1.15 includes some of the team's changes that we need for rate limiting [16:40:09] nod, understood [16:40:31] jayme: I think I'm caught up, let me know if you're still waiting for any [16:42:37] rzl: all done, great. Thanks! [16:42:46] thank you! [16:43:27] going to deploy all that tomorrow (well, see how far I get without setting something on fire :P) [17:32:09] <_joe_> my suggestion would be to upload 1.15.0 in a separate component [17:32:20] <_joe_> something component/apigateway [17:33:01] rzl: called it [17:34:34] <_joe_> or using the official envoy image until we catch up :P [17:36:29] I would like for us to make at least one thing about the envoy build and deployment process incrementally LESS crazy before we make it crazier [17:36:39] I don't care what that thing is, I would just like to see us moving in that directio [17:36:40] n [17:37:04] <_joe_> ok, I am indeed working on restructuring helmfile :) [17:37:31] <_joe_> there is one thing we can't avoid... bumping all charts if we change the envoy config because of deprecations [19:46:53] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10Patch-For-Review: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) >>! In T245464#5891265, @Krinkle wrote: > Some ideas: > > * There is a PECL extensi...