[05:22:06] 10serviceops, 10Operations: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Marostegui) p:05Triage→03Medium [05:23:16] 10serviceops, 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Marostegui) p:05Triage→03Medium [05:42:56] 10serviceops, 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) a:03Dzahn [06:11:58] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) [06:43:30] 10serviceops, 10Operations: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564 (10Marostegui) 05Open→03Resolved Closing this as the last email is from 4th Feb 2019 [07:31:05] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Traffic, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Joe) This change was released to production to all wikis yesterday. The effect can be seen in this 12h moving average of purge r... [09:12:43] _joe_: there is that one appserver, mw1280 which has been breaking multiple times in the past. while it would still have a year of life theoretically dcops is asking if we can just decom it. i tend to agree that a single server does not make a difference for us. it would be the 5th ticket and it already had CPU and RAM replaced before.. it's a lemon [09:13:04] <_joe_> yeah just kill it. [09:13:09] ack [09:14:22] 10serviceops, 10Operations, 10ops-eqiad: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10Dzahn) @wiki_willy Yea, we agree we can just decom the server at this point. [09:52:15] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) >>! In T133821#6118058, @aaron wrote: >>>! In T133821#6092867, @Joe wrote: >> At a later time, we could think of changing the logic, and make purges avoid ra... [09:58:50] 10serviceops, 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) @chasemp Did you want peek1001 or should we rather use something a bit more generic like sectool1001? Will it maybe hos... [10:03:43] 10serviceops, 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10MoritzMuehlenhoff) Why does this need a complete VM, though? If this simply sends some notifications triggered by cron jobs, si... [10:53:51] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1280.eqiad.wmnet` - mw1280.eqiad.wmnet (**FAIL**) -... [10:54:45] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1280.eqiad.wmnet` - mw1280.eqiad.wmnet (**FAIL**) -... [11:04:16] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10Dzahn) @wiki_willy @Jclark-ctr We decom'ed mw1280 on our end and you can remove it any time. [12:31:56] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Traffic, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Ladsgroup) Amazing work. Thank you! [12:49:29] _joe_: jayme was wondering about adding TLS support via the envoy sidecar to the kask chart. It does seem unnecessary since kask already has TLS support itself. However having envoy there would a) promote consistency b) move the burden of maintaining TLS ciphers and all that from CPT to SRE, who are anyway paying it already and c) and give us as well some telemetry stats [12:49:42] what's your opinion? [12:50:03] <_joe_> I was explicitly avoiding putting envoy there [12:50:25] <_joe_> there is also the fact that kask latencies are very important. We could experiment :) [12:51:15] why were you avoiding it ? [12:51:31] <_joe_> because it seemed like an unnecessary complication [12:51:48] <_joe_> kask is never going to call other services, it's a storage service basically [12:53:58] it's true we aren't using the telemetry stats yet so c) doesn't yet apply. And yes latency is pretty important indeed. [12:55:02] we could run some experiments and see how much of a latency envoy generally adds I guess [12:55:19] why is "kask is never going to call other services" an argument against envoy? Just because we would not need it's functionality as outgoing proxy there? [12:55:26] not sure it's worth it though right now on the other hand [12:56:21] jayme: yeah, but on the other hand we don't use it as an outgoing proxy for any services in k8s right now either. [12:58:50] hmm https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=sessionstore [12:59:09] I wonder how much would envoy add to that [12:59:36] I don't feel as if I'm able to estimate the "burden" part akosiaris mentioned. But if thats not such a big deal, and we don't lack the metrics envoy brings in, I would probably opt for "keep as is" for now... [13:01:36] it's becoming an interesting experiment indeed, not an easy decision for sure. I guess we can postpone it for now and revisit it at some later point with more time and energy to run an experiment and see what happens [14:04:34] <_joe_> akosiaris: I remember booking.com reported a ~ 1ms cost for adding envoy + TLS in front of all services [14:16:14] 1ms ? that's borderline not measurable in our infrastructure [14:23:14] <_joe_> I'm more worried about the "more moving parts" issue [14:51:38] akosiaris: o/ i think i might be missing a network rule for eventgate-analtyics-external [14:52:12] am trying to find the places to look for that and I can't remember them now [14:52:23] it needs to talk to https://schema.discovery.wmnet [14:52:41] * akosiaris looking [14:52:44] eventgate-analytics should have that too [14:53:39] eventgate-analytics already has it [14:53:50] oh and analytics-external as well [14:54:04] port 8190 [14:54:50] ottomata: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/admin/common/calico/default-kubernetes-policy.yaml#239 [14:56:10] hm [14:56:42] should be port 80...? [14:57:23] hm [14:57:31] looking hang on [15:00:03] ah akosiaris tls envoy [15:00:15] i guess we need port 443 there too [15:03:07] there is also profile::tlsproxy::envoy::tls_port: if for some reason you need to use a different one [15:03:21] aye, 443 is good [15:03:26] akosiaris: [15:03:26] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/595953 [15:06:47] 10serviceops, 10Operations, 10Patch-For-Review: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [15:12:28] akosiaris: willl I need to upgrade the eventgate-analytics-* releases? [15:12:37] nope [15:13:18] applying change now [15:17:36] ottomata: you should be good to go now. [15:17:50] checking [15:18:26] ok new problem heh [15:18:26] unable to verify the first certificate [15:28:28] i guess node needs to be given the puppet root ca ? [15:31:12] Pchelolo: have you encountered this? [15:31:26] node is connecting to an internal envoy tls endpoint [15:31:45] the internal envoy enpoint uses the puppet signed cert [15:42:33] ottomata: hnowlan can explain how we fixed ha [15:42:38] ha/that [15:42:56] I think we've had the same problem... [15:43:29] ok cool, anticipating some incoming wisdom from hnowlan then... :p [15:43:44] ottomata: more or less this :) https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/charts/changeprop/templates/deployment.yaml$79 [15:44:04] it is the least-gross option that we could come up with at the time [15:44:17] ha, yeah that is waht I was about to do. [15:44:26] ok [15:44:55] wow, finally there's something in k8s we are not borrowing from ottomata! the tides have turned! :) [15:45:13] :D [15:47:42] heheheh [16:10:12] 10serviceops, 10Operations, 10ops-eqiad: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10wiki_willy) Thanks @Dzahn . @Jclark-ctr - I'll move this task over to the "decommission" column on the workboard. [16:35:54] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Traffic, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) 05Open→03Resolved a:05daniel→03Krinkle [16:36:00] 10serviceops, 10Core Platform Team, 10Operations, 10Traffic, and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [16:42:35] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Improve resource-purge request_id and dt propagation - https://phabricator.wikimedia.org/T252127 (10Pchelolo) [16:42:43] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Pchelolo) [16:44:52] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Improve resource-purge request_id and dt propagation - https://phabricator.wikimedia.org/T252127 (10Krinkle) > 'dt' field would contain the timestamp when the resource being purged has changed. If this sche... [16:47:49] <_joe_> ottomata: so the important part is to get the puppet ca from a values file that's written by puppet, not embedding it as-is today [16:52:30] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) [16:53:00] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) > From **Gerrit**: > [mediawiki/core] objectcache: add "coalesceKeys" option... [16:56:02] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) > From **Gerrit** by @aaron: > [mediawiki/core] objectc... [16:56:06] _joe_: ya that is already happening, was using that for tha kafka ca crt, now also for this [16:56:40] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) [16:57:28] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) a:03aaron [16:57:39] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) p:05Triage→03Medium [17:17:29] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [18:01:11] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Improve resource-purge request_id and dt propagation - https://phabricator.wikimedia.org/T252127 (10Pchelolo) > It seems to me like the only thing useful for the purge handler is the timestamp of when the pu... [18:02:10] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [18:24:39] 10serviceops, 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) >>! In T252210#6128718, @MoritzMuehlenhoff wrote: > Why does this need a complete VM, though? If this simply sends som... [20:51:53] 10serviceops, 10Anti-Harassment, 10Operations, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10aezell) I wanted to clarify that this is just in the experiment and investigation stage. We want to start a discussion about using MaxMind to g... [22:09:35] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) [22:11:40] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) p:05Triage→03Medium [22:20:44] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Krinkle) >>! @CDanis wrote: > (The reason for the difference between the per-process reported state and the Status: "Processes active: 0, idle 8 state that php-fpm alread... [22:32:16] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) >>! In T252605#6131597, @Krinkle wrote: >>>! @CDanis wrote: >> (The reason for the difference between the per-process reported state and the Status: "Processes ac...