[06:51:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10jcrespo) FYI backups on these hosts `etcd[1001-1003]` are still configured on bacula, and failing to run as the... [07:02:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10jcrespo) ^this should be merged before closing this ticket :-) [07:29:47] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1001 for hosts: `etcd[1001-1003].eqi... [07:36:39] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) I 'll split this off in its own task, but worthy to point out in order not to forget it. Znuny's QuickClose package seems to be... [08:08:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10JMeybohm) 05Open→03Resolved Hosts decommissioned, puppet, ignore_list and dns clean. Thanks @jcrespo and @V... [09:08:13] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10jijiki) [09:08:31] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10jijiki) [09:22:44] _joe_: since refining those connection reuse params there's been 0 connections destroyed with active requests https://grafana.wikimedia.org/d/UOH-5IDMz/api-gateway?viewPanel=38&orgId=1&from=1600161750381&to=1600248150381&refresh=30s&var-datasource=codfw%20prometheus%2Fk8s [09:23:12] which I think means no more user-visible issues with the gateway connecting to upstream [09:23:24] <_joe_> hnowlan: so what did you set as max connections? [09:23:31] <_joe_> 1000 like we did in the service proxy? [09:23:40] yep [09:24:03] still seeing those "rc=-1" openssl errors in the logs however, but maybe they're at least contained [09:24:06] <_joe_> we still saw occasional failure, but it's like 1 in 10M requests, which I'm sure is outweighted by the gains we had by using persistent connections [09:31:47] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) All m2 dbs are back to sync with primary server: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&var-server=d... [12:44:32] 10serviceops, 10Operations, 10ops-codfw: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10MoritzMuehlenhoff) [13:42:21] <_joe_> akosiaris: is there a good reason to still have proton code in puppet? [13:43:36] _joe_: no, please take out your axe :-) [13:43:44] pretend it's OCG :P [13:43:48] <_joe_> ahah [13:43:53] which it is btw... ocgv4 or something [13:46:25] we also still have the proton* hosts up and running [13:47:03] <_joe_> yeah I was about to say [13:47:47] <_joe_> proton-http is served by the proton cluster [14:02:36] FYI wtp1042 got rebooted [14:02:58] not sure if related to upgrade work or pdu work [14:03:06] in the latter case might have the wrong cabling [14:14:14] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10Joe) [14:23:55] <_joe_> volans: I would guess dcops is the channel where you want to say that [15:16:05] quick one for you: I noticed that restbase-async has the same A record of restbase (as opposed to a CNAME). What's the context/future plan for them? To make sure we represent them correctly in Netbox [15:28:18] <_joe_> restbase-async normally runs in codfw, to spread the load [15:29:57] I guess it couldn't be a CNAME right? [15:30:43] yeah, they're only temporarily colocated because of the switchover [15:31:32] restbase is a tourist in codfw, restbase-async is a permanent resident [15:31:48] rzl: I'm referring to the fact that they use the same IP and related DNS record [15:34:14] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10LGoto) p:05Triage→03High a:03MSantos [15:56:17] <_joe_> volans: what do you mean? it's a discovery record [15:56:24] <_joe_> they can point to different IPs [15:56:38] <_joe_> say restbase is pooled in eqiad, and restbase-async is pooled only in codfw [15:56:44] <_joe_> the IPs would be different [15:56:56] sure, I'm referring to the svc records [15:57:04] restbase 1H IN A 10.2.2.17 [15:57:04] restbase-async 1H IN A 10.2.2.17 [15:57:16] <_joe_> they have separate svc records? ok so maybe I didn't get that [15:57:23] <_joe_> yes it's useless now [15:57:38] it's the only case in which we have 2 names for the same IP in svc [15:57:49] and wanted to understand if it's a use case to support or an exception [15:57:59] <_joe_> _joe_> yes it's useless now [15:58:06] <_joe_> it's a relic of pre-discovery times [15:58:59] * volans lost [15:59:14] discovery usually points to one of the svc [16:02:28] but if it can be deleted, by all means :) [16:10:13] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [16:11:05] <_joe_> discovery does NOT point to the svc, it handles the IPs itself [16:11:58] <_joe_> anyways, we can revisit this tomorrow, when I'm fresh :P [16:29:37] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [16:44:16] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [17:19:33] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) This RFC is up for public discussion today at 21:00 UTC (23:00 CEST, 2pm PDT). The discussion is taking place on IRC,... [17:28:29] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) Ref {T255878} [17:28:45] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) [17:36:00] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) I suspect that this line is our culprit. Does it need to be updated to use https? https://gerrit.wikimedia.org/r/plugins/giti... [17:38:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10MSantos) >>! In T263043#6467333, @Mholloway wrote: > I suspect that this line is our culprit. Does it need to be updated to use https? h... [17:38:33] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) >>! In T263043#6467333, @Mholloway wrote: > I suspect that this line is our culprit. Does it need to be updated to use https? https... [17:48:13] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) For the record, the service doesn't seem to be having any trouble in staging: ` mholloway-shell@deploy1001:/srv/deployment-c... [17:53:01] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) The configuration is exactly the same for staging and eqiad/codfw right now, so I'm not sure what's going on here. I'll have to dig... [18:04:43] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dr0ptp4kt) First of all, cool stuff! Second, I noticed the following: > I previously considered making the service be aware... [18:04:57] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10MSantos) p:05Unbreak!→03High Keeping this alive until we can figure out the root cause. But this should not be broken anymore. [18:05:23] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) a:03Joe The revert solved the issue for now. Still, we need to figure out what was going wrong, most apparently in the wikifeeds -... [18:14:29] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) for the record, I just confirmed: eqiad gives a correct result as well: ` $ curl -s http://wikifeeds.svc.eqiad.wmnet:8889/en.wikipe... [18:18:00] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) 05Open→03Stalled TL;DR: we were unable to reproduce a corruption in this iterartion * I run the full set of URLs a few times using `opcache.protect_memory = 1`.... [18:18:04] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10jijiki) [18:18:18] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) p:05Triage→03Low [18:36:37] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10Mholloway) FYI, for the Beta Cluster deployment, just yesterday I added a local commit to labs/private.git on deployment... [18:37:00] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dpifke) >>! In T260330#6408193, @tstarling wrote: > Has anyone got an idea for giving the HMAC key to the server without allow... [19:00:18] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T260330#6467549, @dpifke wrote: > The kernel keyring can have a key loaded onto it (via `keyctl`) which is usab... [19:06:10] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dpifke) >>! In T260330#6467753, @daniel wrote: > Can this be used from inside a docker container? I've used it with LXC conta... [20:01:42] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Framawiki) [20:07:46] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) As we're building custom 7.2 packages anyway; adding a backport for 1.) would n... [20:51:07] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Legoktm) >>! In T260330#6458637, @tstarling wrote: > An open question is what to do about shell pipelines. I didn't see any... [21:02:02] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) >>! In T262151#6462585, @Volans wrote: > The device is still active in Netbox, shouldn't be marked as failed? Yep, its not online so I'm putting it failed so the reports clear up in netbox.... [21:29:44] 10serviceops, 10DC-Ops, 10Operations: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) [21:30:18] 10serviceops, 10DC-Ops, 10Operations: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) server is down and depooled. it can be worked on anytime [21:30:30] 10serviceops, 10DC-Ops, 10Operations: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) p:05Triage→03Medium [21:37:24] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) [22:51:56] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) a:03Krinkle [22:52:02] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) Thanks. We'll start work on that basis then, coding MW only for the standard PHP function.