[02:52:26] 10serviceops, 10MediaWiki-Debug-Logger, 10Developer Productivity, 10Release-Engineering-Team (Logspam): Fix unhelpful/duplicate "in on " in php7-fatal-error.php messages - https://phabricator.wikimedia.org/T275075 (10Krinkle) 05Open→03Resolved [09:34:26] 10serviceops, 10SRE, 10observability, 10Patch-For-Review: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10jijiki) [09:34:29] 10serviceops, 10SRE, 10observability, 10Patch-For-Review, 10User-jijiki: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10jijiki) 05Open→03Resolved [09:58:22] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10Joe) >>! In T271573#6832845, @elukey wrote: > @wkandek Hi! Do you think that we could find somebody in your team to work with me on this task? It seems very important and potentially bloc... [09:59:44] 10serviceops, 10SRE: Support proxying to etcd v3 storage on buster or later - https://phabricator.wikimedia.org/T275600 (10Joe) [11:36:13] I need to run to lunch, but this looks like a riddle to me so far: https://phabricator.wikimedia.org/T274589 [11:40:38] 10serviceops, 10SRE, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10fgiunchedi) Looks like mtail didn't like that: ` root@mw2239:~# curl -s localhost:3903/metrics An error has occurred during metrics gathering: 3 erro... [12:45:13] 10serviceops, 10SRE, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10hnowlan) Reverted, required an mtail restart on all affected hosts also. [13:02:49] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10JMeybohm) [13:02:54] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Check/Rebuild all docker-pkg build docker images running on kubernetes - https://phabricator.wikimedia.org/T274254 (10JMeybohm) 05Open→03Resolved Updated images have been rolled out to all clusters as of today. [13:50:32] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name - https://phabricator.wikimedia.org/T275618 (10JMeybohm) [13:50:52] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name - https://phabricator.wikimedia.org/T275618 (10JMeybohm) p:05Triage→03Medium [14:45:29] 10serviceops, 10SRE, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10hnowlan) It looks like that without a restart, mtail's reloading of rules will cause it to see multiple definitions: https://phabricator.wikimedia.org/... [15:02:05] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name - https://phabricator.wikimedia.org/T275618 (10akosiaris) I 'll try and draft out a possible way out of this. I am adding members of the observability team f... [15:53:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Investigate/Fix missing metrics from k8s-node and k8s-node-proxy jobs - https://phabricator.wikimedia.org/T275641 (10JMeybohm) [15:53:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Investigate/Fix missing metrics from k8s-node and k8s-node-proxy jobs - https://phabricator.wikimedia.org/T275641 (10JMeybohm) p:05Triage→03Medium [16:04:17] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name - https://phabricator.wikimedia.org/T275618 (10colewhite) >>! In T275618#6856620, @akosiaris wrote: > The proposal I got is the following: > > * Fetch the... [17:04:15] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name - https://phabricator.wikimedia.org/T275618 (10akosiaris) >>! In T275618#6856943, @colewhite wrote: > >>>! In T275618#6856620, @akosiaris wrote: >> The prop... [17:13:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Investigate/Fix missing metrics from k8s-node and k8s-node-proxy jobs - https://phabricator.wikimedia.org/T275641 (10JMeybohm) a:03JMeybohm * https://grafana-rw.wikimedia.org/d/G8zPL7-Wz/kubernetes-node * `http_request_*_seconds_*` metrics from `job="k8s-node... [18:51:35] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10RobH) [20:44:51] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10BBlack) Hi #serviceops - I've run into some of the effects of this recently and tracked down this ticket, which seems a relevant/recent reference point. The curren... [21:04:23] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10Joe) Just for the record, the restbase cluster that has ipv6_compat activated is the dev cluster. Nothing serving production traffic. [21:10:12] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10BBlack) @Joe yeah I'm not sure which layer is causing the logstash appearance there. It's from restbase1019 as a client towards something, maybe parsoid? [21:18:47] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10Joe) No, that entry is for testreduce, so another test instance too. So I doubt that what you're seeing in the logs has anything to do with this setting. In fact,... [21:20:24] bblack: I removed that on testreduce1001 right now [21:20:38] - ipv4_compat: true [21:28:50] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10Dzahn) I removed that setting from the testreduce1001 envoy, just to make sure. ` - address: '::' - ipv4_compat: true + address: ` [22:44:44] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10akosiaris) https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-restbase-2020.12.24?id=L_3hlnYBjr5R1RLC5PlW points out that this is before the envo... [22:54:39] 10serviceops, 10envoy, 10observability, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10akosiaris) >>! In T255568#6781621, @jijiki wrote: >>>! In T255568#6779477, @akosiaris wrote: >> I 've left a comment in the merged change, duplicating here for visi... [23:20:58] the cron that used to sync /srv/deployment between deploy1001 and deploy2001, is now a systemd timer instead. [23:21:20] the way it works is that the inactive server pulls from the active server every minute (before and after) [23:21:30] but new is that it's a unit [23:21:34] [deploy2001:~] $ sudo systemctl status sync_deployment_dir.timer [23:37:58] 10serviceops, 10Platform Engineering Roadmap Decision Making, 10SRE, 10Traffic, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10CCicalese_WMF)