[09:30:04] _joe_: effie: https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging [09:30:14] and with that I think I am ready to enable coredns in codfw + eqiad [09:30:32] we will at least have an idea what's going on if all hell breaks loose [09:38:21] <_joe_> akosiaris: yep, I was thinking an alternative could be to keep using our resolvers and make them query coredns for the k8s subdomains [09:38:58] <_joe_> because well, I am more confident in pdns than in coredns battle testedness [09:40:06] assigning a specific zone like k8s.eqiad.wmnet ? [09:41:32] I thought about it too, but when designing it in my mind, it ended a bit more complex that I 'd like [09:53:23] <_joe_> ok [09:54:10] <_joe_> the complexity being exposing coredns to the rest of production? [10:33:18] yes [10:33:33] I can be convinced otherwise ofc [10:34:01] btw, I made sure that 1 coredns is scheduled on each node right now. I 'll try to make sure requests to it also make local [13:17:56] _joe_: effie: All of the codfw pods are now on coredns. I 've had an exhaustive look in the latencies of cxserver, eventgate, termbox, citoid, zotero (the services that are actually in use and use DNS) and I can see no changes. I 'll proceed to eqiad [14:43:49] hi team, I'm seeing Sep 16 14:37:22 cp1075 traffic_manager[99211]: [Sep 16 14:37:22.638] {0x2ac8ba8a9700} ERROR: SSL connection failed for 'docker-registry.wikimedia.org': error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed, has anything changed on your side? [14:43:57] also... Sep 16 14:41:34 cp1075 traffic_manager[99211]: [Sep 16 14:37:22.638] {0x2ac8ba8a9700} WARNING: SNI (docker-registry.wikimedia.org) not in certificate. Action=Terminate server=docker-registry.discovery.wmnet(10.2.1.44) [14:44:39] and from cp1075 connecting to docker-registry.discovery.wmnet:443 the certificate only has the following alternative names: DNS:docker-registry-rw.discovery.wmnet, DNS:docker-registry.discovery.wmnet [14:50:44] so right now clients going through cp1075 to reach the docker registry are getting 5xx [14:54:52] 10serviceops, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Performance-Team (Radar): Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script - https://phabricator.wikimedia.org/T229736 (10Joe) p:05Triage→03Normal [14:59:35] 10serviceops, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Performance-Team (Radar): Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script - https://phabricator.wikimedia.org/T229736 (10Joe) 05Open→03Resolved a:03Joe [15:00:21] I may not be able to join our meeting [15:00:38] akosiaris: apergos ^ [15:00:47] 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 (10Joe) 05Open→03Resolved a:03Joe [15:00:49] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10Joe) [15:00:51] noted [15:01:29] ok [15:04:18] 10serviceops, 10Thumbor, 10Performance-Team (Radar), 10User-jijiki: Terminate Thumbor with SSL - https://phabricator.wikimedia.org/T180696 (10Joe) >>! In T180696#5349006, @jijiki wrote: > TLS on haproxy it is then:) We're trying to standardize TLS termination on envoy, and I think we should concentrate on... [15:05:25] and my batt is dying [15:06:02] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) Please note that the docker-registry certificate is missing the public hostname: `docker-registry.wikimedia.org` [15:07:23] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [15:07:47] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be fo... [15:14:48] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be found in `/var/log/wmf-a... [15:16:02] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Got stuck at kernel boot, could it be the same issue as T216240 ? [15:28:50] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Marostegui) >>! In T229209#5496239, @jcrespo wrote: > Got stuck at kernel boot, could it be the same issue as T216240 ? Maybe, even if it is not, it wouldn't hurt to get t... [15:31:30] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10MoritzMuehlenhoff) >>! In T229209#5496266, @Marostegui wrote: >>>! In T229209#5496239, @jcrespo wrote: >> Got stuck at kernel boot, could it be the same issue as T216240 ?... [15:39:04] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without... [15:43:57] 10serviceops, 10DBA, 10Operations, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says: ` Unified Server Configurator does not support console redir... [15:44:35] 10serviceops, 10Performance-Team, 10Scap, 10Continuous-Integration-Config, and 4 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [18:51:59] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Anomie) [18:53:39] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [19:25:38] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Anomie) > * InitialiseSettings.php (and much of CommonSettings.php) is replaced with per-wiki inheritable YAML fi... [20:22:35] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) >>! In T223602#5496973, @Anomie wrote: >> * InitialiseSettings.php (and much of CommonSettings.p... [20:22:54] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [20:50:00] 10serviceops, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Performance-Team (Radar): Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script - https://phabricator.wikimedia.org/T229736 (10Dzahn) 05Resolved→03Open reverted the last merge because it broke puppet... [20:51:10] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) server still alerting nowadays [20:56:45] 10serviceops, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Performance-Team (Radar): Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script - https://phabricator.wikimedia.org/T229736 (10Dzahn) it's because the periodic jobs class does not have an "ensure" parame... [21:25:08] 10serviceops, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Performance-Team (Radar): Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script - https://phabricator.wikimedia.org/T229736 (10Dzahn) 05Open→03Resolved [23:22:21] 10serviceops, 10Operations, 10observability, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Krinkle) 05Open→03Declined OK. I'm fine with this staying as it is. It's not really broken. I...