[00:02:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:02:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:05:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:11:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:14:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [00:16:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:18:17] FIRING: [7x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:20:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [00:21:13] (03PS2) 10Ottomata: EventStreamConfig - add mediawiki.user_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952) [00:23:17] FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:26:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:26:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:28:17] FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [00:31:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:33:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:35:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:36:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:36:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:38:17] FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:40:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:42:38] 06SRE, 10Acme-chief, 06Traffic, 13Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737#11906132 (10Krinkle) [00:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:52:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:56:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:59:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:07:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:07:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:08:17] FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531 [01:09:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531 (owner: 10TrainBranchBot) [01:10:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:12:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:13:17] FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 11h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [01:16:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:17:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:20:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:21:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531 (owner: 10TrainBranchBot) [01:21:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:31:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:32:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:33:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:35:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [01:46:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:47:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:49:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:50:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:51:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:51:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:54:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:54:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:57:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:58:17] FIRING: [12x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:59:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:00:17] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:17] FIRING: [11x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:04:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:06:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:07:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:07:16] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 58s) [02:08:17] FIRING: [13x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:09:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:10:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [02:11:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:12:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:15:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [02:15:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:26:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:27:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:28:17] FIRING: [10x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:30:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:31:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [02:31:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:33:17] FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:36:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:37:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:40:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [02:40:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:45:13] (03CR) 10Danielyepezgarces: [C:03+1] Enabling RSS extension for cowikimedia chapter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [02:47:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:50:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:51:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [02:56:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:58:17] FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:59:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:03:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:06:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:08:17] FIRING: [6x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:13:17] FIRING: [5x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:16:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:18:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:21:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:26:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:27:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:30:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:30:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:33:17] FIRING: [5x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:37:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:40:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [03:40:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [03:41:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:41:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:44:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:44:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:47:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:47:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:50:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:51:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:56:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:57:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:00:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:00:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:03:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:12:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:12:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:13:17] FIRING: [4x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:19:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [04:22:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:23:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:25:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:27:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:33:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:38:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:41:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:46:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:50:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:51:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:53:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:56:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:00:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [05:01:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:02:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:05:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [05:05:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:06:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:06:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:09:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:09:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/2 (Core: asw1-b12-drmrs:et-0/0/50 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:10:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:10:45] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:10:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:11:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:11:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:14:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [05:14:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 15h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [05:15:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:17:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:18:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:20:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:21:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:25:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:26:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:26:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:30:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:30:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:31:55] (03CR) 10Muehlenhoff: [C:03+2] redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282353 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [05:34:49] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5004 from eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1284665 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [05:36:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906319 (10MoritzMuehlenhoff) [05:37:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:38:17] FIRING: [5x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:40:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:40:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:42:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:42:17] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:42:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:43:17] FIRING: [7x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:45:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:45:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:45:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:45:32] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11906325 (10MoritzMuehlenhoff) [05:46:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:53:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bookworm [05:53:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm [05:56:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:57:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:57:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast6003.wikimedia.org [05:57:58] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906331 (10ops-monitoring-bot) VM bast6003.wikimedia.org rebooted by jmm@cumin2002 with reason: None [05:59:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:00:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:03:17] FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast6003.wikimedia.org [06:05:58] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906333 (10MoritzMuehlenhoff) [06:07:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:07:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:08:17] FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:08:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM acmechief2002.codfw.wmnet [06:08:48] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906334 (10ops-monitoring-bot) VM acmechief2002.codfw.wmnet rebooted by jmm@cumin2002 with reason: None [06:10:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:10:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:11:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:11:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:12:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief2002.codfw.wmnet [06:14:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:15:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [06:16:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:19:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:22:34] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906346 (10MoritzMuehlenhoff) [06:23:17] FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:24:02] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906347 (10MoritzMuehlenhoff) 05Open→03Resolved All VMs are updated to 2G RAM and we enforce 2G as the lower limit in sre.ganeti.makevm, so resolving this. [06:25:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [06:26:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:27:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:29:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [06:29:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:30:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [06:31:15] (03CR) 10Muehlenhoff: Added cwilliams to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (owner: 10CWilliams) [06:32:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:35:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:38:17] FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [06:41:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:41:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:42:54] (03PS5) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 [06:44:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:44:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:45:14] jayme, effie ^ not paging but is something bad going on with WDQS in eqiad ? [06:47:55] looks like it... [06:50:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5004.eqsin.wmnet with OS bookworm [06:51:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm completed: - ganeti5004 (**PASS**) - Dow... [06:51:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:53:12] (03PS1) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537 [06:55:53] (03PS2) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537 [06:57:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:00:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:00:04] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T0700) [07:00:04] sfaci and dyepezg: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] o/ [07:01:29] (03PS1) 10Muehlenhoff: Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863) [07:02:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:02:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:04:37] (03CR) 10Brouberol: [C:03+2] Add x_wmf_ratelimit_class and x_trusted_request to Turnilo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284876 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic) [07:05:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:06:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:08:55] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache zarcillo.discovery.wmnet on all recursors [07:08:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zarcillo.discovery.wmnet on all recursors [07:10:27] Is anyone around to run the deployment window? [07:11:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:11:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:11:52] btullis: wdqs seems to be under constant high load since thursday last week [07:13:17] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:31] (03CR) 10Ayounsi: [C:03+1] Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:15:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:15:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:17:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:20:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:20:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:21:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [07:21:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [07:22:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:23:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:23:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [07:24:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [07:25:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:26:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:27:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:27:17] (03CR) 10Brouberol: [C:03+1] "This looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [07:27:28] btullis, brouberol|ooo: ^ wdqs seems to be under constant high load since thursday last week [07:27:54] (03CR) 10Tiziano Fogli: [C:03+2] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [07:30:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:31:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:32:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:35:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:35:26] (03CR) 10Elukey: [C:03+2] confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [07:35:49] (03PS1) 10Brouberol: flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) [07:36:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:36:57] (03PS2) 10Brouberol: flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) [07:37:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:37:43] (03CR) 10JMeybohm: wikikube: add wikikube-ctrl2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [07:39:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:41:00] (03CR) 10JMeybohm: [C:03+1] mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [07:42:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:42:42] (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [07:44:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:45:10] (03CR) 10JavierMonton: [C:03+1] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol) [07:45:18] (03Merged) 10jenkins-bot: ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [07:46:19] jayme: FWIW, a potentially related Phab task was filed by a WDQS end-user on Saturday: T425861 [07:46:20] T425861: Wikidata SPARQL query performance regression: frequent 502-bad gateway errors - https://phabricator.wikimedia.org/T425861 [07:46:54] (03CR) 10Blake: [C:03+1] mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [07:47:03] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [07:47:08] (03PS1) 10Jelto: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405) [07:47:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:48:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:48:50] (03CR) 10Slyngshede: [C:03+1] idp: migrate IDP to Redis 8 [puppet] - 10https://gerrit.wikimedia.org/r/1285324 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [07:48:54] (03CR) 10Slyngshede: [C:03+2] idp: migrate IDP to Redis 8 [puppet] - 10https://gerrit.wikimedia.org/r/1285324 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [07:49:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:49:44] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:50:35] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:51:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:51:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:54:54] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:55:33] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [07:55:54] (03PS1) 10JMeybohm: ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439) [07:55:58] (03PS1) 10Slyngshede: IDP: Redis migration [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976) [07:56:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:57:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:57:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:58:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:58:32] (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [08:00:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [08:00:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:00:41] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old eqsin ganeti cluster VIP - ayounsi@cumin1003" [08:00:47] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old eqsin ganeti cluster VIP - ayounsi@cumin1003" [08:00:47] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:00:53] (03Merged) 10jenkins-bot: ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [08:02:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:03:55] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1282325 (owner: 10L10n-bot) [08:05:03] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [08:05:07] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [08:05:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:05:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976) (owner: 10Slyngshede) [08:08:35] (03CR) 10Slyngshede: [C:03+2] IDP: Redis migration [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976) (owner: 10Slyngshede) [08:08:44] !log slyngshede@dns1004 START - running authdns-update [08:10:12] !log slyngshede@dns1004 END - running authdns-update [08:13:17] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [08:15:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester) [08:15:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci) [08:16:19] (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [08:16:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [08:16:46] (03PS4) 10JMeybohm: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) [08:16:46] (03PS3) 10JMeybohm: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) [08:17:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:18:57] (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [08:20:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [08:21:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:23:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:25:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:26:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:26:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [08:27:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:29:14] (03PS1) 10Elukey: confluent: add space in kafka's COMMAND_CONFIG_OPT [puppet] - 10https://gerrit.wikimedia.org/r/1285725 [08:30:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [08:30:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:30:33] (03CR) 10Elukey: [C:03+2] confluent: add space in kafka's COMMAND_CONFIG_OPT [puppet] - 10https://gerrit.wikimedia.org/r/1285725 (owner: 10Elukey) [08:31:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:35:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:36:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:39:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:40:44] (03PS1) 10Muehlenhoff: Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727 [08:41:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01 [08:41:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:41:44] (03PS3) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (owner: 10Joal) [08:42:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:42:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [08:42:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01 [08:42:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92450 and previous config saved to /var/cache/conftool/dbconfig/20260511-084236-fceratto.json [08:43:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install5004.wikimedia.org to drbd [08:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:54] (03CR) 10Tiziano Fogli: [C:03+2] rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [08:44:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906646 (10MoritzMuehlenhoff) [08:44:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906651 (10ops-monitoring-bot) VM install5004.wikimedia.org switching disk type to drbd [08:45:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:45:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:45:44] (03PS1) 10Kosta Harlan: hCaptcha: Enable for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354) [08:47:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:35] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136#11906663 (10phaultfinder) [08:47:55] FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:37] (03CR) 10Ayounsi: [C:03+1] QoS: Move DSCP AF41 from 'low' to 'normal' priority class [homer/public] - 10https://gerrit.wikimedia.org/r/1285350 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [08:49:01] (03CR) 10Ayounsi: [C:03+1] Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727 (owner: 10Muehlenhoff) [08:49:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92451 and previous config saved to /var/cache/conftool/dbconfig/20260511-084945-fceratto.json [08:50:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:50:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:50:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [08:51:20] (03PS1) 10JMeybohm: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285732 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [08:51:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [08:52:36] 10ops-esams, 06DC-Ops: Alert for device asw1-bw27-esams.mgmt.esams.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T425917 (10phaultfinder) 03NEW [08:53:09] (03PS1) 10Elukey: confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) [08:53:12] (03CR) 10Ayounsi: [C:03+1] eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:53:15] (03CR) 10Ayounsi: [C:03+2] eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:53:29] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [08:54:22] FIRING: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:55:23] (03Merged) 10jenkins-bot: eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:55:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [08:55:45] (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285732 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [08:56:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:43] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1230: T419635 [08:56:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:57:18] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1284616 (owner: 10L10n-bot) [08:57:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:58:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:29] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [08:58:47] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [08:59:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:59:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P92453 and previous config saved to /var/cache/conftool/dbconfig/20260511-085954-fceratto.json [09:02:49] (03CR) 10JMeybohm: ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [09:04:12] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906723 (10SLyngshede-WMF) {F80836950} Memory is also having issues [09:04:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install5004.wikimedia.org to drbd [09:04:32] PROBLEM - Host install5004 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:38] RECOVERY - Host install5004 is UP: PING OK - Packet loss = 0%, RTA = 235.88 ms [09:05:23] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906724 (10SLyngshede-WMF) Puppet currently fails to run and hasn't done so since May 10th, at 21:23 [09:05:25] (03CR) 10Ayounsi: "In case it got missed, I left a comment on the task: https://phabricator.wikimedia.org/T424683#11885878" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [09:06:03] (03CR) 10Brouberol: [C:03+2] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol) [09:06:05] (03CR) 10Brouberol: [V:03+2 C:03+2] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol) [09:06:44] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:07:14] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:07:17] (03PS2) 10Elukey: confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) [09:07:55] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:58] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [09:08:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:41] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [09:09:00] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [09:09:22] RESOLVED: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:59] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11906733 (10cmooney) >>! In T424683#11885878, @ayounsi wrote: > Nice! > > We can also filter out the `.16386`, `.16384`, `.1... [09:10:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P92454 and previous config saved to /var/cache/conftool/dbconfig/20260511-091001-fceratto.json [09:10:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/2 (Core: asw1-b12-drmrs:et-0/0/50 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:45] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:11:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:12:21] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1285537 (owner: 10Muehlenhoff) [09:12:22] jouncebot: nowandnext [09:12:22] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [09:12:22] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1000) [09:12:54] (03CR) 10Brouberol: [C:03+1] "Beautiful, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [09:13:17] FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:49] (03CR) 10Elukey: [C:03+1] ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [09:13:51] (03CR) 10JMeybohm: [C:03+1] confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [09:14:26] (03CR) 10Elukey: [C:03+2] confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [09:14:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 19h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [09:15:30] (03PS5) 10JMeybohm: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) [09:15:30] (03PS4) 10JMeybohm: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) [09:15:37] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11906740 (10catherine.kelsey.wmde) Thank you both! I've now signed the NDA :) [09:16:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:17:57] (03PS3) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [09:18:02] (03CR) 10JMeybohm: ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [09:18:17] FIRING: [9x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:19:18] (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [09:19:22] (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [09:19:59] (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [09:20:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92456 and previous config saved to /var/cache/conftool/dbconfig/20260511-092010-fceratto.json [09:20:31] (03CR) 10JMeybohm: [C:03+2] ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [09:20:45] (03PS4) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [09:21:09] (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [09:21:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:22:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:22:48] (03Merged) 10jenkins-bot: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [09:24:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:24:47] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:25:25] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:26:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:26:38] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [09:27:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:27:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:28:17] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:28] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Thanos [09:30:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:30:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:31:34] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:31:59] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:33:17] FIRING: [11x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:44] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:37:11] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:38:17] FIRING: [12x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:31] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921 (10ayounsi) 03NEW p:05Triage→03High [09:38:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [09:40:09] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:40:23] (03Merged) 10jenkins-bot: hCaptcha: Enable for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [09:40:35] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:41:34] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]] [09:41:38] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [09:42:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1230: T419635 [09:42:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:45:21] (03CR) 10Majavah: [C:03+2] P:openstack: nova: Set MTU on flat VLAN interface in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1284675 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [09:45:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:46:07] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [09:46:36] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [09:48:17] FIRING: [12x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:48:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:48:30] (03PS1) 10Elukey: admin: add spare Yubikey public key and remove the old one [puppet] - 10https://gerrit.wikimedia.org/r/1285738 [09:48:45] (03PS3) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) [09:49:46] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:49:49] (03CR) 10Elukey: "Sent the confirmation via Slack to Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey) [09:50:27] (03CR) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:51:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:51:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:52:41] (03PS4) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [09:52:43] (03PS1) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) [09:52:55] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11906844 (10cmooney) Very weird, box still sees the optic but thinks interface is invalid. ` cmooney@asw1-b12-drmrs> show chassis pic fpc-slot 0 pic-slot 0 | match "^ 50" 50 40GBASE SR4 MM FS Q... [09:52:58] (03CR) 10Ayounsi: [C:03+1] "lgtm! I agree the dynamic prefix-list are best, we can revisit it once your PR is merged in Aerleon" [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney) [09:54:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:56:08] (03PS1) 10Btullis: Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) [09:57:06] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2012.codfw.wmnet with reason: Hardware failure [09:57:10] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906855 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2035ffcb-d2a2-45d6-871f-0db385ff6b6e) set by slyngshede@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Hardware f... [09:57:51] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on lvs2012.codfw.wmnet with reason: Hardware failure [09:57:57] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906858 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1a96c21-2484-4d03-b4f4-1baa89a1745e) set by slyngshede@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: H... [09:57:58] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [09:58:15] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:58:15] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [09:58:17] FIRING: [11x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:18] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [09:58:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:58:33] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:58:43] (03PS4) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) [09:58:57] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:59:04] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:59:37] !log kharlan@deploy1003 kharlan: Continuing with deployment [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1000) [10:01:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:09] !log fceratto@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:01:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:01:45] (03PS1) 10Filippo Giunchedi: wmcs: update filippo cloudvps root key [puppet] - 10https://gerrit.wikimedia.org/r/1285742 [10:01:55] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [10:04:06] (03PS1) 10Sergio Gimeno: loggedOutWarning: set lastEditor used earlier [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604) [10:04:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604) (owner: 10Sergio Gimeno) [10:04:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:06:12] (03PS5) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [10:06:12] (03PS2) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) [10:06:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:10:41] !log rebalance routed Ganeti cluster in eqsin T421863 [10:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:44] T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863 [10:11:17] (03PS1) 10Ayounsi: Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745 [10:11:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:49] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]] (duration: 30m 15s) [10:11:54] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [10:11:56] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [10:12:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:12:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:12:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi) [10:12:57] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:13:08] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:13:16] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:13:17] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:33] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:13:45] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:14:23] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:14:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:14:45] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11906886 (10cmooney) @RobH can you raise a task with Digital Realty to take a look at this in MRS2? The link in question is [[ https://netbox.wikimedia.org/dcim/interfaces/21198/trace/ | this one ]]. It's a pink MTP cable... [10:15:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:15:34] (03CR) 10JMeybohm: [C:03+2] Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:15:39] (03CR) 10JMeybohm: [C:03+2] mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:16:13] (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [10:16:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:16:34] !log Migrate of lvs2012 due to hardware issues [10:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:00] (03PS1) 10Btullis: dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) [10:17:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:17:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [10:18:08] (03PS2) 10Ayounsi: Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745 [10:19:00] (03Merged) 10jenkins-bot: mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:19:01] (03PS5) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) [10:19:15] (03Merged) 10jenkins-bot: Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:19:38] (03PS3) 10Muehlenhoff: tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) [10:20:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:05] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [10:21:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:40] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [10:21:57] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [10:22:17] (03PS2) 10Btullis: dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) [10:22:24] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [10:22:33] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [10:22:44] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [10:22:49] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi) [10:23:21] !log jayme@deploy1003 Started scap sync-world: update rsyslog image [10:25:47] (03PS1) 10JMeybohm: Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439) [10:25:57] (03PS1) 10JMeybohm: Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) [10:26:18] (03CR) 10CI reject: [V:04-1] Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [10:26:24] !log jayme@deploy1003 Finished scap sync-world: update rsyslog image (duration: 03m 48s) [10:32:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:35:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:35:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:36:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey) [10:37:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:37:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:38:22] (03PS1) 10JMeybohm: Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) [10:39:44] (03CR) 10Blake: [C:03+1] Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:40:02] (03CR) 10Cathal Mooney: [C:03+2] Nokia: adjust cpm filters to restrict BGP connections to our ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney) [10:40:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:40:40] (03CR) 10Ayounsi: [C:03+1] "one nit overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney) [10:40:43] (03PS1) 10Majavah: P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752 [10:40:43] (03PS1) 10Majavah: P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) [10:41:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:41:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:41:28] (03CR) 10Ayounsi: [C:03+2] Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi) [10:42:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:43:14] (03PS1) 10Majavah: admin: Remove cgoubert ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1285754 [10:43:17] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:09] (03CR) 10Majavah: [C:03+2] admin: Remove cgoubert ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1285754 (owner: 10Majavah) [10:44:23] (03Merged) 10jenkins-bot: Nokia: adjust cpm filters to restrict BGP connections to our ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney) [10:44:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:44:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [10:44:56] (03PS2) 10Hnowlan: prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) [10:46:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:46:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:48:04] (03CR) 10Trueg: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:49:16] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:49:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:50:28] (03PS4) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) [10:50:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:51:25] (03CR) 10Btullis: [C:03+1] openjdk-25-jre/openjdk-25-jdk (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:51:50] (03CR) 10CI reject: [V:04-1] Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney) [10:52:10] !log taavi@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Clément Goubert out of all services on: 2459 hosts [10:53:17] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:53:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. I'll report this to the Debian Java maintainers, so that we eventually get it properly fixed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:55:46] (03CR) 10Btullis: [V:03+2 C:03+2] openjdk-25-jre/openjdk-25-jdk [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:57:03] (03PS2) 10Majavah: P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752 [10:57:03] (03PS2) 10Majavah: P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) [10:57:19] (03CR) 10JMeybohm: [C:03+2] Bump default rsyslog container version to 8.2504.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1280317 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:57:27] (03CR) 10JMeybohm: [C:03+2] Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:57:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:58:09] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [10:59:18] (03CR) 10Muehlenhoff: [C:03+1] "Actually, noted one more issue: Please adapt the versions to reflect current JRE releases, that way we can see in our container tooling wh" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:59:47] !log uprading rsyslog to 8.2504.0-1 in all mediawiki deployments - T418200 [10:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] (03Merged) 10jenkins-bot: Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [10:59:55] T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 [11:00:33] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:00:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [11:01:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [11:05:46] (03CR) 10Muehlenhoff: [C:03+2] use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537 (owner: 10Muehlenhoff) [11:08:32] !log jayme@deploy1003 Started scap sync-world: upgrade rsyslog on all deployments T418200 [11:08:35] T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 [11:11:20] (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [11:11:59] (03CR) 10JMeybohm: [C:03+2] Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [11:12:08] (03PS1) 10Cathal Mooney: py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 [11:14:30] (03PS1) 10Trueg: Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) [11:15:11] (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [11:16:19] (03PS2) 10JMeybohm: Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) [11:16:22] (03Merged) 10jenkins-bot: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [11:16:37] (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [11:17:41] (03CR) 10Btullis: [V:03+2 C:03+2] openjdk-25-jre/openjdk-25-jdk (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [11:18:02] (03CR) 10Ayounsi: [C:03+1] "much better indeed!!" [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney) [11:18:47] (03CR) 10Majavah: [C:03+2] P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752 (owner: 10Majavah) [11:18:57] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [11:19:24] (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [11:19:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [11:19:52] 06SRE, 10SRE-Access-Requests: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930 (10CWilliams-WMF) 03NEW [11:20:11] (03CR) 10Btullis: [C:03+2] Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [11:20:14] (03CR) 10Btullis: [V:03+2 C:03+2] Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [11:21:13] !log jayme@deploy1003 Rolling back deployment [11:21:14] !log jayme@deploy1003 Finished scap sync-world: upgrade rsyslog on all deployments T418200 (duration: 13m 28s) [11:21:18] T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 [11:22:46] (03Merged) 10jenkins-bot: Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [11:25:37] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:25:57] (03CR) 10Joal: [C:03+1] Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [11:26:37] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:26:38] (03CR) 10Blake: [C:03+1] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:26:44] (03CR) 10Btullis: [C:03+2] Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [11:26:53] (03CR) 10Blake: [C:03+1] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:27:08] (03CR) 10Blake: [C:03+1] changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:27:10] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [11:27:22] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [11:27:29] (03CR) 10Blake: [C:03+1] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:27:56] (03CR) 10Blake: [C:03+1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:28:20] (03CR) 10Blake: [C:03+1] mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:31:29] (03CR) 10Cathal Mooney: [C:03+2] py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney) [11:31:53] !log installing Linux 6.12.86 on Trixie hosts [11:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:29] (03PS1) 10Majavah: P:openstack: neutron: Set MTU on cloudnet eqiad1 VLAN interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285759 (https://phabricator.wikimedia.org/T425674) [11:33:18] (03PS1) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) [11:33:56] (03CR) 10CI reject: [V:04-1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:34:54] (03CR) 10Blake: [C:03+1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:35:52] (03Merged) 10jenkins-bot: py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney) [11:35:56] (03PS2) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) [11:36:21] (03PS5) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) [11:36:26] (03CR) 10CI reject: [V:04-1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:38:17] (03PS6) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) [11:39:17] (03PS3) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) [11:39:57] (03CR) 10JMeybohm: [C:03+2] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:40:24] (03CR) 10Cathal Mooney: [C:03+2] Nokia: add module to enable BFD on interfaces that need it (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney) [11:41:51] (03Merged) 10jenkins-bot: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney) [11:42:59] (03CR) 10Daniel Kinzler: [C:03+1] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:43:17] FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:22] (03CR) 10Daniel Kinzler: [C:03+1] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:43:41] (03CR) 10Daniel Kinzler: [C:03+1] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:46:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:46:33] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:48:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2185.codfw.wmnet with reason: Reboot [11:48:17] RESOLVED: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:57] (03CR) 10Brouberol: [C:03+1] Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [11:52:10] (03PS1) 10Bartosz Dziewoński: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) [11:52:29] (03CR) 10Bartosz Dziewoński: "Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1285761" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [11:52:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [11:54:15] (03CR) 10Brouberol: [C:03+1] dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [11:54:40] (03PS1) 10Effie Mouzeli: site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) [11:55:14] (03CR) 10Btullis: [C:03+2] Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [11:56:16] (03PS2) 10Effie Mouzeli: site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) [11:56:37] (03CR) 10Blake: [C:03+1] site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [11:58:14] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [11:58:56] (03PS2) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 [11:59:10] (03PS2) 10Bartosz Dziewoński: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) [12:00:15] (03PS2) 10Bartosz Dziewoński: API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) [12:00:21] (03PS2) 10Bartosz Dziewoński: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) [12:00:30] (03CR) 10Muehlenhoff: [C:03+2] Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727 (owner: 10Muehlenhoff) [12:00:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:39] (03CR) 10Btullis: [C:03+2] dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:04:43] !log push out updated ACL to Nokia switches for BGP connections (T425703) and add BFD config (T425813) [12:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:47] T425813: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813 [12:04:55] (03PS1) 10Btullis: Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653) [12:05:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS trixie [12:05:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:10:58] (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [12:15:08] (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [12:18:18] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [12:19:15] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1285770 (owner: 10L10n-bot) [12:23:35] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1285773 (owner: 10L10n-bot) [12:25:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [12:26:41] (03CR) 10Lucas Werkmeister (WMDE): change logo at zh-classical wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [12:27:22] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1285784 [12:28:17] (03PS1) 10Effie Mouzeli: mcrouter_wancache: replace mc1037 with mc1055 [puppet] - 10https://gerrit.wikimedia.org/r/1285785 (https://phabricator.wikimedia.org/T412255) [12:34:09] (03PS3) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) [12:34:40] (03CR) 10Kamila Součková: [C:03+1] "🍿" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [12:34:57] (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [12:36:36] (03PS1) 10Kosta Harlan: hCaptcha: Enable editing on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354) [12:36:59] jouncebot: nowandnext [12:36:59] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [12:36:59] In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1300) [12:38:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [12:41:31] (03Merged) 10jenkins-bot: hCaptcha: Enable editing on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [12:41:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907442 (10VRiley-WMF) Hey @ssingh Is it okay to make this change today? [12:41:45] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]] [12:41:48] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [12:42:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907451 (10VRiley-WMF) Also, I do apologize, I was planning on doing this today [12:43:15] (03PS4) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) [12:43:47] (03PS5) 10WAN233: change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) [12:44:02] (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [12:44:15] (03CR) 10WAN233: change logo at zh-classical wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [12:44:17] kostajh: o/ after yours is done mind if I fit in a config deployment before the backport window? [12:44:56] ottomata: sure [12:45:01] ty [12:45:32] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:47:26] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:48:00] (03PS5) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) [12:49:52] ottomata: over to you once https://spiderpig.wikimedia.org/jobs/1949 is done [12:51:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [12:51:05] thanks ya, watching! [12:51:38] (03PS1) 10JMeybohm: Revert^2 "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285793 (https://phabricator.wikimedia.org/T418200) [12:53:53] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]] (duration: 12m 07s) [12:53:56] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [12:53:57] (03PS1) 10JMeybohm: Bump release generation for mercurius to pick up rsyslog upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285794 (https://phabricator.wikimedia.org/T418200) [12:54:54] (03CR) 10JMeybohm: [C:03+2] Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [12:55:52] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813#11907507 (10cmooney) 05Open→03Resolved Patch merged and config pushed to all Nokia devices now. [12:56:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [12:56:19] (03CR) 10Elukey: [C:03+2] admin: add spare Yubikey public key and remove the old one [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey) [12:56:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata) [12:56:47] jayme: ok to merge? [12:57:00] ottomata: ok, done [12:57:11] elukey: Revert "wikikube: Add ratelimit-media namespace" [12:57:12] yes [12:57:27] thanks [12:58:31] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:59:06] (03PS6) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [12:59:06] (03PS3) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) [12:59:07] (03Merged) 10jenkins-bot: EventStreamConfig - add mediawiki.user_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata) [12:59:14] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:59:23] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]] [12:59:26] T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952 [12:59:27] thanks kostajh ! [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1300). [13:00:05] yerdua_wmde, codenamenoreste, MatmaRex, and sfaci: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] hi [13:01:07] !log otto@deploy1003 otto: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:01:11] o/ I can deploy in a few minutes [13:01:44] (03CR) 10Blake: [C:03+1] Revert^2 "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285793 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [13:01:54] (03CR) 10Atsuko: "added defaults both in templates and in config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [13:02:25] i need a deployer to ship my changes :) all of my wmf.1 backports should go out together, the config changes can be shipped as you like [13:03:19] !log otto@deploy1003 otto: Continuing with deployment [13:05:51] (03CR) 10Elukey: [C:03+2] role::pki: remove the 'discovery' intermediate's config [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:06:30] !log remove old discovery pki intermediate [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [13:07:28] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]] (duration: 08m 05s) [13:07:31] T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952 [13:11:49] ottomata: are you done deploying? can we do the backport+config window now? [13:11:59] (03CR) 10Elukey: [C:03+2] Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [13:12:09] codenamenoreste doesn’t seem to be around yet [13:12:57] (03CR) 10Lucas Werkmeister (WMDE): Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [13:13:30] * Lucas_WMDE looks at MatmaRex’ changes [13:13:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11907552 (10MoritzMuehlenhoff) [13:13:47] * MatmaRex waves [13:13:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11907554 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff eqsin is now fully on routed Ganeti \o/ [13:14:07] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:14:07] * Lucas_WMDE is confused by “Prevent username registration if the username previously existed” and “Prevent username registration if the username previously existed (v2)” [13:14:13] FIRING: [2x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_discovery_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:23] like… I would usually assume that “v2” supersedes the previous version [13:14:25] but we’re deploying both? [13:14:26] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11907559 (10brouberol) FYI, I created https://gitlab.wikimedia.org/repos/sre/kafka-configurator ~2 years ago thinking it could be useful for 3 things: - managing topics - managing topics configuration - managing ACLs In its cur... [13:14:43] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: neutron: Set MTU on cloudnet eqiad1 VLAN interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285759 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [13:14:56] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:14:58] (03PS1) 10Muehlenhoff: Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863) [13:15:13] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:15:26] Lucas_WMDE: the v2 supersedes the one we wrote in 2018 [13:15:44] so yes, both patches are meant to be deployed [13:15:52] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:16:38] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:16:40] (03PS5) 10Audrey Penven: Enable and configure WikiProjects prototype on Wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) [13:16:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907565 (10ssingh) >>! In T421421#11907442, @VRiley-WMF wrote: > Hey @ssingh Is it okay to make this change today? Yes, please, the host is not in service so you can start whenever... [13:17:00] (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on Wikidata beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [13:17:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Note that the stale right still shows up in API output ([example](https://login.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [13:17:35] let’s start with yerdua_wmde’s changes [13:17:38] *change [13:17:46] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [13:17:46] because that doesn’t involve rebuilding the l10n cache :D [13:18:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [13:18:40] (and I assume ottomata is done) [13:18:56] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:19:02] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:19:13] (03CR) 10Brouberol: [C:03+1] Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [13:19:14] and then drop writeapi next, as its own deploy because it has some risk of breakage [13:19:23] and then all the MatmaRex backports can go through gate-and-submit while that merges [13:19:28] and then we’ll see how much further we get [13:19:35] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:19:46] (03Merged) 10jenkins-bot: Enable and configure WikiProjects prototype on Wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [13:20:03] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]] [13:20:06] T421850: [WIPR] Prototype - Display Wikiproject link on Beta Item pages using properties - https://phabricator.wikimedia.org/T421850 [13:21:28] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [13:21:29] Lucas_WMDE: hmm, if you think the 'writeapi' removal is risky, let's reschedule that one. i want to get the other ones a lot more. i'll update the backports and calendar [13:21:43] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:54] yerdua_wmde: anything to test on mwdebug for this change? [13:22:01] (the correct answer is “no”, beta doesn’t have mwdebug ;)) [13:22:02] (03CR) 10Btullis: [C:03+2] Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [13:22:15] MatmaRex: okay, we can do the backports first [13:22:15] no, nothing to test [13:22:23] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Continuing with deployment [13:22:25] (03PS3) 10Bartosz Dziewoński: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) [13:22:29] * Lucas_WMDE clicks +2 a couple times [13:22:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:22:43] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:22:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [13:22:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [13:22:55] (03CR) 10Bartosz Dziewoński: "Hmm. I'll schedule this for a less busy window…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [13:22:58] !log jiji@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc1055.eqiad.wmnet with OS trixie [13:23:09] MatmaRex: fyi I would also do the other config change separately because I haven’t looked at it yet [13:23:34] (also, sfaci are you around? should your changes be deployed separately or together?) [13:23:37] sure [13:23:51] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:13] FIRING: [4x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_discovery_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:17] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS bookworm [13:24:45] (03CR) 10CI reject: [V:04-1] Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:25:08] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 25m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:25:43] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Seems uncontroversial, I think we can skip the on-wiki notifications / consensus-finding here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:25:52] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:25:53] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [13:26:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "recheck (T419488?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:26:31] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]] (duration: 06m 28s) [13:26:34] T421850: [WIPR] Prototype - Display Wikiproject link on Beta Item pages using properties - https://phabricator.wikimedia.org/T421850 [13:27:01] (03PS1) 10JMeybohm: ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439) [13:27:13] yerdua_wmde: all done, should be effective on beta soon (if not already) ^^ [13:27:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:27:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:27:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [13:27:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [13:28:39] (03CR) 10CI reject: [V:04-1] Prevent username registration if the username previously existed [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:28:41] (03CR) 10CI reject: [V:04-1] Prevent username registration if the username previously existed (v2) [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:28:41] (03CR) 10CI reject: [V:04-1] API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [13:28:42] bah, the backports are already failing in zuul [13:28:42] (03CR) 10CI reject: [V:04-1] list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [13:28:50] (03CR) 10Elukey: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1285784 (owner: 10Muehlenhoff) [13:28:56] ffs https://integration.wikimedia.org/ci/job/quibble-with-Wikibase-extensions-browser-tests-only-vendor-php83/7672/console [13:29:00] “Skipping remaining commands due to success cache hit” [13:29:09] and then T419488 changed it to a failure anyway [13:29:10] T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488 [13:29:42] let’s click the convenient retry button in spiderpig [13:29:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:29:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:29:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [13:29:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [13:30:13] the regular gate-and-submit queue is also super full [13:30:14] (03PS1) 10JMeybohm: Add ratelimit-media CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1285799 (https://phabricator.wikimedia.org/T414439) [13:30:31] Lucas_WMDE: Sorry I'm late. sfaci can't make the window. I'm here in their place [13:30:42] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:30:43] !log restarting pybal on lvs1019 and lvs1020 for T420437 [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] T420437: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437 [13:30:48] phuedx: with a lot of luck we might get to your changes in this window [13:30:58] (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [13:30:58] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:31:28] phuedx: should those changes be deployed together or separately? [13:32:22] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [13:32:29] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [13:33:05] (03Merged) 10jenkins-bot: ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [13:33:25] FIRING: [5x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:33] (03CR) 10Ayounsi: [C:03+1] Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:33:54] (03PS7) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [13:33:54] (03PS4) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) [13:34:06] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [13:34:16] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [13:34:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:27] (03Merged) 10jenkins-bot: Prevent username registration if the username previously existed [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:34:31] (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [13:34:32] (03Merged) 10jenkins-bot: Prevent username registration if the username previously existed (v2) [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [13:35:14] Lucas_WMDE: They can be deployed together [13:35:20] * phuedx crosses fingers [13:35:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [13:35:52] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:36:16] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [13:37:20] (03CR) 10DCausse: [C:03+1] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking) [13:38:22] (03Merged) 10jenkins-bot: API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński) [13:38:25] FIRING: [6x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:25] (03Merged) 10jenkins-bot: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński) [13:38:35] woohoo [13:38:44] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group list (T [13:38:44] 425859)]] [13:38:49] T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386 [13:38:55] T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752 [13:39:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907708 (10VRiley-WMF) 05Open→03In progress [13:39:30] (03CR) 10CWilliams: data.yaml: Adding cwilliams to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [13:39:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11907712 (10Fabfur) [13:40:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [13:40:38] (03CR) 10Brouberol: [C:03+1] "Perfect!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [13:40:42] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:40:58] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:42:45] (03PS2) 10CWilliams: Added cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369 [13:43:23] (03Abandoned) 10Btullis: Update spark shufflers on the test cluster to deploy version 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) (owner: 10Btullis) [13:43:25] FIRING: [7x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:43] (03PS3) 10CWilliams: data.yaml: Adding cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) [13:44:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:32] (still building the image… this’ll probably take a while) [13:45:52] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:46:10] (03CR) 10Brouberol: [C:03+1] "Let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [13:47:12] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-worker-codfw@codfw [13:47:48] wondering if the other MatmaRex config change (grant createpreviouslyrenamedaccount) and phuedx’ changes can be deployed together afterwards [13:48:05] probably [13:48:07] though I guess the WikiLambda event stream(?) stuff could be risky [13:48:25] FIRING: [10x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:13] RESOLVED: [2x] ProbeDown: Service pki2002:443 has failed probes (http_PKI_discovery_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:38] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [13:50:11] Lucas_WMDE: That event stream change is a NOOP tidy up AFAICT. The instrument that sends analytics events to the stream is configured via TestKitchen now [13:50:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [13:50:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: dse-k8s-worker-codfw@codfw [13:50:59] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-worker-eqiad@eqiad [13:51:30] scap has been running docker-pusher for almost ten minutes now [13:51:39] and `top` says dockerd is a bit above 100% CPU usage [13:51:49] * Lucas_WMDE wonders how pushing data over the network can be CPU bound [13:53:38] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 27 May 2026 01:53:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:56:05] !log btullis@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [13:56:07] (03PS1) 10Slyngshede: Update to CAS version 7.3.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 [13:56:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1055.eqiad.wmnet with OS bookworm [13:57:01] (03CR) 10Brouberol: [C:03+1] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking) [13:57:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [13:57:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: dse-k8s-worker-eqiad@eqiad [13:58:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907811 (10VRiley-WMF) [13:59:08] image builds completed :o [13:59:32] MatmaRex will the backports be testable btw? [14:00:40] Lucas_WMDE: yeah, i have some API queries prepared [14:01:21] ok, great [14:03:25] FIRING: [10x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:32] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group [14:04:32] list (T425859)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:04:36] T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386 [14:04:37] T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752 [14:04:37] T425859: InvalidArgumentException in list=globalusers API module with gusprop=rights - https://phabricator.wikimedia.org/T425859 [14:04:40] (03CR) 10Atsuko: [C:03+1] Migrated turnilo to auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:04:53] (03CR) 10Atsuko: [C:03+2] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [14:05:00] (03CR) 10Atsuko: [C:03+2] Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:05:28] Lucas_WMDE: thanks, looks good [14:05:32] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Continuing with deployment [14:05:33] \o/ [14:05:51] jouncebot: nowandnext [14:05:51] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [14:05:51] In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1430) [14:06:00] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester) [14:06:03] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci) [14:06:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:09] (03CR) 10Xcollazo: "( I don't think I have the expertise to review here. Will let @joal@wikimedia.org review. )" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [14:08:25] FIRING: [12x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:12] (03Merged) 10jenkins-bot: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal) [14:09:15] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:09:26] (03Merged) 10jenkins-bot: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:09:50] (03CR) 10CDanis: [C:03+2] Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 (owner: 10CDanis) [14:10:21] meanwhile, zuul is still in hell /o\ [14:12:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:12:58] (03Merged) 10jenkins-bot: WikiLambdaApi instrument: Sets the custom schemaID [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester) [14:13:04] (03Merged) 10jenkins-bot: editSaves: getExperiment returns a promise now [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci) [14:13:25] FIRING: [19x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907908 (10VRiley-WMF) [14:15:32] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm [14:15:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm [14:16:35] TIL SpiderPig disables the “new backport” form when one is already running [14:16:47] I wanted to speed things up by already pasting the URLs for the next deploy but computer says no [14:17:53] what does “Waiting 20 seconds for production traffic” do, btw? [14:18:07] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group list ( [14:18:07] T425859)]] (duration: 39m 22s) [14:18:12] T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386 [14:18:12] T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752 [14:18:12] T425859: InvalidArgumentException in list=globalusers API module with gusprop=rights - https://phabricator.wikimedia.org/T425859 [14:18:25] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [14:18:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [14:20:12] (03Merged) 10jenkins-bot: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński) [14:20:16] (03Merged) 10jenkins-bot: WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [14:20:38] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now (T425785)]] [14:20:41] MatmaRex, phuedx: ^ fyi [14:20:47] T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254 [14:20:47] T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785 [14:21:14] Lucas_WMDE: ty [14:21:22] Lucas_WMDE: exactly what it says, waits for some traffic to hit the canaries so it can check if that traffic is failing or not [14:21:52] taavi: but it says production traffic, not canary traffic [14:21:54] (“Waiting 20 seconds for canary traffic” is earlier) [14:22:05] the canaries are serving some subset of production traffic [14:22:30] (03PS1) 10Tiziano Fogli: logstash: adjust param_time parsing for thanos-query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1285823 (https://phabricator.wikimedia.org/T423986) [14:22:48] I’m still confused [14:23:04] we deploy to the canaries, then wait for canary traffic, then check logstash. so far so good [14:23:18] then we deploy to all of production, and… wait for traffic and check logstash again? [14:23:25] FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:39] think of canaries as a subset of production, not a separate thing [14:24:03] * Lucas_WMDE digs up some older spiderpig logs [14:24:31] yeah there’s no “Waiting 20 seconds for production traffic” in https://spiderpig.wikimedia.org/jobs/1000, this is something newer [14:25:24] is this also part of T225207? [14:25:25] T225207: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 [14:26:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jforrester, matmarex, sfaci: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now [14:26:18] (T425785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:26:22] T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386 [14:26:23] T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254 [14:26:23] T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785 [14:26:40] MatmaRex, phuedx: please test [14:26:55] apparently it’s T317405 / https://gitlab.wikimedia.org/repos/releng/scap/-/commit/ec14e688b8 [14:26:56] T317405: Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 [14:29:22] looking, sorry [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1430) [14:30:16] still deploying, sorry test kitcheners [14:30:54] phuedx: can you test the changes in WikiLambdaApi/WikimediaEvents? [14:31:03] looks good [14:31:08] ok, thanks [14:31:11] Lucas_WMDE: I'm just looking at the Wikilambda one [14:31:20] ok [14:31:33] The WikimediaEvents one is tricky to test but I'm confident that it will fix the error :) [14:32:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908107 (10ssingh) ` Record: 410 Date/Time: 05/10/2026 04:22:34 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Conta... [14:33:03] OK. Plenty of Wikilambda API requests are succeeding on regular abstractwiki pageviews 👍 [14:33:05] Lucas_WMDE: ^ [14:33:06] LGTM [14:33:08] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jforrester, matmarex, sfaci: Continuing with deployment [14:33:11] alright, thanks! [14:34:49] (03CR) 10CDanis: [C:03+2] turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis) [14:37:00] (03Merged) 10jenkins-bot: turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis) [14:38:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:39:28] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now (T425785)]] (duration: 18 [14:39:28] m 50s) [14:39:34] T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386 [14:39:34] T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254 [14:39:35] T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785 [14:39:45] !log UTC afternoon backport+config window done [14:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:54] * Lucas_WMDE done deploying, Test Kitchen can take over now [14:39:58] sorry for the delay [14:40:04] thanks for deploying Lucas_WMDE [14:40:41] +1 Thanks Lucas_WMDE <3 [14:40:47] (meanwhile Zuul remains firmly stuck in hell) [14:41:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:52] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [14:42:17] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [14:42:28] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017 [14:43:46] 06SRE, 06Infrastructure-Foundations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966#11908173 (10LSobanski) p:05Medium→03Low This will be addressed automatically with Debian version upgrades. [14:43:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017 [14:44:50] 06SRE, 06Infrastructure-Foundations: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163#11908178 (10LSobanski) p:05Medium→03Low [14:45:35] (03CR) 10Elukey: [C:03+2] sre.network: handle dry-run outputs in run_junos_commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey) [14:46:29] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:47:28] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:48:07] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 07Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#11908191 (10LSobanski) p:05Medium→03Low Considering the age of this task, is this still a valid request? [14:49:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425#11908214 (10LSobanski) 05Open→03Declined Exim doesn't have a fully fledged systemd unit and masking is expected to work fine otherwise. Please reopen if y... [14:50:00] (03PS1) 10Effie Mouzeli: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285827 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [14:51:36] (03PS1) 10Elukey: admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452) [14:51:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 07Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448#11908226 (10LSobanski) p:05Medium→03Low [14:52:09] (03CR) 10CDanis: [C:03+1] admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [14:54:00] (03CR) 10Ssingh: [C:03+1] Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 (owner: 10Slyngshede) [14:54:05] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [14:54:29] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [14:54:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908254 (10Jhancock.wm) i pulled a replacement DIMM and a ssd from our offlined hosts. @ssingh safe to power down the host? [14:55:21] (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285827 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [14:55:23] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2012.codfw.wmnet with reason: DIMM replacement [14:55:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908259 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8830f0f1-94da-40cc-8ac8-4aef8e53c8f4) set by sukhe@cumin1003 for 1:00:00 on 1 host(s) and their services with reason: DI... [14:55:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908262 (10ssingh) >>! In T425890#11908254, @Jhancock.wm wrote: > i pulled a replacement DIMM and a ssd from our offlined hosts. > @ssingh safe to power down the host? @Jhancock.wm: Yes, please... [14:59:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2144.codfw.wmnet - https://phabricator.wikimedia.org/T425522#11908288 (10Jhancock.wm) 05Open→03Resolved [15:09:48] (03CR) 10Elukey: [C:03+2] admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [15:17:27] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm [15:17:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11908393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm [15:19:55] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11908396 (10RobH) Putting in a remote hands ticket with the following: Support, One of our router to switch links has unexpectedly gone down. We would like you to observe both ports, note the lack of link light, then pro... [15:21:02] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [15:21:11] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [15:22:20] PROBLEM - Host lsw1-a3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:22:58] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:24:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908449 (10Jhancock.wm) @ssingh replaced both. not seeing any errors in the idrac logs at this moment. You should be good to rebuild it. [15:24:32] (03PS1) 10CDanis: turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 [15:25:07] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11908458 (10RobH) [15:27:17] (03CR) 10Bking: [C:03+1] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis) [15:27:41] (03CR) 10CDanis: [C:03+2] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis) [15:29:00] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:29:23] (03CR) 10Joal: [C:03+1] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis) [15:29:27] (03CR) 10Bking: [C:03+2] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking) [15:29:52] (03Merged) 10jenkins-bot: turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis) [15:29:58] RECOVERY - Host lsw1-a3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [15:30:01] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:30:04] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.83 ms [15:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1530). [15:30:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:30:44] ACKNOWLEDGEMENT - MD RAID on lvs2012 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T425965 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:30:57] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425965 (10ops-monitoring-bot) 03NEW [15:31:40] (03CR) 10Cwhite: [C:03+2] logstash: adjust param_time parsing for thanos-query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1285823 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [15:32:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908546 (10ssingh) @Jhancock.wm: Thanks for the quick turnaround! Host is back and serving traffic, will keep a close watch for a bit before resolving this. [15:33:00] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:33:14] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:33:30] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [15:33:35] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425965#11908555 (10Jhancock.wm) →14Duplicate dup:03T425890 [15:33:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908553 (10Jhancock.wm) [15:36:41] jouncebot: nowandnext [15:36:41] For the next 0 hour(s) and 23 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1530) [15:36:41] In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700) [15:36:41] In 1 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700) [15:36:47] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on testwiki (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [15:38:25] FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:17] (03Merged) 10jenkins-bot: Start reading from new file tables on testwiki (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [15:39:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]] [15:39:51] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [15:40:01] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:40:30] PROBLEM - Host lsw1-a5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:22] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:41:35] !log zabe@deploy1003 zabe: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:42:10] !log zabe@deploy1003 zabe: Continuing with deployment [15:44:06] (03PS4) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [15:44:11] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:44:42] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.06 ms [15:44:56] RECOVERY - Host lsw1-a5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.06 ms [15:46:13] (03PS5) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [15:46:19] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]] (duration: 06m 32s) [15:46:23] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [15:46:29] (03PS3) 10Zabe: Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) [15:46:35] (03CR) 10Zabe: [C:03+2] Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) (owner: 10Zabe) [15:47:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11908655 (10elukey) ` Updating the root user's password on the BMC. Changing password for the account with username root: /redfish/v1/AccountService/Accounts/3 Updating the ADMIN user's password... [15:48:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [15:50:28] (03Merged) 10jenkins-bot: Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) (owner: 10Zabe) [15:50:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:50:51] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]] [15:50:54] T423578: Remove custom user groups from Wikinews (in core-Permissions.php) - https://phabricator.wikimedia.org/T423578 [15:52:33] !log zabe@deploy1003 zabe: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:53:26] (03PS1) 10Eevans: sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308) [15:53:49] (03CR) 10Dzahn: [C:03+2] admin: upgrade user gweld to shell, analytics-privatedata and kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1285413 (https://phabricator.wikimedia.org/T425727) (owner: 10Dzahn) [15:54:26] !log zabe@deploy1003 zabe: Continuing with deployment [15:54:48] PROBLEM - Host lsw1-a7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:46] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:57:00] (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [15:58:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity and wmf LDAP group for GWeld - https://phabricator.wikimedia.org/T425727#11908744 (10Dzahn) No problem, Manuel. With your +1 I merged and deployed it. Then I created the Kerberos principal.... [15:58:39] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]] (duration: 07m 48s) [15:58:43] T423578: Remove custom user groups from Wikinews (in core-Permissions.php) - https://phabricator.wikimedia.org/T423578 [15:58:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity and wmf LDAP group for GWeld - https://phabricator.wikimedia.org/T425727#11908746 (10Dzahn) 05In progress→03Resolved a:03Dzahn [15:59:20] (03Merged) 10jenkins-bot: sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [15:59:46] (03PS1) 10Zabe: Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548) [16:00:20] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:00:35] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:00:49] (03PS1) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550) [16:01:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:01:30] (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [16:01:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:02:15] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:02:20] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms [16:02:20] RECOVERY - Host lsw1-a7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms [16:03:23] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:05:44] (03PS2) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550) [16:08:22] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11908821 (10Jclark-ctr) @elukey is there anything I can do to help with this? [16:09:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908830 (10Dzahn) @KOfori Hi, this says it needs your approval. Does it look good? [16:12:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908834 (10Dzahn) 05Open→03In progress [16:12:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908836 (10ssingh) @KOfori is out, deferring to @Kappakayala as the approver in the interim. [16:14:14] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:14:32] (03PS1) 10JMeybohm: Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) [16:15:29] 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975 (10MatthewVernon) 03NEW [16:15:43] (03CR) 10Elukey: [C:03+1] Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [16:16:19] (03CR) 10Dzahn: "patch looks good, just needs approvals" [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [16:16:26] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:02] (03PS3) 10Zabe: Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577) [16:17:09] (03CR) 10Zabe: [C:03+2] Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577) (owner: 10Zabe) [16:17:50] PROBLEM - Host cloudsw1-b1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:18:20] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:18:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908905 (10Dzahn) [16:18:31] (03CR) 10Dzahn: "(and out-of-band confirmation of the SSH key is needed)" [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [16:18:32] (03PS1) 10Eevans: echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308) [16:19:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908913 (10Dzahn) confirmed in Dayforce. NDA checkbox not needed for staff. L3 checked [16:19:47] (03Merged) 10jenkins-bot: Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577) (owner: 10Zabe) [16:20:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908918 (10Dzahn) [16:20:26] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]] [16:20:29] T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577 [16:21:16] (03CR) 10Eevans: [C:03+2] echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [16:21:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11908923 (10Dzahn) Please let us know if you run into specific problems. Probably this is about "upgrade from level 2 to level 3" which would mean access to (more) privat... [16:21:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11908924 (10Dzahn) a:03ArthurTaylor [16:22:04] RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [16:22:07] !log zabe@deploy1003 zabe: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:22:22] RECOVERY - Host cloudsw1-b1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms [16:23:09] !log zabe@deploy1003 zabe: Continuing with deployment [16:23:22] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11908935 (10RobH) Update from email: * finally got an answer back after escalating both on the ticket, via our dell sg team, and via the accounts payable folks @ dell sg who want to be paid for the m... [16:23:28] (03Merged) 10jenkins-bot: echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [16:23:29] (03CR) 10JMeybohm: [C:03+2] Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [16:23:56] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [16:24:16] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [16:25:03] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [16:25:20] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [16:25:22] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908952 (10Dzahn) Hi @KartikMistry we need to do an "out-of-band verification" that this is really your new key. Could you maybe drop a file in some home directory on a production server that confirms it? A... [16:25:39] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908953 (10Dzahn) a:03KartikMistry [16:26:10] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908956 (10Dzahn) 05Open→03In progress [16:26:47] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11908963 (10Dzahn) a:03KFrancis [16:27:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11908964 (10Dzahn) 05In progress→03Stalled waiting for NDA signing to be completed (in linked task) [16:27:20] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]] (duration: 06m 54s) [16:27:23] T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577 [16:27:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11908967 (10Dzahn) a:03AnnieKim_WMDE [16:28:22] (03PS1) 10Eevans: echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308) [16:28:24] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:28:32] PROBLEM - Host lsw1-b3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:30:34] (03CR) 10Dzahn: "a +1 from traffic would be nice - it's just about a sanity check that the IPs are what is in netbox though" [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [16:31:05] (03CR) 10Dzahn: "@dduvall I think you said we probably won't need this. Should I abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:31:10] (03CR) 10Eevans: [C:03+2] echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [16:31:18] (03CR) 10Ssingh: "Yes that's my bad, I just didn't get to it. Sorry. I will get to it today." [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [16:31:28] (03Merged) 10jenkins-bot: Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [16:31:37] (03CR) 10Dzahn: "waiting for releng to check what other things (software) could be affected by this change" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [16:32:10] (03CR) 10Dzahn: "let's schedule the switch-over" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [16:32:34] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms [16:32:34] RECOVERY - Host lsw1-b3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.99 ms [16:34:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909017 (10VRiley-WMF) @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is getting stuck at the raid. I tried to log... [16:34:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909018 (10VRiley-WMF) 05In progress→03Open [16:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:18] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285345 (owner: 10Santiago Faci) [16:36:38] PROBLEM - Host lsw1-b5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:36:47] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:36:56] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:37:12] (03PS4) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) [16:37:16] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:37:29] (03CR) 10Dzahn: "Done! changed the find command to "\( -name "index.lock" -o -name "shallow.lock" \)]"." [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [16:37:35] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:37:48] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285346 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci) [16:37:53] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:37:55] (03PS3) 10HakanIST: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) [16:38:33] (03CR) 10Dzahn: "@ssingh@wikimedia.org would also like to deploy this one some time" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [16:38:48] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:39:02] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:39:06] (03Merged) 10jenkins-bot: echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [16:39:09] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285345 (owner: 10Santiago Faci) [16:39:12] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:39:19] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:39:37] (03CR) 10Dzahn: [C:03+1] "and this :)" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [16:40:11] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [16:40:13] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285346 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci) [16:40:14] (03CR) 10Dzahn: "thank you:) it has all week" [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [16:41:18] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:41:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909067 (10ssingh) >>! In T421421#11909017, @VRiley-WMF wrote: > @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is... [16:41:34] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.76 ms [16:41:40] RECOVERY - Host lsw1-b5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.04 ms [16:41:40] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:44:44] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:58] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:47:23] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909107 (10RobH) > Please Check. The port seems Up right now After replacing it. We found a 2m MTP in your rack. DO100 number mtp fiber. on switch ` et-0/0/50 Core: cr2-drmrs:et-0/0/2 {#D0103} em0... [16:48:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:49:28] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.10 ms [16:49:36] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms [16:49:59] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909125 (10cmooney) a:05RobH→03None It still shows no light incoming to asw1-b12-drmrs on lane 3: ` cmooney@asw1-b12-drmrs> show interfaces diagnostics optics xe-0/0/50:2 | match "Laser receiver power" | match dB... [16:50:19] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [16:51:11] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [16:51:48] (03PS1) 10Jdlrobson: Exclude sitesupport from button/icon treatment, remove manual styling [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721) [16:53:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:53:48] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546) [16:56:55] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700) [17:00:05] ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700) [17:00:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1047.eqiad.wmnet with reason: Maintenance [17:00:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92460 and previous config saved to /var/cache/conftool/dbconfig/20260511-170024-fceratto.json [17:03:44] (03PS1) 10Elukey: sre.hosts.reimage: use ADMIN for redfish when reimaging Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1285868 [17:05:00] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie [17:05:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS tr... [17:06:23] (03PS1) 10Jgreen: Remove deprecated /etc/icinga/objects/nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1285870 (https://phabricator.wikimedia.org/T425424) [17:06:52] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [17:07:01] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [17:07:09] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [17:07:17] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [17:07:28] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [17:07:33] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [17:07:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92461 and previous config saved to /var/cache/conftool/dbconfig/20260511-170739-fceratto.json [17:11:04] !log dancy@deploy1003 Installing scap version "4.263.0" for 2 host(s) [17:12:38] (03CR) 10Dduvall: "Sounds good to me. It's not needed for Zuul migration." [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:12:55] !log dancy@deploy1003 Installation of scap version "4.263.0" completed for 2 hosts [17:14:59] (03Abandoned) 10Dzahn: gerrit: allow zuul machines to port 22 ssh [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:15:25] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [17:15:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [17:17:44] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909294 (10KFrancis) Hi all, the NDA is complete! Thanks! [17:17:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92462 and previous config saved to /var/cache/conftool/dbconfig/20260511-171747-fceratto.json [17:17:59] (03PS3) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550) [17:25:08] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 3h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [17:25:53] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [17:27:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92463 and previous config saved to /var/cache/conftool/dbconfig/20260511-172756-fceratto.json [17:29:43] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1268] - vriley@cumin1003" [17:29:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1268] - vriley@cumin1003" [17:29:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:34:03] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909396 (10RobH) Ok, swapped the cr optic and it fixed it. Followups on ticket: * snap a photo of the defective optic with serial for me to process a repair/return * clarify if the 2M fiber they call temp is temp due to be... [17:34:39] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1268 [17:35:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1268 [17:36:18] (03PS1) 10Eevans: echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874 [17:36:37] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909412 (10cmooney) All looks good with the link, traffic flowing again. {F80931115 width=600} Light good either side: ` cmooney@asw1-b12-drmrs> show interfaces diagnostics optics et-0/0/50 | except "warn|alarm"... [17:38:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92464 and previous config saved to /var/cache/conftool/dbconfig/20260511-173804-fceratto.json [17:38:25] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1268.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:40:05] (03CR) 10Eevans: [C:03+2] echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874 (owner: 10Eevans) [17:42:30] (03Merged) 10jenkins-bot: echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874 (owner: 10Eevans) [17:43:16] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [17:44:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:58] vriley@cumin1003 provision (PID 2669614) is awaiting input [17:47:32] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [17:52:15] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye [17:52:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [17:53:12] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [17:53:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [17:55:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1268.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:56:14] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1268.eqiad.wmnet with OS bookworm [17:56:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm [17:56:33] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1268.eqiad.wmnet with OS bookworm [17:56:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm executed with errors: - db1268 (**F... [18:00:53] vriley@cumin1003 reimage (PID 2684286) is awaiting input [18:07:10] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1268.eqiad.wmnet with OS bookworm [18:07:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909565 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm [18:11:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909574 (10elukey) weird error while doing pxe: ` >>Checking Media Presence...... >>Media Present...... >>Start PXE over IPv4 on MAC: 90-5A-08-A4-D1... [18:12:03] !log roll restarting eventgate-main to pick up changes for T423583 [18:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:06] T423583: mediawiki.page_change.v1 event - Add revision revert details - https://phabricator.wikimedia.org/T423583 [18:12:10] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie [18:12:19] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync [18:12:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS trixie... [18:12:22] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [18:12:29] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [18:12:54] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [18:13:25] FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:37] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [18:13:58] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [18:18:47] (03CR) 10SBassett: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [18:18:54] (03CR) 10SBassett: [V:03+1 C:03+1] Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [18:22:31] (03CR) 10Ssingh: [C:03+1] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [18:25:23] (03CR) 10Ssingh: "Thanks for the ping @dzahn@wikimedia.org. I see various +1s from the folks but no clear indication on if we have verified any recent match" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [18:25:44] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye [18:25:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [18:26:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909630 (10ssingh) @VRiley-WMF: We may need to check this host; I can't seem to get it to come back up after a reboot (checked twice). Is there something else missing here? Perhaps... [18:31:26] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11909651 (10cmooney) >>! In T424683#11906733, @cmooney wrote: >>>! In T424683#11885878, @ayounsi wrote: >> Nice! >> >> We ca... [18:36:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909670 (10VRiley-WMF) [18:36:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909673 (10VRiley-WMF) checking, stand by [18:42:33] (03CR) 10Ssingh: [V:03+1] "Yeah this has been in the backlog for a while. I was hoping for some buy-in from the frontline-defense group, so I will try again." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [18:44:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909724 (10VRiley-WMF) Yes, it's getting stuck at the same spot I was getting stuck at. It looks like it's looking for a specific RAID. [18:47:35] (03PS13) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:47:54] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:49:09] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage [18:50:45] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:54:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:54:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage [18:55:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:19] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:56:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [18:57:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:00:01] (03PS5) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) [19:00:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:00:56] (03CR) 10Cathal Mooney: "Thanks yep I missed the reply. Good call, I'd been testing against the interfaces with actual sub-ints, but didn't realise we'd get all t" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [19:01:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:04:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909801 (10Jclark-ctr) @ssingh you are booting with UEFI? the YAML file need to be updated for lvs1017 -partman/standard-efi.cfg -partman/raid1-2dev-efi.cfg [19:05:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:06:28] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:07:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909810 (10ssingh) >>! In T421421#11909801, @Jclark-ctr wrote: > @ssingh you are booting with UEFI? > > the YAML file need to be updated for lvs1017 > > -partman/standard-efi.cfg... [19:10:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:11:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:11:12] !log bking@archiva1002 `sudo rm -rfv /var/cache/archiva/temp* && sudo systemctl restart archiva`. to free up disk space [19:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:12:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909839 (10ssingh) ` Forced UEFI HTTP Boot for next reboot Resetting chassis power status for lvs1017 to ForceRestart Host rebooted via Redfish ` [19:12:39] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:14:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:14:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:14:34] (03CR) 10Dzahn: [C:03+2] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [19:14:56] !log dzahn@dns1005 START - running authdns-update [19:15:28] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:15:45] vriley@cumin1003 reimage (PID 2684286) is awaiting input [19:16:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:16:43] !log dzahn@dns1005 END - running authdns-update [19:16:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:16:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1268.eqiad.wmnet with OS bookworm [19:16:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm completed: - db1268 (**PASS**) -... [19:17:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909846 (10VRiley-WMF) [19:18:47] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:19:06] (03CR) 10Dzahn: "which tool would you use for that?" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [19:19:24] (03CR) 10Dzahn: [C:03+1] "ah! cool, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:20:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:21:12] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909858 (10Dzahn) a:05KFrancis→03None [19:21:50] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909859 (10Dzahn) a:03Dzahn [19:22:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:22:36] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1269] - vriley@cumin1003" [19:22:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1269] - vriley@cumin1003" [19:22:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:23:24] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1269 [19:24:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1269 [19:25:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:25:12] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1269.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:26:19] (03PS1) 10Dzahn: admin: add Catherine Kelsey of WMDE as ldap_only user [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) [19:26:23] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909867 (10Dzahn) Thanks Katie! @catherine.kelsey.wmde Now we just need approval from one of the WMDE managers listed at https://wikitech.wikimedia.org/wi... [19:26:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909871 (10Dzahn) NDA is completed. Please get one of the WMDE managers to approve (https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_re... [19:27:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:27:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909873 (10Dzahn) [19:29:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909880 (10Dzahn) Could you also provide an example of what tasks or tools this is actually intended for (for that open checkbox from the template... [19:29:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909884 (10Dzahn) [19:30:04] (03CR) 10Dzahn: [C:04-1] "needs WMDE manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn) [19:30:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:31:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:34:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909905 (10Dzahn) a:05Dzahn→03catherine.kelsey.wmde [19:35:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:35:58] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:36:07] (03PS1) 10Muehlenhoff: Remove access for bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1285893 [19:36:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:37:56] (03CR) 10Muehlenhoff: [C:03+2] Remove access for bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1285893 (owner: 10Muehlenhoff) [19:39:08] !log [bking@cumin2002] ~$ sudo cumin 'A:wdqs-main and A:codfw' 'systemctl restart wdqs-blazegraph' <- restart after banning scraper [19:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:09] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bvibber out of all services on: 2453 hosts [19:43:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1269.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:44:32] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [19:44:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [19:54:22] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [19:55:18] (03Merged) 10jenkins-bot: Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [19:55:44] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]] [19:55:48] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [19:57:27] !log zabe@deploy1003 zabe: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:58:32] !log zabe@deploy1003 zabe: Continuing with deployment [20:00:00] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1269.eqiad.wmnet with OS bookworm [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2000). nyaa~ [20:00:05] Sergi0 and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1269.eqiad.wmnet with OS bookworm [20:02:41] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]] (duration: 06m 57s) [20:02:46] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [20:05:15] I'm late for the backport window, I might make it later in the hour, in about 30min. [20:05:31] My patch can wait for me until then [20:12:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910009 (10VRiley-WMF) [20:15:36] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage [20:16:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:34] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage [20:20:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910031 (10VRiley-WMF) [20:23:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:28:41] ok back at my computer, going to deploy the portal donor patch now. [20:30:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:30:58] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:31:52] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:32:06] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]] [20:32:09] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:32:58] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [20:33:48] !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:36:06] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:37:50] !log jdrewniak@deploy1003 jdrewniak: Continuing with deployment [20:39:12] vriley@cumin1003 reimage (PID 2773592) is awaiting input [20:39:33] (03PS1) 10Alex.sanford: Enforce 2FA requirements for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) [20:41:49] (03PS1) 10Eevans: echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906 [20:41:57] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]] (duration: 09m 51s) [20:42:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:42:13] (03PS1) 10Jdlrobson: Skin: Correct thumbnail class [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910) [20:45:19] (03CR) 10Eevans: [C:03+2] echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906 (owner: 10Eevans) [20:46:03] i might have something to deploy in this current window - i think the queue is finished? [20:47:31] (03Merged) 10jenkins-bot: echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906 (owner: 10Eevans) [20:48:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:48:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:48:26] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1269.eqiad.wmnet with OS bookworm [20:48:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1269.eqiad.wmnet with OS bookworm completed: - db1269 (**PASS**) -... [20:49:49] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:53:30] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1270] - vriley@cumin1003" [20:53:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1270] - vriley@cumin1003" [20:53:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:53:40] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [20:54:41] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2100). [21:02:11] hi! I have a config patch to get out today that I'll use spiderpig for [21:02:17] and then a security patch to deploy [21:02:28] maryum: can i deploy something after you're done? [21:02:34] yes of course [21:02:41] cool - thanks! [21:03:10] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [21:07:39] running the backport now [21:07:46] (03PS1) 10Eevans: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 [21:07:47] a config backport [21:07:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [21:08:08] (03PS2) 10Eevans: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 [21:08:50] (03Merged) 10jenkins-bot: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [21:09:07] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]] [21:09:10] T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058 [21:10:50] !log mstyles@deploy1003 sbassett, mstyles: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:58] (03CR) 10Eevans: [C:03+2] echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 (owner: 10Eevans) [21:11:32] !log mstyles@deploy1003 sbassett, mstyles: Continuing with deployment [21:13:04] (03Merged) 10jenkins-bot: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 (owner: 10Eevans) [21:14:10] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:15:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:15:43] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]] (duration: 06m 36s) [21:15:46] T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058 [21:16:21] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [21:16:30] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [21:16:46] now preparing to deploy the security patch [21:21:57] scap is running [21:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 7h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [21:28:07] (03PS1) 10Anne Tomasevich: Add ReadingLists Account Creation CTA campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) [21:29:44] !log Deployed security fix for T425406 [21:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:54] cjming: you can go ahead with your deploy now [21:30:04] tysm! [21:30:29] (03PS1) 10Clare Ming: WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) [21:33:00] (03CR) 10Santiago Faci: [C:03+1] WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming) [21:35:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming) [21:36:00] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:36:20] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:37:20] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:37:59] (03Merged) 10jenkins-bot: WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming) [21:38:00] cjming let me know when you're done. I might need to rollback the security patch I just deployed [21:38:27] maryum: will do - should be done momentarily [21:39:01] uh oh [21:39:12] mine just err'd out [21:40:25] not sure what to do in this situation - it merged but https://spiderpig.wikimedia.org/jobs/1957 [21:41:30] should i revert? retry? [21:41:51] it's a super minor change [21:42:28] cjming: not an expert but that looks like it's failing because of the uncommitted file in /srv/patches -- needs maryum's attention maybe [21:42:40] oh that's my bad, committing that now [21:43:01] gtk [21:44:09] cjming just committed [21:44:29] thx - i guess i will retry [21:44:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:14] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]] [21:45:18] T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254 [21:47:00] !log cjming@deploy1003 cjming: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:47:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.37% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:47:32] !log cjming@deploy1003 cjming: Continuing with deployment [21:49:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 935.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:51:40] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]] (duration: 06m 26s) [21:51:44] T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254 [21:51:55] maryum: back to you - all yours [21:52:03] cjming thanks! [21:52:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.23% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:54:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 837.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:58:02] (03CR) 10Bking: "The PCC failure is expected on cirrussearch2070, as it is a Bullseye node and Bullseye nodes are blocked from installing atop (for good re" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [22:03:18] (03PS4) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) [22:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:13:40] FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:53] (03CR) 10Dzahn: "can not compile this for some reason: https://puppet-compiler.wmflabs.org/output/1285488/8539/" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:25:43] (03CR) 10Dzahn: [V:04-1] "nevermind, just typo in the hostname. here it goes: https://puppet-compiler.wmflabs.org/output/1285488/8540/codesearch9.codesearch.eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:32:40] (03PS5) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) [22:44:35] (03PS6) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) [22:45:05] (03CR) 10CI reject: [V:04-1] codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:45:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:47:14] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8542/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:47:54] (03PS7) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) [22:51:00] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:51:15] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8542/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:58:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2300) [23:00:25] Here. Let me know if there are any reasons not to use the readers deploy window. [23:03:17] (03PS1) 10Dduvall: zuul: Set mode of SSH private key to 0400 [puppet] - 10https://gerrit.wikimedia.org/r/1285923 [23:05:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285464 (https://phabricator.wikimedia.org/T424571) (owner: 10Jdlrobson) [23:10:13] (03PS4) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST) [23:10:28] (03CR) 10Jdlrobson: [C:03+1] "(Note: I48a7c82bdad0e2697bea175e7a04846e5a8b2cf0 needs to be merged first and in production before we can backport this)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST) [23:10:38] (03CR) 10CI reject: [V:04-1] Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST) [23:15:16] (03PS1) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [23:15:46] (03CR) 10CI reject: [V:04-1] Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [23:17:58] (03Merged) 10jenkins-bot: Add support for icons in toolbox [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285464 (https://phabricator.wikimedia.org/T424571) (owner: 10Jdlrobson) [23:18:16] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]] [23:18:20] T424571: Temporary watchstar status not reflected in dropdown: Add icon support for toolbox in Vector 2022 - https://phabricator.wikimedia.org/T424571 [23:19:56] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:20:34] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [23:21:32] (03PS2) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [23:24:45] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]] (duration: 06m 29s) [23:24:48] T424571: Temporary watchstar status not reflected in dropdown: Add icon support for toolbox in Vector 2022 - https://phabricator.wikimedia.org/T424571 [23:25:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721) (owner: 10Jdlrobson) [23:38:35] (03Merged) 10jenkins-bot: Exclude sitesupport from button/icon treatment, remove manual styling [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721) (owner: 10Jdlrobson) [23:38:51] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]] [23:38:54] T425721: Revert the header donate button back to a normal link - https://phabricator.wikimedia.org/T425721 [23:39:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927 [23:39:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927 (owner: 10TrainBranchBot) [23:40:32] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:41:05] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [23:43:42] (03PS7) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [23:45:13] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]] (duration: 06m 21s) [23:45:16] T425721: Revert the header donate button back to a normal link - https://phabricator.wikimedia.org/T425721 [23:45:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [23:50:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:52:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927 (owner: 10TrainBranchBot) [23:55:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:59:52] (03Merged) 10jenkins-bot: Skin: Correct thumbnail class [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson)