[00:02:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:02:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:05:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:11:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:14:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[00:16:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:17:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:18:17] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:20:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[00:21:13] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - add mediawiki.user_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952)
[00:23:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:23:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:26:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:26:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:28:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[00:31:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:32:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:33:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:35:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:36:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:36:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:38:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:39:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:40:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:42:38] <wikibugs>	 06SRE, 10Acme-chief, 06Traffic, 13Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737#11906132 (10Krinkle)
[00:43:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:52:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:52:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:56:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:59:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:07:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:07:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:08:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:09:42] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531
[01:09:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531 (owner: 10TrainBranchBot)
[01:10:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:10:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:12:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:13:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:15:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:15:23] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 11h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[01:16:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:17:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:20:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:21:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285531 (owner: 10TrainBranchBot)
[01:21:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:31:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:32:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:33:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:34:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:35:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[01:46:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:47:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:49:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:50:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:51:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:51:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:54:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:54:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:57:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:58:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:59:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:00:17] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:03:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:04:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:06:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:07:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:07:16] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 58s)
[02:08:17] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:09:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:10:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[02:11:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:12:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:15:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[02:15:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:26:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:28:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:30:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:31:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[02:31:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:33:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:34:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:36:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:37:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:40:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[02:40:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:45:13] <wikibugs>	 (03CR) 10Danielyepezgarces: [C:03+1] Enabling RSS extension for cowikimedia chapter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces)
[02:47:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:50:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:51:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces)
[02:56:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:58:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:59:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:59:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:03:03] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:06:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:08:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:09:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:13:17] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:16:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:18:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:21:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:21:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:26:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:30:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:30:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:33:17] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:37:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:37:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:40:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[03:40:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[03:41:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:41:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:44:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:44:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:47:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:47:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:50:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:51:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:54:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:56:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:57:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:00:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:00:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:03:03] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:12:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:12:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:13:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:15:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:19:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[04:22:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:23:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:25:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:33:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:38:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:41:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:43:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:44:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:46:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:50:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:51:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:53:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:54:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:56:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:00:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[05:01:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:02:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:05:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[05:05:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:06:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:06:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:09:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:09:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/2 (Core: asw1-b12-drmrs:et-0/0/50 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:10:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:10:45] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:10:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:11:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:11:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:14:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[05:14:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:15:23] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 15h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[05:15:51] <jinxer-wm>	 RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:17:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:18:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:20:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:21:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:25:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:26:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:26:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:30:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:30:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:31:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282353 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff)
[05:34:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5004 from eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1284665 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[05:36:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906319 (10MoritzMuehlenhoff)
[05:37:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:38:17] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:40:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:40:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:42:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:42:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:42:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:43:17] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:45:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:45:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:45:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:45:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11906325 (10MoritzMuehlenhoff)
[05:46:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:53:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bookworm
[05:53:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm
[05:56:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:57:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:57:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast6003.wikimedia.org
[05:57:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906331 (10ops-monitoring-bot) VM bast6003.wikimedia.org rebooted by jmm@cumin2002 with reason: None
[05:59:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:00:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:03:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:03:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast6003.wikimedia.org
[06:05:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906333 (10MoritzMuehlenhoff)
[06:07:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:07:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:08:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:08:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM acmechief2002.codfw.wmnet
[06:08:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906334 (10ops-monitoring-bot) VM acmechief2002.codfw.wmnet rebooted by jmm@cumin2002 with reason: None
[06:10:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:10:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:11:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:11:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:12:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief2002.codfw.wmnet
[06:14:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:15:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[06:16:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:19:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:22:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906346 (10MoritzMuehlenhoff)
[06:23:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:24:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11906347 (10MoritzMuehlenhoff) 05Open→03Resolved All VMs are updated to 2G RAM and we enforce 2G as the lower limit in sre.ganeti.makevm, so resolving this.
[06:25:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage
[06:26:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:27:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:29:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage
[06:29:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:30:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[06:31:15] <wikibugs>	 (03CR) 10Muehlenhoff: Added cwilliams to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (owner: 10CWilliams)
[06:32:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:35:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:36:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:38:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:39:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[06:41:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:41:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:42:54] <wikibugs>	 (03PS5) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620
[06:44:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:44:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:45:14] <XioNoX>	 jayme, effie ^ not paging but is something bad going on with WDQS in eqiad ?
[06:47:55] <jayme>	 looks like it...
[06:50:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5004.eqsin.wmnet with OS bookworm
[06:51:00] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm completed: - ganeti5004 (**PASS**)   - Dow...
[06:51:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:53:12] <wikibugs>	 (03PS1) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537
[06:55:53] <wikibugs>	 (03PS2) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537
[06:57:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:00:04] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T0700)
[07:00:04] <jouncebot>	 sfaci and dyepezg: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:22] <sfaci>	 o/
[07:01:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863)
[07:02:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:02:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:04:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Add x_wmf_ratelimit_class and x_trusted_request to Turnilo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284876 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic)
[07:05:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:06:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:08:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache zarcillo.discovery.wmnet on all recursors
[07:08:59] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zarcillo.discovery.wmnet on all recursors
[07:10:27] <sfaci>	 Is anyone around to run the deployment window?
[07:11:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:11:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:11:52] <jayme>	 btullis: wdqs seems to be under constant high load since thursday last week
[07:13:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:13:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:15:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:15:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:17:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:20:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:20:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:21:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply
[07:21:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply
[07:22:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:23:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:23:59] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply
[07:24:21] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply
[07:25:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:26:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:27:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:27:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "This looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[07:27:28] <jayme>	 btullis, brouberol|ooo: ^ wdqs seems to be under constant high load since thursday last week
[07:27:54] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[07:30:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:31:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:32:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:35:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:35:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[07:35:49] <wikibugs>	 (03PS1) 10Brouberol: flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978)
[07:36:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:36:57] <wikibugs>	 (03PS2) 10Brouberol: flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978)
[07:37:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:37:43] <wikibugs>	 (03CR) 10JMeybohm: wikikube: add wikikube-ctrl2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine)
[07:39:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:41:00] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[07:42:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:42:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert)
[07:44:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:45:10] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol)
[07:45:18] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert)
[07:46:19] <A_smart_kitten>	 jayme: FWIW, a potentially related Phab task was filed by a WDQS end-user on Saturday: T425861
[07:46:20] <stashbot>	 T425861: Wikidata SPARQL query performance regression: frequent 502-bad gateway errors - https://phabricator.wikimedia.org/T425861
[07:46:54] <wikibugs>	 (03CR) 10Blake: [C:03+1] mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[07:47:03] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[07:47:08] <wikibugs>	 (03PS1) 10Jelto: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405)
[07:47:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:48:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:48:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:48:50] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] idp: migrate IDP to Redis 8 [puppet] - 10https://gerrit.wikimedia.org/r/1285324 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[07:48:54] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] idp: migrate IDP to Redis 8 [puppet] - 10https://gerrit.wikimedia.org/r/1285324 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[07:49:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:49:44] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[07:50:35] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[07:51:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:51:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:54:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:54:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti5004 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1285538 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:55:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[07:55:54] <wikibugs>	 (03PS1) 10JMeybohm: ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439)
[07:55:58] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Redis migration [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976)
[07:56:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[07:57:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:57:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:58:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:58:32] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[08:00:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:00:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[08:00:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:00:41] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old eqsin ganeti cluster VIP - ayounsi@cumin1003"
[08:00:47] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old eqsin ganeti cluster VIP - ayounsi@cumin1003"
[08:00:47] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:00:53] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit-media: Fix missing unit in ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285546 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[08:02:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:03:55] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1282325 (owner: 10L10n-bot)
[08:05:03] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[08:05:07] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[08:05:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:05:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976) (owner: 10Slyngshede)
[08:08:35] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Redis migration [dns] - 10https://gerrit.wikimedia.org/r/1285547 (https://phabricator.wikimedia.org/T419976) (owner: 10Slyngshede)
[08:08:44] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:10:12] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:13:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:15:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci)
[08:15:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester)
[08:15:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci)
[08:16:19] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[08:16:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[08:16:46] <wikibugs>	 (03PS4) 10JMeybohm: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439)
[08:16:46] <wikibugs>	 (03PS3) 10JMeybohm: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439)
[08:17:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:18:57] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285543 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[08:20:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[08:21:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:23:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:25:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:26:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:26:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:26:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[08:27:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:29:14] <wikibugs>	 (03PS1) 10Elukey: confluent: add space in kafka's COMMAND_CONFIG_OPT [puppet] - 10https://gerrit.wikimedia.org/r/1285725
[08:30:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[08:30:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:30:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] confluent: add space in kafka's COMMAND_CONFIG_OPT [puppet] - 10https://gerrit.wikimedia.org/r/1285725 (owner: 10Elukey)
[08:31:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:35:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:36:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:36:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:38:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:39:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:39:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:40:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727
[08:41:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01
[08:41:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:41:44] <wikibugs>	 (03PS3) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (owner: 10Joal)
[08:42:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:42:28] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance
[08:42:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01
[08:42:36] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92450 and previous config saved to /var/cache/conftool/dbconfig/20260511-084236-fceratto.json
[08:43:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install5004.wikimedia.org to drbd
[08:43:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:43:54] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[08:44:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906646 (10MoritzMuehlenhoff)
[08:44:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11906651 (10ops-monitoring-bot) VM install5004.wikimedia.org switching disk type to drbd
[08:45:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:45:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:45:44] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354)
[08:47:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:47:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:47:35] <wikibugs>	 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136#11906663 (10phaultfinder)
[08:47:55] <jinxer-wm>	 FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:48:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] QoS: Move DSCP AF41 from 'low' to 'normal' priority class [homer/public] - 10https://gerrit.wikimedia.org/r/1285350 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney)
[08:49:01] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727 (owner: 10Muehlenhoff)
[08:49:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92451 and previous config saved to /var/cache/conftool/dbconfig/20260511-084945-fceratto.json
[08:50:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:50:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:50:38] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[08:51:20] <wikibugs>	 (03PS1) 10JMeybohm: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285732 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake)
[08:51:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff)
[08:52:36] <wikibugs>	 10ops-esams, 06DC-Ops: Alert for device asw1-bw27-esams.mgmt.esams.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T425917 (10phaultfinder) 03NEW
[08:53:09] <wikibugs>	 (03PS1) 10Elukey: confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528)
[08:53:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:53:15] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:53:29] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[08:54:22] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:54:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:55:23] <wikibugs>	 (03Merged) 10jenkins-bot: eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:55:45] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[08:55:45] <wikibugs>	 (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285732 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake)
[08:56:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:56:43] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1230: T419635
[08:56:46] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:57:18] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1284616 (owner: 10L10n-bot)
[08:57:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:58:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:58:29] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[08:58:47] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[08:59:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:59:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P92453 and previous config saved to /var/cache/conftool/dbconfig/20260511-085954-fceratto.json
[09:02:49] <wikibugs>	 (03CR) 10JMeybohm: ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[09:04:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906723 (10SLyngshede-WMF) {F80836950}  Memory is also having issues
[09:04:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install5004.wikimedia.org to drbd
[09:04:32] <icinga-wm>	 PROBLEM - Host install5004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:04:38] <icinga-wm>	 RECOVERY - Host install5004 is UP: PING OK - Packet loss = 0%, RTA = 235.88 ms
[09:05:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906724 (10SLyngshede-WMF) Puppet currently fails to run and hasn't done so since May 10th, at 21:23
[09:05:25] <wikibugs>	 (03CR) 10Ayounsi: "In case it got missed, I left a comment on the task: https://phabricator.wikimedia.org/T424683#11885878" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney)
[09:06:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol)
[09:06:05] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] flink: build the flink 2.0.0 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285540 (https://phabricator.wikimedia.org/T412978) (owner: 10Brouberol)
[09:06:44] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:07:14] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:07:17] <wikibugs>	 (03PS2) 10Elukey: confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528)
[09:07:55] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:07:58] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[09:08:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:08:41] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[09:09:00] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[09:09:22] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11906733 (10cmooney) >>! In T424683#11885878, @ayounsi wrote: > Nice! >  > We can also filter out the `.16386`, `.16384`, `.1...
[09:10:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P92454 and previous config saved to /var/cache/conftool/dbconfig/20260511-091001-fceratto.json
[09:10:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/2 (Core: asw1-b12-drmrs:et-0/0/50 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:10:45] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:11:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:12:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1285537 (owner: 10Muehlenhoff)
[09:12:22] <kostajh>	 jouncebot: nowandnext
[09:12:22] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[09:12:22] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1000)
[09:12:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Beautiful, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[09:13:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:13:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[09:13:51] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[09:14:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] confluent: add KAFKA_BOOTSTRAP_TLS_SERVERS and use it in kafka3.sh [puppet] - 10https://gerrit.wikimedia.org/r/1285733 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[09:14:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:15:23] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 19h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[09:15:30] <wikibugs>	 (03PS5) 10JMeybohm: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439)
[09:15:30] <wikibugs>	 (03PS4) 10JMeybohm: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439)
[09:15:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11906740 (10catherine.kelsey.wmde) Thank you both! I've now signed the NDA :)
[09:16:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:17:57] <wikibugs>	 (03PS3) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[09:18:02] <wikibugs>	 (03CR) 10JMeybohm: ratelimit: Add ingress support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[09:18:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:19:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:19:18] <wikibugs>	 (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[09:19:22] <wikibugs>	 (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[09:19:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[09:20:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92456 and previous config saved to /var/cache/conftool/dbconfig/20260511-092010-fceratto.json
[09:20:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[09:20:45] <wikibugs>	 (03PS4) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[09:21:09] <wikibugs>	 (03CR) 10A-pizzata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[09:21:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:22:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:22:48] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit: Add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285400 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[09:24:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:24:47] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:25:25] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:26:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:26:38] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[09:27:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:27:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:28:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:29:28] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Thanos
[09:30:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:30:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:31:34] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:31:59] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:33:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:36:44] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:37:11] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:38:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:31] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921 (10ayounsi) 03NEW p:05Triage→03High
[09:38:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan)
[09:40:09] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:40:23] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285731 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan)
[09:40:35] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:41:34] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]]
[09:41:38] <stashbot>	 T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354
[09:42:09] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1230: T419635
[09:42:12] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:45:21] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: nova: Set MTU on flat VLAN interface in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1284675 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah)
[09:45:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:46:07] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[09:46:36] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[09:48:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:48:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:48:30] <wikibugs>	 (03PS1) 10Elukey: admin: add spare Yubikey public key and remove the old one [puppet] - 10https://gerrit.wikimedia.org/r/1285738
[09:48:45] <wikibugs>	 (03PS3) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057)
[09:49:46] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[09:49:49] <wikibugs>	 (03CR) 10Elukey: "Sent the confirmation via Slack to Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey)
[09:50:27] <wikibugs>	 (03CR) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[09:51:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:51:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:52:41] <wikibugs>	 (03PS4) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[09:52:43] <wikibugs>	 (03PS1) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763)
[09:52:55] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11906844 (10cmooney) Very weird, box still sees the optic but thinks interface is invalid. ` cmooney@asw1-b12-drmrs> show chassis pic fpc-slot 0 pic-slot 0 | match "^  50"    50   40GBASE SR4       MM    FS                 Q...
[09:52:58] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm! I agree the dynamic prefix-list are best, we can revisit it once your PR is merged in Aerleon" [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney)
[09:54:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:56:08] <wikibugs>	 (03PS1) 10Btullis: Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057)
[09:57:06] <logmsgbot>	 !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2012.codfw.wmnet with reason: Hardware failure
[09:57:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906855 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2035ffcb-d2a2-45d6-871f-0db385ff6b6e) set by slyngshede@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Hardware f...
[09:57:51] <logmsgbot>	 !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on lvs2012.codfw.wmnet with reason: Hardware failure
[09:57:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11906858 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1a96c21-2484-4d03-b4f4-1baa89a1745e) set by slyngshede@cumin1003 for 30 days, 0:00:00 on 1 host(s) and their services with reason: H...
[09:57:58] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[09:58:15] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:58:15] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[09:58:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:18] <stashbot>	 T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354
[09:58:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:58:33] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[09:58:43] <wikibugs>	 (03PS4) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057)
[09:58:57] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[09:59:04] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[09:59:37] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1000)
[10:01:04] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:01:09] <logmsgbot>	 !log fceratto@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[10:01:23] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[10:01:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmcs: update filippo cloudvps root key [puppet] - 10https://gerrit.wikimedia.org/r/1285742
[10:01:55] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[10:04:06] <wikibugs>	 (03PS1) 10Sergio Gimeno: loggedOutWarning: set lastEditor used earlier [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604)
[10:04:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604) (owner: 10Sergio Gimeno)
[10:04:47] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[10:06:12] <wikibugs>	 (03PS5) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[10:06:12] <wikibugs>	 (03PS2) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763)
[10:06:22] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[10:10:41] <moritzm>	 !log rebalance routed Ganeti cluster in eqsin T421863
[10:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:44] <stashbot>	 T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863
[10:11:17] <wikibugs>	 (03PS1) 10Ayounsi: Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745
[10:11:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:11:49] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285731|hCaptcha: Enable for group0 wikis (T425354)]] (duration: 30m 15s)
[10:11:54] <stashbot>	 T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354
[10:11:56] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[10:12:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[10:12:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:12:30] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi)
[10:12:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[10:13:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[10:13:16] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[10:13:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:13:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[10:13:45] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[10:14:23] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[10:14:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:14:45] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11906886 (10cmooney) @RobH can you raise a task with Digital Realty to take a look at this in MRS2?  The link in question is [[ https://netbox.wikimedia.org/dcim/interfaces/21198/trace/ | this one ]].  It's a pink MTP cable...
[10:15:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:15:34] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:15:39] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:16:13] <wikibugs>	 (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[10:16:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:16:34] <slyngs>	 !log Migrate of lvs2012 due to hardware issues
[10:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:00] <wikibugs>	 (03PS1) 10Btullis: dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437)
[10:17:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:17:48] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[10:18:08] <wikibugs>	 (03PS2) 10Ayounsi: Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745
[10:19:00] <wikibugs>	 (03Merged) 10jenkins-bot: mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:19:01] <wikibugs>	 (03PS5) 10Btullis: Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057)
[10:19:15] <wikibugs>	 (03Merged) 10jenkins-bot: Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:19:38] <wikibugs>	 (03PS3) 10Muehlenhoff: tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993)
[10:20:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:05] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[10:21:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:40] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[10:21:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply
[10:22:17] <wikibugs>	 (03PS2) 10Btullis: dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437)
[10:22:24] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[10:22:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply
[10:22:44] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[10:22:49] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi)
[10:23:21] <logmsgbot>	 !log jayme@deploy1003 Started scap sync-world: update rsyslog image
[10:25:47] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439)
[10:25:57] <wikibugs>	 (03PS1) 10JMeybohm: Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439)
[10:26:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[10:26:24] <logmsgbot>	 !log jayme@deploy1003 Finished scap sync-world: update rsyslog image (duration: 03m 48s)
[10:32:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:35:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:35:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:36:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey)
[10:37:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:37:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:38:22] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200)
[10:39:44] <wikibugs>	 (03CR) 10Blake: [C:03+1] Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:40:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: adjust cpm filters to restrict BGP connections to our ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney)
[10:40:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:40:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "one nit overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney)
[10:40:43] <wikibugs>	 (03PS1) 10Majavah: P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752
[10:40:43] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674)
[10:41:25] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[10:41:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:41:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Icinga: Add Nokia icon to more Nokia switches mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1285745 (owner: 10Ayounsi)
[10:42:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:43:14] <wikibugs>	 (03PS1) 10Majavah: admin: Remove cgoubert ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1285754
[10:43:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:44:09] <wikibugs>	 (03CR) 10Majavah: [C:03+2] admin: Remove cgoubert ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1285754 (owner: 10Majavah)
[10:44:23] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: adjust cpm filters to restrict BGP connections to our ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1285362 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney)
[10:44:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:44:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff)
[10:44:56] <wikibugs>	 (03PS2) 10Hnowlan: prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663)
[10:46:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:46:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:48:04] <wikibugs>	 (03CR) 10Trueg: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[10:49:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:49:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:50:28] <wikibugs>	 (03PS4) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813)
[10:50:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:51:25] <wikibugs>	 (03CR) 10Btullis: [C:03+1] openjdk-25-jre/openjdk-25-jdk (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[10:51:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney)
[10:52:10] <logmsgbot>	 !log taavi@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Clément Goubert out of all services on: 2459 hosts
[10:53:17] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:53:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:53:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. I'll report this to the Debian Java maintainers, so that we eventually get it properly fixed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[10:55:46] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] openjdk-25-jre/openjdk-25-jdk [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[10:57:03] <wikibugs>	 (03PS2) 10Majavah: P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752
[10:57:03] <wikibugs>	 (03PS2) 10Majavah: P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674)
[10:57:19] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Bump default rsyslog container version to 8.2504.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1280317 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:57:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:57:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:58:09] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah)
[10:59:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Actually, noted one more issue: Please adapt the versions to reflect current JRE releases, that way we can see in our container tooling wh" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[10:59:47] <jayme>	 !log uprading rsyslog to 8.2504.0-1 in all mediawiki deployments - T418200
[10:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Test updated rsyslog image on mw-experimental and mw-web canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285751 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[10:59:55] <stashbot>	 T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200
[11:00:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:00:47] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[11:01:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:03:21] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[11:05:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] use_linux612_on_bookworm: Bump kernel to 6.12.86 [puppet] - 10https://gerrit.wikimedia.org/r/1285537 (owner: 10Muehlenhoff)
[11:08:32] <logmsgbot>	 !log jayme@deploy1003 Started scap sync-world: upgrade rsyslog on all deployments T418200
[11:08:35] <stashbot>	 T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200
[11:11:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[11:11:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[11:12:08] <wikibugs>	 (03PS1) 10Cathal Mooney: py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756
[11:14:30] <wikibugs>	 (03PS1) 10Trueg: Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636)
[11:15:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff)
[11:16:19] <wikibugs>	 (03PS2) 10JMeybohm: Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439)
[11:16:22] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit-media: Enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285401 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[11:16:37] <wikibugs>	 (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[11:17:41] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] openjdk-25-jre/openjdk-25-jdk (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[11:18:02] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "much better indeed!!" [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney)
[11:18:47] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: nova: Fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1285752 (owner: 10Majavah)
[11:18:57] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Set MTU on cloudnet codfw1dev interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285753 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah)
[11:19:24] <wikibugs>	 (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[11:19:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[11:19:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930 (10CWilliams-WMF) 03NEW
[11:20:11] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[11:20:14] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Fixed changelog version number. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1285757 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg)
[11:21:13] <logmsgbot>	 !log jayme@deploy1003 Rolling back deployment
[11:21:14] <logmsgbot>	 !log jayme@deploy1003 Finished scap sync-world: upgrade rsyslog on all deployments T418200 (duration: 13m 28s)
[11:21:18] <stashbot>	 T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200
[11:22:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add ratelimit-media namespace to wikikube" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285749 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[11:25:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:25:57] <wikibugs>	 (03CR) 10Joal: [C:03+1] Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[11:26:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:26:38] <wikibugs>	 (03CR) 10Blake: [C:03+1] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:26:44] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add the spark 3.5.8 shuffler to the prod hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1285740 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[11:26:53] <wikibugs>	 (03CR) 10Blake: [C:03+1] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:27:08] <wikibugs>	 (03CR) 10Blake: [C:03+1] changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:27:10] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[11:27:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[11:27:29] <wikibugs>	 (03CR) 10Blake: [C:03+1] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:27:56] <wikibugs>	 (03CR) 10Blake: [C:03+1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:28:20] <wikibugs>	 (03CR) 10Blake: [C:03+1] mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:31:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney)
[11:31:53] <moritzm>	 !log installing Linux 6.12.86 on Trixie hosts
[11:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:29] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: Set MTU on cloudnet eqiad1 VLAN interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285759 (https://phabricator.wikimedia.org/T425674)
[11:33:18] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200)
[11:33:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[11:34:54] <wikibugs>	 (03CR) 10Blake: [C:03+1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[11:35:52] <wikibugs>	 (03Merged) 10jenkins-bot: py3-style: allow lines up to 120 chars in length [homer/public] - 10https://gerrit.wikimedia.org/r/1285756 (owner: 10Cathal Mooney)
[11:35:56] <wikibugs>	 (03PS2) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200)
[11:36:21] <wikibugs>	 (03PS5) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813)
[11:36:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[11:38:17] <wikibugs>	 (03PS6) 10Cathal Mooney: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813)
[11:39:17] <wikibugs>	 (03PS3) 10JMeybohm: Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200)
[11:39:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285760 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[11:40:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: add module to enable BFD on interfaces that need it (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney)
[11:41:51] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: add module to enable BFD on interfaces that need it [homer/public] - 10https://gerrit.wikimedia.org/r/1285421 (https://phabricator.wikimedia.org/T425813) (owner: 10Cathal Mooney)
[11:42:59] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285342 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:43:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:43:22] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:43:41] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[11:46:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:46:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:48:15] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2185.codfw.wmnet with reason: Reboot
[11:48:17] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[11:52:10] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859)
[11:52:29] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1285761" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[11:52:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[11:54:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[11:54:40] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263)
[11:55:14] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add conda-analytics-next to the Hadoop test cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1285361 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis)
[11:56:16] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263)
[11:56:37] <wikibugs>	 (03CR) 10Blake: [C:03+1] site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli)
[11:58:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] site.pp: make mc1055 a memcached server [puppet] - 10https://gerrit.wikimedia.org/r/1285762 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli)
[11:58:56] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106
[11:59:10] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386)
[12:00:15] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752)
[12:00:21] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859)
[12:00:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist more network protocols as defense in depth [puppet] - 10https://gerrit.wikimedia.org/r/1285727 (owner: 10Muehlenhoff)
[12:00:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:02:39] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dse-k8s: Switch the services to use IPIP load-balancing [puppet] - 10https://gerrit.wikimedia.org/r/1285747 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[12:04:43] <topranks>	 !log push out updated ACL to Nokia switches for BGP connections (T425703) and add BFD config (T425813)
[12:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:47] <stashbot>	 T425813: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813
[12:04:55] <wikibugs>	 (03PS1) 10Btullis: Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653)
[12:05:31] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS trixie
[12:05:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:10:58] <wikibugs>	 (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[12:15:08] <wikibugs>	 (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[12:18:18] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage
[12:19:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1285770 (owner: 10L10n-bot)
[12:23:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1285773 (owner: 10L10n-bot)
[12:25:17] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage
[12:26:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): change logo at zh-classical wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[12:27:22] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1285784
[12:28:17] <wikibugs>	 (03PS1) 10Effie Mouzeli: mcrouter_wancache: replace mc1037 with mc1055 [puppet] - 10https://gerrit.wikimedia.org/r/1285785 (https://phabricator.wikimedia.org/T412255)
[12:34:09] <wikibugs>	 (03PS3) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930)
[12:34:40] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "🍿" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert)
[12:34:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams)
[12:36:36] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable editing on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354)
[12:36:59] <kostajh>	 jouncebot: nowandnext
[12:36:59] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[12:36:59] <jouncebot>	 In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1300)
[12:38:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:39:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan)
[12:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable editing on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285789 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan)
[12:41:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907442 (10VRiley-WMF) Hey @ssingh Is it okay to make this change today?
[12:41:45] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]]
[12:41:48] <stashbot>	 T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354
[12:42:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907451 (10VRiley-WMF) Also, I do apologize, I was planning on doing this today
[12:43:15] <wikibugs>	 (03PS4) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930)
[12:43:47] <wikibugs>	 (03PS5) 10WAN233: change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128)
[12:44:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams)
[12:44:15] <wikibugs>	 (03CR) 10WAN233: change logo at zh-classical wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[12:44:17] <ottomata>	 kostajh: o/ after yours is done mind if I fit in a config deployment before the  backport window?
[12:44:56] <kostajh>	 ottomata: sure
[12:45:01] <ottomata>	 ty
[12:45:32] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:47:26] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[12:48:00] <wikibugs>	 (03PS5) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930)
[12:49:52] <kostajh>	 ottomata: over to you once https://spiderpig.wikimedia.org/jobs/1949 is done
[12:51:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[12:51:05] <ottomata>	 thanks ya, watching!  
[12:51:38] <wikibugs>	 (03PS1) 10JMeybohm: Revert^2 "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285793 (https://phabricator.wikimedia.org/T418200)
[12:53:53] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285789|hCaptcha: Enable editing on group0 wikis (T425354)]] (duration: 12m 07s)
[12:53:56] <stashbot>	 T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354
[12:53:57] <wikibugs>	 (03PS1) 10JMeybohm: Bump release generation for mercurius to pick up rsyslog upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285794 (https://phabricator.wikimedia.org/T418200)
[12:54:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "wikikube: Add ratelimit-media namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1285750 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[12:55:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813#11907507 (10cmooney) 05Open→03Resolved Patch merged and config pushed to all Nokia devices now.
[12:56:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[12:56:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin: add spare Yubikey public key and remove the old one [puppet] - 10https://gerrit.wikimedia.org/r/1285738 (owner: 10Elukey)
[12:56:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata)
[12:56:47] <elukey>	 jayme: ok to merge?
[12:57:00] <kostajh>	 ottomata: ok, done
[12:57:11] <jayme>	 elukey: Revert "wikikube: Add ratelimit-media namespace"
[12:57:12] <jayme>	 yes
[12:57:27] <jayme>	 thanks
[12:58:31] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:59:06] <wikibugs>	 (03PS6) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[12:59:06] <wikibugs>	 (03PS3) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763)
[12:59:07] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig - add mediawiki.user_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285525 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata)
[12:59:14] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:59:23] <logmsgbot>	 !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]]
[12:59:26] <stashbot>	 T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952
[12:59:27] <ottomata>	 thanks kostajh !
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1300).
[13:00:05] <jouncebot>	 yerdua_wmde, codenamenoreste, MatmaRex, and sfaci: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:21] <MatmaRex>	 hi
[13:01:07] <logmsgbot>	 !log otto@deploy1003 otto: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:01:11] <Lucas_WMDE>	 o/ I can deploy in a few minutes
[13:01:44] <wikibugs>	 (03CR) 10Blake: [C:03+1] Revert^2 "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285793 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm)
[13:01:54] <wikibugs>	 (03CR) 10Atsuko: "added defaults both in templates and in config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[13:02:25] <MatmaRex>	 i need a deployer to ship my changes :) all of my wmf.1 backports should go out together, the config changes can be shipped as you like
[13:03:19] <logmsgbot>	 !log otto@deploy1003 otto: Continuing with deployment
[13:05:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::pki: remove the 'discovery' intermediate's config [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[13:06:30] <elukey>	 !log remove old discovery pki intermediate
[13:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:55] <wikibugs>	 (03CR) 10Brouberol: Add auth_proxy.httpd_cas module (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[13:07:28] <logmsgbot>	 !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285525|EventStreamConfig - add mediawiki.user_change.dev0 (T423952)]] (duration: 08m 05s)
[13:07:31] <stashbot>	 T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952
[13:11:49] <Lucas_WMDE>	 ottomata: are you done deploying? can we do the backport+config window now?
[13:11:59] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[13:12:09] <Lucas_WMDE>	 codenamenoreste doesn’t seem to be around yet
[13:12:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[13:13:30] * Lucas_WMDE looks at MatmaRex’ changes
[13:13:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11907552 (10MoritzMuehlenhoff)
[13:13:47] * MatmaRex waves
[13:13:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11907554 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff eqsin is now fully on routed Ganeti \o/
[13:14:07] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[13:14:07] * Lucas_WMDE is confused by “Prevent username registration if the username previously existed” and “Prevent username registration if the username previously existed (v2)”
[13:14:13] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_discovery_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:14:23] <Lucas_WMDE>	 like… I would usually assume that “v2” supersedes the previous version
[13:14:25] <Lucas_WMDE>	 but we’re deploying both?
[13:14:26] <wikibugs>	 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11907559 (10brouberol) FYI, I created https://gitlab.wikimedia.org/repos/sre/kafka-configurator ~2 years ago thinking it could be useful for 3 things: - managing topics - managing topics configuration - managing ACLs  In its cur...
[13:14:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: neutron: Set MTU on cloudnet eqiad1 VLAN interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285759 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah)
[13:14:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[13:14:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863)
[13:15:13] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[13:15:23] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[13:15:26] <MatmaRex>	 Lucas_WMDE: the v2 supersedes the one we wrote in 2018
[13:15:44] <MatmaRex>	 so yes, both patches are meant to be deployed
[13:15:52] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:16:38] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[13:16:40] <wikibugs>	 (03PS5) 10Audrey Penven: Enable and configure WikiProjects prototype on Wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850)
[13:16:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907565 (10ssingh) >>! In T421421#11907442, @VRiley-WMF wrote: > Hey @ssingh Is it okay to make this change today?  Yes, please, the host is not in service so you can start whenever...
[13:17:00] <wikibugs>	 (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on Wikidata beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[13:17:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Note that the stale right still shows up in API output ([example](https://login.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński)
[13:17:35] <Lucas_WMDE>	 let’s start with yerdua_wmde’s changes
[13:17:38] <Lucas_WMDE>	 *change
[13:17:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:17:46] <Lucas_WMDE>	 because that doesn’t involve rebuilding the l10n cache :D
[13:18:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[13:18:40] <Lucas_WMDE>	 (and I assume ottomata is done)
[13:18:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:19:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:19:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[13:19:14] <Lucas_WMDE>	 and then drop writeapi next, as its own deploy because it has some risk of breakage
[13:19:23] <Lucas_WMDE>	 and then all the MatmaRex backports can go through gate-and-submit while that merges
[13:19:28] <Lucas_WMDE>	 and then we’ll see how much further we get
[13:19:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[13:19:46] <wikibugs>	 (03Merged) 10jenkins-bot: Enable and configure WikiProjects prototype on Wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[13:20:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]]
[13:20:06] <stashbot>	 T421850: [WIPR] Prototype - Display Wikiproject link on Beta Item pages using properties - https://phabricator.wikimedia.org/T421850
[13:21:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[13:21:29] <MatmaRex>	 Lucas_WMDE: hmm, if you think the 'writeapi' removal is risky, let's reschedule that one. i want to get the other ones a lot more. i'll update the backports and calendar
[13:21:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:21:54] <Lucas_WMDE>	 yerdua_wmde: anything to test on mwdebug for this change?
[13:22:01] <Lucas_WMDE>	 (the correct answer is “no”, beta doesn’t have mwdebug ;))
[13:22:02] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add dse-k8s-wdqs-test hosts to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1285763 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis)
[13:22:15] <Lucas_WMDE>	 MatmaRex: okay, we can do the backports first
[13:22:15] <yerdua_wmde>	 no, nothing to test
[13:22:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Continuing with deployment
[13:22:25] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386)
[13:22:29] * Lucas_WMDE clicks +2 a couple times
[13:22:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:22:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:22:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[13:22:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[13:22:55] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Hmm. I'll schedule this for a less busy window…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński)
[13:22:58] <logmsgbot>	 !log jiji@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc1055.eqiad.wmnet with OS trixie
[13:23:09] <Lucas_WMDE>	 MatmaRex: fyi I would also do the other config change separately because I haven’t looked at it yet
[13:23:34] <Lucas_WMDE>	 (also, sfaci are you around? should your changes be deployed separately or together?)
[13:23:37] <MatmaRex>	 sure
[13:23:51] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:24:13] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_discovery_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:24:17] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS bookworm
[13:24:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:25:08] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 25m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[13:25:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Seems uncontroversial, I think we can skip the on-wiki notifications / consensus-finding here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:25:52] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:25:53] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[13:26:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "recheck (T419488?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:26:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270482|Enable and configure WikiProjects prototype on Wikidata beta (T421850)]] (duration: 06m 28s)
[13:26:34] <stashbot>	 T421850: [WIPR] Prototype - Display Wikiproject link on Beta Item pages using properties - https://phabricator.wikimedia.org/T421850
[13:27:01] <wikibugs>	 (03PS1) 10JMeybohm: ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439)
[13:27:13] <Lucas_WMDE>	 yerdua_wmde: all done, should be effective on beta soon (if not already) ^^
[13:27:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:27:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:27:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[13:27:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[13:28:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Prevent username registration if the username previously existed [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:28:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Prevent username registration if the username previously existed (v2) [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:28:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[13:28:42] <Lucas_WMDE>	 bah, the backports are already failing in zuul
[13:28:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[13:28:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1285784 (owner: 10Muehlenhoff)
[13:28:56] <Lucas_WMDE>	 ffs https://integration.wikimedia.org/ci/job/quibble-with-Wikibase-extensions-browser-tests-only-vendor-php83/7672/console
[13:29:00] <Lucas_WMDE>	 “Skipping remaining commands due to success cache hit”
[13:29:09] <Lucas_WMDE>	 and then T419488 changed it to a failure anyway
[13:29:10] <stashbot>	 T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488
[13:29:42] <Lucas_WMDE>	 let’s click the convenient retry button in spiderpig
[13:29:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:29:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:29:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[13:29:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[13:30:13] <Lucas_WMDE>	 the regular gate-and-submit queue is also super full
[13:30:14] <wikibugs>	 (03PS1) 10JMeybohm: Add ratelimit-media CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1285799 (https://phabricator.wikimedia.org/T414439)
[13:30:31] <phuedx>	 Lucas_WMDE: Sorry I'm late. sfaci can't make the window. I'm here in their place
[13:30:42] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:30:43] <btullis>	 !log restarting pybal on lvs1019 and lvs1020 for T420437
[13:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:47] <stashbot>	 T420437: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437
[13:30:48] <Lucas_WMDE>	 phuedx: with a lot of luck we might get to your changes in this window
[13:30:58] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[13:30:58] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:31:28] <Lucas_WMDE>	 phuedx: should those changes be deployed together or separately?
[13:32:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[13:32:29] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[13:33:05] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit-media: Set default gateway hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285798 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[13:33:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[13:33:54] <wikibugs>	 (03PS7) 10Atsuko: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[13:33:54] <wikibugs>	 (03PS4) 10Atsuko: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763)
[13:34:06] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply
[13:34:16] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[13:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:34:27] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent username registration if the username previously existed [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285460 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:34:31] <wikibugs>	 (03CR) 10Atsuko: Add auth_proxy.httpd_cas module (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[13:34:32] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent username registration if the username previously existed (v2) [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285461 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[13:35:14] <phuedx>	 Lucas_WMDE: They can be deployed together
[13:35:20] * phuedx crosses fingers
[13:35:25] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[13:35:52] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:36:16] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage
[13:37:20] <wikibugs>	 (03CR) 10DCausse: [C:03+1] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking)
[13:38:22] <wikibugs>	 (03Merged) 10jenkins-bot: API: Introduce list=globalusers [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285462 (https://phabricator.wikimedia.org/T261752) (owner: 10Bartosz Dziewoński)
[13:38:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:38:25] <wikibugs>	 (03Merged) 10jenkins-bot: list=globalusers: Avoid querying group permissions with empty group list [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285761 (https://phabricator.wikimedia.org/T425859) (owner: 10Bartosz Dziewoński)
[13:38:35] <Lucas_WMDE>	 woohoo
[13:38:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group list (T
[13:38:44] <logmsgbot>	 425859)]]
[13:38:49] <stashbot>	 T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386
[13:38:55] <stashbot>	 T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752
[13:39:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907708 (10VRiley-WMF) 05Open→03In progress
[13:39:30] <wikibugs>	 (03CR) 10CWilliams: data.yaml: Adding cwilliams to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams)
[13:39:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11907712 (10Fabfur)
[13:40:07] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage
[13:40:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Perfect!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[13:40:42] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:40:58] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:42:45] <wikibugs>	 (03PS2) 10CWilliams: Added cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369
[13:43:23] <wikibugs>	 (03Abandoned) 10Btullis: Update spark shufflers on the test cluster to deploy version 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) (owner: 10Btullis)
[13:43:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:43:43] <wikibugs>	 (03PS3) 10CWilliams: data.yaml: Adding cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930)
[13:44:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:32] <Lucas_WMDE>	 (still building the image… this’ll probably take a while)
[13:45:52] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:46:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[13:47:12] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-worker-codfw@codfw
[13:47:48] <Lucas_WMDE>	 wondering if the other MatmaRex config change (grant createpreviouslyrenamedaccount) and phuedx’ changes can be deployed together afterwards
[13:48:05] <MatmaRex>	 probably
[13:48:07] <Lucas_WMDE>	 though I guess the WikiLambda event stream(?) stuff could be risky
[13:48:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:13] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service pki2002:443 has failed probes (http_PKI_discovery_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:38] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[13:50:11] <phuedx>	 Lucas_WMDE: That event stream change is a NOOP tidy up AFAICT. The instrument that sends analytics events to the stream is configured via TestKitchen now
[13:50:38] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[13:50:38] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: dse-k8s-worker-codfw@codfw
[13:50:59] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: dse-k8s-worker-eqiad@eqiad
[13:51:30] <Lucas_WMDE>	 scap has been running docker-pusher for almost ten minutes now
[13:51:39] <Lucas_WMDE>	 and `top` says dockerd is a bit above 100% CPU usage
[13:51:49] * Lucas_WMDE wonders how pushing data over the network can be CPU bound
[13:53:38] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 27 May 2026 01:53:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[13:56:05] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[13:56:07] <wikibugs>	 (03PS1) 10Slyngshede: Update to CAS version 7.3.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804
[13:56:13] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1055.eqiad.wmnet with OS bookworm
[13:57:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking)
[13:57:21] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[13:57:21] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: dse-k8s-worker-eqiad@eqiad
[13:58:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907811 (10VRiley-WMF)
[13:59:08] <Lucas_WMDE>	 image builds completed :o
[13:59:32] <Lucas_WMDE>	  MatmaRex will the backports be testable btw?
[14:00:40] <MatmaRex>	 Lucas_WMDE: yeah, i have some API queries prepared
[14:01:21] <Lucas_WMDE>	 ok, great
[14:03:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:04:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group
[14:04:32] <logmsgbot>	 list (T425859)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:04:36] <stashbot>	 T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386
[14:04:37] <stashbot>	 T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752
[14:04:37] <stashbot>	 T425859: InvalidArgumentException in list=globalusers API module with gusprop=rights - https://phabricator.wikimedia.org/T425859
[14:04:40] <wikibugs>	 (03CR) 10Atsuko: [C:03+1] Migrated turnilo to auth_proxy.httpd_cas module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:04:53] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[14:05:00] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:05:28] <MatmaRex>	 Lucas_WMDE: thanks, looks good
[14:05:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Continuing with deployment
[14:05:33] <Lucas_WMDE>	 \o/
[14:05:51] <Lucas_WMDE>	 jouncebot: nowandnext
[14:05:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 24 minute(s)
[14:05:51] <jouncebot>	 In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1430)
[14:06:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester)
[14:06:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci)
[14:06:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:09] <wikibugs>	 (03CR) 10Xcollazo: "( I don't think I have the expertise to review here. Will let @joal@wikimedia.org review. )" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[14:08:25] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:09:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (https://phabricator.wikimedia.org/T348763) (owner: 10Joal)
[14:09:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:09:26] <wikibugs>	 (03Merged) 10jenkins-bot: Migrated turnilo to auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285739 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko)
[14:09:50] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 (owner: 10CDanis)
[14:10:21] <Lucas_WMDE>	 meanwhile, zuul is still in hell /o\
[14:12:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:12:58] <wikibugs>	 (03Merged) 10jenkins-bot: WikiLambdaApi instrument: Sets the custom schemaID [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285352 (https://phabricator.wikimedia.org/T415254) (owner: 10Jforrester)
[14:13:04] <wikibugs>	 (03Merged) 10jenkins-bot: editSaves: getExperiment returns a promise now [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285406 (https://phabricator.wikimedia.org/T425785) (owner: 10Santiago Faci)
[14:13:25] <jinxer-wm>	 FIRING: [19x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:13:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907908 (10VRiley-WMF)
[14:15:32] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm
[14:15:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm
[14:16:35] <Lucas_WMDE>	 TIL SpiderPig disables the “new backport” form when one is already running
[14:16:47] <Lucas_WMDE>	 I wanted to speed things up by already pasting the URLs for the next deploy but computer says no
[14:17:53] <Lucas_WMDE>	 what does “Waiting 20 seconds for production traffic” do, btw?
[14:18:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285460|Prevent username registration if the username previously existed (T196386)]], [[gerrit:1285461|Prevent username registration if the username previously existed (v2) (T196386)]], [[gerrit:1285462|API: Introduce list=globalusers (T261752)]], [[gerrit:1285761|list=globalusers: Avoid querying group permissions with empty group list (
[14:18:07] <logmsgbot>	 T425859)]] (duration: 39m 22s)
[14:18:12] <stashbot>	 T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386
[14:18:12] <stashbot>	 T261752: Add an API module to display status of multiple globally locked users - https://phabricator.wikimedia.org/T261752
[14:18:12] <stashbot>	 T425859: InvalidArgumentException in list=globalusers API module with gusprop=rights - https://phabricator.wikimedia.org/T425859
[14:18:25] <jinxer-wm>	 FIRING: [23x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:18:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[14:18:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci)
[14:20:12] <wikibugs>	 (03Merged) 10jenkins-bot: Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285448 (https://phabricator.wikimedia.org/T196386) (owner: 10Bartosz Dziewoński)
[14:20:16] <wikibugs>	 (03Merged) 10jenkins-bot: WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci)
[14:20:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now (T425785)]]
[14:20:41] <Lucas_WMDE>	 MatmaRex, phuedx: ^ fyi
[14:20:47] <stashbot>	 T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254
[14:20:47] <stashbot>	 T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785
[14:21:14] <phuedx>	 Lucas_WMDE: ty
[14:21:22] <taavi>	 Lucas_WMDE: exactly what it says, waits for some traffic to hit the canaries so it can check if that traffic is failing or not
[14:21:52] <Lucas_WMDE>	 taavi: but it says production traffic, not canary traffic
[14:21:54] <Lucas_WMDE>	 (“Waiting 20 seconds for canary traffic” is earlier)
[14:22:05] <taavi>	 the canaries are serving some subset of production traffic
[14:22:30] <wikibugs>	 (03PS1) 10Tiziano Fogli: logstash: adjust param_time parsing for thanos-query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1285823 (https://phabricator.wikimedia.org/T423986)
[14:22:48] <Lucas_WMDE>	 I’m still confused
[14:23:04] <Lucas_WMDE>	 we deploy to the canaries, then wait for canary traffic, then check logstash. so far so good
[14:23:18] <Lucas_WMDE>	 then we deploy to all of production, and… wait for traffic and check logstash again?
[14:23:25] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:23:39] <taavi>	 think of canaries as a subset of production, not a separate thing
[14:24:03] * Lucas_WMDE digs up some older spiderpig logs
[14:24:31] <Lucas_WMDE>	 yeah there’s no “Waiting 20 seconds for production traffic” in https://spiderpig.wikimedia.org/jobs/1000, this is something newer
[14:25:24] <Lucas_WMDE>	 is this also part of T225207?
[14:25:25] <stashbot>	 T225207: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207
[14:26:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jforrester, matmarex, sfaci: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now
[14:26:18] <logmsgbot>	 (T425785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:26:22] <stashbot>	 T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386
[14:26:23] <stashbot>	 T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254
[14:26:23] <stashbot>	 T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785
[14:26:40] <Lucas_WMDE>	 MatmaRex, phuedx: please test
[14:26:55] <Lucas_WMDE>	 apparently it’s T317405 / https://gitlab.wikimedia.org/repos/releng/scap/-/commit/ec14e688b8
[14:26:56] <stashbot>	 T317405: Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405
[14:29:22] <MatmaRex>	 looking, sorry
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1430)
[14:30:16] <Lucas_WMDE>	 still deploying, sorry test kitcheners
[14:30:54] <Lucas_WMDE>	 phuedx: can you test the changes in WikiLambdaApi/WikimediaEvents?
[14:31:03] <MatmaRex>	 looks good
[14:31:08] <Lucas_WMDE>	 ok, thanks
[14:31:11] <phuedx>	 Lucas_WMDE: I'm just looking at the Wikilambda one
[14:31:20] <Lucas_WMDE>	 ok
[14:31:33] <phuedx>	 The WikimediaEvents one is tricky to test but I'm confident that it will fix the error :)
[14:32:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908107 (10ssingh) ` Record:      410 Date/Time:   05/10/2026 04:22:34 Source:      system Severity:    Critical Description: A critical diagnostic event occurred in the memory device at B2. Conta...
[14:33:03] <phuedx>	 OK. Plenty of Wikilambda API requests are succeeding on regular abstractwiki pageviews 👍
[14:33:05] <phuedx>	 Lucas_WMDE: ^
[14:33:06] <phuedx>	 LGTM
[14:33:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jforrester, matmarex, sfaci: Continuing with deployment
[14:33:11] <Lucas_WMDE>	 alright, thanks!
[14:34:49] <wikibugs>	 (03CR) 10CDanis: [C:03+2] turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis)
[14:37:00] <wikibugs>	 (03Merged) 10jenkins-bot: turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis)
[14:38:45] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[14:39:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285448|Grant 'createpreviouslyrenamedaccount' to account creators and sysop-likes (T196386)]], [[gerrit:1278704|WikiLambdaApi: update stream configuration (T415254)]], [[gerrit:1285352|WikiLambdaApi instrument: Sets the custom schemaID (T415254)]], [[gerrit:1285406|editSaves: getExperiment returns a promise now (T425785)]] (duration: 18
[14:39:28] <logmsgbot>	 m 50s)
[14:39:34] <stashbot>	 T196386: MediaWiki should prevent username registration if the username previously existed - https://phabricator.wikimedia.org/T196386
[14:39:34] <stashbot>	 T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254
[14:39:35] <stashbot>	 T425785: TypeError: experiment.send is not a function - https://phabricator.wikimedia.org/T425785
[14:39:45] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:54] * Lucas_WMDE done deploying, Test Kitchen can take over now
[14:39:58] <Lucas_WMDE>	 sorry for the delay
[14:40:04] <MatmaRex>	 thanks for deploying Lucas_WMDE
[14:40:41] <phuedx>	 +1 Thanks Lucas_WMDE <3
[14:40:47] <Lucas_WMDE>	 (meanwhile Zuul remains firmly stuck in hell)
[14:41:25] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:41:52] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply
[14:42:17] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply
[14:42:28] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017
[14:43:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966#11908173 (10LSobanski) p:05Medium→03Low This will be addressed automatically with Debian version upgrades.
[14:43:57] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017
[14:44:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163#11908178 (10LSobanski) p:05Medium→03Low
[14:45:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.network: handle dry-run outputs in run_junos_commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey)
[14:46:29] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:47:28] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:48:07] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 07Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#11908191 (10LSobanski) p:05Medium→03Low Considering the age of this task, is this still a valid request?
[14:49:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425#11908214 (10LSobanski) 05Open→03Declined Exim doesn't have a fully fledged systemd unit and masking is expected to work fine otherwise. Please reopen if y...
[14:50:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285827 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake)
[14:51:36] <wikibugs>	 (03PS1) 10Elukey: admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452)
[14:51:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 07Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448#11908226 (10LSobanski) p:05Medium→03Low
[14:52:09] <wikibugs>	 (03CR) 10CDanis: [C:03+1] admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey)
[14:54:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 (owner: 10Slyngshede)
[14:54:05] <logmsgbot>	 !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply
[14:54:29] <logmsgbot>	 !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply
[14:54:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908254 (10Jhancock.wm) i pulled a replacement DIMM and a ssd from our offlined hosts.  @ssingh safe to power down the host?
[14:55:21] <wikibugs>	 (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1285827 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake)
[14:55:23] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2012.codfw.wmnet with reason: DIMM replacement
[14:55:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908259 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8830f0f1-94da-40cc-8ac8-4aef8e53c8f4) set by sukhe@cumin1003 for 1:00:00 on 1 host(s) and their services with reason: DI...
[14:55:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908262 (10ssingh) >>! In T425890#11908254, @Jhancock.wm wrote: > i pulled a replacement DIMM and a ssd from our offlined hosts.  > @ssingh safe to power down the host?  @Jhancock.wm: Yes, please...
[14:59:57] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2144.codfw.wmnet - https://phabricator.wikimedia.org/T425522#11908288 (10Jhancock.wm) 05Open→03Resolved
[15:09:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: update the opentelemetry's collector to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285835 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey)
[15:17:27] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm
[15:17:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11908393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm
[15:19:55] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11908396 (10RobH) Putting in a remote hands ticket with the following:   Support,  One of our router to switch links has unexpectedly gone down.  We would like you to observe both ports, note the lack of link light, then pro...
[15:21:02] <logmsgbot>	 !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply
[15:21:11] <logmsgbot>	 !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply
[15:22:20] <icinga-wm>	 PROBLEM - Host lsw1-a3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:22:58] <icinga-wm>	 PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908449 (10Jhancock.wm) @ssingh replaced both. not seeing any errors in the idrac logs at this moment. You should be good to rebuild it.
[15:24:32] <wikibugs>	 (03PS1) 10CDanis: turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846
[15:25:07] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11908458 (10RobH)
[15:27:17] <wikibugs>	 (03CR) 10Bking: [C:03+1] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis)
[15:27:41] <wikibugs>	 (03CR) 10CDanis: [C:03+2] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis)
[15:29:00] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:29:23] <wikibugs>	 (03CR) 10Joal: [C:03+1] turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis)
[15:29:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking)
[15:29:52] <wikibugs>	 (03Merged) 10jenkins-bot: turnilo: webrequest: bool dimension for resiproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285846 (owner: 10CDanis)
[15:29:58] <icinga-wm>	 RECOVERY - Host lsw1-a3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms
[15:30:01] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:30:04] <icinga-wm>	 RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.83 ms
[15:30:05] <jouncebot>	 jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1530).
[15:30:19] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:30:44] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on lvs2012 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T425965 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:30:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425965 (10ops-monitoring-bot) 03NEW
[15:31:40] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: adjust param_time parsing for thanos-query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1285823 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[15:32:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908546 (10ssingh) @Jhancock.wm: Thanks for the quick turnaround! Host is back and serving traffic, will keep a close watch for a bit before resolving this.
[15:33:00] <icinga-wm>	 PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:14] <icinga-wm>	 RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[15:33:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1285797 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[15:33:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425965#11908555 (10Jhancock.wm) →14Duplicate dup:03T425890
[15:33:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908553 (10Jhancock.wm)
[15:36:41] <zabe>	 jouncebot: nowandnext
[15:36:41] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1530)
[15:36:41] <jouncebot>	 In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700)
[15:36:41] <jouncebot>	 In 1 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700)
[15:36:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start reading from new file tables on testwiki (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[15:38:25] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:39:17] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from new file tables on testwiki (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[15:39:47] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]]
[15:39:51] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[15:40:01] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:40:30] <icinga-wm>	 PROBLEM - Host lsw1-a5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:22] <icinga-wm>	 PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:35] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:42:10] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[15:44:06] <wikibugs>	 (03PS4) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971
[15:44:11] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:44:42] <icinga-wm>	 RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.06 ms
[15:44:56] <icinga-wm>	 RECOVERY - Host lsw1-a5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.06 ms
[15:46:13] <wikibugs>	 (03PS5) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971
[15:46:19] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280418|Start reading from new file tables on testwiki (2nd try) (T416548)]] (duration: 06m 32s)
[15:46:23] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[15:46:29] <wikibugs>	 (03PS3) 10Zabe: Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578)
[15:46:35] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) (owner: 10Zabe)
[15:47:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11908655 (10elukey) ` Updating the root user's password on the BMC. Changing password for the account with username root: /redfish/v1/AccountService/Accounts/3 Updating the ADMIN user's password...
[15:48:44] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis)
[15:50:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) (owner: 10Zabe)
[15:50:48] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:50:51] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]]
[15:50:54] <stashbot>	 T423578: Remove custom user groups from Wikinews (in core-Permissions.php) - https://phabricator.wikimedia.org/T423578
[15:52:33] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:53:26] <wikibugs>	 (03PS1) 10Eevans: sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308)
[15:53:49] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: upgrade user gweld to shell, analytics-privatedata and kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1285413 (https://phabricator.wikimedia.org/T425727) (owner: 10Dzahn)
[15:54:26] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[15:54:48] <icinga-wm>	 PROBLEM - Host lsw1-a7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:55:46] <icinga-wm>	 PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:00] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[15:58:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity and wmf LDAP group for GWeld - https://phabricator.wikimedia.org/T425727#11908744 (10Dzahn) No problem, Manuel. With your +1 I merged and deployed it.  Then I created the Kerberos principal....
[15:58:39] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281491|Remove custom user groups from Wikinews (T423578)]] (duration: 07m 48s)
[15:58:43] <stashbot>	 T423578: Remove custom user groups from Wikinews (in core-Permissions.php) - https://phabricator.wikimedia.org/T423578
[15:58:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity and wmf LDAP group for GWeld - https://phabricator.wikimedia.org/T425727#11908746 (10Dzahn) 05In progress→03Resolved a:03Dzahn
[15:59:20] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore: Upgrade to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285852 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[15:59:46] <wikibugs>	 (03PS1) 10Zabe: Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548)
[16:00:20] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply
[16:00:35] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[16:00:49] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550)
[16:01:28] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:01:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott)
[16:01:47] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:02:15] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:02:20] <icinga-wm>	 RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms
[16:02:20] <icinga-wm>	 RECOVERY - Host lsw1-a7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[16:03:23] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:05:44] <wikibugs>	 (03PS2) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550)
[16:08:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11908821 (10Jclark-ctr) @elukey  is there anything I can do to help with this?
[16:09:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:10:56] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908830 (10Dzahn) @KOfori Hi, this says it needs your approval. Does it look good?
[16:12:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908834 (10Dzahn) 05Open→03In progress
[16:12:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908836 (10ssingh) @KOfori is out, deferring to @Kappakayala as the approver in the interim.
[16:14:14] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:14:32] <wikibugs>	 (03PS1) 10JMeybohm: Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439)
[16:15:29] <wikibugs>	 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975 (10MatthewVernon) 03NEW
[16:15:43] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[16:16:19] <wikibugs>	 (03CR) 10Dzahn: "patch looks good, just needs approvals" [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams)
[16:16:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:17:02] <wikibugs>	 (03PS3) 10Zabe: Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577)
[16:17:09] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577) (owner: 10Zabe)
[16:17:50] <icinga-wm>	 PROBLEM - Host cloudsw1-b1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:18:20] <icinga-wm>	 PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:18:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908905 (10Dzahn)
[16:18:31] <wikibugs>	 (03CR) 10Dzahn: "(and out-of-band confirmation of the SSH key is needed)" [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams)
[16:18:32] <wikibugs>	 (03PS1) 10Eevans: echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308)
[16:19:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908913 (10Dzahn) confirmed in Dayforce.  NDA checkbox not needed for staff.  L3 checked
[16:19:47] <wikibugs>	 (03Merged) 10jenkins-bot: Disable FlaggedRevs on wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281506 (https://phabricator.wikimedia.org/T423577) (owner: 10Zabe)
[16:20:01] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11908918 (10Dzahn)
[16:20:26] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]]
[16:20:29] <stashbot>	 T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577
[16:21:16] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[16:21:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11908923 (10Dzahn) Please let us know if you run into specific problems.  Probably this is about "upgrade from level 2 to level 3" which would mean access to (more) privat...
[16:21:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11908924 (10Dzahn) a:03ArthurTaylor
[16:22:04] <icinga-wm>	 RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms
[16:22:07] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:22:22] <icinga-wm>	 RECOVERY - Host cloudsw1-b1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms
[16:23:09] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[16:23:22] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11908935 (10RobH) Update from email:  * finally got an answer back after escalating both on the ticket, via our dell sg team, and via the accounts payable folks @ dell sg who want to be paid for the m...
[16:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: Upgrade (staging) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285859 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[16:23:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[16:23:56] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply
[16:24:16] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply
[16:25:03] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply
[16:25:20] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply
[16:25:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908952 (10Dzahn) Hi @KartikMistry we need to do an "out-of-band verification" that this is really your new key.  Could you maybe drop a file in some home directory on a production server that confirms it?    A...
[16:25:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908953 (10Dzahn) a:03KartikMistry
[16:26:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11908956 (10Dzahn) 05Open→03In progress
[16:26:47] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11908963 (10Dzahn) a:03KFrancis
[16:27:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11908964 (10Dzahn) 05In progress→03Stalled waiting for NDA signing to be completed (in linked task)
[16:27:20] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281506|Disable FlaggedRevs on wikinews (T423577)]] (duration: 06m 54s)
[16:27:23] <stashbot>	 T423577: Undeploy FlaggedRevs from Wikinews and drop FlaggedRevs tables - https://phabricator.wikimedia.org/T423577
[16:27:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11908967 (10Dzahn) a:03AnnieKim_WMDE
[16:28:22] <wikibugs>	 (03PS1) 10Eevans: echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308)
[16:28:24] <icinga-wm>	 PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:28:32] <icinga-wm>	 PROBLEM - Host lsw1-b3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:30:34] <wikibugs>	 (03CR) 10Dzahn: "a +1 from traffic would be nice - it's just about a sanity check that the IPs are what is in netbox though" [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[16:31:05] <wikibugs>	 (03CR) 10Dzahn: "@dduvall I think you said we probably won't need this. Should I abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[16:31:10] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[16:31:18] <wikibugs>	 (03CR) 10Ssingh: "Yes that's my bad, I just didn't get to it. Sorry. I will get to it today." [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[16:31:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add TLS for the ratelimit namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285858 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm)
[16:31:37] <wikibugs>	 (03CR) 10Dzahn: "waiting for releng to check what other things (software) could be affected by this change" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn)
[16:32:10] <wikibugs>	 (03CR) 10Dzahn: "let's schedule the switch-over" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[16:32:34] <icinga-wm>	 RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms
[16:32:34] <icinga-wm>	 RECOVERY - Host lsw1-b3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.99 ms
[16:34:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909017 (10VRiley-WMF) @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is getting stuck at the raid. I tried to log...
[16:34:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909018 (10VRiley-WMF) 05In progress→03Open
[16:34:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:36:18] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285345 (owner: 10Santiago Faci)
[16:36:38] <icinga-wm>	 PROBLEM - Host lsw1-b5-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:47] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:36:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:37:12] <wikibugs>	 (03PS4) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147)
[16:37:16] <icinga-wm>	 PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:29] <wikibugs>	 (03CR) 10Dzahn: "Done! changed the find command to "\( -name "index.lock" -o -name "shallow.lock" \)]"." [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[16:37:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:37:48] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285346 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci)
[16:37:53] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:37:55] <wikibugs>	 (03PS3) 10HakanIST: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930)
[16:38:33] <wikibugs>	 (03CR) 10Dzahn: "@ssingh@wikimedia.org would also like to deploy this one some time" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn)
[16:38:48] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:39:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:39:06] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: Upgrade (prod) to Kask v1.0.19 (Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285861 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans)
[16:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285345 (owner: 10Santiago Faci)
[16:39:12] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:39:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:39:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "and this :)" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[16:40:11] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply
[16:40:13] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285346 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci)
[16:40:14] <wikibugs>	 (03CR) 10Dzahn: "thank you:) it has all week" [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[16:41:18] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply
[16:41:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909067 (10ssingh) >>! In T421421#11909017, @VRiley-WMF wrote: > @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is...
[16:41:34] <icinga-wm>	 RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.76 ms
[16:41:40] <icinga-wm>	 RECOVERY - Host lsw1-b5-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.04 ms
[16:41:40] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply
[16:44:44] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:58] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:47:23] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909107 (10RobH) > Please Check. The port seems Up right now After replacing it. We found a 2m MTP in your rack. DO100 number mtp fiber.    on switch ` et-0/0/50                  Core: cr2-drmrs:et-0/0/2 {#D0103} em0...
[16:48:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[16:49:28] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.10 ms
[16:49:36] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms
[16:49:59] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909125 (10cmooney) a:05RobH→03None It still shows no light incoming to asw1-b12-drmrs on lane 3: ` cmooney@asw1-b12-drmrs> show interfaces diagnostics optics xe-0/0/50:2 | match "Laser receiver power" | match dB...
[16:50:19] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[16:51:11] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply
[16:51:48] <wikibugs>	 (03PS1) 10Jdlrobson: Exclude sitesupport from button/icon treatment, remove manual styling [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721)
[16:53:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[16:53:48] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546)
[16:56:55] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700)
[17:00:05] <jouncebot>	 ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T1700)
[17:00:16] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1047.eqiad.wmnet with reason: Maintenance
[17:00:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92460 and previous config saved to /var/cache/conftool/dbconfig/20260511-170024-fceratto.json
[17:03:44] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reimage: use ADMIN for redfish when reimaging Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1285868
[17:05:00] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie
[17:05:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS tr...
[17:06:23] <wikibugs>	 (03PS1) 10Jgreen: Remove deprecated /etc/icinga/objects/nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1285870 (https://phabricator.wikimedia.org/T425424)
[17:06:52] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[17:07:01] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[17:07:09] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[17:07:17] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[17:07:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply
[17:07:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply
[17:07:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92461 and previous config saved to /var/cache/conftool/dbconfig/20260511-170739-fceratto.json
[17:11:04] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.263.0" for 2 host(s)
[17:12:38] <wikibugs>	 (03CR) 10Dduvall: "Sounds good to me. It's not needed for Zuul migration." [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[17:12:55] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.263.0" completed for 2 hosts
[17:14:59] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: allow zuul machines to port 22 ssh [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[17:15:25] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye
[17:15:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye
[17:17:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909294 (10KFrancis) Hi all, the NDA is complete! Thanks!
[17:17:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92462 and previous config saved to /var/cache/conftool/dbconfig/20260511-171747-fceratto.json
[17:17:59] <wikibugs>	 (03PS3) 10Andrew Bogott: wmfkeystonehooks: double-check that we're adding a new user to a project [puppet] - 10https://gerrit.wikimedia.org/r/1285854 (https://phabricator.wikimedia.org/T379550)
[17:25:08] <jinxer-wm>	 FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 3h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[17:25:53] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[17:27:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047', diff saved to https://phabricator.wikimedia.org/P92463 and previous config saved to /var/cache/conftool/dbconfig/20260511-172756-fceratto.json
[17:29:43] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1268] - vriley@cumin1003"
[17:29:50] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1268] - vriley@cumin1003"
[17:29:50] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:34:03] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909396 (10RobH) Ok, swapped the cr optic and it fixed it.  Followups on ticket: * snap a photo of the defective optic with serial for me to process a repair/return * clarify if the 2M fiber they call temp is temp due to be...
[17:34:39] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1268
[17:35:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1268
[17:36:18] <wikibugs>	 (03PS1) 10Eevans: echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874
[17:36:37] <wikibugs>	 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11909412 (10cmooney) All looks good with the link, traffic flowing again.  {F80931115 width=600}  Light good either side: ` cmooney@asw1-b12-drmrs> show interfaces diagnostics optics et-0/0/50 | except "warn|alarm"...
[17:38:05] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1047 (T419961)', diff saved to https://phabricator.wikimedia.org/P92464 and previous config saved to /var/cache/conftool/dbconfig/20260511-173804-fceratto.json
[17:38:25] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1268.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:40:05] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874 (owner: 10Eevans)
[17:42:30] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: add missing restbase nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285874 (owner: 10Eevans)
[17:43:16] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply
[17:44:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:45:58] <logmsgbot>	 vriley@cumin1003 provision (PID 2669614) is awaiting input
[17:47:32] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[17:52:15] <logmsgbot>	 !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye
[17:52:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*...
[17:53:12] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye
[17:53:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye
[17:55:23] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1268.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:56:14] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1268.eqiad.wmnet with OS bookworm
[17:56:25] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm
[17:56:33] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1268.eqiad.wmnet with OS bookworm
[17:56:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm executed with errors: - db1268 (**F...
[18:00:53] <logmsgbot>	 vriley@cumin1003 reimage (PID 2684286) is awaiting input
[18:07:10] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1268.eqiad.wmnet with OS bookworm
[18:07:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909565 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm
[18:11:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909574 (10elukey) weird error while doing pxe:  ` >>Checking Media Presence...... >>Media Present...... >>Start PXE over IPv4 on MAC: 90-5A-08-A4-D1...
[18:12:03] <ottomata>	 !log roll restarting eventgate-main to pick up changes for T423583
[18:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:06] <stashbot>	 T423583: mediawiki.page_change.v1 event - Add revision revert details - https://phabricator.wikimedia.org/T423583
[18:12:10] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie
[18:12:19] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[18:12:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11909578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS trixie...
[18:12:22] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[18:12:29] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[18:12:54] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[18:13:25] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:13:37] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[18:13:58] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[18:18:47] <wikibugs>	 (03CR) 10SBassett: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett)
[18:18:54] <wikibugs>	 (03CR) 10SBassett: [V:03+1 C:03+1] Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett)
[18:22:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[18:25:23] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the ping @dzahn@wikimedia.org. I see various +1s from the folks but no clear indication on if we have verified any recent match" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn)
[18:25:44] <logmsgbot>	 !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye
[18:25:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*...
[18:26:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909630 (10ssingh) @VRiley-WMF: We may need to check this host; I can't seem to get it to come back up after a reboot (checked twice). Is there something else missing here? Perhaps...
[18:31:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11909651 (10cmooney) >>! In T424683#11906733, @cmooney wrote: >>>! In T424683#11885878, @ayounsi wrote: >> Nice! >>  >> We ca...
[18:36:07] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909670 (10VRiley-WMF)
[18:36:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909673 (10VRiley-WMF) checking, stand by
[18:42:33] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Yeah this has been in the backlog for a while. I was hoping for some buy-in from the frontline-defense group, so I will try again." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[18:44:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909724 (10VRiley-WMF) Yes, it's getting stuck at the same spot I was getting stuck at. It looks like it's looking for a specific RAID.
[18:47:35] <wikibugs>	 (03PS13) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[18:47:54] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[18:49:09] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage
[18:50:45] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[18:54:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:54:42] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage
[18:55:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:56:19] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye
[18:56:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye
[18:57:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:00:01] <wikibugs>	 (03PS5) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683)
[19:00:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:00:56] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks yep I missed the reply.  Good call, I'd been testing against the interfaces with actual sub-ints, but didn't realise we'd get all t" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney)
[19:01:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:04:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909801 (10Jclark-ctr) @ssingh you are booting with UEFI?   the YAML file need to be updated for lvs1017  -partman/standard-efi.cfg -partman/raid1-2dev-efi.cfg
[19:05:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:06:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:07:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909810 (10ssingh) >>! In T421421#11909801, @Jclark-ctr wrote: > @ssingh you are booting with UEFI? >  >  the YAML file need to be updated for lvs1017 >  > -partman/standard-efi.cfg...
[19:10:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:11:12] <inflatador>	 !log bking@archiva1002 `sudo rm -rfv /var/cache/archiva/temp* && sudo systemctl restart archiva`. to free up disk space
[19:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:12:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909839 (10ssingh) ` Forced UEFI HTTP Boot for next reboot Resetting chassis power status for lvs1017 to ForceRestart Host rebooted via Redfish `
[19:12:39] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[19:14:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:14:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:14:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[19:14:56] <logmsgbot>	 !log dzahn@dns1005 START - running authdns-update
[19:15:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:15:45] <logmsgbot>	 vriley@cumin1003 reimage (PID 2684286) is awaiting input
[19:16:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:16:43] <logmsgbot>	 !log dzahn@dns1005 END - running authdns-update
[19:16:45] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[19:16:46] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1268.eqiad.wmnet with OS bookworm
[19:16:57] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1268.eqiad.wmnet with OS bookworm completed: - db1268 (**PASS**)   -...
[19:17:27] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909846 (10VRiley-WMF)
[19:18:47] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[19:19:06] <wikibugs>	 (03CR) 10Dzahn: "which tool would you use for that?" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn)
[19:19:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "ah! cool, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:20:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:21:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909858 (10Dzahn) a:05KFrancis→03None
[19:21:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909859 (10Dzahn) a:03Dzahn
[19:22:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:22:36] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1269] - vriley@cumin1003"
[19:22:42] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1269] - vriley@cumin1003"
[19:22:42] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:23:24] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1269
[19:24:36] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1269
[19:25:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:25:12] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1269.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:26:19] <wikibugs>	 (03PS1) 10Dzahn: admin: add Catherine Kelsey of WMDE as ldap_only user [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566)
[19:26:23] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909867 (10Dzahn) Thanks Katie!  @catherine.kelsey.wmde Now we just need approval from one of the WMDE managers listed at https://wikitech.wikimedia.org/wi...
[19:26:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909871 (10Dzahn) NDA is completed.  Please get one of the WMDE managers to approve (https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_re...
[19:27:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:27:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909873 (10Dzahn)
[19:29:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909880 (10Dzahn) Could you also provide an example of what tasks or tools this is actually intended for (for that open checkbox from the template...
[19:29:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11909884 (10Dzahn)
[19:30:04] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "needs WMDE manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn)
[19:30:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:31:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:34:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11909905 (10Dzahn) a:05Dzahn→03catherine.kelsey.wmde
[19:35:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:35:58] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[19:36:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1285893
[19:36:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:37:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1285893 (owner: 10Muehlenhoff)
[19:39:08] <inflatador>	 !log [bking@cumin2002] ~$ sudo cumin 'A:wdqs-main and A:codfw' 'systemctl restart wdqs-blazegraph' <- restart after banning scraper
[19:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:09] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bvibber out of all services on: 2453 hosts
[19:43:39] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1269.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:44:32] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye
[19:44:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*...
[19:54:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[19:55:18] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from new file tables on all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285853 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[19:55:44] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]]
[19:55:48] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[19:57:27] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:58:32] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[20:00:00] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1269.eqiad.wmnet with OS bookworm
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2000). nyaa~
[20:00:05] <jouncebot>	 Sergi0 and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11909993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1269.eqiad.wmnet with OS bookworm
[20:02:41] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285853|Start reading from new file tables on all small and medium wikis (T416548)]] (duration: 06m 57s)
[20:02:46] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[20:05:15] <jan_drewniak>	 I'm late for the backport window, I might make it later in the hour, in about 30min.
[20:05:31] <jan_drewniak>	 My patch can wait for me until then
[20:12:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910009 (10VRiley-WMF)
[20:15:36] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage
[20:16:41] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:19:34] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage
[20:20:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910031 (10VRiley-WMF)
[20:23:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[20:28:41] <jan_drewniak>	 ok back at my computer, going to deploy the portal donor patch now. 
[20:30:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:30:58] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[20:31:52] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285866 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:32:06] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]]
[20:32:09] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[20:32:58] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[20:33:48] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:36:06] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[20:37:50] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Continuing with deployment
[20:39:12] <logmsgbot>	 vriley@cumin1003 reimage (PID 2773592) is awaiting input
[20:39:33] <wikibugs>	 (03PS1) 10Alex.sanford: Enforce 2FA requirements for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119)
[20:41:49] <wikibugs>	 (03PS1) 10Eevans: echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906
[20:41:57] <logmsgbot>	 !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285866|Bumping portals to master (T128546)]] (duration: 09m 51s)
[20:42:00] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[20:42:13] <wikibugs>	 (03PS1) 10Jdlrobson: Skin: Correct thumbnail class [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910)
[20:45:19] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906 (owner: 10Eevans)
[20:46:03] <cjming>	 i might have something to deploy in this current window - i think the queue is finished?
[20:47:31] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: refactored egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285906 (owner: 10Eevans)
[20:48:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford)
[20:48:25] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[20:48:26] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1269.eqiad.wmnet with OS bookworm
[20:48:32] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11910142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1269.eqiad.wmnet with OS bookworm completed: - db1269 (**PASS**)   -...
[20:49:49] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[20:53:30] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1270] - vriley@cumin1003"
[20:53:35] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1270] - vriley@cumin1003"
[20:53:35] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:53:40] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply
[20:54:41] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2100).
[21:02:11] <maryum>	 hi! I have a config patch to get out today that I'll use spiderpig for
[21:02:17] <maryum>	 and then a security patch to deploy
[21:02:28] <cjming>	 maryum: can i deploy something after you're done?
[21:02:34] <maryum>	 yes of course
[21:02:41] <cjming>	 cool - thanks!
[21:03:10] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[21:07:39] <maryum>	 running the backport now
[21:07:46] <wikibugs>	 (03PS1) 10Eevans: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910
[21:07:47] <maryum>	 a config backport
[21:07:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett)
[21:08:08] <wikibugs>	 (03PS2) 10Eevans: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910
[21:08:50] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett)
[21:09:07] <logmsgbot>	 !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]]
[21:09:10] <stashbot>	 T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058
[21:10:50] <logmsgbot>	 !log mstyles@deploy1003 sbassett, mstyles: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:10:58] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 (owner: 10Eevans)
[21:11:32] <logmsgbot>	 !log mstyles@deploy1003 sbassett, mstyles: Continuing with deployment
[21:13:04] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: rollback to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285910 (owner: 10Eevans)
[21:14:10] <icinga-wm>	 PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:15:11] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:15:43] <logmsgbot>	 !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284008|Enable CSPUseReportURIDirective in Wikimedia production (T424058)]] (duration: 06m 36s)
[21:15:46] <stashbot>	 T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058
[21:16:21] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply
[21:16:30] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[21:16:46] <maryum>	 now preparing to deploy the security patch
[21:21:57] <maryum>	 scap is running
[21:25:23] <jinxer-wm>	 FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 7h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[21:28:07] <wikibugs>	 (03PS1) 10Anne Tomasevich: Add ReadingLists Account Creation CTA campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169)
[21:29:44] <maryum>	 !log Deployed security fix for T425406
[21:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:54] <maryum>	 cjming: you can go ahead with your deploy now
[21:30:04] <cjming>	 tysm!
[21:30:29] <wikibugs>	 (03PS1) 10Clare Ming: WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254)
[21:33:00] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming)
[21:35:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming)
[21:36:00] <icinga-wm>	 RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:36:20] <icinga-wm>	 PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:37:20] <icinga-wm>	 RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:37:59] <wikibugs>	 (03Merged) 10jenkins-bot: WikiLambdaApi instrument: update schema [extensions/WikiLambda] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285916 (https://phabricator.wikimedia.org/T415254) (owner: 10Clare Ming)
[21:38:00] <maryum>	 cjming let me know when you're done. I might need to rollback the security patch I just deployed
[21:38:27] <cjming>	 maryum: will do - should be done momentarily
[21:39:01] <cjming>	 uh oh
[21:39:12] <cjming>	 mine just err'd out
[21:40:25] <cjming>	 not sure what to do in this situation - it merged but https://spiderpig.wikimedia.org/jobs/1957
[21:41:30] <cjming>	 should i revert? retry?
[21:41:51] <cjming>	 it's a super minor change
[21:42:28] <rzl>	 cjming: not an expert but that looks like it's failing because of the uncommitted file in /srv/patches -- needs maryum's attention maybe
[21:42:40] <maryum>	 oh that's my bad, committing that now
[21:43:01] <cjming>	 gtk
[21:44:09] <maryum>	 cjming just committed
[21:44:29] <cjming>	 thx - i guess i will retry
[21:44:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:45:14] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]]
[21:45:18] <stashbot>	 T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254
[21:47:00] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:47:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.37% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:47:32] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with deployment
[21:49:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 935.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:51:40] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285916|WikiLambdaApi instrument: update schema (T415254)]] (duration: 06m 26s)
[21:51:44] <stashbot>	 T415254: Migrate "WikiLambda API" instrument to use the Test Kitchen SDK - https://phabricator.wikimedia.org/T415254
[21:51:55] <cjming>	 maryum: back to you - all yours
[21:52:03] <maryum>	 cjming thanks!
[21:52:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.23% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:54:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 837.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:58:02] <wikibugs>	 (03CR) 10Bking: "The PCC failure is expected on cirrussearch2070, as it is a Bullseye node and Bullseye nodes are blocked from installing atop (for good re" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[22:03:18] <wikibugs>	 (03PS4) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627)
[22:06:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:11:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:13:40] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:23:53] <wikibugs>	 (03CR) 10Dzahn: "can not compile this for some reason: https://puppet-compiler.wmflabs.org/output/1285488/8539/" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[22:25:43] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "nevermind, just typo in the hostname. here it goes:  https://puppet-compiler.wmflabs.org/output/1285488/8540/codesearch9.codesearch.eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[22:32:40] <wikibugs>	 (03PS5) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147)
[22:44:35] <wikibugs>	 (03PS6) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147)
[22:45:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[22:45:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:47:14] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8542/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[22:47:54] <wikibugs>	 (03PS7) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147)
[22:51:00] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:51:15] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8542/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[22:58:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[23:00:04] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260511T2300)
[23:00:25] <Jdlrobson>	 Here. Let me know if there are any reasons not to use the readers deploy window.
[23:03:17] <wikibugs>	 (03PS1) 10Dduvall: zuul: Set mode of SSH private key to 0400 [puppet] - 10https://gerrit.wikimedia.org/r/1285923
[23:05:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285464 (https://phabricator.wikimedia.org/T424571) (owner: 10Jdlrobson)
[23:10:13] <wikibugs>	 (03PS4) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST)
[23:10:28] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "(Note: I48a7c82bdad0e2697bea175e7a04846e5a8b2cf0 needs to be merged first and in production before we can backport this)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST)
[23:10:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST)
[23:15:16] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112)
[23:15:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic)
[23:17:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add support for icons in toolbox [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285464 (https://phabricator.wikimedia.org/T424571) (owner: 10Jdlrobson)
[23:18:16] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]]
[23:18:20] <stashbot>	 T424571: Temporary watchstar  status not reflected in dropdown: Add icon support for toolbox in Vector 2022 - https://phabricator.wikimedia.org/T424571
[23:19:56] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:20:34] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment
[23:21:32] <wikibugs>	 (03PS2) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112)
[23:24:45] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285464|Add support for icons in toolbox (T424571)]] (duration: 06m 29s)
[23:24:48] <stashbot>	 T424571: Temporary watchstar  status not reflected in dropdown: Add icon support for toolbox in Vector 2022 - https://phabricator.wikimedia.org/T424571
[23:25:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721) (owner: 10Jdlrobson)
[23:38:35] <wikibugs>	 (03Merged) 10jenkins-bot: Exclude sitesupport from button/icon treatment, remove manual styling [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285864 (https://phabricator.wikimedia.org/T425721) (owner: 10Jdlrobson)
[23:38:51] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]]
[23:38:54] <stashbot>	 T425721: Revert the header donate button back to a normal link - https://phabricator.wikimedia.org/T425721
[23:39:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927
[23:39:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927 (owner: 10TrainBranchBot)
[23:40:32] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:41:05] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment
[23:43:42] <wikibugs>	 (03PS7) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152)
[23:45:13] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285864|Exclude sitesupport from button/icon treatment, remove manual styling (T425721)]] (duration: 06m 21s)
[23:45:16] <stashbot>	 T425721: Revert the header donate button back to a normal link - https://phabricator.wikimedia.org/T425721
[23:45:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson)
[23:50:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:52:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1285927 (owner: 10TrainBranchBot)
[23:55:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:59:52] <wikibugs>	 (03Merged) 10jenkins-bot: Skin: Correct thumbnail class [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285907 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson)