[02:12:44] (03PS1) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:43] (03CR) 10Dzahn: [V:03+2 C:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:15:51] (03PS2) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:55] (03CR) 10Dzahn: [V:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [03:21:37] (03PS1) 10Andrew Bogott: Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 [03:22:02] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 (owner: 10Andrew Bogott) [04:25:22] FIRING: [12x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:26:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:26:22] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [04:26:32] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390 (10phaultfinder) 03NEW [04:27:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:28:35] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:29:35] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:03] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:56] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:31:03] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:31:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:37] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:37] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:03] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:35:03] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:36:23] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:37:23] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:40:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [04:45:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:48:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [04:50:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:55:52] RESOLVED: [24x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:57:52] RESOLVED: [9x] HAProxyServiceUnavailable: HAProxy service heat-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [06:52:49] 10VPS-project-Phabricator, 06collaboration-services: Requesting manual activation of phabricator.wmcloud.org accounts - https://phabricator.wikimedia.org/T397280#10930493 (10A_smart_kitten) Thank you @dzahn! All seems to work okay :) [07:49:14] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390#10930715 (10dcaro) @Andrew This seems fixed now, though it happened during your working hours I think and I see maybe it's related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161148 ? [07:49:37] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390#10930716 (10dcaro) p:05Triage→03Low [07:53:26] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930735 (10dcaro) This is still happening, it seems to be timing out when waiting for the reverse DNS cleanup: ` Jun 19 0... [07:58:08] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T397105#10930752 (10dcaro) 05Open→03Resolved a:03dcaro [07:58:54] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930755 (10dcaro) It seems it has been flapping very often lately (https://grafana-rw.wikimedia.org/d/ebJoA6VWz/wmcs-opens... [08:05:43] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930852 (10dcaro) Got specially choppy in the last couple of days: {F62387966} [08:05:54] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930864 (10dcaro) Cleaned up all the existing VMs, trying to get a clean run [08:16:10] 10wikitech.wikimedia.org: Wikitech double redirect bot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376224#10930958 (10taavi) 05Open→03Resolved a:03taavi [08:16:30] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930961 (10dcaro) Hmm.... the fullstack logs did successfully remove the VM, but the DNS records are still there for a dif... [08:28:12] (03open) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:28:27] (03update) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:29:18] (03update) 10dcaro: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] (use_markdown) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 [08:29:22] (03update) 10dcaro: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] (use_markdown) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 [08:29:38] (03update) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:29:41] (03update) 10dcaro: README: use makrdown for nice presentation in gitlab [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/4 [08:31:28] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10931044 (10dcaro) 05Open→03Resolved a:03dcaro I cleaned up all the VMs, and ran the `wmcs-dnsleaks --delpoyment... [08:33:12] 06cloud-services-team, 10Horizon, 13Patch-For-Review: Horizon proxy tab Edit buttons not working - https://phabricator.wikimedia.org/T397272#10931063 (10dcaro) 05Open→03In progress p:05Triage→03Medium a:03dcaro [08:33:28] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Horizon, 13Patch-For-Review: Horizon proxy tab Edit buttons not working - https://phabricator.wikimedia.org/T397272#10931068 (10dcaro) [08:33:42] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T397100#10931082 (10dcaro) 05Open→03Resolved a:03dcaro This is fixed now [08:35:44] 06cloud-services-team: KernelErrors Server cloudcephosd1024 logged kernel errors - https://phabricator.wikimedia.org/T396937#10931091 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` root@cloudcephosd1024:~# journalctl -k -p err -- Journal begins at Sat 2025-06-14 21:29:07 UTC, ends at Thu 2025-06-19... [08:36:01] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T396934#10931097 (10dcaro) 05Open→03Resolved a:03dcaro Cleaned up and restarted, and now it's working. [08:37:56] 06cloud-services-team: KernelErrors Server cloudcephosd1015 logged kernel errors - https://phabricator.wikimedia.org/T396796#10931107 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:05] 06cloud-services-team: KernelErrors Server cloudcephosd1016 logged kernel errors - https://phabricator.wikimedia.org/T396801#10931111 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:10] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T396810#10931115 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:16] 06cloud-services-team: KernelErrors Server cloudcephosd1017 logged kernel errors - https://phabricator.wikimedia.org/T396832#10931120 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:25] 06cloud-services-team: KernelErrors Server cloudcephosd1018 logged kernel errors - https://phabricator.wikimedia.org/T396859#10931124 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:33] 06cloud-services-team: KernelErrors Server cloudcephosd1019 logged kernel errors - https://phabricator.wikimedia.org/T396909#10931128 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:39] 06cloud-services-team: KernelErrors Server cloudcephosd1020 logged kernel errors - https://phabricator.wikimedia.org/T396917#10931132 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:45] 06cloud-services-team: KernelErrors Server cloudcephosd1022 logged kernel errors - https://phabricator.wikimedia.org/T396921#10931136 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:53] 06cloud-services-team: KernelErrors Server cloudcephosd1023 logged kernel errors - https://phabricator.wikimedia.org/T396929#10931151 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:41:06] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2010-dev:9100 - https://phabricator.wikimedia.org/T396769#10931156 (10dcaro) 05Open→03Resolved a:03dcaro Not failing anymore. [08:43:20] 10Tool-query-chest, 10Wikidata, 10Wikidata Query UI: Use query-chest for short URLs when the w.wiki shortener fails for long queries - https://phabricator.wikimedia.org/T334893#10931160 (10jhsoby) [08:52:56] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931174 (10dcaro) p:05Triage→03High [08:53:02] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931176 (10dcaro) [08:53:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931177 (10dcaro) a:03Andrew [08:57:25] !log dcaro@acme toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:57:26] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:27] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [08:57:30] !log dcaro@acme toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [08:57:31] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:33] !log dcaro@acme toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:57:34] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:44] !log dcaro@acme toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [08:57:44] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:59:24] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:59:24] dcaro@cloudcumin1001: Unknown project "toolsbeta-logging" [09:00:03] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) [09:03:28] (03approved) 10taavi: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [09:03:34] dcaro@cloudcumin1001 create_project (PID 3616594) is awaiting input [09:03:48] (03merge) 10dcaro: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [09:06:19] !log dcaro@cloudcumin1001 toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [09:06:23] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [09:08:28] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [09:09:29] !log dcaro@cloudcumin1001 toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [09:09:48] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10931248 (10Marostegui) Thank you! [09:34:16] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [09:34:19] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [09:36:14] !log dcaro@cloudcumin1001 toolsbeta-logging END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project toolsbeta-logging in eqiad1 (T397339)