[00:14:27] (03open) 10lucaswerkmeister: package: add toolforge- prefix to more files [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/5 (https://phabricator.wikimedia.org/T399238) [00:16:04] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238#10993855 (10LucasWerkmeister) >>! In T399238#10993703, @bd808 wrote: > My theory would be that `debian/misctools.bash-completion` and `debian/misctools.lintian-overrides` n... [00:30:28] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10993869 (10bd808) [00:47:34] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wm... [00:54:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [00:54:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [00:59:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:09:58] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:31:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:41:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [01:59:28] PROBLEM - SSH on cloudcephosd1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:02:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:03:50] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [02:08:18] RECOVERY - SSH on cloudcephosd1035 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:08:20] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [02:22:58] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:26:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:16:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:21:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:26:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:31:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:54:40] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993983 (10Andrew) 05Open→03Resolved I upgraded the firmware on all of these. My attempts to get them to bookworm at the s... [04:20:12] PROBLEM - SSH on cloudcephosd1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:29:10] FIRING: CephSlowOps: Ceph cluster in eqiad has 30 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [04:29:20] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 30 slow ops - https://phabricator.wikimedia.org/T399255 (10phaultfinder) 03NEW [05:02:02] RECOVERY - SSH on cloudcephosd1036 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:04:15] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 30 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [05:28:22] PROBLEM - SSH on cloudcephosd1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:31:12] RECOVERY - SSH on cloudcephosd1037 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:53:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [06:54:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [06:58:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [06:59:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [07:06:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:11:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:18:42] PROBLEM - SSH on cloudcephosd1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:18:42] FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [07:18:42] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1451 slow ops - https://phabricator.wikimedia.org/T399260 (10phaultfinder) 03NEW [07:18:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [07:19:28] FIRING: [2x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:19:28] FIRING: [3x] InstanceDown: Project tools instance tools-elastic-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:19:32] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [07:19:39] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [07:21:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 779 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [07:23:53] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:23:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [07:24:28] RESOLVED: [2x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:24:28] RESOLVED: [3x] InstanceDown: Project tools instance tools-elastic-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:24:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [07:24:39] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [07:27:39] FIRING: CephSlowOps: Ceph cluster in eqiad has 908 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [07:27:46] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 908 slow ops - https://phabricator.wikimedia.org/T399262 (10phaultfinder) 03NEW [07:30:12] RECOVERY - SSH on cloudcephosd1037 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:32:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 908 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:09:56] 10Cloud-VPS (Project-requests): Request creation of Clipi VPS project - https://phabricator.wikimedia.org/T399237#10994288 (10Aklapper) 05Open→03Stalled @IhsannKhan: Hi and welcome! Please fill out the Purpose. Please also provide a description of the project and not a description of yourself. Thanks. [08:10:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 1678 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:13:13] PROBLEM - SSH on cloudcephosd1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-77 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:05] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [08:18:05] FIRING: [2x] InstanceDown: Project cloudinfra instance cloudinfra-db04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:05] FIRING: [2x] InstanceDown: Project gitlab-runners instance runner-1033 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:05] FIRING: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [08:18:05] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:28] FIRING: InstanceDown: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:28] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [08:18:32] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-thanos-fe-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:35] FIRING: TargetDown: Job mariadb is unreachable in project cloudinfra instance cloudinfra-db04 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [08:18:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:19:28] RESOLVED: [10x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:20:10] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2284 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:22:16] RESOLVED: [3x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [08:22:28] RESOLVED: [3x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:22:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [08:22:28] RESOLVED: [4x] InstanceDown: Project gitlab-runners instance runner-1033 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:22:32] RESOLVED: [5x] InstanceDown: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:23:28] RESOLVED: InstanceDown: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:23:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [08:23:28] RESOLVED: InstanceDown: Project metricsinfra instance metricsinfra-thanos-fe-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:23:35] RESOLVED: TargetDown: Job mariadb is unreachable in project cloudinfra instance cloudinfra-db04 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [08:23:58] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:24:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:33:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:45:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:58:58] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:07:14] 10Toolforge (Toolforge iteration 21): 2026-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281 (10fnegri) 03NEW [09:07:21] 10Toolforge (Toolforge iteration 21): 2026-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994561 (10fnegri) 05Open→03In progress p:05Triage→03High [09:09:39] 10Toolforge (Toolforge iteration 21): 2026-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994565 (10fnegri) [09:10:34] 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994572 (10Hamishcn) [09:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:18:53] 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994596 (10fnegri) There was some Cloud VPS issue just before we started seeing problem with Toolforge, with many hosts reported down: {F63887066} [09:19:28] 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994600 (10fnegri) [09:20:15] 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994601 (10fnegri) [09:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:21:05] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:23:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281#10994608 (10fnegri) [09:25:07] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37 [09:25:56] RESOLVED: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:26:05] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:28:02] RECOVERY - SSH on cloudcephosd1036 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:30:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:39:29] PROBLEM - SSH on cloudcephosd1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:40:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:40:16] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1 slow ops - https://phabricator.wikimedia.org/T399284 (10phaultfinder) 03NEW [09:41:40] !log fnegri@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37 [09:42:19] RECOVERY - SSH on cloudcephosd1035 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:46:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:46:58] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994719 (10fnegri) [09:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:58:38] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994751 (10fnegri) [10:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:20:13] PROBLEM - SSH on cloudcephosd1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:20:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994779 (10fnegri) Ceph OSD latencies show noticeable increases, so I suspect Ceph was the root cause of the outage: {F63892689} [10:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:21:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 847 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:21:19] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 847 slow ops - https://phabricator.wikimedia.org/T399287 (10phaultfinder) 03NEW [10:22:03] FIRING: ProbeDown: Service tools-proxy-9:443 has failed probes (http_toolforge_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#tools-proxy-9:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:23:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [10:24:09] RECOVERY - SSH on cloudcephosd1036 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:24:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [10:26:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1386 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:27:44] RESOLVED: [3x] ProbeDown: Service api.svc.toolforge.org:443 has failed probes (http_api_svc_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:27:44] PROBLEM - SSH on cloudcephosd1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:28:03] RECOVERY - SSH on cloudcephosd1036 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:28:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 5134 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:28:17] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 5134 slow ops - https://phabricator.wikimedia.org/T399288 (10phaultfinder) 03NEW [10:28:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [10:29:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [10:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:31:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:31:58] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994812 (10fnegri) Full output of `ceph health detail`: ` fnegri@cloudcephmon1004:~$ sudo ceph health detail HEALTH_WARN noout flag(s) set; Sl... [10:33:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1272 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:36:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994835 (10fnegri) There was another spike in Ceph OSD latency, with many Cloud VPS vms reporting down. Things look back to normal now. {F6389... [10:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:51:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994886 (10fnegri) [11:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:03:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994943 (10cmooney) Hmm not sure what might be the explanation for this. Checking right now from cloudcephmon1004 we seem to have normal ping... [11:04:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994944 (10cmooney) I also don't see drops reported between these hosts in the exported //node_ping_latency// stats: https://grafana.wikimedia... [11:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:11:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [11:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:16:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10994993 (10Curb_Safe_Charmer) Still getting a 504 when trying to access https://refill.toolforge.org/ [11:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:26:17] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995011 (10Andrew) I suspect that this was caused by bad Ceph behavior on bookworm; many alerts correspond to hosts that I upgraded yesterday.... [11:26:55] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wmnet with... [11:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:31:28] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995031 (10Andrew) At least three affected nodes show issues with the swap partition, md1: ` [16172.851790] perf: interrupt took too long (251... [11:35:26] PROBLEM - SSH on cloudcephosd1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:36:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:39:28] FIRING: [2x] InstanceDown: Project tools instance tools-puppetdb-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:40:28] FIRING: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:40:28] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:40:28] FIRING: InstanceDown: Project gitlab-runners instance runner-1035 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:40:36] FIRING: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [11:41:01] FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [11:41:16] RECOVERY - SSH on cloudcephosd1008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:41:28] FIRING: InstanceDown: Project cvn instance cvn-apache11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:41:28] FIRING: InstanceDown: Project cloudinfra instance ntp-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:41:32] FIRING: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [11:41:58] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:44:28] RESOLVED: [10x] InstanceDown: Project tools instance tools-k8s-haproxy-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:28] RESOLVED: [3x] InstanceDown: Project toolsbeta instance toolsbeta-cumin-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:28] RESOLVED: [2x] InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:32] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1035 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:36] RESOLVED: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [11:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:46:01] RESOLVED: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [11:46:28] RESOLVED: InstanceDown: Project cvn instance cvn-apache11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:46:28] RESOLVED: [2x] InstanceDown: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:46:32] RESOLVED: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [11:46:58] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:59:09] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995086 (10cmooney) >>! In T399281#10995047, @Andrew wrote: > cloudcephosd1036 has lost ssh connectivity several times during the night, only o... [12:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:13:30] PROBLEM - SSH on cloudcephosd1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:14:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 367 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:16:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:17:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:17:37] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 367 slow ops - https://phabricator.wikimedia.org/T399299 (10phaultfinder) 03NEW [12:17:49] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wmnet with OS b... [12:18:27] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [12:18:28] FIRING: [4x] InstanceDown: Project toolsbeta instance toolsbeta-nfs-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:18:28] FIRING: [4x] InstanceDown: Project tools instance tools-db-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:18:28] FIRING: [3x] InstanceDown: Project cloudinfra instance cloudinfra-idp-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:18:35] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [12:18:39] FIRING: TargetDown: Job toolsdb-mariadb is unreachable in project tools instance tools-db-6 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:20:20] RECOVERY - SSH on cloudcephosd1035 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:20:28] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [12:20:42] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [12:20:51] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/954748 (owner: 10Hashar) [12:20:55] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wmnet with... [12:20:56] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/954748 [12:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:21:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:22:07] (03CR) 10CI reject: [V:04-1] Jenkins job validation (DO NOT SUBMIT) [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/954748 (owner: 10Hashar) [12:22:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:23:27] RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [12:23:28] RESOLVED: [4x] InstanceDown: Project cloudinfra instance cloudinfra-db04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:23:28] RESOLVED: [8x] InstanceDown: Project tools instance tools-db-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:23:32] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [12:23:35] RESOLVED: [4x] InstanceDown: Project toolsbeta instance toolsbeta-nfs-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:23:39] RESOLVED: TargetDown: Job toolsdb-mariadb is unreachable in project tools instance tools-db-6 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:24:06] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/954749 (owner: 10Hashar) [12:24:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2061 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:25:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [12:25:42] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [12:26:31] FIRING: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [12:26:31] FIRING: ToolsToolsDBReplicationError: ToolsDB replication is broken on tools-db-6 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [12:28:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:30:26] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eq... [12:30:28] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eq... [12:30:31] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eq... [12:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:43:28] RESOLVED: [3x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:44:24] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wmnet with OS b... [12:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:56:25] (03CR) 10Hashar: "recheck after having added pkg-config I2d8328d6c59ec94efb6e3d5f881b7f35d2a26da6" [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/954748 (owner: 10Hashar) [13:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:05:19] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/954748 (owner: 10Hashar) [13:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:10:30] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995338 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [13:11:31] RESOLVED: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [13:11:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037.eqiad.wmnet with... [13:12:15] (03CR) 10Jforrester: [C:03+2] build: Updating mediawiki/mediawiki-phan-config to 0.16.0 [labs/tools/coverme] - 10https://gerrit.wikimedia.org/r/1167983 (owner: 10Libraryupgrader) [13:14:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995347 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.... [13:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:16:31] RESOLVED: ToolsToolsDBReplicationError: ToolsDB replication is broken on tools-db-6 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [13:20:41] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:26:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995418 (10Lucas_Werkmeister_WMDE) Could the Ceph issues affect the Beta cluster as well (T399303)? I don’t know enough about Ceph to understan... [13:29:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:33:11] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995431 (10Andrew) >>! In T399281#10995418, @Lucas_Werkmeister_WMDE wrote: > Could the Ceph issues affect the Beta cluster as well (T399303)? I... [13:35:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:35:15] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995438 (10Lucas_Werkmeister_WMDE) [13:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:39:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:41:48] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host cloudcephosd103... [13:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:48:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephos... [13:48:45] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995467 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host cloudcephosd103... [13:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:55:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281#10995475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephos... [14:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:10:00] FIRING: [2x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:11:28] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:12:28] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:12:28] FIRING: [5x] InstanceDown: Project cloudinfra instance cloudinfra-idp-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:12:28] FIRING: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:13:01] FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [14:13:28] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:16:28] FIRING: [9x] InstanceDown: Project tools instance tools-db-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:17:28] RESOLVED: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:17:28] FIRING: [8x] InstanceDown: Project cloudinfra instance cloudinfra-db04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:17:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [14:17:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [14:18:01] RESOLVED: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [14:18:28] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:19:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 987 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:19:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-test-k8s-worker-12 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:19:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-61 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:20:31] FIRING: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [14:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:21:28] FIRING: [13x] InstanceDown: Project tools instance tools-db-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:24:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1414 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:24:28] FIRING: InstanceDown: Project gitlab-runners instance runner-1038 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:24:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-61 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:25:00] FIRING: [3x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:25:01] FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [14:25:28] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:25:58] FIRING: [2x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:26:28] FIRING: [20x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:26:28] FIRING: InstanceDown: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:27:28] FIRING: [7x] InstanceDown: Project cloudinfra instance cloudinfra-db04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:29:28] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-controller-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:30:58] FIRING: [5x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:31:28] FIRING: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:31:31] FIRING: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [14:31:43] FIRING: [33x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:32:07] FIRING: HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [14:32:28] FIRING: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:32:43] FIRING: [10x] InstanceDown: Project cloudinfra instance cloudinfra-db04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:33:02] FIRING: [11x] InstanceDown: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:33:02] FIRING: [35x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:33:48] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:34:28] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1038 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:34:28] FIRING: [2x] InstanceDown: Project metricsinfra instance metricsinfra-controller-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:35:43] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:35:58] FIRING: [4x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:36:28] RESOLVED: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:36:28] RESOLVED: InstanceDown: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:36:31] RESOLVED: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [14:36:43] FIRING: [32x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:37:07] RESOLVED: HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [14:37:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:37:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:37:43] FIRING: [10x] InstanceDown: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:38:48] RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:39:28] FIRING: TargetDown: Job mariadb is unreachable in project cloudinfra instance cloudinfra-db03 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [14:40:58] FIRING: [6x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:41:28] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:41:28] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [14:41:43] FIRING: [36x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:42:43] FIRING: [11x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:45:00] FIRING: [4x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:47:43] FIRING: [11x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:49:43] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:51:43] FIRING: [37x] InstanceDown: Project tools instance tools-bastion-13 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:52:43] FIRING: [12x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:54:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:55:58] FIRING: [5x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:56:43] FIRING: [21x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:58:02] FIRING: [4x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:59:28] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:59:43] FIRING: [6x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:01:43] FIRING: [22x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:03:02] FIRING: [20x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:04:28] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:04:43] FIRING: [21x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:43] FIRING: [23x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:07:43] FIRING: [7x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:08:02] FIRING: [31x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:08:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:09:28] FIRING: [12x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:09:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [15:09:43] FIRING: [35x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:11:43] FIRING: [24x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:13:02] FIRING: [41x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:13:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [15:13:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [15:13:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [15:13:34] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [15:14:28] FIRING: [13x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:14:43] FIRING: [42x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:15:58] RESOLVED: [5x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:16:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:16:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:16:43] FIRING: [24x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:17:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [15:17:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 1h 3m 14s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [15:17:43] RESOLVED: [7x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:17:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:18:02] RESOLVED: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [15:18:02] FIRING: [47x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:28] RESOLVED: InstanceDown: Project metricsinfra instance metricsinfra-grafana-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:28] FIRING: [15x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance project-proxy-puppetserver-1 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:32] FIRING: PuppetAgentNoResources: No Puppet resources found on instance quarry-bastion on project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:36] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance cvn-app10 on project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:39] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:43] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance runner-1032 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:47] FIRING: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-puppetserver-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:19:51] RESOLVED: TargetDown: Job mariadb is unreachable in project cloudinfra instance cloudinfra-db03 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [15:19:55] FIRING: [52x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:20:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance etcd-discovery-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:20:31] RESOLVED: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [15:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:22:31] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 1h 2m 33s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [15:23:02] FIRING: [70x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:23:02] RESOLVED: [24x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:23:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 1h 9m 19s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [15:24:28] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:28] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance cvn-app10 on project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance k8s-nfs on project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:32] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-puppetserver-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:36] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance runner-1031 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:43] FIRING: [70x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:24:59] FIRING: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [15:25:00] FIRING: [6x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:25:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance etcd-discovery-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:28:09] PROBLEM - SSH on cloudcephosd1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:28:09] FIRING: [70x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:28:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 1992 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:28:16] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1992 slow ops - https://phabricator.wikimedia.org/T399315 (10phaultfinder) 03NEW [15:28:19] RECOVERY - SSH on cloudcephosd1035 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:29:28] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-5 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance k8s-nfs on project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:32] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:36] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance runner-1031 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:43] FIRING: [70x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:58] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance cvn-app10 on project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:33:02] FIRING: [70x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:33:31] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 1h 17m 17s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [15:34:28] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:34:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-5 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:34:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance k8s-nfs on project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:34:32] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance runner-1031 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:34:36] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:35:00] RESOLVED: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [15:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:38:02] FIRING: [53x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:38:10] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:39:28] FIRING: [17x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:39:28] RESOLVED: [3x] PuppetAgentNoResources: No Puppet resources found on instance k8s-nfs on project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:39:28] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance runner-1031 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:39:43] FIRING: [49x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:40:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10995893 (10fnegri) [15:43:02] FIRING: [40x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:28] FIRING: [12x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:28] RESOLVED: [3x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-5 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:28] RESOLVED: [5x] PuppetAgentNoResources: No Puppet resources found on instance runner-1031 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:43] FIRING: [40x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance cvn-nfs-1 on project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:48:02] FIRING: [39x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:49:43] FIRING: [39x] PuppetAgentNoResources: No Puppet resources found on instance abogott-nstesting on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:58:02] RESOLVED: [14x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-control-7 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:59:28] RESOLVED: [6x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-proxy-7 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:05:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10996027 (10Andrew) [16:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:07:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10996030 (10Andrew) [16:10:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,designate [16:11:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment eqiad1 for service: project,designate [16:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:17:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:22:26] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10996078 (10fnegri) [16:22:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10996084 (10fnegri) a:05fnegri→03Andrew [16:27:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-106 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:32:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-106 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:35:30] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [16:37:48] PROBLEM - SSH on cloudcephosd1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:37:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [16:37:48] FIRING: CephSlowOps: Ceph cluster in eqiad has 1574 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [16:37:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [16:37:48] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1574 slow ops - https://phabricator.wikimedia.org/T399328 (10phaultfinder) 03NEW [16:41:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-36 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:45:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:46:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-36 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:04:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:14:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:15:05] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-55 [17:19:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:21:04] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-55 [17:28:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [17:29:30] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [17:32:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [17:47:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:57:05] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [18:14:15] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [18:15:06] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1163075 (owner: 10Jacob4code) [18:16:30] (03CR) 10Eugene233: [C:03+1] Search results for actors include description [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [18:17:32] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code) [18:18:30] (03CR) 10Eugene233: [C:03+1] Cannot read properties of undefined fixed Detail about the collaboration WIP [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1163075 (owner: 10Jacob4code) [18:19:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:19:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [18:19:15] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code) [18:19:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:19:25] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [18:20:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:20:32] (03CR) 10Eugene233: [C:03+2] Cannot read properties of undefined fixed Detail about the collaboration WIP [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1163075 (owner: 10Jacob4code) [18:21:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [18:21:20] (03Merged) 10jenkins-bot: Cannot read properties of undefined fixed Detail about the collaboration WIP [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1163075 (owner: 10Jacob4code) [18:22:42] (03CR) 10Eugene233: "Looks good. Just a few recommendations." [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code) [18:23:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:24:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [18:25:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:25:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [21:12:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge (Toolforge iteration 21): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#10996648 (10bd808) [21:18:56] (03CR) 10BryanDavis: "Cause of T399216. The `hieradata/common/profile/*` files are not loaded by any codepath for a Cloud VPS instance as far as I can tell." [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [22:52:10] (03CR) 10Jacob4code: [C:03+1] Search results for actors include description [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [22:54:08] (03CR) 10Eugene233: [C:03+2] Search results for actors include description [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [22:54:14] (03CR) 10CI reject: [V:04-1] Search results for actors include description [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [22:54:45] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [23:08:23] (03CR) 10Essa237: [C:03+1] "please try to resolve the conflicts you have and add a commit message that describes what you are working on" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166497 (owner: 10Jacob4code) [23:10:53] (03CR) 10Essa237: "please try to resolve your add do requested changes" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code) [23:18:34] (03PS3) 10Jacob4code: Updated README Elaborate steps to run tool on local setup [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 [23:20:33] (03CR) 10Jacob4code: "All resolved" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code) [23:53:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-42 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses