[00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:29:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:24:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:54:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:14:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:19:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:20:57] (03approved) 10dcaro: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 (owner: 10raymond-ndibe) [08:24:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:29:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:34:06] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012137 (10dcaro) Looking into the ceph crash list, they were marked as acked for some reason (that's why I... [08:38:41] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012152 (10dcaro) For timing reference, all that happened at ~15:09:21 onwards UTC, the ids: ` root@cloudce... [08:39:29] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012153 (10dcaro) And all in just one node: ` root@cloudcephosd1006:~# for osd in $(ceph crash ls | grep 20... [08:48:45] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012287 (10dcaro) Looking at the logs of that node, it had just come up, and the osds failed right away: `... [08:49:25] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012301 (10dcaro) It had just been reimaged: https://sal.toolforge.org/log/eSEG-pcB8tZ8Ohr0Wi4D [08:59:25] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012330 (10dcaro) The logs also show errors connecting to the cluster trying to create more crash reports:... [09:29:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [11:46:16] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012928 (10dcaro) We don't have any more system logs from the incident (the reimages deleted all traces), t... [12:03:43] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11012964 (10dcaro) 05Open→03Resolved p:05Triage→03High @Kanashimi awesome :), I'll close this then, if you see that you still need more resources, we can have another look an... [12:05:19] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11012979 (10dcaro) There's some interesting logs, for example, the mon notices some slow pings already at 07... [12:26:43] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/42 [12:26:44] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/5 [12:28:13] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/1170325 (owner: 10L10n-bot) [12:28:15] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1170323 (owner: 10L10n-bot) [12:28:30] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "auditlogging" project Buster deprecation - https://phabricator.wikimedia.org/T367522#11013041 (10Andrew) Yep, you can delete it. We have only one Buster VM still running in the cluster and (I hope) it won't be around for long. [13:45:56] FIRING: SystemdUnitDown: The service unit prometheus-node-pinger.service is in failed status on host cloudcephosd1010. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1010 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:49:00] (03update) 10dcaro: Draft: dcaro test [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [13:50:56] RESOLVED: [5x] SystemdUnitDown: The service unit prometheus-node-pinger.service is in failed status on host cloudcephosd1010. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:53:11] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (9d 23h 29m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [14:14:11] 10VPS-project-Wikistats: Add zghwiktionary to wikistats - https://phabricator.wikimedia.org/T399790#11013419 (10Dzahn) once T399684 is resolved [14:14:22] 10VPS-project-Wikistats: Add zghwiktionary to wikistats - https://phabricator.wikimedia.org/T399790#11013421 (10Dzahn) 05Open→03Stalled [14:31:49] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#11013574 (10fnegri) > One mon and one OSD are on bookworm (for science), all others are running Bullseye. Look... [14:45:16] 06cloud-services-team, 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858 (10fnegri) 03NEW [14:45:32] 06cloud-services-team, 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11013681 (10fnegri) [14:45:36] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11013682 (10fnegri) [14:47:55] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11013707 (10fnegri) p:05Triage→03High [14:57:58] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11013757 (10dcaro) Sorry for the spam, trying to dump info, I'll try to summarize later. So ceph crashes t... [15:15:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [15:15:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:20:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [15:20:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:34:03] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11013954 (10fnegri) [15:35:56] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11013958 (10fnegri) cloudcephosd1006 was reimaged again to bookworm on 2025-07-16 at 18:21 UTC. The [disk utilization](https://grafana.wikimedia.org/goto/zedLHo8Ng?orgId=1) g... [15:39:36] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11013968 (10fnegri) Same thing for [node_procs_running](https://grafana.wikimedia.org/goto/Doi_Ho8Ng?orgId=1): {F65060728} [15:39:44] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11013970 (10cmooney) 05Resolved→03Open @Jclark-ctr as discussed in our call on Tuesday we will be connecting the second SFP port... [15:50:37] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11014080 (10fnegri) [[ https://grafana.wikimedia.org/goto/2Y62Do8Ng?orgId=1 | Memory utilization on cloudcephosd1006 ]] is increasing and it looks like it might crash the ser... [16:05:40] 06cloud-services-team, 10Toolforge: Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870 (10fnegri) 03NEW [16:07:37] 06cloud-services-team, 10Toolforge: Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11014174 (10fnegri) [16:07:41] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11014175 (10fnegri) [16:12:01] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11014187 (10fnegri) [16:12:10] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11014188 (10fnegri) [16:12:34] 06cloud-services-team, 10Toolforge, 06SRE-OnFire, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11014189 (10fnegri) [16:12:40] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11014190 (10fnegri) [16:26:36] (03update) 10dcaro: Draft: dcaro test [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [16:35:27] (03update) 10dcaro: Draft: dcaro test [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [16:41:10] (03update) 10dcaro: Draft: dcaro test [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [16:44:32] (03update) 10dcaro: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [16:53:16] (03open) 10dcaro: cli: only send fields that are set [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/112 [16:58:25] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11014409 (10cmooney) Regarding the jumbo-frame complication on the plan to move to one link we are arranging to connect a second 25G on each of... [17:47:07] 10Tool-archive-externa-links, 10Wikidata-Gadgets: [Code] Script utilisateur ArchiveExternaLinks avec Internet Archive - https://phabricator.wikimedia.org/T399885 (10paulwiki) 03NEW [17:59:43] 10Tool-archive-externa-links, 10Wikidata-Gadgets: [Code] Script utilisateur ArchiveExternaLinks avec Internet Archive - https://phabricator.wikimedia.org/T399885#11014677 (10paulwiki) [18:02:00] 10Tool-archive-externa-links, 10Wikidata-Gadgets: [Code] Script utilisateur ArchiveExternaLinks avec Internet Archive - https://phabricator.wikimedia.org/T399885#11014689 (10paulwiki) [18:07:22] 10Tool-archive-externa-links, 10Wikidata-Gadgets: [Documentation] Licence du script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T399886 (10paulwiki) 03NEW [18:15:09] 10Tool-archive-externa-links, 10Wikidata-Gadgets: Création de tableau de bord - https://phabricator.wikimedia.org/T399889 (10paulwiki) 03NEW [19:28:04] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/5 (owner: 10l10n-bot) [19:28:09] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/5 (owner: 10l10n-bot) [20:23:38] 10Tool-gitlab-content: Support CORS in gitlab-content tool - https://phabricator.wikimedia.org/T397571#11015188 (10Iniquity) @bd808 As it turns out, the task already exists :) I also encountered this today. {F65076579} [20:28:58] 10Tool-documentation: Curate tool documentation tasks and resources for upcoming Hackathons - https://phabricator.wikimedia.org/T391611#11015191 (10TBurmeister) 05In progress→03Resolved I updated or created the following resources that will be useful to newcomers who want to work on tech docs: https://ww... [21:24:49] 10Tool-gitlab-content: Support CORS in gitlab-content tool - https://phabricator.wikimedia.org/T397571#11015325 (10bd808) p:05Triage→03Medium I really thought that the Toolforge front proxy already did this for all tools, but I was apparently confusing the Toolforge proxy with the tools-static proxy. [21:38:00] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11015358 (10bd808) https://developer.hashicorp.com/terraform/plugin/sdkv2/best-practices/detecting-drift might be helpful when wo... [21:38:11] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11015360 (10bd808) [23:49:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-21 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess