[00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:22:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [03:27:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [06:05:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:08:49] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 22), 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600#11003145 (10komla) WIP for the wiki page is [[ https://meta.wikimedia.org/wiki/User:SSapaty_(WMF... [06:32:30] (03PS3) 10Samwilson: Add accesskeys to main JSON [labs/tools/extjsonuploader] - 10https://gerrit.wikimedia.org/r/813359 (https://phabricator.wikimedia.org/T305674) [07:35:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:47:29] 10Tool-extjsonuploader, 13Patch-For-Review: Extract and display accesskey information - https://phabricator.wikimedia.org/T305674#11003287 (10Samwilson) [09:26:52] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06Moderator-Tools-Team: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481#11003638 (10fnegri) a:03fnegri [09:48:07] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#11003744 (10fnegri) 05Open→03In progress p:05Triage→03Medium a:03fnegri [10:56:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:27:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:30:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Epic: [jobs-api,webservice] Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755#11004121 (10Raymond_Ndibe) 05In progress→03Stalled [11:36:44] 10Cloud-VPS (Project-requests), 13Patch-For-Review: Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11004146 (10Raymond_Ndibe) putting on hold for now until this concern is addressed [11:37:00] 10Cloud-VPS (Project-requests), 13Patch-For-Review: Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418#11004147 (10Raymond_Ndibe) 05Open→03Stalled [11:47:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:09:39] (03close) 10raymond-ndibe: projects: added project voterlists [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/257 (https://phabricator.wikimedia.org/T399418) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [12:11:04] !log raymond-ndibe@cloudcumin1001 voterlists END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project voterlists in eqiad1 (T399418) [12:11:06] raymond-ndibe@cloudcumin1001: Unknown project "voterlists" [12:11:06] T399418: Request creation of voterlists VPS project - https://phabricator.wikimedia.org/T399418 [12:26:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:31:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06Moderator-Tools-Team: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481#11004228 (10fnegri) @Kgraessle I created new Application Credentials and your code seems to work fine (at least until connection, I have not tr... [12:53:37] (03PS2) 10Jacob4code: Search results for actors include description [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1168284 [12:55:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06Moderator-Tools-Team: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481#11004301 (10Kgraessle) @fnegri thanks for the update, I was able to see records being uploaded last night, so I'm good to close this out. I thi... [12:55:37] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11004302 (10Raymond_Ndibe) >>! In T398780#11002620, @Kanashimi wrote: > @Raymond_Ndibe Thank you for the response. Although I didn't specify the number of CPUs, I checked the quota. It s... [12:58:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06Moderator-Tools-Team: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481#11004306 (10fnegri) 05Open→03Resolved No problem, thanks for confirming it's working again. I'll mark the task as Resolved. [13:03:04] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1168284 (owner: 10Jacob4code) [13:14:51] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334#11004360 (10fnegri) [13:14:54] 06cloud-services-team, 10Cloud-VPS: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up - https://phabricator.wikimedia.org/T358774#11004362 (10fnegri) [13:17:51] 06cloud-services-team, 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453#11004380 (10fnegri) 05Stalled→03Open I clearly failed at making any progress on this task, so... [13:24:40] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#11004420 (10fnegri) 05Stalled→03Open No one is currently working on this task, so I'll move it back to the backlog but keep it as "High" priority. @SD0001... [13:24:42] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#11004423 (10fnegri) a:05fnegri→03None [13:30:11] 06cloud-services-team, 10Cloud-VPS, 07Upstream: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255#11004445 (10Andrew) Zigo has built us some new Epoxy neutron packages ( 2:26.0.0-9~bpo12+1 ) which include the upstream fix for this. These packages... [13:31:37] 06cloud-services-team, 07artificial-intelligence: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#11004450 (10Raymond_Ndibe) I found this repo with a list of llms and their licenses https://github.com/eugeneyan/open-llms/blob/main/README.md I don't believe we should wait... [13:31:59] 06cloud-services-team, 10Toolforge: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#11004452 (10fnegri) [13:32:16] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 22), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11004454 (10fnegri) [13:33:02] 10Cloud Services Proposals, 06cloud-services-team, 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#11004455 (10fnegri) [13:33:33] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 10Observability-Alerting, and 2 others: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502#11004457 (10fnegri) [13:34:27] 06cloud-services-team, 10Cloud-VPS: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338#11004460 (10fnegri) [13:36:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: 2024-09-10: hardware error on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374467#11004471 (10fnegri) 05Open→03Resolved This is up & running, I'll mark the task as Resolved. [13:38:56] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 13Patch-For-Review: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#11004481 (10fnegri) [13:38:57] 10cloud-services-team (FY2025/26-Q1): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#11004482 (10fnegri) [13:40:27] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 22), 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600#11004496 (10fnegri) [13:41:10] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki tlwikisource - https://phabricator.wikimedia.org/T388657#11004500 (10fnegri) [13:42:22] 06cloud-services-team, 10Toolforge: [toolsdb] Migrate quickstatements db to Trove - https://phabricator.wikimedia.org/T369177#11004501 (10fnegri) [13:42:33] 06cloud-services-team, 10Toolforge: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862#11004504 (10fnegri) [13:43:07] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#11004514 (10fnegri) 05Open→03In progress [13:43:42] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591#11004520 (10fnegri) [13:43:50] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] Gather usage stats - https://phabricator.wikimedia.org/T381587#11004525 (10fnegri) [13:44:01] 06cloud-services-team, 10Data-Services, 06Data-Persistence: [wikireplicas] Route alerts to WMCS team - https://phabricator.wikimedia.org/T381589#11004527 (10fnegri) [13:44:12] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11004530 (10Raymond_Ndibe) @dcaro I looked at the underlying `resourcequota` for this tool and yes, it seems like tool is already using up the allowed `cpu requests` (see `spec.hard.requ... [13:44:14] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge - https://phabricator.wikimedia.org/T391369#11004531 (10fnegri) [13:44:53] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11004532 (10Raymond_Ndibe) a:03Raymond_Ndibe [13:45:16] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] add proper dry-run/diff mode to maintain-views - https://phabricator.wikimedia.org/T351637#11004534 (10fnegri) [13:45:29] 10cloud-services-team (FY2025/26-Q1), 10Data-Services, 13Patch-For-Review: [wikireplicas] Refactor maintenance scripts to allow local testing - https://phabricator.wikimedia.org/T395266#11004535 (10fnegri) [13:45:36] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11004537 (10fnegri) [13:53:10] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (11d 23h 29m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [14:02:26] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Horizon: Horizon proxy tab Edit buttons not working - https://phabricator.wikimedia.org/T397272#11004572 (10dcaro) 05In progress→03Resolved This is fixed now \o/ (thanks @Andrew !) [14:02:51] 10Cloud Services Proposals, 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 22), 05Cloud-Services-Origin-Team, and 3 others: [builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-d... - https://phabricator.wikimedia.org/T194332#11004577 [14:03:37] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716#11004582 (10dcaro) [14:04:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [metricsinfra] alerts do not get propagated to prod alertmanager - https://phabricator.wikimedia.org/T384200#11004584 (10dcaro) 05In progress→03Resolved I think this is fixed a... [14:05:39] 10cloud-services-team (FY2025/26-Q1), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned, 13Patch-For-Review: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885#11004587 (10dcaro) [14:28:55] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#11004659 (10dcaro) a:05dcaro→03None [14:29:11] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 22), 07Epic, 05Goal: [infra] Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#11004666 (10dcaro) [14:29:33] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cloudceph] Improve downtime when a switch goes down - https://phabricator.wikimedia.org/T375204#11004668 (10dcaro) [14:42:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:41:26] 06cloud-services-team, 10Ceph, 06Data-Platform-SRE: Proposed improvement: Manage CephX users via exported/collected Puppet resources - https://phabricator.wikimedia.org/T399594 (10BTullis) 03NEW [16:02:54] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): http://169.254.169.254/openstack/latest/user_data semi-regularly unavaliable during Magnum Kubernetes cluster builds - https://phabricator.wikimedia.org/T399596#11005203 (10bd808) @Andrew has helpfully cleared this is... [16:26:30] 06cloud-services-team, 10Cloud-VPS, 10Ceph, 06Data-Platform-SRE: Proposed improvement: Manage CephX users via exported/collected Puppet resources - https://phabricator.wikimedia.org/T399594#11005392 (10fnegri) [17:14:26] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11005708 (10dcaro) I see yes, the limit reached is of the requests, not the actual cpu used. The current usage is way lower than the requested CPUs, I think that this can be sorted out... [17:56:24] !log bd808@cloudcumin1001 c26d9d326bdf464fa1025939ded7e5a2 START - Cookbook wmcs.openstack.cloudvirt.vm_console [17:56:24] bd808@cloudcumin1001: Unknown project "c26d9d326bdf464fa1025939ded7e5a2" [18:02:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:06:55] !log bd808@cloudcumin1001 c26d9d326bdf464fa1025939ded7e5a2 END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [18:06:56] bd808@cloudcumin1001: Unknown project "c26d9d326bdf464fa1025939ded7e5a2" [18:07:19] !log bd808@cloudcumin1001 c26d9d326bdf464fa1025939ded7e5a2 START - Cookbook wmcs.openstack.cloudvirt.vm_console (T399596) [18:07:19] bd808@cloudcumin1001: Unknown project "c26d9d326bdf464fa1025939ded7e5a2" [18:07:20] T399596: http://169.254.169.254/openstack/latest/user_data semi-regularly unavaliable during Magnum Kubernetes cluster builds - https://phabricator.wikimedia.org/T399596 [18:07:30] !log bd808@cloudcumin1001 c26d9d326bdf464fa1025939ded7e5a2 END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) (T399596) [18:07:30] bd808@cloudcumin1001: Unknown project "c26d9d326bdf464fa1025939ded7e5a2" [18:08:21] !log bd808@cloudcumin1001 zuul START - Cookbook wmcs.openstack.cloudvirt.vm_console [18:08:23] !log bd808@cloudcumin1001 zuul END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [18:26:12] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): http://169.254.169.254/openstack/latest/user_data semi-regularly unavaliable during Magnum Kubernetes cluster builds - https://phabricator.wikimedia.org/T399596#11006029 (10bd808) I asked @andrew if leaving the broken... [19:56:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:41:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:52:51] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): http://169.254.169.254/openstack/latest/user_data semi-regularly unavaliable during Magnum Kubernetes cluster builds - https://phabricator.wikimedia.org/T399596#11006675 (10bd808) [21:56:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:04:25] (03PS1) 10Jacob4code: elimininate shared productions duplicates [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1169783 [23:51:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [23:51:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0)