[00:01:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:27:28] 10Toolforge (Quota-requests): Request increased build quota for toc Toolforge tool - https://phabricator.wikimedia.org/T398780#11007417 (10Kanashimi) I try `cpu: 0.25` and it works. Thank you. [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:09:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:59:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [05:00:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:05:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:20:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:01:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:21:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-71 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:30:34] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:35:34] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [08:24:54] 10cloud-services-team (FY2025/26-Q1): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#11008159 (10fnegri) [08:26:20] 10cloud-services-team (FY2025/26-Q1): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#11008166 (10fnegri) > Try sending an email in Gmail and verify that the account shows up as "deactivated" This method does not work anymore. I used to see some people with a gray "deactivated" icon... [08:26:34] 10cloud-services-team (FY2025/26-Q1): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#11008177 (10fnegri) 05In progress→03Resolved [09:44:04] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' (T399212) [09:44:11] T399212: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212 [09:46:46] PROBLEM - Host cloudvirt1073 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:50] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' (T399212) [09:47:58] RECOVERY - Host cloudvirt1073 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [09:49:00] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#11008413 (10fnegri) Merged the patch above and tested that it works by rebooting cloudvirt1073. Before reboot: ` fnegri@cloudvirt1073:~$ sudo cat /proc/sys/net/nf_c... [09:49:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:52:14] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#11008428 (10fnegri) 05In progress→03Resolved I applied the setting on all other cloudvirts without reboot, by running: ` sudo cumin 'cloudvirt*' 'sysctl --sy... [09:58:04] 06cloud-services-team, 10Cloud-VPS: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#11008444 (10fnegri) 05In progress→03Resolved The alert stopped firing after I updated the nf_conntrack setting for cloudvirt1067 in T399212#10992507 In {T399212} I... [09:58:15] 06cloud-services-team, 10Cloud-VPS: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#11008446 (10fnegri) [10:06:45] 10Toolforge (Toolforge iteration 22): [lima-kilo] foxtrot ldap docke image is using buster and fails to build - https://phabricator.wikimedia.org/T399701 (10dcaro) 03NEW [10:09:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [10:20:59] 06cloud-services-team, 10Cloud-VPS: [wmcs-cookbooks] cloudvirt.safe_reboot triggers NeutronAgentDown alert - https://phabricator.wikimedia.org/T399705 (10fnegri) 03NEW [11:51:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:29:44] (03open) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [13:36:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:43:21] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [13:44:06] 06cloud-services-team, 10Cloud-VPS: Add k8s_admin, k8s_developer, and k8s_viewer roles expected by default Magnum config for Kubernetes auth using Keystone auth - https://phabricator.wikimedia.org/T399488#11009297 (10Andrew) 05Open→03Resolved a:03Andrew ` openstack role list | grep k8s | 70215d932207... [13:47:56] 06cloud-services-team, 10Toolforge, 07Documentation: Compile the frequently used webpage design snippets for Tools authors - https://phabricator.wikimedia.org/T202949#11009308 (10TBurmeister) [13:53:11] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (10d 23h 29m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [13:53:37] 06cloud-services-team, 10Striker: Make it possible to maintain Toolforge tools via an easy-to-use web interface instead of a command-line one - https://phabricator.wikimedia.org/T332480#11009328 (10TBurmeister) [13:56:45] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [14:04:13] (03open) 10raymond-ndibe: [typing] use native types where possible [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/50 [14:05:19] (03CR) 10Essa237: [C:03+1] elimininate shared productions duplicates [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1169783 (owner: 10Jacob4code) [14:05:22] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [14:06:07] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [14:06:46] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [14:25:12] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [15:00:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:03:45] (03open) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [15:20:04] (03update) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [15:20:52] (03update) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [15:34:02] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (admin role) unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731 (10bd808) 03NEW [15:45:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:59:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:59:34] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [16:02:48] (03approved) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [16:02:50] (03merge) 10dcaro: container: use bitnami/openldap [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/9 [16:09:28] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11009980 (10fnegri) [16:10:12] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1992 slow ops - https://phabricator.wikimedia.org/T399315#11009989 (10fnegri) [16:10:20] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11009990 (10fnegri) [16:10:21] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1992 slow ops - https://phabricator.wikimedia.org/T399315#11009991 (10fnegri) 05Open→03Resolved a:03fnegri [16:10:30] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 987 slow ops - https://phabricator.wikimedia.org/T399309#11009995 (10fnegri) [16:10:44] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11009996 (10fnegri) [16:10:46] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 987 slow ops - https://phabricator.wikimedia.org/T399309#11009997 (10fnegri) 05Open→03Resolved a:03fnegri [16:10:54] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 367 slow ops - https://phabricator.wikimedia.org/T399299#11010000 (10fnegri) 05Open→03Resolved [16:11:01] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 367 slow ops - https://phabricator.wikimedia.org/T399299#11010001 (10fnegri) [16:11:09] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010002 (10fnegri) [16:11:14] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 5134 slow ops - https://phabricator.wikimedia.org/T399288#11010003 (10fnegri) [16:11:22] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010004 (10fnegri) [16:11:24] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 5134 slow ops - https://phabricator.wikimedia.org/T399288#11010005 (10fnegri) 05Open→03Resolved a:03fnegri [16:11:44] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 847 slow ops - https://phabricator.wikimedia.org/T399287#11010008 (10fnegri) [16:11:51] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010009 (10fnegri) [16:11:52] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 847 slow ops - https://phabricator.wikimedia.org/T399287#11010010 (10fnegri) 05Open→03Resolved a:03fnegri [16:11:55] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1 slow ops - https://phabricator.wikimedia.org/T399284#11010013 (10fnegri) [16:12:02] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1678 slow ops - https://phabricator.wikimedia.org/T399267#11010015 (10fnegri) [16:12:04] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010014 (10fnegri) [16:12:08] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 908 slow ops - https://phabricator.wikimedia.org/T399262#11010017 (10fnegri) [16:12:10] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010016 (10fnegri) [16:12:15] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010018 (10fnegri) [16:12:17] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1451 slow ops - https://phabricator.wikimedia.org/T399260#11010019 (10fnegri) [16:12:24] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010020 (10fnegri) [16:12:26] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 30 slow ops - https://phabricator.wikimedia.org/T399255#11010021 (10fnegri) [16:12:33] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010022 (10fnegri) [16:19:40] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T399360#11010060 (10fnegri) 05Open→03Resolved a:03fnegri cloudcephosd1013 had a hard drive failure, see {T399366}. cloudcephosd1036 had a single error message logged on 2025-07-11 15:56 UTC during the outage ({T399281}). Unfortuna... [16:19:53] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T399360#11010066 (10fnegri) [16:20:02] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22): 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11010067 (10fnegri) [16:20:25] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T399189#11010068 (10fnegri) 05Open→03Resolved a:03fnegri [16:22:38] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T399189#11010078 (10fnegri) 05Resolved→03Open IPv6 was unavailable for about 1 hour. Reopening as maybe it's worth a quick investigation. {F64824760} [16:22:53] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 30 slow ops - https://phabricator.wikimedia.org/T399255#11010086 (10fnegri) 05Open→03Resolved a:03fnegri [16:22:58] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1451 slow ops - https://phabricator.wikimedia.org/T399260#11010090 (10fnegri) 05Open→03Resolved a:03fnegri [16:23:01] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 908 slow ops - https://phabricator.wikimedia.org/T399262#11010093 (10fnegri) 05Open→03Resolved a:03fnegri [16:23:07] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1678 slow ops - https://phabricator.wikimedia.org/T399267#11010096 (10fnegri) 05Open→03Resolved a:03fnegri [16:23:13] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 1 slow ops - https://phabricator.wikimedia.org/T399284#11010099 (10fnegri) 05Open→03Resolved a:03fnegri [16:23:18] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 847 slow ops - https://phabricator.wikimedia.org/T399287#11010102 (10fnegri) 05Resolved→03Open [16:23:32] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 847 slow ops - https://phabricator.wikimedia.org/T399287#11010105 (10fnegri) 05Open→03Resolved [16:29:26] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T399189#11010133 (10fnegri) Actually only 5 minutes, the previous graph had a reporting artifact due to using `avg_over_time[1h]`: {F64825930} [16:32:00] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T399189#11010151 (10fnegri) 05Open→03Resolved The probe failed 3 times on IPv6 and one time over IPv4 over the past week. {F64826057} I'm resolving for now, if it happens again a new task will be opened and we can investigate more. [16:41:20] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (admin role) unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731#11010170 (10bd808) Poking around in Puppet to see the Ope... [17:08:09] (03update) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [17:17:24] 10Toolforge (Toolforge iteration 22): [lima-kilo] foxtrot ldap docke image is using buster and fails to build - https://phabricator.wikimedia.org/T399701#11010293 (10dcaro) 05Open→03Resolved p:05Triage→03High a:03dcaro Merged the foxtrot-ldap upgrade code, tested with lima-kilo and it's working, I'... [17:43:05] 10Cloud-VPS (Quota-requests), 06Moderator-Tools-Team, 10Wikilink-Tool: Request to increase Object Storage capacity - Wikilink project - https://phabricator.wikimedia.org/T399746 (10Scardenasmolinar) 03NEW [17:46:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:48:07] (03update) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [17:49:05] (03update) 10dcaro: foxtrot_ldap: fix bug when accounts already exist [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/257 [18:25:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:30:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:32:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:32:03] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [18:33:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [18:34:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:38:00] PROBLEM - Host cloudcephosd1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:30] RECOVERY - Host cloudcephosd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:44:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [18:46:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:51:18] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (formerly 'projectadmin') unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731#11010683 (10Andrew) [18:56:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:01:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:22:47] (03open) 10lucaswerkmeister: openapi: Allow lowercase ASCII letters too [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/59 (https://phabricator.wikimedia.org/T374780) [19:23:03] 06cloud-services-team, 10Tool-quickcategories, 10Toolforge, 13Patch-For-Review: Relax restrictions on toolforge envvar names - https://phabricator.wikimedia.org/T374780#11010762 (10LucasWerkmeister) ^ Let’s try the low-hanging fruit (already backed by T374780#10162297) first then, I guess. [19:23:22] (03update) 10lucaswerkmeister: openapi: Allow lowercase ASCII letters too [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/59 (https://phabricator.wikimedia.org/T374780) [19:25:52] (03update) 10lucaswerkmeister: openapi: Allow lowercase ASCII letters too [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/59 (https://phabricator.wikimedia.org/T374780) [19:42:57] 10Tool-archive-externa-links, 10Datasets-Archiving, 10Pywikibot, 10Wikidata: Creation of the "ArchivingBot" for Automatic URL Archiving on Wikidata - https://phabricator.wikimedia.org/T389599#11010872 (10paulwiki) [20:03:32] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "auditlogging" project Buster deprecation - https://phabricator.wikimedia.org/T367522#11010895 (10Southparkfan) >>! In T367522#10999101, @Aklapper wrote: > @Andrew / @Southparkfan: Can this ticket be resolved by now, or is there more to do? The VM still exists.... [20:20:57] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (formerly 'projectadmin') unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731#11010935 (10bd808) [20:20:58] 06cloud-services-team, 10Cloud-VPS: Review our handling of keystone 'member' role (previously known as 'projectadmin') - https://phabricator.wikimedia.org/T396016#11010936 (10bd808) [20:34:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:39:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:44:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:49:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:13:22] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (formerly 'projectadmin') unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731#11011261 (10Andrew) I have confirmed that th... [22:27:02] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Cloud VPS project member (formerly 'projectadmin') unable to grant k8s_admin, k8s_developer, k8s_viewer via `openstack role add` - https://phabricator.wikimedia.org/T399731#11011323 (10bd808) >>! In T399731#11011261,...