[00:03:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [00:05:55] FIRING: MaxConntrack: Max conntrack at 84.34% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:48:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [00:55:56] RESOLVED: MaxConntrack: Max conntrack at 82.82% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [03:13:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:19:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:24:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [05:22:50] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973902 (10ayounsi) There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more. [06:08:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:46:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance runner-1033 in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:01:31] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 3 deleted instances on gitlab-runners-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [07:11:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance runner-1033 in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:24:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:28:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:29:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:57:26] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166054 (https://phabricator.wikimedia.org/T390397) (owner: 10Bovimacoco) [08:15:33] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10974128 (10Addshore) @dcaro want to schedule a call to walk through it all in more detail? [08:19:14] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10974130 (10dcaro) >>! In T398651#10974128, @Addshore wrote: > @dcaro want to schedule a call to... [08:28:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:17:41] (03open) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:21:37] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:21:45] (03PS1) 10Essa237: Refined the landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166348 [09:23:21] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:25:02] (03Abandoned) 10Essa237: [Fix] added a landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157586 (owner: 10Essa237) [09:28:15] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686 (10Tobi_WMDE_SW) 03NEW [09:29:57] (03open) 10taavi: logs: Move multi-pod fix from jobs-api to here [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/82 (https://phabricator.wikimedia.org/T398647) [09:33:17] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974387 (10Clement_Goubert) [09:33:29] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974388 (10Clement_Goubert) [09:33:34] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:35:50] (03open) 10taavi: Draft: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) [09:37:00] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974404 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium @Tobi_WMDE_SW Can you or @sowmya.guru fill out the first part of the... [09:39:55] (03update) 10taavi: Draft: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) [09:40:55] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:44:42] (03update) 10taavi: logs: Move multi-pod fix from jobs-api to here [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/82 (https://phabricator.wikimedia.org/T398647) [10:02:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647#10974500 (10taavi) [10:33:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:42:31] (03PS1) 10NkwadaNora: [fix]: created a pyproject.toml file at the root of the project, this tells tox to skip trying to build a Python distribution and just run your npm lint commands insteadsince the project is javascript, typeScript and not a python project [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166379 [10:51:33] (03Abandoned) 10NkwadaNora: rearrange the location of some files [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1152117 (owner: 10NkwadaNora) [10:53:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:54:01] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166379 (owner: 10NkwadaNora) [10:55:29] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 87.13%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [11:02:39] (03CR) 10NkwadaNora: [C:03+1] [fix]: created a pyproject.toml file at the root of the project, this tells tox to skip trying to build a Python distribution and just run y [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166379 (owner: 10NkwadaNora) [11:05:22] (03CR) 10Eugene233: [C:03+2] [fix]: created a pyproject.toml file at the root of the project, this tells tox to skip trying to build a Python distribution and just run y [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166379 (owner: 10NkwadaNora) [11:28:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:31:40] (03open) 10taavi: Query logs from Loki [repos/cloud/toolforge/jobs-api] (taavi/logging) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/180 (https://phabricator.wikimedia.org/T398645) [12:36:01] (03update) 10taavi: Query logs from Loki [repos/cloud/toolforge/jobs-api] (taavi/logging) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/180 (https://phabricator.wikimedia.org/T398645) [12:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:42:39] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254#10975074 (10taavi) a:03taavi [12:42:46] !log taavi@cloudcumin1001 wikidata-deleted START - Cookbook wmcs.vps.create_project for project wikidata-deleted in eqiad1 (T398254) [12:42:47] taavi@cloudcumin1001: Unknown project "wikidata-deleted" [12:42:48] T398254: Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254 [12:43:25] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project wikidata-deleted [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/256 (https://phabricator.wikimedia.org/T398254) [12:44:08] (03update) 10taavi: Query logs from Loki [repos/cloud/toolforge/jobs-api] (taavi/logging) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/180 (https://phabricator.wikimedia.org/T398645) [12:44:37] (03merge) 10taavi: projects: added project wikidata-deleted [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/256 (https://phabricator.wikimedia.org/T398254) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [12:47:06] !log taavi@cloudcumin1001 wikidata-deleted END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project wikidata-deleted in eqiad1 (T398254) [12:47:07] taavi@cloudcumin1001: Unknown project "wikidata-deleted" [12:49:44] !log taavi@cloudcumin1001 wikidata-deleted START - Cookbook wmcs.vps.create_project for project wikidata-deleted in eqiad1 (T398254) [12:49:48] T398254: Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254 [12:51:18] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712 (10taavi) 03NEW [12:51:18] !log taavi@cloudcumin1001 wikidata-deleted END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project wikidata-deleted in eqiad1 (T398254) [12:53:08] 06cloud-services-team, 10Cloud-VPS (Project-requests), 13Patch-For-Review: Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254#10975101 (10taavi) 05Open→03Resolved This project has been created. @bovlb: please make sure that you are subscribed to [[ https://li... [12:54:07] FIRING: HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [12:59:47] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:01:25] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 35.375 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:03:06] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:13:06] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:18:06] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:23:06] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:24:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-36 [13:26:07] 06cloud-services-team, 10Toolforge: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715 (10taavi) 03NEW p:05Triage→03High [13:28:06] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:30:39] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-36 [13:33:06] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [13:50:45] (03approved) 10marostegui: Update ES switchover script [toolforge-repos/switchmaster] - 10https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/11 (https://phabricator.wikimedia.org/T397628) (owner: 10ladsgroup) [14:08:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:38:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [14:44:37] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-24 [14:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:56:18] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-24 [14:58:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:03:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:08:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:33:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:13:27] 06cloud-services-team, 10Toolforge: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10975666 (10Raymond_Ndibe) a:03Raymond_Ndibe [16:38:36] 10Tools: [versions] Link to MediaWiki release notes for deployed versions - https://phabricator.wikimedia.org/T398725 (10bd808) 03NEW [17:03:47] (03merge) 10ladsgroup: Update ES switchover script [toolforge-repos/switchmaster] - 10https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/11 (https://phabricator.wikimedia.org/T397628) [17:39:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-57 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:23:21] 06cloud-services-team, 10Toolforge: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10975938 (10Raymond_Ndibe) Managed to get it down to 50% by manually cleaning up some images and running garbage collection (took some fiddling to get gc to run because gc needs redis and redis was down b... [18:24:00] 10VPS-project-Phabricator, 06collaboration-services, 06Release-Engineering-Team: Add the 'other assignee' field to the Phabricator test instance - https://phabricator.wikimedia.org/T398732 (10A_smart_kitten) 03NEW [18:24:10] 06cloud-services-team, 10Toolforge (Toolforge iteration 21): toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10975962 (10Raymond_Ndibe) [18:24:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 21): toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10975965 (10Raymond_Ndibe) 05Open→03In progress [18:49:02] (03PS1) 10Krinkle: IPInfo: Improve and simplify getAsInfo implementation further [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1166433 [18:50:31] (03CR) 10Krinkle: [C:03+2] IPInfo: Improve and simplify getAsInfo implementation further [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1166433 (owner: 10Krinkle) [18:51:27] (03Merged) 10jenkins-bot: IPInfo: Improve and simplify getAsInfo implementation further [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1166433 (owner: 10Krinkle) [19:49:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-47 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:15:15] (03PS1) 10Jacob4code: Updated README Elaborate steps to run tool on local setup [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 [23:16:41] (03CR) 10Jacob4code: "Hi can you please review this ?" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166450 (owner: 10Jacob4code)