[00:05:55] FIRING: MaxConntrack: Max conntrack at 95.11% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:06:04] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 (10phaultfinder) 03NEW [00:55:55] RESOLVED: MaxConntrack: Max conntrack at 93.26% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:12:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1073.eqiad.wmnet' (T394333) [01:12:42] T394333: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333 [01:26:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1073.eqiad.wmnet' (T394333) [01:26:07] T394333: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333 [01:26:46] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10986891 (10Andrew) >>! In T394333#10986464, @Jclark-ctr wrote: > @dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x... [04:33:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:38:58] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:58:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10987287 (10dcaro) The retention rules are disabled in toolsbeta for some reason :/, let's re-enable unless someone was testing something specific: {F6360692... [08:00:47] (03merge) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [08:07:24] (03open) 10dcaro: d/changelog: bump to 1.49.1 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/2 [08:10:43] (03open) 10dcaro: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 [08:11:19] (03update) 10dcaro: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 [08:11:26] (03update) 10dcaro: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 [08:21:52] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [08:22:30] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component misctools-cli [08:27:09] (03open) 10dcaro: functional-tests: add misctools-cli test definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/874 [08:27:23] (03update) 10dcaro: functional-tests: add misctools-cli test definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/874 [08:27:37] (03update) 10dcaro: functional-tests: add misctools-cli test definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/874 [08:35:46] (03PS7) 10David Caro: toolforge.component.deploy: support multiarch packages [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) [08:46:28] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [08:46:57] !log dcaro@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component misctools-cli [08:51:12] (03approved) 10taavi: tool-config: export the config schema [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T397724) (owner: 10dcaro) [09:44:16] 06cloud-services-team, 10Toolforge: [toolsdb] `information_schema.views` takes long time to query - https://phabricator.wikimedia.org/T398808#10987497 (10fnegri) 05Open→03Resolved a:03fnegri This is no longer happening, there was probably some type of heavy load on the database that is no longer ther... [10:02:11] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10987529 (10dcaro) For the record, I used this to add the extra arch to the aptly published repos without falling into the removal of packa... [10:07:21] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [10:07:53] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component misctools-cli [10:09:25] (03approved) 10dcaro: functional-tests: add misctools-cli test definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/874 [10:09:28] (03merge) 10dcaro: functional-tests: add misctools-cli test definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/874 [10:09:54] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [10:13:36] (03CR) 10David Caro: [C:03+2] "This has been moved now, packages are built from the new repository" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:13:47] (03CR) 10David Caro: [V:03+2 C:03+2] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:13:51] (03CR) 10CI reject: [V:04-1] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:14:02] (03CR) 10CI reject: [V:04-1] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:15:04] (03CR) 10David Caro: [V:03+1 C:03+2] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:15:09] (03CR) 10David Caro: [V:03+2 C:03+2] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [10:17:17] 06cloud-services-team, 10Toolforge, 10GitLab (Project Migration), 13Patch-For-Review: Migrate misctools package to GitLab - https://phabricator.wikimedia.org/T398202#10987556 (10A_smart_kitten) [10:18:39] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10987559 (10taavi) [10:18:42] 06cloud-services-team, 10Toolforge, 10GitLab (Project Migration), 13Patch-For-Review: Migrate misctools package to GitLab - https://phabricator.wikimedia.org/T398202#10987560 (10taavi) [10:18:47] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10987561 (10taavi) [10:18:49] 06cloud-services-team, 10Toolforge, 10GitLab (Project Migration), 13Patch-For-Review: Migrate misctools package to GitLab - https://phabricator.wikimedia.org/T398202#10987562 (10taavi) [10:20:06] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component misctools-cli [10:22:07] (03CR) 10David Caro: "Tested now with https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/2#note_151979" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [10:22:55] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [10:33:40] (03update) 10dcaro: Draft: dcaro test [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 [10:34:36] 10cloud-services-team (FY2024/2025-Q3-Q4): Create WMCS offboarding checklist - https://phabricator.wikimedia.org/T398972#10987613 (10fnegri) [10:34:36] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component misctools-cli [10:41:27] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068 (10fnegri) 03NEW [10:42:39] (03approved) 10dcaro: d/changelog: bump to 1.49.1 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/2 [10:42:41] (03merge) 10dcaro: d/changelog: bump to 1.49.1 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/2 [10:43:27] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987635 (10fnegri) [10:45:15] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987640 (10fnegri) [10:45:28] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10987641 (10dcaro) The new package has been released, @Raymond_Ndibe when you have time, can you rebuild your lima-kilo and check if it wor... [10:45:35] 10cloud-services-team (FY2024/2025-Q3-Q4): Create WMCS offboarding checklist - https://phabricator.wikimedia.org/T398972#10987643 (10fnegri) I'm going to test the checklist with {T399068} [10:51:50] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987658 (10fnegri) [10:58:07] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987674 (10fnegri) [10:59:02] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987677 (10fnegri) [10:59:05] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987678 (10fnegri) 05Open→03In progress [10:59:43] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987679 (10fnegri) [11:04:16] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987685 (10fnegri) [11:04:46] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10987686 (10fnegri) [11:05:11] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987688 (10cmooney) >>! In T394333#10964951, @Andrew wrote: > That should be possible as long as I can get support with refactoring... [11:32:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:44:18] FIRING: KernelErrors: Server cloudvirt1073 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirt1073 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [11:44:23] 06cloud-services-team: KernelErrors Server cloudvirt1073 logged kernel errors - https://phabricator.wikimedia.org/T399073 (10phaultfinder) 03NEW [11:51:04] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987796 (10cmooney) 1050 and 1051 are now connected and ports up too. ` cmooney@cloudsw1-f4-eqiad> show interfaces descriptions | ma... [12:01:50] (03PS5) 10SomeRandomDeveloper: Replace divs with semantic elements [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1091830 (https://phabricator.wikimedia.org/T227631) [12:05:35] 06cloud-services-team: KernelErrors Server cloudvirt1073 logged kernel errors - https://phabricator.wikimedia.org/T399073#10987819 (10taavi) ` [1115043.105712] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Down [1115051.950548] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow... [12:17:11] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.... [12:17:22] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.... [12:17:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:43:50] 10Tool-global-search, 06Data-Platform-SRE: Global Search displays most search results twice - https://phabricator.wikimedia.org/T391175#10987933 (10Esanders) [12:49:10] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987962 (10Jclark-ctr) @elukey i am having issues with 2 servers both fail to reimage after switching to 25g dac . cloudcephosd... [12:55:25] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmne... [12:55:35] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987971 (10Jclark-ctr) [12:56:13] (03open) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [12:58:37] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmne... [12:58:49] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987977 (10Jclark-ctr) [12:59:49] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [13:00:31] (03update) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:01:20] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [13:01:23] (03update) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:01:36] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:03:06] (03update) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:03:50] (03approved) 10taavi: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 (owner: 10dcaro) [13:09:52] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10988004 (10fnegri) [13:10:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10988018 (10Raymond_Ndibe) >>! In T398715#10987287, @dcaro wrote: > The retention rules are disabled in toolsbeta for some reason :/, let's re-enable unless... [13:13:59] (03update) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:17:22] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10988043 (10Raymond_Ndibe) >>! In T398715#10987287, @dcaro wrote: > The retention rules are disabled in toolsbeta for some reason :/, let's re-enable unless... [13:19:26] (03open) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [13:20:22] 10Toolforge (Toolforge iteration 21): [clis] standardize the package names - https://phabricator.wikimedia.org/T399080 (10dcaro) 03NEW [13:20:23] 10Toolforge (Toolforge iteration 21): [clis] standardize the package names - https://phabricator.wikimedia.org/T399080#10988077 (10dcaro) 05Open→03In progress [13:20:33] 10Toolforge (Toolforge iteration 21): [clis] standardize the package names - https://phabricator.wikimedia.org/T399080#10988078 (10dcaro) p:05Triage→03Medium [13:20:37] (03update) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:20:51] (03update) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [13:21:30] 10cloud-services-team (FY2024/2025-Q3-Q4): Create WMCS offboarding checklist - https://phabricator.wikimedia.org/T398972#10988081 (10fnegri) I updated the template a bit, @taavi made some edits as well. I think the current version of https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Offboarding_templa... [13:22:01] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10988082 (10fnegri) All done except "Remove from cloud mailing list", I'm not an admin of that list. [13:23:50] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision request - Reuse toolforge user tools central logging for toolforge infrastructure logging - https://phabricator.wikimedia.org/T398285#10988085 (10taavi) p:05Triage→03Medium [13:32:02] (03open) 10raymond-ndibe: [maintain-harbor] reduce toolforge project quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/875 (https://phabricator.wikimedia.org/T398715) [13:32:29] 10Tool-global-search, 06Data-Platform-SRE: Global Search displays most search results twice - https://phabricator.wikimedia.org/T391175#10988129 (10EBernhardson) Most plausibly the global-search side is picking up indices that are not live on the production side. IIRC this runs queries against the `*` index,... [13:34:11] (03CR) 10David Caro: [C:03+1] "LGTM, all nits (feel free to ignore)" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [13:38:50] 06cloud-services-team: KernelErrors Server cloudvirt1073 logged kernel errors - https://phabricator.wikimedia.org/T399073#10988158 (10taavi) 05Open→03Resolved Related to {T394333} presumably. [13:41:51] (03update) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [13:45:58] (03update) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [13:48:23] (03CR) 10Andrew Bogott: Add 'reactivate' cookbook (034 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [13:48:53] 06cloud-services-team, 06Infrastructure-Foundations, 10SRE-tools: sre.hosts.decommission often leaves dangling things in netbox - https://phabricator.wikimedia.org/T398052#10988206 (10taavi) →14Duplicate dup:03T398412 [13:49:24] (03merge) 10dcaro: package: add toolforge- prefix like the rest of clis [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [13:51:05] (03PS50) 10Andrew Bogott: Add 'reactivate' cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 [13:51:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [13:51:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [13:52:48] (03PS8) 10David Caro: toolforge.component.deploy: support multiarch packages [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) [13:53:08] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988254 (10fnegri) This is again `diffscan02` (172.16.3.44), similar to this issue from one year ago: {T355222}. ` root@cloudvirt1067:~# conntrack -L |grep 172.16.3.44 |wc -l conntrack... [13:53:51] (03open) 10dcaro: d/changelog: bump to 1.49.2 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/4 [13:54:23] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988272 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [13:54:27] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [13:56:50] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [14:00:15] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [14:00:42] (03CR) 10Andrew Bogott: Add 'reactivate' cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [14:01:26] (03update) 10raymond-ndibe: runtimes.k8s.images: use config for image refresh interval [repos/cloud/toolforge/jobs-api] (refresh_image_config_data) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/165 (owner: 10dcaro) [14:02:34] (03update) 10raymond-ndibe: runtimes.k8s.images: use config for image refresh interval [repos/cloud/toolforge/jobs-api] (refresh_image_config_data) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/165 (owner: 10dcaro) [14:02:55] 06cloud-services-team, 10Striker, 10Phabricator: Striker dev environment needs a new Phabricator base image - https://phabricator.wikimedia.org/T340080#10988334 (10joanna_borun) p:05Triage→03Low [14:03:12] 06cloud-services-team, 10Toolforge: [builds-cli] Show "image_name" in build details - https://phabricator.wikimedia.org/T397863#10988335 (10dcaro) p:05Triage→03Low [14:03:53] 06cloud-services-team, 10Toolforge: [components-cli] Invalid YAML file error should not encourage reporting the issue to admins - https://phabricator.wikimedia.org/T398425#10988342 (10dcaro) p:05Triage→03Medium [14:04:28] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10988346 (10dcaro) p:05Triage→03High We had a chat and decided that we might want to wait unt... [14:04:37] 06cloud-services-team, 10Cloud-VPS, 07Documentation: [tofu-cloudvps] Document using `cloudvps_puppet_project` to manage project-wide and instance specific puppet classes and hiera settings - https://phabricator.wikimedia.org/T397994#10988348 (10taavi) p:05Triage→03Medium [14:05:07] 06cloud-services-team, 10Cloud-VPS: Add check for cloud-wide root keys to the offboarding script - https://phabricator.wikimedia.org/T398214#10988349 (10joanna_borun) p:05Triage→03Medium [14:05:16] 06cloud-services-team: Improve WMCS offboarding process - https://phabricator.wikimedia.org/T398215#10988350 (10joanna_borun) p:05Triage→03Medium [14:05:55] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 10GitLab (Auth & Access): Sync WMCS GitLab group membership from LDAP - https://phabricator.wikimedia.org/T398217#10988351 (10joanna_borun) p:05Triage→03Medium [14:06:21] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10988353 (10joanna_borun) p:05Triage→03Medium [14:07:46] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712#10988361 (10joanna_borun) p:05Triage→03Medium [14:07:47] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712#10988362 (10Andrew) a:03Andrew [14:10:05] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component misctools-cli [14:11:24] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component misctools-cli [14:11:55] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988380 (10fnegri) p:05Triage→03High [14:16:08] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10988393 (10fnegri) p:05Triage→03High [14:21:22] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 10GitLab (Auth & Access): Sync WMCS GitLab group membership from LDAP - https://phabricator.wikimedia.org/T398217#10988440 (10Jelto) >>! In T398217#10959948, @thcipriani wrote: > FWIW, there are utilities that run in systemd (managed by puppet) to manage lda... [14:22:21] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component misctools-cli [14:23:07] (03approved) 10dcaro: d/changelog: bump to 1.49.2 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/4 [14:23:13] (03merge) 10dcaro: d/changelog: bump to 1.49.2 [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/4 [14:27:33] (03open) 10dcaro: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 [14:29:50] (03update) 10dcaro: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 [14:34:19] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [14:37:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:37:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [14:39:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:42:28] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988522 (10fnegri) Maybe there are other VMs that are also creating more connections than usual. Top ones (hat tip @dcaro for the Bash one-liner!): ` root@cloudvirt1067:~# conntrack -L... [14:43:14] andrew@cloudcumin1001 reactivate (PID 2042522) is awaiting input [14:43:21] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [14:43:55] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [14:51:19] (03approved) 10raymond-ndibe: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 (owner: 10dcaro) [14:51:20] (03update) 10raymond-ndibe: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 (owner: 10dcaro) [14:51:28] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988589 (10fnegri) It looks like the pattern is the same as the previous days, but last night there was an //additional// 50K connections that pushed the total just above the alert thres... [14:51:39] (03update) 10raymond-ndibe: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 (owner: 10dcaro) [14:52:33] (03approved) 10raymond-ndibe: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 (owner: 10dcaro) [14:52:37] (03update) 10raymond-ndibe: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 (owner: 10dcaro) [15:03:42] (03open) 10dcaro: use new jobs cli name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [15:04:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_mons (T306820) [15:04:08] T306820: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820 [15:05:30] (03update) 10dcaro: jobs-cli: use the new package name [repos/cloud/toolforge/lima-kilo] (download_arch_specific_packages_if_needed) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [15:05:55] FIRING: MaxConntrack: Max conntrack at 80.44% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:06:10] (03update) 10dcaro: jobs-cli: use the new package name [repos/cloud/toolforge/lima-kilo] (download_arch_specific_packages_if_needed) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [15:07:47] PROBLEM - Host cloudcephmon1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:15] RECOVERY - Host cloudcephmon1004 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:10:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:12:56] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [15:13:47] (03PS51) 10Andrew Bogott: Add 'reactivate' cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 [15:13:48] (03PS1) 10Andrew Bogott: inventory: update osd_drives_count to 8 for codfw [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167654 [15:14:45] (03CR) 10Andrew Bogott: Add 'reactivate' cookbook (033 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [15:20:56] RESOLVED: MaxConntrack: Max conntrack at 80.96% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:27:36] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988720 (10fnegri) Actually the traffic increase (about 50k additional connections) seems to match the sum of the values for 172.20.4.21 (cloudvirt1067.private) and 172.20.3.20 (cloudvir... [15:28:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_mons (exit_code=0) [15:28:42] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 10GitLab (Auth & Access): Sync WMCS GitLab group membership from LDAP - https://phabricator.wikimedia.org/T398217#10988721 (10bd808) >>! In T398217#10988440, @Jelto wrote: > Yes that's right, we have a script and a systemd timer to sync users from ldap to Gi... [15:29:12] (03CR) 10Andrew Bogott: [C:03+2] Add 'reactivate' cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [15:29:22] (03CR) 10Andrew Bogott: [C:03+2] inventory: update osd_drives_count to 8 for codfw [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167654 (owner: 10Andrew Bogott) [15:31:27] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988752 (10fnegri) > This is traffic flowing between those two hosts on port 4789. This is probably encapsulated VxLAN traffic: https://en.wikipedia.org/wiki/Virtual_Extensible_LAN [15:33:12] (03Merged) 10jenkins-bot: Add 'reactivate' cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166955 (owner: 10Andrew Bogott) [15:33:13] (03Merged) 10jenkins-bot: inventory: update osd_drives_count to 8 for codfw [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167654 (owner: 10Andrew Bogott) [15:36:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:39:05] (03CR) 10Andrew Bogott: [C:03+2] wmcs_libs/ceph.py: support changes in ok-to-stop output in Pacific [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166929 (owner: 10Andrew Bogott) [15:41:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:42:56] (03Merged) 10jenkins-bot: wmcs_libs/ceph.py: support changes in ok-to-stop output in Pacific [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166929 (owner: 10Andrew Bogott) [15:49:08] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10988908 (10fnegri) 05Open→03In progress a:03fnegri Whatever that is, 50k connections is only 10% of the limit, so we should focus on what's causing the remaining 90%. I think it's... [15:53:19] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10988921 (10fnegri) [15:53:39] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10988924 (10fnegri) 05In progress→03Resolved [16:16:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:26:59] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolsbeta harbor disk full - https://phabricator.wikimedia.org/T398715#10989055 (10dcaro) We had a chat and decided to start with just expanding the volume to 100G, if that's not enough we'll review the policies :) [16:29:28] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10989063 (10dcaro) ``might be something we can put in prometheus xd `` [16:36:31] (03merge) 10dcaro: toolforge_deploy_mr: support deploying arch-specific packages [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/254 [16:36:33] (03update) 10dcaro: jobs-cli: use the new package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [16:36:55] (03update) 10dcaro: [start-devenv.sh] enable use of range syntax for ansible tags [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/251 (https://phabricator.wikimedia.org/T398306) (owner: 10raymond-ndibe) [16:37:01] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989102 (10elukey) Full stack trace: ` 2025-07-09 12:36:49,306 jclark 138654 [INFO] Completed command 'puppet lookup --render-as s... [16:42:41] (03approved) 10dcaro: [start-devenv.sh] enable use of range syntax for ansible tags [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/251 (https://phabricator.wikimedia.org/T398306) (owner: 10raymond-ndibe) [16:43:23] (03update) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/106 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:43:29] (03update) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/106 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:52:17] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:52:22] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-api [16:52:35] (03approved) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/106 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:52:37] (03merge) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/106 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:55:02] (03update) 10dcaro: [maintain-harbor] reduce toolforge project quota [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/875 (https://phabricator.wikimedia.org/T398715) (owner: 10raymond-ndibe) [16:55:19] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.134-20250709165247-8f70091f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/877 [16:55:23] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.134-20250709165247-8f70091f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/877 [16:56:03] (03approved) 10dcaro: logging: alloy: Allow running on the entire cluster [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/836 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [16:57:01] (03update) 10dcaro: logging: loki: Add second Loki instance for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/834 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [16:57:06] (03update) 10dcaro: logging: alloy: Add routing for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/835 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [16:57:10] (03update) 10dcaro: logging: alloy: Allow running on the entire cluster [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/836 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [16:58:42] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:59:49] (03approved) 10dcaro: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 [16:59:51] (03merge) 10dcaro: toolforge_get_versions: add misctools-cli [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/876 [17:01:00] 06cloud-services-team, 10Data-Services, 10Projects-Cleanup: Archive the operations/debs/bdsync repository - https://phabricator.wikimedia.org/T377882#10989222 (10hashar) [17:01:21] 06cloud-services-team, 10Data-Services, 10Projects-Cleanup: Archive the operations/debs/bdsync repository - https://phabricator.wikimedia.org/T377882#10989225 (10hashar) 05Open→03Resolved a:03Jdforrester-WMF [17:02:46] (03close) 10dcaro: Draft: basic_system: add lima-kilo-boot.service [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/148 (owner: 10aborrero) [17:03:27] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [17:03:30] (03close) 10dcaro: helpers: add toolforge_redeploy_components.sh [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/166 (owner: 10aborrero) [17:04:56] 10Tool-lists, 10Projects-Cleanup: Archive Gerrit repository "labs/tools/lists" - https://phabricator.wikimedia.org/T371095#10989238 (10hashar) 05Open→03Resolved a:03hashar Done, and I have deleted the GitHub mirror [17:05:19] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [17:08:14] (03CR) 10David Caro: toolforge.component.deploy: support multiarch packages (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [17:10:16] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [17:10:33] (03update) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [17:14:35] (03approved) 10dcaro: components-api: bump to 0.0.134-20250709165247-8f70091f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/877 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:14:38] (03update) 10dcaro: components-api: bump to 0.0.134-20250709165247-8f70091f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/877 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:15:10] (03merge) 10dcaro: components-api: bump to 0.0.134-20250709165247-8f70091f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/877 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:21:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudnet2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:56:39] 10Cloud-VPS (Project-requests): Request creation of lemmy VPS project - https://phabricator.wikimedia.org/T396948#10989383 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided. @Gryllida: After you have provided the information asked... [18:01:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds (T306820) [18:01:08] T306820: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820 [18:04:27] PROBLEM - Host cloudcephosd1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:43] (03open) 10ttaylor: Add mobile responsiveness [toolforge-repos/listen-to-wiki-changes] - 10https://gitlab.wikimedia.org/toolforge-repos/listen-to-wiki-changes/-/merge_requests/2 [18:05:25] RECOVERY - Host cloudcephosd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:05:27] (03merge) 10ttaylor: Add mobile responsiveness [toolforge-repos/listen-to-wiki-changes] - 10https://gitlab.wikimedia.org/toolforge-repos/listen-to-wiki-changes/-/merge_requests/2 [18:08:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:11:01] PROBLEM - Host cloudcephosd1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:37] RECOVERY - Host cloudcephosd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [18:16:57] PROBLEM - Host cloudcephosd1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:27] RECOVERY - Host cloudcephosd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:21:59] PROBLEM - Host cloudcephosd1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:05] RECOVERY - Host cloudcephosd1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:29:01] PROBLEM - Host cloudcephosd1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:29] RECOVERY - Host cloudcephosd1008 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:35:31] PROBLEM - Host cloudcephosd1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:15] RECOVERY - Host cloudcephosd1009 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [18:42:07] PROBLEM - Host cloudcephosd1010 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:45] RECOVERY - Host cloudcephosd1010 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:47:11] PROBLEM - Host cloudcephosd1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:32] (03update) 10raymond-ndibe: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 (owner: 10dcaro) [18:48:33] (03approved) 10raymond-ndibe: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 (owner: 10dcaro) [18:49:29] RECOVERY - Host cloudcephosd1011 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:51:18] (03update) 10raymond-ndibe: deploy: allow retrieving a deploy with a token [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/109 (owner: 10dcaro) [18:54:11] PROBLEM - Host cloudcephosd1012 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:25] RECOVERY - Host cloudcephosd1012 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:00:25] PROBLEM - Host cloudcephosd1013 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:55] RECOVERY - Host cloudcephosd1013 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:05:57] PROBLEM - Host cloudcephosd1014 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:29] (03update) 10raymond-ndibe: deploy: allow retrieving a deploy with a token [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/109 (owner: 10dcaro) [19:13:30] (03approved) 10raymond-ndibe: deploy: allow retrieving a deploy with a token [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/109 (owner: 10dcaro) [19:15:31] (03approved) 10raymond-ndibe: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 (owner: 10dcaro) [19:15:32] (03update) 10raymond-ndibe: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 (owner: 10dcaro) [19:15:34] (03update) 10raymond-ndibe: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 (owner: 10dcaro) [19:51:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [19:51:33] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [19:51:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [19:53:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [20:05:43] 10Tool-global-search, 06Data-Platform-SRE: Global Search displays most search results twice - https://phabricator.wikimedia.org/T391175#10989779 (10EBernhardson) merge request: https://github.com/wikimedia/tools-global-search/pull/118 [20:06:08] 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127 (10Andrew) 03NEW [20:06:49] 10Tool-global-search, 06Data-Platform-SRE: Global Search displays most search results twice - https://phabricator.wikimedia.org/T391175#10989799 (10EBernhardson) @MusikAnimal [20:07:59] 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127#10989805 (10Andrew) Neutron (wisely) forbids changing a network to unshared when it is already used by multiple tenants. Disabling the network seems to not really change anything oth... [20:13:20] RECOVERY - Host cloudcephosd1014 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:19:14] PROBLEM - Host cloudcephosd1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:20:42] RECOVERY - Host cloudcephosd1015 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:24:46] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:16] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:29:57] PROBLEM - Host cloudcephosd1017 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [20:31:41] RECOVERY - Host cloudcephosd1017 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:35:01] PROBLEM - Host cloudcephosd1018 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [20:36:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989891 (10Jclark-ctr) [20:36:55] RECOVERY - Host cloudcephosd1018 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:37:46] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989902 (10Jclark-ctr) 05Open→03Resolved [20:41:39] PROBLEM - Host cloudcephosd1019 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:07] RECOVERY - Host cloudcephosd1019 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:47:13] PROBLEM - Host cloudcephosd1020 is DOWN: PING CRITICAL - Packet loss = 100% [20:48:41] RECOVERY - Host cloudcephosd1020 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:52:49] PROBLEM - Host cloudcephosd1021 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:05] RECOVERY - Host cloudcephosd1021 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [20:57:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:58:29] PROBLEM - Host cloudcephosd1022 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:57] RECOVERY - Host cloudcephosd1022 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:02:58] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:04:11] PROBLEM - Host cloudcephosd1023 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:57] RECOVERY - Host cloudcephosd1023 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:09:19] PROBLEM - Host cloudcephosd1024 is DOWN: PING CRITICAL - Packet loss = 100% [21:11:10] RECOVERY - Host cloudcephosd1024 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [21:15:32] PROBLEM - Host cloudcephosd1026 is DOWN: PING CRITICAL - Packet loss = 100% [21:17:04] RECOVERY - Host cloudcephosd1026 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [21:21:28] PROBLEM - SSH on cloudcephosd1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:22:18] RECOVERY - SSH on cloudcephosd1027 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:27:28] PROBLEM - Host cloudcephosd1028 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:17] RECOVERY - Host cloudcephosd1028 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:33:15] PROBLEM - Host cloudcephosd1029 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:59] RECOVERY - Host cloudcephosd1029 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:38:17] PROBLEM - Host cloudcephosd1030 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:45] RECOVERY - Host cloudcephosd1030 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [21:44:05] 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127#10990055 (10bd808) >>! In T396936#10945434, @bd808 wrote: >>>! In T396936#10937326, @taavi wrote: >> If Magnum doesn't support dual-stack clusters then I think I consider that a bug t... [21:45:35] PROBLEM - Host cloudcephosd1031 is DOWN: PING CRITICAL - Packet loss = 100% [21:45:59] RECOVERY - Host cloudcephosd1031 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:50:31] PROBLEM - Host cloudcephosd1032 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:59] RECOVERY - Host cloudcephosd1032 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:56:59] PROBLEM - Host cloudcephosd1033 is DOWN: PING CRITICAL - Packet loss = 100% [21:58:27] RECOVERY - Host cloudcephosd1033 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [22:02:39] PROBLEM - Host cloudcephosd1034 is DOWN: PING CRITICAL - Packet loss = 100% [22:04:29] RECOVERY - Host cloudcephosd1034 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [22:08:29] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:29] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:17:01] PROBLEM - Host cloudcephosd1036 is DOWN: PING CRITICAL - Packet loss = 100% [22:19:43] RECOVERY - Host cloudcephosd1036 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [22:23:35] PROBLEM - Host cloudcephosd1037 is DOWN: PING CRITICAL - Packet loss = 100% [22:27:45] RECOVERY - Host cloudcephosd1037 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [22:32:31] PROBLEM - Host cloudcephosd1038 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:59] RECOVERY - Host cloudcephosd1038 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [22:38:53] PROBLEM - Host cloudcephosd1039 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:19] RECOVERY - Host cloudcephosd1039 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [22:48:12] PROBLEM - Host cloudcephosd1040 is DOWN: PING CRITICAL - Packet loss = 100% [22:50:41] RECOVERY - Host cloudcephosd1040 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [22:55:51] PROBLEM - Host cloudcephosd1041 is DOWN: PING CRITICAL - Packet loss = 100% [22:58:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:59:19] RECOVERY - Host cloudcephosd1041 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [23:05:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [23:08:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses