[00:05:55] FIRING: MaxConntrack: Max conntrack at 82.35% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:09:48] (03update) 10raymond-ndibe: Draft: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [00:11:29] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [00:13:09] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [00:55:55] RESOLVED: MaxConntrack: Max conntrack at 80.33% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [04:50:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [04:54:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [05:32:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [05:32:12] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T399144 (10phaultfinder) 03NEW [06:12:59] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10990367 (10elukey) @Jclark-ctr IIUC it was a temporary failure right? [08:12:06] 06cloud-services-team, 10Cloud-VPS: Prevent creation of VMs on the old ipv4 network - https://phabricator.wikimedia.org/T399127#10990858 (10dcaro) >>! In T399127#10990055, @bd808 wrote: >>>! In T396936#10945434, @bd808 wrote: >>>>! In T396936#10937326, @taavi wrote: >>> If Magnum doesn't support dual-stack clu... [08:16:45] (03merge) 10dcaro: packaging: change name to match the rest of clis [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/110 [08:27:38] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10990902 (10fnegri) It's indeed `diffscan02` (`172.16.3.44`) causing the spike around 00:00 UTC: ` Wed Jul 9 11:57:44 PM UTC 2025 conntrack v1.4.7 (conntrack-tools): 13424 flow entries... [08:56:28] (03open) 10arthurtaylor: Add deprecation notice [toolforge-repos/wcsg-workboard-log] - 10https://gitlab.wikimedia.org/toolforge-repos/wcsg-workboard-log/-/merge_requests/1 [08:56:40] (03merge) 10arthurtaylor: Add deprecation notice [toolforge-repos/wcsg-workboard-log] - 10https://gitlab.wikimedia.org/toolforge-repos/wcsg-workboard-log/-/merge_requests/1 [09:11:30] (03open) 10dcaro: d/changelog: bump to 16.1.15 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/111 (https://phabricator.wikimedia.org/T399080) [09:26:53] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [09:26:58] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [09:30:30] (03PS1) 10David Caro: toolforge.component.deploy: rename jobs-cli [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167829 [09:30:50] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [09:31:14] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [09:32:15] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [10:00:29] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10991125 (10fnegri) The alert started firing daily on 2025-06-27 because the limit changed: {F63744349} [10:13:09] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10991147 (10fnegri) 05Resolved→03In progress [10:15:19] 10cloud-services-team (FY2024/2025-Q3-Q4): WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068#10991168 (10fnegri) [10:17:06] 10cloud-services-team (FY2024/2025-Q3-Q4): Create WMCS offboarding checklist - https://phabricator.wikimedia.org/T398972#10991175 (10fnegri) 05In progress→03Resolved Marking as Resolved, feel free to edit the template directly if you find something is missing. [10:22:59] 06cloud-services-team, 10Toolforge, 10MediaWiki-Action-API: APIhighlimits doesn't work on my bot (with bot password) since July 15, 2020 - https://phabricator.wikimedia.org/T258057#10991207 (10Aklapper) [10:37:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [10:42:31] (03open) 10dcaro: toolforge_get_version: use the new jobs-cli package name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/878 [10:44:53] (03approved) 10dcaro: toolforge_get_version: use the new jobs-cli package name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/878 [10:44:59] (03merge) 10dcaro: toolforge_get_version: use the new jobs-cli package name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/878 [10:45:00] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [10:45:26] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [11:25:03] PROBLEM - Host cloudcephosd1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [11:30:05] RECOVERY - Host cloudcephosd1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:31:44] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#10991485 (10cmooney) 05Stalled→03Resolved a:03cmooney I am going to close this one (please ping me if that is hasty!) as I've o... [11:32:28] FIRING: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:44:01] (03approved) 10dcaro: jobs-cli: use the new package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [11:44:26] (03merge) 10dcaro: jobs-cli: use the new package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/255 [11:45:44] (03open) 10dcaro: ansible: use the new misctools package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/256 [11:46:24] (03approved) 10dcaro: ansible: use the new misctools package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/256 [11:49:50] !log dcaro@hephaestus cloudinfra START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:51:00] !log dcaro@hephaestus cloudinfra END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:51:04] !log dcaro@hephaestus cloudinfra START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:52:02] !log dcaro@hephaestus cloudinfra END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:52:28] RESOLVED: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:52:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10991577 (10cmooney) I created the below task to continue the discussion of how we set up the interfaces for these hosts, and cop... [12:12:40] (03approved) 10dcaro: d/changelog: bump to 16.1.15 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/111 (https://phabricator.wikimedia.org/T399080) [12:12:44] (03merge) 10dcaro: d/changelog: bump to 16.1.15 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/111 (https://phabricator.wikimedia.org/T399080) [12:15:57] (03CR) 10David Caro: [C:03+2] "Used to deploy the latest toolforge-misctools-cli and jobs-cli, merging" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [12:16:20] (03CR) 10David Caro: [C:03+2] "Used to deploy jobs-cli, needed now, merging" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167829 (owner: 10David Caro) [12:17:14] (03merge) 10dcaro: ansible: use the new misctools package name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/256 [12:20:46] (03Merged) 10jenkins-bot: toolforge.component.deploy: support multiarch packages [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166830 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [12:21:21] (03Merged) 10jenkins-bot: toolforge.component.deploy: rename jobs-cli [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1167829 (owner: 10David Caro) [12:25:13] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10991647 (10dcaro) 05Open→03In progress [12:25:16] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10991650 (10dcaro) 05In progress→03Resolved [12:27:12] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/41 [12:27:12] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/4 [12:34:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [12:35:29] 10Toolforge (Toolforge iteration 21): [clis] standardize the package names - https://phabricator.wikimedia.org/T399080#10991666 (10dcaro) [12:37:59] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [12:42:13] PROBLEM - Host cloudcephosd1007 is DOWN: PING CRITICAL - Packet loss = 100% [12:42:41] RECOVERY - Host cloudcephosd1007 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [12:42:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [12:47:28] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [12:58:35] FIRING: [2x] ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_from_codfw_ip6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:43] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T399189 (10phaultfinder) 03NEW [12:58:53] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:00:54] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:01:50] FIRING: ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:02:03] 10Tools, 10Pywikibot, 10Pywikibot-Scripts: Implement a webservice at toolforge.org based on create_isbn_edition script - https://phabricator.wikimedia.org/T379488#10991840 (10Xqt) 05Open→03Declined See T398140 [13:02:56] FIRING: SystemdUnitDown: The service unit disable-tool.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:03:10] FIRING: ProjectProxyMainProxyDown: Proxy service address is unreachable - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [13:03:11] PROBLEM - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 324 bytes in 60.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:03:35] RESOLVED: [3x] ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_from_codfw_ip6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:39] RECOVERY - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 26.026 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:03:53] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:06:49] FIRING: [4x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:07:56] RESOLVED: SystemdUnitDown: The service unit disable-tool.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:08:09] !log dcaro@hephaestus project-proxy START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:08:10] RESOLVED: ProjectProxyMainProxyDown: Proxy service address is unreachable - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [13:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL [13:08:14] !log dcaro@hephaestus project-proxy END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [13:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL [13:08:53] RESOLVED: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:11:50] RESOLVED: [4x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:13:31] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10991863 (10fnegri) The value for nf_conntrack_max was increased in {T355222} and again in {T373816}. The current value set in `modules/profile/manifests/openstack/base/nova/compute/servi... [13:33:29] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10991952 (10fnegri) Grafana shows the value changed on all cloudvirts, but not at the same time. I think the value failed to be reapplied after the latest reboot of cloudvirts: ` root@cl... [13:43:00] (03PS1) 10David Caro: tools-webservice: remove as it moved to gitlab [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1167865 [13:44:44] (03merge) 10dcaro: deploy: allow retrieving a deploy with a token [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/109 [13:45:10] (03update) 10dcaro: tool-config: export the config schema [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T397724) [13:47:14] (03merge) 10dcaro: tool-config: export the config schema [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T397724) [13:47:48] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.135-20250710134503-c7e0923f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T398485) [13:50:06] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.136-20250710134726-f76face5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) [13:50:09] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.136-20250710134726-f76face5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) [13:57:10] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10992160 (10Jclark-ctr) >>! In T394333#10990367, @elukey wrote: > @Jclark-ctr IIUC it was a temporary failure right? yes that wa... [13:59:28] FIRING: PuppetSyncFailure: Failed to update Puppet repository /srv/git/operations/puppet on instance tools-puppetserver-01 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [14:00:32] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10992170 (10fnegri) The last change to the value was actually in {T387179} with patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124821 setting the value to `33554432`, which i... [14:01:04] (03update) 10dcaro: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 [14:02:57] (03merge) 10dcaro: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 [14:05:11] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.137-20250710140310-6d0932f6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) [14:05:14] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.137-20250710140310-6d0932f6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) [14:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:24:28] RESOLVED: PuppetSyncFailure: Failed to update Puppet repository /srv/git/operations/puppet on instance tools-puppetserver-01 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [14:29:01] 06cloud-services-team, 10Toolforge: toolsbeta paging - https://phabricator.wikimedia.org/T396038#10992349 (10dcaro) Was opened due to {T398715} (adding for context) [14:31:52] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10992376 (10fnegri) `sysctl --system` does fix the issue: ` root@cloudvirt1067:~# cat /proc/sys/net/nf_conntrack_max 524288 root@cloudvirt1067:~# sysctl --system root@cloudvirt1067:~#... [14:54:02] 06cloud-services-team, 10Cloud-VPS: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212 (10fnegri) 03NEW [14:55:06] 06cloud-services-team: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#10992502 (10fnegri) I created a subtask {T399212} to address the root cause of the alert described by this task. [14:57:23] 06cloud-services-team, 10Cloud-VPS: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#10992507 (10fnegri) I'm not sure what is loading the `nf_conntrack` module, because the module is loaded eventually, and I can apply the values with `sysctl --system`: ` root@cloud... [15:00:47] 06cloud-services-team, 10Cloud-VPS: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212#10992512 (10fnegri) Maybe it's loaded by openvswitch: ` root@cloudvirt1067:~# lsmod |grep nf_conntrack nf_conntrack_netlink 57344 0 nfnetlink 20480 3 nfnetlink_ct... [15:03:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:03:47] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.reactivate (exit_code=97) [15:03:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:05:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:07:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [15:07:28] (03approved) 10dcaro: components-api: bump to 0.0.137-20250710140310-6d0932f6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:07:34] (03merge) 10dcaro: components-api: bump to 0.0.137-20250710140310-6d0932f6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879 (https://phabricator.wikimedia.org/T397724 https://phabricator.wikimedia.org/T398485) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:11:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:36:35] (03Abandoned) 10Wandji collins: Merge branch 'main' of https://gerrit.wikimedia.org/r/labs/tools/wdaudiolex-be [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1132142 (owner: 10UnknownStrange) [16:40:30] (03Abandoned) 10Wandji collins: Remove duplicate routes copy file [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1132112 (https://phabricator.wikimedia.org/T386326) (owner: 10Juniorbesong) [16:43:48] (03CR) 10Eugene233: "Change seems to be abandoned. Is this still relevant? Then it could be restored." [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1132112 (https://phabricator.wikimedia.org/T386326) (owner: 10Juniorbesong) [16:45:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#10992899 (10Andrew) ceph eqiad11 is now running 16.2.15 on all nodes. One mon and one OSD are on bookworm... [17:04:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 21): [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10992948 (10dcaro) This should be already available: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Deploy_yo... [17:04:42] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): ZuulDevOpsBot user can create but not delete a cluster template - https://phabricator.wikimedia.org/T396932#10992949 (10bd808) 05Open→03Invalid The tofu automation using ZuulDevOpsBot's credentials was able to... [17:54:16] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10993075 (10cmooney) We may need to hold off on this for now. The requirement for jumbo frames poses a difficulty for the plan as the parent i... [18:45:15] (03CR) 10Wandji collins: "I am doing a cleanup; this patch appeared multiple times, and the change is outdated." [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1132112 (https://phabricator.wikimedia.org/T386326) (owner: 10Juniorbesong) [18:56:24] (03Abandoned) 10Wandji collins: T388196 BE - Route to add best match audio file to lexeme [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1132178 (owner: 10UnknownStrange) [19:02:36] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eq... [19:10:29] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234 (10Andrew) 03NEW [19:34:46] 10Cloud-VPS (Project-requests): Request creation of Clipi VPS project - https://phabricator.wikimedia.org/T399237 (10IhsaanKhan) 03NEW [19:41:12] 10Cloud-VPS (Project-requests): Request creation of Clipi VPS project - https://phabricator.wikimedia.org/T399237#10993382 (10JJMC89) a:05IhsaanKhan→03None [19:46:21] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.... [19:50:08] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/41 (owner: 10l10n-bot) [19:50:10] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/41 (owner: 10l10n-bot) [19:56:30] 06cloud-services-team, 10Toolforge: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238 (10LucasWerkmeister) 03NEW [19:58:24] 06cloud-services-team, 10Toolforge: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238#10993408 (10LucasWerkmeister) CCing @dcaro based on [the latest commits in that package](https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/commits/main) (and because Taavi, the... [19:58:41] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/4 (owner: 10l10n-bot) [19:58:44] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/4 (owner: 10l10n-bot) [20:04:41] 06cloud-services-team, 10Toolforge: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238#10993422 (10bd808) I wonder if https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/commit/0e495dbffcfe7c14cc9b92b6a5cf937dbf795e0a needed to update more `debian/*` files when rena... [20:09:35] 06cloud-services-team, 10Toolforge: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238#10993440 (10LucasWerkmeister) Maybe… I don’t know how bash-completion files are installed in Debian packages, but they vanished from the installed files in 1.49.2: `lang=shell-session lucaswerk... [20:40:29] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993548 (10RobH) [20:44:53] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1035.eqia... [21:28:58] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:29:16] 06cloud-services-team, 10Toolforge: Missing bash completion for `become` - https://phabricator.wikimedia.org/T399238#10993703 (10bd808) My theory would be that `debian/misctools.bash-completion` and `debian/misctools.lintian-overrides` need to be renamed to `debian/toolforge-misctools-cli.bash-completion` and... [21:31:48] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234#10993706 (10RobH) 05Open→03Resolved a:03RobH IRC Update: The file it was looking for didn't exist on the cumin1003 host, but does on cumin20... [21:32:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1035.eqiad.wm... [21:38:58] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:47:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [21:48:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [22:16:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1036.eqia... [22:54:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [22:55:19] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [22:59:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [23:00:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [23:03:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1036.eqiad.wm... [23:59:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037.eqia...