[09:08:21] Emperor: o/ morning :) I have been testing the apus S3 api via the docker registry, and I have a doubt about replication (especially after the last experience with Swift). So far I pushed objects to apus-fe eqiad, and my s3cmd ls shows the same data if I use the apus-fe eqiad and codfw endpoints in its config). [09:08:44] Is there some automatic replication? [09:11:56] elukey: yes. [09:12:01] [more details coming] [09:14:50] elukey: Ceph itself handles replication between the two clusters (it's asynchronous, as the alternative would be a single cluster that spanned both DCs, but that would mean that write latency etc. would be dependent on the inter-DC link). See https://docs.ceph.com/en/reef/radosgw/multisite/#diagram-replication-of-object-data-between-zones for upstream details, and https://grafana.wikimedia.org/goto/DkFg0XVDR?orgId=1 for the dashb [09:14:50] Unhelpfully, that doesn't include any metric of how "caught up" things are. [09:16:19] If you want to check at any particular time, then ' sudo cephadm shell -- radosgw-admin sync status' on one of the controller nodes (cf https://wikitech.wikimedia.org/wiki/Ceph/Cephadm#Interacting_with_the_cluster ) [09:21:12] Emperor: thanks a lot for the explanation! I'll look for more data, as you mentioned the big bummer is that we don't have a good metric to use [09:21:30] but overall it seems better than the current status [09:22:53] IIUC we can now failover between DCs for the docker registry endpoint, maybe not go that far as being active/active [09:23:11] elukey: Squid (next release) includes some more headers that you can check ( https://ceph.io/en/news/blog/2025/rgw-multisite-sync-status/ ), but I'm afraid I have no timetable for moving off Reef just now [09:23:44] elukey: yes, anything you upload to apus should be available in both DCs (modulo replication delay) [09:24:44] super, for the moment we are good, we are still experimenting. We may get to move the MediaWiki images to apus this month, but for the whole registry it will take time. We can chat about it in Lisbon if you have time! [09:24:56] and also involve management for the prioritization [09:26:13] the whole point of Lisbon is to have time to chat about stuff like this :) [10:30:27] Emperor: totally unrelated - Jesse found the issue with the md-uuids mismatching in UEFI hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/1225021 [10:54:05] FIRING: MySQLReplicaNotUsingGTID: MySQL replica db1195:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [10:54:14] federico3: ^ nice [10:54:17] Going to enable it back [10:54:57] \o/ [10:55:18] Waiting for the recovery now [10:59:05] RESOLVED: MySQLReplicaNotUsingGTID: MySQL replica db1195:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [10:59:23] Nice! ^ [11:03:37] good work [11:31:52] elukey: ah, cool :) [14:56:38] federico3: when I meant provisioning, I meant hw provisioning, not db [14:56:44] *said [14:57:10] procurement and installation would be more precise terms, I guess [15:20:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed