[00:29:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:26] there are active alerts on replicatio on db2239 and db1150 [06:51:06] those are replication sources running their backups [06:51:34] thanks [06:51:46] you can see that by checking their port, it is different from 3306 and they also have notifications disabled too [06:51:51] So that's a backup source [06:52:38] maybe there could be a way to make the warning not show up on alertmanager as well? [06:54:03] I don't know enough of alert manager, maybe o11y knows a way [07:10:03] the reason they warn is because puppet lacks config for ignoring lag up to some time [07:10:32] the check needs an option for max lag, but there is not one [07:48:29] I am going to switch x3 codfw [08:06:02] Now I am going to switch s6 codfw [08:11:50] marostegui: I've left one es* host to be updated if you are curious to test the new script otherwise I can just run it so we are done with the es* sections [08:12:11] federico3: go for it, I can always run it for the next release :) [08:12:28] ok [08:29:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:08] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:32] PROBLEM - MariaDB sustained replica lag on x1 on db2196 is CRITICAL: 21.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104 [10:19:32] RECOVERY - MariaDB sustained replica lag on x1 on db2196 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104 [10:21:26] jynus: I'm a bit confused about https://phabricator.wikimedia.org/T393990#11008053 - the dbstore1009:3350 instance is in the "staging" section and appears to have no cumin2024 user at all, also the root user does not have cumin1002's ipaddr in the grants [10:23:44] We can probably forget about that instance in that host [10:23:51] We never use it and we do not "own" it [11:53:14] also I have to access to db1208:3352 and db1208:3351 [12:05:12] yeah those are from analytics [12:16:50] if I could try to push back, maybe they are from analytics (what's that department?) but I am the one in charge of backing them up [12:17:16] So sometimes I need access to check issues and restore them [12:22:52] but if that still is not enough, please consider removing the grants from production.sql https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/mariadb/grants/production.sql.erb$41 and removing the hosts from zarcillo for consistency (which will remove its monitoring) [12:26:30] as otherwise puppet role and grants will not match [12:28:20] I do think if there's access from cumin1002 there should be access from cumin1003 [12:28:51] clearly that was missed for cumin1002 too [12:29:01] as, it was an old bug IMHO [12:29:53] but it doesn't make sense having access to dbstore1005:s1 and not dbstore1005:staging to me [12:30:04] independently of who takes care of the data [12:30:24] (I take care partially, for the backus and restores) [12:30:36] yeah [12:31:14] of course, the account, I don't care if it is root or anything else [12:31:23] I just need the access [18:31:33] Hi All, I’m adjusting my calendar to start my workday two hours earlier on Mondays and Tuesdays to increase my overlap with the EU time zone. Would it be possible to move up the Data Persistence team meeting by half hour? (especially urandom: ) [18:33:02] kavitha: if it's not too early for you, then it shouldn't be for me šŸ˜€ [18:33:44] Cool, thank you so much for being flexible. I will propose the change :-)