[01:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:41:28] volans|off: wrt https://phabricator.wikimedia.org/T390860, I have an initial PoC patch (https://gerrit.wikimedia.org/r/1167299) that I'd like to test before proceeding on the patch further [06:42:07] volans|off: what's the best way to do that? basically I want to be able to create an ElasticsearchCluster object and call one of the methods i've modified [08:16:05] ryankemper: I had left few suggestions in that patch last week, did you see them? [08:20:37] it can be simplified a lot taking advantage of the existing featuers [08:20:41] *features in spicerack [08:29:23] once cleaned up and ready for testing you can do it with https://etherpad.wikimedia.org/p/volans-tmp3 [09:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:47:17] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973#11008647 (10Volans) 05Open→03Resolved [13:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:00:14] topranks, XioNoX - I noticed [14:00:14] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:00:30] is it WIP or to be fixed [14:00:30] ? [14:01:24] elukey: thanks! Yep it needs to be fixed, the cookbook will do it afaik I’ll take a look in a moment [14:01:38] ack! [14:13:08] nice to see that alerting works as expected :) [14:13:36] yep - and the cookbook also working fine, nice work you two! [14:13:38] I'm wondering if we could add the network scope so it shows up in https://alerts.wikimedia.org/?q=scope%3Dnetwork [14:13:55] yeah, or team=netops [14:15:43] I prefer not to have that :) and not do anything more narrow than I/F for the team [14:18:55] topranks: do you have to run it for all? [14:19:20] volans: I haven't implemented the `all` option yet [14:19:29] I know but we could right now :D [14:19:34] instead of the spam :-P [14:19:34] the cookbook only takes one host as parameter [14:22:13] it's not that bad, the cookbook doesn't take any input so a few "for i in {1..8}; do..." in bash got the rows done [14:22:53] but yeah apologies for the spam, should be done now [14:23:47] volans: also a candidate for the auto-remediation :) AM could call the cookbook automatically [14:24:35] if we had the 'all' we could run it on a cron maybe, first thing it does it check there is a cert, it will only generate and install a new one if there is none there already or expiry is less than 28 days [14:25:20] auto-remidiation also a good candidate yep [14:26:05] elukey: ^^ take note ;) [14:30:24] while we're on the topic: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1161337 for when all the cumin hosts are on bookworm [17:29:55] FIRING: MaxConntrack: Max conntrack at 84.93% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:39:55] RESOLVED: MaxConntrack: Max conntrack at 84.48% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:53:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010414 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5332ad34-f45a-4d5c-9180-ace7ebb578e8) set by cmooney@cumin1003 for 0:15:00 on 1 ho... [17:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:06:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010441 (10cmooney) The replacement optic module arrived on site in the past hour and we have replaced it now. I have un-drained the Arelion backhaul circuit... [18:08:44] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010458 (10cmooney) iperf test is also clean: ` cmooney@cp5017:~$ iperf -s -i1 -u -w512k ------------------------------------------------------------ Server l... [18:12:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010479 (10cmooney) p:05High→03Medium [20:34:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010962 (10ssingh) Things were stable for a few hours even after @cmooney made the fix above but starting ~20:00 UTC, we had a page for text-https in eqsin an... [21:54:25] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:04:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11011234 (10cmooney) I've updated the ticket with Arelion to advise we have been able to replace the optic, and despite the apparat improvement at first we sti...