[01:06:36] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10SRE-tools: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11403767 (10ssingh) [09:06:48] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404346 (10fgiunchedi) [09:09:23] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404351 (10fgiunchedi) The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks [09:11:01] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989 (10fgiunchedi) 03NEW [10:05:30] hello folks! I started a conversation with Eric about adding an LVS endpoint in front of Cassandra clusters, to facilitate a lot of things (node discovery to send a simple query, seeding during bootstrap, etc..) - https://phabricator.wikimedia.org/T410075 - Please let me know when you have a moment if there is anything that I didn't think of, or if some ideas would need to be tweaked [10:09:49] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11404665 (10cmooney) @fgiunchedi have all the physical interfaces been removed on site? Typically I would ask DC-Ops to remove in Netbox when they remove... [10:54:47] <_joe_> elukey: wouldn't an externalservice work as well? [10:55:03] <_joe_> given we're mostly connecting to cassandra from k8s [10:55:50] <_joe_> and given AIUI the point is to make it easy to reach "any cassandra node" [10:57:35] _joe_ didn't think about it, could be a good alternative. The only thing that it may be missing is that the "seeding" list of cassandra instances that we hardcode in every cassandra config may benefit from an lvs endpoint IIUC, and that pooling/depooling a node may be quicker via confctl rather than a k8s deployment (that would be the way for the external service that I can think of, if you have another idea lemme know) [10:57:57] <_joe_> yeah I didn't think of that [10:58:13] <_joe_> but still, any client already active will have the old list of seeds right? [10:58:24] <_joe_> not sure if that's a problem at runtime for cassandra clients [10:58:40] <_joe_> it's a good point [10:59:19] so the seed list IIUC is only need for a cassandra instance to reach out to some live instance in the cluster that it knows it can reply with the needed info, at least this is what Eric mentioned in the task. [10:59:36] <_joe_> right [10:59:51] <_joe_> so yeah that's probably the best solution [11:00:30] okok thanks for the brainbounce :) [11:07:52] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11404877 (10cmooney) 05Open→03Resolved a:03cmooney > and run homer what we can do if the second port being in an "up" state on the switch is a... [11:45:43] FIRING: [4x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [11:48:21] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11405013 (10cmooney) 05Resolved→03Open [11:50:43] FIRING: [9x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [11:51:44] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405026 (10Ladsgroup) I picked a random path that was hit and looked the IP and basically looked at the previous and after requests at th... [12:00:43] RESOLVED: [9x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [12:13:29] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405067 (10Ladsgroup) Okay, I checked several more cases and they all seems to be coming from rest endpoint for page summary. For example... [12:18:54] 10netops, 06Infrastructure-Foundations, 06SRE: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11405075 (10cmooney) [12:40:59] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405153 (10Ladsgroup) I was kinda sure it was Popups and lo and behold, it's Popups: https://gerrit.wikimedia.org/g/mediawiki/extensions/... [13:08:00] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11405226 (10fgiunchedi) >>! In T410989#11404665, @cmooney wrote: > @fgiunchedi have all the cables been removed on site? > > Typically I would ask DC-Ops... [13:18:01] 10netops, 06Infrastructure-Foundations: Remove extra netbox interfaces for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11405248 (10cmooney) >>! In T410989#11405226, @fgiunchedi wrote: > I see, thank you I was not aware of the procedure and it makes sense! Yeah the main th... [13:18:31] 10netops, 06Infrastructure-Foundations: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11405249 (10cmooney) [13:24:03] 10netops, 06Infrastructure-Foundations: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11405268 (10fgiunchedi) >>! In T410989#11405248, @cmooney wrote: > Thinking it through what is probably best: > > # We disable the switch interfaces te... [13:43:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [13:49:42] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11405342 (10cmooney) [15:04:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [15:15:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [15:52:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11406114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [16:01:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7001 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7001 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [16:03:39] 06Traffic, 07Essential-Work, 06Experimentation Lab (Experiment Platform Sprint 16), 13Patch-For-Review: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11406158 (10Milimetric) [16:16:43] RESOLVED: [2x] HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7001 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7001 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [16:29:09] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406314 (10MatthewVernon) Great find, thank you! [16:41:17] 06Traffic, 06Data-Persistence, 10MediaViewer, 10Page-Previews, and 3 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406425 (10MatthewVernon) [16:48:28] 06Traffic, 06Data-Persistence, 10MediaViewer, 10Page-Previews, and 3 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406440 (10Ladsgroup) {T411013} for longer term solution [17:51:44] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10SRE-tools: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11406729 (10Vgutierrez) From SREBatchRunnerBase `__reboot_action()`: `lang=python puppet = self._spicerack.puppet(hosts) reboot_time = da... [17:55:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11406753 (10Papaul) [18:18:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11406867 (10Jclark-ctr) Day 10 Update: - 7 host Moved, 11 Remaining - 300 host at start of migration - John worked with Ben directly to migrate the (4) Data P... [18:59:50] 06Traffic, 06SRE: Revisit the 1GB cache size limit for ATS - https://phabricator.wikimedia.org/T411043 (10ssingh) 03NEW [19:00:30] 06Traffic, 06SRE: Revisit the 1GB cache size limit for ATS - https://phabricator.wikimedia.org/T411043#11407032 (10ssingh) [19:30:22] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11407124 (10cmooney) [20:29:22] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11407342 (10cmooney) @BCornwall thanks for the gerrit reviews! Could you have a look at t... [20:29:38] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11407343 (10cmooney) @BCornwall thanks for the gerrit reviews! Could you have a look at t... [20:38:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11407364 (10RobH) [20:46:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11407373 (10RobH) New host count: 7 host Moved, 11 Remaining - 308 host at start of migration (counting the 8 John audited and filed a task for) [21:23:12] 06Traffic: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11407529 (10Krinkle) By default MediaWiki core actually has a rate limit for this, but it seems we neglected to port this to Thumbor. It is based on `wfThumbIsStandard` (via wgThumbLimits, wgImageLimit, and... [22:01:33] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054 (10cmooney) 03NEW p:05Triage→03Medium [22:01:43] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11407632 (10cmooney) [22:03:18] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11407640 (10cmooney) [22:09:56] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11407654 (10cmooney) [22:50:18] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Introduce known-client identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11407725 (10Scott_French) [22:52:27] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Introduce known-client identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11407729 (10Scott_French) [22:53:47] 06Traffic: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11407730 (10Ladsgroup) I don't know the details of ForeignAPIFile but the API endpoints actually return the correct standardized thumb urls. See for example: https://commons.wikimedia.org/w/api.php?action=q... [22:54:17] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Introduce known-client identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11407731 (10Scott_French) For later reference, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203187 is the varnishtest setup used to validate...