[06:02:31] akosiaris: hey [06:02:38] legoktm: o/ [06:03:42] so https://phabricator.wikimedia.org/T255250 has some patches on it, should we resurrect those? [06:05:44] yeah, https://gerrit.wikimedia.org/r/c/operations/puppet/+/614894 looks correct to me [06:06:14] anotheri nteresting thing is that in a recent discussion with Luca is that we can have a small outage for ORES [06:06:35] so, https://phabricator.wikimedia.org/T255250#6321852 doesn't apply anymore [06:06:55] btw, the list of the services that live in that cluster is here: https://wikitech.wikimedia.org/wiki/Redis [06:07:54] the ores queue one is memory only, so on the switch we will lose that anyway (and that's fine and by design). the ores cache, it would be nice if it has been replicated (which we aim to do), but not the end of the world if it isn't [06:08:32] * legoktm nods [06:09:56] so merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/614894 and forcing a puppet run will automatically start replication? [06:10:00] the other 3, we want to have them fully replicated before we switch. docker-registry uses redis also just as a blob cache, but we haven't ever tested how it will behave without it [06:10:31] yes, it will. That's a good first step [06:10:59] let me refresh https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/614901/ for changeprop [06:11:01] how long do you think replication will take? [06:12:38] the most demanding one is ORES as it's close to 16GB IIRC. [06:12:47] so.. something like 3-4mins [06:14:57] ack, running puppet now [06:18:37] oh damn. I just found out we have another user [06:18:42] api-gateway [06:18:54] it uses the exact same instance as changeprop [06:19:06] let's document that [06:19:09] :| [06:21:03] do I need to check every redis instance for e.g. "master_last_io_seconds_ago:0" to ensure replication finished? or if I just check ORES is that good enough? [06:22:16] master_sync_in_progress:0 [06:22:20] that's what we want [06:22:43] for i in {6378..6382}; do echo "info replication" | redis-cli -p $i -a ; done [06:22:52] that's a quick hack for seeing the status [06:23:20] and only 6378 is still syncing, which is expected I guess [06:24:15] I see both as finished? [06:26:27] yup! [06:26:38] so... like 3-4 mins after all [06:27:01] ok, https://wikitech.wikimedia.org/w/index.php?title=Redis&diff=1909617&oldid=1893689 updated [06:27:19] also is https://gerrit.wikimedia.org/r/c/operations/puppet/+/615163/ useless now that scb is gone? [06:29:10] indeed. Abandon it [06:29:28] should those scb.yaml files be deleted? they're still in puppet [06:33:56] really? sigh [06:34:51] let me add it to https://gerrit.wikimedia.org/r/c/operations/puppet/+/676906 then [06:35:11] based on https://codesearch.wmcloud.org/search/?q=rdb200%5B34%5D&i=nope&files=&excludeFiles=&repos= it looks like this doesn't affect docker-registry or netbox? [06:36:03] do we effectively have to coordinate the api-gateway/changeprop/jobqueue deploys and the puppet one at the same time? [06:38:15] ok, wait, now I am confused a bit [06:38:26] so, rdb200[35] have the same role [06:38:59] and respectively 4678 now [06:40:05] ah, 5 and 6 are for some reason tracked under https://phabricator.wikimedia.org/T266721 [06:41:09] :ooo I didn't realize we had those too [06:41:37] yeah, I had to refer to the SvcOps OKRs doc to figure it out [06:41:51] we have dropped the ball on following up with the proper task after the racking ones [06:42:28] there's also (rdb1011|rdb1012)\.eqiad\.wmnet [06:42:36] (still role(insetup)) [06:43:50] yup, the eqiad counterparts. supposedly a followup for https://phabricator.wikimedia.org/T266724 [06:44:27] replacements for 5 and 6 [06:45:28] I just found https://gerrit.wikimedia.org/r/c/operations/puppet/+/670850 [06:45:33] these are all stretch still [06:45:50] * akosiaris cries [06:45:51] including 78...should we reimage them first? [06:47:24] just checked with cumin, all rdb* is stretch [06:47:25] ok, wait a moment, regroup moment [06:48:09] https://etherpad.wikimedia.org/p/rdbmess-2021 [06:53:16] * akosiaris is recomposing the picture now in his mind [06:54:34] I did a quick search and didn't find anything either way on if redis v5 can replicate from redis v3 [06:57:23] ok, I guess that's 1 actionable then. To test it [06:59:02] is there any reason we would want to leave these on stretch? [06:59:55] aside from replication breaking and not working ? [07:00:00] I guess no [07:00:37] I guess is the priority to get these racked and working higher than getting off stretch [07:01:23] but it seems like we have to do the same amount of transition work when we want to reimage that we might as well do now (if it works) [07:01:53] both true [07:01:55] we can reimage rdb2007 and if replication works then it works but if not we ... move ahead with stretch for now? [07:02:51] let me see if we can avoid a senseless reimaging, spinning up 2 redis instances in docker should be faster [07:04:47] ok :D [07:05:08] let me work on puppet patches for the other servers that we need anyways [07:05:29] ok, thanks [07:05:39] I 'll finish up the deployment-charts patches as well [07:05:46] I am halfway through anyway [07:10:24] 10serviceops, 10SRE: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10Legoktm) p:05Triage→03High [07:15:30] our base layer is looking really solid at this point, if 3.2 can replicate against 6.0, you can also skip Buster and start with Bullseye rightaway [07:16:22] "go big or go home" [07:16:36] there's a few remaining glitches, but all expected to get fixed in puppet in the next days or fixes in Debian which will migrate in the next days [07:18:06] rotfl [07:18:18] ok 5 replicates from 3 [07:18:28] redis5_1 | 1:S 27 Apr 07:17:14.018 * MASTER <-> SLAVE sync: Flushing old data [07:18:28] redis5_1 | 1:S 27 Apr 07:17:14.018 * MASTER <-> SLAVE sync: Loading DB in memory [07:18:29] redis5_1 | 1:S 27 Apr 07:17:14.019 * MASTER <-> SLAVE sync: Finished with success [07:18:43] moritz is tempting me right now [07:19:42] I'm sure it'll be fine, I'm also migrating the LDAP replicas to bullseye currently [07:19:56] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 2 host(s) and their services with reason: for zookeeper migration ` conf[2002-2003].codfw.wmnet ` [07:20:14] "I'm sure it'll be fine" -- Moritz, 2021-04-27 [07:20:25] Debian spent a lot of effort on automated tests via autopkgtests, this is beginning to show in a smoother release process with less surprises [07:20:52] ^.^ I'm down to try bullseye though [07:21:44] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet ` [07:21:50] 10serviceops, 10SRE: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm) p:05Triage→03High [07:22:04] 10serviceops, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10Legoktm) [07:22:06] 10serviceops, 10SRE: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Legoktm) [07:22:09] 10serviceops, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10Legoktm) 05Open→03Stalled [07:24:39] legoktm: it does replicate [07:24:54] at least I did try a single key and it did just fine [07:25:30] redis_version:6.0.11 [07:25:45] redis6_1 | 1:S 27 Apr 2021 07:24:17.961 * Loading RDB produced by version 3.2.6 [07:25:45] redis6_1 | 1:S 27 Apr 2021 07:24:17.961 * RDB age 0 seconds [07:25:45] redis6_1 | 1:S 27 Apr 2021 07:24:17.961 * RDB memory usage when created 0.77 Mb [07:25:45] redis6_1 | 1:S 27 Apr 2021 07:24:17.961 * MASTER <-> REPLICA sync: Finished with success [07:25:54] it even knew it was replicating from 3.2!!! [07:26:04] antirez rules as always [07:26:06] :oo [07:26:42] well, there is 1 thing that will be good if we go straight to bullseye [07:26:50] we will care about this again in 6 years from now [07:27:20] and I really don't see a reason for buster if bullseye is an option [07:27:26] both are as untested [07:27:58] yeah, and if there's actually a bug somewhere still in 6/bullseye, it's also far easier to get it fixed [07:28:02] both upstream and in Debian [07:28:35] legoktm: I think it's like half past midnight over there [07:29:01] this is my most productive hour :p [07:29:04] lol [07:29:13] yeah, I can remember those days. They were fun [07:30:35] let me make one last check (that the config as is works on a redis 6) and maybe we can try to reimage rdb200{7,8} to bullsye [07:31:27] sounds good :D [07:37:18] I am impressed [07:37:23] it worked without a hitch [07:37:31] we are good to go on a reimage I think [07:39:34] moritzm: dare I ask why redis is marked with an epoch of 5 in bullseye? 5:6.0.11-1 that is [07:40:35] the stretch one also uses an epoch, doesn't it? [07:41:45] yeah but it's 3. 3:3.2.6-3+deb9u3. Buster is 5, that is 5:5.0.3-4+deb10u3 [07:41:46] yeah,was introduced sometime in 2009 [07:41:49] stretch had an epoch of 3, buster was at 5 [07:41:57] 5:6.0.11 is ... confusing ? [07:42:02] to battle a different pre-release version scheme [07:42:07] they probably stopped bumping it once the policy requirement came in requiring epoch bumps to go through debian-devel [07:42:21] yeah, the maintainer bumped the epoch as well a few times to reduce confusion: [07:42:37] redis (3:3.2.3-2): [07:42:45] Bump epoch as the "2" prefix makes it look like we are shipping version 2.x [07:42:46] of Redis itself. [07:43:23] but now they are required to be rubberstamped before bumping epoch so "no, thanks" or what? [07:43:49] maybe it was just forgotten for redis 6, not sure [07:43:51] you have to justify bumping the epoch because they break various tooling [07:43:52] actually, never had to work around this... can they dump the epoch? [07:44:03] or they are now married to it for life? [07:44:04] no, once you picked an epoch it sticks [07:44:09] your only hope is that upstream goes proprietary and you switch to free fork :-) [07:44:10] unless you rename the package or smth [07:44:18] lol [07:44:41] the mediawiki package has a 1: epoch, but it's good that MW will be 1.x forever [07:44:54] yeah, I 've heard that before [07:45:06] linux would be a 2.x.y forever [07:45:30] that aged well [07:46:39] so, interestingly rdb2005 (and thus 6 as well as it's slave), is also using redis db1 (and not just 0) [07:46:56] for a total of 24 keys, but who's counting? [07:49:18] 10serviceops, 10Patch-For-Review: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb2007.codfw.wmnet', 'rdb2008.codfw.wmnet'] ` The log can be found in `/var/log/wmf-au... [07:50:31] akosiaris: what's it used for? [07:51:36] legoktm: netbox reportedly [07:51:57] I 'll dump keys to make sure, but this db0 vs db1 is ... weird [07:55:33] but netbox servers are also running a redis on localhost:6380 ? [07:55:37] * akosiaris puzzled [07:55:41] better call volans! [07:55:52] akosiaris: what's up? [07:55:58] lol [07:56:07] talking about instant gratification ;) [07:56:18] anyway TL;DR is redis + netbox [07:56:50] we have in https://wikitech.wikimedia.org/wiki/Redis netbox as a user of the redis misc cluster port 6381 [07:57:16] netbox used to use the cluster redis, then was moved to its local redis instance, it's just for caching, no need to have replication or clusterization. The cleanup of the existing data in the main redis cluster was supposed to happen as part of the task that migrated it to the local one. [07:57:21] I gather it never happened :/ [07:57:40] ok, I was about to post 4 different followup questions, you just answered them all [07:57:42] thanks! [07:57:58] TL;DR you can nuke it all [07:58:10] nuking it as instructed! thanks! [07:58:20] dorry about that [07:58:26] sorry about that [07:59:42] no worries, we have dropped the ball on that as well [08:01:20] * volans trying but failing to find the related phab task [08:08:56] 10serviceops, 10Patch-For-Review: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2007.codfw.wmnet', 'rdb2008.codfw.wmnet'] ` Of which those **FAILED**: ` ['rdb2007.codfw.wmnet', 'rdb2008.codfw.wmnet'] ` [08:09:28] 08:06:40 | rdb2007.codfw.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run [08:09:39] same with rdb2008 [08:14:06] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, unsupported: facts['os']['release'['major'] (bullseye/sid) is not a number (file: /etc/puppet/modules/debian/manifests/init.pp, line: 23, column: 9) on node rdb2007.codfw.wmnet [08:14:08] moritzm: ^^ [08:14:31] legoktm: is that verbatim? missing square bracket [08:14:44] also, why are you still awake... go to bed! [08:15:00] yes, verbatim [08:15:08] 00:29:01 this is my most productive hour :p [08:15:29] legoktm: that's one of the few glitches I mentioned earlier, that has been changed in base-files 11.1, which will migrate to bullseye in three days [08:15:51] volans: I tried and I failed already at that. Can't blame him, I said the same thing back then [08:15:55] in the mean time you can edit /etc/debian_version and change "bullseye/sid" to "11.0" [08:16:08] then the rest of Puppet will run just fine [08:16:12] ack [08:16:21] fail("unsupported: facts['os']['release'['major'] (${facts['os']['release']['major']}) is not a number") <-- I'll submit a patch in a moment for that patch [08:16:27] for that typo* [08:17:01] but "bullseye/sid" is not a number :-) [08:17:07] or what do you mean? [08:17:25] the message says facts['os']['release'['major'], missing a ] after release [08:18:28] ah, that one [08:18:35] puppet is running now, thanks [08:19:55] 10serviceops, 10decommission-hardware: decommission rdb200[34].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10akosiaris) [08:19:56] 10serviceops, 10Patch-For-Review: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10akosiaris) [08:20:09] 10serviceops, 10decommission-hardware: decommission rdb200[34].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10akosiaris) 05Open→03Stalled Stalling until T255250 is completed. [08:23:25] Error: /Stage[main]/Redis/File[/etc/redis/redis.conf]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/redis/redis-bullseye.conf [08:24:26] source => "puppet:///modules/redis/redis-${::lsbdistcodename}.conf", [08:25:37] what a great trap [08:25:45] redis-buster, redis-stretch, redis-jessie are all the same [08:25:50] the only different one? redis-trusty [08:31:29] rotfl [08:31:44] I think we can get rid of it [08:32:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/682901/1 [08:33:27] I added redis-bullseye for now just since the cleanup is a bit more involved and should get reviewed [08:33:34] ok [08:36:37] both rdb200[78] should be replicating now [08:36:53] legoktm: I am gonna add rdb20[09|10] as well to the mix. Since some of the clients we are switching over use the 2 pairs for failover we probably want to do both now [08:36:55] \o/ [08:37:53] ok, I put up patches for all of those already [08:38:14] moritzm: any reason we can't import the fixed base-files with ~wmf1 instead of waiting for it to migrate? [08:41:42] akosiaris: is it okay if I reimage everything as bullseye now or should we just do 7-10 and make sure it works? [08:45:04] legoktm: by everything meaning the 2 rdb10* servers as wel ? [08:45:09] I think it's ok [08:45:10] yes [08:45:15] yeah, go for it [08:46:06] * legoktm does [08:47:10] legoktm: didn't really feel like worth the hassle, the fix migrates in three days and the workaround is easy to apply [08:47:49] it means that wmf-auto-reimage doesn't magically work :( [08:48:56] 10serviceops, 10SRE: Put rdb20[09|10] into service - https://phabricator.wikimedia.org/T281225 (10akosiaris) [08:49:02] it still works? you only need to apply the workaround via install_console out of band [08:49:07] 10serviceops, 10SRE, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] ` The log can be foun... [08:49:32] if the reimage fails at the first puppet run and you apply a manual fix you're skipping other steps the reimage does after puppet [08:49:52] just FYI [08:50:03] oh, what other steps should I be running? [08:50:47] 10serviceops, 10SRE, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eqiad.wmnet for hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` The log can be foun... [08:51:38] Run the Netbox script to update the device with its interfaces and related IPs [08:51:38] Umasks the masked systemd units [08:51:51] (from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reimage) [08:52:49] that Netbox link is broken [08:53:42] doh [08:53:43] checking [08:53:45] also a reboot [08:55:00] ah that would explain why I got only an IPv4 and not an IPv6 for rdb2007 and rdb2008 [08:55:09] legoktm: fixed link [08:55:23] whereas I got an IPv6 for 09 and 10 [08:56:46] volans: do I need to do anything manually for the unmasking part? [08:57:29] just did the netbox part [08:57:44] and then I guess I can reboot 07 and 08 [08:58:37] I'm probably going to sleep once the reimages finish [08:58:50] legoktm: finally! :P [08:58:59] I 'll take over and switch the apps [08:59:09] I 've got patches for those already [08:59:53] legoktm: if you didn't pass the masking option to start with no [09:01:03] ack [09:02:49] oh it seems it's a party here [09:04:00] Amir1: and it includes ORES !!! [09:04:14] 10serviceops, 10SRE, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` Of which those **FAILED**: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.w... [09:04:29] I'm literally making popcorn now [09:04:39] rotfl [09:05:22] 10serviceops, 10SRE, 10Patch-For-Review: Replace rdb2005, rdb2006 with rdb2009, rdb2010 - https://phabricator.wikimedia.org/T281216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['rdb2009.codfw.wmnet', 'rdb2010.codfw.w... [09:19:59] Here you are, fresh out of microwave [09:20:03] https://usercontent.irccloud-cdn.com/file/CGMahHHX/image.png [09:20:18] rotfl [09:20:18] :D :D :D [09:24:20] oh gosh [09:25:01] akosiaris: all hosts reimaged and rebooted, handing it over to you now :) [09:25:09] great, thanks! [09:25:12] go to sleep now ;-) [09:25:57] told him already, he doesn't listen :( [09:26:48] good night everyone :) [09:27:28] good night [09:46:32] <_joe_> good night... [09:58:12] registries looking cool, moving to ores [11:33:10] Kunal slept and we had this incident. Coincidence? I Think Not! [11:33:17] * Amir1 puts on his tinfoil hat [11:42:53] 10serviceops, 10SRE, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10jbond) [11:43:16] _joe_: re changeprop can you cgeck that this task ^^^ makes sense thet the right tags are present [11:50:45] 10serviceops, 10ChangeProp, 10SRE, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10hnowlan) [11:52:16] looks like hugh did that for ya [11:55:45] thank hnowlan :) [12:00:41] 10serviceops, 10ChangeProp, 10SRE, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given w... [12:00:58] <_joe_> yeah I added a comment upon reading the code [12:11:37] thanks _joe_ [13:07:08] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration ` conf[2004-2006].codfw.wmnet ` [13:21:24] 10serviceops, 10ChangeProp, 10SRE, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Pchelolo) yeah, that's correct. We can increase the additional delay if needed. Also, this particular additional del... [13:21:30] 10serviceops, 10SRE: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond) [13:22:28] 10serviceops, 10SRE: Ubtade grafana link for mediawiki-error-rate-$cluster check - https://phabricator.wikimedia.org/T281261 (10jbond) @jijiki perhaps? [13:23:04] 10serviceops, 10SRE: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261 (10jbond) [15:08:31] <_joe_> akosiaris / jayme can you confirm ipv6 addresses should never make it to kubernetes egress rules? [15:21:45] _joe_: I don't think there is a rule to not have ipv6 there [16:27:43] 10serviceops, 10Scap, 10Release-Engineering-Team (Radar): Deploy Scap version 3.17.1-1 - https://phabricator.wikimedia.org/T279695 (10jijiki) @dancy thank you very much, everything worked well! I have installed scap 3.17.1-1 on our api canaries, I will proceed with a full rollout this week. [16:43:01] akosiaris: awesome :D I can take care of the decom [16:49:02] akosiaris: I still see references to rdb2003 and rdb2005 in deployment-charts? [17:55:41] 10serviceops, 10DBA, 10Phabricator, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) [17:56:24] 10serviceops, 10DBA, 10Phabricator, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) I've marked this as blocked by {T171498} because that sounds like the righ... [17:56:50] 10serviceops, 10DBA, 10Phabricator, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) p:05Triage→03Low [18:57:05] <_joe_> legoktm: AFAICT it's all networkpolicies, so they were left there until the cleanup phase [19:03:13] oh, missed that, thanks [20:54:44] 10serviceops, 10decommission-hardware, 10Patch-For-Review: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `rdb[2003-2004].codfw.wmnet` - rdb2003.codfw.wmnet (**PASS**) - Downt... [21:02:21] 10serviceops, 10Continuous-Integration-Infrastructure, 10Regression: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10hashar) [21:07:05] 10serviceops, 10decommission-hardware, 10Patch-For-Review: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `rdb[2005-2006].codfw.wmnet` - rdb2005.codfw.wmnet (**PASS**) - Downt... [21:27:23] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) ok, a bit more time passed and we've lost 24 jobs in 7 day... [21:49:20] some report from early testing bullseye: [21:49:46] tlsproxy::envoy fails to handle the certificate chain.. because: [21:49:52] failed: /usr/bin/env: ‘python’: No such file or directory [21:50:19] on host people1003 with apache and envoy and nothing much else [21:50:43] also, if you want to see something on bullseye really quick, you can use people1003 now [21:51:25] do you know what's trying to call python2? [21:58:35] 10serviceops, 10decommission-hardware, 10Patch-For-Review: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Legoktm) [21:59:28] 10serviceops, 10DC-Ops, 10decommission-hardware, 10ops-codfw: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Legoktm) This is ready for #DC-ops now. [22:08:16] 10serviceops, 10DC-Ops, 10decommission-hardware, 10ops-codfw: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) p:05Medium→03Low a:03Papaul [22:10:26] legoktm: tracked it down to /usr/local/sbin/x509-bundle [22:10:34] hrmm [22:10:52] used by sslcert::chainedcert [22:13:36] looks straightforward to port [22:13:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/670978 [22:14:13] woo:) [22:14:41] oh, it's been up for a while, nice find