[08:35:09] 10serviceops, 10MediaWiki-Authentication-and-authorization: sessionstore certificates will expire soon - https://phabricator.wikimedia.org/T274564 (10akosiaris) [09:14:05] 10serviceops, 10MediaWiki-General, 10SRE, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) In order to catch calls to mediawiki that are not monitoring and go to port 8... [10:54:14] 10serviceops, 10MediaWiki-General, 10SRE, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10akosiaris) [11:50:20] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10akosiaris) >>! In T273521#6812274, @JMeybohm wrote: > I think we could make kubernetes... [12:59:29] 10serviceops, 10SRE, 10envoy, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) 05Open→03Resolved [12:59:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a yaml structure for defining apache virtualhosts for mediawiki, that can be used both in puppet and in helm charts. - https://phabricator.wikimedia.org/T272305 (10Joe) 05Open→03Resolved [12:59:57] 10serviceops, 10MW-on-K8s, 10SRE: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [14:37:12] 10serviceops, 10Wikidata, 10Wikidata-Query-Service: Limit query parallelism from Flink based WDQS updater to Wikidata - https://phabricator.wikimedia.org/T275133 (10Gehel) [14:37:42] 10serviceops, 10Wikidata, 10Wikidata-Query-Service: Limit query parallelism from Flink based WDQS updater to Wikidata - https://phabricator.wikimedia.org/T275133 (10Gehel) [17:44:21] so.. about mwmaint and buster... should I simply reimage mwmaint2001 to buster. since that is not the active DC but then we can test if the role (and jobs) work? [17:44:46] or we could go the other route and create a new mwmaint on buster [17:48:59] <_joe_> I'd go with the former [17:50:32] alright, if it's not considered too risky in case we have to switch on just that day unexpectedly.. but even then still easy enough to just reimage it back [17:53:31] yeah, practically speaking we probably can't do a surprise DC switchover faster than that anyway [17:53:47] ok, ack! [18:05:17] in the long run I'd love for us to maintain a state of, like, readiness to do an unexpected DC switchover within X hours at any time -- but that's extremely aspirational :) [18:06:59] yea, so I found my own comments about making puppet accept more than 1 mwmaint in a single DC .. [18:07:30] but we don't really need it to just get the upgrade done now [18:20:25] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10dancy) >>! In T273521#6840723, @akosiaris wrote: > One extra thing to always keep in m... [18:22:04] rzl: I guess in the above you mean in the sense of "an unexpected but unecessary one, without breaking too many less-important things?" I mean, I think we do expect to be able to handle sudden loss of a core DC in reasonable time, given that some 1-DC things will be borked and that's ok, right? [18:22:50] bblack: yeah -- I think the big question is, how long will it take us to do the switch, and is that *always* true or just mostly true [18:23:06] right [18:23:06] e.g. the work mutante is doing -- if we were really careful, we would never do anything that makes codfw unavailable to take a handoff [18:23:23] but, I don't think we're otherwise at a point where it makes sense to maintain that stance [18:23:45] I don't think we're that far off from it, one-off temporary risks like this aside? [18:24:07] mostly-true I guess :) [18:24:28] the other big thing is the DB prework I think -- currently it's a lot of work for the DBAs to get into a switch-ready configuration [18:24:39] some of that is stuff we could skip in a hurry, but not all of it iiuc [18:25:09] but that's mostly to maintain sanity so that we can switch back later easily, etc... if we assumed one DC was lost and we don't care about recovery time to get it back, it shouldn't need much, I would guess (but I don't know) [18:25:54] nod [18:26:18] I guess maybe I've falled into the trap of looking at our readonly timewindow, and not thinking about whether some longer-running things outside of that window are still necessary even if the other DC is dead-dead. [18:26:23] 10serviceops, 10SRE: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [18:26:25] s/falled/fallen/ [18:26:38] 10serviceops, 10SRE: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) renaming this ticket to cover both mwmaint* servers and not be just for eqiad alone [18:26:49] I think that's probably true, and it's also probably true that in an emergency we could get Most Things[tm] to codfw before Too Long[tm] -- we just don't have a doctrine of making sure that's always true to a particular standard, is all [18:27:03] yeah [18:27:14] someday we should, but until we do, we don't really have to constrain our other work byit [18:28:05] this sounds like a job for SLOs tracking routine failover testing! :) [18:28:23] yup [18:33:53] most things minus the community (and Foundation) tools that run in Cloud VPS... but that's something that folks are talking about working on again which is nice [18:34:27] yeah, "All Of Cloud And All Of Analytics" is the canonical big gap [18:34:46] would love to not have that problem [18:51:05] my crystal ball says that somehow or other, we'll eventually land on having different conceptual groupings of what fails over and how. Some label will apply to most things for codfw:eqiad, and maybe some other things are considered ok to be non-redundant, and maybe some other things are redundant but use codfw:elsewhere or eqiad:elsewhere, etc. [18:52:28] sounds right [18:52:43] the big win there would be making those decisions intentionally, I think [18:52:49] yes [18:53:08] then we can say it's all working as intended because we've clearly defined that we didn't intend to cover the hard parts :) [18:53:23] haha [18:53:32] and to catch it early when we are about to introduce new SPOFs..like "I see this VM request asks for just one DC, are you SUUUUURE??" [18:54:08] mutante: yeah, or better yet this k8s deployment :) where we can express all of rack-level, row-level, and DC-level redundancy in the same ways [18:54:52] and some of this is tied into SLO definitions too -- there's only so good your SLO can be with a single machine, and similarly there's only so good your SLO can be with a single DC [18:56:16] we might need 2 different error page templates, "this is currently down but still inside SLO. see you soon" and "this is down and already breaking SLO, auto-generated ticket is linked here" [19:12:17] so.. was about to reboot mwmaint2001, remembered to check home dirs again. find a couple users with gigabytes of data.. about to start pinging them.. then I look at one example, joe's home and I see what is taking that space, it's "home-terbium", the previous server and I created that years ago for the same reason.. lol, compounding "interest" [19:23:54] as opposed to mwdebug, mwmaint homes ARE in Bacula (just checked). so.. I will nuke "home-terbium" now [19:32:35] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) >>! In T273521#6841527, @dancy wrote: >>>! In T273521#6840723, @akosiaris wro... [22:16:37] <_joe_> sorry I missed this conversation, I just want to comment that we can switch over to codfw and finish reimaging mwmaint2001 in the meantime [22:17:06] <_joe_> it will be a down of ~ 1 hour of some rather important cronjobs but nothing will really be not working in a tragical way [22:17:30] <_joe_> I think we'd have our hands full with dns changes and weathering the initial storm on very cold databases [22:18:04] I am finding even more old data, like the homes from terbium and mwmaint1001 in the root of mwmaint1002 and mwmaint2001, stuff like that [22:19:18] bacula console still knows about old hosts like that, tin as well. but then when you look closely it might not have a full backup anymore, just knowledge that this client once existed [22:20:39] so it can be a bit deceiving if you just glance at the bacula client list (maybe, I'm asking a bit if this is working as expected) [23:16:42] 10serviceops, 10SRE: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mwmaint2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202102182316_dzahn_17848_mwm...