[08:57:59] hi oncallers! I just repooled kartotherian codfw to test the new stack before the k8s upgrade [08:58:35] we are now in the situation that maps eqiad is the only one serving traffic, and this will cause some troubles when upgrading k8s later on [08:58:46] so I pooled codfw to make sure that it works as expected [09:00:33] caveat: due to timing constraints, the tegola's tiles cache is still not completely refreshed, so users may see stale tiles during the next hours (like some extra geoshape or info not visualized with the latest data) [09:00:53] cc: claime [09:02:45] the next step will be to depool eqiad before the k8s upgrade, so we'll see if the new stack correctly handles all traffic etc.. [09:30:51] yo sorry I had pc issues [09:31:01] elukey: sgtm [09:31:06] just got in [09:31:30] elukey: do you want to send an update on the thread about the maps status? [09:33:27] claime: no yet since I just discovered that the cache warm up didn't work sigh [09:33:33] I need to figure out what's wrong [09:33:40] anyway, we'll surely serve some stale tiles [09:33:45] but we can't do otherwise [09:35:36] elukey: I think that's fine honestly [09:35:41] Compared to being completely off that is [09:39:45] I agree [09:42:09] ok fixed the cache issue, now it is getting warmed up a little, but of course all the past hours of work are to be redone from scratch sigh [09:44:11] * elukey re-inserts 90M events in the kafka queue [09:44:48] (later, the current ones are still being inserted) [09:46:29] thanks for the help elukey <3 [09:46:39] claime: so yes you can write in the status update that we are going to serve some stale maps tiles due to the cache being refreshed in these hours. There are 90M elements to refresh and it takes a long time, and we couldn't do it before the k8s upgrade's deadline :( [09:47:00] elukey: it's ok, I'll send the update rn [09:47:04] the last test that I want to do is to push all traffic to codfw [09:47:12] to make sure that we'll be ok [09:47:16] doing it now [09:48:21] all right, moment of truth, all traffic served by codfw [09:52:25] * claime crosses fingers [09:56:50] so far all good [09:57:10] \o/ [09:59:57] claime: yep confirmed, we are good :) [10:00:17] elukey: awesome thank you *so much* [10:42:45] elukey: once the maintenance is done what do you want to do with maps? [10:43:03] Leave it there, switch back to eqiad? [10:44:58] claime: better to switch back to eqiad, so I have time to properly warm up the codfw's tiles cache and test [10:45:31] elukey: ok, will you want to do it, or just leave me the commands and dashboards and we'll do it? [10:47:31] claime: I can take care of it! [10:50:06] elukey: tyvm [10:50:09] <3 [10:55:28] We're starting, coordination happens here [10:55:50] 👀 [10:56:02] \i/ [10:56:06] Running charlie [11:04:03] scap locked [11:04:09] depooling toolhub [11:09:49] To check later: Why is toolhub not failing going through drmrs [11:10:00] Moving on [11:12:24] Emperor: we need to move thumbor from eqiad to codfw, do we need to move swift as well? [11:15:00] claime: how is thumbor talking to swift these days? its credential will only work in the "right" DC, but I thought it'd been fixed to talk to the relevant DC-specific record [11:15:18] hnowlan: do you know? ^ [11:15:48] ok yeah I think so Emperor [11:16:04] it's got values-{codfw,eqiad}.yaml with different swift service proxy records [11:16:39] that was my recollection, so the answer to your question ought to be 'no' :) [11:16:53] awesome, proceeding [11:19:20] Switching thumbor over to codfw, codfw pooled, sleeping 5 minutes for TTL, then depooling eqiad [11:20:25] we haven't changed thumbor to speak to the right DC yet [11:20:36] hnowlan: huh? the values aren't used? [11:21:17] I stopped, we are pooled in both datacentres [11:21:58] I should see errors from codfw if we're not talking to the right backend right? [11:21:59] claime: sorry, ambiguous wording - it's coupled with swift [11:22:22] we haven't changed thumbor to talk to discovery records or similar, because as Emperor said the credentials are per-dc [11:22:31] hnowlan: yeah, but that means it IS talking to its right backend [11:22:36] yes [11:22:40] so I can fail over without moving swift [11:23:40] I think you will need to depool swift [11:23:48] in eqiad [11:24:30] because the url redirection in swift knows only about the dc-local thumbor instance afaik. verifying that [11:25:51] Oh, yes, that's right, swift->thumbor only talks to dc-local thumbor [11:26:13] there's something strange then [11:26:19] swift is pooled in both datacentres [11:26:23] rewrite.py is configured via 'thumborhost' in proxy-server.conf [11:26:25] thumbor is depooled from codfw [11:26:40] or am I looking at the wrong swift record? [11:27:02] claime: yeah, but you didn't stop answer to thumbor.svc.codfw.wmnet did you? [11:27:24] Just arranged that thumbor.discovery is sent to eqiad [11:27:32] ooooh [11:27:34] s/answer/&ing/ [11:27:36] oh [11:27:42] right [11:28:13] So we had an outage the last time one DC's thumbor was entirely turned off, but that's not the same as just depooling from the global discovery record [11:28:26] yeah right right [11:28:31] so depooling swift in eqiad then? [11:28:38] swift.discovery.wmnet [11:29:05] yes, I think so [11:29:24] ok depooling swift in eqiad then [11:29:28] do it. And give me my t-shirt if there's 🔥 [11:29:32] :) [11:32:52] loads on eqiad swift are way down [11:33:12] so looks okay [11:33:14] hnowlan: what are you looking at? [11:33:37] https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&from=now-1h&to=now&timezone=utc&var-quantile=0.75&var-engine=$__all [11:34:10] it's not going down as fast as I'd expect but I guess that's long running renders and queues [11:34:34] oh dear, thumbor codfw is looking very hot. I might need to bump replicas asap [11:34:50] this is probably an unprecendented load for a single dc because of the number of scrapers [11:34:53] go [11:35:04] tell me if I need to repool swift eqiad in the meantime [11:36:05] bumped, but need to see if more is required [11:36:14] ack [11:37:58] queues hopefully evening out, just need another minute or two [11:39:17] yep [11:40:41] I think we're safe to proceed, I'll keep an eye and adjust as needed [11:40:59] Thanks hnowlan <3 proceeding [11:41:02] Depooling thumbor in eqiad [11:41:59] Waiting for TTL to expire before wiping cluster [11:47:53] Ok, preparing to wipe cluster [11:48:20] Wiping cluster [11:54:45] Actually wiping cluster now [11:56:21] holding for upload errors, in case we need to emergency repool thumbor and swift [12:31:23] Repooled thumbor and swift in eqiad for relief [12:52:50] Retrying depool of swift and thumbor [13:06:11] ok, no more requests hitting swift eqiad, thumbor eqiad [13:06:14] everything looks fine [13:06:25] We have a little less than two hours to complete the upgrade [13:06:33] All aboard to wipe the cluster? [13:07:18] jelto: jayme ^ [13:07:30] go :) [13:07:39] go 💣 [13:07:49] cluster wiping [13:10:34] Merging puppet and deployment-charts changes [13:55:14] admin namespace deployments done, deploying istio [14:01:42] repooling toolhub [14:02:55] Running charlie to redeploy all services [14:04:50] Redeploying kartotherian and thumbor in priority [14:06:30] elukey: kartotherian redeployed you should be g2g [14:11:32] all right will check in a sec! [14:20:59] repooled kartotherian in eqiad [14:22:48] elukey: awesome thank you [14:22:49] ahhhh no wait tegola wasn't deployed! [14:23:00] depooled again sigh :( [14:23:07] aaaaah sorry sorry [14:23:12] kartotherian is dependent on tegola, I assumed it was deployed, my bad [14:23:20] deploying now [14:25:51] elukey: tegola deployed [14:34:49] claime: thanks! This time I repooled after some basic checks :D [14:46:47] I double checked admin_ng in the dse clusters, and beside the undeployed flink/spark changes there is no diff, so we should be good there from our side, see T405703#11233555 [14:46:48] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [14:49:54] kartotherian back to run only in eqiad [14:50:29] Great,thanks for your help elukey [14:52:50] dcausse: redeploying all services is taking a little longer than expected, I'll ping you again when it's done [14:53:11] claime: no worries, thanks! [15:00:19] claime: I think there are resource issues [15:00:19] 0/243 nodes are available: 1 Insufficient cpu. preemption: 0/243 nodes are available: 1 No preemption victims found for incoming pod, 242 Preemption is not helpful for scheduling. [15:00:32] is thumbor still scaled to more replicas? [15:00:36] yeah I have a deployment stuck [15:00:44] let me check [15:01:18] yeah something weird's happening, we have 4.5k CPU available though [15:01:42] But we should have deployed mw-mcrouter first because it's a DaemonSet [15:02:24] maybe thumbor is using more resource because of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1192873 ? [15:02:35] thumbor is massively overprovisioned now [15:02:53] maybe we have to scale it down again to deploy mcroute (and the rest)? [15:04:18] scaling it down [15:04:37] I don't think that's the issue though but I'll look at something [15:05:12] done [15:06:37] "you require more thumbor. Mine more thumbor" [15:10:02] mining thumbs can be pretty harmful though [15:10:21] The reason we're stuck is that some hosts have too many overall pods to allow the mcrouter daemonset pod to be deployed there [15:10:32] So I'm going through them, deleting one pod, and it goes througe [15:10:34] through [15:11:35] jelto: can you add to the task that mw-mcrouter needs to be deployed first? [15:11:54] oh okay, thank you! is this still a left-over from the smaller /18 subnets? [15:11:54] yep one sec [15:12:26] jelto: nope, just that k8s packs some nodes full, and then we can't add the must-run-on-every-node pods [15:13:28] ack, I updated the task [15:13:58] tyvm [15:20:00] 16 or so services left [15:20:15] Next time we should deploy some of the stuff first (thumbor, toolhub, mw-mcrouter) [15:20:18] Then parallellize the rest [15:20:27] sequential is way too long [15:20:34] yep +1 [15:21:18] I jinxed us saying we were gonna finish on time [15:34:49] DONE [15:34:51] Finally [15:35:00] dcausse: you can go ahead and restart flink [15:35:05] zotero was the last one 🎉? [15:35:08] claime: nice, thanks! [15:35:08] yes [15:35:14] nice [15:37:59] claime: from deploy2002 I get "WARNING: version difference between client (1.23) and server (1.31) exceeds the supported minor version skew of +/-1" with kubectl version should I just ignore or am I missing some initialisation steps in my env? [15:38:16] dcausse: lemme check [15:39:21] dcausse: what namespace ? [15:39:34] after kube_env cirrus-streaming-updater eqiad [15:40:03] well same for rdf-streaming-updater [15:40:08] hmm that works for me [15:40:40] just had to logout/login apparently... [15:40:53] yeah probably just update-alternatives shenanigans [15:41:05] ack, thanks! [15:43:15] thanks for running the upgrade claime, that was quite an exercise [15:43:28] yeah I'm fried lol [15:44:18] yeah, well done - sorry about the extra excitement [15:44:34] me too, I'll head off, tomorrow at 5am vops expects me [15:44:45] jelto: good luck! o> [15:45:02] thx :) [15:48:37] jelto: hopefully I won't speak to you until regular working hours tomorrow :)