[06:34:03] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:34:27] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:35:10] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:53:01] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Joe) 05Open→03Declined Not sure what this task rationale is. Debian buster has node10, https://packages.debian.org/buster/nodejs and will provide security updates until at least 202... [06:57:02] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) I don't think Debian provides security support for the 1,446,739 packages on npmjs.org. It won't be long before our production services or CI tooling will no longer function on a su... [07:22:26] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Adding some thoughts about mc1036, in my opinion it is really flying with the new config :) With the extra +20G... [08:27:31] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Jgiannelos) >>! In T266373#6616778, @akosiaris wrote: >>>! In T266373#6613038, @Jgiannelos wrote: >> @a... [08:33:25] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Jgiannelos) >>! In T266373#6617586, @akosiaris wrote: >> Interestingly, proton returns transfer-encodin... [08:56:20] akosiaris: Hey! Thanks for the work on T266373. For future reference, is there any documentation or even a couple of example `curl` runs on how to reach different proxy levels directly (eg Varnish/ATS) for debugging purposes? Also one of the reasons I only sampled a few articles is rate limiting. How did you work around that? [09:06:41] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10hashar) 05Declined→03Open The rationale is that developers are adopting newer versions on a different timeline than the Debian releases. Either we are ahead (the case for NodeJS) and/or... [09:29:52] 10serviceops, 10Operations: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10hashar) Reply from the package uploader: > indeed, all other things being equal, nodejs 12.x will be in debian 11. > (unless a developer starts working full time on transitioning nodejs 14... [09:44:15] 10serviceops, 10Operations, 10Packaging: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Peachey88) [10:59:22] nemo-yiannis: the various endpoints for talking to internal services in their dc-agnostic way are in the formart .discovery.wmnet (see https://wikitech.wikimedia.org/wiki/DNS/Discovery if you care about the gritty details). The list of services to feed the above format is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml). e.g. restbase is [10:59:22] an internal service so restbase.discovery.wmnet. So, most of the times you can just curl https://.discovery.wmnet:. The port part you can get again from the service.yaml file above. [10:59:22] Public facing services, are of the form ..wikimedia.org. Those are rather few compared to the internal ones and they all tend to resolve to text-lb..wikimedia.org (the public facing edge caches) these days. So for those, docs about edge caches are at https://wikitech.wikimedia.org/wiki/Caching_overview along side with diagrams to help understand the usual request flow. [10:59:22] I would though suggest to treat the edge caches as a whole and not try to drill down into the individual components as I did in that task, unless you are willing to spend considerable time on them. The architecture there is changing as the traffic team is meeting a variety of issues with varnish and ats. [10:59:59] Cool, thanks for the details. [11:00:30] nemo-yiannis: as far as the rate limiting goes, I did not work around it, at least as long as I was testing from my PC. I just waited. But internally (e.g. from a bastion host), rate limits don't apply IIRC, so be careful :-) [11:05:37] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6623109, @Jgiannelos wrote: >>>! In T266373#6616778, @akosiaris wrote: >>>>!... [12:06:57] 10serviceops, 10Operations, 10Kubernetes: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [14:53:46] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) I've been blocked by a last minute change made on translation, which required me to manually change date formats in translat... [15:02:10] puppet is failing on scb since poolcounter::client::python tries to install python3-poolcounter, which isn't available on jessie [15:03:13] or rather it runs per se, but the installation of the package is failing [15:11:23] _joe_: ^ [15:30:13] rzl: moritzm: i have created a CR (https://gerrit.wikimedia.org/r/641194) which prevents this from being installed on jessies systems, there may be a better fix but this should at least stop the noise [15:35:55] <_joe_> I'll fix it tomorrow, it's really not a huge issue, basically we need to just set the hiera value for the poolcounter backends to null [15:36:07] <_joe_> I'm still pretty sick sorry [15:39:51] _joe_: oh, sorry to hear -- I'll stamp jbond42's change then, feel free not to worry about it until you're ready [15:40:07] 10serviceops, 10Discovery-Search, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Backport imposm3 to the debian channel - https://phabricator.wikimedia.org/T238753 (10MSantos) [15:40:08] <_joe_> rzl: that change is incorrect :) [15:40:18] haha then I will not do that thing [15:40:31] <_joe_> or better, it wouldn't fix the real issue, just not make puppet complain [15:48:00] thx _jo.e_, i have put a noteon the change basicly saying this ^^ and not to merge unless things becaome a reall issue before a correct change is implmented [15:48:25] sure, sounds good -- it won't be a real issue, we aren't actually relying on that poolcounter functionality for anything yet [15:48:50] I was just going to merge it in order to unblock puppet, but if we don't care about that, cool :) [15:49:39] im fine, but like i said feel free to merge if it becomes an issue [15:49:51] moritzm: _joe_ [15:49:52] 👍 [15:50:04] so our plan for ICU is [15:50:19] a) enable the component on all mw clusters [15:51:09] b) in batches of 3 per cluster (2 of jobrunners, and 3 for parsoid/api/app) to depool and run apt-get install + restat php-fpm [15:51:47] <_joe_> ideally you first do so on the canaries, wait a couple hours, go on with the rest if nothing fishy happens [15:51:52] yep [15:51:55] and then c) run the update script first on mediawikiwiki and then in parallel instance per shard [15:51:58] <_joe_> and in the meanwhile, you also upgrade mwmaint [15:53:06] plan sounds good to me! [15:55:31] after we upgrade the api and app canaries, how long do you think we should wait ? [16:02:56] I don't think we need more than an hour or so? if there's an issue I think we should see it relatively quickly, and if it's so rare that it doesn't arise on 10 of our 150 prod servers within an hour, it doesn't sound like a show stopper [16:03:31] I don't remember how long we waited the last time, but probably not much longer than that [16:08:30] 10serviceops, 10Discovery-Search, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Backport imposm3 to the debian channel - https://phabricator.wikimedia.org/T238753 (10MSantos) a:05sdkim→03hnowlan [16:09:07] oh it's on? nice [16:09:22] my very rudimentary testing in deployment-prep turned up nothing, as expected [16:10:01] I've got some long-running php scripts in flight on the snapshot hosts, but there's nothing to be done about that [16:10:04] yeah much appreciate your updates on the task [16:10:23] ack -- anything we can do to make your life easier there? [16:10:33] don't shoot them? :-D [16:10:45] I mean at this point things are in memory so who cares right [16:10:55] cool okay [16:11:08] these are all cli, should be fine [16:11:10] no immediate plans to touch the snapshot hosts, I just wanted to make sure you weren't going to get hosed by collation changes or whatever [16:11:21] extremely doubtful [16:11:30] we are in the "get text for each revision" phase [16:13:59] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Note that I ran a little dumps test on a non-latin1 wiki in deployment-prep (ruwiki to be precise) and the results look... [16:14:08] forgot to in fact note that I did minimal testing on the task. there. [16:14:42] what time is this kicking off then? [16:14:57] any minute now :D [16:15:06] effie and I are on a call coordinating [16:18:30] 👍 [16:49:27] upgraded to icu63 on canary API servers in both DCs, letting it bake for an hour or so [16:50:17] ok [16:50:35] I guess we'll all be in a meeting for most of that hour, heh [16:51:12] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) We've started upgrading the canary appservers to ICU 63, so the window of category sorting disruption has officially started. [16:54:13] https://debmonitor.wikimedia.org/packages/libicu63 agrees [16:54:29] and also sampled pending updates on mw1276, which looks fine as well [17:59:11] no noise from the canaries, going ahead with the rest of the api servers shortly [18:08:45] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) [18:12:03] merging harmless changes that were backed up in gerrit but nothing related to appservers. keeping an eye on channels that have user reports like #wikipedia and -tech [18:13:25] thanks :) I doubt anybody is going to notice or care though [18:27:54] 10serviceops, 10Growth-Team, 10Operations, 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) [18:58:14] mutante: can we cont here [18:58:20] -operations is too noisy [18:58:30] the status is that on only 2 api servers [18:58:38] we have this issue? [18:59:06] (I'm trying mutante's idea now, keeping the currently-installed pool.d/www.conf and then rerunning puppet) [18:59:23] that will work probably yes [19:00:39] .log mw2255 - is pooled and puppet works on next run, after it removed php 7.2 config files [19:00:46] running puppet on that one host made it remove a bunch of php 7.2 config files and tideways extension [19:00:53] and after that the puppet run is happy again. and the host was and is pooled [19:01:15] on the other one mw2313 something else is going on, with errors installing packages [19:01:50] hmm, it looks like rerunning puppet on 2284 didn't touch that config file [19:01:51] but it is also not pooled [19:02:16] mutante: effie is in the middle of a cumin run upgrading libicu on those hosts so you should expect stuff is moving around there [19:02:22] probably best to leave it :) [19:02:27] ok, so we can rule out it was puppet fighting with the package i guess [19:03:01] ok, it was a reaction to the "widespread puppet failures" alert [19:03:28] it just triggers over a certain threshold. just wanted to see if it's all of mw* or not [19:03:41] it's not, just 2 special cases, so we can ignore that [19:03:42] oh yes please mutante, I am running on appservers on codfw and eqiad [19:03:53] * volans wonders if debdeploy could have been used here, he doesn't recall if you can hook a pre/post command to depool/repool [19:04:16] volans: I would say "maybe next time" but we've sworn there won't be a next time [19:04:25] so, a good hypothetical suggestion for another universe [19:04:31] lol [19:04:42] we swore that last time too, but this time we've double-sworn it [19:04:42] that applies to any package upgrade [19:04:44] debdeploy isn't great here since it touches 20 different source packages [19:05:09] but we can simply pass the same hooks to the cumin call to make it retain the local conffiles if that helps [19:05:24] I'm still not convinced that was the right move actually [19:05:27] I am ignoring and logged out of any mw* hosts that reported puppet issues. We can focus on just the DPKG alert [19:05:42] if puppet had replaced the file after the package upgrade, then definitely, but it looks like it didn't [19:05:59] so my guess is that file is untouched by puppet but someone had adjusted it manually on that host for whatever reason [19:06:13] maybe bumping the requestlog limit as part of an outage investigation [19:06:15] sounds like it was being used as a test host [19:06:21] yeah [19:06:24] yea, sounds likely [19:06:38] or sorry, request_slowlog, you know what I mean [19:07:16] so, exposing my apt-get ignorance: how do I go back and reinstall that package saying "do actually overwrite this file"? [19:07:23] affected hosts: mw2284, mw2313, so yes Effie, just 2 hosts [19:08:15] ah, --force-confask? [19:08:15] rzl: ehm... dpkg-reconfigure ? [19:08:25] what do you mean with overwrite this file? to the puppetised local version or what's shipped as default in the deb? [19:08:25] oh [19:08:59] moritzm: to the default deb version -- I'd like to get what I would have gotten if I'd said "install the package maintainer's version" at the original prompt [19:09:00] cumin foo* 'export DEBIAN_FRONTEND=noninteractive; apt-get install PACKAGE_LIST -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"' [19:09:12] is what keeps the puppetised versions [19:09:17] let me find the inverse [19:09:18] apparently it's not puppetized but diff between old hacked version and new package [19:10:11] on mw2284 the issue is resolved 50% now [19:10:21] there is only one line left that is [19:10:25] rc apt-listchanges [19:10:34] if that was gone the DPKG alert would recover [19:10:44] mutante: it's not resolved, the other package is the one we're talking about :) [19:10:59] are we talking about https://phabricator.wikimedia.org/P13267 here? [19:11:16] rzl: well, all the php packages are "ii" now, earlier they were not [19:11:22] mutante: one sec please [19:11:42] moritzm: that paste is the diff in the config file we're talking about, between mw2215 (not upgraded yet) and mw2284 (the host at issue) [19:11:57] moritzm: I have another paste I'll find in a sec, which is the diff presented by the installer, hang on [19:12:08] https://phabricator.wikimedia.org/P13266 [19:12:49] moritzm: of the two diffs in P13267, we think the first is a red herring, just different values for different CPUs -- the request_slowlog diff is what we think somebody just edited by hand [19:13:01] (I think I half-remember the incident where we did that actually) [19:13:16] so, we don't mind blowing it away, and we're happy with whatever version of the file apt-get would install for us [19:13:56] mutante: the reason I don't care about the dpkg alert is that the alert is trying to bring your attention to a problem that we're actively working on [19:14:04] mutante: we know about the problem, we're working on it, the alert itself is unimportant [19:15:02] mutante: we can downtime it for the duration of the upgrade if you want to clear it, but otherwise you can expect it to continue recovering and unrecovering as we work on this, and that's fine [19:16:14] so looking at modules/php/manifests/fpm/pool.pp pm.max_children is in fact dependent on CPU count [19:16:17] rzl: alright [19:16:39] and slowlog_timeout is fixed to 15 in the puppetised version [19:17:06] moritzm: ah thanks, I failed to find that file for some reason [19:17:16] so maybe the 5 in the https://phabricator.wikimedia.org/P13267 is probably some debug leftover [19:17:31] ahaha so the host I happened to grab at random was the test host [19:17:34] bad luck [19:17:49] okay, sorry for the red herring [19:18:05] so in that case, I don't have a good theory anymore why this popped up on 2284 in the first place [19:18:09] for multiple of the conffiles involved here we don't want to use the default files shipped in the deb, and using o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" will ensure our puppetised versions are not overwritten [19:18:11] given it didn't on the other hosts we've upgraded so far [19:18:44] or alternatively, the deb gets deployed and puppet run-run, but -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" shortcuts it [19:19:18] okay yeah agreed -- after I couldn't find it in the puppet repo, I assumed we wanted this file at the default (and we were putting our config somewhere else) but that was wrong [19:20:45] checking with cumin, we have six hosts in codfw which use 5 and 145 which use 15 :-) [19:21:21] mw2215, mw2216, mw2244, mw2245, mw2271, mw2272 [19:21:46] canaries [19:22:14] yep, all of them have in common they have the canary service in conftool [19:22:24] oh good catch [19:22:44] aw man, the random host I grabbed was a canary? I deserved that then [19:23:09] okay in that case it's safe to ignore, the canaries are already upgraded so if we were going to have problems we would have already had them [19:23:20] I'll go ahead and add those flags to the cumin run, thanks moritzm [19:23:24] so, this is in fact a Hiera variable: [19:23:44] profile::mediawiki::php::slowlog_limit [19:23:54] which is set to 5 for the canaries and 15 as a default [19:24:48] heh, and some "RLazarus" character +1ed it https://gerrit.wikimedia.org/r/c/570256 [19:25:00] I knew something about it sounded familiar 🤦 [19:26:10] so, I still don't know why the "Package distributor has shipped an updated version" prompt popped on 2284 and not on any other host so far, but I guess it doesn't really matter [19:26:59] yeah, earlier I got confused about the additional default in the php::fpm::pool define, the request_slowlog_timeout=15, I missed that it gets overriden by the $config hash [19:27:40] so yeah, this seems all plausible and that RLazarus cahracter nailed it back in February :-) [19:33:15] okay, proceeding with the rest of the apiservers with `-o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"` [19:33:39] thanks both for the help <3 [19:35:41] ack, let me know if there's other conffile mysteries to dig into :-) [20:18:17] looks like the deployment_server is currently in codfw. have you heard of plans to switch it back to eqiad as well? [20:18:46] oh.. nevermind, i am on the new server 1002 [20:19:09] so we still want to replace 1001 with 1002 just because of old hardware, not because of OS version [20:19:18] and I had just put that on hold [20:23:47] rzl: so leftovers are [20:23:58] mwmaint* deploy* and snapshot* [20:24:26] nod [20:25:03] we'll coordinate with a.pergos about snapshot*, I don't think it has to happen immediately [20:25:21] in fact immediately is fine [20:25:31] might as well just get them overwith [20:25:36] oh okay [20:25:59] there's only one php thing running now, it shouldn't be impacted, if it is it's nbd though [20:26:03] literally one file [20:26:22] I will do the dumps [20:46:55] apiservers are finished [21:15:47] hmm still not done [21:16:23] I was thinking of winding down for the evening but maybe I'll stick around a little longer in that case [21:18:00] apergos: if you are around, can you +/-1 ? [21:18:07] where is it [21:18:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641255 [21:20:36] {{done}} [21:21:11] the install failed on mw2279 (jobrunner), looking [21:24:28] worked on a retry 🤷 continuing [21:28:15] * apergos waits impatiently for the upgrade [21:28:44] that annoyig arhythmic tapping sound you hear? just me drumming my fingers as I wait :-P [21:35:59] mw2250 failed to *pool* because it has weight 0 here https://www.irccloud.com/pastebin/JN8KKIP0/ [21:36:13] I guess I can just pool it manually with confctl and the right selectors, but that's weird [21:36:51] apergos: I am running on 1007 [21:37:09] that is your test server right? [21:37:27] rzl: it's just the canary. but normally they have 1 instead of 0 [21:37:34] mutante: right [21:37:44] and it's 1 in jobrunner and 0 in videoscaler, for whatever reason [21:37:52] it doesn't actually matter which one is the test right now [21:37:53] it might be my fault [21:37:55] looking [21:38:06] they are all running manual catchup things except one box maybe [21:38:06] that could be when we had tha issue with the videoscalers [21:38:29] just run it somewhere :-D [21:38:43] just finished in 1007 [21:38:46] wow look at all those spakin' new packages [21:38:52] haha [21:39:00] ok moving on to the others [21:39:28] rzl: mw2250 once had broken hardware. Probably i failed to set the canary service to weight 1 https://phabricator.wikimedia.org/T226948 [21:39:39] ah okay [21:39:45] mutante: want me to just set it to 1 and repool? [21:39:57] after running the reimage coookbook [21:39:59] rzl: yes [21:40:09] please [21:41:00] done [21:41:52] thanks. error rate went up just now fwiw [21:42:16] and right after I said it..it's over [21:43:55] hmmm and they look like a spike of "Could not enqueue jobs from stream *" errors [21:44:07] so it probably *is* my fault but I'm not sure how [21:44:10] agree it's over though [21:45:45] showed up in both eqiad and codfw which is kind of surprising to me [21:50:03] were those errors on kibana ? [21:51:15] yeah [21:51:51] <_joe_> uh wait [21:51:59] <_joe_> you didn't upgrade the mwmaints? [21:52:09] <_joe_> so, how are you running the scripts? [21:52:53] <_joe_> because you need to run them from a server with an upgraded ICU version [21:55:36] we will updat ethe mw* now, but we have not started running the scripts [21:55:52] upgrades took quite a long time [21:59:13] jobrunners done [21:59:38] I never ran those in deployment-prep either btw [21:59:48] $sometime it might be nice to do that just in case [22:00:32] ding. midnight [22:00:47] how long should I stick around now that the snaps are doe? [22:00:48] done [22:16:40] oh sorry apergos [22:16:45] yes they are done [22:16:55] I saw that they are! [22:17:03] any point in watching them at all? [22:17:14] or should I just wander off? :-D [22:22:52] nah I think it is fine [22:31:27] ok... gone! hope the rest of your stint is a short one too, and very boring :-) [22:42:30] sigh it is not, thank you ariel [23:31:20] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) We upgraded to ICU 63: appservers, api, parsoid, jobrunners, mwmaint, and snapshot. What is left is deploy*. We are runnin... [23:32:24] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [23:37:01] 10serviceops, 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) All appservers are now running ICU 63, and the collation update script is running. Earlier today should have been the moment o...