[06:12:33] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10elukey) Halfway ping just to remember that a month is left before the certs expire :) [08:51:38] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Use Envoy for making GET requests to lang.wikipedia.org/api.php - https://phabricator.wikimedia.org/T276217 (10akosiaris) >>! In T276217#6952784, @kostajh wrote: > @akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our applicat... [08:53:56] 10serviceops, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) [08:54:52] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Use Envoy for making GET requests to lang.wikipedia.org/api.php - https://phabricator.wikimedia.org/T276217 (10kostajh) >>! In T276217#6954930, @akosiaris wrote: >>>! In T276217#6952784, @kostajh wrote: >> @akosiaris @JMeybohm @h... [08:57:42] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) yeap, thank you! [09:03:14] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Use Envoy for making GET requests to lang.wikipedia.org/api.php - https://phabricator.wikimedia.org/T276217 (10akosiaris) >>! In T276217#6954940, @kostajh wrote: >>>! In T276217#6954930, @akosiaris wrote: >>>>! In T276217#6952784... [09:27:54] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Use Envoy for making GET requests to lang.wikipedia.org/api.php - https://phabricator.wikimedia.org/T276217 (10kostajh) >>! In T276217#6954967, @akosiaris wrote: >>>! In T276217#6954940, @kostajh wrote: >>>>! In T276217#6954930,... [09:51:28] 10serviceops, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) Hi! TL;DR: Aside from snapshot and dumpsdata that @ArielGlenn is better equipped to answer for, maybe we can do scandium and rdb* but the rest are eith... [10:00:31] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10MSantos) [11:45:36] hello folks [11:46:23] yesterday I created a couple of patches as possible ideas to allow the ml-serve clusters to cause less pain to all of you :D [11:46:57] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/675558 for the namespaces in deployment-charts (applicable also to the rest of the shared config in theory) [11:47:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/675566 to split the kubernetes users credentials [11:47:57] I am not particularly proud of the latter but it should give us some flexibility as interim solution [11:48:12] if all is completely horrible lemme know and I'll re-work it [11:49:49] 10serviceops, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) Having been duly poked, may I ask which services you are looking at, both for dumpsdata* and snapshot*, that listen only on IPv4? Then we can talk abou... [11:50:27] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) [11:57:37] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10jijiki) >>! In T274456#6929627, @Dmantena wrote: >> I am afraid we are unable to provide... [13:44:39] sorry I'm going to be late for the meeting today -- my calendar's a little snarled with conflicts but I want to catch the last 15m of the onfire meeting, so I'll join the serviceops meeting 15m late [13:45:31] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10hnowlan) > In T271142#6955077, @akosiaris wrote: >> * restbase - major services only seem to listen on ipv4 > > In fact, the major service (... [14:22:26] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10akosiaris) I have applied the patch manually on mw2305 and wtp1032. I haven't seen any difference in the various dashboards for those hosts. Respecti... [14:45:17] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10akosiaris) mw1412 and mw1412 have puppet disabled and the changes live as of a few mins ago. [15:04:01] moritzm (or anyone really), any idea when someone might be able to provide input on feasibility of https://phabricator.wikimedia.org/T277064 ? [15:11:41] hey, I wanted to give you a heads up- we are seeing lots of idle connections increasing on almost al databases since 14UTC [15:12:06] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1617095520118&to=1617117120118&var-site=eqiad&var-group=core&var-shard=All&var-role=All&viewPanel=10 [15:13:03] databases will survive, there is a way to kill those after seconds of being idle, but may disrupt application server queries [15:13:30] not sure here or a more mediawiki-focus channels is the best place to report this [15:14:32] I think it started at 14:05, but I don't see any related deploy? [15:18:09] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) >>! In T271142#6955077, @akosiaris wrote: > Hi! > > TL;DR: Aside from snapshot and dumpsdata that @ArielGlenn is better equipped to... [15:19:44] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) [15:20:36] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) [15:21:05] mmmm, 99% of those seem to be jobrunners, so back to infra as potential cause [15:24:04] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) >>! In T271142#6956040, @ArielGlenn wrote: > Having been duly poked, may I ask which services you are looking at, both for dumpsdata... [15:29:16] yeah, I am almost sure the jobqueue is getting overloaded: https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=15&orgId=1&from=1617096543946&to=1617118143946&var-dc=eqiad%20prometheus%2Fk8s [15:39:36] * akosiaris around [15:39:48] jynus: reading back now, we've been in a team meeting [15:40:36] things happen at the worst moment, sorry :-( [15:41:29] so CPU utilization has increased significantly since ~12:30 today [15:41:36] https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=87&orgId=1&var-site=eqiad&var-cluster=jobrunner&var-instance=All&var-datasource=thanos&from=now-6h&to=now [15:42:17] looks wikibase related [15:42:19] doesnt it [15:42:26] video to me [15:42:54] serviceopsen: see backscroll in -ops too if you haven't, about video uploads [15:45:04] ack, and urbanecm paused a script that was uploading [15:45:09] yeah it's video related. I count 361 ffmpeg related processes on a single mw host [15:45:45] The way I detected this was by seeing too many idle connections to databases that we started to kill automatically [15:45:46] ok, I will look at recent uploads on commons [15:46:06] I am happy for once we have nice monitoring on my side of the stack! [15:47:39] I wonder if the uploads are serial but not waiting for the various ecodings to happen before the nxt upload [15:47:44] that could stack them up pretty fast [15:48:11] I don;t see anything of use in recent changes or upload [15:48:20] effie, it is upload server side [15:48:26] not sure if they get regular user logs [15:48:43] someone would have to look at the script used for the server sde stuff [15:49:06] I would expect to find at least a video [15:49:08] it's hard to wait for things in the jon queue but being able to schdule in a wait of some reasonable length of time, that would be nice [15:50:25] akosiaris: any informaition about the video ? [15:50:35] or anything of use anyway [15:51:47] ok I scrolled up to -ops, damn it is hard to read [15:55:38] akosiaris, mutante, rzl, legoktm can we please sum up where we are [15:55:46] half of the info is in the mess of -ops [15:55:49] and half here [15:56:28] can we just use -ops please? [15:56:53] ok [15:56:58] legoktm: it is too noisy I think, -sre is better and people who can help are there as well [15:57:06] that is my opinion anyway [15:57:55] let's not add a new channel right now :) if we're already in two places, adding a third won't help [15:58:19] -ops is the right call for now IMO, that's where people are already talking [15:58:36] I agree a summary would be a good idea though, let's do that over in -ops [16:25:15] 10serviceops, 10Maps, 10Packaging: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MSantos) @MoritzMuehlenhoff you can assume that maps will be in buster by the time we need PostGIS v3.1. @hnowlan has already started the needed work ({T269582}) and we are very clos... [16:39:57] thanks to everyone that helped [16:50:40] 10serviceops, 10SRE, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) @ssastry do we still need parsoid JS running in the parsoid servers? This is a good opportunity to clean this up. I am running into this issue T245757#6953720 when I tried to r... [16:53:03] 10serviceops, 10SRE, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ssastry) No. [16:57:02] rzl, legoktm, mutante I am off, a few mw* servers are struggling [16:57:26] ack, enjoy the rest of your day :) [16:57:28] I think we should consider switching those 6 mw12* servers with 6 mw13* servers since they are newer [16:58:16] there are a few alerts here and there on icinga to keep and eye :/ [16:58:20] thank you legoktm [16:58:36] I can do the swap in a few minutes [17:00:12] we can wait a bit more for https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=15&orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fk8s to look more normal [17:00:58] for instance the cdn_purge job got a bit worse over the last 15' [17:01:08] which is a bit odd [17:02:01] legoktm: we could start with a couple and see how the queues are doing, what do you think? [17:04:11] yeah, I can flip 3 over? [17:04:36] mw[1300-1302] from job runner -> videoscaler? [17:04:53] and mw[1293-1295] the other way [17:06:00] yeah, [17:06:08] ttyl:) [17:12:39] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10Dmantena) >> 1. Developer makes HTTP request to https://deployment with parameters of env... [17:17:04] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) >>! In T271142#6957026, @crusnov wrote: >>>! In T271142#6956040, @ArielGlenn wrote: >> Having been duly poked, may I ask which se... [17:25:00] moved 3 videoscalers to mw13XX [17:58:41] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) >>! In T271142#6957437, @ArielGlenn wrote: >>>! In T271142#6957026, @crusnov wrote: >>>>! In T271142#6956040, @ArielGlenn wrote: >>>... [18:15:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [18:19:49] 10serviceops, 10Dumps-Generation, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10crusnov) >>! In T271142#6957794, @crusnov wrote: > Sounds good, if you'd like to ping me on IRC when you want to do this project we can do at... [18:41:22] 10serviceops, 10Code-Health-Objective, 10MW-1.36-notes (1.36.0-wmf.36; 2021-03-23), 10Performance-Team (Radar), and 3 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10Krinkle) [18:56:32] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6953720, @jijiki wrote: > I have reimaged parse2001 as a test, and it appears... [18:57:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [19:54:15] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Use Envoy for making GET requests to lang.wikipedia.org/api.php - https://phabricator.wikimedia.org/T276217 (10Tgr) >>! In T276217#6954930, @akosiaris wrote: > Going through the edge caches (as those are the ones responsible for... [20:24:37] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team-TODO: Add k8s credentials for Blubberoid continuous deployment - https://phabricator.wikimedia.org/T217147 (10thcipriani) 05Open→03Declined declined as no longer matching reality in 2021-03-30 [23:17:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul)