[08:12:11] is it me or some strange traffic pattern change happened around 8UTC at app side? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&from=1603870208225&to=1603872653861 [08:12:41] 415 dropped to almost 0 and latency dropped too [08:49:44] Would it be possible to have a 5-15 min voice/video chat with someone from ops about setting up helm charts for the Wikispeech services? [08:50:11] I'm in lockdown, so any time would work :D [09:11:24] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10jijiki) @Millimetric, after discussing with @ema, traffic feels that those requests should be visible in turnilo (eg webrequests_sample... [09:16:26] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10ema) >>! In T263683#6584033, @jijiki wrote: > @Millimetric, after discussing with @ema, traffic feels that those requests should be vis... [09:22:09] <_joe_> kalle: frankly, I think there should be a discussion at the management layer about budgeting/support for wikispeech. And when/if there is a commitment to supporting the service in production by all interested parties, we would probably want to have a chat about architecture and conventions for services running in production. [09:22:28] <_joe_> that's all well before we can think of an helm chart [09:22:51] <_joe_> so, please, let's get management to figure out the first part [09:30:20] _joe_: afaik the project is moving towards security review and betacluster. [09:31:08] And then from betacluster there will be a discussion on whether or not it should go live. [09:31:34] So the management decision would take place after betacluster? [09:33:36] <_joe_> I would suppose before? I'm just unaware of what the support model for this software will be [09:34:49] But I don't know, I'm just a WMSE developer that has been told to get the services ready for k8s. [09:35:10] <_joe_> heh :) [09:35:23] <_joe_> I get it, and sorry I don't want to put the burden on you [09:35:38] <_joe_> but I feel like there is a broken communication line here [09:36:32] <_joe_> kalle: I'm just worried we'll install this thing without a clear support from within the WMF. And in the end our team (serviceops) will be constantly oncall for this service, with no clear line of support for it [09:37:02] <_joe_> but maybe everything has been done in that direction, and I'm just unaware. I'll speak up my reporting chain to get clarity [09:39:11] <_joe_> but even then, it's unfortunate we haven't been involved before there was a finished product. There is a checklist of stuff we should go through. [09:41:09] Thanks [09:41:59] <_joe_> so I would say in the meantime, https://wikitech.wikimedia.org/wiki/Services/FirstDeployment#New_service_request [09:42:36] For clarification, we're not looking to install anything yet. We are just trying be ready with everything when we get to the potential point of deploying services. [09:43:13] <_joe_> ack :) [09:43:26] <_joe_> kalle: the sooner the better, though, I agree [09:44:12] <_joe_> kalle: let me talk up the chain and I'll get back to you, but in the meantime, have you seen the tutorial for deploying new software on the deployment pipeline? [09:45:02] <_joe_> https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Tutorial [09:47:54] _joe_: We're already building docker images using Blubber and LibPipeline. [09:48:53] <_joe_> kalle: ack, that's a good first step [09:49:52] The helm chart docs mainly speak about how to do thing, not so much about why. We have a couple of funky services that might need slightly different helm configs, so we'd love to speak with someone about this. Perhaps we're just good to go with standard stuff. Perhaps not. [10:07:24] <_joe_> ack [12:47:51] 10serviceops, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [13:53:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Add helmfile validation for the helmfile.d/admin part - https://phabricator.wikimedia.org/T266670 (10JMeybohm) p:05Triage→03Medium [14:07:30] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) 05Open→03Resolved Eventrouter is deployed to all clusters now. [14:39:08] 10serviceops, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [15:04:06] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) p:05Triage→03Medium [15:17:22] I have a question about helm charts. Is it preferable to keep various config files for your service in the helm chart or in the project directory? [15:21:26] <_joe_> maryum: good question! [15:21:35] <_joe_> the answer is of course "it depends" [15:21:40] Depends if you need to change them (and how frequent) I would say. You can easily keep them in your projects git and bake them into the docker image [15:21:51] what he said :-) [15:21:56] and what about generating config files in the helm chart? [15:22:25] <_joe_> maryum: that's what we usually do when stuff might be configured in 10 different ways [15:22:32] <_joe_> so let me make you two examples [15:22:56] <_joe_> we use envoy as our TLS terminator. We deploy the same docker image with 20 different configurations [15:23:08] <_joe_> that configuration *needs* to be templated with helm [15:23:39] <_joe_> for other stuff, it might be enough to expose the few things that might change between dev/prod/staging via environment variables [15:23:51] <_joe_> and inject those into the configuration [15:24:39] we won't really have multiple config files for different envs, but our config might change a lot [15:25:02] <_joe_> you mean over time? [15:25:14] where is the envoy helm chart? [15:25:34] <_joe_> it's one of the containers included in all charts [15:26:22] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/common_templates/0.2/_tls_helpers.tpl#164 [15:26:51] we're not sure how much the config would change over time. I'm guessing a lot in the beginning and then not as much at the end [15:27:29] <_joe_> maryum: so my approach would be: for now, keep a vanilla config in the image to use e.g. in CI [15:27:54] <_joe_> then define a config map in helm to overwrite the config file [15:28:26] <_joe_> and within it, you can template out the stuff you think might change the most so you can just change the helm values [15:29:01] <_joe_> I hope I'm not confusing you too much, this is all pretty convoluted and I'm still confused about how to manage these layers of configuration sanely :/ [15:29:52] no, that makes sense. we were wondering where exactly the best home for config files are [15:30:01] and it seems that answer is all places [15:31:07] while we're here, I'm having a bunch of permissions issues with the production variant in my blubberfile: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/635074 [15:31:45] I can't figure out how to let the production variant execute the jar with our code because it's a different user in that variant [15:32:00] so I have all of these extra step in the production variant that should be in the build step [15:32:29] <_joe_> so I'm no blubber expert, but this is a common problem [15:32:46] <_joe_> usually in docker you raise and drop privileges during build to come across this problem [15:33:02] <_joe_> I'm not sure if we can express that in blubber though [15:33:06] right but that's not allowed in blubber, exactly [15:33:31] <_joe_> heh then you might need to either use the same user or at least the same UID [15:34:26] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review: Push notification service should make deletion requests to MediaWiki for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10LGoto) [15:34:59] how can I specify a user? like in the commands in the builder section? [15:39:33] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) 05Stalled→03Open p:05Triage→03Medium [15:39:37] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10jijiki) [15:39:47] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10LGoto) a:03Jgiannelos [15:46:34] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10LGoto) a:03Jgiannelos [15:50:19] <_joe_> maryum: I think that's better asked with someone more familiar with blubber than I am [15:50:39] ah okay who are the blubber experts? [15:50:41] <_joe_> #-releng is probably the best place to ask [15:52:35] _joe_: for a minute there I thought you were talking about me [15:52:48] <_joe_> ahahahah [15:53:03] <_joe_> hey I'm trying to help maryum, not teasing you [15:53:26] thanks! [15:53:33] <_joe_> (the backstory is effie hates the name blubber) [16:21:33] 10serviceops, 10DC-Ops, 10Operations, 10Platform Engineering: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Dzahn) The codfw part of this is done meanwhile. There are only parse2* but no wtp2*. (T247441 and others) The eqiad part though is still left t... [17:17:32] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [17:17:35] 10serviceops, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [17:28:33] 10serviceops, 10Operations, 10Platform Engineering, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [17:44:26] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:44:33] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:54:54] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [17:55:25] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10aaron) Regarding RedisLockManager (it only needs 2 of the 3 host to be reachable). If one of them is depooled or refuses connection... [18:59:12] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @aaron If you have any insights regarding the Redis Lock Manager and file upload, it would be much appreciated (+ T265643) [19:05:57] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi... [19:05:59] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sc... [19:07:34] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi... [19:42:48] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore) Sounds like a fine solution from our side for now. I'll let #serviceops do with this ticket as they wish (keep it or close it)... [19:56:18] thanks for all the chats re static sites :) [20:03:40] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [20:04:12] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @aaron thank you! I updated the task description [21:46:36] 10serviceops, 10Operations, 10conftool, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) [21:46:44] 10serviceops, 10Operations, 10conftool, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) p:05Triage→03Medium [21:50:53] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sc... [22:20:34] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` and were **ALL** successful. [22:30:05] 10serviceops, 10Performance-Team, 10Developer Productivity, 10User-jijiki: Evaluate use of Gerrit dashboard for code review - https://phabricator.wikimedia.org/T263494 (10Krinkle) p:05Triage→03Medium [22:58:58] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) scandium has been reimaged (stretch like before) and after merging https://gerrit.wikimedia.org/r/634383 It... [23:58:39] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi...