[06:36:16] 10serviceops, 10DC-Ops, 10Platform Engineering, 10SRE, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Aklapper) [06:36:19] 10serviceops, 10SRE, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Aklapper) [07:43:13] legoktm: re: buster kernel and slow PUTs, can't say for sure tbh [08:48:47] 10serviceops, 10SRE, 10Traffic: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10JMeybohm) 05Open→03Resolved Closing this as cache is disabled now. [08:58:36] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10elukey) [09:22:55] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Renamed task to be jobrunner+buster specific and looping in #serviceops [09:26:24] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) I can't imagine a single valid reason for a distro upgrade meaning that data transfer would slow down so much. My suggestion is we re-image one jobrunner to stretch and we check... [09:33:08] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) I'll also note that the behavior is generally quite rare compared to the number of PUTs from jobrunners, e.g. I haven't been able to reproduce using the swift python clien... [09:53:17] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) @RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Renewing the certs should amount to just runn... [09:54:44] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Urbanecm) >>! In T275752#6864889, @fgiunchedi wrote: > Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have... [11:08:13] 10serviceops, 10SRE, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) [11:17:54] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) >>! In T275752#6869341, @Urbanecm wrote: >>>! In T275752#6864889, @fgiunchedi wrote: >> Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still... [11:47:19] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) a:03Joe At the meeting we decided it's ok to let apache log to kafka as a main method of collection. We will therefore, at least in a first iteration: * Log to /... [11:49:51] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we will not be needing mcrouter certs. We have 60 days ahead of us, I think it ca... [11:57:43] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10JMeybohm) p:05Triage→03Medium [12:03:36] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) >>! In T276029#6870062, @jijiki wrote: > I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we will not be needing mcrouter certs. We h... [12:13:10] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) >>! In T276029#6870186, @Joe wrote: >>>! In T276029#6870062, @jijiki wrote: >> I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we... [13:59:27] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10Joe) [14:05:18] 10serviceops, 10MW-on-K8s: Create MediaWiki httpd base image - https://phabricator.wikimedia.org/T276097 (10Joe) p:05Triage→03Medium [14:09:01] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10akosiaris) = Modify mtail to be able to consume logs from kafka = In this idea, we'd be able to just consume a kafka topic directly from mtail... [14:58:04] 10serviceops, 10DC-Ops, 10Platform Engineering, 10SRE, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10jijiki) Since we will be moving mediawiki to k8s relatively soon, I am not sure if it is worth the hassle at this point. My opinion... [15:44:31] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10colewhite) Another possible solution is to extract the metrics via a sum aggregation query with prometheus-es-exporter. It's pretty easy to se... [16:12:27] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) [16:12:32] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) [17:07:01] 10serviceops, 10Analytics, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10JMeybohm) [17:07:12] 10serviceops, 10Analytics, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10JMeybohm) [17:18:20] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10fdans) [17:19:12] 10serviceops, 10SRE, 10Patch-For-Review: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1307.eqiad.wmnet ` The log can be found in `/var/log/wm... [17:43:12] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10dduvall) >>! In T273521#6854437, @Legoktm wrote: > The restricted/ namespace is now li... [17:43:30] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10hashar) I would rather have cherry picked people that knows about docker-pkg / CI. But I guess it is fi... [17:44:22] 10serviceops, 10SRE, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) p:05Triage→03High [17:45:51] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10dduvall) In other words, override the value of [[ https://gerrit.wikimedia.org/r/plugi... [17:50:04] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) >>! In T273521#6871522, @dduvall wrote: > PipelineLib currently calls out to... [17:56:23] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) >>! In T276029#6869338, @Joe wrote: > @RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Re... [18:22:13] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1307.eqiad.wmnet'] ` and were **ALL** successful. [18:25:28] 10serviceops, 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Dzahn) >>! In T275752#6869162, @Joe wrote: > My suggestion is we re-image one jobrunner to stretch and we check if that changes things dramatically. @Legoktm / @Dzahn can you take car... [18:26:01] mw1307 is back on stretch (jobrunner) as requested [19:01:05] 10serviceops, 10Performance-Team, 10Platform Engineering, 10Wikimedia-Rdbms, and 4 others: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) >>! In T254634#6820865, @aaron wrote: > The only thing that currently updates the replication... [20:24:01] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [20:32:15] 10serviceops, 10Performance-Team, 10Developer Productivity, 10User-jijiki: Evaluate Gerrit code review dashboard for SRE ServiceOps - https://phabricator.wikimedia.org/T263494 (10Krinkle) [20:46:48] 10serviceops, 10Gerrit, 10Release-Engineering-Team: Gerrit crashed due to out of Heap - https://phabricator.wikimedia.org/T225166 (10hashar) The issue was a memory leak I have found in Gerrit (T263008). It has been addressed in Gerrit 3.2.7 which we deployed in February 2020. [21:01:16] 10serviceops, 10GitLab (Initialization), 10Release-Engineering-Team-TODO, 10User-brennen: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10thcipriani) [21:02:40] 10serviceops, 10GitLab (Initialization), 10Release-Engineering-Team-TODO, 10User-brennen: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10thcipriani) Are there other subtasks (IP allocation or anything) that need seperate subtasks? [21:31:52] 10serviceops, 10Performance-Team, 10SRE, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) a:05aaron→03Krinkle Next steps: 1. Update mediawiki/WANObjectCache to implement a new config option that... [22:00:13] 10serviceops, 10SRE, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) [22:00:17] 10serviceops, 10Performance-Team, 10SRE, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [22:00:20] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [22:41:07] 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 (10Dzahn) [23:44:46] 10serviceops, 10DNS, 10Traffic, 10GitLab (Initialization), and 2 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Reedy) [23:45:54] 10serviceops, 10DNS, 10SRE, 10Traffic, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn)