[05:51:13] <_joe_> I still see nothing has been done for the mcrouter certificates [05:51:46] <_joe_> given we're 15 days from doomsday, I'll raise the task to UBN! I guess. [06:00:20] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) p:05Medium→03High I don't realistically see it possible to switch memcached to TLS in the remaining time before we need to renew the certificates, hence raising priority. It will be raised... [07:37:38] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse200... [07:41:18] 10serviceops, 10SRE: High load on jobrunners (12 Apr 2021) - https://phabricator.wikimedia.org/T279893 (10jijiki) 05Open→03Resolved Closing this task, no further issues were observed [08:21:33] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2004.codfw.wmnet'] ` and were **ALL** successful. [08:25:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) >>! In T274925#6981855, @Jclark-ctr wrote: > @jijiki Racking these host i only have 2 available spots in D4 will any of the ones in t... [08:45:59] jayme: was wondering if you could double check the value of a secret environment variable for the linkrecommendation service [08:46:32] kostajh: sure thing [08:47:01] jayme: thanks. We're seeing that the adminlinkrecommendation user can't connect to the mwaddlink database [08:47:29] kostajh: that's the one running the cronjob, right? [08:47:35] jayme: could you double check that the environment variable for DB_PASSWORD matches what's in /home/jayme/mwaddlink ? [08:47:36] yes [08:47:59] hmm..but that used to work in the past or am I wrong? [08:50:17] jayme: yes, it's been working until a few days ago. We recently made some changes to the mysql connection code in the application, so that's my first suspicion, but the linkrecommendation user has no problem connecting, and I can't reproduce the issue locally on host or via docker-compose [08:51:17] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse200... [08:55:51] jayme: I'll try deploying an earlier image without that code change to see what happens. [08:57:06] kostajh: I don't have that file anymore, but the secret still contains what is in private puppet repo and that did not change since 2020-12-09 [09:29:30] <_joe_> can I suggest verifying the credentials are good by using the mysql client? [09:35:06] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse2007.codfw.wmnet'] ` and were **ALL** successful. [09:42:40] _joe_: yeah, marostegui has done that. [09:43:25] <_joe_> ok then, you should be able to see the secret when you fetch the k8s deployment data from the api [09:43:35] <_joe_> unless I'm missing something jayme [09:48:26] It's an actual kuberneets secret object, so the deploy user is not able to read that [10:27:36] 10serviceops, 10SRE, 10Wikimedia-Incident: High load on jobrunners (12 Apr 2021) - https://phabricator.wikimedia.org/T279893 (10jijiki) [11:08:58] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse201... [11:51:39] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse2010.codfw.wmnet'] ` and were **ALL** successful. [13:01:53] 10serviceops, 10Performance-Team, 10SRE, 10Traffic: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10fgiunchedi) p:05Triage→03Medium [13:36:42] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse201... [13:38:41] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet'] ` The log can... [13:53:16] 10serviceops, 10SRE: Jenkins fails onCI puppet with: EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/pkg-resources/ - https://phabricator.wikimedia.org/T279307 (10fgiunchedi) p:05Triage→03Medium [13:53:35] 10serviceops, 10Maps, 10Packaging, 10SRE: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10fgiunchedi) p:05Triage→03Medium [14:09:49] 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) p:05Medium→03Low [14:19:56] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2013.codfw.wmnet'] ` and were **ALL** successful. [14:52:01] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet'] ` and were **ALL** successful. [15:02:15] <_joe_> jayme, legoktm, mutante ping :) [15:03:29] trying to convince my computer to produce some audio output...sorry. Be there in a second [15:35:05] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet'] ` The log can... [15:39:38] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse201... [16:07:15] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) @RLazarus do you mind running the script one last time? I hope to get TLS working this quarter, but sadly I didn't manage to do it towards the end of Q3 as I originally planned. [16:23:52] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse2016.codfw.wmnet'] ` and were **ALL** successful. [16:40:56] 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10akosiaris) >>! In T279411#6985523, @kostajh wrote: > > So we can see that the time spent in quer... [16:49:09] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet'] ` and were **ALL** successful. [17:29:20] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse201... [17:29:56] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet'] ` The log can... [18:15:33] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse2019.codfw.wmnet'] ` and were **ALL** successful. [18:45:15] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet'] ` and were **ALL** successful. [18:48:59] 10serviceops, 10Performance-Team, 10SRE, 10Traffic: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) a:03Krinkle [18:56:52] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2020.codfw.wmnet'] ` The log can be found in `/var/lo... [18:57:20] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet'] ` The log can... [19:05:12] \o could someone here stop & remove a container in a pod for the linkrecommendation service please? [19:24:28] basically, we have some silly code running in a cronjob that is wasting a lot of resources unnecessarily, I'm deploying the fix now but need someone to stop & remove the running container [19:37:05] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2020.codfw.wmnet'] ` and were **ALL** successful. [20:10:30] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet'] ` and were **ALL** successful. [20:15:12] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) [20:46:51] kostajh: I .. just did that [20:46:58] and deleted that pod [20:47:15] I hope it was right [20:47:23] let me log [20:49:54] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10Dzahn) ` [kubemaster1001:~] $ sudo kubectl get pods -n linkrecommendation NAME READY S... [20:52:08] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10Dzahn) 05Open→03Resolved a:03Dzahn I deleted the pod as requested ^. Hope that was correct as I had not done it before in production. [20:58:21] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts... [21:04:04] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) >>! In T280076#6996312, @Dzahn wrote: > I deleted the pod as requested ^. Hope that was correct as I had not done it before in production... [21:05:38] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts... [21:32:20] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2394.codfw.wmnet'] ` and were **ALL** successful. [21:40:38] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2395.codfw.wmnet'] ` and were **ALL** successful. [22:07:26] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) mw2394 and mw2395 have been reimaged as jobrunners/videoscalers and then I pooled them into the jobrunner cluster b... [22:08:54] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) @Legoktm Would you say this is resolved (for) now? [22:10:06] https://config-master.wikimedia.org/pybal/codfw/jobrunner (mw2394 and mw2395, the ones with weight 15, are the dedicated jobrunners for codfw now) [22:10:39] setting them to 15 while other jobrunners are 10 was spontaeous [23:49:25] 10serviceops, 10SRE, 10WMF-Annual-Report: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) 05Open→03Resolved @spatton I'll claim this is resolved. Cheers [23:53:05] 10serviceops, 10Scap, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (Yak Shaving 🐃🪒 ): Define a mediawiki "version" - https://phabricator.wikimedia.org/T218412 (10thcipriani) [23:53:37] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) I think we're missing 2 codfw servers that are *only* videoscalers?