[00:00:03] RECOVERY - Long running screen/tmux on centrallog1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T0000). [00:00:05] Jdlrobson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:26] i can deploy today [00:00:35] Jdlrobson: hi! [00:02:31] hey Urbanecm [00:02:36] thank you! [00:02:37] hi :) [00:02:42] first deploy from a new deployment host! [00:03:14] 💪 [00:03:48] ehm...ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/667688, did we previously set a wrong config variable? are we sure whatever is there should be used? [00:04:15] (03PS3) 10Urbanecm: Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145) (owner: 10Jdlrobson) [00:04:22] (03CR) 10Urbanecm: [C: 03+2] Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145) (owner: 10Jdlrobson) [00:04:48] Urbanecm: yeh :/ [00:04:58] weird no one noticed [00:05:03] wgVectorMaxWidthOptionsNamespaces does not exist but wgVectorMaxWidthOptions does [00:05:14] the values are basically the same as the defaults [00:05:18] (03Merged) 10jenkins-bot: Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145) (owner: 10Jdlrobson) [00:05:19] Urbanecm: I did, and I like that nothing happened a lot [00:05:24] when testing wikisource we realized that hadn't applied properly [00:05:27] i see [00:05:34] okay, makes sense [00:05:42] mutante: i wouldn't say so [00:05:56] /srv/mediawiki-stagging is terribly out of date there [00:05:56] i've just been slow at following up :/ [00:06:05] mutante: any idea why? [00:06:24] Urbanecm: I went through that question with multiple people [00:06:26] incl releng [00:07:11] and what did they say? [00:08:01] that 2 versions are enough [00:08:14] and scap handles it [00:08:25] that it's not needed to add rsync to puppet like for /srv/patches [00:08:33] I mean, i just ran git fetch at deploy1002 [00:08:39] and it fetched 13 commits [00:08:56] instead of just one, that i merged [00:09:13] and what normally runs that? [00:09:19] humans [00:09:32] deployers are supposed to first run git fetch, and then make sure only what they merged was fetched [00:09:36] so the summary is, humans run it and it works to run it [00:09:43] yes [00:09:52] but unexpected commits were fetched [00:10:14] should the previous deployer have done that? [00:10:39] ack [00:11:19] anyway, I'm checking whether all the fetched commits are also deployed [00:11:52] !log deploy2002 - ran 'git etch' in /srv/mediawiki-staging [00:11:55] fetch [00:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:57] okay, so i know what's happening now... [00:13:36] /srv/mediawiki-stagging was out of date when deploy1002 was marked active. With twentyafterfour's full scap, all commits merged to mediawiki config were undeployed accidentally... [00:13:41] (all recent commits) [00:15:00] Jdlrobson: sorry, reviewing what to do with the other commits [00:16:21] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [00:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:58] no problem [00:17:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOYING: 92f65972f4277624f74369af08563a8ca6254bda: rowiki: Update help panel links (T275130) (duration: 00m 59s) [00:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:36] T275130: Deploy Growth features on Romanian Wikipedia - https://phabricator.wikimedia.org/T275130 [00:18:15] !log urbanecm@deploy1002 sync-file aborted: 2a8ece1c92d9d1434b2b5433f3a042a279d9756e: GrowthExperiments: set GELinkRecommendationsUseEventGate (duration: 00m 05s) [00:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOYING: 2a8ece1c92d9d1434b2b5433f3a042a279d9756e: GrowthExperiments: set GELinkRecommendationsUseEventGate (duration: 00m 57s) [00:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:47] 10SRE, 10DC-Ops, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Dzahn) 05Open→03Declined [00:19:50] 10SRE, 10serviceops, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) [00:21:16] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 1edcbb53b2f18105d132c839cfe12cccb97031b3: vector: Stage 2 of WVUI search treatment A/B test (T249297) (duration: 00m 56s) [00:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:22] T249297: Deploy the new Vue.js search experience - https://phabricator.wikimedia.org/T249297 [00:21:54] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: Revert: vector: Stage 2 of WVUI search treatment A/B test (T249297) (duration: 00m 56s) [00:22:57] Urbanecm: thanks for the re-deploys [00:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:45] mutante: np. It's not good we managed to undeployed many recent config commits somehow :(. [00:24:06] do we have a deployment host failover procedure I could help to review? [00:25:33] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [00:25:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: e991806eb9dc5ec018ebc59832d02e8a6563ba0a: Simplify deployment of Growth team features (1/3; T276091) (duration: 00m 56s) [00:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:57] T276091: [config] Make it easier to deploy features in shadow mode - https://phabricator.wikimedia.org/T276091 [00:26:45] !log urbanecm@deploy1002 sync-file aborted: REDEPLOY: de0f74126eddafb5375b853d543b377e78544caa: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 25s) [00:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:00] Urbanecm: No, but it likely was the last time every it happened. [00:27:16] mutante: interesting. Why, if I may ask? [00:27:51] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: REDEPLOY: de0f74126eddafb5375b853d543b377e78544caa: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 57s) [00:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:29] Urbanecm: because now we are on buster and newer hardware so it takes a while until one of them is EOL and by the time that happens stuff is supposed to be on kubernetes [00:28:54] i see [00:29:11] I'm wondering what would MW on k8s deployment look like [00:29:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 599b7390c840388d97dc4cdbf1796451d4024c22: Simplify deployment of Growth team features (3/3; T276091) (duration: 00m 56s) [00:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:48] Urbanecm: afaik CI will build docker images and then they get pushed to the docker registry [00:30:44] https://phabricator.wikimedia.org/T265183 [00:30:55] thanks [00:31:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 21cb6f5b32920c33611f26a0f3c97247f6f496f8: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" (T249297) (duration: 00m 56s) [00:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:14] T249297: Deploy the new Vue.js search experience - https://phabricator.wikimedia.org/T249297 [00:33:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: Config: [[gerrit:666842|EventLoggingSchemas: Bump HomepageVisit version (T275615)]] (duration: 00m 56s) [00:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:45] T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615 [00:35:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: d53834e9460ea6321e50401cda9e53d9f74c545e: Enable Growth features on hrwiki in stealth mode (1/3; T275684) (duration: 00m 55s) [00:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:29] T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684 [00:35:56] Jdlrobson: this is the last patch that needed a resync, will soon pull yours to mwdebug [00:36:10] no problemo [00:36:34] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: REDEPLOY: d53834e9460ea6321e50401cda9e53d9f74c545e: Enable Growth features on hrwiki in stealth mode (2/3; T275684) (duration: 00m 56s) [00:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:33] (03PS1) 10Cwhite: icinga: add grafana dashboard alert for client errors [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) [00:37:56] !log urbanecm@deploy1002 Synchronized wmf-config/config/hrwiki.yaml: REDEPLOY: d53834e9460ea6321e50401cda9e53d9f74c545e: Enable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 56s) [00:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:54] Jdlrobson: `Enable og tags on non-wikidata wikis ` is on mwdebug1001, please test [00:40:25] (03PS3) 10Urbanecm: Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson) [00:40:31] (03CR) 10Urbanecm: [C: 03+2] Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson) [00:40:50] Urbanecm: on it [00:40:55] thank you [00:41:27] (03Merged) 10jenkins-bot: Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson) [00:41:48] Urbanecm: that one LGTM [00:41:51] thanks [00:42:30] syncing [00:43:13] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [00:43:25] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6cc8521310d6e952fc7d0b23579021b650828764: Enable og tags on non-wikidata wikis (T157145) (duration: 00m 56s) [00:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:32] T157145: Twitter cards don't work for any projects besides Wikidata - https://phabricator.wikimedia.org/T157145 [00:43:50] Jdlrobson: `Fixes max-width configuration for new Vector ` is on mwdebug1001, please test [00:44:23] .. [00:45:00] Urbanecm: that one also looks good [00:45:01] finally working! [00:45:05] thanks, syncing! [00:46:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 61647cd191b9bc5e2d8672fa4813b57d958f1a68: Fixes max-width configuration for new Vector (T260091) (duration: 00m 56s) [00:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:33] T260091: Expand the list of pages where max-width does not and does apply - https://phabricator.wikimedia.org/T260091 [00:47:08] (03PS4) 10Urbanecm: Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:47:19] (03CR) 10Urbanecm: [C: 03+2] Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:49:18] (03Merged) 10jenkins-bot: Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:49:52] Jdlrobson: pulled to nwdebug1001 as well :) [00:50:44] checking.. [00:50:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:29] LGTM Urbanecm [00:51:35] thanks, syncing [00:51:51] Did i forget to add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/667704 to the deployment list? [00:51:51] (03PS2) 10Urbanecm: Update the Persian Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:51:56] (03CR) 10Urbanecm: [C: 03+2] Update the Persian Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:52:14] ah gotcha [00:52:49] you did not, I'm performing them in order [00:52:59] (03Merged) 10jenkins-bot: Update the Persian Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [00:53:23] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: 97ebf7539f7f16d4908f80ea4b8eea5c4b997ecb: Separate Wikivoyage wordmark and icon (T261033; T273477) (duration: 00m 56s) [00:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:32] T273477: Update Wikivoyage logos (Vector modern & mobile) - https://phabricator.wikimedia.org/T273477 [00:53:33] T261033: Update wordmark and tagline for Persian (FA) Wikipedia - https://phabricator.wikimedia.org/T261033 [00:54:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 97ebf7539f7f16d4908f80ea4b8eea5c4b997ecb: Separate Wikivoyage wordmark and icon (T261033; T273477) (duration: 00m 56s) [00:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:23] Jdlrobson: persian wikipedia logos puled to mwdebug1001 [00:56:24] Urbanecm: LGTM! [00:56:24] thanks! [00:56:28] thanks, syncing [00:57:03] (03CR) 10Jdlrobson: [C: 03+1] icinga: add grafana dashboard alert for client errors [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) (owner: 10Cwhite) [00:58:12] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: 0f08e8bbe1f74220a7a01a7606b67e0f75734a53: Update the Persian Wikipedia logos (T261033; 1/2) (duration: 00m 56s) [00:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0f08e8bbe1f74220a7a01a7606b67e0f75734a53: Update the Persian Wikipedia logos (T261033; 2/2) (duration: 00m 56s) [00:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:45] T261033: Update wordmark and tagline for Persian (FA) Wikipedia - https://phabricator.wikimedia.org/T261033 [00:59:51] Jdlrobson: should be all done. Anything else? [01:00:05] thanks Urbanecm that's everything from me [01:00:08] cool [01:00:11] thanks a bunch [01:00:12] just in time then :) [01:00:14] these are all big changes :) [01:00:14] happy to help [01:00:30] (we can now share on twitter! ... now time to write a user notice..) [01:01:29] hehe [01:13:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [01:25:40] Hi, SRE! We have been locked out of the Analytics MaxMind account. The email that the password reset goes to is noc@wikimedia.org. After checking with ITS, they said TechOps might have access? [01:30:50] janna_WMF: (non-SRE comment) yup, noc@ gets to the SREs. I would personally recommend creating a Phabricator ticket, explaining what you need. Or ask SREs from analytics 🙂 [01:40:33] (03CR) 10Jeena Huneidi: [C: 03+1] pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [02:02:02] (03CR) 10Jforrester: "C-1." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [02:07:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.33 [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667743 [02:11:26] (03CR) 10Jdlrobson: "Urg. Is this repo capable of running npm test? It would be useful to run an svg checker here and automate this using svgo. Looks like many" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [02:16:29] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [02:20:36] (03PS9) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [02:20:38] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [02:21:18] (03CR) 10jerkins-bot: [V: 04-1] install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [02:28:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (mwmaint2002), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:32:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:35:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:38:43] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.33 [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667743 (https://phabricator.wikimedia.org/T274937) (owner: 10TrainBranchBot) [02:40:04] is the code for train branch bot public? [02:51:14] Should be [02:51:27] do you know where? [02:52:14] (03PS1) 10Aaron Schulz: Set $wgChronologyProtectorStash to mcrouter for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667744 [02:52:18] (03PS1) 10Aaron Schulz: Set $wgChronologyProtectorStash to redis for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 [02:52:20] (03PS1) 10Aaron Schulz: Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 [02:53:05] I think it's https://github.com/wikimedia/mediawiki-tools-release/blob/3541b03880de0d4904610d638afc43d39358e061/make-release/mwrelease/branch.py [02:53:30] There's already functionality for adding a task to the commit summary [02:53:41] I just guess whatever/wherever is running it hasn't been setup for it [02:54:19] thats what I was looking for, thanks [03:01:47] (03PS1) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [03:16:49] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:19:11] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 9.345 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:32:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [04:37:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [05:35:09] (03PS1) 10Marostegui: instances.yaml: Add db2152 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667756 (https://phabricator.wikimedia.org/T275633) [05:35:49] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2152 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667756 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2152 into s8 as vslow - T275633', diff saved to https://phabricator.wikimedia.org/P14551 and previous config saved to /var/cache/conftool/dbconfig/20210302-053814-marostegui.json [05:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:23] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:39:21] (03PS1) 10Marostegui: db2152: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667758 (https://phabricator.wikimedia.org/T275633) [05:39:55] (03CR) 10Marostegui: [C: 03+2] db2152: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667758 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:40:31] !log apply gerrit:667757 on mwdebug1002 to test T259360 [05:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:38] T259360: Cognate doesn't properly create interwiki links for Shawiya Wiktionary (shy.wiktionary.org) - https://phabricator.wikimedia.org/T259360 [05:56:11] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10Joe) >>! In T275731#6871524, @hashar wrote: > I would rather have cherry picked people that knows about... [06:07:05] (03PS1) 10Marostegui: install_server: Reimage db1162 [puppet] - 10https://gerrit.wikimedia.org/r/667759 (https://phabricator.wikimedia.org/T258361) [06:07:07] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:09] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:07:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:34] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Once db1162 is back (T275309) I will reimage it and repopulate it. [06:07:55] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1162 [puppet] - 10https://gerrit.wikimedia.org/r/667759 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:09:47] (03CR) 10Marostegui: "kormat could you take care of this? Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [06:11:29] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:35] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:44:46] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] ReferenceTooltips and other gadget names for ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [06:45:05] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10JMeybohm) p:05Triage→03Medium [06:45:52] 10SRE, 10Abstract Wikipedia, 10DNS, 10Traffic: Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10JMeybohm) p:05Triage→03Medium [06:58:02] 10SRE: Malformed membership for ops user , has additional group(s): {'contint-admins', 'contint-docker'} - https://phabricator.wikimedia.org/T276165 (10JMeybohm) p:05Triage→03Medium Linking to T275731 [07:03:56] (03CR) 10JMeybohm: [C: 03+1] Support ANALYTICS_BASE_URL (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [07:11:52] (03PS3) 1020after4: topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) [07:15:19] 10SRE: wmf-utils has an outdated script to update known hosts files - https://phabricator.wikimedia.org/T275806 (10Joe) This task has definitely nothing to do with serviceops. [07:24:45] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE [07:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE [07:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE [07:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:10] !log Pooled `elastic106[0,4]` (Noticed I never re-pooled these hosts after resolving an incident last week) [07:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE [07:28:58] (03CR) 10Lars Wirzenius: [C: 03+2] Branch commit for wmf/1.36.0-wmf.33 [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667743 (https://phabricator.wikimedia.org/T274937) (owner: 10TrainBranchBot) [07:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:42] 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn) Note that T273713 also relates to this error. Dumpsdata1001 looks a lot better to me now. Feb 1 and 2: {F34131209} March 1 and 2: {F34131212} Dumpsata... [07:31:32] 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) I wonder if this should be merged with T273714. [07:54:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:19] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:25] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [07:55:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [07:56:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE [07:56:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE [07:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.33 [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667743 (https://phabricator.wikimedia.org/T274937) (owner: 10TrainBranchBot) [07:58:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE [07:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE [07:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:26] !log 1.36.0-wmf.33 was branched at 800e1f8cea169fc9c6e72ac1dc197591a06701bd for T274937 [07:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:33] T274937: 1.36.0-wmf.33 deployment blockers - https://phabricator.wikimedia.org/T274937 [08:00:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [08:03:46] (03CR) 10Filippo Giunchedi: [C: 03+1] topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [08:05:24] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:35] (03CR) 1020after4: "Note this should be completely safe to merge. It's currently unused and incorrect data in prod this just gets the dsh list to match reali" [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [08:09:25] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10fgiunchedi) My two cents: keeping mtail (or similar, in the envoy case) processing on-host would be ideal I think: it'll be simpler to scale (i... [08:11:59] (03CR) 10Filippo Giunchedi: "LGTM, is 'admins' needed as contact group tho?" [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) (owner: 10Cwhite) [08:13:19] (03CR) 10Filippo Giunchedi: [C: 03+1] (WIP) mediawiki::alerts: add per cluster error/fatals rate alert [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [08:14:32] twentyafterfour: I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/667700 [08:14:37] (03CR) 10Filippo Giunchedi: [C: 03+2] topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [08:16:27] also TIL conftool has its own clusters, independent of other clusters in puppet [08:23:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE [08:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE [08:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE [08:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE [08:28:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE [08:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE [08:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:06] (03CR) 10Kosta Harlan: [C: 03+2] Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [08:31:53] (03Merged) 10jenkins-bot: Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [08:33:38] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:01] (03PS1) 10Vgutierrez: ATS: Enable parent proxies for ats-tls on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/667805 (https://phabricator.wikimedia.org/T274888) [08:35:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28314/console" [puppet] - 10https://gerrit.wikimedia.org/r/667805 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [08:36:06] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:45] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [08:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:52] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:40] 10SRE: /var/run/elasticsearch not created after reboot - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) [08:51:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:27] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE [08:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:55] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies for ats-tls on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/667805 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [08:53:59] !log rolling restart of ats-tls on ulsfo [08:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE [08:54:36] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE [08:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE [08:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:39] I know a lot of spam, apologies :D [08:58:21] 10SRE, 10Abstract Wikipedia, 10DNS, 10Traffic: Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10AnotherEditor144) This will probably be in phase lambda (establishing this is a crucial part of the launch). [08:58:24] elukey: yesterday it was #wikimedia-operations-kormat-fights-pcc [08:58:39] 19 patchsets: https://gerrit.wikimedia.org/r/c/operations/puppet/+/667547/ [08:58:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE [08:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:17] kormat: ahahhaha yes I am familiar with the feeling :D [09:01:04] 10SRE, 10Abstract Wikipedia, 10DNS, 10Traffic: Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10AnotherEditor144) I will move it in a few days if there are no objections. [09:04:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:00] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:42] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667826 [09:09:44] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667826 (owner: 10Lars Wirzenius) [09:10:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667826 (owner: 10Lars Wirzenius) [09:10:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:12:31] !log liw@deploy1002 Started scap: testwikis wikis to 1.36.0-wmf.33 [09:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:17] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10dcaro) Hi, can you confirm the hard drive that you replaced? Did you replace just the cables? (that's ok, but as you mention a hard drive I'm not sure). Thanks! [09:15:22] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:41] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10MoritzMuehlenhoff) I think the big question is whether it's realistic to upstream support for group-> user flag mappings to the RemoteUserBackend. We have had the case of Grafana were upstream reserv... [09:26:54] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (mwmaint2002), Stale: 1 (gerrit1001), Fresh: 102 jobs Jcrespo full backups causing delays - The acknowledgement expires at: 2021-03-02 12:26:22. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:29:46] (03PS2) 10Elukey: role::druid::public::worker: tune cache settings [puppet] - 10https://gerrit.wikimedia.org/r/666598 (https://phabricator.wikimedia.org/T270173) [09:32:00] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10Volans) Given the current situation with Netbox upstream [1] don't expect any new feature accepted/merged within a short timeframe. [1] https://github.com/netbox-community/netbox/discussions/5853 [09:33:42] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1119.eqiad.wmnet [09:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:11] (03CR) 10Arturo Borrero Gonzalez: "great work! some comments inline." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:36:21] RECOVERY - dhclient process on sretest1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:36:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1119.eqiad.wmnet [09:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:38] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1120-1123].eqiad.wmnet [09:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1120-1123].eqiad.wmnet [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28315/console" [puppet] - 10https://gerrit.wikimedia.org/r/666598 (https://phabricator.wikimedia.org/T270173) (owner: 10Elukey) [09:41:15] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1124-1128].eqiad.wmnet [09:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:45] (03PS1) 10Vgutierrez: ATS: Enable parent proxies for ats-tls on codfw [puppet] - 10https://gerrit.wikimedia.org/r/667829 [09:43:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1124-1128].eqiad.wmnet [09:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28316/console" [puppet] - 10https://gerrit.wikimedia.org/r/667829 (owner: 10Vgutierrez) [09:45:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:46:54] !log liw@deploy1002 Finished scap: testwikis wikis to 1.36.0-wmf.33 (duration: 36m 20s) [09:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] 10SRE, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10ovasileva) @BBlack - From our side, we are ready to start scheduling these. How does the following sound: -... [09:48:25] 10SRE: Malformed membership for ops user , has additional group(s): {'contint-admins', 'contint-docker'} - https://phabricator.wikimedia.org/T276165 (10jbond) @MoritzMuehlenhoff fixed this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/667551 [09:48:37] 10SRE: Malformed membership for ops user , has additional group(s): {'contint-admins', 'contint-docker'} - https://phabricator.wikimedia.org/T276165 (10jbond) 05Open→03Resolved a:03jbond [09:50:06] (03CR) 10Jbond: [C: 03+2] sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667119 (owner: 10Jbond) [09:50:27] (03CR) 10Arturo Borrero Gonzalez: "great work! some early review, comments inline." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:52:02] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1130-1131].eqiad.wmnet [09:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:07] (03CR) 10David Caro: wmcs.vps: add cookbook to create an instance of a prefix (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:52:53] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin1001.eqiad.wmnet for hosts: ` sretest1002.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202103020952_volans_22946_sretest1002_eqiad_w... [09:53:43] 10SRE, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) a:03Joe [09:54:07] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving per last comment. Feel free to reopen though! [09:54:13] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10akosiaris) Resolving per last comment. Feel free to reopen though! [09:54:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1130-1131].eqiad.wmnet [09:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:32] 10SRE, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [09:55:34] 10SRE, 10Analytics-Clusters, 10vm-requests: Eq: new Druid test VM for analytics - https://phabricator.wikimedia.org/T266771 (10akosiaris) 05Open→03Resolved This seem to be done by the move in the workboard from Backlog to Done. Feel free to reopen though! [09:55:54] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10akosiaris) 05Open→03Resolved a:03akosiaris [09:58:06] 10SRE, 10vm-requests: EQIAD and CODFW : 5of VMs requested for kubernetes master - https://phabricator.wikimedia.org/T276204 (10akosiaris) [09:59:52] 10SRE, 10vm-requests: EQIAD and CODFW : 5of VMs requested for kubernetes master - https://phabricator.wikimedia.org/T276204 (10akosiaris) I 'll create the VMs, I am filling this for paperwork and visibility reasons. Those VMs will replace the following VMs: * argon * chlorine * acrux * acrab * neon [10:03:31] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) Thanks for the analysis this all looks good to me, one note > re: RemoteUserBackend requires we rewrite the request headers to the headers its expecting I think we can also do this in [[ htt... [10:03:48] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [10:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::druid::public::worker: tune cache settings [puppet] - 10https://gerrit.wikimedia.org/r/666598 (https://phabricator.wikimedia.org/T270173) (owner: 10Elukey) [10:05:49] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [10:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:41] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:08:03] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 13.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:08:47] vgutierrez: o/ I am confused by this, shouldn't it page? [10:09:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE [10:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:25] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:10:53] weird [10:11:01] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` and were **ALL** successful. [10:11:46] ahh no ok there was a jump before [10:12:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE [10:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:39] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:16:43] (03CR) 10Jbond: install_server/dhcp: dhcpd.conf include mechanism support machinery (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:17:12] elukey: everything looking good:) [10:18:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:55] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1119.eqiad.wmnet [10:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:17] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Art+Feminism Wikimedians Mailing List - https://phabricator.wikimedia.org/T275552 (10jbond) 05Open→03Resolved @Masssly i have reset the password please re-open this task if you don't recive the password reset email [10:21:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1119.eqiad.wmnet [10:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:51] (03PS4) 10Effie Mouzeli: profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) [10:25:38] (03CR) 10Hnowlan: [C: 03+2] prometheus::postgres_exporter: Load additional rules on stretch [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [10:29:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:29:52] (03PS1) 10Phuedx: vector: Stage 3 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667836 (https://phabricator.wikimedia.org/T249297) [10:29:58] !log upgrade memcached on mc2024, mc1028 [10:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1028.eqiad.wmnet [10:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:02] RECOVERY - configured eth on sretest1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:33:50] (03PS1) 10DCausse: Bump version to rebuild the plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/667837 [10:37:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1028.eqiad.wmnet [10:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:36] (03CR) 10Gehel: [C: 03+2] Bump version to rebuild the plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/667837 (owner: 10DCausse) [10:42:41] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10LSobanski) @Papaul Thanks. I'm guessing https://phabricator.wikimedia.org/maniphest/task/edit/form/66/ would be the best place for this. Unfortunately I cannot edit the form. Wo... [10:51:00] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) When processing this, please don't worry about os installation, I will take care of this. Please setup network (and the parameters only you can get on puppet, such as the mac) an... [10:53:59] (03PS4) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [10:54:01] (03PS2) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [10:54:18] (03CR) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [10:55:28] (03CR) 10Volans: [C: 04-1] puppetdb microservice: refactor prior to expand it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:55:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:58:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:25] 10SRE: /var/run/elasticsearch not created after reboot - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) We tracked it down to an Elasticsearch restart removing /var/run/elasticsearch, this is the log by auditd (which was installed in deployment-prep, which shows the same error): ` type=PROCTIT... [11:01:44] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:02:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:02:13] (03CR) 10Hnowlan: [C: 03+2] api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:02:40] (03PS1) 10MSantos: maps: execute notify-tilerator with tilerator user [puppet] - 10https://gerrit.wikimedia.org/r/667842 [11:03:39] (03Merged) 10jenkins-bot: api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:07:38] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:55] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:11:55] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:18] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubemaster2001.codfw.wmnet [11:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:30] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubemaster2002.codfw.wmnet [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:02] (03PS2) 10MSantos: maps: execute notify-tilerator with tilerator user [puppet] - 10https://gerrit.wikimedia.org/r/667842 [11:24:02] (03CR) 10MSantos: [C: 04-1] "This seems to have a lot of moving parts that will break production. Probably better to abandon." [puppet] - 10https://gerrit.wikimedia.org/r/667842 (owner: 10MSantos) [11:26:15] (03Abandoned) 10MSantos: maps: execute notify-tilerator with tilerator user [puppet] - 10https://gerrit.wikimedia.org/r/667842 (owner: 10MSantos) [11:28:22] (03PS1) 10Hnowlan: api-gateway: allow access to linkrecommendation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/667844 (https://phabricator.wikimedia.org/T269581) [11:33:09] (03CR) 10ArielGlenn: [C: 03+1] api-gateway: allow access to linkrecommendation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/667844 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:33:40] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow access to linkrecommendation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/667844 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:34:25] (03Merged) 10jenkins-bot: api-gateway: allow access to linkrecommendation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/667844 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:35:09] (03PS1) 10MSantos: maps: fix expiry tile list mask to work with imposm3 [puppet] - 10https://gerrit.wikimedia.org/r/667847 [11:40:02] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [11:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:38] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10Joe) p:05Triage→03Low [11:47:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [11:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:15] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2001.codfw.wmnet [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:07] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@937deb5]: (no justification provided) [11:53:10] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@937deb5]: (no justification provided) (duration: 00m 03s) [11:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2001.codfw.wmnet [11:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:28] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10hashar) Over 30 days there were some other related spikes (use a fine grained 5 minutes aggregate instead of daily ones) https://w.wiki/33TA Apparently hitting same URLs.... [11:59:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 to clone db1164 T258361', diff saved to https://phabricator.wikimedia.org/P14554 and previous config saved to /var/cache/conftool/dbconfig/20210302-115959-marostegui.json [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1200). [12:00:04] kart_ and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:06] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [12:00:18] \o [12:00:25] I can deploy today, unless someone else wants to :) [12:00:28] o/ [12:00:31] Hullo [12:01:03] * kart_ is here. [12:01:16] Urbanecm: Please go ahead. First one does not need any testing. [12:01:38] (03PS2) 10Urbanecm: Remove test2wiki from wgContentTranslationAsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667424 (owner: 10KartikMistry) [12:01:45] (03CR) 10Urbanecm: [C: 03+2] Remove test2wiki from wgContentTranslationAsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667424 (owner: 10KartikMistry) [12:02:23] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2002.codfw.wmnet [12:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:42] (03Merged) 10jenkins-bot: Remove test2wiki from wgContentTranslationAsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667424 (owner: 10KartikMistry) [12:04:31] syncing [12:04:32] (03PS3) 10Volans: puppetdb microservice: refactor prior to expand it [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) [12:04:34] (03PS4) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [12:04:55] (03CR) 10Volans: puppetdb microservice: refactor prior to expand it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [12:05:35] (03PS4) 10Urbanecm: Enable SectionTranslation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666567 (https://phabricator.wikimedia.org/T275596) (owner: 10KartikMistry) [12:05:40] (03CR) 10Urbanecm: [C: 03+2] Enable SectionTranslation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666567 (https://phabricator.wikimedia.org/T275596) (owner: 10KartikMistry) [12:06:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: af89965e80a77e92e78e3948e0678460decd7718: Remove test2wiki from wgContentTranslationAsBetaFeature (duration: 01m 38s) [12:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:44] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [12:07:00] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:03] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2002.codfw.wmnet [12:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:11] (03Merged) 10jenkins-bot: Enable SectionTranslation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666567 (https://phabricator.wikimedia.org/T275596) (owner: 10KartikMistry) [12:08:31] kart_: ^^ is pulled to mwdebug1001, please test [12:08:38] (03PS2) 10Urbanecm: vector: Stage 3 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667836 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:08:44] (03CR) 10Urbanecm: [C: 03+2] vector: Stage 3 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667836 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:09:02] Urbanecm: This will require testing [12:09:16] phuedx: ack, I will ping you once it's on mwdebug [12:09:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:10:27] (03CR) 10Volans: "I've tested the latest PS against the existing service and I found no diff in the responses for the following requests:" [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [12:10:36] * volans looking at the uncommitted changes [12:11:25] (03PS5) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [12:11:28] (03Merged) 10jenkins-bot: vector: Stage 3 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667836 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:11:51] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@8d3d81c]: (no justification provided) [12:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] Urbanecm: few minutes more to test.. [12:12:04] kart_: ack, take your time [12:12:07] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@8d3d81c]: (no justification provided) (duration: 00m 15s) [12:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:16] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2003.codfw.wmnet [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2003.codfw.wmnet [12:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:03] jayme: the uncommitted dns changes is related to your work? kubemaster2001/2 changes [12:14:28] phuedx: your changes are pulled to mwdebug1002, please test. [12:14:38] volans: uhm, no...I'm just rebooting [12:14:49] Urbanecm: On it [12:15:32] (03CR) 10Muehlenhoff: profile::zuul::server: Remove support for jessie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667102 (owner: 10Muehlenhoff) [12:15:54] Urbanecm: mwdebug1001 for me, just to confirm, right? [12:15:59] kart_: yes. [12:16:04] (03PS2) 10Muehlenhoff: profile::zuul::server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667102 [12:16:16] I pulled phuedx's change to 1002 just to allow both of you to test independently :) [12:16:19] volans: I'll have a look [12:16:35] jayme: mmmh weird https://netbox.wikimedia.org/search/?q=kubemaster [12:16:55] (03PS4) 10Hnowlan: maps: fix imposm3 cache dir [puppet] - 10https://gerrit.wikimedia.org/r/667598 (owner: 10MSantos) [12:17:01] the kubemaster have IPs but I don't see the devices... [12:17:20] I have lunch ready in 2 [12:17:22] volans: ah, that's probably akosiaris for https://phabricator.wikimedia.org/T276204 [12:18:29] he'll probably add all of them and then run the sync [12:19:27] Urbanecm: all OK. Please deploy. [12:19:28] mmmh the makevm cookbook does all that [12:19:35] thank you, syncing kart_ [12:19:54] ah it was started earlier [12:20:09] jayme: I bet is the cookbook waiting for the confirmation from akosiaris to proceed ;) [12:21:03] sorry for the ping [12:21:07] Urbanecm: Testing is done. It looks good to us. Deploy away! [12:21:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5674d2ab64c2833e15ac8a90696fcde529e58dca: Enable SectionTranslation in testwiki (T275596) (duration: 01m 09s) [12:21:17] don't worry! :) [12:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:23] T275596: Enable SectionTranslation in testwiki - https://phabricator.wikimedia.org/T275596 [12:21:28] kart_: yours is live [12:21:29] phuedx: syncing [12:22:14] Urbanecm: thanks!! [12:22:35] any time [12:23:35] (03CR) 10Hnowlan: [C: 03+2] maps: fix expiry tile list mask to work with imposm3 [puppet] - 10https://gerrit.wikimedia.org/r/667847 (owner: 10MSantos) [12:23:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 29952b404b3fe9c235da86df0ffb86b725845473: vector: Stage 3 of WVUI search treatment A/B test (T249297) (duration: 01m 08s) [12:23:49] phuedx: live [12:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:51] T249297: Deploy the new Vue.js search experience - https://phabricator.wikimedia.org/T249297 [12:23:51] anything else? [12:27:37] (03PS1) 10Marostegui: mariadb: Productionize db1164 [puppet] - 10https://gerrit.wikimedia.org/r/667851 (https://phabricator.wikimedia.org/T258361) [12:27:39] (03CR) 10Hnowlan: [C: 03+2] maps: fix imposm3 cache dir [puppet] - 10https://gerrit.wikimedia.org/r/667598 (owner: 10MSantos) [12:28:23] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [12:28:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1164 [puppet] - 10https://gerrit.wikimedia.org/r/667851 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [12:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:45] Thanks, Urbanecm! [12:29:49] np [12:32:01] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [12:32:01] !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] (03CR) 10Effie Mouzeli: "I did some testing, I think that we can try this on an debug and an api server" [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [12:37:22] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:38:20] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:39:01] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1012.eqiad.wmnet [12:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:05] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) @Sergey.Trofimovsky.SF please see the comment below >> Email address: sergey.trofimovsky@speedandfunction.com > The email address regis... [12:42:31] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) [12:43:40] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubemaster2001.codfw.wmnet [12:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23624 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:44:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1012.eqiad.wmnet [12:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:02] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubemaster2001.codfw.wmnet [12:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:14] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) >>! In T275677#6864295, @jbond wrote: > @OlyKalinichenkoSpeedAndFunction are you also able to confirm L3 status as per: > >>>! In T275677#... [12:46:28] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) [12:47:58] (03PS1) 10JMeybohm: sre.discovery.service-route: Fix service name generation [cookbooks] - 10https://gerrit.wikimedia.org/r/667852 [12:51:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [12:51:50] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) @Eugene.chernov > Preferred shell username: ichernov This differs from the shell name you used when registering on wikitech, can you please... [12:52:11] (03CR) 10Jbond: [C: 03+2] sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [12:53:21] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:40] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster2002.codfw.wmnet [12:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1300) [13:00:20] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10wkandek) Yes, we can do all servers in A3 at once. [13:00:32] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/667852 (owner: 10JMeybohm) [13:08:40] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubemaster1001.eqiad.wmnet [13:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [13:13:39] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster2001.codfw.wmnet [13:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [13:18:28] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10Eugene.chernov) @jbond Yes, sorry ‘eugene-chernov’ is the right one [13:20:08] (03CR) 10Muehlenhoff: [C: 03+2] envoyproxy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666920 (owner: 10Muehlenhoff) [13:20:37] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10jbond) > IIRC the Webmaster Central URL at https://www.google.com/webmasters/verification/details?hl=en&domain=wikidata.org ought to make it possible del... [13:21:56] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:24:36] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster1001.eqiad.wmnet [13:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:47] (03PS1) 10Alexandros Kosiaris: Clean old kubetcd200[123] DHCP entries [puppet] - 10https://gerrit.wikimedia.org/r/667857 [13:24:49] (03PS1) 10Alexandros Kosiaris: Introduce kubemaster[12]00[12], kubestagemaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/667858 (https://phabricator.wikimedia.org/T276204) [13:25:39] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubemaster1002.eqiad.wmnet [13:25:46] (03CR) 10jerkins-bot: [V: 04-1] Introduce kubemaster[12]00[12], kubestagemaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/667858 (https://phabricator.wikimedia.org/T276204) (owner: 10Alexandros Kosiaris) [13:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. Let's roll it out" [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [13:30:29] 10SRE, 10Analytics, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) [13:30:47] (03PS1) 10Jbond: admin: add eugene-chernov [puppet] - 10https://gerrit.wikimedia.org/r/667859 (https://phabricator.wikimedia.org/T275679) [13:31:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) [13:32:27] 10SRE, 10Analytics, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) @fkaelin we discussed this during our grooming session and we decided to pause the efforts for Kubeflow until we'll know that this is the technology/stack that we'll use. We'll know f... [13:33:20] (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666938 (owner: 10Muehlenhoff) [13:33:41] 10SRE, 10Analytics, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) p:05Medium→03Triage [13:36:03] (03PS3) 10Jgiannelos: Deploy tegola on kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) [13:36:44] (03PS1) 10Elukey: Add an-worker11[19,20-28,30,31] to Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/667860 (https://phabricator.wikimedia.org/T274795) [13:38:05] 10SRE, 10Analytics, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10Joseph) Hi @fkaelin, I think you tagged wrong Joseph. [13:39:31] 10SRE, 10Abstract Wikipedia, 10DNS, 10Traffic: Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Aklapper) @AnotherEditor: Objection. Such actions are up to developers. Thanks. [13:40:58] (03CR) 10Elukey: [C: 03+2] Add an-worker11[19,20-28,30,31] to Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/667860 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [13:41:35] (03PS1) 10Muehlenhoff: profile::tlsproxy::envoy: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667861 [13:42:58] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1001.eqiad.wmnet [13:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:18] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667861 (owner: 10Muehlenhoff) [13:43:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [13:44:42] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster1002.eqiad.wmnet [13:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [13:53:30] (03CR) 10Volans: "much better! Couple of nits inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [13:56:23] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1001.eqiad.wmnet [13:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:45] !log installing bind9 security updates on stretch (client-side tools/libs only) [13:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:09] (03CR) 10David Caro: [C: 03+1] "👍" [cookbooks] - 10https://gerrit.wikimedia.org/r/667852 (owner: 10JMeybohm) [13:59:31] 10SRE, 10Analytics, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech-focus: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10fgiunchedi) random-ish update re: checkpoint storage after a chat with @Zbyszko: the current situation... [14:00:04] liw and longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1400). [14:00:17] (03PS1) 10Lars Wirzenius: group0 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667866 [14:00:19] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667866 (owner: 10Lars Wirzenius) [14:02:42] (03CR) 10Gehel: "see minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [14:03:03] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667866 (owner: 10Lars Wirzenius) [14:03:58] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1164 is now replicating in s1 (running 10.4.18) Will start pooling after 24h [14:04:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "From the kubernetes (profile::docker::engine puppet class) point of view this is fine." [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [14:04:29] !log liw@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.33 [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] (03CR) 10Gehel: "minor comments inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [14:06:04] (03PS2) 10Alexandros Kosiaris: Clean old kubetcd200[123] DHCP entries [puppet] - 10https://gerrit.wikimedia.org/r/667857 [14:06:06] (03PS2) 10Alexandros Kosiaris: Introduce kubemaster[12]00[12], kubestagemaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/667858 (https://phabricator.wikimedia.org/T276204) [14:06:08] (03PS1) 10Alexandros Kosiaris: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 [14:07:57] !log dropping database bacula from m1 (with replication) T274809 [14:08:03] ^ marostegui [14:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:04] T274809: Drop unused database "bacula" from m1 - https://phabricator.wikimedia.org/T274809 [14:08:07] \o/ [14:09:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Clean old kubetcd200[123] DHCP entries [puppet] - 10https://gerrit.wikimedia.org/r/667857 (owner: 10Alexandros Kosiaris) [14:11:48] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:03] (03PS3) 10Alexandros Kosiaris: Introduce kubemaster[12]00[12], kubestagemaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/667858 (https://phabricator.wikimedia.org/T276204) [14:12:05] (03PS2) 10Alexandros Kosiaris: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 [14:12:52] (03PS1) 10Kosta Harlan: [WIP] Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [14:14:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubemaster[12]00[12], kubestagemaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/667858 (https://phabricator.wikimedia.org/T276204) (owner: 10Alexandros Kosiaris) [14:16:04] (03PS1) 10Jcrespo: Bacula: remove grants for bacula db [puppet] - 10https://gerrit.wikimedia.org/r/667870 (https://phabricator.wikimedia.org/T274809) [14:17:34] (03CR) 10Marostegui: [C: 03+1] Bacula: remove grants for bacula db [puppet] - 10https://gerrit.wikimedia.org/r/667870 (https://phabricator.wikimedia.org/T274809) (owner: 10Jcrespo) [14:18:39] (03PS8) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [14:19:19] (03CR) 10Wolfgang Kandek: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667859 (https://phabricator.wikimedia.org/T275679) (owner: 10Jbond) [14:20:09] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) (owner: 10Jcrespo) [14:20:20] (03CR) 10Jcrespo: [C: 03+2] Bacula: remove grants for bacula db [puppet] - 10https://gerrit.wikimedia.org/r/667870 (https://phabricator.wikimedia.org/T274809) (owner: 10Jcrespo) [14:22:19] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10CDanis) Did this cause any actual issue? [14:25:02] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Allow access to thorium [deployment-charts] - 10https://gerrit.wikimedia.org/r/667872 [14:25:17] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) (owner: 10Jcrespo) [14:29:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 5%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14557 and previous config saved to /var/cache/conftool/dbconfig/20210302-142911-root.json [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:35] !log dropping db grants for bacula from m1 T274809 [14:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:41] T274809: Drop unused database "bacula" from m1 - https://phabricator.wikimedia.org/T274809 [14:30:34] (03CR) 10Alexandros Kosiaris: trafficserver: add director for gitlab to gitlab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667731 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [14:30:40] I will also try to use orchestrator instead of tendril for reference [14:30:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] gitlab: open port 80 for traffic from caching servers [puppet] - 10https://gerrit.wikimedia.org/r/667733 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [14:30:49] wrong channel [14:31:46] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 97 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:33:57] (03PS3) 10Hashar: Add rename-project plugin @ a880148 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630548 (https://phabricator.wikimedia.org/T201953) [14:34:45] (03CR) 10Hashar: "I should push that eventually but renaming a repository would need some written run book I guess. Maybe I should rebuild it again." [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630548 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [14:37:32] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:38:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Allow access to thorium [deployment-charts] - 10https://gerrit.wikimedia.org/r/667872 (owner: 10Alexandros Kosiaris) [14:38:43] (03Merged) 10jenkins-bot: linkrecommendation: Allow access to thorium [deployment-charts] - 10https://gerrit.wikimedia.org/r/667872 (owner: 10Alexandros Kosiaris) [14:41:24] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10wkandek) https://gerrit.wikimedia.org/r/c/operations/puppet/+/667731 [14:42:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:18] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:55] akosiaris: o/ - one qs - does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667872 mean that a kubernetes service uses thorium as backend? (if so I wasn't aware) [14:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 10%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14558 and previous config saved to /var/cache/conftool/dbconfig/20210302-144415-root.json [14:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:55] elukey: yes, linkrecommendation. It pulls https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ [14:45:17] it puts that into the mysql db and then serves API requests based on it [14:45:26] * elukey cries in a corner [14:45:32] thanks :) [14:45:35] yw [14:49:02] heads up about T276224 which will be fixed shortly [14:49:02] T276224: DBQuery Error: testwiki.growthexperiments_link_recommendations' doesn't exist - https://phabricator.wikimedia.org/T276224 [14:49:57] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [14:52:34] PROBLEM - Check systemd state on dbprov1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:50] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) @LSobanski Please reach out to Rob. Thanks [14:52:53] (03CR) 10Thcipriani: trafficserver: add director for gitlab to gitlab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667731 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [14:53:34] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:53:34] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Validate htpasswd() salt is only 8 characters [puppet] - 10https://gerrit.wikimedia.org/r/666787 (owner: 10Legoktm) [14:55:17] tgr_: nice to hear! Should I merge 667869: HomepageHooks: Block search data hook if link recommendations are off | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/667869 ? [14:55:49] just did [14:56:22] cool :) [14:56:27] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies for ats-tls on codfw [puppet] - 10https://gerrit.wikimedia.org/r/667829 (owner: 10Vgutierrez) [14:56:29] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10thcipriani) Updated the ticket with details about AuthorizedKeysCommand, and removed the proposed solution (here... [14:57:44] !log rolling restart of ats-tls on codfw [14:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 25%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14559 and previous config saved to /var/cache/conftool/dbconfig/20210302-145918-root.json [14:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:46] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:00:47] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:41] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) A second sshd daemon for port 29418 concentrates all configuration for gitlab/ssh access in a separate... [15:06:50] (03PS1) 10Gergő Tisza: HomepageHooks: Block search data hook if link recommendations are off [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667811 (https://phabricator.wikimedia.org/T276224) [15:08:01] (03CR) 10Bstorm: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:08:15] liw: longma: if the train is still ongoing, would you be able to throw https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/667811 on top of it? [15:08:49] (03PS2) 10David Caro: doc: Introduce a code reviewing guideline [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 [15:09:07] (03CR) 10David Caro: "Added some of the suggestions." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [15:10:31] tgr, train has been at group0 for an hour, but alas I'm not able to add patches to it - if you want to do a backport youself, that'd be OK, there's plenty of train window left [15:11:00] thanks, will do it then [15:13:36] (03CR) 10JMeybohm: [C: 03+2] sre.discovery.service-route: Fix service name generation [cookbooks] - 10https://gerrit.wikimedia.org/r/667852 (owner: 10JMeybohm) [15:13:45] (03CR) 10Gergő Tisza: [C: 03+2] "emergency backport" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667811 (https://phabricator.wikimedia.org/T276224) (owner: 10Gergő Tisza) [15:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 50%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14560 and previous config saved to /var/cache/conftool/dbconfig/20210302-151422-root.json [15:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) 05Open→03Resolved a:03jbond This access has been enabled please allow upto 15 minutes for the change to fully pro... [15:17:41] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [15:19:40] 10SRE, 10DBA: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929 (10jcrespo) 05Open→03Resolved a:03jcrespo This is technically done with the merging of the previous patch. This is not a great solution, but it is "a" solutio... [15:22:45] RECOVERY - Check systemd state on dbprov1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:21] (03Merged) 10jenkins-bot: HomepageHooks: Block search data hook if link recommendations are off [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667811 (https://phabricator.wikimedia.org/T276224) (owner: 10Gergő Tisza) [15:27:28] (03CR) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:29:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 75%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14561 and previous config saved to /var/cache/conftool/dbconfig/20210302-152925-root.json [15:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:07] (03CR) 10Bstorm: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:33:10] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) (owner: 10Cwhite) [15:35:22] !log tgr@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: Backport: [[gerrit:667811|HomepageHooks: Block search data hook if link recommendations are off (T276224)]] (duration: 01m 13s) [15:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] T276224: DBQuery Error: testwiki.growthexperiments_link_recommendations' doesn't exist - https://phabricator.wikimedia.org/T276224 [15:36:15] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:37:54] liw: done, thanks [15:38:57] 10SRE: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10dcausse) [15:39:17] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10dcausse) [15:40:33] (03PS2) 10Jbond: profile::tlsproxy::envoy: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667861 (owner: 10Muehlenhoff) [15:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 85%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14562 and previous config saved to /var/cache/conftool/dbconfig/20210302-154429-root.json [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:43] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [15:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:57] (03PS1) 10Jbond: rake_modules: drop jessie from default spec tests [puppet] - 10https://gerrit.wikimedia.org/r/667881 [15:54:14] (03CR) 10Jbond: [C: 03+2] rake_modules: drop jessie from default spec tests [puppet] - 10https://gerrit.wikimedia.org/r/667881 (owner: 10Jbond) [15:55:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [15:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1084 (re)pooling @ 100%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14563 and previous config saved to /var/cache/conftool/dbconfig/20210302-155932-root.json [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:55] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10elukey) [16:07:59] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10elukey) [16:08:18] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:08:29] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) 05Open→03Resolved Thanks a lot, will follow up in a new task! [16:10:38] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [16:10:39] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [16:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:48] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:12:29] (03CR) 10Volans: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:13:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [16:13:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:03] (03CR) 10Arturo Borrero Gonzalez: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:16:31] (03CR) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:17:59] (03CR) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:19:09] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28319/" [puppet] - 10https://gerrit.wikimedia.org/r/667861 (owner: 10Muehlenhoff) [16:19:18] 10ops-eqiad, 10Analytics: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) [16:19:37] 10ops-eqiad, 10Analytics: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) [16:20:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:23] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) @jcrespo why do you have to do the easy part? :) [16:22:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:21] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10MoritzMuehlenhoff) >>! In T276148#6874923, @wkandek wrote: > A second sshd daemon for port 29418 concentrates al... [16:30:03] (03PS1) 10Giuseppe Lavagetto: Allow changing the IP of the fcgi server [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667885 [16:30:05] (03PS1) 10Giuseppe Lavagetto: Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 [16:30:22] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add php 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664884 (owner: 10Giuseppe Lavagetto) [16:33:25] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [16:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:35] (03PS1) 10David Caro: Use the correct package name/path everywhere [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 [16:35:38] (03CR) 10David Caro: "Maybe the solution would be to use the same name for the package and the repo, this only makes the scripts/docs use the same name as the p" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [16:37:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [16:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:32] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [16:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:35] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10Joe) >>! In T276213#6874754, @CDanis wrote: > Did this cause any actual issue? No actual issue, hence the low priority. I anyhow thought it would be to have a task opened wi... [16:42:40] (03CR) 10Ahmon Dancy: [C: 04-1] "minor nits" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [16:43:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [16:45:23] (03CR) 10Muehlenhoff: Use the correct package name/path everywhere (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [16:46:40] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10MoritzMuehlenhoff) >>! In T271058#6863725, @aborrero wrote: > I don't see errors anymore on cloudnet1004 after the kernel up... [16:48:03] (03CR) 10David Caro: Use the correct package name/path everywhere (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [16:48:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "A cursory check of mediawiki-config says this shouldn't cause issues." [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [16:50:31] (03CR) 10Muehlenhoff: Use the correct package name/path everywhere (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [16:54:03] (03PS2) 10David Caro: Use the correct package name/path everywhere [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 [16:54:05] (03CR) 10David Caro: Use the correct package name/path everywhere (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [16:54:59] (03CR) 10Jdlrobson: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) (owner: 10Cwhite) [16:56:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:58:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:04] jbond42 and cdanis: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1700). [17:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:18] o/ [17:05:02] (03PS1) 10Elukey: bigtop::sqoop: apply upstream patch/override to sqoop bin [puppet] - 10https://gerrit.wikimedia.org/r/667894 (https://phabricator.wikimedia.org/T274866) [17:05:09] tgr_, they are in meetings at the moment [17:05:15] maybe I can help you [17:05:29] thanks! [17:05:48] (03CR) 10Elukey: [C: 03+2] bigtop::sqoop: apply upstream patch/override to sqoop bin [puppet] - 10https://gerrit.wikimedia.org/r/667894 (https://phabricator.wikimedia.org/T274866) (owner: 10Elukey) [17:05:55] the patch should be a noop in production; the maintenance script aborts immediately [17:06:04] it's mainly needed for testing in beta [17:06:13] I am doing a sanity check [17:06:22] as I don't normally check maintenance scripts [17:06:31] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:07:05] (where it where it also won't work at the moment, due to T275975. But I'd rather set it up now and check after that gets fixed.) [17:07:06] T275975: Search broken on beta cluster wikis - https://phabricator.wikimedia.org/T275975 [17:08:42] cf https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/maintenance/refreshLinkRecommendations.php#L114 and https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L25644 [17:09:06] (03PS5) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [17:11:42] (03PS5) 10Svantje Lilienthal: ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [17:12:31] tgr_, I checked that too, but I was mostly checking puppet syntax, etc. [17:13:21] legoktm: Interesting thing happened. Submodule update did on Friday for ContentTranslation seems reverted with wmf.33 update. [17:13:46] (03CR) 10Jcrespo: [C: 03+2] Add GrowthExperiments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [17:14:18] (03PS6) 10Svantje Lilienthal: ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [17:14:19] liw: Do you know if there is any changes done with wmf.32, legoktm updated extensions/ContentTranslation and it seems reverted. [17:14:27] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Legoktm) I remember having to help people in `#mediawiki` whose network blocked very high ports like 29418. Mayb... [17:14:51] o.O [17:15:34] legoktm: What can be possible reason? [17:15:41] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+log/refs/heads/wmf/1.36.0-wmf.32/extensions/ContentTranslation looks fine to me? [17:16:21] tgr_, Notice: Applied catalog in 56.25 seconds [17:16:30] let me double check the systemd timer [17:16:35] thanks jynus! [17:16:36] (03PS1) 10Mholloway: Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667813 (https://phabricator.wikimedia.org/T276235) [17:16:44] legoktm: See Special:Version of bnwiki for example. [17:16:50] (03PS1) 10Mholloway: Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667814 (https://phabricator.wikimedia.org/T276235) [17:16:52] legoktm: it points to old code. [17:17:32] :/ [17:17:49] on deploy1002 the submodule isn't bumped [17:17:58] tgr_, mediawiki_job_growthexperiments-refreshLinkRecommendations.timer [17:18:05] sorry, wrong link [17:18:11] https://phabricator.wikimedia.org/P14565 [17:18:32] Submodule extensions/ContentTranslation e6b1a7cd0d..cd5cd3c9d2 (rewind): [17:18:33] < CX3 Build 0.1.0+20210223 [17:18:33] Submodule extensions/Graph 9d5cf348f5..64e3de6a17 (rewind): [17:18:33] < Do not log graph errors to WMF servers [17:18:58] legoktm: oh, so because of server switch? [17:19:08] that's what I'm guessing... [17:19:28] only thing I can think of that would explain it [17:20:28] syncing [17:20:36] Thanks!! [17:21:26] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10brennen) > I remember having to help people in #mediawiki whose network blocked very high ports like 29418. Mayb... [17:21:33] !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation/: Re-apply: CX3 Build 0.1.0+20210223 (duration: 01m 10s) [17:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:20] legoktm: Things are back to normal it seems.. [17:26:50] Looks like our patch is victim of every worst possible scenario :P [17:31:22] 10SRE, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar), 10Turkish-Sites: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10FriedrickMILBarbarossa) [17:31:45] 10SRE, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar), and 2 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10FriedrickMILBarbarossa) [17:32:02] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 3 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10FriedrickMILBarbarossa) [17:32:28] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10LarsWirzenius) From other contexts, I *think* this needs to be fixed in Debian's own tooling. The impact is th... [17:32:31] tgr_, log looks fine, so going away [17:32:40] thanks again! [17:34:47] (03PS1) 10Jbond: P:profile::client::httpd: add param override the attribute delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667898 [17:36:00] (03PS3) 10Effie Mouzeli: mediawiki::alerts: add per cluster error/fatals rate alert [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) [17:37:29] (03PS2) 10Jbond: P:profile::client::httpd: add param override the attribute delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667898 [17:38:31] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10thcipriani) [17:38:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28321/console" [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [17:39:19] !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.32/extensions/Graph/: Do not log graph errors to WMF servers (duration: 01m 08s) [17:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:40] (03CR) 10Zppix: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/667815 (https://phabricator.wikimedia.org/T276241) (owner: 10Zppix) [17:41:11] 10SRE, 10Desktop Improvements, 10Traffic, 10Wiktionary-fr, and 2 others: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10FriedrickMILBarbarossa) [17:43:56] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@869a29b]: ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500 [17:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:23] (03CR) 10Jbond: [V: 03+1] "tested localy on puppetboard and all works fine" [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [17:46:34] (03PS1) 10Jbond: O:netmon: update delimiter to use ':' [puppet] - 10https://gerrit.wikimedia.org/r/667899 [17:46:52] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@869a29b]: ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500 (duration: 02m 55s) [17:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:14] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) Allow me to stress that the SSH port for Gitlab is a long term choice. Whatever is decide... [17:50:08] (03CR) 10Mholloway: [C: 03+2] Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667813 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [17:50:26] (03CR) 10jerkins-bot: [V: 04-1] Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667814 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [17:50:28] (03PS1) 10Phamhi: wikireplica: depool clouddbb1014 [puppet] - 10https://gerrit.wikimedia.org/r/667901 (https://phabricator.wikimedia.org/T273281) [17:50:32] (03PS2) 10Jbond: O:netmon: update delimiter to use ':' [puppet] - 10https://gerrit.wikimedia.org/r/667899 [17:51:04] (03PS3) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [17:51:06] (03CR) 10Hnowlan: postgres: use remote script on replica to resync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [17:51:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28323/console" [puppet] - 10https://gerrit.wikimedia.org/r/667899 (owner: 10Jbond) [17:52:22] (03CR) 10Mholloway: "recheck" [extensions/EventLogging] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667814 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [17:56:29] (03PS1) 10Jbond: P:idp::client::httpd::site: update default delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667902 [17:57:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:58:01] (03CR) 10Jbond: [V: 03+1] "to the best of my knowledge only librenms has logic to parse and split headers (see follow up patches)." [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [17:58:35] (03CR) 10Zppix: [C: 03+1] "Worked when tested on one of cloud vps projects i use." [puppet] - 10https://gerrit.wikimedia.org/r/667815 (https://phabricator.wikimedia.org/T276241) (owner: 10Zppix) [17:59:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667861 (owner: 10Muehlenhoff) [18:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1800). [18:00:06] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 3 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10BBlack) @ovasileva Yes, that plan seems reasonable! Just to be sure we're on the same page on details an... [18:00:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:02:10] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [18:04:13] (03PS3) 10Hnowlan: postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) [18:04:34] (03CR) 10Hnowlan: postgres: add script for automatic resyncing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [18:07:26] (03PS1) 10Vgutierrez: ATS: Enable parent proxies for ats-tls on esams [puppet] - 10https://gerrit.wikimedia.org/r/667907 [18:11:03] (03CR) 10Dduvall: [C: 03+2] Extend wmfSwiftConfig placeholder keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667242 (owner: 10Ahmon Dancy) [18:11:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28325/console" [puppet] - 10https://gerrit.wikimedia.org/r/667907 (owner: 10Vgutierrez) [18:11:47] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies for ats-tls on esams [puppet] - 10https://gerrit.wikimedia.org/r/667907 (owner: 10Vgutierrez) [18:11:50] (03Merged) 10jenkins-bot: Extend wmfSwiftConfig placeholder keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667242 (owner: 10Ahmon Dancy) [18:12:36] !log rolling restart of ats-tls on esams [18:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] jouncebot: now [18:13:11] For the next 0 hour(s) and 46 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1800) [18:15:49] (03Merged) 10jenkins-bot: Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/667813 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [18:15:58] (03CR) 10Jbond: "puppet stuff looks good to me, some questions inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [18:17:46] (03PS1) 10Ottomata: Bump pyarrow to 3.0.0 and include toree [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667908 [18:17:48] (03PS1) 10Ottomata: Fix bug in conda-deactivate-stacked that would cause infinite loop [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667909 (https://phabricator.wikimedia.org/T224658) [18:19:57] chrisalbon: do you have anything going out in this deploy window and would you object to me syncing a noop change to mediawiki-config? [18:20:36] (03CR) 10Mholloway: [C: 03+2] Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667814 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [18:21:27] !log mholloway-shell@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/EventLogging: Fix timestamp format for migrated events (T276235) (duration: 01m 09s) [18:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:34] T276235: [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 [18:21:57] (03PS5) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [18:22:00] ah, looks like mholloway is deploying so i'll wait [18:23:38] marxarelli: ~30 mins until the wmf.32 patch clears CI, so if you want to get something in in the meantime, by all means go ahead! [18:23:55] okey dokey. thanks! [18:24:05] it's a tiny noop change, so should be quick [18:26:32] PROBLEM - configured eth on sretest1002 is CRITICAL: eno2 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:27:47] ops, this is me, fixing [18:28:42] !log dduvall@deploy1002 Synchronized private/readme.php: Config: [[gerrit:667242|Extend wmfSwiftConfig placeholder keys]] (duration: 01m 09s) [18:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:47] 10SRE, 10ops-eqiad, 10Analytics: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) a:03Cmjohnson [18:33:06] (03PS1) 10Herron: assign mwlog2002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667911 (https://phabricator.wikimedia.org/T224565) [18:33:08] (03PS1) 10Herron: assign mwlog1002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667912 (https://phabricator.wikimedia.org/T224565) [18:33:59] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10herron) a:03herron [18:36:47] (03CR) 10Cwhite: [C: 03+1] assign mwlog2002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667911 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [18:36:58] (03PS1) 10Ottomata: Symlink conda-(de)activate-stacked scripts into user env instead of cp [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667913 (https://phabricator.wikimedia.org/T224658) [18:37:16] (03CR) 10Ottomata: [C: 03+2] Bump pyarrow to 3.0.0 and include toree [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667908 (owner: 10Ottomata) [18:37:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Bump pyarrow to 3.0.0 and include toree [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667908 (owner: 10Ottomata) [18:40:50] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [18:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:52] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [18:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:03] (03PS1) 10Phamhi: wikireplica: depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/667915 (https://phabricator.wikimedia.org/T273281) [18:48:57] (03Merged) 10jenkins-bot: Fix timestamp format for migrated events [extensions/EventLogging] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667814 (https://phabricator.wikimedia.org/T276235) (owner: 10Mholloway) [18:49:37] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667901 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [18:50:34] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddbb1014 [puppet] - 10https://gerrit.wikimedia.org/r/667901 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [18:52:02] (03PS2) 10Ottomata: Fix bug in conda-deactivate-stacked that would cause infinite loop [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667909 (https://phabricator.wikimedia.org/T224658) [18:53:08] !log mholloway-shell@deploy1002 Synchronized php-1.36.0-wmf.32/extensions/EventLogging: Fix timestamp format for migrated events (T276235) (duration: 01m 10s) [18:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] T276235: [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 [18:56:46] RECOVERY - configured eth on sretest1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:57:14] (03PS1) 10Herron: elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 [18:59:58] !log apply merge.policy.deletes_pct_allowed=20 to production-search-codfw commonswiki_file to encourage merging away deleted docs from T271493 [19:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T1900). [19:00:04] Zabe: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] T271493: Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size - https://phabricator.wikimedia.org/T271493 [19:10:19] here [19:13:55] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [19:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:56] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:18] (03PS1) 10Ahmon Dancy: scap-master-sync: Don't exclude CDB files [puppet] - 10https://gerrit.wikimedia.org/r/667919 (https://phabricator.wikimedia.org/T275826) [19:18:06] (03PS1) 10Andrew Bogott: nova compute: make live_migration_uri dc-specific [puppet] - 10https://gerrit.wikimedia.org/r/667920 (https://phabricator.wikimedia.org/T265965) [19:22:53] Zabe: I'm sorry no one claimed the window yet. If you're still around, I can get the patch out for you. [19:23:21] Urbanecm: that would be nice [19:23:27] cool [19:24:02] (03PS2) 10Urbanecm: Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) (owner: 10Zabe) [19:24:06] (03CR) 10Urbanecm: [C: 03+2] Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) (owner: 10Zabe) [19:25:55] (03CR) 10Urbanecm: [C: 04-1] "Noting an issue in-line." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [19:26:01] (03Merged) 10jenkins-bot: Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) (owner: 10Zabe) [19:27:05] (03PS1) 10Ahmon Dancy: wikiversions-dev.json: php-1.35.0-wmf.30 [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667923 [19:28:15] (03PS2) 10Ahmon Dancy: wikiversions-dev.json: php-1.36.0-wmf.30 [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667923 [19:28:18] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f6fa5b3d2fa315518ce5d0995a653e038da05f24: Set local timezone for trwikivoyage to UTC (T275598) (duration: 01m 09s) [19:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:26] T275598: Set local timezone for Turkish Wikivoyage to UTC - https://phabricator.wikimedia.org/T275598 [19:28:35] Zabe: synced the first patch (trwikivoyage timezone). I'm uncomfortable with syncing the other patch through. Interface admins were introduced to only let people who _really_ need to edit sitewide CSS/JS to do so. I'm concerned this would turn the group into a general "technician" group, which would decrease the security benefits gained from it. I will post a note on the Phabricator task, and ping a few people to comment on [19:28:35] that. [19:28:44] (03CR) 10Ahmon Dancy: [C: 03+2] wikiversions-dev.json: php-1.36.0-wmf.30 [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667923 (owner: 10Ahmon Dancy) [19:29:09] ok thx [19:29:37] (03Merged) 10jenkins-bot: wikiversions-dev.json: php-1.36.0-wmf.30 [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667923 (owner: 10Ahmon Dancy) [19:29:46] (03CR) 10CRusnov: [C: 03+1] "LGTM, this seems like a good solution!" [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [19:30:02] Zabe: anything else I can do for you? [19:30:11] (03CR) 10Dduvall: [C: 03+1] "I have only a minor reservation about the name MW_MAINTENANCE_OFFLINE, that it may be ambiguous whether it is forcing an offline mode or f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [19:32:21] Urbanecm: Could you take a look at this one? https://gerrit.wikimedia.org/r/667348 [19:32:45] looking [19:33:08] (03PS3) 10Urbanecm: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) (owner: 10Zabe) [19:33:11] (03CR) 10Urbanecm: [C: 03+2] Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) (owner: 10Zabe) [19:34:07] (03CR) 10jerkins-bot: [V: 04-1] Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) (owner: 10Zabe) [19:36:36] what? [19:36:52] ah, right, tab indenting [19:36:57] Zabe: can you fix that patch to indent by tabs? [19:37:10] yes [19:39:43] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [19:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:57] (03PS4) 10Zabe: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) [19:40:33] (03CR) 10Urbanecm: [C: 03+2] Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) (owner: 10Zabe) [19:40:38] thanks, let's give it a second try [19:41:37] (03Merged) 10jenkins-bot: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) (owner: 10Zabe) [19:42:49] syncing [19:42:55] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c48e40ac43445d6b038919133b951d6aaea960b7: Enable babel categorize on thwikisource (T275283) (duration: 01m 09s) [19:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:00] T275283: Enable babel categorize on thwikisource - https://phabricator.wikimedia.org/T275283 [19:44:30] Zabe: should be done. Please note pages might need to be purged for the categories to appear. [19:44:32] anything else? [19:44:55] no, thanks for your work [19:45:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:48] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:46:30] cool :) [19:49:22] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/667927 [19:49:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:40] (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/667927 (owner: 10Kosta Harlan) [19:51:38] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/667927 (owner: 10Kosta Harlan) [19:53:13] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [19:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:02] codfw mgmt is going down for 5 minutes for maintenance thank youn [19:56:16] |log codfw mgmt is going down for 5 minutes for maintenance thank youn [19:56:26] !log codfw mgmt is going down for 5 minutes for maintenance thank youn [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:23] (03PS10) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [19:57:25] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [19:58:13] (03CR) 10CDanis: [C: 03+2] Fix up some HTML validation errors. [software/klaxon] - 10https://gerrit.wikimedia.org/r/662754 (owner: 10CDanis) [19:59:35] (03CR) 10Cwhite: [C: 03+2] icinga: add grafana dashboard alert for client errors [puppet] - 10https://gerrit.wikimedia.org/r/667737 (https://phabricator.wikimedia.org/T264665) (owner: 10Cwhite) [20:00:04] liw and longma: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210302T2000). [20:00:20] (03CR) 10Legoktm: [C: 04-1] "The new files in multiversion/bin/* should have a `.php` extension so CI will lint them and run PHPCS on them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [20:00:22] (03PS11) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [20:00:24] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [20:00:36] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [20:00:36] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:55] (03Merged) 10jenkins-bot: Fix up some HTML validation errors. [software/klaxon] - 10https://gerrit.wikimedia.org/r/662754 (owner: 10CDanis) [20:03:56] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:03:56] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [20:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:00] 10SRE, 10Analytics: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) p:05Triage→03Low [20:13:16] (03PS2) 10Aaron Schulz: Set $wgChronologyProtectorStash for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667744 [20:16:06] (03Abandoned) 10Aaron Schulz: Set $wgChronologyProtectorStash for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667744 (owner: 10Aaron Schulz) [20:16:15] (03PS2) 10Aaron Schulz: Set $wgChronologyProtectorStash to redis for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 [20:17:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:17:57] (03PS2) 10Andrew Bogott: nova compute: make live_migration_uri dc-specific [puppet] - 10https://gerrit.wikimedia.org/r/667920 (https://phabricator.wikimedia.org/T265965) [20:18:00] (03PS1) 10Andrew Bogott: wmcs-drain-hypervisor.py: Don't fail or lock up on VMs that aren't in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) [20:18:24] (03PS3) 10Aaron Schulz: Set $wgChronologyProtectorStash to redis for production/labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 [20:18:46] (03PS2) 10Aaron Schulz: Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 [20:18:51] (03CR) 10jerkins-bot: [V: 04-1] wmcs-drain-hypervisor.py: Don't fail or lock up on VMs that aren't in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) (owner: 10Andrew Bogott) [20:19:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:19:54] (03PS2) 10Andrew Bogott: wmcs-drain-hypervisor.py: Don't fail or lock up on VMs that aren't in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) [20:20:02] (03CR) 10Andrew Bogott: [C: 03+2] nova compute: make live_migration_uri dc-specific [puppet] - 10https://gerrit.wikimedia.org/r/667920 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [20:20:42] (03CR) 10jerkins-bot: [V: 04-1] wmcs-drain-hypervisor.py: Don't fail or lock up on VMs that aren't in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) (owner: 10Andrew Bogott) [20:21:48] (03PS3) 10Andrew Bogott: wmcs-drain-hypervisor.py: Better handling of VMS not in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) [20:26:54] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddbb1014" [puppet] - 10https://gerrit.wikimedia.org/r/667818 [20:29:56] (03PS1) 10Milimetric: analytics/refine: bump refinery-source version to 0.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/667930 [20:32:45] (03PS4) 10Aaron Schulz: Set $wgChronologyProtectorStash to redis for production/labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 [20:33:30] (03PS3) 10Aaron Schulz: Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 [20:35:10] (03CR) 10Krinkle: [C: 03+1] "Prod uses 'wgParserCacheType' => [ 'default' => 'mysql-multiwrite', ] which underneath uses 'mcrouter-with-onhost-tier'. I don't know if t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 (owner: 10Aaron Schulz) [20:36:03] (03CR) 10Krinkle: [C: 03+2] Set $wgChronologyProtectorStash to redis for production/labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 (owner: 10Aaron Schulz) [20:36:17] (03PS4) 10Krinkle: Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 (owner: 10Aaron Schulz) [20:36:20] (03CR) 10Krinkle: [C: 03+2] Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 (owner: 10Aaron Schulz) [20:36:55] (03Merged) 10jenkins-bot: Set $wgChronologyProtectorStash to redis for production/labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667745 (owner: 10Aaron Schulz) [20:37:07] (03Merged) 10jenkins-bot: Set $wgParserCacheType to "mcrouter" on beta for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667746 (owner: 10Aaron Schulz) [20:37:51] * Krinkle using mwdebug1002 for testing [20:38:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:39:27] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:39:47] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddbb1014" [puppet] - 10https://gerrit.wikimedia.org/r/667818 (owner: 10Phamhi) [20:39:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:28] (03PS1) 10ArielGlenn: allow linkrecommendation service access to m2-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/667934 (https://phabricator.wikimedia.org/T276268) [20:47:56] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I7f387bf19e5f prep wgChronologyProtectorStash ahead of wmf.33 roll out to ensure cross-wiki consistency (duration: 01m 18s) [20:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:38] (03PS3) 10Krinkle: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666778 [20:48:40] (03PS3) 10Krinkle: CommonSettings: Remove wmg/wg indirection for BotPasswords (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666779 [20:48:42] (03PS1) 10Krinkle: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667935 [20:51:00] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/667915 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [21:01:44] (03CR) 10Ottomata: kafka: Disable alert for absolute max lag value and under-replicated partitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) (owner: 10Razzi) [21:12:39] 10SRE, 10DBA, 10Performance-Team, 10Sustainability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) [21:13:37] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) [21:17:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:54] (03CR) 10BryanDavis: [C: 03+1] "We purposefully exclude CDB file syncing from the deploy server->mediawiki server actions taken by scap, but this change seems reasonable " [puppet] - 10https://gerrit.wikimedia.org/r/667919 (https://phabricator.wikimedia.org/T275826) (owner: 10Ahmon Dancy) [21:21:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:54] (03CR) 10Dzahn: [C: 03+2] gitlab: open port 80 for traffic from caching servers [puppet] - 10https://gerrit.wikimedia.org/r/667733 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [21:27:59] (03PS2) 10Dzahn: gitlab: open port 80 for traffic from caching servers [puppet] - 10https://gerrit.wikimedia.org/r/667733 (https://phabricator.wikimedia.org/T276144) [21:28:52] (03CR) 10Razzi: "@Bstorm @Marostegui Sqoop has run for March so we're ready to start the reimaging of the node once this is reviewed! Please let me know ho" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [21:29:00] (03PS8) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [21:30:04] (03PS3) 10Cwhite: profile: add gerrit log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) [21:31:03] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: jiji - jiji, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:37:37] (03CR) 10Cwhite: [C: 03+2] profile: add gerrit log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:37:59] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) I merged and deployed [[ https://gerrit.wikimedia.org/r/667733 | 667733 ]]. This allows all... [21:40:51] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) Now this is either resolved (if we go unencrypted behind caching) or we can do the same thing... [21:41:34] (03PS3) 10Dzahn: site: remove mwmaint2001.codfw.mwnet [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) [21:48:05] (03CR) 10Dzahn: [C: 03+2] site: remove mwmaint2001.codfw.mwnet [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [21:51:00] !log copied docker-registry package from stretch-wikimedia to buster-wikimedia (T272550) [21:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:08] T272550: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 [21:51:15] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) a:03Legoktm [21:51:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mwmaint2001.codfw.wmnet with reason: decom [21:51:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mwmaint2001.codfw.wmnet with reason: decom [21:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:24] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mwmaint2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/667281 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [21:52:30] (03PS2) 10Dzahn: DHCP: remove mwmaint2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/667281 (https://phabricator.wikimedia.org/T275928) [21:52:35] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) 05Open→03Resolved [21:53:05] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:56:10] (03PS3) 10Dzahn: remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 [21:56:44] (03CR) 10jerkins-bot: [V: 04-1] remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 (owner: 10Dzahn) [21:57:59] !log mforns@deploy1002 Started deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@COMMIT_HASH] [21:58:00] !log mforns@deploy1002 deploy aborted: Regular analytics weekly train [analytics/refinery@COMMIT_HASH] (duration: 00m 01s) [21:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:00] !log mforns@deploy1002 Started deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] [21:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:48] (03PS1) 10Gergő Tisza: [beta] Disable captchas while they are completely broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667946 (https://phabricator.wikimedia.org/T276176) [22:04:22] I'll deploy a beta-only config change [22:07:50] (03PS4) 10Dzahn: remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 (https://phabricator.wikimedia.org/T667278) [22:12:09] !log mforns@deploy1002 Finished deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 13m 09s) [22:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:13] !log mforns@deploy1002 Started deploy [analytics/refinery@af99602] (thin): Regular analytics weekly train THIN [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] [22:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:20] !log mforns@deploy1002 Finished deploy [analytics/refinery@af99602] (thin): Regular analytics weekly train THIN [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 00m 07s) [22:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:07] !log mforns@deploy1002 Started deploy [analytics/refinery@af99602] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] [22:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:46] (03CR) 10Dzahn: [C: 03+2] remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 (https://phabricator.wikimedia.org/T667278) (owner: 10Dzahn) [22:18:18] (03CR) 10Hashar: [C: 03+1] "Looks good. Thank you for the cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/667102 (owner: 10Muehlenhoff) [22:21:54] (03PS1) 10Dzahn: site: add mwmaint2001 with insetup role, decom in progress [puppet] - 10https://gerrit.wikimedia.org/r/667949 (https://phabricator.wikimedia.org/T667278) [22:22:02] (03CR) 10Hashar: [C: 03+1] "Hmm. Yes correct, at least for the CI part. I don't know about WMCS and their k8s cluster though. So I guess +1 😊 I will be happy to dr" [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [22:22:27] (03CR) 10jerkins-bot: [V: 04-1] site: add mwmaint2001 with insetup role, decom in progress [puppet] - 10https://gerrit.wikimedia.org/r/667949 (https://phabricator.wikimedia.org/T667278) (owner: 10Dzahn) [22:23:37] !log mforns@deploy1002 Finished deploy [analytics/refinery@af99602] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 07m 30s) [22:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:23] (03PS2) 10Dzahn: site: add mwmaint2001 with insetup role, decom in progress [puppet] - 10https://gerrit.wikimedia.org/r/667949 (https://phabricator.wikimedia.org/T667278) [22:30:43] (03CR) 10Dzahn: [C: 03+2] site: add mwmaint2001 with insetup role, decom in progress [puppet] - 10https://gerrit.wikimedia.org/r/667949 (https://phabricator.wikimedia.org/T667278) (owner: 10Dzahn) [22:34:19] !log mforns@deploy1002 Started deploy [analytics/refinery@3bd0858]: Regular analytics weekly train- forgot bump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] [22:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:36] (03CR) 10Gergő Tisza: [C: 03+2] [beta] Disable captchas while they are completely broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667946 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [22:35:15] * Jdlrobson is about to run an alerting test [22:36:05] (03Abandoned) 10Dzahn: mediawiki::maintenance: sync home dir from mwmaint2001 to mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667241 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [22:36:42] (03Merged) 10jenkins-bot: [beta] Disable captchas while they are completely broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667946 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [22:36:45] Jdlrobson: who is going to be alerted? [22:39:56] mutante: it should just be in this irc channel and releng but currently no bueno [22:41:18] (03CR) 10Dzahn: [C: 03+1] "I like this because it should also prevent the issue that directory size on the passive deployment server keeps building up and it would h" [puppet] - 10https://gerrit.wikimedia.org/r/667919 (https://phabricator.wikimedia.org/T275826) (owner: 10Ahmon Dancy) [22:42:08] Jdlrobson: an edit on a wiki is supposed to make an IRC bot talk? [22:43:30] (03PS3) 10Razzi: kafka: Disable alert for absolute max lag value [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) [22:43:38] (03CR) 10Volans: [C: 03+1] "LGTM, but I think there's a permission to fix in the executable. All the rest is totally optional and up to you." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [22:44:33] 10SRE, 10serviceops, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) [22:44:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Dzahn) [22:44:48] (03CR) 10Ottomata: [C: 03+1] kafka: Disable alert for absolute max lag value [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) (owner: 10Razzi) [22:47:50] (03PS1) 10Dzahn: tcpircbot: remove mwmaint2001, replace with mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667953 (https://phabricator.wikimedia.org/T667278) [22:50:36] (03CR) 10Razzi: [C: 04-1] "@Ottomata and I looked over this together, found some room for improvement! It'll be great to standardize our setup with this changeset so" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667689 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [22:51:09] (03CR) 10Dzahn: [C: 03+2] tcpircbot: remove mwmaint2001, replace with mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667953 (https://phabricator.wikimedia.org/T667278) (owner: 10Dzahn) [22:51:20] (03PS1) 10Dzahn: mediawiki::maintenance: reverse rsync direction for home dirs [puppet] - 10https://gerrit.wikimedia.org/r/667955 (https://phabricator.wikimedia.org/T667278) [22:53:00] !log mforns@deploy1002 Finished deploy [analytics/refinery@3bd0858]: Regular analytics weekly train- forgot bump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 18m 41s) [22:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:31] !log mforns@deploy1002 Started deploy [analytics/refinery@3bd0858] (thin): Regular analytics weekly train THIN- forgot bnump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] [22:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:37] !log mforns@deploy1002 Finished deploy [analytics/refinery@3bd0858] (thin): Regular analytics weekly train THIN- forgot bnump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 00m 06s) [22:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:45] (03CR) 10Dzahn: [C: 03+2] mediawiki::maintenance: reverse rsync direction for home dirs [puppet] - 10https://gerrit.wikimedia.org/r/667955 (https://phabricator.wikimedia.org/T667278) (owner: 10Dzahn) [22:56:28] that's what I get for not compiling this one time and calling it trivial. one extra dot breaks it [22:57:29] (03PS1) 10Dzahn: mediawiki::maintenance: remove superfluous dot in maint hostname [puppet] - 10https://gerrit.wikimedia.org/r/667956 [22:58:36] !log mforns@deploy1002 Started deploy [analytics/refinery@3bd0858] (hadoop-test): Regular analytics weekly train TEST- forgot version bump [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] [22:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:54] (03CR) 10Dzahn: [C: 03+2] mediawiki::maintenance: remove superfluous dot in maint hostname [puppet] - 10https://gerrit.wikimedia.org/r/667956 (owner: 10Dzahn) [22:59:20] (03CR) 1020after4: [C: 03+1] scap-master-sync: Don't exclude CDB files [puppet] - 10https://gerrit.wikimedia.org/r/667919 (https://phabricator.wikimedia.org/T275826) (owner: 10Ahmon Dancy) [23:00:33] (03CR) 10Dzahn: [C: 03+2] scap-master-sync: Don't exclude CDB files [puppet] - 10https://gerrit.wikimedia.org/r/667919 (https://phabricator.wikimedia.org/T275826) (owner: 10Ahmon Dancy) [23:03:33] !log mforns@deploy1002 Finished deploy [analytics/refinery@3bd0858] (hadoop-test): Regular analytics weekly train TEST- forgot version bump [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 04m 56s) [23:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:26] PROBLEM - reading-web-client-errors grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [23:09:46] !log restart weged prometheus-wmf-elasticsearch-exporter-9200 on elastic2042 [23:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:45] (03PS12) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [23:10:47] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [23:11:28] (03CR) 10jerkins-bot: [V: 04-1] install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [23:11:56] !log mwmaint2002 - rsyncing home dirs from mwmaint1002 (T275905) [23:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:03] T275905: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 [23:13:39] (03PS13) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [23:14:34] (03CR) 10Cwhite: elk: send icinga events to a separate partition/index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [23:17:40] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10bcampbell) Thanks all. Is the patch live yet? [23:19:28] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@4d0f053]: deploy phatality 7.10 [23:19:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:36] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@4d0f053]: deploy phatality 7.10 (duration: 01m 01s) [23:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:07] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@4d0f053]: deploy phatality 7.10 [23:22:12] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@4d0f053]: deploy phatality 7.10 (duration: 00m 05s) [23:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:02] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@4d0f053]: trying again: deploy phatality 7.10 [23:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:39] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@4d0f053]: trying again: deploy phatality 7.10 (duration: 00m 37s) [23:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:11] (03PS1) 10Dzahn: smokeping: replace mwmaint2001 with cumin2001 as D5 target [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) [23:33:08] (03PS1) 10Dzahn: site: remove mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667958 (https://phabricator.wikimedia.org/T275928) [23:34:02] (03PS1) 1020after4: fix sudoers rule for phatality (`kibana-plugin install`) [puppet] - 10https://gerrit.wikimedia.org/r/667959 (https://phabricator.wikimedia.org/T272655) [23:34:12] (03PS2) 10Dzahn: tcpircbot: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667042 (https://phabricator.wikimedia.org/T275832) [23:35:05] (03CR) 10Cwhite: [C: 03+2] fix sudoers rule for phatality (`kibana-plugin install`) [puppet] - 10https://gerrit.wikimedia.org/r/667959 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [23:35:59] (03CR) 10Dzahn: [C: 03+2] tcpircbot: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667042 (https://phabricator.wikimedia.org/T275832) (owner: 10Dzahn) [23:37:08] (03PS4) 10Dzahn: tcpircbot: remove deploy1001 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) [23:37:29] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) [23:38:27] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@4d0f053]: sudoer rules fixed, trying again: deploy phatality [23:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:34] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@4d0f053]: sudoer rules fixed, trying again: deploy phatality (duration: 00m 06s) [23:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:07] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) @jbond It's an outcome of me trying to separate personal and S&F accounts here, sorry about that. I updated the ticket wi... [23:40:41] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) [23:42:00] !log restart kibana to finalize phatality 7.10 deployment [23:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [23:45:24] 10SRE, 10Readers-Community-Engagement, 10Epic, 10Services (watching), 10User-mobrovac: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10Aklapper) [23:45:43] (03PS5) 10Dzahn: tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) [23:47:34] (03PS1) 1020after4: fix sudoers rule for phatality (`kibana-plugin install`) [puppet] - 10https://gerrit.wikimedia.org/r/667964 (https://phabricator.wikimedia.org/T272655) [23:48:07] (03PS2) 1020after4: add sudo rule to restart kibana [puppet] - 10https://gerrit.wikimedia.org/r/667964 (https://phabricator.wikimedia.org/T272655) [23:49:52] (03CR) 10Dzahn: [C: 03+2] tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [23:51:00] jouncebot: next [23:51:00] In 0 hour(s) and 8 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T0000) [23:51:19] (03CR) 10Cwhite: [C: 03+2] add sudo rule to restart kibana [puppet] - 10https://gerrit.wikimedia.org/r/667964 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [23:52:59] !log mwmaint2002 - find /home -nouser -delete [23:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:11] (03Abandoned) 10Ppchelko: Enable helmfile recreatePods for changeprop installations [deployment-charts] - 10https://gerrit.wikimedia.org/r/666225 (owner: 10Ppchelko) [23:54:30] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:54:36] (03PS2) 10Urbanecm: Enable Growth features in idwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667619 (https://phabricator.wikimedia.org/T259024) [23:56:10] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features in idwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667619 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [23:56:40] (03PS3) 10Dzahn: common/scap/DHCP: remove deploy1001 from scap hosts and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T275831) [23:57:11] (03Merged) 10jenkins-bot: Enable Growth features in idwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667619 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [23:57:50] jouncebot: now [23:57:50] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [23:58:00] ehee [23:58:11] jouncebot: next [23:58:11] In 0 hour(s) and 1 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T0000) [23:58:13] mutante: sorry, i started a bit early :) [23:58:17] is that an issue? [23:58:31] i can wait for a bit [23:58:37] Urbanecm: nah, not really, i just wont merge this right now, but not important [23:58:51] nah, thank you, just continue [23:59:21] okay, thanks