[00:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:04:48] (03CR) 10CRusnov: "The support module (ib3_auth.py) seems to pass Python3 tox and I eyeballed it for any potential encoding or library issues and it looks OK" [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:06:30] (03PS2) 10CRusnov: ircecho: port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) [00:08:03] (03CR) 10CRusnov: "Inline note request for reviewers." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:11:32] (03CR) 10Bstorm: [C: 03+1] "Looks good! A couple nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651166 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [00:27:05] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:27:40] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Patch-For-Review: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10aaron) At fi... [00:29:24] (03CR) 10CRusnov: labstore/files/logcleanup.py: Port to Python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:38:14] (03CR) 10Bstorm: "I swear I started reviewing this, but it's too late in the day for me to think it through just to be sure. I like a lot of the refactors." [puppet] - 10https://gerrit.wikimedia.org/r/651507 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [01:19:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:50] ^ that means a maintenance job failed to run. this one: mediawiki_job_purge_abusefilteripdata [01:21:15] is it because of a file being missing by any chance? [01:21:17] Let's try to get the service name into the Icinga output [01:21:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/653539 [01:22:05] ah! thank you. let's see [01:22:24] I'm guessing it's something related to that in some way [01:22:25] legoktm: ^^ [01:22:43] (code=exited, status=2) [01:23:05] uhoh [01:23:15] (03CR) 10Dzahn: "@mwmaint1002:~# sudo systemctl status mediawiki_job_purge_abusefilteripdata" [puppet] - 10https://gerrit.wikimedia.org/r/653539 (owner: 10Daimona Eaytoy) [01:23:45] legoktm: should I try to run that manually? [01:24:03] that is.. start the systemd timer [01:24:28] I was going to to see more output [01:24:28] I just logged in to do that [01:24:35] ok, letting you do it [01:25:52] > The MediaWiki script file "/srv/mediawiki/php-1.36.0-wmf.22/extensions/AbuseFilter/maintenance/PurgeOldLogIPData.php" does not exist. [01:25:53] and on [01:26:11] I think the -f extensions/AbuseFilter/maintenance/purgeOldLogIPData.php is failing because it's not absolute [01:27:06] and if that fails, it falls back to the second path which doesn't exist yet and then the whole unit fails [01:27:54] /srv/mediawiki/php-1.36.0-wmf.22/extensions/AbuseFilter/maintenance DOES exist and there is purgeOldLogIPData.php [01:28:04] legoktm: just P vs p [01:28:05] i think [01:28:08] yeah [01:28:17] this week's train is going to rename the file from p to P [01:28:29] so this patch was supposed to properly handle both but it didn't work [01:28:33] aha, seemed like a simple typo [01:28:39] ok [01:29:32] could the fix be a symlink from one to the other? [01:30:06] That's what we did for CirrusSearch for a few weeks [01:30:06] I was going to suggest that, but it'll break on case-insensitive filesystems [01:30:54] Renaming it foo to Foo doesn't even play nicely on them [01:31:26] we could also just cherry-pick the script rename now and get it over with [01:31:49] (since we/I already managed to break it) [01:32:08] or make that an absolute path you said earlier? [01:32:32] cherry-picking seems sane as well [01:33:36] I'm not exactly sure if making the path abolute will work properly since at some point both paths will exist and multiversion will pick one or the other, complaining on half during the period we have two branches deployed [01:33:48] I didn't really fully think it through when merging the patch [01:34:06] (03CR) 10Jforrester: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/653539 (owner: 10Daimona Eaytoy) [01:34:31] How bad (or not) if you just revert it for now? [01:34:55] it's still 1.36.0-wmf.22 everywhere, so nothing will change [01:34:57] Is something broken without that change? [01:35:15] when .25 starts rolling out, then the issues start :) [01:36:01] let me revert the puppet patch now, and then tomorrow morning before the train goes out I can coordinate renaming both places at the same time [01:36:14] also no idea at all how bad it is if that job does not run.. or after how long it starts to be an issue [01:36:29] legoktm: sounds good [01:36:47] (03PS1) 10Legoktm: Revert "mediawiki: Temporarily add alternative path for AbuseFilter script" [puppet] - 10https://gerrit.wikimedia.org/r/654001 [01:37:01] (03PS2) 10Legoktm: Revert "mediawiki: Temporarily add alternative path for AbuseFilter script" [puppet] - 10https://gerrit.wikimedia.org/r/654001 [01:37:08] (03CR) 10Legoktm: [C: 03+2] Revert "mediawiki: Temporarily add alternative path for AbuseFilter script" [puppet] - 10https://gerrit.wikimedia.org/r/654001 (owner: 10Legoktm) [01:37:42] (03CR) 10Dzahn: [C: 03+1] "script not running, caused a monitoring alert" [puppet] - 10https://gerrit.wikimedia.org/r/654001 (owner: 10Legoktm) [01:38:01] !log [wdqs deploy] Pre-deploy tests are all passing, proceeding with deploy shortly [01:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:26] In the meantime, we could just drop AbuseFilterLogIPMaxAge down [01:38:36] It's set to 3 months, and purges the private data after that [01:38:47] Set it to 2.75 months for a week, then revert that ;) [01:39:47] legoktm: suggesting to run it and then systemctl reset-failed and Icinga should recover [01:39:59] doing a puppet run on mwmaint1002 right now and then I'll do that :) [01:40:07] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@0432f8c]: 0.3.57 [01:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:36] Probably better with 2 or 2.5 months to be safe, and not so "clever" maths [01:41:17] !log [wdqs deploy] Canary `wdqs1003` passing all tests following deploy, proceeding to rest of fleet [01:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:22] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:41] cool [01:45:10] I started the unit manually for today's purge [01:47:10] :) and the part that we noticed is just because they moved from cron to systemd timers and we had the generic Icinga check for failed units already. in the past we would not have seen it like that [01:47:58] will see if we can get the name of the unit in that status line [01:48:51] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@0432f8c]: 0.3.57 (duration: 08m 44s) [01:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:14] Yeah, no case-sensitive symlinks please; it broke the CirrusSearch repo for months. [01:49:44] legoktm: We can add a wmf.22 patch now to add a symlink there? [01:49:50] (branch-only fix) [01:50:06] I'm more or less doing that [01:50:15] well I'm just renaming the file in wmf.22 and I can deploy both at the same time [01:50:29] !log [wdqs deploy] Restarted `wdqs-updater` across the whole fleet simultaneously: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [01:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:41] Oh, hmm, I suppose that'll work if you also manually add it to extension.json's loader? [01:50:51] !log [wdqs deploy] Restarted categories across all wdqs test instances: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [01:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:45] !log [wdqs deploy] Restarting `wdqs-categories` across non-test wdqs nodes one at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [01:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:49] yeah well [01:53:57] after this I need to fix AF's extension.json, it's overcomplicated [01:54:04] It's simpler now. [01:54:08] PSR-4 for everything. [01:54:26] Or do you mean in a different manner? [01:54:34] oh good [01:54:40] no that's what I meant [01:54:44] (That's why we're renaming everything.) [01:54:53] Yeah, pull master and marvel. :-) [01:54:54] just it could have been done before namespacing everything [01:55:17] Well, yeah, the last few files were just dumped into a top-level AF namespace and await actual refactoring. [01:55:36] Want a train-blocker task for the rename? [01:59:15] (03PS3) 10Legoktm: mediawiki: Update path for AbuseFilter script (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/654296 [02:00:11] legoktm: T271182 [02:00:11] T271182: Fix AbuseFilter maintenance script rename issue before wmf.25 rolls out too far - https://phabricator.wikimedia.org/T271182 [02:00:15] thanks [02:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.25 [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654350 [02:07:36] (03PS2) 10Jforrester: Branch commit for wmf/1.36.0-wmf.25 [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654350 (https://phabricator.wikimedia.org/T267418) (owner: 10TrainBranchBot) [02:08:03] (03CR) 10Jforrester: [C: 03+1] Branch commit for wmf/1.36.0-wmf.25 [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654350 (https://phabricator.wikimedia.org/T267418) (owner: 10TrainBranchBot) [02:10:45] it's really coming after me [02:11:40] (03PS1) 10Dzahn: ATS: re-add config for parsoid-rt-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) [02:12:30] (03PS4) 10Legoktm: mediawiki: Update path for AbuseFilter script (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/654296 (https://phabricator.wikimedia.org/T271182) [02:17:04] (03PS3) 10Legoktm: Rename maintenance/purgeOldLogIPData.php script [extensions/AbuseFilter] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/654000 (https://phabricator.wikimedia.org/T271182) [02:17:12] (03CR) 10Legoktm: [C: 03+2] Rename maintenance/purgeOldLogIPData.php script [extensions/AbuseFilter] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/654000 (https://phabricator.wikimedia.org/T271182) (owner: 10Legoktm) [02:18:59] (03CR) 10Jforrester: [C: 03+1] Rename maintenance/purgeOldLogIPData.php script [extensions/AbuseFilter] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/654000 (https://phabricator.wikimedia.org/T271182) (owner: 10Legoktm) [02:20:21] !log [wdqs deploy] Deploy completed without issue [02:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:50] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10Papaul) In reference to your Hewlett Packard Enterprise Support Case Number 5352620710, the following Customer Self Repair Part has been shipped: Part/s shipped: 872772-001 Part descript... [02:41:30] (03PS1) 10Andrew Bogott: Nova: another attempt at getting vendordata properly injected [puppet] - 10https://gerrit.wikimedia.org/r/654362 (https://phabricator.wikimedia.org/T271056) [02:41:36] RECOVERY - dump of analytics_meta in eqiad on alert1001 is OK: Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2021-01-05 02:24:24 (1 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:41:56] (03CR) 10jerkins-bot: [V: 04-1] Nova: another attempt at getting vendordata properly injected [puppet] - 10https://gerrit.wikimedia.org/r/654362 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [02:44:05] (03PS2) 10Andrew Bogott: Nova: another attempt at getting vendordata properly injected [puppet] - 10https://gerrit.wikimedia.org/r/654362 (https://phabricator.wikimedia.org/T271056) [02:49:32] (03Merged) 10jenkins-bot: Rename maintenance/purgeOldLogIPData.php script [extensions/AbuseFilter] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/654000 (https://phabricator.wikimedia.org/T271182) (owner: 10Legoktm) [02:55:04] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.22/extensions/AbuseFilter/: Rename maintenance/purgeOldLogIPData.php script (T271182) (duration: 00m 59s) [02:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:08] T271182: Fix AbuseFilter maintenance script rename issue before wmf.25 rolls out too far - https://phabricator.wikimedia.org/T271182 [02:55:52] (03CR) 10Legoktm: [C: 03+2] mediawiki: Update path for AbuseFilter script (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/654296 (https://phabricator.wikimedia.org/T271182) (owner: 10Legoktm) [02:57:56] legoktm: Are you going to manually trigger it? I guess it'll be a no-op (or will break)? [02:58:53] yeah, I kicked it manually just to verify [03:00:13] Cool. [03:01:04] it's at l wikis and no issues so far [03:01:40] Excellent. Thanks for all your help. Sorry it was a burden. [03:01:49] no worries :) [03:02:20] * legoktm -> phone [03:10:03] (03PS3) 10Andrew Bogott: Nova: another attempt at getting vendordata properly injected [puppet] - 10https://gerrit.wikimedia.org/r/654362 (https://phabricator.wikimedia.org/T271056) [03:11:11] (03CR) 10Andrew Bogott: [C: 03+2] Nova: another attempt at getting vendordata properly injected [puppet] - 10https://gerrit.wikimedia.org/r/654362 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [05:12:51] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:44:17] 10Operations, 10Gerrit, 10Phabricator, 10Release-Engineering-Team, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Tgr) Isn't this a problem with the extension? Content scripts are subject to the document's CSP but the main extension code isn't,... [06:14:09] 10Operations, 10ops-codfw, 10DBA: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) 05Open→03Resolved Data was checked and came back up clean. Closing this - thanks for getting on this so fast Papaul! [06:33:38] (03PS1) 10Marostegui: Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654002 [06:34:16] (03CR) 10Marostegui: [C: 03+2] Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654002 (owner: 10Marostegui) [06:37:20] (03PS1) 10Marostegui: mariadb: Productionize db1155 as sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/654370 (https://phabricator.wikimedia.org/T268742) [06:38:16] (03PS2) 10Marostegui: mariadb: Productionize db1155 as sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/654370 (https://phabricator.wikimedia.org/T268742) [06:39:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1155 as sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/654370 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [06:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 to clone db1155:3312 T268742 ', diff saved to https://phabricator.wikimedia.org/P13647 and previous config saved to /var/cache/conftool/dbconfig/20210105-064026-marostegui.json [06:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:32] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [06:41:11] (03PS1) 10Marostegui: db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654371 (https://phabricator.wikimedia.org/T268742) [06:41:23] !log Stop MySQL on db1074 - this will generate lag on s2 on labs [06:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:48] (03CR) 10Marostegui: [C: 03+2] db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654371 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [06:46:01] PROBLEM - MariaDB Replica IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1074.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1074.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:46:46] ^ me [06:51:55] (03PS1) 10Marostegui: mariadb: Add db1155 to redact sanitarium and check_private_data [puppet] - 10https://gerrit.wikimedia.org/r/654372 (https://phabricator.wikimedia.org/T268742) [06:53:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1155 to redact sanitarium and check_private_data [puppet] - 10https://gerrit.wikimedia.org/r/654372 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [07:14:41] !log execute 'apt-get clean' on an-airflow1001 to recover disk space (root partition almost saturated) [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:34] I also removed unused kernels from an-airflow1001, freed up another 1G [07:23:59] (03CR) 10Muehlenhoff: [C: 04-1] "Staff leaving the foundation can sign up for a volunteer NDA and retain some/partial access and Chelsy did this back then. It might be tha" [puppet] - 10https://gerrit.wikimedia.org/r/654307 (https://phabricator.wikimedia.org/T271161) (owner: 10Ryan Kemper) [07:31:47] RECOVERY - Disk space on an-airflow1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops [07:32:49] 10Operations, 10SRE-Access-Requests: convert Maya Kampurath to full-time employee - https://phabricator.wikimedia.org/T271169 (10Joe) p:05Triage→03Medium a:03MoritzMuehlenhoff AIUI this was already done by @MoritzMuehlenhoff yesterday in https://gerrit.wikimedia.org/r/c/operations/puppet/+/654205. I'll... [07:35:32] 10Operations, 10SRE-Access-Requests: convert Maya Kampurath to full-time employee - https://phabricator.wikimedia.org/T271169 (10MoritzMuehlenhoff) 05Open→03Resolved Ack, everything that needed to be done, was already done yesterday. @dzahn: Please check the latest state of puppet.git before sending out ma... [07:52:03] (03CR) 10Elukey: "We are getting closer, I added some suggestions and new ideas, let me know what you think!" (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [07:52:18] 10Operations, 10SRE-tools, 10Traffic, 10IPv6, 10User-crusnov: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10Joe) p:05Triage→03Low I will let the traffic folks answer as well, but first of all I think you should clarify a bit better the wording of... [08:01:53] (03CR) 10Muehlenhoff: apt: Create a script to detect manually installed packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [08:04:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sadly it's not possible to rename a service this way without causing disruptions." [puppet] - 10https://gerrit.wikimedia.org/r/654294 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [08:04:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add dependency on wmflib [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/654200 (owner: 10Giuseppe Lavagetto) [08:06:59] (03Merged) 10jenkins-bot: Add dependency on wmflib [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/654200 (owner: 10Giuseppe Lavagetto) [08:07:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Thanks, that bug's been around since day one, and somehow neither me nor anyone else took the time to fix it :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645138 (owner: 10Ahmon Dancy) [08:09:17] (03Merged) 10jenkins-bot: Fix Step 0 reporting for update operation [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645138 (owner: 10Ahmon Dancy) [08:10:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd: drop the ServerAdmin line completely [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [08:11:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27349/console" [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [08:13:47] RECOVERY - exim queue on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [08:15:47] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27350/console" [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [08:18:10] (03PS4) 10David Caro: wmcs.backup: ignore all dumps backups except dumps-0 [puppet] - 10https://gerrit.wikimedia.org/r/654196 (https://phabricator.wikimedia.org/T267195) [08:18:47] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.472e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:18:58] (03PS5) 10David Caro: wmcs.backup: ignore all dumps backups except dumps-0 [puppet] - 10https://gerrit.wikimedia.org/r/654196 (https://phabricator.wikimedia.org/T267195) [08:19:00] (03CR) 10JMeybohm: [C: 03+2] admin_ng Update/Fix PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/649629 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [08:20:38] (03Merged) 10jenkins-bot: admin_ng Update/Fix PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/649629 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [08:26:00] (03PS2) 10JMeybohm: docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) [08:26:39] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.02925 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:27:14] !log Restarted CI Jenkins on contint2001 [08:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] !log Restart db2127 T271106 [08:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:30] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [08:32:09] !log reboot sretest1001 to test some new PXE rescue settings [08:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:05] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:38:08] this is me of course --^ [08:39:43] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:43:12] (03PS7) 10David Caro: wmcs.backup: Add a images summary command [puppet] - 10https://gerrit.wikimedia.org/r/651166 (https://phabricator.wikimedia.org/T267195) [08:43:14] (03PS5) 10David Caro: wmcs.backup: Add a method to create a vm backup [puppet] - 10https://gerrit.wikimedia.org/r/651507 (https://phabricator.wikimedia.org/T267195) [08:43:16] (03PS6) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537 (https://phabricator.wikimedia.org/T267195) [08:43:18] (03PS5) 10David Caro: wmcs.backup: Add a way to remove old backups and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651550 (https://phabricator.wikimedia.org/T267195) [08:43:20] (03PS5) 10David Caro: wmcs.backup: Add command to backup all assigned vms [puppet] - 10https://gerrit.wikimedia.org/r/651761 (https://phabricator.wikimedia.org/T267195) [08:43:22] (03PS5) 10David Caro: wmcs.backup: add a command to remove non-handled backups [puppet] - 10https://gerrit.wikimedia.org/r/651776 (https://phabricator.wikimedia.org/T267195) [08:43:24] (03PS3) 10David Caro: wmcs.backup: Add a command to create the next backup [puppet] - 10https://gerrit.wikimedia.org/r/654220 (https://phabricator.wikimedia.org/T267195) [08:43:26] (03PS3) 10David Caro: wmcs.backup: Add host to the rbd snapshot name [puppet] - 10https://gerrit.wikimedia.org/r/654221 (https://phabricator.wikimedia.org/T267195) [08:43:28] (03PS3) 10David Caro: wmcs.backup: Add backup_image command [puppet] - 10https://gerrit.wikimedia.org/r/654266 (https://phabricator.wikimedia.org/T270478) [08:43:30] (03PS3) 10David Caro: wmcs.backup: blacked all files [puppet] - 10https://gerrit.wikimedia.org/r/654267 [08:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2140 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P13652 and previous config saved to /var/cache/conftool/dbconfig/20210105-084807-marostegui.json [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:29] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backup: Add host to the rbd snapshot name [puppet] - 10https://gerrit.wikimedia.org/r/654221 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [08:48:32] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backup: Add backup_image command [puppet] - 10https://gerrit.wikimedia.org/r/654266 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [08:56:03] !log installing flac security updates [08:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think Janis had some plans on the long run, but I support the idea for now. The -1 is for an implementation detail, see below." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) (owner: 10Ahmon Dancy) [09:11:51] (03CR) 10David Caro: wmcs.backup: Add a images summary command (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651166 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [09:14:51] (03PS4) 10JMeybohm: k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [09:21:29] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) 6.0.1-1wm1 has now been working fine for a day on the beta cluster, upgrading cp3054. [09:21:35] !log cp3054: upgrade varnish to 6.0.1-1wm1 T264398 [09:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:43] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [09:21:50] (03CR) 10Elukey: [V: 03+1] "Tested this today applying the buster-installer patch on apt1001 (that is the place where the /srv/tftpboot configs are picked up) and for" [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [09:23:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: ignore all dumps backups except dumps-0 [puppet] - 10https://gerrit.wikimedia.org/r/654196 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [09:28:10] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 3: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) (owner: 10Ahmon Dancy) [09:28:56] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/654187 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [09:29:19] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I would go with the other patch, that removes serveradmin completely, it's really better to simplify stuff." [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [09:29:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: add prometheus script to collect ceph usage network metrics by nova [puppet] - 10https://gerrit.wikimedia.org/r/654211 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez) [09:30:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Can I also ask you to check the apache configurations for the httpd production docker images, and submit a similar patch there too?" [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [09:31:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This will probably cause some dependent images to fail to build, but we can fix that once we've rebuilt this." [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:31:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27352/console" [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [09:34:11] (03Abandoned) 10Giuseppe Lavagetto: cache-text: add throttling for calls to ORES from the OKAPI [puppet] - 10https://gerrit.wikimedia.org/r/631385 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto) [09:39:24] (03PS1) 10Muehlenhoff: Add library hint for flac [puppet] - 10https://gerrit.wikimedia.org/r/654408 [09:39:33] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:54] (03CR) 10David Caro: [C: 03+2] wmcs.backup: ignore all dumps backups except dumps-0 [puppet] - 10https://gerrit.wikimedia.org/r/654196 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [09:45:14] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 4 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) >>! In T263910#6620877, @Ladsgroup wrote: > Guess when changes got merged: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=1... [09:46:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:47] RECOVERY - MariaDB Replica IO: s2 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:26] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for flac [puppet] - 10https://gerrit.wikimedia.org/r/654408 (owner: 10Muehlenhoff) [10:02:29] !log stopping stray cpjobqueue processes on scb hosts [10:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] (03CR) 10Jbond: [C: 03+1] Enable base::service_auto_restart for Apache/Nginx on debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/654261 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:03:09] (03CR) 10JMeybohm: [C: 04-1] "envoy is buster based already. Would need a change to Dockerfile.template (back to {{ seed_image }})" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [10:03:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:06:05] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:43] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:25] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:27] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:35] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:50] ^ me, fixing [10:10:13] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:25] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:43] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:43] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:13] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:28] (03PS1) 10Marostegui: Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654003 [10:11:31] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:04] (03CR) 10Marostegui: [C: 03+2] Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654003 (owner: 10Marostegui) [10:12:29] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:36] (03CR) 10Jbond: dnsdist: allow custom headers in the HTTP response and enable HSTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:12:37] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:39] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:49] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:25] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:37] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:55] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:57] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654277 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [10:14:27] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:48] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10ayounsi) [10:15:19] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:15:29] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/652575 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:17:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: After cloning db1155:3312', diff saved to https://phabricator.wikimedia.org/P13653 and previous config saved to /var/cache/conftool/dbconfig/20210105-101735-root.json [10:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:31] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:23:40] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache/Nginx on debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/654261 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:25:07] (03PS1) 10David Caro: wmcs.backup.instances: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/654409 [10:26:26] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [10:26:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup.instances: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/654409 (owner: 10David Caro) [10:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:30] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [10:32:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: After cloning db1155:3312', diff saved to https://phabricator.wikimedia.org/P13654 and previous config saved to /var/cache/conftool/dbconfig/20210105-103239-root.json [10:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:28] (03CR) 10David Caro: [C: 03+2] wmcs.backup.instances: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/654409 (owner: 10David Caro) [10:37:18] (03PS3) 10Filippo Giunchedi: scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) [10:45:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654277 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [10:47:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: After cloning db1155:3312', diff saved to https://phabricator.wikimedia.org/P13655 and previous config saved to /var/cache/conftool/dbconfig/20210105-104742-root.json [10:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:43] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Joe) >>! In T271123#6719830, @Ladsgroup wrote: > oh boy. My suggestion is that for sake of uniformity and ease of maintenance, we shoul... [10:49:45] !log jmm@cumin2001 START - Cookbook sre.dns.netbox [10:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] (03CR) 10Muehlenhoff: "Filed https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=979317 for the change in d-i (but the custom one off dumps are perfectly fine in t" [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [10:54:49] (03PS2) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 [10:56:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: After cloning db1155:3312', diff saved to https://phabricator.wikimedia.org/P13656 and previous config saved to /var/cache/conftool/dbconfig/20210105-110246-root.json [11:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for taking this on!" [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [11:11:22] 10Puppet: puppet new facts for php_version and python_version - https://phabricator.wikimedia.org/T271196 (10Aklapper) [11:11:56] (03CR) 10Jbond: [C: 04-1] "minor issue with srings.letters vs string.ascii_letters" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:15:11] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/654216 (https://phabricator.wikimedia.org/T271099) (owner: 10Jbond) [11:23:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:31:24] (03PS6) 10Jbond: apt: Create a script to detect manually installed packages [puppet] - 10https://gerrit.wikimedia.org/r/654257 [11:32:55] (03PS7) 10Jbond: apt: Create a script to detect manually installed packages [puppet] - 10https://gerrit.wikimedia.org/r/654257 [11:34:10] (03CR) 10Jbond: "> Patch Set 5:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [11:39:23] (03CR) 10Elukey: [C: 03+2] admin: remove members of 'reseachers' already in other posix groups [puppet] - 10https://gerrit.wikimedia.org/r/654277 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [11:47:28] (03CR) 10Kormat: [C: 03+1] "The approach looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [11:47:39] (03CR) 10JMeybohm: [C: 03+1] Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [11:50:05] (03PS1) 10Joal: Bump AQS druid backend datasource to 2020-12 [puppet] - 10https://gerrit.wikimedia.org/r/654413 [11:50:16] elukey: --^& [11:51:15] (03CR) 10Marostegui: "Question: I assume there would be no limitation on one proxy having both "replica_type" right?" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [11:51:17] joal: should we leave this to Andrew/Razzi later on? [11:51:28] sure elukey :) [11:51:42] elukey: I added razzi as reviewer [11:51:48] super I was about to say that [11:51:50] thanks :) [11:52:36] (03CR) 10Elukey: "Razzi/Andrew - I updated the docs, please use the cookbook :D" [puppet] - 10https://gerrit.wikimedia.org/r/654413 (owner: 10Joal) [11:52:57] (03CR) 10Kormat: [C: 03+1] "> Patch Set 34:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [11:53:50] !log installing lxml security updates for buster [11:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] (03CR) 10Marostegui: "> Patch Set 34:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [11:56:49] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.25 [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654350 (https://phabricator.wikimedia.org/T267418) (owner: 10TrainBranchBot) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T1200). Please do the needful. [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:01:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [12:01:40] !log Restart db2121 T271106 [12:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:45] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [12:12:16] !log installing p11-kit security updates on buster [12:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:18] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on malmok.wikimedia.org with reason: rebooting for kernel update [12:13:18] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on malmok.wikimedia.org with reason: rebooting for kernel update [12:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:52] (03PS1) 10Muehlenhoff: Add library hint for p11-kit [puppet] - 10https://gerrit.wikimedia.org/r/654414 [12:19:50] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for p11-kit [puppet] - 10https://gerrit.wikimedia.org/r/654414 (owner: 10Muehlenhoff) [12:21:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27353/console" [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:21:24] (03PS2) 10Ssingh: dnsdist: allow custom headers in the HTTP response and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) [12:22:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27354/console" [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:23:59] (03CR) 10Ssingh: [V: 03+1] "Thanks for the review! Yeah, it makes sense to be consistent. I guess I took the reading of the RFC example literally :)" [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:29:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.25 [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654350 (https://phabricator.wikimedia.org/T267418) (owner: 10TrainBranchBot) [12:29:41] !log jmm@cumin2001 START - Cookbook sre.dns.netbox [12:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] (03PS1) 10Arturo Borrero Gonzalez: cloud: mail: smarthost: drop support for letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/654415 (https://phabricator.wikimedia.org/T260834) [12:42:29] 10Operations, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) The plot thickens. I now have more questions than I do have answers, but here's the story so far. ats-be on cp3052 is essentially never calling `mmap`(once in 10 seconds). ` 11:50:59 ema@c... [12:43:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:31] (03PS2) 10Arturo Borrero Gonzalez: cloud: mail: smarthost: drop support for letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/654415 (https://phabricator.wikimedia.org/T260834) [12:47:31] (03CR) 10Elukey: [V: 03+1 C: 03+2] install_server: add a "rescue" label [puppet] - 10https://gerrit.wikimedia.org/r/654192 (https://phabricator.wikimedia.org/T191018) (owner: 10Elukey) [12:48:59] !log add PXE d-i rescue bootable image config for jessie/stretch/buster to tftp [12:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:48] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) [12:56:49] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) [12:58:34] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) [13:01:54] !log installing lxml security updates for stretch [13:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:03:34] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) as ganeti[3001-3003].esams.wmnet and ganeti[4001-4003].ulsfo.wmnet allready have AAAA records configured I'm assuming it should be safe to add them... [13:04:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:06:08] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10MoritzMuehlenhoff) >>! In T271136#6722307, @jbond wrote: > as ganeti[3001-3003].esams.wmnet and ganeti[4001-4003].ulsfo.wmnet allready have AAAA records co... [13:06:12] 10Operations, 10Patch-For-Review: Provide an option menu when booting via PXE - https://phabricator.wikimedia.org/T191018 (10elukey) Next steps: * add a simpler menu (like https://wiki.syslinux.org/wiki/index.php?title=Menu) * figure out the various options that we need to add * add documentation about how to... [13:07:45] (03CR) 10Jbond: "lgtm comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:17:08] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:22:16] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson happy 2021 :) When you have a moment could you please unrack the one host added to B4 and the two to C2? Then I think we could... [13:29:33] !log installing xen security updates on buster [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:03] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi >>! In T156955#6720937, @Dzahn wrote: > Since this ticket has been created we now have a fairly small subset of standard rec... [13:35:43] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:35:57] (03PS3) 10Ssingh: dnsdist: allow custom headers in the HTTP response and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) [13:37:20] (03CR) 10Ssingh: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27355/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:40:38] !log installing python-apt security updates on buster/stretch [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:50] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:56] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) [13:46:54] 10Operations, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10jbond) > One notable difference is that eqiad/codfw are still on Stretch. It'll probably work fine, but it's a bit of an unknown and let's better make the s... [13:50:00] (03PS3) 10Arturo Borrero Gonzalez: cloud: mail: smarthost: drop support for letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/654415 (https://phabricator.wikimedia.org/T260834) [13:51:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: mail: smarthost: drop support for letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/654415 (https://phabricator.wikimedia.org/T260834) (owner: 10Arturo Borrero Gonzalez) [13:52:42] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) > note: I created a [[ https://tickets.puppetlabs.com/browse/FACT-2843 | bug against facter4 ]] which is related FYI this has been resolved however i need to create an additi... [13:55:20] (03PS1) 10Jbond: (WIP) ccreat ocsp helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418 [13:55:57] (03CR) 10jerkins-bot: [V: 04-1] (WIP) ccreat ocsp helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418 (owner: 10Jbond) [13:58:06] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:59] (03PS1) 10David Caro: cloud.encapi: enable ssl nginx vhost [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) [13:59:10] (03CR) 10Filippo Giunchedi: "I checked two labs hosts that IIRC use the lvm module by way of profile::swift::storage::labs but the compiler itself looks like it failed" [puppet] - 10https://gerrit.wikimedia.org/r/654216 (https://phabricator.wikimedia.org/T271099) (owner: 10Jbond) [14:00:47] (03CR) 10David Caro: cloud.encapi: enable ssl nginx vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [14:10:35] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on Thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/654422 (https://phabricator.wikimedia.org/T135991) [14:11:14] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4480 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [14:11:43] (03PS1) 10Jbond: icinga: check_ripe_atlas fix python3 porting [puppet] - 10https://gerrit.wikimedia.org/r/654423 [14:12:48] herron, arturo: the above exim alert seems to be referred to wmflabs.org emails [14:14:18] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654423 (owner: 10Jbond) [14:14:41] (03CR) 10Jbond: [C: 03+2] icinga: check_ripe_atlas fix python3 porting [puppet] - 10https://gerrit.wikimedia.org/r/654423 (owner: 10Jbond) [14:15:37] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >1. What approval do I need to transfer this data to her? >1. How can I transfer this data to he... [14:17:09] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) [14:18:20] volans: ack thanks [14:25:22] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) p:05Triage→03Low [14:29:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [14:30:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:33:31] (03PS1) 10Ottomata: Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) [14:34:15] (03CR) 10jerkins-bot: [V: 04-1] Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [14:34:38] (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh: use confirm_on_failure() [cookbooks] - 10https://gerrit.wikimedia.org/r/654426 [14:35:23] (03PS2) 10Ottomata: Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) [14:40:20] (03PS3) 10Ottomata: Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) [14:46:13] (03PS4) 10Ottomata: Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) [14:48:40] (03CR) 10Elukey: [C: 03+1] tests: fix deprecated pytest argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/651803 (owner: 10Volans) [14:52:41] (03CR) 10Ottomata: [C: 03+2] Remove overrides from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654425 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [14:55:16] (03CR) 10Elukey: [C: 03+1] dnsdisc: improve test coverage [software/spicerack] - 10https://gerrit.wikimedia.org/r/651804 (owner: 10Volans) [14:56:11] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/654429 (https://phabricator.wikimedia.org/T135991) [14:58:14] (03CR) 10Gehel: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/651802 (owner: 10Volans) [14:58:31] herron volans yes I'm struggling with expiring TLS certs [14:59:16] (03PS4) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) [14:59:45] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove overrides from wgEventLoggingSchemas (duration: 00m 57s) [14:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:55] (03CR) 10Elukey: [C: 03+1] Use newly migrated code from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/651805 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:00:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) [15:01:52] (03CR) 10Volans: [C: 03+2] tests: fix deprecated pytest argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/651803 (owner: 10Volans) [15:02:06] (03CR) 10Volans: [C: 03+2] dnsdisc: improve test coverage [software/spicerack] - 10https://gerrit.wikimedia.org/r/651804 (owner: 10Volans) [15:03:36] (03CR) 10Volans: [C: 03+2] Use newly migrated code from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/651805 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:04:28] (03CR) 10Bstorm: "> Patch Set 34:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [15:06:01] (03PS1) 10Volans: CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/654430 [15:07:03] (03CR) 10Volans: [C: 03+2] "trivial typo" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/654430 (owner: 10Volans) [15:07:30] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [15:07:51] (03Merged) 10jenkins-bot: tests: fix deprecated pytest argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/651803 (owner: 10Volans) [15:08:50] (03Merged) 10jenkins-bot: dnsdisc: improve test coverage [software/spicerack] - 10https://gerrit.wikimedia.org/r/651804 (owner: 10Volans) [15:09:45] (03Merged) 10jenkins-bot: CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/654430 (owner: 10Volans) [15:11:34] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 4 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I'm just happy I am not resetting it every few hours. One of our goals was to put it in a stable mode and we've done it! [15:11:57] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10Joe) hi @nshahquinn-wmf! I think the best people to answer this question are the #analytics folk... [15:13:08] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10Ottomata) @nshahquinn-wmf probably the simplest thing to do would be to get her analytics-privat... [15:15:03] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1025, mc2025 to buster [puppet] - 10https://gerrit.wikimedia.org/r/654432 (https://phabricator.wikimedia.org/T213089) [15:17:04] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1025, mc2025 to buster [puppet] - 10https://gerrit.wikimedia.org/r/654432 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [15:17:52] (03PS1) 10Jbond: customscripts/interface_automation: fix loop control [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654434 (https://phabricator.wikimedia.org/T265904) [15:17:54] (03PS1) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [15:18:38] (03CR) 10jerkins-bot: [V: 04-1] interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:19:15] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1025.eqiad.wmnet ` The log can be... [15:19:21] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2025.codfw.wmnet ` The log can be... [15:19:30] (03PS2) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [15:19:50] (03Abandoned) 10Jbond: customscripts/interface_automation: fix loop control [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654434 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:20:17] (03CR) 10jerkins-bot: [V: 04-1] interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:22:01] (03PS1) 10Herron: kibana7: add kibana7 conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654436 (https://phabricator.wikimedia.org/T234854) [15:22:03] (03PS1) 10Herron: kibana7: repoint (rename) kibana-next services to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/654437 (https://phabricator.wikimedia.org/T234854) [15:22:05] (03PS1) 10Herron: kibana7: remove kibana-next conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654438 (https://phabricator.wikimedia.org/T234854) [15:23:37] (03PS3) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [15:29:34] (03CR) 10Herron: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/654294 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:29:44] (03PS1) 10Jbond: customscripts/interface_automation: skipp slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [15:31:34] (03CR) 10Jbond: interface_automation: update is_primary logic. (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:33:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1025.eqiad.wmnet with reason: REIMAGE [15:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:07] (03CR) 10Jbond: [C: 03+1] "> I had initially thought of something like this:" [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:35:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1025.eqiad.wmnet with reason: REIMAGE [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2025.codfw.wmnet with reason: REIMAGE [15:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:10] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for Apache on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/654429 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:36:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for Apache on Thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/654422 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:37:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2025.codfw.wmnet with reason: REIMAGE [15:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:19] !log upgraded wmflib to 0.0.6 on all hosts where it's installed - T257905 [15:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:23] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [15:41:45] (03PS1) 10Muehlenhoff: Add some more Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/654440 [15:46:29] (03CR) 10Elukey: [C: 03+1] interactive: migrate from spicerack to wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/651765 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:50:14] (03PS4) 10Ahmon Dancy: Redirect top level URL to https://dockerregistry.toolforge.org/ [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) [15:50:27] !log merging puppetlabs-lvm update [15:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:32] godog: fyi ^^ [15:50:46] jbond42: *ack* thanks for the heads up [15:51:12] (03CR) 10Ahmon Dancy: "Thanks for the review joe and JMeybohm!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) (owner: 10Ahmon Dancy) [15:52:41] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen) 05Open→03Resolved a:03Jgreen [15:52:49] (03CR) 10Volans: [C: 03+2] tests: fix deprecated pytest argument [cookbooks] - 10https://gerrit.wikimedia.org/r/651767 (owner: 10Volans) [15:53:13] (03PS3) 10Jbond: pupetlabs-lvm: update lvm module with latest upstream [puppet] - 10https://gerrit.wikimedia.org/r/654216 (https://phabricator.wikimedia.org/T271099) [15:53:49] (03CR) 10Jbond: [C: 03+2] pupetlabs-lvm: update lvm module with latest upstream [puppet] - 10https://gerrit.wikimedia.org/r/654216 (https://phabricator.wikimedia.org/T271099) (owner: 10Jbond) [15:54:49] (03PS1) 10Jbond: Revert "pupetlabs-lvm: update lvm module with latest upstream" [puppet] - 10https://gerrit.wikimedia.org/r/654446 [15:55:58] (03Merged) 10jenkins-bot: tests: fix deprecated pytest argument [cookbooks] - 10https://gerrit.wikimedia.org/r/651767 (owner: 10Volans) [16:00:14] (03CR) 10Volans: [C: 03+2] interactive: migrate from spicerack to wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/651765 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [16:02:10] (03Merged) 10jenkins-bot: interactive: migrate from spicerack to wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/651765 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [16:07:53] (03PS5) 10Andrew Bogott: profile::mail::smarthost: switch to acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/654295 (https://phabricator.wikimedia.org/T260834) [16:08:10] (03PS2) 10Elukey: sre.hadoop.change-distro-from-cdh: use confirm_on_failure() [cookbooks] - 10https://gerrit.wikimedia.org/r/654426 [16:08:41] 10Puppet, 10Patch-For-Review: puppetlabs-lvm: upgrade the lvm module to match the puppe;tlabs upstream module - https://phabricator.wikimedia.org/T271099 (10jbond) 05Open→03Resolved a:03jbond [16:10:02] (03CR) 10Razzi: [C: 03+1] Bump AQS druid backend datasource to 2020-12 [puppet] - 10https://gerrit.wikimedia.org/r/654413 (owner: 10Joal) [16:10:08] (03CR) 10Andrew Bogott: [C: 03+2] profile::mail::smarthost: switch to acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/654295 (https://phabricator.wikimedia.org/T260834) (owner: 10Andrew Bogott) [16:10:20] (03PS8) 10Jbond: apt: Create a script to detect manually installed packages [puppet] - 10https://gerrit.wikimedia.org/r/654257 [16:10:22] (03CR) 10Volans: [C: 03+1] "Looks sane to me, worth testing it. I'm not sure all commands are ok to retry, but the human operator will decide." [cookbooks] - 10https://gerrit.wikimedia.org/r/654426 (owner: 10Elukey) [16:10:45] (03CR) 10Jbond: apt: Create a script to detect manually installed packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [16:14:54] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1025.eqiad.wmnet'] ` and were **ALL** successful. [16:15:26] (03CR) 10Jayprakash12345: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654005 (https://phabricator.wikimedia.org/T270864) (owner: 10Jayprakash12345) [16:17:47] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2025.codfw.wmnet'] ` and were **ALL** successful. [16:18:28] (03PS5) 10Jbond: sretest: test install_apt_audit_installed script [puppet] - 10https://gerrit.wikimedia.org/r/654279 [16:18:49] (03PS3) 10Jbond: profile::base: add parameter to install apt audit script [puppet] - 10https://gerrit.wikimedia.org/r/654278 [16:18:59] (03PS6) 10Jbond: sretest: test install_apt_audit_installed script [puppet] - 10https://gerrit.wikimedia.org/r/654279 [16:19:37] (03CR) 10Jbond: [C: 03+2] apt: Create a script to detect manually installed packages [puppet] - 10https://gerrit.wikimedia.org/r/654257 (owner: 10Jbond) [16:19:45] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Joe) I think we have a better way to avoid this. Basically we want to stop running scripts once we get into the readonly phase. So we could modify the wra... [16:21:17] (03CR) 10Jbond: [C: 03+2] profile::base: add parameter to install apt audit script [puppet] - 10https://gerrit.wikimedia.org/r/654278 (owner: 10Jbond) [16:21:24] (03CR) 10Jbond: [C: 03+2] sretest: test install_apt_audit_installed script [puppet] - 10https://gerrit.wikimedia.org/r/654279 (owner: 10Jbond) [16:25:19] (03PS1) 10Jbond: apt: audit_installed move to sbin [puppet] - 10https://gerrit.wikimedia.org/r/654444 [16:25:20] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10sbassett) >>! In T271202#6722550, @Ottomata wrote: > probably the simplest thing to do would be... [16:29:14] (03CR) 10Jbond: [C: 03+2] apt: audit_installed move to sbin [puppet] - 10https://gerrit.wikimedia.org/r/654444 (owner: 10Jbond) [16:30:26] (03CR) 10Bstorm: "> Patch Set 34:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:31:51] (03CR) 10Ayounsi: [C: 03+1] Enable base::service_auto_restart for Apache on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/654429 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:35:08] (03PS1) 10Jbond: apt_install_audit: add the apt-audit-installed script by default [puppet] - 10https://gerrit.wikimedia.org/r/654445 [16:36:40] (03CR) 10Ssingh: [C: 03+2] dnsdist: allow custom headers in the HTTP response and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/654275 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:37:36] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10JFishback_WMF) +1 from me [16:37:58] (03CR) 10Bstorm: "> Patch Set 34:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:43:29] (03PS3) 10CRusnov: ircecho: port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) [16:43:31] (03CR) 10CRusnov: ircecho: port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:44:22] (03CR) 10CRusnov: "> Patch Set 2: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:44:48] (03CR) 10CRusnov: ircecho: port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:49:25] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [16:50:02] (03PS1) 10MSantos: mobileapps: bump to 2021-01-04-165358-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/654467 [16:52:37] (03CR) 10Dzahn: "also https://stackoverflow.com/questions/21798272/benefits-of-serveradmin-directive-in-apache2" [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [16:57:43] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >>! In T271202#6722550, @Ottomata wrote: > @nshahquinn-wmf probably the simplest... [17:00:04] jbond42 and cdanis: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T1700). [17:00:47] !log capture packets on pfw3-eqiad:reth0.1134 - T263833 [17:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:11] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro-from-cdh: use confirm_on_failure() [cookbooks] - 10https://gerrit.wikimedia.org/r/654426 (owner: 10Elukey) [17:07:12] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) I also created [[ https://tickets.puppetlabs.com/browse/FACT-2907 | FACT-2907 ]] to request adding binding flags [17:10:49] !log 1.36.0-wmf.25 was branched at 083fd09afcd204cfef177e11d7a5e4fd1217acfc for T267418 [17:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:53] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [17:18:54] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10sbassett) >>! In T271202#6722852, @nshahquinn-wmf wrote: > As the [production access guide](http... [17:19:43] (03PS1) 10Elukey: Add logstash101[1-3] ipv4 records to the kafka term in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/654469 [17:20:51] (03PS2) 10Elukey: Add logstash101[1-3] ipv4 records to the kafka term in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/654469 [17:20:54] (03CR) 10Bstorm: [C: 03+1] wmcs.backup: Add a images summary command (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651166 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [17:21:45] (03CR) 10Ayounsi: Add logstash101[1-3] ipv4 records to the kafka term in analytics-in4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/654469 (owner: 10Elukey) [17:22:41] XioNoX: I wasn't fast enough :D [17:22:49] elukey: ;) [17:27:37] (03CR) 10Ayounsi: [C: 03+1] Add logstash101[1-3] ipv4 records to the kafka term in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/654469 (owner: 10Elukey) [17:38:19] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) >>! In T266717#6722705, @Joe wrote: > I think we have a better way to avoid this. Basically we want to stop running scripts once we get into the... [17:57:22] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.9 - https://phabricator.wikimedia.org/T271228 (10ayounsi) p:05Triage→03Low [18:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T1800). [18:03:03] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-01-04-165358-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/654467 (owner: 10MSantos) [18:04:29] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-01-04-165358-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/654467 (owner: 10MSantos) [18:07:10] 10Operations, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10Ladsgroup) My useless contribution to the plot >>! In T265625#6722294, @ema wrote: > Leaving the cp3052 mystery aside for the time being, I've noticed that there's a difference between text and... [18:08:59] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] (03CR) 10Elukey: [C: 03+2] Add logstash101[1-3] ipv4 records to the kafka term in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/654469 (owner: 10Elukey) [18:13:43] !log run homer on cr1/cr2-eqiad to update the analytics-in4 filter (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/654469) [18:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:20] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654472 [18:18:23] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654472 (owner: 10Jeena Huneidi) [18:18:26] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:54] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654472 (owner: 10Jeena Huneidi) [18:21:28] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:50] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.25 refs T267418 [18:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:54] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [18:25:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/654445 (owner: 10Jbond) [18:39:17] 10Operations, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10JFishback_WMF) @nshahquinn-wmf your question raises another - where are you looking to send the... [18:41:19] (03PS4) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [18:42:30] (03PS5) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [18:48:22] (03PS13) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [18:50:44] (03PS8) 10Cwhite: profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [18:51:55] 10Operations, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10CDanis) >>! In T270391#6719023, @ayounsi wrote: > A downside, for example with Google is that it will most likely include crawlers IPs I'm als... [18:53:39] (03PS6) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [18:54:27] (03PS4) 10Alex Paskulin: Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [18:54:44] (03CR) 10Alex Paskulin: [C: 03+1] Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [18:59:04] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.25 refs T267418 (duration: 39m 07s) [18:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:08] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T1900) [19:01:40] (03PS14) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [19:02:07] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10aaron) >>! In T264604#6681125, @jijiki wrote: > @Krinkle @aaron do you think we are ready to move this forward?... [19:16:09] !log mwdebug1003 - editing apache2 defaults conf and dropping ServerAdmin address.restarting [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:32] !log deploying refinery for weekly train [19:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:12] !log razzi@deploy1001 Started deploy [analytics/refinery@56fb3ff]: Regular analytics weekly train [analytics/refinery@6ce68c950fc339dc3748cf50e6925cd1031287c4] [19:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:16] (03CR) 10Dzahn: [C: 03+2] "thanks! confirmed on mwbdebug1003 it's optional ...and merging" [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [19:19:25] (03PS2) 10Dzahn: httpd: drop the ServerAdmin line completely [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) [19:22:15] (03CR) 10Dzahn: [C: 04-2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/651649 has been merged instead" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [19:23:14] (03Abandoned) 10Dzahn: httpd: make it possible to configure server admin email address [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [19:28:49] !log razzi@deploy1001 Finished deploy [analytics/refinery@56fb3ff]: Regular analytics weekly train [analytics/refinery@6ce68c950fc339dc3748cf50e6925cd1031287c4] (duration: 09m 37s) [19:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:54] (03PS1) 10Dzahn: drop the ServerAdmin line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654482 (https://phabricator.wikimedia.org/T251005) [19:29:21] !log razzi@deploy1001 Started deploy [analytics/refinery@56fb3ff] (thin): Regular analytics weekly train THIN [analytics/refinery@6ce68c950fc339dc3748cf50e6925cd1031287c4] [19:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:30] !log razzi@deploy1001 Finished deploy [analytics/refinery@56fb3ff] (thin): Regular analytics weekly train THIN [analytics/refinery@6ce68c950fc339dc3748cf50e6925cd1031287c4] (duration: 00m 08s) [19:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:40] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [19:32:12] (03PS4) 10Dzahn: swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) [19:33:46] (03PS1) 10Dwisehaupt: Shift fundraising read dns handle to primary for upgrade [dns] - 10https://gerrit.wikimedia.org/r/654483 (https://phabricator.wikimedia.org/T254198) [19:34:39] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27357/" [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:36:44] (03CR) 10Dzahn: "noop on stat1004" [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:36:51] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 3684 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:39:59] (03CR) 10Jgreen: [C: 03+2] Shift fundraising read dns handle to primary for upgrade [dns] - 10https://gerrit.wikimedia.org/r/654483 (https://phabricator.wikimedia.org/T254198) (owner: 10Dwisehaupt) [19:40:28] mw1362 seems to cause the most errors - something is special with that [19:41:51] !log mw1362 - restarted apache2 [19:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:43] and the error rate is going down after that [19:43:24] it seems it was all localized to that one server (26,480 vs 390 errors for rank 2) [19:46:42] there weere 'proxy_fcgi:errors' with "Partial results are valid..." and restarting apache made them go away. it's an API server. the most common error was from /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument [19:47:28] !log depooled mw1362 [19:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:39] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:57:41] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 827 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:00:04] longma and hashar: May I have your attention please! Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210105T2000) [20:01:16] nice large spike of whatever :\ [20:01:44] oh I thought that was over [20:01:52] oh a new one [20:03:20] (03CR) 10Razzi: [C: 03+2] Bump refine jar version to refinery-job 0.0.143 [puppet] - 10https://gerrit.wikimedia.org/r/654308 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:04:08] !log mw1344 - restarted apache2 - it was showing the same "partial results" error a mw1362 - no other appservers are showing up in logstash, but these were #1 and #2 source of errors [20:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:02] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Mormegil) > Those files are provided by the debian package, and are not considered convfiguration files. This would mean that every tim... [20:05:05] mw1362 is now gone from the "Top Hosts" list. previusly it was like 26k errors vs 1 [20:05:20] mw1344 is now the new #1 but going down. others seem fine [20:05:25] it does look like two separate spikes, I wonder if this is traffic-driven [20:07:15] I'll wait to deploy the train until the errors drop a bit more [20:07:41] the most common error changed to a cirrussearch related one [20:08:09] huh, so two unrelated causes maybe? [20:08:23] oh no, I see, you mean most common after the other one cleared [20:08:42] yes, the _new_ most common one is: CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey() [20:08:42] okay yeah that's good news :) [20:08:49] but the one we had before that..is gone [20:09:36] also there was analytics deploy but unlikely to be related? [20:10:29] is the cirrussearch error also new? [20:10:39] if so it and the elastica might have some underlying searchy problem as their root cause [20:11:36] 10Operations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10Jdforrester-WMF) [20:12:47] 10Operations, 10Abstract Wikipedia, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10Jdforrester-WMF) [20:13:25] let's try to put mw1362 back into the pool and see if it shows up in the list again or not [20:13:29] the log graph is at some plateau of error since 19:55 [20:13:35] na hold [20:13:43] logstash should give us more info [20:14:03] that's what I am quoting when talkign about the error message that is gone now [20:14:35] there are still 1k errors per minute :] [20:15:17] [X-TItgpAIDgAAEXFEUoAAAAU] /w/api.php Error from line 599 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/CirrusSearch/includes/Searcher.php: Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey() [20:15:34] yes, that's the new top one [20:15:34] on mw1344 [20:15:37] as above [20:16:12] there were 2 affected servers, mw1362 and mw1344, the first one was depooled, the second one was not [20:16:21] I am betting $0.02 it is an opcache corruption [20:16:23] ;D [20:16:33] the stats were like 26k, 8k, 1, 1, 1, 1 [20:16:47] and before there was a global apache reload [20:17:44] that's why I wanted to see if mw1362 is ok [20:18:06] the one for mw1362 is slightly different: [{exception_id}] {exception_url} Elastica\Exception\InvalidException from line 44 of /srv/mediawiki/php-1.36.0-wmf.22/vendor/ruflin/elastica/lib/Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument [20:18:59] yes, that was the previous error. never happened on servers besides mw1362 [20:19:16] the undefined method one that is ongoing on mw1344 doesn't make any sense code wise [20:19:25] so i am betting on the php opcache being corrupted somehow [20:19:30] maybe as a result of the apache reload, [20:19:31] ? [20:19:37] exactly [20:19:50] that is a fun case [20:19:55] that's what made me just restart apache2 and that made the error rate go down [20:20:37] mutante: does restarting apache2 also restart php-fpm? [20:20:50] !log mw1344 - /usr/local/sbin/restart-php7.2-fpm [20:20:52] rzl: ^ [20:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:59] rzl: no [20:21:13] but on 1362 it worked nevertheless [20:21:17] then if restarting apache2 affected it, it's not opcache corruption [20:21:23] on 1344 it did not [20:21:30] is that known that a reload / restart of apache can lead to opcache corruption in the untouched php-fpm process? [20:21:35] 2 servers, 2 error messages, both bad luck [20:23:05] what I meant is that for more than a year we had opcache corruption triggered after a deployment of mediawiki code [20:23:15] but in this case, it triggered without any deployment just by reloading apache [20:23:19] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:23:23] which is new to me [20:23:25] grafana looks promising [20:23:44] hashar: yeah, that's why I'm skeptical this is opcache corruption [20:24:15] Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey() [20:24:34] 1362 - restart apache fixed it, 1344 - restart php-fpm fixed it. [20:24:38] that one is entirely wrong code wise it definitely exists [20:25:22] a seemingly-impossible error message is one of the conditions that is necessary but not sufficient to demonstrate cache corruption [20:25:37] in this case, as far as we know, there was nothing that could have corrupted the cache [20:25:45] that doesn't make it impossible, but it's extremely unlikely without more information [20:26:07] it's a super attractive thing to blame for a lot of weird circumstances but this is probably not it [20:27:05] in particular, if we think the errors on 1362 and 1344 were caused by the same thing -- which seems likely, given they fired at the same time -- then that thing was not cache corruption [20:27:23] that's ruled out because restarting apache on 1362 couldn't have resolved a cache corruption issue, right? [20:27:30] logstash looking all clean again [20:27:31] fwiw [20:27:44] yeah...should I go ahead and deploy then? [20:27:48] errors look gone to me [20:28:02] is 1362 depooled still? [20:28:10] I would like to put that back in the pool then. [20:28:11] ok? [20:28:15] mutante: sgtm [20:28:32] longma: let's wait for another couple minutes of quiet after mutante repools, if you don't mind, and then go ahead with the train [20:28:37] sure [20:28:49] !log repooled mw1362 [20:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:03] i used the "pool" command that's why it does not log automatically. done [20:29:32] rzl: yeah I don't quite get it, though both got Apache reloaded [20:29:47] that does not explain why the 2nd one started misbehaving way later [20:30:08] My feeling is that it is just a matter of scale. If you reload 500 apaches you'll get one or two. [20:30:32] They would be reloaded at different times because puppet is randomized [20:30:44] I am trying to imagine mechanisms via which php-fpm could 'know' that apache2 has been restarted, and I am struggling to come up with one [20:30:48] yeah, that's just the puppet splay [20:30:53] it's just opening a UNIX domain socket [20:30:54] and they were still both showing up close to each other.. as #1 and #2 [20:31:00] other servers never were in the list [20:31:39] looks like another spike [20:32:57] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1836 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:23] my problem now is that in about 20 min I have to be at my dentist appointment [20:33:58] well, we're outside the 30-minute window, so that definitely isn't caused by the same puppet change rolling out [20:34:53] it's mw1362 again [20:35:06] then it's a bad server and back because I repooled it [20:36:53] I'll remove it again then. [20:36:58] mutante: no, hang on [20:37:03] Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found [20:37:04] ok, hanging on. [20:37:17] that one *does* smell extremely cache-corrupty, and I have no idea why [20:37:18] I need to either cancel my appointment or leave ... hmmm [20:37:25] mutante: go ahead, we got this [20:37:39] rzl: thank you, ok. I'm setting back then. [20:37:39] I'm going to try restarting php-fpm on that machine, to see if it clears this up [20:37:46] sounds good [20:37:51] it was not done yet [20:38:02] only apache on that one [20:38:06] if it does, I guess we'll have learned something new and baffling about how apache and php-fpm interact [20:38:12] and if it doesn't, we'll have ruled out the opcache conclusively [20:38:57] !log rzl@mw1362:~$ sudo -i /usr/local/sbin/restart-php7.2-fpm [20:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:41] longma: please continue holding off, if that's okay :) if this takes much longer we'll just leave 1362 depooled and then you'll be able to go ahead [20:39:49] will do [20:39:55] much obliged [20:40:38] 10Operations, 10observability, 10CAS-SSO, 10Performance-Team (Radar): Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Krinkle) [20:40:55] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:38] well! that's extremely spooky [20:42:15] lol [20:42:30] how to bring down wikipedia: systemctl reload apache2 [20:42:39] so that reproduces? :-\ [20:42:50] "reproduces" is a strong word :P [20:43:10] mutante: I say: do head to your appointment! :) [20:43:21] it does seem like the cache may have been corrupted on 1362 and 1344, but we don't know if the apache2 reload is why yet [20:43:33] or the opcache was already corrupted [20:43:41] 10Operations, 10observability, 10CAS-SSO, 10Performance-Team (Radar): Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Krinkle) I'm having the same issue. I had to clear everything and then follow these very specific steps: 1. Don't o... [20:43:44] and apache reload just trigger the code to run from the now corrupted cache [20:43:49] for example, yeah [20:44:04] but hmm no doesn't make much sense either or we would have server running outdated code potentially [20:44:06] then [20:44:11] !log deploy aqs (analytics query service) as part of analytics train [20:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:30] I don't get how apache could lead to that :\ [20:44:52] nothing conclusively points to the apache change yet, afaik [20:45:11] The top Logstash errors are: [20:45:16] - /Elastica/Connection/Strategy/StrategyFactory.php: Can't create strategy instance by given argument [20:45:18] we made a recent puppet change to the apache config, but that doesn't mean it triggered the problem here, it's just one of the things we were looking at [20:45:21] - Call to undefined method CirrusSearch\Searcher::getQueryCacheStatsKey() [20:45:23] Krinkle: we know, we've been readin ghtem :) [20:45:24] but thank you [20:45:27] That seems classic opcache [20:45:28] *reading them, excuse me [20:45:39] specific to mw1362 and mw1344 [20:45:42] hashar: thanks, yes, i'm out for now [20:46:00] mutante: thanks for the investigation! [20:46:24] Krinkle: please see above :) [20:46:36] longma: all yours, proceed when ready [20:46:55] Krinkle: the "new" thing is that this time that got triggered by simply reloading Apache [20:47:14] thanks rzl & mutant.e [20:47:21] proceeding with the train deployment now [20:47:25] guess that can be captured on the related opcache phabricator task or whatever Gdoc we might have tracking it [20:47:30] again, we do not know whether reloading apache triggered the issue [20:47:33] :) [20:48:03] two things that happened right before the problem started, and are unlikely to have caused it, are that m.utante submitted an apache config change, and I turned my kettle on to make some tea [20:48:05] (03PS1) 10Jeena Huneidi: group0 wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654493 [20:48:07] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654493 (owner: 10Jeena Huneidi) [20:48:13] rzl: does apache restart also effectively clear fpm? or would that have continued as-is? [20:48:14] !log razzi@deploy1001 Started deploy [analytics/aqs/deploy@5d05f83]: Configure http request timeout and caching for T268809 [20:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:18] T268809: AQS pageview default caching is one day - https://phabricator.wikimedia.org/T268809 [20:48:22] neither of those has any known causal chain to the opcache [20:48:28] Krinkle: it does not, please read scrollback [20:48:53] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654493 (owner: 10Jeena Huneidi) [20:49:02] it just seems super suspicious [20:50:43] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.25 refs T267418 [20:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:48] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [20:53:03] !log razzi@deploy1001 Finished deploy [analytics/aqs/deploy@5d05f83]: Configure http request timeout and caching for T268809 (duration: 04m 48s) [20:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:52] the logs are pretty quiet so I'll declare train done for today [20:58:09] (03CR) 10Razzi: [C: 03+2] Bump AQS druid backend datasource to 2020-12 [puppet] - 10https://gerrit.wikimedia.org/r/654413 (owner: 10Joal) [20:58:13] the causal chain seems the same as usual, which is that upon code change, there is (naturally) concurrent traffic finding it absent in opcache and the myhtical race condition corrupts it on fill. so it would be deploy related, with possibly a delay from when the new code is first used by traffic. [20:58:31] rzl: grafana graphs for mw1362 suggests there was an opcache reset at 19:32 [20:58:46] which presumably happened unattended [20:59:43] Krinkle: that code change has always been php code, though, right? I'm still trying to see how changing apache config could trigger it [21:00:04] rzl: train deploy happened 29min earlier [21:00:12] also seen in the same graph [21:01:12] longma: congratulations :] [21:01:35] Krinkle: ahh okay, *that* I'll buy [21:02:06] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart [21:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:32] thanks hashar :) [21:04:05] and from SAL it looks like that earlier deploy was testwikis, so I can believe a long delay before any particular codepath is hit [21:07:03] I thought the errors came from wmf.22 [21:07:40] (03CR) 10CDanis: [C: 03+1] "Thank you! This patch is correct, and a puppet-merge followed by a run-puppet-agent on cumin hosts is all that is needed to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/654045 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [21:08:11] ah you're right [21:08:14] CirrusSearch\Searcher::getQueryCacheStatsKey() had 1k error per minute, definitely coming from wmf.22 / live prod traffic [21:09:09] and \EormatterLabelDescriptionLookupFactory is rather typical, it should start with a F so the opcache magically made F to E (a off by one error) [21:09:14] https://grafana-rw.wikimedia.org/d/GuHySj3mz/php7-transition?viewPanel=5&orgId=1&refresh=30s&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-server=mw1362 [21:09:44] https://grafana-rw.wikimedia.org/d/GuHySj3mz/php7-transition?viewPanel=5&orgId=1&refresh=30s&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-server=mw1344 [21:10:48] hashar: can you update https://phabricator.wikimedia.org/T245183 for the record? [21:12:40] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [21:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:39] Krinkle: ideally yeah, but I have no idea what happened really ;D [21:14:14] hashar: I mean just mention the one-off character change, and timestamp /server names of when it happened. [21:14:15] will add some of the traces / log we have already [21:14:22] wel [21:14:38] it happens on a weekly basis, do we need to capture every single reference of the issue? [21:14:47] we're not going to investigate it because we already know everything we want to know, focus is on removing use of opcache revalidation, which is already being worked on. [21:14:56] ahhhh [21:15:15] if it happens weekly, then no, but afaik people are saying that the currnet cronjob is preventing all issues [21:15:26] ok ok [21:15:27] which we now know is not true, so that's worth recognising there [21:16:09] copy pasting stuff [21:16:34] (03CR) 10Bstorm: [C: 03+1] labstore/files/logcleanup.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:16:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr) [21:21:16] (03CR) 10Bstorm: [C: 03+2] "I'm going to move ahead with this one based on the one with puppet disabled on the tools and paws haproxies to be sure." [puppet] - 10https://gerrit.wikimedia.org/r/651301 (owner: 10Bstorm) [21:40:10] longma: fyi ubn from the train rollout https://phabricator.wikimedia.org/T271259 [21:41:21] rzl: mutante: Krinkle: longma: wrote a bit about the opcache issue at https://phabricator.wikimedia.org/T245183#6724069 ;) [21:41:24] nothing fancy really [21:41:28] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10hashar) That happened during 1.36.0-wmf.25 promotion to testwiki. We then had three servers showing all the symptoms of suffering from an opcac... [21:41:41] thanks [21:41:53] I also found out we mortals have a sudo rule to restart php fpm :] [21:42:06] thanks p858snake [21:42:52] rzl: and the puppet-run crontab does not match the start of the issue on each server. Then I don't have access to the apache/puppet logs to confirm. But that seems to rule out Apache reload as the source of the issue [21:43:23] I will roll back to testwikis [21:43:25] I am going off it is late here [21:43:41] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) Do we need to update the title / create a 2021 task? [21:44:07] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [21:44:12] (03PS3) 10CRusnov: labstore/files/logcleanup.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) [21:44:46] longma: I am subscribed to the blocker task now. Will catch up tomorrow if you need me to explicitly do something tomorrow morning just ping me on the task and I will obey :] [21:45:08] s/obey/follow up/ [21:45:12] thanks hashar! good night [21:45:12] I've just poked ladsgroup as he broke it [21:47:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr) @RobH @wiki_willy Attached 3 photos Unfortunately large cooling fins have fitment issues in dell case. Any future gpu can not extend past backplane bracket... [21:48:20] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.36.0-wmf.22" [21:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:25] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr) {F33986382} {F33986383} [21:50:01] (03PS1) 10Jeena Huneidi: Revert "group0 wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654508 [21:50:03] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group0 wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654508 (owner: 10Jeena Huneidi) [21:50:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr) {F33986386} [21:50:37] (03CR) 10CRusnov: [C: 03+2] labstore/files/logcleanup.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:51:03] (03CR) 10CRusnov: [C: 03+2] labstore/files/logcleanup.py: Port to Python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654339 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:51:21] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654508 (owner: 10Jeena Huneidi) [21:53:54] Reedy: thanks for sorting the trace [21:56:45] Reedy: am I right in thinking https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/653792 was probably the cause [21:56:54] Or one of the other namespace related patches? [21:57:21] Somewhere in that tree, yeah [21:57:24] I've put a patch up [21:57:31] Just saw [21:57:55] That fix looks simple enough to at least unblock the train [21:58:04] Just needs deploying I guess [21:58:15] Yeah, need to check CI doesn't complain [21:58:27] As autoloadnamespace and autoloadclasses can get upset sometime [21:59:23] Ah ok [21:59:33] CI nearly always complains to me [21:59:35] And it does complain like I thought [21:59:36] haha [22:00:02] I self taught too much which means me and CI nearly always disagree on style [22:05:57] (03PS1) 10Reedy: Explicitly Autoload old aliased classes [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654447 (https://phabricator.wikimedia.org/T271266) [22:06:21] (03CR) 10Reedy: [C: 03+2] Explicitly Autoload old aliased classes [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654447 (https://phabricator.wikimedia.org/T271266) (owner: 10Reedy) [22:06:43] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/654515 (owner: 10CRusnov) [22:17:06] (03PS1) 10Esanders: Disable DiscussionTools' newtopictool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654520 [22:24:11] (03CR) 10jerkins-bot: [V: 04-1] Explicitly Autoload old aliased classes [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654447 (https://phabricator.wikimedia.org/T271266) (owner: 10Reedy) [22:24:17] bloody browser tests [22:24:33] (03PS1) 10Ladsgroup: hive: Migrate hiera() to lookup() and setting datatype in metastore [puppet] - 10https://gerrit.wikimedia.org/r/654521 (https://phabricator.wikimedia.org/T209953) [22:24:52] (03CR) 10Reedy: [V: 03+2 C: 03+2] "Stupid echo browser test failure filed as T271281" [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654447 (https://phabricator.wikimedia.org/T271266) (owner: 10Reedy) [22:26:50] !log reedy@deploy1001 Synchronized php-1.36.0-wmf.25/extensions/AbuseFilter/extension.json: T271266 (duration: 01m 04s) [22:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:56] T271266: "Caught exception of type Error" when creating new flow threads - https://phabricator.wikimedia.org/T271266 [22:27:03] (03PS1) 10Ladsgroup: Check for the index name while it's being renamed [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654448 (https://phabricator.wikimedia.org/T271259) [22:27:13] (03CR) 10Ladsgroup: [C: 03+2] Check for the index name while it's being renamed [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654448 (https://phabricator.wikimedia.org/T271259) (owner: 10Ladsgroup) [22:31:51] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27359/" [puppet] - 10https://gerrit.wikimedia.org/r/654521 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [22:39:56] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654449 (owner: 10DannyS712) [22:42:32] Hey all - looking to deploy a security patch for T270988. Will involve a temporary config change that'll go to .25 and mwdebug and then get reverted (https://gerrit.wikimedia.org/r/654449). attn: Reedy [22:43:02] (03PS3) 10DannyS712: Revoke `tboverride` from testwiki template editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654449 [22:43:45] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:51] on sodium it's the update-ubuntu-mirror unit [22:53:15] ...and not going to deploy the security patch T270988 rn after all [22:53:24] sbassett: Just check with Amir1 first... [22:53:28] heh, or not :) [22:53:59] Reedy: yeah, hoping T271259 is fixed by tomorrow, will try again. [22:53:59] T271259: Database query error when viewing non-existent pages on mediawiki.org - https://phabricator.wikimedia.org/T271259 [22:54:45] 10Operations, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Legoktm) [22:57:44] (03CR) 10Cicalese: [C: 03+2] Labs: remove labs-specific wmgUseMediaModeration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654325 (owner: 10Ppchelko) [22:57:47] (03CR) 10Cicalese: [C: 03+2] Labs: Remove now unused wgParserCacheUseJson [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654315 (owner: 10Ppchelko) [22:58:51] (03Merged) 10jenkins-bot: Labs: remove labs-specific wmgUseMediaModeration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654325 (owner: 10Ppchelko) [22:58:58] (03Merged) 10jenkins-bot: Labs: Remove now unused wgParserCacheUseJson [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654315 (owner: 10Ppchelko) [22:59:38] (03CR) 10SBassett: [C: 04-2] "Holding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654449 (owner: 10DannyS712) [23:04:37] sbassett: I'll get it deployed now [23:04:45] (my patch I mean) [23:05:41] (03Merged) 10jenkins-bot: Check for the index name while it's being renamed [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654448 (https://phabricator.wikimedia.org/T271259) (owner: 10Ladsgroup) [23:13:38] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.25/includes/logging/LogPager.php: [[gerrit:654507|Check for the index name while it's being renamed]] (duration: 01m 06s) [23:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:32] sbassett: it's done now. [23:46:25] (03PS2) 10Dave Pifke: Remove Excimer single-shot profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651267 (owner: 10Ori.livneh) [23:46:27] (03PS3) 10Dave Pifke: [WIP] profiler: remove MongoDB client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) [23:49:27] (03CR) 10Dzahn: [C: 03+2] "Even though this has "mariadb" in the name it is only applied on maintenance hosts and compiler shows complete noop as expected." [puppet] - 10https://gerrit.wikimedia.org/r/650637 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:53:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: Illegal opcode error on boot for frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271284 (10Dwisehaupt) p:05Triage→03High [23:53:30] (03CR) 10Bstorm: [C: 03+1] "I did not test it, and I admit to some ignorance to potential consequences. That said, everything looks sound (and I'm getting used to the" [puppet] - 10https://gerrit.wikimedia.org/r/651507 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [23:55:20] (03CR) 10Bstorm: [C: 03+1] wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro)