[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200117T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:03:47] PROBLEM - Check systemd state on urldownloader2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:57] PROBLEM - Check systemd state on urldownloader1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] p858snake: yea, pretty sure it will not be what it says on the ticket "they control the IP and create certs but it's in our domain" [00:04:17] PROBLEM - Check systemd state on urldownloader2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:23] PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:28] mutante: I commented on the codesearch task, the main problem right now is that puppet tries to start everything all at once...in case you have any ideas on how to stagger the startup [00:13:55] 10Operations, 10Security-Team, 10Traffic, 10CRM (Jan-Mar-2020): Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10JFishback_WMF) Has #wmf-legal reviewed this yet? [00:14:03] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31675520 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:14:18] 10Operations, 10Privacy Engineering, 10Security-Team, 10Traffic, 10CRM (Jan-Mar-2020): Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10JFishback_WMF) [00:15:51] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 65456 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:19:35] legoktm: just 2 quick thoughts for now. first is in systemd unit files there is "After=" and it could be used to start one service after another service instead of all after "network.target". other thought is that systemd itself also supports templates to start multiple services from the same unit file with just some small things changing. "By appending the @ symbol to the unit file name, [00:19:41] it becomes a template unit file and can be called multiple times." [00:20:53] also.. one unit can have stuff like "Requires=worker@1.service worker@2.service worker@3.service" [00:27:41] (03CR) 10Dzahn: [C: 03+2] gerrit: rename gerrit-test to gerrit1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/565404 (owner: 10Dzahn) [00:29:51] 10Puppet, 10VPS-project-codesearch: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) Hm, some puppet thing transitively installed ferm (good I guess), but that means we need a config file to open up port 3002 that hound_proxy uses. [00:30:34] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [00:33:35] !log bootstrapping restbase2022-a — T243000 [00:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:38] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [00:34:27] RECOVERY - cassandra-a SSL 10.192.32.191:7001 on restbase2022 is OK: SSL OK - Certificate restbase2022-a valid until 2022-01-15 15:53:04 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [00:38:15] fun.. so the urldownloader alerts are because logrotate failed [00:38:26] and logrotate fails because squid3:7 duplicate log entry for /var/log/squid/access.log [00:38:34] and decides to not even start because of that?! [00:40:15] and that is because "squid" vs "squid3" in buster.. so we had to rename it [00:43:59] RECOVERY - Check systemd state on urldownloader1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:25] RECOVERY - Check systemd state on urldownloader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:33] !log urldownloaders - rm /etc/logrotate.d/squid3 ; systemctl start logrotate (this fixes failed logrotate because of squid3 vs squid file = duplicate entry, but puppet will recreate it) [00:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:29] RECOVERY - Check systemd state on urldownloader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:01] RECOVERY - Check systemd state on urldownloader2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:10] 10Operations, 10Patch-For-Review: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 (10Dzahn) Today these alerts happened: ` 19:03 <+icinga-wm> PROBLEM - Check systemd state on urldownloader2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units fail... [01:06:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) Last update from Dell below. Nothing we don't know about already. I have not found the ability to disable a single port on the... [01:31:23] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10Reedy) [01:35:05] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10Krenair) we should probably stop trying to fix problems with this instance and aim to shut it down, have people fix logstash03 instead? [01:35:42] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10Krenair) (IIRC f-strings are python 3.6 so I'm not sure how this has shown up in a python3.4 directory) [01:40:36] (03PS2) 10Jforrester: mediawiki::php: Don't install gd any more, ZeroBanner is gone [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) [01:43:21] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [01:47:33] 10Operations, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE - https://phabricator.wikimedia.org/T241481 (10Reedy) [01:47:55] 10Operations, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE - https://phabricator.wikimedia.org/T241481 (10Reedy) [01:53:09] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10Reedy) [01:57:49] (03PS1) 10Alex Monk: CloudVPS: codfw1dev: Fix default SSH rule to use correct range [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) [02:07:42] (03PS2) 10Alex Monk: CloudVPS: codfw1dev: Fix default SSH rule to use correct range [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) [02:14:04] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [02:21:06] (03PS1) 10Catrope: Enable WelcomeSurvey on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) [02:21:24] (03CR) 10Catrope: [C: 04-2] "Blocked on privacy statement from legal, see task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) (owner: 10Catrope) [02:23:21] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: Partial refactor [puppet] - 10https://gerrit.wikimedia.org/r/565284 (https://phabricator.wikimedia.org/T238766) [02:23:24] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: further refactor [puppet] - 10https://gerrit.wikimedia.org/r/565285 (https://phabricator.wikimedia.org/T238766) [02:23:26] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: add a main() function [puppet] - 10https://gerrit.wikimedia.org/r/565286 (https://phabricator.wikimedia.org/T238766) [02:23:27] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: catch all exceptions [puppet] - 10https://gerrit.wikimedia.org/r/565287 (https://phabricator.wikimedia.org/T238766) [02:23:30] (03PS4) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) [02:23:44] (03PS1) 10Catrope: Enable UnderstandingFirstDay on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565439 (https://phabricator.wikimedia.org/T238294) [02:27:38] (03PS2) 10Catrope: Enable WelcomeSurvey on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) [02:29:12] (03PS3) 10Catrope: Enable WelcomeSurvey on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) [02:32:38] RECOVERY - cassandra-a CQL 10.192.32.191:9042 on restbase2022 is OK: TCP OK - 0.036 second response time on 10.192.32.191 port 9042 https://phabricator.wikimedia.org/T93886 [02:41:37] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: Partial refactor [puppet] - 10https://gerrit.wikimedia.org/r/565284 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [02:42:16] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: add a main() function [puppet] - 10https://gerrit.wikimedia.org/r/565286 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [02:42:27] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: further refactor [puppet] - 10https://gerrit.wikimedia.org/r/565285 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [02:42:41] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: catch all exceptions [puppet] - 10https://gerrit.wikimedia.org/r/565287 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [02:43:23] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [02:45:18] !log bootstrapping restbase2022-b — T243000 [02:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:21] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [02:45:34] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [02:58:27] (03PS1) 10Catrope: GrowthExperiments: Enable help panel on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565442 (https://phabricator.wikimedia.org/T238319) [03:09:07] (03PS1) 10Catrope: GrowthExperiments: Enable help panel on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565443 (https://phabricator.wikimedia.org/T238319) [03:26:49] (03PS1) 10Catrope: GrowthExperiments: Enable help panel on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565447 (https://phabricator.wikimedia.org/T238319) [03:28:56] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [03:29:22] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) a:05Papaul→03Marostegui @Marostegui all yours [03:32:14] (03PS1) 10Catrope: GrowthExperiments: Enable homepage on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565448 (https://phabricator.wikimedia.org/T238320) [03:34:10] (03PS1) 10Catrope: GrowthExperiments: Enable homepage on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565449 (https://phabricator.wikimedia.org/T238320) [03:36:01] (03PS1) 10Catrope: GrowthExperiments: Enable homepage on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565450 (https://phabricator.wikimedia.org/T238320) [03:45:40] (03PS1) 10Legoktm: codesearch: Open up port 3002 [puppet] - 10https://gerrit.wikimedia.org/r/565451 (https://phabricator.wikimedia.org/T242319) [04:25:34] (03PS1) 10Andrew Bogott: mwopenstackclients: add python3 version [puppet] - 10https://gerrit.wikimedia.org/r/565456 (https://phabricator.wikimedia.org/T229920) [04:25:39] (03PS1) 10Andrew Bogott: mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 [04:25:41] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: move to python3 [puppet] - 10https://gerrit.wikimedia.org/r/565458 (https://phabricator.wikimedia.org/T229920) [04:26:16] (03CR) 10jerkins-bot: [V: 04-1] mwopenstackclients: add python3 version [puppet] - 10https://gerrit.wikimedia.org/r/565456 (https://phabricator.wikimedia.org/T229920) (owner: 10Andrew Bogott) [04:26:37] (03CR) 10jerkins-bot: [V: 04-1] mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 (owner: 10Andrew Bogott) [04:26:56] (03CR) 10jerkins-bot: [V: 04-1] wmcs-dns-floating-ip-updater.py: move to python3 [puppet] - 10https://gerrit.wikimedia.org/r/565458 (https://phabricator.wikimedia.org/T229920) (owner: 10Andrew Bogott) [04:35:22] (03PS2) 10Andrew Bogott: mwopenstackclients: add python3 version [puppet] - 10https://gerrit.wikimedia.org/r/565456 (https://phabricator.wikimedia.org/T229920) [04:35:24] (03PS2) 10Andrew Bogott: mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 [04:35:26] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: move to python3 [puppet] - 10https://gerrit.wikimedia.org/r/565458 (https://phabricator.wikimedia.org/T229920) [04:40:45] (03PS4) 10BryanDavis: toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) [04:42:00] (03CR) 10Andrew Bogott: "The 'all production hosts' run only produced one failure, for orespoolcounter1004. That failure can't be reproduced, though, here's a rec" [puppet] - 10https://gerrit.wikimedia.org/r/564662 (owner: 10Andrew Bogott) [04:57:43] (03CR) 10BryanDavis: toolforge: Monitor local crontabs with Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [05:39:05] (03PS5) 10BryanDavis: Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 [05:39:07] (03PS5) 10BryanDavis: Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 [05:39:09] (03PS5) 10BryanDavis: Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 [05:39:11] (03PS3) 10BryanDavis: Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 [05:39:13] (03PS11) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:39:15] (03PS5) 10BryanDavis: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 [05:39:52] (03CR) 10jerkins-bot: [V: 04-1] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [05:39:57] (03CR) 10jerkins-bot: [V: 04-1] Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:40:09] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis) [05:53:40] (03PS4) 10BryanDavis: Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 [05:53:42] (03PS12) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:53:45] (03PS6) 10BryanDavis: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 [05:58:31] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2020-01-13 05:37:41 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:02:03] (03PS1) 10Marostegui: db1081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565461 [06:02:59] (03CR) 10Marostegui: [C: 03+2] db1081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565461 (owner: 10Marostegui) [06:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P10188 and previous config saved to /var/cache/conftool/dbconfig/20200117-060259-marostegui.json [06:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:42] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) Thank you Papaul! Memory and disk space looks good ` [06:06:33] marostegui@cumin1001:~$ sudo cumin 'es202*.codfw.wmnet' 'free -g ; df -hT /s... [06:10:26] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) 05Open→03Resolved a:05Marostegui→03Papaul [06:15:19] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [06:15:22] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) [06:16:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P10189 and previous config saved to /var/cache/conftool/dbconfig/20200117-061602-marostegui.json [06:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P10190 and previous config saved to /var/cache/conftool/dbconfig/20200117-062838-marostegui.json [06:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:51] (03PS1) 10Marostegui: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 [06:34:53] (03CR) 10Legoktm: Consistently capitalize MediaWiki properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [06:35:20] !log Compress db1125:3314 tables - this will create lag on s4 labs hosts [06:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:40] (03CR) 10jerkins-bot: [V: 04-1] db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 (owner: 10Marostegui) [06:36:16] (03PS2) 10Legoktm: Consistently capitalize MediaWiki properly [puppet] - 10https://gerrit.wikimedia.org/r/565166 [06:36:18] (03PS1) 10Legoktm: Add "Mediawiki" (incorrectly capitalized) to typos file [puppet] - 10https://gerrit.wikimedia.org/r/565464 [06:36:55] (03PS2) 10Marostegui: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 [06:37:51] (03CR) 10jerkins-bot: [V: 04-1] db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 (owner: 10Marostegui) [06:38:46] (03PS3) 10Marostegui: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 [06:41:38] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/565463 (owner: 10Marostegui) [06:45:54] 10Operations, 10Domains, 10Privacy Engineering, 10Security-Team, and 2 others: Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10soworu) [06:58:16] (03PS1) 10Marostegui: mariadb: Move es2020 from spare to core [puppet] - 10https://gerrit.wikimedia.org/r/565465 (https://phabricator.wikimedia.org/T243052) [07:00:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Move es2020 from spare to core [puppet] - 10https://gerrit.wikimedia.org/r/565465 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [07:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2012', diff saved to https://phabricator.wikimedia.org/P10191 and previous config saved to /var/cache/conftool/dbconfig/20200117-070320-marostegui.json [07:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es2012', diff saved to https://phabricator.wikimedia.org/P10192 and previous config saved to /var/cache/conftool/dbconfig/20200117-070516-marostegui.json [07:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2019', diff saved to https://phabricator.wikimedia.org/P10193 and previous config saved to /var/cache/conftool/dbconfig/20200117-070636-marostegui.json [07:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:04] !log Stop and upgrade db1082 [07:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:18] 10Operations, 10serviceops, 10Kubernetes: New Deployment charts should allow exposing services via TLS - https://phabricator.wikimedia.org/T236008 (10Joe) p:05Triage→03Normal [07:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P10197 and previous config saved to /var/cache/conftool/dbconfig/20200117-072544-marostegui.json [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:56] 10Operations, 10Domains, 10Privacy Engineering, 10Security-Team, and 2 others: Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10Qgil) a:05mark→03None [07:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P10198 and previous config saved to /var/cache/conftool/dbconfig/20200117-073917-marostegui.json [07:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1081', diff saved to https://phabricator.wikimedia.org/P10199 and previous config saved to /var/cache/conftool/dbconfig/20200117-073954-marostegui.json [07:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:51] (03CR) 10Legoktm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20406/" [puppet] - 10https://gerrit.wikimedia.org/r/565451 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [07:46:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P10200 and previous config saved to /var/cache/conftool/dbconfig/20200117-074626-marostegui.json [07:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:54] ACKNOWLEDGEMENT - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2020-01-13 05:37:41 Marostegui lets wait for the next iteration https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:51:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1082', diff saved to https://phabricator.wikimedia.org/P10201 and previous config saved to /var/cache/conftool/dbconfig/20200117-075125-marostegui.json [07:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:59] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:53:59] 10Operations, 10Domains, 10Privacy Engineering, 10Security-Team, and 2 others: Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10soworu) >>! In T243032#5811583, @JFishback_WMF wrote: > Has #wmf-legal reviewed this yet? Legal has been duly notifi... [07:55:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) Thanks Papaul! I think we don't need to pursue the "let's disable the unused port" option further, the current solution... [08:04:02] (03PS1) 10Muehlenhoff: Fix role name [puppet] - 10https://gerrit.wikimedia.org/r/565505 [08:05:59] (03CR) 10Muehlenhoff: [C: 03+2] Fix role name [puppet] - 10https://gerrit.wikimedia.org/r/565505 (owner: 10Muehlenhoff) [08:10:22] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [08:10:37] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:10:44] 10Operations, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE - https://phabricator.wikimedia.org/T241481 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:14:14] (03PS1) 10Legoktm: codesearch: Work around bootstrapping problems [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) [08:14:46] (03PS1) 10Elukey: profile::hadoop::backup::namenode: reduce the fsimage retention to 20 days [puppet] - 10https://gerrit.wikimedia.org/r/565509 [08:16:48] (03CR) 10Legoktm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/20407/codesearch6.codesearch.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [08:17:32] (03CR) 10Elukey: [C: 03+2] profile::hadoop::backup::namenode: reduce the fsimage retention to 20 days [puppet] - 10https://gerrit.wikimedia.org/r/565509 (owner: 10Elukey) [08:17:38] (03CR) 10Legoktm: "I also manually tested the wait.py system on codesearch4 to make sure it actually worked." [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [08:20:24] 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) From IRC: `lang=irc 16:19:35 legoktm: just 2 quick thoughts for now. first is in systemd unit files there is "After=" and it could be used to start one service after... [08:30:50] (03PS1) 10Marostegui: es2020: Set role master for es4 [puppet] - 10https://gerrit.wikimedia.org/r/565510 (https://phabricator.wikimedia.org/T243052) [08:31:31] (03CR) 10Marostegui: [C: 03+2] es2020: Set role master for es4 [puppet] - 10https://gerrit.wikimedia.org/r/565510 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [08:36:08] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10Joe) p:05Triage→03Normal [08:37:44] ACKNOWLEDGEMENT - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2020-01-13 08:24:39 Marostegui waiting for the next iteration https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:39:32] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) When adding the MAC addresses to the DHCP file, make sure to add the following line: ` option pxelinux.pathprefix "http://apt.wikimedia.org/t... [08:50:56] (03CR) 10ArielGlenn: [C: 03+1] "This looks ok to me, though I don't have access to an instance with dumps mounted to be able to double-check things." [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) (owner: 10Bstorm) [08:52:57] (03PS2) 10Muehlenhoff: Add Jennifer Wang to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/565304 (https://phabricator.wikimedia.org/T242807) [08:56:03] (03CR) 10Muehlenhoff: [C: 03+2] Add Jennifer Wang to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/565304 (https://phabricator.wikimedia.org/T242807) (owner: 10Muehlenhoff) [08:56:14] !log [08:56:17] !log [08:57:36] !log tools.stashbot Starting deploy of Parsoid [08:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2103', diff saved to https://phabricator.wikimedia.org/P10202 and previous config saved to /var/cache/conftool/dbconfig/20200117-085808-marostegui.json [08:58:10] !log tools.stashbot Ended deploy of Parsoid [08:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:02] !log Installing ATS [09:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:41] !log Finished installing ATS [09:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:24] (03CR) 10Arturo Borrero Gonzalez: "LGTM. My only request would be to add {default_value => '0.0.0.0/0'} to the new lookup() calls to ease running this code inside CloudVPS V" [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) (owner: 10Alex Monk) [09:03:32] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 5:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [09:06:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Fleet wide PCC at https://puppet-compiler.wmflabs.org/compiler1003/334/, seems pretty ok. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [09:06:24] \o/ [09:06:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add "Mediawiki" (incorrectly capitalized) to typos file [puppet] - 10https://gerrit.wikimedia.org/r/565464 (owner: 10Legoktm) [09:06:43] (03PS1) 10Muehlenhoff: Annotate Kerberos access for Jennifer and Kai [puppet] - 10https://gerrit.wikimedia.org/r/565512 [09:06:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Monitor local crontabs with Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561412 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [09:07:14] <_joe_> uhm [09:07:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @jwang Your access is now enabled, let me kno... [09:08:53] shall I remove the bogus log entries from the wiki? [09:09:45] apergos: I was about to do that [09:09:49] to avoid confusions [09:09:55] ok then [09:10:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 (owner: 10Andrew Bogott) [09:10:51] done [09:11:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] mwopenstackclients: add python3 version [puppet] - 10https://gerrit.wikimedia.org/r/565456 (https://phabricator.wikimedia.org/T229920) (owner: 10Andrew Bogott) [09:11:34] 👍 [09:13:18] (03PS1) 10Joal: Convert labstore stats fetcher left to hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565513 [09:13:26] elukey: --^ [09:14:00] (03PS2) 10Legoktm: codesearch: Work around bootstrapping problems [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) [09:15:21] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10fgiunchedi) a:05fgiunchedi→03Eevans [09:16:12] (03CR) 10Arturo Borrero Gonzalez: wmflib: Introduce a more usable data structure to describe services. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [09:16:41] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10fgiunchedi) [09:17:18] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10fgiunchedi) 05Open→03Resolved All done, service is being implemented in T243000 [09:19:19] (03CR) 10Elukey: [C: 03+2] Convert labstore stats fetcher left to hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565513 (owner: 10Joal) [09:25:29] apergos,bstorm_ - all analytics data fetches from labstores are using hdfs now! [09:25:48] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:25:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:50] no more fuse, so hopefully more stable [09:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:53] 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10fgiunchedi) >>! In T125408#5811525, @Addshore wrote: > @fgiunchedi Any idea if there is any sort of regular / scheduled ba... [09:30:08] elukey: great news! [09:30:36] I don't want to jinx it so I won't say that it seems to have been more stable recently :-) [09:31:42] apergos: let's see! The mediawiki history dumps are now available, it was not possible before, we are really super happy [09:31:51] (kudos to joal for all the work!) [09:32:06] :-) :-) [09:37:53] 10Operations, 10Patch-For-Review: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 (10MoritzMuehlenhoff) Good catch! I'll review the difference between the Logrotate config shipped in the Debian config and our Puppet one, maybe we can simply stick with the Debian default entirely. [09:47:20] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10faidon) I've seen this issue before, and if I recall correctly, it was an issue with the Python 3.4 backport. I think the [[ https://people.debian.org/~paravoid/python-all/ | lat... [09:51:33] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: fix 0.58 entry with typo in Arturo's name [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565521 [09:52:09] (03PS1) 10Muehlenhoff: Fix the name of the logrotate config file for Squid 4 [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) [09:53:04] (03CR) 10jerkins-bot: [V: 04-1] Fix the name of the logrotate config file for Squid 4 [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [09:54:23] (03PS2) 10Muehlenhoff: Fix the name of the logrotate config file for Squid 4 [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) [09:54:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: fix 0.58 entry with typo in Arturo's name [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565521 (owner: 10Arturo Borrero Gonzalez) [09:56:25] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Legoktm) [09:56:38] (03PS1) 10Elukey: Enable Kerberos for Analytics Refine Spark jobs on the Hadoop coords [puppet] - 10https://gerrit.wikimedia.org/r/565523 [10:07:24] 10Operations, 10observability: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) [10:08:16] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/20411/" [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [10:09:16] (03CR) 10Muehlenhoff: [C: 03+2] Fix the name of the logrotate config file for Squid 4 [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [10:09:33] (03CR) 10Muehlenhoff: [C: 03+2] Annotate Kerberos access for Jennifer and Kai [puppet] - 10https://gerrit.wikimedia.org/r/565512 (owner: 10Muehlenhoff) [10:10:14] (03PS3) 10Muehlenhoff: Fix the name of the logrotate config file for Squid 4 [puppet] - 10https://gerrit.wikimedia.org/r/565522 (https://phabricator.wikimedia.org/T224551) [10:10:52] (03PS2) 10Elukey: Enable Kerberos for Analytics Refine Spark jobs on the Hadoop coords [puppet] - 10https://gerrit.wikimedia.org/r/565523 [10:11:31] moritzm: --^ lol [10:12:27] kerberos-run-command has been fixing my mess for all this time [10:12:52] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20412/" [puppet] - 10https://gerrit.wikimedia.org/r/565523 (owner: 10Elukey) [10:14:45] oh, wow :-) [10:32:47] !log installing remaining OpenSSL 1.0.2 updates [10:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:05] !log restarting apache on puppetboard* to pick up SSL updates [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:12] !log restarting apache on miscweb* to pick up SSL updates [10:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:34] (03PS5) 10Filippo Giunchedi: varnish: use syslog for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [10:56:08] 10Operations, 10netops, 10cloud-services-team (Kanban): asw-b-codfw: fixes for openstack - https://phabricator.wikimedia.org/T243002 (10ayounsi) `lang=diff ayounsi@asw-b-codfw# show | compare [edit interfaces] interface-range vlan-private1-a-codfw { ... } + interface-range cloud-net-trunk { + me... [10:58:35] 10Operations, 10netops, 10cloud-services-team (Kanban): asw-b-codfw: fixes for openstack - https://phabricator.wikimedia.org/T243002 (10aborrero) >>! In T243002#5812422, @ayounsi wrote: > `lang=diff > ayounsi@asw-b-codfw# show | compare > [edit interfaces] > interface-range vlan-private1-a-codfw { ... }... [11:01:28] !log delete vlan cloud-instances1-b-eqiad from asw2-b-eqiad - T240670 [11:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:31] T240670: WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 [11:03:24] 10Operations, 10netops, 10cloud-services-team (Kanban): asw-b-codfw: fixes for openstack - https://phabricator.wikimedia.org/T243002 (10ayounsi) 05Open→03Resolved Synced up on IRC, change pushed. [11:03:54] 10Operations, 10netops, 10observability: Provision plaintext syslog collectors in esams/ulsfo/eqsin - https://phabricator.wikimedia.org/T243065 (10fgiunchedi) [11:04:09] !log Running homer to remove decom cloud vlans in eqiad/codfw - T240670 [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:06] !log restarting apache on matomo1001 to pick up SSL updates [11:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] (03PS1) 10Arturo Borrero Gonzalez: prometheus: node_local_crontab: fix sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/565530 (https://phabricator.wikimedia.org/T210993) [11:08:36] (03PS2) 10Arturo Borrero Gonzalez: prometheus: node_local_crontab: fix sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/565530 (https://phabricator.wikimedia.org/T210993) [11:10:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: node_local_crontab: fix sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/565530 (https://phabricator.wikimedia.org/T210993) (owner: 10Arturo Borrero Gonzalez) [11:12:41] We have an issue of a banner image that is 800+ kb on the English Wikipedia (that could be 35 kb something) https://phabricator.wikimedia.org/T243062 - what's the right team/who should I reach out to? [11:13:20] phedenskog: I suspect you contact a CN admin [11:13:33] I'm guessing it wasn't added by a WMF staff member [11:13:44] !log restart nginx on analitycs tool hosts to pick up openssl updates [11:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:04] (03PS1) 10Muehlenhoff: Update various Cumin aliases for recent role changes/additions [puppet] - 10https://gerrit.wikimedia.org/r/565531 [11:14:23] Though... Why is the image uploaded to donate wiki for a WLE thing? [11:14:56] https://donate.wikimedia.org/w/index.php?title=File:WLE_banner_2010.png [11:15:22] Let's blame Seddon. CC'd him on the task [11:15:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 62 probes of 512 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:18:22] (03CR) 10Muehlenhoff: [C: 03+2] Update various Cumin aliases for recent role changes/additions [puppet] - 10https://gerrit.wikimedia.org/r/565531 (owner: 10Muehlenhoff) [11:21:14] (03CR) 10Elukey: [C: 03+1] Switch component-pyall to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563474 (owner: 10Muehlenhoff) [11:21:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 29 probes of 512 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:22:18] Reedy: thanks [11:23:46] (03PS1) 10Muehlenhoff: Fix syntax for one Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/565532 [11:25:28] 10Operations, 10Beta-Cluster-Infrastructure: python3.4 broken on deployment-logstash2 - https://phabricator.wikimedia.org/T243048 (10Reedy) 05Open→03Resolved a:03Reedy >>! In T243048#5812215, @faidon wrote: > All that said, it's 2020, and spending time for Python 3.4 (and on stretch) is Just Wrong™ IMHO.... [11:31:36] (03CR) 10Muehlenhoff: [C: 03+2] Fix syntax for one Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/565532 (owner: 10Muehlenhoff) [11:34:23] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Joe) [11:37:48] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Joe) We should complete this work before we perform the MediaWiki switchover. [11:39:09] (03PS1) 10Vgutierrez: cumin: Provide aliases for ncredir@{esams,ulsfo,eqsin} [puppet] - 10https://gerrit.wikimedia.org/r/565538 (https://phabricator.wikimedia.org/T242321) [11:39:24] (03CR) 10Vgutierrez: [C: 03+2] smokeping: Serve traffic directly and using TLS [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [11:42:19] (03CR) 10Muehlenhoff: [C: 03+1] cumin: Provide aliases for ncredir@{esams,ulsfo,eqsin} [puppet] - 10https://gerrit.wikimedia.org/r/565538 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:44:09] (03CR) 10Vgutierrez: [C: 03+2] cumin: Provide aliases for ncredir@{esams,ulsfo,eqsin} [puppet] - 10https://gerrit.wikimedia.org/r/565538 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:44:18] (03CR) 10Ayounsi: [C: 03+1] Serve smokeping.wm.o directly from netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/564045 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [11:45:00] (03CR) 10Vgutierrez: [C: 03+2] Serve smokeping.wm.o directly from netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/564045 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [11:45:04] (03PS2) 10Vgutierrez: Serve smokeping.wm.o directly from netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/564045 (https://phabricator.wikimedia.org/T238900) [11:48:39] !log upgrading PHP 7.2 on netmon* (also apache restart for SSL update) [11:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:03] 10Operations, 10netops: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 (10ayounsi) JTAC recommends to upgrade to the current Junos recommended, 18.2R3-S2.9. I copied it over and validated it: ` ayounsi@mr1-esams> request system software validate /var/tmp/junos-srxsme-18.2R3-S2.9.tgz... [11:59:40] (03PS4) 10Muehlenhoff: librenms: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/562226 [12:00:01] (03PS1) 10Joal: Clean labstore fetcher after hdfs-rsync move [puppet] - 10https://gerrit.wikimedia.org/r/565545 [12:04:07] (03CR) 10Muehlenhoff: [C: 03+2] librenms: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/562226 (owner: 10Muehlenhoff) [12:07:58] (03CR) 10Alexandros Kosiaris: "Tbh, I am not even sure this is going to be used in the end. It's deployed in production but is currently failing. There's https://phabric" [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [12:12:12] (03CR) 10Mvolz: "I actually don't have the permissions necessary to self-merge to this repository- is that a right I should maybe get?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 (owner: 10Mvolz) [12:12:25] 10Operations, 10netops, 10observability: Provision plaintext syslog collectors in esams/ulsfo/eqsin - https://phabricator.wikimedia.org/T243065 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:12:44] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10faidon) What is the status of this? [12:13:26] 10Operations, 10observability: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:13:41] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:14:53] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10MoritzMuehlenhoff) We have role::webserver_misc_static (bromine/vega) for this. [12:16:10] (03PS1) 10Pikne: Add wordmark for etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565549 (https://phabricator.wikimedia.org/T230379) [12:16:17] (03PS1) 10Vgutierrez: Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) [12:17:55] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "RBAC within etcd in the v2 store slows down etcd up to 10x, we can't afford that." [puppet] - 10https://gerrit.wikimedia.org/r/561818 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [12:18:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd: enable ssl validation [puppet] - 10https://gerrit.wikimedia.org/r/561817 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [12:19:34] (03CR) 10Giuseppe Lavagetto: "As I explained, this is not feasible at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/561819 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [12:20:18] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10Mvolz) [12:21:03] 10Operations, 10Citoid, 10SRE-Access-Requests: Revoke access Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T242427 (10Mvolz) [12:21:06] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) [12:39:15] (03PS2) 10Vgutierrez: Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) [12:42:35] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [12:48:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.157e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:49:52] Hi, I have an UBN which might require a Friday deployment: T242437 [12:49:52] T242437: Unable to view some articles due to fatal LogicException: Cannot roll back missing named ref "" (from ReferenceStack.php) - https://phabricator.wikimedia.org/T242437 [12:50:28] It might be crashing the RefreshLinksJob for zhwiki, I'm not sure if I'm reading the logs correctly. [12:50:32] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) [12:50:41] *it = the bug documented in that task [12:51:42] (03PS3) 10Vgutierrez: Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) [12:52:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, and 2 others: Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10WDoranWMF) To whomever needs the comment, as Hugh Nowlan's manager I approve his being approved for shell access, provided it is approved by t... [12:53:43] awight: I don't mind editing a user page with my staff account if you need it to [12:53:46] I don't understand the lsat comment on the task about being unable to edit the page because it's [12:53:49] ie for a technical reason [12:53:50] in the user namespace [12:54:14] does zh wiki have restrictions about that? [12:54:18] apergos: I think it's an AbuseFilter configuration, but I'm prevented from editing. [12:54:23] Looks like it... [12:54:29] I see [12:54:35] +1 for Reedy to try it then [12:54:35] (03PS4) 10Vgutierrez: Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) [12:54:41] Tell me what needs changing :) [12:54:54] Reedy: okay thank you, that's probably the nicer way to go. I'm slightly worried that it could happen again over the weekend, but this at least will unblock the job. [12:55:08] I'm not against a friday deploy if you want to try and fix it [12:55:15] But if we can also fix the crap onwiki... [12:55:33] Reedy: This empty ref should be removed: {{cite web|url=|title=|publisher=|author=|date=|accessdate=2018-}} [12:55:44] know which are all the pages that could trigger it, and which ones someone might edit over the weekend... not sure how to figure that [12:55:45] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:55:50] !log Updgrade netmon* to to php 7.2.26 and restart - T241222 [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] Here's my crappy, machine translation of a reasonable commit message: 刪除空引用,導致錯誤。 [12:55:53] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [12:56:14] put the reaosn in parens in en too [12:56:23] there will be en-3,4 people there [12:56:55] apergos: Exactly, it's totally random whether a page with this specific edge case expires from the cache and is reparsed. [12:57:15] apergos: Good idea about the zh/en. 刪除空引用,導致錯誤。(Remove empty reference which was causing an error.) [12:57:53] Reedy: Successfully saving the page is enough to demonstrate that the problem is solved, BTW. [12:58:40] The publish button was disabled... Had to preview then save [12:58:40] https://zh.wikipedia.org/w/index.php?title=User%3AAtana_goodwin%2FWork14&type=revision&diff=57736642&oldid=49425248 [12:59:44] now to make the decsion: have someone lurk and watch logstash over the weekend in case we get more of these? [13:00:09] or friday deploy and babysit? [13:01:52] My initial preference was to monitor logs myself, but I'm starting to think the deployment might be less disruptive overall. OTOH, it's a really low error rate and number of pages affected. [13:02:14] I wouldn't say the change is particularly concerning either [13:02:19] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Cite/+/565517/3/src/Cite.php [13:02:20] how many pages since the patch went live? [13:02:27] It's not like it's changing loads of lines [13:03:27] apergos: Maybe 5 or so. The most common impact has been that page consistently failing to render, but the zhwiki RefreshLinksJob blockage is more concerning. The same thing could happen with any wiki. [13:03:43] meh [13:03:43] And we know users do occasionally do stupid stuff [13:04:11] Reedy: This is the other half of the fix, https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Cite/+/564558/ [13:04:46] That doesn't concern me either [13:04:53] Most of the changes concerned are tests :) [13:05:07] kk, I'll post a minimal backport in case we decide to deploy. [13:07:29] (03PS2) 10Arturo Borrero Gonzalez: dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) [13:10:02] Here's the backport, I'm waiting for tests now. https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Cite/+/565562/ [13:10:44] The only annoying thing is potentially running full scap [13:10:50] Or whether we just ignore the possibly broken message :P [13:11:42] (03PS3) 10Arturo Borrero Gonzalez: dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) [13:11:48] Reedy: Good call, let me just kludge the message by reusing another for the weekend. [13:13:40] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Jgreen) >>! In T237582#5802949, @Cmjohnson wrote: > @Jgreen I have the batteries...when can we schedule to do this? Long shot, but we have a planned 1-hour maintenance window on 9AM PST on... [13:17:10] (03PS1) 10Muehlenhoff: profile::java::analytics: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/565567 [13:17:23] Okay, it's mis-reusing an existing error. [13:19:02] Wait for jerkins then and see what's what [13:19:25] Reedy: I still have the car keys, happy to do the deployment myself... [13:19:35] Sure :) [13:19:49] I was more meaning just waiting for that to happen before deciding how to proceed ;) [13:20:30] Reedy: sure, and thank you for clarifying :-) [13:23:47] (03CR) 10Volans: "post-merge optional comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565538 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [13:24:07] ho hum watching the grass grow [13:28:31] (03PS2) 10Muehlenhoff: jenkins: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563478 [13:32:42] (03PS4) 10Arturo Borrero Gonzalez: dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) [13:33:29] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:34:34] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20415/" [puppet] - 10https://gerrit.wikimedia.org/r/563478 (owner: 10Muehlenhoff) [13:35:25] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:18] jenkins is done [13:38:42] !log masking squid3 on old URL downloaders T224551 [13:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:45] T224551: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 [13:40:40] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:43:33] Reedy: Here's a graph of the error frequency, https://logstash.wikimedia.org/goto/90f943b7b94cbb92349838dc4b63d2c4 . I'm interested in your perspective about whether it makes sense to backport. [13:44:19] I wonder how many of the errors are from us browsing to the page... :D [13:44:37] Over the last 7 days, it increases to nearly 500 [13:44:48] With some on other wikis.. [13:45:28] I think I've fixed 4 pages, fwiw. [13:45:43] Timo has fixed some apparently [13:45:43] https://uk.wikipedia.org/w/index.php?title=Tinder&action=history [13:46:17] And apparently Tinder on ukwiki gets quite a few hits [13:46:45] Whew, that could have contributed to social unrest! [13:46:48] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:47:11] You've made the backport, seems worthy to just deploy it [13:48:36] Reedy: That's sane. Going for it. [13:54:09] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:59:12] (03PS2) 10Filippo Giunchedi: WIP: remove json_lines tcp [puppet] - 10https://gerrit.wikimedia.org/r/564866 [13:59:14] (03PS1) 10Filippo Giunchedi: hieradata: turn down logstash tcp json_lines endpoint [puppet] - 10https://gerrit.wikimedia.org/r/565573 (https://phabricator.wikimedia.org/T213899) [14:00:57] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [14:02:30] Merged, here goes. [14:03:09] !log beginning Friday deployment for UBN, T242437 [14:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:15] T242437: Unable to view some articles due to fatal LogicException: Cannot roll back missing named ref "" (from ReferenceStack.php) - https://phabricator.wikimedia.org/T242437 [14:05:23] (03PS1) 10Giuseppe Lavagetto: deployment-prep: do not use conftool [puppet] - 10https://gerrit.wikimedia.org/r/565574 [14:06:23] mwdebug is happy. [14:06:57] \o/ [14:08:59] (03PS1) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [14:09:12] !log awight@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/Cite: UBN backport: [[gerrit:565562|Fix for nested #tag:references and empty name (T242437)]] (duration: 00m 57s) [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] T242437: Unable to view some articles due to fatal LogicException: Cannot roll back missing named ref "" (from ReferenceStack.php) - https://phabricator.wikimedia.org/T242437 [14:09:36] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [14:10:15] (03PS2) 10Elukey: Clean labstore fetcher after hdfs-rsync move [puppet] - 10https://gerrit.wikimedia.org/r/565545 (owner: 10Joal) [14:12:05] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29422184 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:12:20] (03CR) 10Elukey: [C: 03+2] "Fixed a typo in a comment, pcc looks good https://puppet-compiler.wmflabs.org/compiler1003/20419/labstore1006.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/565545 (owner: 10Joal) [14:13:41] Reedy: apergos: Looks like the error has dropped to zero after deployment, and nothing else unusual is happening. Thanks for the technical and emotional support :-) [14:13:57] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 370608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:14:00] I'm looking at it in logstash, so far so good [14:14:19] let's hope we don't see anything else crop up ('shouldn't' but you know how it is) [14:14:43] I'll keep a "new-errors" window open for the evening... [14:15:51] sounds good, better than keeping a logstsh window open all weekend [14:16:22] :-D I see you've done this before [14:17:42] well I've definitely done the 'watch something I fixed on Friday, over the weekend' thing before :-D [14:19:32] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: add python3 version [puppet] - 10https://gerrit.wikimedia.org/r/565456 (https://phabricator.wikimedia.org/T229920) (owner: 10Andrew Bogott) [14:20:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) 05Open→03Resolved a:03Papaul Thansk @MoritzMuehlenhoff. resolving this task [14:20:41] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [14:21:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) a:05Papaul→03Jgreen @Jgreen no good update from Dell [14:27:59] (03PS6) 10Filippo Giunchedi: varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [14:31:54] !log bootstrapping restbase2022-c — T243000 [14:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:57] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [14:32:52] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [14:43:32] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) >>! In T242265#5812899, @Papaul wrote: > @Jgreen no good update from Dell Thanks for looking into it. No problem, I was able to use the fix the SRE team... [14:46:54] (03PS1) 10Vgutierrez: varnishkafka (1.0.13-2) buster-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/565581 (https://phabricator.wikimedia.org/T242093) [14:47:49] (03CR) 10jerkins-bot: [V: 04-1] varnishkafka (1.0.13-2) buster-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/565581 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:50:57] (03CR) 10Ema: [C: 03+1] "Can we also remove the logstash related settings from profile::cache::base at this point? After a quick glance it looks like the units mod" [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [14:52:36] (03CR) 10Elukey: varnishkafka (1.0.13-2) buster-wikimedia; urgency=medium (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/565581 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [15:07:21] 10Operations, 10Research, 10Traffic, 10Patch-For-Review: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10BBlack) [15:07:57] (03PS1) 10Ema: cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) [15:08:07] 10Operations, 10Research, 10Traffic, 10Patch-For-Review: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10BBlack) Most of this has been configured now, the remaining slightly difficult bit is configuring an alternate SNI cert for the domain on our new a... [15:09:01] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:16:46] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) p:05Triage→03Low [15:18:29] (03PS2) 10Ema: cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) [15:20:31] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:21:20] (03CR) 10Filippo Giunchedi: "> Patch Set 6: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [15:21:55] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:52] 10Operations, 10Citoid, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10MarcoAurelio) [15:23:55] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Daimona) Belatedly, I confirm that I can join the channel. Thanks! [15:24:05] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 55 probes of 513 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:25:13] (03PS1) 10Gehel: maps: increase warning limit on postgres replication check [puppet] - 10https://gerrit.wikimedia.org/r/565612 [15:25:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great! (To the extent partman recipes can look great :-)" [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:25:56] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) [15:26:26] (03CR) 10jerkins-bot: [V: 04-1] maps: increase warning limit on postgres replication check [puppet] - 10https://gerrit.wikimedia.org/r/565612 (owner: 10Gehel) [15:26:28] (03CR) 10Bstorm: Report error messages on stderr (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 (owner: 10BryanDavis) [15:28:36] (03PS2) 10Gehel: maps: increase warning limit on postgres replication check [puppet] - 10https://gerrit.wikimedia.org/r/565612 [15:29:54] (03PS1) 10Cparle: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565614 (https://phabricator.wikimedia.org/T241242) [15:31:35] (03CR) 10DCausse: maps: increase warning limit on postgres replication check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565612 (owner: 10Gehel) [15:32:59] (03CR) 10Gehel: maps: increase warning limit on postgres replication check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565612 (owner: 10Gehel) [15:33:05] (03CR) 10DCausse: [C: 03+1] maps: increase warning limit on postgres replication check [puppet] - 10https://gerrit.wikimedia.org/r/565612 (owner: 10Gehel) [15:33:07] (03CR) 10Gehel: [C: 03+2] maps: increase warning limit on postgres replication check [puppet] - 10https://gerrit.wikimedia.org/r/565612 (owner: 10Gehel) [15:33:51] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.14 ms [15:35:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 513 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:37:25] (03PS2) 10Cparle: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565614 (https://phabricator.wikimedia.org/T241242) [15:37:37] (03CR) 10Muehlenhoff: [C: 03+1] install_server: introduce raid0 standard partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:41:39] (03PS1) 10Cparle: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) [15:42:55] (03CR) 10jerkins-bot: [V: 04-1] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) (owner: 10Cparle) [15:43:56] 10Operations, 10DC-Ops, 10decommission: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Jgreen) [15:44:11] 10Operations, 10DC-Ops, 10decommission: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Jgreen) a:03Papaul [15:45:28] (03PS3) 10Ema: cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) [15:47:22] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish::instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:47:56] (03PS2) 10Cparle: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) [15:48:19] (03PS2) 10Vgutierrez: varnishkafka (1.0.13-2) buster-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/565581 (https://phabricator.wikimedia.org/T242093) [15:50:31] 10Operations, 10DBA: Make Percona 8 report the basics on tendril - https://phabricator.wikimedia.org/T243081 (10Marostegui) [15:50:37] 10Operations, 10DBA: Make Percona 8 report the basics on tendril - https://phabricator.wikimedia.org/T243081 (10Marostegui) 05Open→03Resolved [15:50:43] 10Operations, 10DBA, 10MediaWiki-General: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) [15:51:43] (03PS2) 10Pikne: Add wordmark for etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565549 (https://phabricator.wikimedia.org/T230379) [15:55:19] 10Operations, 10Domains, 10Privacy Engineering, 10Security-Team, and 2 others: Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10JFishback_WMF) @soworu Did #wmf-legal or #security-team review the underlying vendor agreement or system? This is the... [16:00:29] (03PS1) 10Muehlenhoff: elasticsearch::curator: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/565617 [16:03:46] (03PS1) 10Arturo Borrero Gonzalez: .gitignore: ignore nano lock file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565619 [16:05:34] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) 05Open→03Resolved a:03Jgreen Closing this task because as it was defined it has been completed and we decided to stay the course throug... [16:06:46] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) [16:11:32] 10Operations, 10Citoid, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10akosiaris) 05Open→03Resolved a:03akosiaris Done in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deplo... [16:11:35] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10akosiaris) [16:11:52] (03PS1) 10Vgutierrez: Explicitly include stdint.h [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/565620 [16:15:30] (03PS2) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [16:15:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] .gitignore: ignore nano lock file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565619 (owner: 10Arturo Borrero Gonzalez) [16:15:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Let's also backport this to the varnish51 branch!" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/565620 (owner: 10Vgutierrez) [16:16:38] (03PS1) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:17:25] (03CR) 10jerkins-bot: [V: 04-1] ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) (owner: 10Vgutierrez) [16:18:32] (03PS2) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:18:49] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [16:20:26] (03PS3) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [16:21:38] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [16:23:14] (03PS4) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [16:23:26] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [16:25:36] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@938d253]: Move weekly elasticsearch transfer to airflow [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:45] (03PS3) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:25:50] (03CR) 10Alexandros Kosiaris: "I know it's a WIP but here's a first round of comments" (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [16:25:58] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@938d253]: Move weekly elasticsearch transfer to airflow (duration: 00m 21s) [16:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:26] (03PS4) 10Ema: cache: move varnish::instance into profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) [16:27:57] (03CR) 10jerkins-bot: [V: 04-1] ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) (owner: 10Vgutierrez) [16:28:30] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish::instance into profile [puppet] - 10https://gerrit.wikimedia.org/r/565602 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [16:28:37] (03PS4) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:30:20] jerkins is obviously biased against traffic team members [16:30:21] :( [16:30:36] (03CR) 10jerkins-bot: [V: 04-1] ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) (owner: 10Vgutierrez) [16:31:18] vgutierrez: He doesn't want you deploying on fridays [16:31:21] * Reedy hides [16:31:36] TBH I don't want to deploy that today [16:31:48] heh [16:32:19] (03PS1) 10Marostegui: tendril-host-add.sh: Add comment about MySQL 5.7+ [software/tendril] - 10https://gerrit.wikimedia.org/r/565627 (https://phabricator.wikimedia.org/T24308) [16:32:22] (03PS5) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:33:12] !log Stop replication on db1107 [16:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:18] ok jerkins you know what? I'm out of here. I hope you're happy now! [16:33:23] * ema leaves [16:33:27] lol [16:36:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice! Here's a first round of comments." (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [16:37:25] (03PS6) 10Vgutierrez: ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) [16:39:14] (03CR) 10Marostegui: [V: 03+2 C: 03+2] tendril-host-add.sh: Add comment about MySQL 5.7+ [software/tendril] - 10https://gerrit.wikimedia.org/r/565627 (https://phabricator.wikimedia.org/T24308) (owner: 10Marostegui) [16:41:22] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) So we have a google sheet with all of the power cords, and they do NOT match up. Example: cp3054 is plugged in to tower 1 on port 2 and into tower 2 on port 7. I don't like that none of t... [16:41:51] (03PS1) 10Vgutierrez: Explicitly include stdint.h [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/565628 [16:42:57] (03CR) 10Vgutierrez: "pcc looks happy and shows the expected DIFF on a text node, and a NOOP on an upload node: https://puppet-compiler.wmflabs.org/compiler1002" [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) (owner: 10Vgutierrez) [16:51:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] Explicitly include stdint.h [software/varnish/varnishkafka] (varnishv51) - 10https://gerrit.wikimedia.org/r/565628 (owner: 10Vgutierrez) [16:57:43] (03CR) 10BBlack: [C: 03+1] ATS: Deploy wikiworkshop TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/565625 (https://phabricator.wikimedia.org/T242374) (owner: 10Vgutierrez) [17:07:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Papaul) [17:07:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Papaul) [17:09:47] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10faidon) Could we import into Netbox now, and then change & document the setup at our convenience? It feels like documenting the existing situation and changing it are orthogonal to each other - an... [17:13:00] (03PS1) 10EBernhardson: airflow: Remove spurious "<" from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/565629 [17:13:34] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [17:13:39] (03CR) 10jerkins-bot: [V: 04-1] airflow: Remove spurious "<" from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/565629 (owner: 10EBernhardson) [17:14:09] (03CR) 10EBernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/565629 (owner: 10EBernhardson) [17:14:11] (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/565629 (owner: 10EBernhardson) [17:14:59] !log bootstrapping restbase2023-a — T243000 [17:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [17:15:06] 10Operations, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) p:05Triage→03Normal [17:17:22] 10Operations, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) a:03wiki_willy I'm not sure if we want to have someone from SRE go handle this, or if I should just dispatch directions to ESAMS/Iron Mountain remote hands to do so? This is pretty s... [17:17:49] (03CR) 10CDanis: [C: 03+2] airflow: Remove spurious "<" from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/565629 (owner: 10EBernhardson) [17:18:15] cdanis: thanks! [17:18:32] np! puppet-merging now [17:20:13] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) >>! In T184066#5813230, @faidon wrote: > Could we import into Netbox now, and then change & document the setup at our convenience? It feels like documenting the existing situation and changi... [17:20:56] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) Agreed that power cord swapping has a risk, and the task T243088 outlines that we want #traffic and/or #dc-ops around when this work is done! [17:26:44] awight: Thanks, I can see that the old rev now renders as well :) https://uk.wikipedia.org/w/index.php?title=Tinder&oldid=26116722 [17:28:03] btw, I wonder how common it is/was to have articles with an empty string as name. Perhaps as mistake, but still, I wonder if that could be supported? It is a valid array key after all. Especially if it worked prior to the refactor. [17:34:24] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) Update: I'm going to clean up and import what we have into netbox as part of the PDU setup task T184066; once imported I'll resolve T184066. Then T243088 will be set to lowest priority for... [17:34:27] 10Operations, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) Update: I'm going to clean up and import what we have into netbox as part of the PDU setup task T184066; once imported I'll resolve T184066. Then T243088 will be set to lowest priority... [17:44:17] (03PS1) 10EBernhardson: airflow: Provide runtime directory for skein [puppet] - 10https://gerrit.wikimedia.org/r/565646 [18:10:06] (03PS2) 10Giuseppe Lavagetto: conftool::safe_service_restarts: better support for non lvs servers [puppet] - 10https://gerrit.wikimedia.org/r/565574 [18:18:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:25] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [18:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:32] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `gerrit-test.wikimedia.org` - gerrit-test.wikimedia.org (**FAIL**) - Down... [18:25:36] (03CR) 10Dzahn: [C: 03+2] codesearch: Open up port 3002 [puppet] - 10https://gerrit.wikimedia.org/r/565451 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [18:30:34] (03CR) 10Dzahn: [C: 03+2] codesearch: Work around bootstrapping problems [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [18:30:40] (03PS3) 10Dzahn: codesearch: Work around bootstrapping problems [puppet] - 10https://gerrit.wikimedia.org/r/565508 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [18:42:25] 10Operations, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10wiki_willy) a:05wiki_willy→03RobH Assigning back over to @RobH after our conversation via IRC [18:44:33] (03CR) 10Dzahn: [C: 03+1] jenkins: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563478 (owner: 10Muehlenhoff) [18:45:05] (03CR) 10Dzahn: [C: 03+2] jenkins: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563478 (owner: 10Muehlenhoff) [18:51:25] (03CR) 10Dzahn: "ran puppet on contint[12]001 and releases[12]001. puppet removes /etc/apt/sources.list.d/jenkins-thirdparty-ci.list and adds /etc/apt/sou" [puppet] - 10https://gerrit.wikimedia.org/r/563478 (owner: 10Muehlenhoff) [18:55:18] (03CR) 10Dzahn: "saper: re: tabs and vim. check out https://wikitech.wikimedia.org/wiki/Puppet_coding#tab_character_found_on_line_.." [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [19:07:45] (03PS1) 10Herron: assign kafka-main200[45] to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/565678 [19:13:59] RECOVERY - cassandra-a CQL 10.192.48.142:9042 on restbase2023 is OK: TCP OK - 0.036 second response time on 10.192.48.142 port 9042 https://phabricator.wikimedia.org/T93886 [19:14:31] !log gerrit - switching operations/debs/hhvm to READONLY mode and adding ARCHIVED to description (T237038) [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:35] T237038: Archive operations/debs/hhvm repository - https://phabricator.wikimedia.org/T237038 [19:20:49] (03PS1) 10Jforrester: Archive repo [debs/hhvm] - 10https://gerrit.wikimedia.org/r/565684 (https://phabricator.wikimedia.org/T237038) [19:22:51] (03CR) 10Dzahn: [C: 03+2] Archive repo [debs/hhvm] - 10https://gerrit.wikimedia.org/r/565684 (https://phabricator.wikimedia.org/T237038) (owner: 10Jforrester) [19:23:03] (03CR) 10Herron: [C: 03+2] assign kafka-main200[45] to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/565678 (owner: 10Herron) [19:43:10] (03PS1) 10Dzahn: admin::ori: remove hhvm related commands [puppet] - 10https://gerrit.wikimedia.org/r/565687 (https://phabricator.wikimedia.org/T235142) [19:49:48] (03PS1) 10Dzahn: deployment-prep: remove install_hhvm from Hiera for mediawiki-parsoid10 [puppet] - 10https://gerrit.wikimedia.org/r/565688 (https://phabricator.wikimedia.org/T235142) [19:50:27] (03CR) 10Jforrester: [C: 03+1] admin::ori: remove hhvm related commands [puppet] - 10https://gerrit.wikimedia.org/r/565687 (https://phabricator.wikimedia.org/T235142) (owner: 10Dzahn) [19:50:35] (03CR) 10Jforrester: [C: 03+1] deployment-prep: remove install_hhvm from Hiera for mediawiki-parsoid10 [puppet] - 10https://gerrit.wikimedia.org/r/565688 (https://phabricator.wikimedia.org/T235142) (owner: 10Dzahn) [19:51:13] Note I just forwarded an email to ops-private@ regarding "eBGP-v4 with WIKIMEDIA-FOUNDATION over EQUINIX ASHBURN is down". Noting here in case it is urgent. [19:53:20] XioNoX: ^ [19:56:14] quiddity: thanks, they are supposed to use noc@ or maint-announce@ [19:58:03] (03PS2) 10Dzahn: define 2 API appservers per row in codfw as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) [19:58:09] (03CR) 10Dzahn: define 2 API appservers per row in codfw as canary API appservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [20:01:57] !log reset bgp peerings with gfiber on cr2-eqiad [20:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:17] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20433/deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/565688 (https://phabricator.wikimedia.org/T235142) (owner: 10Dzahn) [20:05:17] handled + replied [20:06:19] (03PS2) 10Dzahn: deployment-prep: remove install_hhvm from Hiera for mediawiki-parsoid10 [puppet] - 10https://gerrit.wikimedia.org/r/565688 (https://phabricator.wikimedia.org/T235142) [20:07:10] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [20:07:21] !log bootstrapping restbase2023-b — T243000 [20:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:24] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [20:13:56] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:15:19] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 77944344 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:17:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 46512 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:21:20] (03PS3) 10Dzahn: add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) [20:24:04] (03CR) 10Paladox: [C: 03+1] add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:27:47] quiddity, bblack thanks! [20:29:03] (03PS4) 10Dzahn: add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) [20:30:41] (03CR) 10BBlack: [C: 03+1] add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:31:26] (03CR) 10Dzahn: [C: 03+2] add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:31:31] (03PS5) 10Dzahn: add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) [20:35:38] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) IP situation fixed! server: gerrit1002.wikimedia.org has address 208.80.154.75 gerrit1002.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:75 service:... [20:40:55] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [20:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:43] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) recreating VM as gerrit1002 so that we can use gerrit-test as service name: Creating new VM named gerrit1002.wikimedia.org in eqiad with row=C vcpu=1 memory=16... [20:59:36] (03CR) 10Ori.livneh: [C: 03+1] "Thanks for looking after this, Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/565687 (https://phabricator.wikimedia.org/T235142) (owner: 10Dzahn) [21:01:08] (03CR) 10Dzahn: [C: 03+2] admin::ori: remove hhvm related commands [puppet] - 10https://gerrit.wikimedia.org/r/565687 (https://phabricator.wikimedia.org/T235142) (owner: 10Dzahn) [21:03:07] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Jdforrester-WMF) [21:03:10] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Reedy) > engagement and impact data (readership, editorial, quality data, etc) It sounds like you're requesting access to at least some of t... [21:06:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Dzahn) Welcome @rudolph-san! Do you already have a user on https://wikitech.wikimedia.org ? Please register there if you haven't already a... [21:13:07] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10rudolph-san) Hi @Dzahn I already have a user on https://wikitech.wikimedia.org. The username I picked was Rampofo. Thanks [21:17:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [21:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:52] (03PS1) 10Dzahn: admin: add Rudolph Ampofo to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/565694 (https://phabricator.wikimedia.org/T243103) [21:36:09] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Iflorez) Hi all, I am a data analyst working on GLOW. Here's the [[ https://phabricator.wikimedia.org/project/view/428... [21:44:50] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10rudolph-san) [21:50:26] (03PS1) 10Eevans: Configure remainder of testwikis group for kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565696 (https://phabricator.wikimedia.org/T243106) [21:56:17] !log bootstrapping restbase2023-c — T243000 [21:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:20] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [21:56:43] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [22:02:53] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [23:10:41] (03CR) 10BryanDavis: dynamicproxy: urlproxy: introduce support for domain-based routing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [23:15:56] (03CR) 10BryanDavis: kubernetes: add support for domain-based routing in the new kubernetes cluster (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [23:25:01] (03CR) 10Bstorm: kubernetes: add support for domain-based routing in the new kubernetes cluster (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [23:33:45] (03PS2) 10Cwhite: when configured to relay statsd traffic, send the raw []byte recieved toward the configured statsd endpoint [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/554544 (https://phabricator.wikimedia.org/T239833) [23:35:18] (03PS3) 10Cwhite: when configured to relay statsd traffic, send the raw []byte recieved toward the configured statsd endpoint [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/554544 (https://phabricator.wikimedia.org/T239833) [23:40:57] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Aklapper) >>! In T243103#5813834, @Iflorez wrote: > We were told that for Superset access he needed to create a ticket... [23:41:44] (03CR) 10Bstorm: kubernetes: add support for domain-based routing in the new kubernetes cluster (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [23:42:46] (03PS1) 10Dzahn: phabricator: make deploy_user configurable and change it in cloud [puppet] - 10https://gerrit.wikimedia.org/r/565701 [23:43:51] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter drops relayed metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) [23:44:20] (03CR) 10Paladox: phabricator: make deploy_user configurable and change it in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565701 (owner: 10Dzahn) [23:45:59] (03CR) 10Dzahn: phabricator: make deploy_user configurable and change it in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565701 (owner: 10Dzahn) [23:46:21] (03PS2) 10Dzahn: phabricator: make deploy_user configurable and change it in cloud [puppet] - 10https://gerrit.wikimedia.org/r/565701 [23:50:05] (03CR) 10Bstorm: "The change I'm proposing is:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [23:51:56] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [23:54:13] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/20435/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/565701 (owner: 10Dzahn) [23:55:44] (03CR) 10Dzahn: [C: 03+2] phabricator: make deploy_user configurable and change it in cloud [puppet] - 10https://gerrit.wikimedia.org/r/565701 (owner: 10Dzahn)