[00:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T0000). [00:05:32] !log updating phabricator. 1 patch + new translations. Expect only brief downtime. [00:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:37] !log phabricator update appears to be stable. [00:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:30] 10Operations, 10serviceops, 10Performance-Team (Radar): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:05aaron→03None [00:11:01] 10Operations, 10serviceops, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [00:12:56] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [00:16:53] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) >>! In T244340#5853430, @jijiki wrote: > The idea is obviously sensible. I do have some c... [00:30:25] (03CR) 10Krinkle: [C: 04-1] "This updates the label, not the link :)" [puppet] - 10https://gerrit.wikimedia.org/r/596280 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [00:31:22] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:21] PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 6635 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:37:05] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:25] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:49] !log change password for "Python eggs" [00:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:22] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [01:29:38] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10Krenair) [01:30:05] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10Krenair) [01:32:55] !log depooled wdqs2006 while waiting for lag to recover [01:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:39] RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 899.6 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:55:04] !log wdqs2006 was repooled after successful test queries [02:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:01] !log wdqs1004 was repooled after successful test queries [02:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:50] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [02:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:02] !log wdqs1005 has been de-pooled pending wdqs data xfer [02:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:21] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:47] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Nuria) @Dzahn is there an additional step we do to verify employment? [04:00:56] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:02:15] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:17] (03PS1) 10Andrew Bogott: Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) [04:03:17] (03CR) 10jerkins-bot: [V: 04-1] Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) (owner: 10Andrew Bogott) [04:04:44] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10AntiCompositeNumber) [04:04:47] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:04:57] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10AntiCompositeNumber) [04:04:59] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:06:40] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10AntiCompositeNumber) [04:06:46] 10Operations, 10Commons, 10SRE-swift-storage, 10User-notice: Update rsvg on the image scalers to 2.40.16 (to solve several SVG rendering issues) - https://phabricator.wikimedia.org/T112421 (10AntiCompositeNumber) [04:07:02] (03PS2) 10Andrew Bogott: Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) [04:08:05] (03CR) 10jerkins-bot: [V: 04-1] Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) (owner: 10Andrew Bogott) [04:08:13] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:08:25] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:09:48] (03PS3) 10Andrew Bogott: Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) [04:10:35] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:10:42] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:13:17] (03PS4) 10Andrew Bogott: Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) [04:15:18] (03CR) 10Andrew Bogott: [C: 03+2] Add profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/596328 (https://phabricator.wikimedia.org/T233995) (owner: 10Andrew Bogott) [04:17:42] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:17:54] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:22:21] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:29:23] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:31:13] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10AntiCompositeNumber) [04:31:16] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:31:23] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:31:25] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10AntiCompositeNumber) [04:35:25] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:35:32] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:38:36] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:38:45] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:43:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1002), Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:46:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:57] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [04:47:59] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:48:13] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [04:48:15] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [04:49:20] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) >>! In T193352#5984551, @Jdforrester-WMF wrote: > AFAICT this entire task stack is the wrong way around? We can't do this upgrade until T216815 is done, and all o... [05:42:57] (03CR) 10Elukey: [C: 03+2] Bump AQS druid snapshot to 2020_04 [puppet] - 10https://gerrit.wikimedia.org/r/596263 (owner: 10Joal) [05:44:51] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [05:57:29] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:49] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) There is Http::createMultiClient(), which respects $wgHTTPConne... [06:05:21] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:39] this is me --^ [06:07:13] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:49] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) 05Open→03Resolved a:03elukey This can be closed in my opinion, we have already wor... [06:09:52] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10elukey) [06:12:08] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) In a separate task, I mentioned the following: > on every mcXXXX we have ~25GB of free RAM (not even used by page cache) that we currently don't use.... [06:21:22] (03CR) 10Elukey: [C: 03+2] cassandra: use openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [06:29:05] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:13] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:21] PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:35] PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:47] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:05] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:05] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:37] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:31:19] PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:35] !log Pooled wdqs2005 following successful test queries [06:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:55] PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:15] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/596239 (owner: 10Herron) [06:36:05] PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:36:41] PROBLEM - dhclient process on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:36:47] PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:39:09] PROBLEM - Disk space on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:39:43] (03PS1) 10Elukey: profile::kafka::broker: use openjdk-8 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/596384 [06:40:23] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:40:45] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:03] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:41:15] (03PS2) 10Elukey: profile::kafka::broker: use openjdk-8 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/596384 [06:43:11] PROBLEM - ores_workers_running on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:45:41] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:46:57] RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:48:53] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22514/" [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [06:49:49] PROBLEM - IPMI Sensor Status on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [06:52:09] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:52:27] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:31] RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:54:47] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:29] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:59:59] RECOVERY - Disk space on ores1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [07:00:57] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:01:55] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: clean up 'storage_policies' from swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596155 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [07:02:12] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: add ability to toggle WMF-specific filters [puppet] - 10https://gerrit.wikimedia.org/r/596177 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [07:02:26] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [07:02:39] (03PS4) 10Filippo Giunchedi: swift: move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) [07:03:04] !log installing apt security updates [07:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:47] RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:06:07] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:07:35] RECOVERY - dhclient process on ores1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:13:27] 10Operations, 10Analytics, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) [07:15:35] (03PS2) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 [07:17:53] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:39] RECOVERY - IPMI Sensor Status on ores1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:23:29] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:44] (03PS1) 10Ayounsi: Cumin: add network devices support [puppet] - 10https://gerrit.wikimedia.org/r/596389 [07:29:31] (03PS1) 10Filippo Giunchedi: profile: move analytics::cluster::secrets to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596390 (https://phabricator.wikimedia.org/T252537) [07:29:33] (03PS1) 10Filippo Giunchedi: profile: move thumbor to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596391 (https://phabricator.wikimedia.org/T252537) [07:29:35] (03PS1) 10Filippo Giunchedi: profile: move docker registry to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596392 (https://phabricator.wikimedia.org/T252537) [07:29:37] (03PS1) 10Filippo Giunchedi: swift: remove accounts from swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596393 (https://phabricator.wikimedia.org/T252537) [07:32:03] PROBLEM - DPKG on logstash2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:34:05] PROBLEM - DPKG on db2104 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:34:52] ^ dpkg errors are me, will recover in a bit [07:34:54] 10Operations, 10serviceops: Test effects of forcing numa locality for php-fpm - https://phabricator.wikimedia.org/T252743 (10Joe) [07:35:35] (03CR) 10Dzahn: [C: 03+2] "i tested if mysql connection from scandium works with the same permissions and it does" [puppet] - 10https://gerrit.wikimedia.org/r/596293 (owner: 10Subramanya Sastry) [07:45:19] (03CR) 10Dzahn: "I had already renamed all the files from the dump to have the "r" prefix and uploaded them. So if you just wanted to save that work it's n" [puppet] - 10https://gerrit.wikimedia.org/r/596280 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [07:47:19] (03CR) 10Dzahn: [C: 03+1] "the sudo::user { 'nagios_mailman_queue': can also move inside the "if". not a big deal though either way." [puppet] - 10https://gerrit.wikimedia.org/r/596259 (https://phabricator.wikimedia.org/T252615) (owner: 10Herron) [07:47:21] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/22515/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [07:48:44] (03PS2) 10JMeybohm: zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) [07:48:51] (03CR) 10jerkins-bot: [V: 04-1] zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [07:50:07] PROBLEM - DPKG on elastic1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:51:43] PROBLEM - DPKG on wdqs1009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:52:02] (03PS6) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [07:53:45] RECOVERY - DPKG on logstash2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:53:45] RECOVERY - DPKG on db2104 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:54:59] (03PS7) 10JMeybohm: raw: Add apiVersion (helm lint) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [07:55:41] PROBLEM - DPKG on an-worker1089 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:55:43] PROBLEM - DPKG on matomo1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:56:27] 10Operations, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Joe) [07:57:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] raw: Add apiVersion (helm lint) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:57:27] PROBLEM - DPKG on elastic2032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:58:46] (03CR) 10JMeybohm: [C: 03+2] raw: Add apiVersion (helm lint) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:59:08] (03Merged) 10jenkins-bot: raw: Add apiVersion (helm lint) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:59:56] (03PS2) 10Ema: ATS: cap TTL for cacheable 404 responses to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/595877 (https://phabricator.wikimedia.org/T251537) [08:00:16] !log upgrade ats to version 8.0.7-1wm7 on cp5011 [08:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:36] (03PS3) 10JMeybohm: zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) [08:01:59] PROBLEM - DPKG on elastic1057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:02:13] 10Operations: Generate ssh_known_hosts for network devices - https://phabricator.wikimedia.org/T252747 (10ayounsi) p:05Triage→03Medium [08:03:38] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/22521/" [puppet] - 10https://gerrit.wikimedia.org/r/596390 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:05:20] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/22522/" [puppet] - 10https://gerrit.wikimedia.org/r/596391 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:06:24] (03CR) 10JMeybohm: [C: 03+2] zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:06:44] (03Merged) 10jenkins-bot: zotero: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/595900 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:07:17] (03PS3) 10Dzahn: static-codereview: do not allow directory listing for subdirs [puppet] - 10https://gerrit.wikimedia.org/r/596240 (https://phabricator.wikimedia.org/T243056) [08:07:25] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/22523/" [puppet] - 10https://gerrit.wikimedia.org/r/596392 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:09:31] PROBLEM - DPKG on graphite1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:10:26] (03PS3) 10Ema: ATS: cap TTL for cacheable 404 responses to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/595877 (https://phabricator.wikimedia.org/T251537) [08:11:09] (03CR) 10Filippo Giunchedi: "PCC (noops effectively) https://puppet-compiler.wmflabs.org/compiler1003/22524/" [puppet] - 10https://gerrit.wikimedia.org/r/596393 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:12:15] 10Operations, 10serviceops: Test effects of forcing numa locality for php-fpm - https://phabricator.wikimedia.org/T252743 (10Joe) p:05Triage→03Medium [08:16:21] (03PS1) 10Filippo Giunchedi: hieradata: remove obsolete esams swift [puppet] - 10https://gerrit.wikimedia.org/r/596394 (https://phabricator.wikimedia.org/T252537) [08:20:03] !log jayme@deploy2001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [08:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile: move docker registry to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596392 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:20:57] RECOVERY - DPKG on elastic1066 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:22:35] RECOVERY - DPKG on wdqs1009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:26:33] RECOVERY - DPKG on an-worker1089 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:26:35] RECOVERY - DPKG on matomo1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:28:19] RECOVERY - DPKG on elastic2032 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:30:36] (03CR) 10Gilles: "Yes, I think that it's fair to expect that a completely fresh NewFiles page with the default amount of thumbnails would always work, but o" [puppet] - 10https://gerrit.wikimedia.org/r/596149 (https://phabricator.wikimedia.org/T252426) (owner: 10Gilles) [08:31:13] (03CR) 10Gilles: [C: 03+1] profile: move thumbor to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596391 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:32:05] !log installing Java security updates on Hadoop/AQS/Druid [08:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:49] RECOVERY - DPKG on elastic1057 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:35:11] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) @Jclark-ctr Could you paste your SSH config on a pastebin? (https://phabricator.wikimedia.org/paste/ f.e.) [08:39:40] !log imported helm 2.16.7-1 to main for jessie-wikimedia [08:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] 10Operations, 10Thumbor: Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Gilles) [08:40:23] RECOVERY - DPKG on graphite1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:40:48] 10Operations, 10Thumbor: Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Gilles) [08:41:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @Nuria Yea, for now we can still check on the corporate LDAP (OIT) servers (though they might be shut down in th... [08:42:08] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) a:03Dzahn [08:43:06] (03PS1) 10Elukey: aptrepo: add matomo component [puppet] - 10https://gerrit.wikimedia.org/r/596396 (https://phabricator.wikimedia.org/T252741) [08:43:45] !log updated helm: 2.12.2-1 -> 2.16.7-1 on deploy[1,2]001 and contint1001. 2.12.2-4 -> 2.16.7-1 on contint2001 [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:05] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10Dzahn) Searching for a user on the ldap-corp servers is currently the way SREs can confirm whether a user is actually an... [08:49:56] (03PS1) 10Dzahn: admins: add Segun Oworu to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/596397 (https://phabricator.wikimedia.org/T252703) [08:51:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. Make sure to add the repo key after merging:" [puppet] - 10https://gerrit.wikimedia.org/r/596396 (https://phabricator.wikimedia.org/T252741) (owner: 10Elukey) [08:52:40] (03CR) 10Jbond: "Looking at the task it seems like the user will need shell access? Also still awaiting manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/596298 (https://phabricator.wikimedia.org/T251523) (owner: 10Cwhite) [08:53:28] (03CR) 10Elukey: [C: 03+2] aptrepo: add matomo component [puppet] - 10https://gerrit.wikimedia.org/r/596396 (https://phabricator.wikimedia.org/T252741) (owner: 10Elukey) [08:54:11] godog: merged your changes in the labs-private repo [08:55:53] elukey: thanks! [08:56:04] I keep forgetting private-but-public wants merge now [08:56:25] !log installing Java security updates on Presto [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:10] !log imported gpg key 1FD752571FE36FF23F78F91B81E2E78B66FED89E in apt1001 (Matomo public debian repo) [08:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:40] (03CR) 10Jbond: "lgtm minor nit" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (owner: 10Muehlenhoff) [09:00:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 64 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:33] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:10:55] (03PS1) 10RhinosF1: Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 [09:11:33] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) Hi @HMarcus Thanks for configuring this however the application associated with this service account will also... [09:11:52] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: move thumbor to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596391 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:12:01] (03PS2) 10Filippo Giunchedi: profile: move thumbor to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596391 (https://phabricator.wikimedia.org/T252537) [09:14:46] (03PS1) 10Elukey: aptrepo: allow the matomo componet to use checkupdate/update [puppet] - 10https://gerrit.wikimedia.org/r/596399 (https://phabricator.wikimedia.org/T252741) [09:15:13] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) >>! In T244792#6136133, @Dzahn wrote: > Searching for a user on the ldap-corp servers is currently the way SREs c... [09:15:53] (03PS2) 10RhinosF1: Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 [09:16:14] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: move docker registry to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596392 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:16:22] (03PS2) 10Filippo Giunchedi: profile: move docker registry to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596392 (https://phabricator.wikimedia.org/T252537) [09:16:35] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove obsolete esams swift [puppet] - 10https://gerrit.wikimedia.org/r/596394 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:17:55] (03PS3) 10RhinosF1: Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 (https://phabricator.wikimedia.org/T252600) [09:18:14] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) >>! In T244792#6136133, @Dzahn wrote: > Will there be something to replace that and will all SREs be... [09:18:27] (03PS1) 10Seddon: Undeploy from mediawikiwiki. Labs graph fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) [09:19:07] (03PS2) 10Filippo Giunchedi: hieradata: remove obsolete esams swift [puppet] - 10https://gerrit.wikimedia.org/r/596394 (https://phabricator.wikimedia.org/T252537) [09:22:27] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' . [09:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:46] 10Operations, 10Traffic, 10Patch-For-Review: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) >>! In T251537#6132201, @ema wrote: > https://github.com/apache/trafficserver/pull/6767 That's been mer... [09:27:43] (03CR) 10Elukey: [C: 03+2] aptrepo: allow the matomo componet to use checkupdate/update [puppet] - 10https://gerrit.wikimedia.org/r/596399 (https://phabricator.wikimedia.org/T252741) (owner: 10Elukey) [09:27:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove late-install d-i hack for Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/594977 (owner: 10Muehlenhoff) [09:27:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [09:28:15] elukey: shall I puppet-merge your matomo patch along? [09:28:19] yep thanks! [09:28:52] done [09:29:00] !log upload matomo-3.13.3 to thirdparty/matomo on stretch|buster-wikimedia [09:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:44] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' . [09:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:08] (03CR) 10Jcrespo: [C: 03+2] "This is good, merging, but I have some comments that I hope you can address in further patches." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [09:35:42] (03CR) 10Elukey: [C: 03+1] profile: move analytics::cluster::secrets to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596390 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:35:45] (03PS1) 10Elukey: matomo: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/596404 (https://phabricator.wikimedia.org/T252741) [09:36:19] (03CR) 10Elukey: [C: 03+2] matomo: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/596404 (https://phabricator.wikimedia.org/T252741) (owner: 10Elukey) [09:43:32] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: move analytics::cluster::secrets to profile::swift::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596390 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:43:55] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: remove accounts from swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596393 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:51:49] (03Abandoned) 10Arturo Borrero Gonzalez: wmnet: cleanup unused labsdb1002 entries [dns] - 10https://gerrit.wikimedia.org/r/548257 (https://phabricator.wikimedia.org/T146455) (owner: 10Arturo Borrero Gonzalez) [09:56:18] !log upgrade matomo on matomo1001 to 3.13.3 (latest upstream) - T252741 [09:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:26] T252741: Upgrade matomo to the latest upstream - https://phabricator.wikimedia.org/T252741 [09:58:11] (03PS2) 10Ema: vcl: move exp admission policy settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/594969 (https://phabricator.wikimedia.org/T144187) [09:58:13] (03PS1) 10Ema: vcl: test 'exp' admission policy on cp2028 [puppet] - 10https://gerrit.wikimedia.org/r/596408 (https://phabricator.wikimedia.org/T144187) [09:58:17] !log remove matomo 3.11 from the main component of stretch-wikimedia [09:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:13] (03PS3) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 [10:07:19] (03CR) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (owner: 10Muehlenhoff) [10:10:55] (03PS3) 10Muehlenhoff: Add debian/ directory to the build overlay (WIP) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) [10:12:31] (03PS3) 10Ema: vcl: move exp admission policy settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/594969 (https://phabricator.wikimedia.org/T144187) [10:12:59] (03PS2) 10Ema: vcl: test 'exp' admission policy on cp2028 [puppet] - 10https://gerrit.wikimedia.org/r/596408 (https://phabricator.wikimedia.org/T144187) [10:14:00] !log fdans@deploy1001 Started deploy [analytics/refinery@6f13979]: Regular analytics weekly train [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:28] (03PS3) 10Ema: vcl: test 'exp' admission policy on cp2028 [puppet] - 10https://gerrit.wikimedia.org/r/596408 (https://phabricator.wikimedia.org/T144187) [10:22:38] (03CR) 10Volans: "Thanks for the patch! Some comment inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [10:30:35] (03CR) 10Ayounsi: "Thanks for the quick review!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [10:30:59] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 78, down: 11, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:14] !log fdans@deploy1001 Finished deploy [analytics/refinery@6f13979]: Regular analytics weekly train (duration: 17m 14s) [10:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:51] (03PS1) 10Seddon: Adding import to test wikis from mediawikiwiki to permit easier importing of test cases. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596411 (https://phabricator.wikimedia.org/T242855) [10:34:03] (03CR) 10Jcrespo: "I am a bit doubtful about 2 things: API and implementation." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [10:38:30] (03PS3) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [10:39:37] (03CR) 10Dzahn: [C: 03+2] maintenance: also rsync codereview files to codfw miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596235 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [10:39:48] (03PS2) 10Dzahn: maintenance: also rsync codereview files to codfw miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596235 (https://phabricator.wikimedia.org/T243056) [10:43:32] 10Operations, 10Puppet, 10User-jbond: admin: create schema validation for admin.yaml - https://phabricator.wikimedia.org/T250259 (10jbond) 05Open→03Resolved [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1100). [11:00:05] matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] o/ [11:00:44] I'll go [11:03:49] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10Dzahn) >>! In T244792#6136271, @MoritzMuehlenhoff wrote: > These are very rare, so it should be fine if a few people in... [11:04:35] !log upgrade ats to version 8.0.7-1wm7 on cp3064 [11:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:02] (03PS1) 10Dzahn: maintenance: fix hostname of miscweb server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/596414 [11:05:46] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.32/extensions/WikibaseMediaInfo/: [MediaInfo] Enable media search for all users by default (duration: 01m 12s) [11:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:10] no last-minute additions to swat? [11:07:42] !log EU swat done [11:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:18] (03CR) 10Dzahn: [C: 03+2] maintenance: fix hostname of miscweb server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/596414 (owner: 10Dzahn) [11:15:56] (03PS1) 10Dzahn: maintenance: another fix in host names for rsyncing from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596415 [11:16:57] (03CR) 10jerkins-bot: [V: 04-1] maintenance: another fix in host names for rsyncing from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596415 (owner: 10Dzahn) [11:18:48] (03CR) 10Jbond: [C: 03+1] Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (owner: 10Muehlenhoff) [11:19:57] (03PS2) 10Dzahn: maintenance: another fix in host names for rsyncing from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596415 [11:21:37] (03CR) 10Dzahn: [C: 03+2] maintenance: another fix in host names for rsyncing from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596415 (owner: 10Dzahn) [11:23:21] 10Operations, 10Commons, 10MediaWiki-File-management, 10Parsoid, and 8 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Aklapper) [11:28:06] matthiasmullie: Thank you! [11:28:35] (03CR) 10Privacybatm: "Thank you so much for merging it :) I will address those comments in the next patch." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [11:34:45] (03PS2) 10Dzahn: static-codereview: Fix links on index.html [puppet] - 10https://gerrit.wikimedia.org/r/596280 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [11:36:34] (03CR) 10Dzahn: [C: 03+2] "HTML is also valid per W3C validator :)" [puppet] - 10https://gerrit.wikimedia.org/r/596280 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [11:39:31] (03PS4) 10Dzahn: static-codereview: do not allow directory listing for subdirs [puppet] - 10https://gerrit.wikimedia.org/r/596240 (https://phabricator.wikimedia.org/T243056) [11:39:52] (03CR) 10Dzahn: [C: 03+2] static-codereview: do not allow directory listing for subdirs [puppet] - 10https://gerrit.wikimedia.org/r/596240 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [11:44:31] PROBLEM - Disk space on miscweb1002 is CRITICAL: DISK CRITICAL - free space: / 92 MB (0% inode=59%): /tmp 92 MB (0% inode=59%): /var/tmp 92 MB (0% inode=59%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=miscweb1002&var-datasource=eqiad+prometheus/ops [11:47:46] !log changed iosched on pc1010 to `none` as a test T252761 [11:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:49] T252761: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1200) [12:01:46] 10Operations, 10Wikimedia-Mailing-lists: Update description for wikitech-l - https://phabricator.wikimedia.org/T252763 (10Reedy) [12:07:28] 10Operations, 10User-MoritzMuehlenhoff: prometheus-intel-microcode not in line with what's actually loaded by the kernel - https://phabricator.wikimedia.org/T252676 (10MoritzMuehlenhoff) [12:09:13] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 5 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10MoritzMuehlenhoff) [12:10:01] (03PS4) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [12:17:37] (03PS2) 10Dzahn: static-codereview: activate Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596193 (https://phabricator.wikimedia.org/T243056) [12:20:24] (03CR) 10Dzahn: [C: 03+2] static-codereview: activate Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596193 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [12:20:47] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff No longer relevant [12:21:44] 10Operations, 10Cassandra, 10Maps, 10Services (watching), 10User-Eevans: Remove Cassandra 2.2.6 packages from jessie-wikimedia/thirdparty apt repo - https://phabricator.wikimedia.org/T191627 (10MoritzMuehlenhoff) 05Open→03Declined We're not running cassandra on jessie any more and the jessie-wikimedi... [12:22:18] !log reverted iosched on pc1010 to `mq-deadline` T252761 [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:23] T252761: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 [12:23:05] (03PS1) 10RhinosF1: Rename NS_SPECIAL on jv to Naraguna [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596423 [12:23:32] (03Abandoned) 10RhinosF1: Rename NS_SPECIAL on jv to Naraguna [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596423 (owner: 10RhinosF1) [12:24:01] silly me picked the wrong repo [12:25:13] ACKNOWLEDGEMENT - Disk space on miscweb1002 is CRITICAL: DISK CRITICAL - free space: / 92 MB (0% inode=59%): /tmp 92 MB (0% inode=59%): /var/tmp 92 MB (0% inode=59%): daniel_zahn fixed https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=miscweb1002&var-datasource=eqiad+prometheus/ops [12:34:17] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) >>! In T243056#6130671, @Legoktm wrote: >>>! In T243056#6111775, @Dzahn wrote: >> - Should we add Bacula bac... [12:42:03] (03CR) 10Privacybatm: "> We should not modify this behaviour and just error our if explicitly a port is required. For example, we could default to port 0 (or -1)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [12:43:07] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [12:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:47] (03PS1) 10Elukey: role::archiva: move to profile::java::analytics [puppet] - 10https://gerrit.wikimedia.org/r/596425 (https://phabricator.wikimedia.org/T252767) [12:45:11] (03CR) 10MarkTraceur: [C: 03+1] "FYI patch looks good, will allow better testing of graph extension. Expected deployment during morning SWAT today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596411 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [12:45:52] (03CR) 10MarkTraceur: [C: 03+1] "FYI patch looks good, part of our slow rollout and testing of graph extension changes in furtherance of Graphoid sunset. Expected deployme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [12:46:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [12:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:14] !log rolling upgrade ats to version 8.0.7-1wm7 [12:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:21] 10Operations, 10Wikimedia-Mailing-lists: Update description for wikitech-l - https://phabricator.wikimedia.org/T252763 (10Aklapper) 05Open→03Resolved a:03Aklapper I went for `For developers discussing technical aspects and organization of Wikimedia projects` [12:57:50] 10Operations, 10Wikimedia-Mailing-lists: Update description for wikitech-l - https://phabricator.wikimedia.org/T252763 (10Reedy) Works for me. Definitely better than "Wikimedia developers", which while true, definitely isn't the whole story :) [13:00:04] hashar and twentyafterfour: #bothumor I � Unicode. All rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1300). [13:02:43] (03PS1) 10Ema: cp3051: lower large_objects_cutoff to 288K [puppet] - 10https://gerrit.wikimedia.org/r/596430 (https://phabricator.wikimedia.org/T249809) [13:04:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22528/cp1075.eqiad.wmnet/index.html shows this is a noop until we add kafka configuration" [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:06:57] (03CR) 10Ema: [C: 03+2] cp3051: lower large_objects_cutoff to 288K [puppet] - 10https://gerrit.wikimedia.org/r/596430 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [13:07:24] (03CR) 10Ottomata: [C: 03+1] profile::kafka::broker: use openjdk-8 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [13:08:26] (03CR) 10Vgutierrez: [C: 03+1] "ATS 8.0.7-1wm7 is now fleet wide available, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/595877 (https://phabricator.wikimedia.org/T251537) (owner: 10Ema) [13:08:29] (03PS1) 10RhinosF1: Revert "[vecwiki] Update project logo with temporary 20k branding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596431 (https://phabricator.wikimedia.org/T252770) [13:08:51] (03CR) 10Vgutierrez: [C: 03+2] trafficserver_exporter: Track throttled active connections [puppet] - 10https://gerrit.wikimedia.org/r/596238 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:10:51] (03CR) 10Ottomata: [C: 03+1] "A nit, but +1 either way" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596425 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [13:13:28] (03CR) 10Elukey: role::archiva: move to profile::java::analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596425 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [13:17:24] (03PS4) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp2027 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [13:28:02] (03PS1) 10Giuseppe Lavagetto: wip: fix purged kafka config [puppet] - 10https://gerrit.wikimedia.org/r/596437 [13:31:43] (03PS2) 10Giuseppe Lavagetto: wip: fix purged kafka config [puppet] - 10https://gerrit.wikimedia.org/r/596437 [13:31:51] <_joe_> !log updating purged to 0.11 in eqiad,eqsin,esams [13:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:45] (03PS3) 10Giuseppe Lavagetto: profile::cache::purge: correct access to kafka configuration [puppet] - 10https://gerrit.wikimedia.org/r/596437 [13:36:48] (03PS5) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp2027 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [13:41:13] (03PS4) 10Giuseppe Lavagetto: profile::cache::purge: correct access to kafka configuration [puppet] - 10https://gerrit.wikimedia.org/r/596437 [13:41:16] (03PS6) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp2027 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [13:45:29] (03CR) 10CDanis: [C: 03+1] hieradata: remove obsolete esams swift [puppet] - 10https://gerrit.wikimedia.org/r/596394 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [13:46:53] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10CDanis) Yes, and please, also the output of: `ssh -v bast1002.wikimedia.org` and `ssh -v cumin1001.eqiad.wmnet` We only use SSH public keys, and we do not set passw... [13:48:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::cache::purge: correct access to kafka configuration [puppet] - 10https://gerrit.wikimedia.org/r/596437 (owner: 10Giuseppe Lavagetto) [13:51:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache::text: enable reading purges from kafka on cp2027 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:58:33] PROBLEM - purged service on cp2027 is CRITICAL: CRITICAL - Expecting active but unit purged is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:58:46] ema: ^^ ? [13:58:59] _joe_: ^ [13:59:01] mmm maybe _joe_ :) [13:59:04] yeah [13:59:08] <_joe_> yes [13:59:13] <_joe_> it's running though [13:59:25] <_joe_> manually :P [13:59:51] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:00] <_joe_> I'll ack those alerts, or just fix the problem temporarily for cp2027 [14:03:43] 10Operations: Generate ssh_known_hosts for network devices - https://phabricator.wikimedia.org/T252747 (10CDanis) If you wanted the script to be invoked on SRE's own machines, it could first ssh to a bastion and then invoke ssh-keyscan from there. [14:04:20] (03PS1) 10Giuseppe Lavagetto: purged: hotfix disable dynamicuser when running with kafka support [puppet] - 10https://gerrit.wikimedia.org/r/596440 [14:07:49] (03PS2) 10Giuseppe Lavagetto: purged: hotfix disable dynamicuser when running with kafka support [puppet] - 10https://gerrit.wikimedia.org/r/596440 [14:08:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] purged: hotfix disable dynamicuser when running with kafka support [puppet] - 10https://gerrit.wikimedia.org/r/596440 (owner: 10Giuseppe Lavagetto) [14:10:02] anyone know what the "Cron /usr/local/bin/prometheus-amd-rocm-stats --outfile /var/lib/prometheus/node.d/rocm.prom" mails every minute are about? [14:10:34] oh, looks like maybe https://phabricator.wikimedia.org/T247082 [14:10:36] elukey: ^ ? [14:11:05] RECOVERY - Check systemd state on cp2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:39] RECOVERY - purged service on cp2027 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:14:13] 10Operations, 10Analytics: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10elukey) [14:14:43] rzl: ah! checking, it shouldn't spam in theory [14:15:16] FileNotFoundError: [Errno 2] No such file or directory: '/opt/rocm/bin/rocm-smi': '/opt/rocm/bin/rocm-smi' [14:15:23] this is weird [14:15:52] I was playing with it earlier on, maybe a PEBCAK occurred [14:24:37] elukey: /opt/rocm-3.3.0/ [14:24:45] missing a symlink to the one without version? [14:25:57] mutante: yeah I know but I am not sure why it disappeared all of a sudden [14:26:02] (03PS1) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [14:26:18] I used the tool this morning, and it matches with timing, but in the history I don't see any rm or similar [14:26:37] also, the /opt/rocm/bin/rocm-smi link is not in the db [14:26:40] *deb anymore [14:27:02] so either the prev version was held open by something and I freed it, or I am not sure [14:27:21] anyway, I am going to file a puppet patch [14:27:44] (03CR) 10jerkins-bot: [V: 04-1] Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [14:27:51] (03CR) 10Nuria: [C: 03+1] admins: add Segun Oworu to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/596397 (https://phabricator.wikimedia.org/T252703) (owner: 10Dzahn) [14:29:26] (03PS1) 10Vgutierrez: ATS: Stop handling KA and WS on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/596445 [14:32:21] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/22537/" [puppet] - 10https://gerrit.wikimedia.org/r/596445 (owner: 10Vgutierrez) [14:34:15] (03PS1) 10Volans: tests: relax bandit dependency [software/cumin] - 10https://gerrit.wikimedia.org/r/596448 [14:34:17] (03PS1) 10Volans: tests: fix new pylint reported issues [software/cumin] - 10https://gerrit.wikimedia.org/r/596449 [14:34:19] (03PS1) 10Volans: setup.py: make it Debian Buster compatible [software/cumin] - 10https://gerrit.wikimedia.org/r/596450 [14:34:21] (03PS1) 10Volans: Drop support for Python 3.5 and 3.6 [software/cumin] - 10https://gerrit.wikimedia.org/r/596451 [14:34:23] (03PS1) 10Volans: test: improve integration tests [software/cumin] - 10https://gerrit.wikimedia.org/r/596452 [14:34:25] (03PS1) 10Volans: doc: fix and improved documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/596453 [14:34:27] (03PS1) 10Volans: doc: split HTML and manpage generation [software/cumin] - 10https://gerrit.wikimedia.org/r/596454 [14:34:56] (03CR) 10Dzahn: [C: 03+2] admins: add Segun Oworu to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/596397 (https://phabricator.wikimedia.org/T252703) (owner: 10Dzahn) [14:37:52] (03CR) 10Volans: "> Patch Set 1:" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 (owner: 10CRusnov) [14:38:31] 10Operations, 10SRE-tools: Generate ssh_known_hosts for network devices - https://phabricator.wikimedia.org/T252747 (10Volans) [14:39:27] !log kuai kuai is https://twitter.com/Arlieth/status/1257714333133357056 | https://en.wikipedia.org/wiki/Kuai_Kuai_culture [14:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:36] arg, wrong [14:39:38] !kuai kuai is https://twitter.com/Arlieth/status/1257714333133357056 | https://en.wikipedia.org/wiki/Kuai_Kuai_culture [14:40:04] ahahaha I was going to say, the SAL already? this is rolling out faster than I expected [14:40:27] no, i wanted to add the fact.. where is that bot :) [14:40:41] wm-bot: help [14:40:52] (03PS1) 10Elukey: Add rocm-smi path as parameter to the Prometheus AMD GPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/596457 (https://phabricator.wikimedia.org/T247082) [14:41:04] there is an infobot-like bot :) [14:42:41] (03CR) 10Elukey: [C: 03+2] Add rocm-smi path as parameter to the Prometheus AMD GPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/596457 (https://phabricator.wikimedia.org/T247082) (owner: 10Elukey) [14:44:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10Nuria) 05Open→03Resolved [14:45:19] (03PS2) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [14:46:24] (03PS2) 10Volans: doc: fix and improved documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/596453 [14:46:26] (03PS2) 10Volans: doc: split HTML and manpage generation [software/cumin] - 10https://gerrit.wikimedia.org/r/596454 [14:47:05] mutante: just !kuai is should work [14:47:08] (03Abandoned) 10Cwhite: admin add Segun Oworu to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/596298 (https://phabricator.wikimedia.org/T251523) (owner: 10Cwhite) [14:47:13] rzl, mutante I think I fixed the spam, thanks! [14:47:13] !kuai is test [14:47:14] Sorry, you are not authorized to perform this [14:47:18] I trust: petan|w.*wikimedia/Petrb (2admin), .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@mediawiki/Catrope (2admin), .*@wikimedia/RobH (2admin), .*@wikimedia/Ryan-lane (2admin), petan!.*@wikimedia/Petrb (2admin), .*@wikimedia/Krinkle (2admin), .*@wikipedia/.* (2trusted), [14:47:18] @trusted [14:47:21] elukey: thank you! [14:47:54] RhinosF1: thanks, yea that's what i rememebered in the second attempt, but it did not say anything as usual [14:48:06] has to be one word, I guess [14:48:11] mutante: you did !kuai kuai is [14:48:12] elukey: cool, thanks [14:48:22] Just one !kuai should work [14:48:42] RhinosF1: ah, because it parses for a single word followed by "is" ? right [14:48:53] Yep [14:49:13] !kuai is kuai kuai - https://twitter.com/Arlieth/status/1257714333133357056 | https://en.wikipedia.org/wiki/Kuai_Kuai_culture [14:49:13] Key was added [14:49:16] there we go :) [14:49:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251523 (10colewhite) 05Open→03Declined It appears this was handled by T252703. Per T251122, SSH access was deemed not necessary. [14:49:44] :) [14:51:46] (03PS1) 10Ppchelko: Changeprop: properly template root_event for resource-purge [deployment-charts] - 10https://gerrit.wikimedia.org/r/596459 [14:51:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @soworu You have been added to the "wmf" group. You should now be able to login. [14:52:04] (03PS1) 10Ottomata: Factor out java 8 installation into java_8 class [puppet] - 10https://gerrit.wikimedia.org/r/596460 (https://phabricator.wikimedia.org/T252767) [14:55:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @elukey @Ottomata Please sync soworu's user for Hue access (needed per https://wikitech.wi... [14:55:59] (03PS2) 10Ppchelko: Changeprop: properly template root_event for resource-purge [deployment-charts] - 10https://gerrit.wikimedia.org/r/596459 [14:57:16] (03PS2) 10Vgutierrez: ATS: Stop handling KA and WS on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/596445 [14:57:41] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) updated ticket with HP. Scheduled main board replacement. [14:58:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Ottomata) For Hue the user must also have shell access and be added to the `analytics-privatedata... [14:59:17] (03CR) 10Ottomata: "no-op, just adding the class to the catalog" [puppet] - 10https://gerrit.wikimedia.org/r/596460 (https://phabricator.wikimedia.org/T252767) (owner: 10Ottomata) [15:02:39] (03CR) 10Elukey: [C: 03+1] Factor out java 8 installation into java_8 class [puppet] - 10https://gerrit.wikimedia.org/r/596460 (https://phabricator.wikimedia.org/T252767) (owner: 10Ottomata) [15:02:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [15:05:59] (03CR) 10Elukey: "Andrew created https://gerrit.wikimedia.org/r/#/c/596460/, so we should be able to re-use the same code in multiple places." [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [15:09:13] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) 05Open→03Stalled a:05mmodell→03None I am currently unable to drive this forward as all the change... [15:20:22] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) (owner: 10Cwhite) [15:20:25] (03PS1) 10Privacybatm: Firewall.py: Store target_host as an instance property of Firewall object [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) [15:24:11] RECOVERY - Host furud.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [15:24:35] andrewbogott: can you add some project tags to https://phabricator.wikimedia.org/T252784 [15:24:44] yes, good idea :) [15:24:53] I'm getting better, I only forget ~30% of the time [15:24:53] 10Operations, 10ops-codfw, 10Analytics: furud mgmt interface is down - https://phabricator.wikimedia.org/T252616 (10Papaul) 05Open→03Resolved It was a loosed cable . ` PING 10.193.1.42 (10.193.1.42) 56(84) bytes of data. 64 bytes from 10.193.1.42: icmp_seq=1 ttl=62 time=2.08 ms 64 bytes from 10.193.1.42... [15:25:03] andrewbogott: np [15:25:16] !log disable asw2-d1-eqiad:et-1/1/0 - T251663 [15:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:19] T251663: D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 [15:26:01] (03PS2) 10Ema: 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) [15:27:02] thanks papaul for furud's mgmt! [15:27:20] (03CR) 10Cwhite: [C: 03+2] admin: add Daniele Rama to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) (owner: 10Cwhite) [15:27:20] so the 9 thin cloudvirts shipped, dell just contacted for inbound shipment ticket for mach1, will arrive tomrorow most likley [15:27:33] (03PS3) 10Cwhite: admin: add Daniele Rama to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) [15:27:41] i just put in a ticket for delivery to cage for those hosts [15:28:30] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) Oh, thanks for pointing that out @Ottomata In this case i am not sure if that is really needed because the re... [15:35:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10colewhite) 05Open→03Resolved Access is provisioned. For Kerberos setup, an email should arrive with the necessary details at the email... [15:36:35] (03PS4) 10Muehlenhoff: Add debian/ directory to the build overlay (WIP) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) [15:37:12] (03PS1) 10Andrew Bogott: Convert cloudvirt1004 and cloudvirt1006 to ceph-backed cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/596466 (https://phabricator.wikimedia.org/T252784) [15:38:33] (03CR) 10Andrew Bogott: [C: 03+2] Convert cloudvirt1004 and cloudvirt1006 to ceph-backed cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/596466 (https://phabricator.wikimedia.org/T252784) (owner: 10Andrew Bogott) [15:38:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Nuria) >@soworu You will have superset and turnilo This is all you need to access all data to be clear. [15:39:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) 05Open→03Resolved Alright, thanks Nuria for clarification. I will claim it's resolved then. If any issues f... [15:40:27] (03CR) 10Volans: "Few comments/suggestions inline" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [15:40:49] (03PS1) 10Andrew Bogott: cloudvirt1004/1006: fix role name [puppet] - 10https://gerrit.wikimedia.org/r/596467 (https://phabricator.wikimedia.org/T252784) [15:41:29] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1004/1006: fix role name [puppet] - 10https://gerrit.wikimedia.org/r/596467 (https://phabricator.wikimedia.org/T252784) (owner: 10Andrew Bogott) [15:41:56] (03CR) 10Jcrespo: "Sorry, with "copy", I actually meant "move". You can remove the ones under wmfmariadbpy- we only need a copy in the repo, in the place you" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:42:53] (03PS1) 10Ema: vcl_reference_leak: instrument VCL_Poll [puppet] - 10https://gerrit.wikimedia.org/r/596469 (https://phabricator.wikimedia.org/T236754) [15:43:45] (03CR) 10Privacybatm: "Oh, Okay :D" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:45:41] (03PS2) 10Privacybatm: Firewall.py: Store target_host as an instance property of Firewall object [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) [15:45:54] (03CR) 10Ema: [C: 03+2] vcl_reference_leak: instrument VCL_Poll [puppet] - 10https://gerrit.wikimedia.org/r/596469 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [15:46:03] (03CR) 10Jcrespo: "The other change looks good- when you amend the patch, I will test it and merge." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [15:46:16] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [15:47:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Daniram3) Yes, the email has arrived. Thank you @colewhite for helping out! [15:52:47] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [15:53:32] (03CR) 10Hnowlan: [C: 03+2] Changeprop: properly template root_event for resource-purge [deployment-charts] - 10https://gerrit.wikimedia.org/r/596459 (owner: 10Ppchelko) [15:53:51] (03Merged) 10jenkins-bot: Changeprop: properly template root_event for resource-purge [deployment-charts] - 10https://gerrit.wikimedia.org/r/596459 (owner: 10Ppchelko) [15:56:03] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] (03PS1) 10Cwhite: profile: add mailman outbound queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) [15:57:14] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:51] (03PS4) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [16:01:29] (03PS5) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [16:01:37] (03PS1) 10Ssingh: cescout: harden the Postgres installation [puppet] - 10https://gerrit.wikimedia.org/r/596472 (https://phabricator.wikimedia.org/T247273) [16:02:13] (03PS6) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [16:02:15] (03PS1) 10Subramanya Sastry: Bump rt-test clients from 8 to 12 [puppet] - 10https://gerrit.wikimedia.org/r/596473 [16:02:54] (03CR) 10Cwhite: [C: 03+1] Factor out java 8 installation into java_8 class [puppet] - 10https://gerrit.wikimedia.org/r/596460 (https://phabricator.wikimedia.org/T252767) (owner: 10Ottomata) [16:03:03] (03PS7) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [16:03:12] (03CR) 10Ottomata: [C: 03+2] Factor out java 8 installation into java_8 class [puppet] - 10https://gerrit.wikimedia.org/r/596460 (https://phabricator.wikimedia.org/T252767) (owner: 10Ottomata) [16:05:00] (03CR) 10Zoranzoki21: "Patch rebased finally. Let me know if is something left to do in this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [16:06:52] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/22539/cescout1001.eqiad.wmnet/index.html seems to indicate that the desired change work b" [puppet] - 10https://gerrit.wikimedia.org/r/596472 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [16:09:00] (03PS1) 10Zoranzoki21: [thwikisource] Set ProofReadPage separator on '' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) [16:09:29] (03CR) 10CDanis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [16:20:27] (03PS1) 10Zoranzoki21: Enabe subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) [16:22:07] (03PS2) 10Zoranzoki21: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) [16:24:23] (03PS1) 10Bstorm: paws: Introduce the role skeleton for the paws servers [puppet] - 10https://gerrit.wikimedia.org/r/596478 (https://phabricator.wikimedia.org/T188912) [16:30:29] (03PS1) 10Giuseppe Lavagetto: purged: remove authn from the tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/596479 [16:31:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] purged: remove authn from the tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/596479 (owner: 10Giuseppe Lavagetto) [16:33:48] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) p:05Triage→03High [16:36:26] !log asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 0 port 48 member 2 - T252797 [16:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:30] T252797: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 [16:42:38] !log request virtual-chassis vc-port delete pic-slot 1 port 2 member 1 - T252797 [16:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:41] T252797: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 [16:48:34] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) I disabled the mentioned link on the fpc2 side (so we don't risk fully losing access to fpc1) first. Then on the fpc1 side to check if the alert was caused by this DAC. Unfortunately it loo... [16:49:58] !log request virtual-chassis vc-port set pic-slot 1 port 2 member 1 - T252797 [16:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:04] T252797: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 [16:51:19] !log asw2-d-eqiad> request virtual-chassis vc-port set pic-slot 0 port 48 member 2 - T252797 [16:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump rt-test clients from 8 to 12 [puppet] - 10https://gerrit.wikimedia.org/r/596473 (owner: 10Subramanya Sastry) [16:54:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [16:55:23] !log asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 3 member 1 - T252797 [16:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:27] T252797: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 [17:00:04] halfak and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1700). [17:00:28] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) `pic-slot 1 port 3 member 1` was a leftover port configured as VC port, but without any cable connected to it. Errors are still happening. [17:02:59] !log asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 1 member 1 - T252797 [17:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:02] T252797: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 [17:07:09] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) Thank you so much @colewhite and all! [17:12:07] (03PS1) 10Arturo Borrero Gonzalez: apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) [17:12:51] (03CR) 10jerkins-bot: [V: 04-1] apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [17:13:21] (03PS2) 10Arturo Borrero Gonzalez: apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) [17:14:26] (03CR) 10jerkins-bot: [V: 04-1] apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [17:14:34] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) Disabled the last link, and the errors are still showing up, so I'm confused on where the issue is coming from. [17:20:37] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) From T218059#5075466 it probably due to the link disabled in T251663 acting up. @Jclark-ctr, please unplug fpc1:1/0 (and remove/store the optics) from both sides, fpc8:1/0 (link should be d... [17:24:53] (03CR) 10BryanDavis: apt-upgrade: give support to understand dist/component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [17:26:50] (03CR) 10Herron: [C: 03+1] "Thanks for this, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [17:27:15] PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:25] PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:35] PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:37] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:39] PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:41] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:41] PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:41] uh oh [17:27:53] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:55] PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:01] PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:09] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:09] PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:13] PROBLEM - Host mw1360 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:15] PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:16] this is all rack D1 [17:28:17] D1: Initial commit - https://phabricator.wikimedia.org/D1 [17:28:21] PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:33] yikes [17:28:38] XioNoX: ^ [17:28:45] RECOVERY - Host mw1361 is UP: PING WARNING - Packet loss = 77%, RTA = 0.96 ms [17:28:45] RECOVERY - Host dumpsdata1002 is UP: PING WARNING - Packet loss = 90%, RTA = 0.53 ms [17:28:45] RECOVERY - Host logstash1012 is UP: PING WARNING - Packet loss = 33%, RTA = 0.75 ms [17:28:45] RECOVERY - Host mw1360 is UP: PING WARNING - Packet loss = 33%, RTA = 0.29 ms [17:28:45] RECOVERY - Host es1018 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:28:46] RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [17:28:46] RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [17:28:47] RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:28:47] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [17:28:49] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [17:28:51] RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [17:28:51] RECOVERY - Host mw1357 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [17:28:51] RECOVERY - Host centrallog1001 is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [17:28:58] \o/ [17:29:11] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:37] PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:44] yes, d1 is acting up [17:29:55] RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:29:55] RECOVERY - Host mw1361 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:30:01] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:30:03] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:30:13] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:30:15] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:30:17] PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:30:21] PROBLEM - Nginx local proxy to apache on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:30:25] PROBLEM - Nginx local proxy to apache on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:30:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:30:33] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:31:03] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5824 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:31:14] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10BPirkle) @tstarling , that sounds good to me. There will be no technical... [17:31:33] context is https://phabricator.wikimedia.org/T252797#6137832 basically unplugged a disabled interface, so not sure how it can cause that... [17:31:38] it's plugged back in [17:31:55] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:31:59] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:32:01] RECOVERY - PHP7 rendering on mw1360 is OK: HTTP OK: HTTP/1.1 200 OK - 75916 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:32:03] RECOVERY - Nginx local proxy to apache on mw1355 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:32:09] RECOVERY - Nginx local proxy to apache on mw1354 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:32:17] RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 200 OK - 75916 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:32:55] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 261 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:32:59] PROBLEM - Check systemd state on mw1358 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:35:31] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [17:35:52] rzl: is something still off? [17:35:53] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 12.53 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:36:06] or it's things catching up? [17:36:42] XioNoX: not sure, that mw1358 alert is for ferm -- I haven't tried restarting it, should I? [17:36:46] I'm seeing logging ES and kafka brokers catching back up fwiw [17:37:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:37:23] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [17:37:48] https://phabricator.wikimedia.org/P11196 [17:38:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:38:08] yeah I guess that just wants a kick, I'll restart it [17:38:15] rzl: yep [17:38:33] afaik everything is back to the state it was before [17:38:35] RECOVERY - Check systemd state on mw1358 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:35] done and looks good, alert should clear [17:38:59] I thiiink we're all set? checking icinga alerts [17:39:28] wish I could get icinga.wm.o/alerts?site=eqiad&rack=D1 or whatever [17:39:29] D1: Initial commit - https://phabricator.wikimedia.org/D1 [17:40:07] rzl: we've talked about adding some labels in prometheus or something [17:40:16] but only vague notions [17:40:51] i'm surprised we didn't page [17:40:52] having a look at the kafka-jumbo mirror maker [17:41:29] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.1958 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:43:05] rzl: not per rack but per parent switch (so row): https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=asw2-d-eqiad&style=overview [17:43:18] oh sweet, thanks [17:43:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:44:11] interestingly we served more 50x at the *end* of the incident rather than the start https://logstash.wikimedia.org/goto/860b6373aec52d1e4d7a8d786dda20b5 [17:44:47] phew, 1 procs named java is (arguably) much better than 0 [17:45:01] @hashar are you here? [17:46:41] ( thanks herron ) [17:47:54] we should break ToR switches more often ;) [17:48:41] yeah not a bad exercise tbh [17:49:52] (03PS3) 10Ottomata: profile::kafka::broker: use openjdk-8 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [17:52:04] yeah, I'd rather not :) [17:52:37] (03CR) 10Herron: [C: 03+1] "LGTM -- not tested but am support of using openjdk-8 in buster to ease upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [17:52:38] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/22540/kafka-main1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [17:52:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:53:09] (03CR) 10Ottomata: [C: 03+2] profile::kafka::broker: use openjdk-8 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/596384 (owner: 10Elukey) [17:54:13] herron: did you fix ^^ somehow before or was it just intermittent? [17:54:16] mirror marker [17:54:19] maker* [17:54:38] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) Unplugging that link caused fpc1 to lose connectivity to the remaining of the VC, while it's neither a VCP, nor enabled. > asw2-d-eqiad fpc1 PFEMAN: Shutting down in 5 seconds, PFEMAN Resync... [17:54:44] I bounced the service and it cleared [17:54:55] hm k [17:55:35] Failed to load SSL keystore /etc/kafka/mirror/ssl/kafka_mirror_maker.keystore.jks of type JKS [17:55:36] ??? [17:56:10] Keystore was tampered with, or password was incorrect hmmmm [17:57:02] modified ~2 weeks ago [17:57:46] hmmmmm [17:57:50] i haven't changed anything [17:58:43] it is the same file on kafka-jumbo1001 [17:58:50] i'm worried that they all changed somehow in cergen?! [17:58:54] and puppet made new ones?? [17:58:55] hmmm [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1800). [18:00:04] Zoranzoki21 and Seddon: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:28] Hi de hi :d [18:00:30] :D [18:00:43] no change in puppet private for that keystore since oct 2019 [18:01:10] hm it the same file [18:02:22] wonder if the passphrase in secrets changed? [18:03:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [18:04:44] the password in kafka mirror configs looks wrong.,.. [18:06:05] Seddon: hey, I can do the SWAT in a few minutes, just need to get to my laptop :) [18:06:23] Urbanecm: sure :) I'm about, no rush [18:06:37] hmm, i see elukey added a Add mirror maker keystore password for Kafka Jumbo on apriil 29 [18:06:59] (unless ottomata or herron would appreciate no deploys happening?) [18:07:05] nono proceed [18:07:08] Hello, I'm here [18:07:10] Ack [18:07:12] thigns are fine just confusing, deploys won't break kafka mirror [18:07:33] Zoranzoki21: I will start the SWAT in a minute, need to get to my laptop. [18:08:22] Urbanecm: Ok, no problem, I'm not in hurry :) [18:11:09] Can you check my patch for config related to awawiki later? [18:11:22] Seddon: could you clarify https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/596400's commit message, as in "what is being undeployed"? [18:11:42] thigns are fine just confusing, deploys won't break kafka mirror [18:11:46] oops [18:12:08] (03PS2) 10Seddon: Undeploy graphoid from mediawikiwiki. Labs graph fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) [18:12:20] Urbanecm: better? [18:12:26] yes, thanks [18:13:35] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596411 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:14:25] (03Merged) 10jenkins-bot: Adding import to test wikis from mediawikiwiki to permit easier importing of test cases. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596411 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:15:00] deploying the import patch [18:15:59] Zoranzoki21: I saw your message, sure, after the deployments :) [18:16:16] Urbanecm: I mean it, thanks! [18:16:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: f03a45c: Adding import to test wikis from mediawikiwiki (T242855) (duration: 01m 07s) [18:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:35] T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 [18:16:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [18:17:09] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:18:16] (03PS3) 10Urbanecm: Undeploy graphoid from mediawikiwiki. Labs graph fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:18:20] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:18:47] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [18:19:03] (03Merged) 10jenkins-bot: Undeploy graphoid from mediawikiwiki. Labs graph fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596400 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [18:19:38] Seddon: pulled the undeploy patch onto mwdebug1001. Could you test it works as expected and let me know? [18:19:47] Will do. Am ready this time :) [18:20:08] nice! [18:20:59] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:37] Zoranzoki21: T252755 doesn't have any link to community discussion [18:21:40] T252755: Add subpages in ns Page for nap.source - https://phabricator.wikimedia.org/T252755 [18:22:13] Urbanecm: Maybe user forgot to put it in description, will check. Meanwhile, you can deploy another patch [18:22:22] I'm looking on it rn :) [18:23:51] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) (owner: 10Zoranzoki21) [18:23:57] (03PS2) 10Urbanecm: [thwikisource] Set ProofReadPage separator on '' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) (owner: 10Zoranzoki21) [18:24:02] (03CR) 10Urbanecm: [thwikisource] Set ProofReadPage separator on '' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) (owner: 10Zoranzoki21) [18:24:06] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) (owner: 10Zoranzoki21) [18:24:22] Urbanecm: The undeply change worked. Some other issues I need to deal with but thanks for that :) [18:24:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/596472 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [18:24:38] Urbanecm: I can't find any discussion for T252755 on napwikisource, we won't deploy it [18:24:50] (03Merged) 10jenkins-bot: [thwikisource] Set ProofReadPage separator on '' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596476 (https://phabricator.wikimedia.org/T252610) (owner: 10Zoranzoki21) [18:24:51] Zoranzoki21: ack, please update the requestor on the task [18:24:56] Seddon: thanks, syncing [18:25:45] Urbanecm: Done [18:26:13] Seddon: fyi: the beta part of your patch should be auto-deployed to beta soon, if not there already [18:26:30] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 4b8399c: Undeploy graphoid from mediawikiwiki (T242855) (duration: 01m 05s) [18:26:30] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) 05Resolved→03Open Re-opening this temporarily for a quick follow-up! @elukey could you add @Daniram3 to the LDAP-group so that he can access the note... [18:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:34] T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 [18:26:41] Seddon: both done now :-). [18:27:10] Zoranzoki21: proofread patch is ready at mwdebug1001 [18:28:02] Urbanecm: Ok, testing... [18:29:33] I need help... For what I should look? [18:31:34] let me see... [18:33:41] I'd say it works :-) [18:34:18] syncing [18:34:44] Ok, thanks Urbanecm [18:35:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 15adbbc: [thwikisource] Set ProofReadPage separator to an empty string (T252610) (duration: 01m 06s) [18:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:27] T252610: Set ProofreadPage page separator on thwikisource - https://phabricator.wikimedia.org/T252610 [18:35:29] done [18:35:42] Urbanecm: ty [18:35:46] hth [18:35:56] (03CR) 10Urbanecm: [C: 04-2] "no sign of being consulted with community, to prevent accidental merges" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [18:36:03] @Urbanecm confirm that the import change is working on both test and test2 wiki [18:36:18] nice, thanks Seddon ! [18:36:21] !log Morning SWAT done [18:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:52] (03PS1) 10Subramanya Sastry: Bump rt-test clients from 12 to 20 [puppet] - 10https://gerrit.wikimedia.org/r/596496 [18:50:15] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10colewhite) I am unable to find anything in wikitech that speaks of an ldap group that allows access to the notebooks. @Miriam, do you mean shell access to the `... [18:50:36] no train today, it is blocked on a spam of databsae INSERT https://phabricator.wikimedia.org/T247028 [18:50:42] I am crafting a mail [18:52:15] (03PS1) 10Cyberpower678: Update exec role for Cyberbot [puppet] - 10https://gerrit.wikimedia.org/r/596497 [18:52:17] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/596497 (owner: 10Cyberpower678) [18:58:30] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) Hi @colewhite yes! this is for SWAP access, see this task: https://phabricator.wikimedia.org/T199736 [18:58:57] 10Operations, 10SRE-tools: Create cookbook to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) [19:00:04] hashar and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T1900). [19:00:19] (03CR) 10Volans: [C: 04-1] "I've took the liberty to open https://phabricator.wikimedia.org/T252807 to better discuss where we want to go with this. It's ok to start " (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (owner: 10Muehlenhoff) [19:03:43] (03CR) 10BryanDavis: [C: 03+1] Update exec role for Cyberbot [puppet] - 10https://gerrit.wikimedia.org/r/596497 (owner: 10Cyberpower678) [19:09:31] (03CR) 10Bstorm: [C: 03+2] Update exec role for Cyberbot [puppet] - 10https://gerrit.wikimedia.org/r/596497 (owner: 10Cyberpower678) [19:11:02] 10Operations, 10SRE-tools: Create cookbook to reboot hosts - https://phabricator.wikimedia.org/T252807 (10jcrespo) Quick stupid idea - 1) Insert hook after downtime for custom code. 2) Have a configured way to tell which hosts load which class in the hierarchy, be it an abort "this host should never be reboote... [19:13:47] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Add WDQS and REST API to CORS whitelist - https://phabricator.wikimedia.org/T252810 (10Jseddon) [19:35:02] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Nuria) @colewhite it is no doubt confusing, you need to be in wmf ldap group to use the wb ui to notebooks [19:39:23] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 3 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10EBernhardson) [19:44:14] MatmaRex: Jdlrobson: hi, for the Vector side bar being broken, I guess we can deploy it right now? [19:44:37] the cherry pick is https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/596503/ [19:44:42] sure. i'm around [19:45:18] ok +2 ing that one [19:47:51] (03PS2) 10Herron: lists: don't monitor mailman procs on standby_host [puppet] - 10https://gerrit.wikimedia.org/r/596259 (https://phabricator.wikimedia.org/T252615) [19:48:21] (03PS8) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [19:49:21] !log Depooled wqds1006 in preparation for impending wdqs data xfer [19:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:37] (03CR) 10Herron: [C: 03+2] "> the sudo::user { 'nagios_mailman_queue': can also move inside the" [puppet] - 10https://gerrit.wikimedia.org/r/596259 (https://phabricator.wikimedia.org/T252615) (owner: 10Herron) [19:50:39] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:10] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:19] (03PS1) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [19:59:50] MatmaRex: I am deploying the Vector fix ;) [20:00:36] okay [20:01:12] it is on mwdebug1001 [20:01:26] then I am afraid the sidebar is cached somehow [20:01:50] it is cached in all sorts of funny ways [20:02:02] I tried purging it ;) [20:02:08] no luck [20:02:45] I mean on https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar [20:02:54] and pointing at mwdebug1001 [20:03:09] (03CR) 10Ssingh: [C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/596472 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [20:04:31] hashar: the code is in Skin.php buildSidebar() [20:04:41] it uses $wgSidebarCacheExpiry [20:04:49] which is… a day? [20:05:00] nice [20:05:21] but maybe an edit will get it purged? [20:05:57] i would hope so, maybe the 'checkKeys' thing does that [20:07:14] hashar: are you trying that? i don't have the permissions on wikitech [20:07:20] I dont either [20:07:23] but we can try on mw.org [20:07:42] hashar: officewiki was affected [20:07:54] oh, i can't edit there either, never mind [20:08:01] then, I havent looked at the sidebar editing in ages [20:09:40] ah found a tiny fix to do [20:09:58] https://www.mediawiki.org/w/index.php?title=MediaWiki:Sidebar&diff=3852433&oldid=3338826 [20:10:32] so that in the side bar "code docs" come before "code repository" [20:10:42] but I cant tell whether that fix the plain text issue [20:10:47] hashar: mw.org isn't affected by the bug though (since https://www.mediawiki.org/w/index.php?title=MediaWiki:MediaWiki.org&action=history was created) [20:10:53] ah [20:10:56] i guess we could delete that page, and see if it's still fixed then [20:11:26] sure [20:11:46] deleting [20:11:50] my first delete in ages [20:11:55] fun [20:12:02] that brings back the bugg [20:12:09] so the fix does not work :/ [20:12:18] 10Operations, 10Mail: lists1001 alerting on mailman processes - https://phabricator.wikimedia.org/T252615 (10herron) 05Open→03Resolved a:03herron This has been fixed. Thx for logging the task @Marostegui! [20:12:21] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10herron) [20:12:27] oh no [20:12:29] hashar: i see the bug on the prod servers, but it's ixed on mwdebug [20:12:30] it does [20:12:35] :]]]]]]]]] [20:12:41] syncing thank you very much MatmaRex ! [20:12:46] 10Operations, 10Wikimedia-Mailing-lists: Mailing-list sending notifications for inexistent spam messages - https://phabricator.wikimedia.org/T251816 (10Teles) Writing just to confirm that I didn’t receive any more mails. Thank you! [20:13:33] thanks [20:13:38] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Add WDQS and REST API to CORS whitelist - https://phabricator.wikimedia.org/T252810 (10Jseddon) 05Open→03Declined p:05Triage→03Lowest [20:13:42] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [20:14:31] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10herron) 05Open→03Resolved a:03herron Long term fix has been put in place. Resolving! [20:14:34] MatmaRex: thank you so much for pointing at the dummy workaround page :) [20:14:51] anytime. thanks for deploying [20:15:02] (03CR) 10Cwhite: [C: 03+2] profile: add mailman outbound queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [20:15:11] (03PS2) 10Cwhite: profile: add mailman outbound queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596471 (https://phabricator.wikimedia.org/T236505) [20:15:11] then if it is all cached, I am not sure how to purge it on all wikis [20:15:25] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.32/skins/Vector/includes/VectorTemplate.php: Allow plain text labels in side bar - T252727 (duration: 01m 06s) [20:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:30] T252727: Regression: Plain text sidebar section stopped working in Vector - https://phabricator.wikimedia.org/T252727 [20:15:32] hashar: wmf.32 was only deployed on group0, right? [20:15:44] group0 and group1 [20:16:06] I have pushed group1 yesterday [20:16:14] oh, bleh [20:16:47] hmm wikitech.wikimedia.org sidebar is now fixed [20:16:52] !log moved codereview.tar.gz and with_r.tar.gz from miscweb1002 to cumin1001 to free space [20:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:54] so maybe it is purged asynchronously somehow [20:17:01] i mean, worst case is that we wait 24 hours? [20:17:16] or something magically fix it somehow [20:17:19] i think also any edit to localisation messages might clear it, if you need to do it manually [20:18:37] might be [20:18:47] I will just claim it to be fixed ;] [20:19:21] (03PS1) 10Cwhite: profile: check_prometheus expects int [puppet] - 10https://gerrit.wikimedia.org/r/596507 (https://phabricator.wikimedia.org/T236505) [20:19:30] and it was definitely worth a blocking task [20:19:57] (03CR) 10Herron: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/596239 (owner: 10Herron) [20:20:31] (03Abandoned) 10Herron: prometheus::class_config: use FQDN by default [puppet] - 10https://gerrit.wikimedia.org/r/596239 (owner: 10Herron) [20:20:38] (03CR) 10Cwhite: [C: 03+2] profile: check_prometheus expects int [puppet] - 10https://gerrit.wikimedia.org/r/596507 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [20:20:51] (03PS2) 10Cwhite: profile: check_prometheus expects int [puppet] - 10https://gerrit.wikimedia.org/r/596507 (https://phabricator.wikimedia.org/T236505) [20:38:38] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10colewhite) It looks to me like membership to the `nda` ldap group will also fulfill this requirement. Since there is an NDA on file, it's probably fine to use t... [20:49:28] I am off, see you tomorrow [20:52:24] (03PS1) 10Papaul: DNS: Add mgmt and production for thanos-be200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/596510 [21:04:45] (03PS3) 10Krinkle: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [21:04:49] (03PS4) 10Krinkle: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [21:04:56] (03CR) 10Krinkle: [C: 03+1] Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [21:08:36] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:37] (03CR) 10Urbanecm: [C: 04-1] "otherwise, LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [21:21:19] (03PS9) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [21:22:12] (03PS1) 10Cwhite: profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) [21:22:50] (03CR) 10jerkins-bot: [V: 04-1] profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [21:23:02] (03CR) 10Urbanecm: [C: 04-1] Initial config for awawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [21:24:11] (03PS10) 10Zoranzoki21: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) [21:27:48] (03PS2) 10Cwhite: profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) [21:42:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:53] !log depooled wdqs2006 while lag recovers [21:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:42] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [21:49:58] (03PS4) 10RhinosF1: Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 (https://phabricator.wikimedia.org/T252600) [21:52:59] (03PS4) 10RhinosF1: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) [22:07:35] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Jdforrester-WMF) Presumably this is now done and the put-the-redirect-in is T205361? [22:45:59] (03CR) 10Zoranzoki21: Enable subpages in Page namespace on napwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [22:46:29] (03PS3) 10Zoranzoki21: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) [22:47:36] (03CR) 10Zoranzoki21: Enable subpages in Page namespace on napwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [22:49:15] * RhinosF1 here for swat [22:55:33] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) >>! In T245170#6137869, @BPirkle wrote: > @tstarling , that sou... [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200514T2300). [23:00:04] RhinosF1: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] o/ [23:00:36] I can do the SWAT [23:01:01] RoanKattouw: upload-by-url one is a straight sync, the other two need debug [23:01:06] (03CR) 10Catrope: [C: 03+2] Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 (https://phabricator.wikimedia.org/T252600) (owner: 10RhinosF1) [23:01:16] Sounds good, doing the upload domains one first for that reason [23:01:21] :) [23:01:56] (03Merged) 10jenkins-bot: Add 2 sites to copy upload domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596398 (https://phabricator.wikimedia.org/T252600) (owner: 10RhinosF1) [23:09:55] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add *.ub.uni-heidelberg.de and hq.eso.org to $wgCopyUploadDomains (T252600, T252726) (duration: 01m 07s) [23:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:59] T252600: Add *.ub.uni-heidelberg.de/helios/digi/digilit.html to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T252600 [23:09:59] T252726: Add hq.eso.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T252726 [23:10:19] RoanKattouw: not bothered which next [23:10:50] (03PS2) 10Catrope: Revert "[vecwiki] Update project logo with temporary 20k branding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596431 (https://phabricator.wikimedia.org/T252770) (owner: 10RhinosF1) [23:10:55] (03CR) 10Catrope: [C: 03+2] Revert "[vecwiki] Update project logo with temporary 20k branding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596431 (https://phabricator.wikimedia.org/T252770) (owner: 10RhinosF1) [23:11:09] (03PS5) 10Catrope: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) (owner: 10RhinosF1) [23:11:22] (03CR) 10Catrope: [C: 03+2] Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) (owner: 10RhinosF1) [23:11:36] (03Merged) 10jenkins-bot: Revert "[vecwiki] Update project logo with temporary 20k branding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596431 (https://phabricator.wikimedia.org/T252770) (owner: 10RhinosF1) [23:11:41] They're non-overlapping files, so let's do both [23:11:47] perfect [23:11:59] let me know which debug when they're oon [23:12:06] (03Merged) 10jenkins-bot: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) (owner: 10RhinosF1) [23:12:32] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:17:30] RoanKattouw: Portal namespace one is good on mwdebug1002 [23:17:43] and so is the logo change [23:17:48] both good to sync [23:18:23] Yay thanks [23:18:31] I was just about to tell you that they were on mwdebug1002 but you beat me to it [23:18:49] RoanKattouw: namespace one will want namespaceDupes.php running and logo's are normally purged in varnish. I know you by now! [23:19:36] I had remembered the Varnish purge but had forgotten about namespaceDupes, thanks for reminding me or I wouldn't have remembred [23:19:45] no problem! [23:20:15] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create Gapura (Portal) namespace on jvwiki (T252343) (duration: 01m 06s) [23:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:19] T252343: Creating namespaces "portal" and "portal talk" on Javanese Wikipedia - https://phabricator.wikimedia.org/T252343 [23:21:50] is there a techy person available to help me w/ something [23:21:58] yes [23:23:25] !log Ran namespaceDupes.php for T252343 [23:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:55] !log catrope@deploy1001 Synchronized static/images/project-logos/: Revert temporary 20k logo for vecwiki (T252770) (duration: 01m 06s) [23:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:58] T252770: Restore the normal VEC Wikipedia logo - https://phabricator.wikimedia.org/T252770 [23:26:16] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10BPirkle) That all sounds like a win to me. [23:29:05] OK and purges done [23:29:38] RoanKattouw: perfect, thanks as always! [23:32:16] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) The dashboard was broken such that it would not load even load the settings page. It seemed to hang indefinitely; I left it open in the backgrou... [23:32:29] SWAT done afaik, I’m off now. [23:37:42] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) Regarding the label pollution, I added a regex to $dc which excludes values containing numerals (and thus hostnames). This fixes the drop-downs... [23:40:50] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production for thanos-be200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/596510 (owner: 10Papaul) [23:42:12] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [23:47:36] 10Operations, 10ops-codfw: Degraded RAID on db2137 - https://phabricator.wikimedia.org/T252688 (10Papaul) 05Open→03Declined The error went away after disk finished the rebuilt process [23:48:04] 10Operations, 10ops-codfw: Degraded RAID on db2138 - https://phabricator.wikimedia.org/T252687 (10Papaul) 05Open→03Declined The error went away after disk finished the rebuilt process [23:50:39] 10Operations, 10ops-codfw: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10Papaul) 05Open→03Resolved Resolving this since all alerts are green now in Icinga . [23:55:33] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Krinkle) [23:59:57] (03PS1) 10Alex Monk: cloud: Whitelist testlabs-dns-manager for access from cloud subnets [puppet] - 10https://gerrit.wikimedia.org/r/596528 (https://phabricator.wikimedia.org/T252732)