[00:15:39] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [00:17:50] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [00:29:17] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4290095 (10Jdforrester-WMF) >>! In T197450#4291900, @Legoktm wrote: > OK, so this is fixed, but some of the core messages are missing -... [02:33:54] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 14m 12s) [02:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:29] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:29] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 75169 bytes in 0.293 second response time [02:51:41] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.999) (duration: 07m 40s) [02:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:09] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [03:03:10] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [03:26:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 880.60 seconds [03:35:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 249.14 seconds [04:21:52] (03CR) 1020after4: [C: 031] Phab: Allow aklapper to purge user caches [puppet] - 10https://gerrit.wikimedia.org/r/441012 (owner: 10Aklapper) [04:39:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [04:39:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:12:23] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702#4304342 (10Dzahn) 05Open>03Resolved password reset per https://wikitech.wikimedia.org/wiki/Mailman#TLDR ``` [fermium:~] $ sudo /var/lib/mailman/bin/change_pw -l vvitwikicon... [05:14:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [05:14:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [05:47:01] 10Operations, 10LDAP: Certificate Renewal for corp.wikimedia.org - https://phabricator.wikimedia.org/T197840#4304367 (10Aklapper) [05:50:20] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4304368 (10Aklapper) Their (slightly obfuscated) email address will also appear under https://lists.wikimedia.org/mailman/listinfo/wikipedia-en-signpost-priv#admins [06:32:10] PROBLEM - puppet last run on labvirt1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-puppet-agent] [06:32:39] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:34:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [06:34:59] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:38:15] <_joe_> uhm more issues? [06:45:22] (03PS30) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [06:45:24] (03PS1) 10EBernhardson: [WIP] Rework elasticsearch ferm for multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/441337 [06:45:26] (03PS1) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [06:46:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [06:47:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Rework elasticsearch ferm for multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/441337 (owner: 10EBernhardson) [06:47:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson) [06:57:30] RECOVERY - puppet last run on labvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:50] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:11:11] (03PS2) 10EBernhardson: [WIP] Rework elasticsearch ferm for multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/441337 [07:11:13] (03PS2) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [07:12:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson) [07:22:29] (03PS31) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [07:23:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [07:27:22] (03PS3) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [07:28:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson) [07:35:52] (03PS4) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [07:35:54] (03PS32) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [07:36:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson) [08:10:39] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [08:11:39] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [08:13:41] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4304450 (10Dzahn) a:05Dzahn>03None Unassigning this ticket from me temporarily while i'm on vacation. I will take it back once i return but also want to make clear it'... [08:14:31] (03PS1) 10Mobrovac: Proton: Increase the monitoring time out to 15 seconds [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) [08:15:06] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4304453 (10Dzahn) status: wikidata related crons are moved, other mw crons are still to be moved (by switching [08:16:29] (03PS1) 10Dzahn: switch mw_maintenance server to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/441346 (https://phabricator.wikimedia.org/T192092) [08:17:17] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4304455 (10Dzahn) other pending changes, mostly to decom terbium once switch is complete: https://gerrit.wikimedia.org/r/#/q/topic:terbium+(status:open) [08:19:26] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/11549/" [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [08:24:36] (03PS1) 10Vgutierrez: vcl: Bump AES128-SHA pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/441347 (https://phabricator.wikimedia.org/T192555) [08:34:58] (03CR) 10DCausse: Add cirrussearch settings for wikibase (2/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [08:35:07] (03CR) 10DCausse: Add cirrussearch settings for wikibase (1/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [08:35:23] (03PS20) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [08:35:35] (03PS5) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [08:35:37] (03PS5) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [08:36:42] (03PS2) 10Mobrovac: Proton: Increase the monitoring time out to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) [08:39:07] (03CR) 10Giuseppe Lavagetto: [C: 032] "I consider this a temporary measure. Will discuss on ticket the merit of the issue." [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [08:46:06] (03PS1) 10Giuseppe Lavagetto: dhcpd: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/441348 [08:47:18] (03CR) 10Giuseppe Lavagetto: [C: 032] dhcpd: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/441348 (owner: 10Giuseppe Lavagetto) [08:49:19] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [08:50:00] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [08:51:05] <_joe_> that looks better :D [08:53:00] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:53:10] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:06:49] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [09:07:50] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:11:30] <_joe_> hashar: around? [09:22:19] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file n [09:22:19] xistent title) timed out before a response was received [09:24:20] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:41:43] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10Services (blocked): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#4304623 (10Imarlier) @tstarling It's our (Perf team's) impression that we wouldn't be the ones taking th... [10:17:29] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (e [10:18:30] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [10:24:43] (03CR) 10MarcoAurelio: "As 'WIP' this cannot be merged by anyone :)" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/440644 (https://phabricator.wikimedia.org/T197503) (owner: 10Krinkle) [10:41:38] 10Operations, 10monitoring, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#4304754 (10Imarlier) [10:43:30] (03PS3) 10Giuseppe Lavagetto: Proton: Increase the monitoring time out to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:44:04] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Proton: Increase the monitoring time out to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/441345 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [11:08:24] !log Refreshed operations-puppet-tests-docker jenkins job to a new Docker container build that includes isc-dhcp-server | https://gerrit.wikimedia.org/r/c/integration/config/+/441367 [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:11] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/440822 (https://phabricator.wikimedia.org/T180183) (owner: 10Giuseppe Lavagetto) [11:13:29] <_joe_> hashar: thanks, it works \o/ [11:16:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [11:16:40] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [11:23:25] (03PS1) 10Elukey: profile::prometheus::alerts: tune druid alarms [puppet] - 10https://gerrit.wikimedia.org/r/441376 [11:24:34] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: tune druid alarms [puppet] - 10https://gerrit.wikimedia.org/r/441376 (owner: 10Elukey) [11:25:04] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4304912 (10mobrovac) [11:25:04] sorry breaking the no-merge policy to avoid a lot of false alerts for analytics [11:25:15] (we are testing Druid datasources) [11:28:25] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar), 10Services (blocked): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#4304930 (10Krinkle) [11:46:36] 10Operations, 10Proton, 10SRE-Access-Requests: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4304977 (10mobrovac) [11:46:55] 10Operations, 10Proton, 10SRE-Access-Requests: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4304987 (10mobrovac) [11:47:00] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4304988 (10mobrovac) [11:48:10] 10Operations, 10Proton, 10SRE-Access-Requests: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4304977 (10mobrovac) @pmiazga, @Niedzielski and @phuedx, in order for this access to be granted, it needs to be approved by your respective managers. Pl... [11:51:27] (03PS1) 10Mobrovac: Add niedzielski, pmiazga and phuedx to deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/441379 (https://phabricator.wikimedia.org/T197857) [11:59:34] (03PS1) 10Dzahn: mw_maintenace: remove temp change for wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/441381 (https://phabricator.wikimedia.org/T192092) [12:02:59] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4305040 (10Dzahn) a:05Dzahn>03None Meanwhile we have wmf4727 and it is in site.pp and using the phabricator puppet... [12:04:31] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4305042 (10Niedzielski) @mobrovac, @phuedx is my manager so good to go {icon thumbs-up} Thank you! [12:05:54] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4305048 (10mobrovac) >>! In T197857#4305042, @Niedzielski wrote: > @mobrovac, @phuedx is my manager so good to go {icon thumbs-up}... [12:09:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4305050 (10Dzahn) a:05Dzahn>03None for current status please see T190568#4305040 [12:10:31] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915#4305054 (10Dzahn) a:05Dzahn>03None This can be done after T192092 is resolved. I am temporarily unassigning it from me while i'm on vacation for a couple weeks. If others get to it that would... [12:11:10] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4305058 (10Paladox) Is the repos being resynced to phab1002? [12:11:16] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4305059 (10phuedx) I approve this request for @Niedzielski and @pmiazga. @dr0ptp4kt will have to approve this request for me. [12:12:51] 10Operations, 10ops-esams, 10Patch-For-Review: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#4305061 (10Dzahn) a:05Dzahn>03None merged and open pending changes in gerrit: https://gerrit.wikimedia.org/r/#/q/topic:bast3003+(status:open+OR+status:merged) tem... [12:14:30] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4305064 (10Paladox) Path to be rsync is /srv/repos [12:16:46] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4305066 (10Dzahn) The code is there to allow a user to do it. But it's not auto-syncing in the background. It needs a... [12:21:00] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4301121 (10Kudpung) Email of Editor-in-Chief: cs@edubkk.org [12:24:19] (03PS1) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 [12:28:03] (03PS2) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 [12:28:25] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/441384 (owner: 10Paladox) [12:28:52] (03PS3) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [12:36:35] 10Operations, 10Proton, 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862#4305108 (10mobrovac) p:05Triage>03High [12:37:13] (03PS4) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [12:37:26] 10Operations, 10Proton, 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862#4305119 (10mobrovac) [12:37:29] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4305120 (10mobrovac) [12:49:05] !log remove labvirt1019 canary to start debug of network [12:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:51] (03PS5) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [13:01:04] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/11552/" [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [13:14:17] 10Operations, 10Proton, 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862#4305152 (10mobrovac) [13:17:03] (03PS6) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [13:17:40] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [13:18:26] (03PS7) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [13:19:51] (03PS8) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [13:26:17] (03CR) 10Vgutierrez: [C: 032] vcl: Bump AES128-SHA pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/441347 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [13:26:45] (03PS2) 10Vgutierrez: vcl: Bump AES128-SHA pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/441347 (https://phabricator.wikimedia.org/T192555) [13:28:33] !log Bump AES128-SHA pageview replacement to 4% [13:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:19] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4305265 (10dr0ptp4kt) Approved. [14:10:34] (03PS1) 10Paladox: Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 [14:12:40] (03PS2) 10Paladox: Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 [14:38:19] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar), 10Services (designing): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#4305379 (10mobrovac) >>! In T134811#4304623, @Imarlier wrote: > @tstarling It's our (Perf team... [14:43:30] (03CR) 10Hashar: "Here are the caches from a 'gerrit show-caches' output, I have removed a few entries that do not seem interesting at all" [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [14:45:20] (03PS1) 10Giuseppe Lavagetto: [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [14:46:40] (03PS1) 10Paladox: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 [14:47:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [14:47:19] (03PS2) 10Paladox: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 [14:47:46] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [14:48:52] (03CR) 10Paladox: "> Here are the caches from a 'gerrit show-caches' output, I have" [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [14:53:58] (03PS3) 10Paladox: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 [15:06:51] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4305423 (10ayounsi) >>! In T194964#4298470, @Bstorm wrote: > The bad, for some reason, even though eth1 shows up ok as up, the VM on there... [15:09:19] 10Operations, 10monitoring: come up with a suggestion how to structure wiki pages for Icinga reaction play books - https://phabricator.wikimedia.org/T197873#4305428 (10Dzahn) [15:46:58] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4305603 (10pmiazga) @phuedx thanks for your approval. [15:58:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:58:47] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [16:03:52] (03PS5) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [16:03:54] (03PS33) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [16:04:03] <_joe_> a spike of 5xx on cache-misc in esams [16:05:07] (03PS6) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [16:05:09] (03PS34) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [16:11:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:11:46] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [16:18:26] (03PS1) 10RobH: adding kharlan@wikimedia.org to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/441414 (https://phabricator.wikimedia.org/T197886) [16:24:56] (03CR) 10RobH: [C: 031] "we're on a code freeze week due to SRE offsite. This can be merged on Monday 2018-06-25, when the code freeze is over." [puppet] - 10https://gerrit.wikimedia.org/r/441414 (https://phabricator.wikimedia.org/T197886) (owner: 10RobH) [16:25:48] (03CR) 10RobH: [C: 032] adding kharlan@wikimedia.org to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/441414 (https://phabricator.wikimedia.org/T197886) (owner: 10RobH) [16:40:39] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4305786 (10Ottomata) p:05High>03Triage [16:40:42] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4300251 (10Ottomata) p:05Triage>03High [16:46:00] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4305808 (10elukey) a:05elukey>03None [16:46:59] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4300251 (10elukey) I had a chat with Chris today and the 2T disk should do just fine. Removing myself as assignee to let DC-Ops handling the hw swap. Thanks! [16:50:52] (03CR) 10Eevans: [V: 032 C: 032] "Great; Thanks!" [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/441235 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [16:54:42] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4305819 (10Bstorm) @Cmjohnson Could you do me a favor and cancel out of the blasted lifecycle controller setup view on the... [16:55:21] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4305821 (10Jgreen) [16:57:08] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3973216 (10Jgreen) OS install is done, puppet is enabled, and it has been added to the FR deploy tools. We need to test smashpig functionality on PHP7 and finally do the NAT switch and enable monitor... [16:59:22] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4305835 (10Jgreen) [16:59:45] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3973216 (10Jgreen) 05Open>03Resolved a:03Jgreen [17:02:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476#4305848 (10Jgreen) a:05Jgreen>03None [17:17:19] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476#4305907 (10cwdent) a:03cwdent [17:24:37] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [17:25:47] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [17:30:15] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4305955 (10Bstorm) Nevermind, I found a way that worked (HTML5 remote console in the GUI) [17:30:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4305956 (10Krenair) Replacing it myself [17:32:37] (03PS7) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [17:32:39] (03PS4) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [17:32:41] (03PS3) 10EBernhardson: [WIP] Rework elasticsearch ferm for multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/441337 [17:32:43] (03PS7) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [17:32:45] (03PS35) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [17:38:01] ebernhardson, hey [17:38:06] Krenair: hey [17:38:15] ebernhardson, was fakelogstash.search.eqiad.wmflabs yours? [17:38:20] Krenair: yes [17:38:29] i was trying to test some puppet stuff [17:38:29] ebernhardson, ok. did you mean to have it request a cert from deployment-puppetmaster03? [17:38:44] Krenair: no, in my puppet testing i accidently re-parented it to a different puppetmaster [17:38:54] heh ok [17:39:01] (i was trying to make it test a deployment-logstashN.deployment-prep.eqiad.wmflabs role) [17:39:20] I found it didn't exist anymore, so I removed the cert request [17:40:18] Need a sanity check - should I chunk out the refreshLinks script (https://www.mediawiki.org/wiki/Manual:RefreshLinks.php) if I'm running it on a small (~10k pages) wiki on terbium? [17:41:01] Niharika: while i havn't run that script, in general 10k is pretty small and probably fine [17:41:27] ebernhardson: Gotcha. Thanks! [17:43:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:48:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 9 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:50:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4305991 (10Bstorm) Got it to attempt PXE on the interface that is actually plugged in, however, dhcp failed. [18:14:48] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4306018 (10Bstorm) Current status: labstore1009 appears to not be plugged in on any port. labstore1008 is plugged in on... [18:15:27] PROBLEM - puppet last run on ores1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:25:23] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4306051 (10Bstorm) It looks to me like it is not? [18:34:54] (03PS2) 10C. Scott Ananian: Enable testing LanguageConverter in sandboxes on deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) [18:37:09] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4304977 (10RobH) Just reviewing this as clinic duty this week, and this seems to be a deploy service, but doesn't list sudo rights... [18:39:30] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4306060 (10RobH) Please note that next Monday's SRE team meeting has been canceled, as the SRE off-site is occurring this week. If this access needs to be approved before Monday, July... [18:39:44] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4306062 (10RobH) [18:40:56] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:42:06] (03PS1) 10C. Scott Ananian: Fix en-rtl in Special:SiteMatrix in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441422 (https://phabricator.wikimedia.org/T195675) [18:42:32] (03PS2) 10C. Scott Ananian: Fix en-rtl in Special:SiteMatrix in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441422 (https://phabricator.wikimedia.org/T195675) [19:07:57] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4306122 (10Krenair) Alright, should be back to roughly where we were 2 weeks ago now. npm package is still failing,... [19:10:33] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4304977 (10Krenair) I think that group is just trusted by keyholder or something? [19:26:53] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [19:27:51] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [19:33:47] 10Operations, 10Research, 10Research-collaborations, 10Research-management, 10SRE-Access-Requests: Remove shell access to analytics-privatedata-users for DYNKM - https://phabricator.wikimedia.org/T197895#4306165 (10Capt_Swing) [19:34:19] 10Operations, 10Ops-Access-Reviews, 10Research, 10Research-collaborations, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4024896 (10Capt_Swing) [19:34:22] 10Operations, 10Research, 10Research-collaborations, 10Research-management, 10SRE-Access-Requests: Remove shell access to analytics-privatedata-users for DYNKM - https://phabricator.wikimedia.org/T197895#4306177 (10Capt_Swing) [19:59:29] (03PS4) 10Paladox: phabricator: Make phd.taskmasters configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/439645 [20:08:12] (03PS1) 10RobH: remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) [20:09:18] 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access to analytics-privatedata-users for DYNKM - https://phabricator.wikimedia.org/T197895#4306258 (10RobH) p:05Triage>03Normal [20:10:07] 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access to analytics-privatedata-users for ironholds - https://phabricator.wikimedia.org/T197895#4306165 (10RobH) [20:14:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:15:22] ^ looking [20:29:46] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:32:14] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4306351 (10Bstorm) To be clear, looks to me like labvirt1020 is not connected to 10G Ethernet. Labvirt1019 is working perfectly on both i... [20:43:50] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4306411 (10Cmjohnson) I am pretty sure it’s not connected to 10G. I will take care of next week when I get back from the off site. [20:54:35] (03PS1) 10ArielGlenn: move iohandler code for compression/decompression out to a separate file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/441484 [20:54:38] (03PS1) 10ArielGlenn: use iohandlers for recompressxml input and output [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/441485 [20:55:09] not really here! pay no attention to the commits from the person behind the curtain [21:28:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1019 IPMI alert - https://phabricator.wikimedia.org/T196751#4306464 (10Bstorm) This appears to have gone away? [21:36:54] (03PS1) 10Thcipriani: Beta: Add scap repository for dumps/dumps [puppet] - 10https://gerrit.wikimedia.org/r/441491 [21:38:25] !log banned elastic1036 from search cluster, waited for all load to shift away, and unbanned [21:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:33] (03CR) 10Alex Monk: [C: 031] "matches prod" [puppet] - 10https://gerrit.wikimedia.org/r/441491 (owner: 10Thcipriani) [21:44:30] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4306521 (10Krenair) @thcipriani's https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441491/ will fix broken stu... [21:52:33] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4306556 (10mobrovac) The users in the `deploy-service` group can `sudo service (start|stop|restart) *` on the target nodes, so it... [21:53:16] (03CR) 10Alex Monk: [C: 031] "and works in beta" [puppet] - 10https://gerrit.wikimedia.org/r/441491 (owner: 10Thcipriani) [22:47:27] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4306642 (10RobH) Thanks for feedback, duly noted and set in the proper column for SRE meeting review approval. [22:51:16] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:52:17] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api