[00:07:05] 10serviceops: mc1024 broke - replace it or remove it from configs - https://phabricator.wikimedia.org/T272078 (10aaron) >>! In T272078#6763020, @jijiki wrote: > @Krinkle @aaron the gutter pool sets a max TTL of 600s to any key with a TTL over 600s, do you think it is fine to keep the gutter-pool substitute the m... [00:08:06] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2363.codfw.wmnet'] ` an... [00:09:01] 10serviceops, 10SRE: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) Is this a problem with icinga? ` legoktm@mwdebug1003:~$ /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50 OK: opcache is healthy ` Doesn't seem like a permissions issue ei... [00:09:06] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2365.codfw.wmnet'] ` an... [00:10:08] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2367.codfw.wmnet'] ` an... [00:10:40] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2369.codfw.wmnet'] ` an... [00:22:10] 10serviceops, 10SRE: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) This always happens after reimaging a server and then disappears after it's been running for a while. [00:27:33] 10serviceops, 10SRE: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) It's not consistent. For example mw2226 is OK but mw2224 and mw2225 have the alert but all 3 are buster and have been reimaged on the same day, 8 days ago. [00:27:47] 10serviceops, 10SRE: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) mwdebug1003 was one of the first servers to be reimaged and it's still critical after over a month though [01:05:41] legoktm: it really is a monitoring issue. I tracked down the exact NRPE command and run it from the icinga box and it behaves different for 2 host names [01:05:49] will keep debugging it after a break [01:06:04] locally the check is OK ... so weird [01:08:36] disabling puppet on one host and enabling NRPE debug logs [01:10:23] it's only getting weirder. on a host where it works i can enable debug logging and see the command it runs .. on the host where it fails i do the same and debug logs show ... nothing [01:15:58] 10serviceops, 10SRE: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) >>! In T270517#6763884, @Legoktm wrote: > Is this a problem with icinga? Yes! And it's really weird. I tracked down the NRPE command that is run from Icinga and it behaves different on... [01:17:45] mutante: is it possible other checks are similarly broken but because they weren't crit earlier we haven't noticed? [01:18:36] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) [01:19:07] legoktm: possible? yea [01:19:23] it's almost like it does not even talk to the right host? [01:19:40] debug logging is on but nothing shows up [01:19:46] it does on the host where stuff works [01:21:39] if it was an NRPE issue you would expect a different error, not the output of the check command [01:21:58] but runnign the check command locally gets a different result and the script is the same [01:26:46] tcpdump shows it does somehow talk to NRPE, but NRPE debug logs are silent ..for this check.. and talk about other checks [01:39:20] finally found a difference.. the nagios-nrpe-server version is NOT the same ! [01:40:19] NRPE 3.2.1-2 is broken. and 3.0.1-3 is not or so [01:51:36] 3.0.1-3 is the stretch version though? [01:57:00] yes, and the actual issue is that i marked mw2226 as DONE but .. it is not. somehow [01:57:15] I will continue on this and should be the buster NRPE version [01:57:20] but for now dinner [01:59:16] o/ I'm going afk soon too [02:02:30] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) upon further investigation I realized mw2226 is actually still stretch and I made a mistake to mark it as DONE in the etherpad for appserver upgrades... som... [02:36:34] it's also broken when using nagios-nrpe-server 4.0.3-1~bpo10+1 [02:38:12] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I tried installing nagios-nrpe-server 4.0.3-1~bpo10+1 over 3.2.1-2 but that did not fix the issue either. [02:58:09] I found it.. using full path to php7adm fixes it... [02:58:19] it's different on buster [03:03:55] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I found the issue. Changing line 28 in /usr/local/lib/nagios/plugins/nrpe_check_opcache to: ` OUT=$(/usr/local/bin/php7adm /opcache-info | jq . 2>&1) ` f... [03:09:12] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) a:03Dzahn [03:30:41] 10serviceops, 10Icinga, 10SRE, 10observability, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) ` 03:28 <+icinga-wm> RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Ap... [03:38:52] 10serviceops, 10Icinga, 10SRE, 10observability, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) 05Open→03Resolved [03:38:56] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [09:07:03] 10serviceops, 10SRE: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10JMeybohm) I don't see anything interesting in the 2.7.1 release (https://github.com/docker/distribution/releases/tag/v2.7.1, https://metadata.ftp-master.debian.org/changelogs//main/d/docker-regi... [09:10:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: kubestage200* change on every puppet run - https://phabricator.wikimedia.org/T271702 (10JMeybohm) 05Open→03Resolved a:03JMeybohm With that merged, this is fixed now. Thanks @jbond ! [09:10:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [13:48:12] 10serviceops, 10MediaWiki-Containers, 10SRE, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) {meme, src="antoine-approve", below="{{done\}\}"} [15:22:12] 10serviceops, 10Icinga, 10SRE, 10observability: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10RLazarus) Nice find! Thanks for tracking this down. [16:12:12] 10serviceops, 10MediaWiki-Containers, 10SRE, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10dancy) Thanks Legoktm. Small feature request: Can you add "last updated at " text to the top right corner of the page? [16:26:02] <_joe_> oh good idea [16:26:37] 10serviceops, 10MediaWiki-Containers, 10SRE, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm. Small feature request: Can you add "last updated at > " text to the top righ... [17:16:27] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Request volume for Docker images and container filesystems on releases machines - https://phabricator.wikimedia.org/T272092 (10dduvall) >>! In T272092#676... [18:09:05] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Request volume for Docker images and container filesystems on releases machines - https://phabricator.wikimedia.org/T272092 (10Dzahn) unfortunately T27255... [18:09:45] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) a:05Ottomata→03elukey [18:14:19] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:15:14] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:15:58] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:19:49] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:26:10] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Oo we'll also want eventstreams-internal.svc.* LVS set up too. [18:55:50] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2371.codfw.wmnet'] ` an... [18:56:45] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2373.codfw.wmnet'] ` an... [18:58:23] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2375.codfw.wmnet'] ` an... [19:37:33] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2226.codfw.wmnet'] ` an... [19:47:20] 10serviceops, 10MediaWiki-Containers, 10SRE, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm. Small feature request: Can you add "last updated at > " text to the top... [20:20:09] 10serviceops, 10Traffic: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) [21:01:19] 10serviceops, 10SRE, 10Traffic: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10Dzahn) `hieradata/role/common/cache/text.yaml` has: ` 60 helm-charts.wikimedia.org: 61 caching: 'normal' ` That should confirm that it is indeed the 24... [21:10:00] 10serviceops, 10SRE, 10Traffic: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) >>! In T272633#6766881, @Dzahn wrote: > An easy way to do this would be to just switch 'normal' to 'pass' here. Then there would be no caching at all. We... [21:48:07] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey it works! I realized that since this service is not prox... [23:33:09] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:02] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:53] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:35:33] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:51:15] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` Of... [23:56:09] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` Of...