[03:07:37] 10serviceops, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10cloud-services-team (Kanban): Frequent "Nonce already used" errors in scripts and tools - https://phabricator.wikimedia.org/T272319 (10AntiCompositeNumber) There have been no errors since the... [05:21:31] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10KartikMistry) [05:26:47] 10serviceops, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10cloud-services-team (Kanban): Frequent "Nonce already used" errors in scripts and tools - https://phabricator.wikimedia.org/T272319 (10Vort) @AntiCompositeNumber, my bot is also having problem... [07:20:24] 10serviceops, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10cloud-services-team (Kanban): Frequent "Nonce already used" errors in scripts and tools - https://phabricator.wikimedia.org/T272319 (10Joe) @Vort @AntiCompositeNumber this is related to T27641... [08:51:11] 10serviceops, 10SRE, 10observability, 10Parsoid (Tracking), 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10jijiki) 05Open→03Resolved a:03jijiki [12:13:24] just FYI jakob_WMDE and I are looking at deploying a newer termbox image to our service about now [12:30:50] do these timings seem close to sensible? https://grafana-rw.wikimedia.org/d/3mPeF6yGk/hnowlan-rest-api-timings [14:07:11] <_joe_> hnowlan: you probably want to separate by cluster [14:07:25] <_joe_> as I suspect parsoid drags the performance of the rest api down [14:07:38] <_joe_> other than that, yes it looks sensible [14:23:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1001 for hosts: `neon.eqiad.wmnet` - neon.eqiad.wmnet (**P... [14:24:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) ` Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox Generating the DNS records from Netbox dat... [14:41:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) Ran `sre.dns.netbox` again and it completed just fine. Mission accomplished then. Missing points are docs on how to switch... [14:47:58] _joe_: good idea, makes a big difference https://grafana-rw.wikimedia.org/d/3mPeF6yGk/hnowlan-mediawiki-endpoint-timings [14:49:44] <_joe_> hnowlan: heh parsoid is basically doing all the most expensive work we ever do - parsing the doomed scriptures [14:49:55] <_joe_> (wikitext) [14:52:07] heh [15:08:42] 10serviceops, 10SRE, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [15:56:20] hi! is it the case that k8s pods can now no-longer directly contact https://api-rw.discovery.wmnet/w/index.php or the read-only equivalent ? [15:57:48] we're trying to test a new termbox service image before putting it live and finding that neither our staging nor our testing service seem to be currently working (probably for different reasons) [16:17:01] tarrow: I'll take a look in couple of minutes (in a meeting) [16:17:55] Thanks! [16:19:19] 10serviceops, 10SRE, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Papaul) [16:23:54] Does anyone have a good rule of thumb about which things to check in a /healthz route? I'm looking at https://github.com/KristianOellegaard/django-health-check as a way to build one for Toolhub and kind of wondering how far to take the checks. [16:25:20] I guess what I'm mostly wondering about is thrashing the k8s containers if there is a network or other problem connecting from the k8s pod out to support services (db, redis, elasticsearch) [16:27:54] bd808: liveness checks should never test dependencies. for readiness the situation is more nuanced [16:31:56] agreed. My take always was that if you test something outside of your container it's wrong (as that should have it's own health check) [16:35:25] +1 [16:40:41] <_joe_> liveness should be: is the container running the software? [16:40:46] cdanis: so... what should a liveness check actually check then? Just that the container's expected process is running? [16:40:48] <_joe_> more or less [16:41:10] <_joe_> bd808: that the uwsgi server is up and you're able to respond to a simple request with no dependencies [16:41:32] is running and is itself reachable by the thing checking for liveness, yeah [16:41:44] <_joe_> basically call a basic banner page [16:41:58] <_joe_> we use the OpenAPI spec in most cases IIRC [16:42:14] <_joe_> it's more complex for multi-container pods [16:42:58] as cdanis says there should *also* be a readiness check, which depends on a lot more -- a pod might be "live but not ready" including while it's in the process of starting up, for example [16:44:14] <_joe_> bd808: for php applications, I'm calling the fpm status page as liveness, and the internal monitoring endpoint for readiness, for instance [16:44:23] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10SRE, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10jijiki) [16:44:58] <_joe_> the fpm status page would be working even if - say - we still don't have regenerated some cache php files [16:45:27] <_joe_> rzl: I would still keep other services availability out of the readiness probe [16:46:16] <_joe_> like - I would avoid making readiness depend on db connections, unless we have clients smart enough to fail-open when the service is unavailable [16:46:21] <_joe_> and we don't have that :) [16:47:01] <_joe_> else, it can be a way to remove pressure from the system until it recovered, and that's good (if no pod is ready, the traffic will be rejected without puttting more pressure on the db that is failing) [16:50:33] ok. I'll probably come back to group think about readiness at some point, but I'll write up that /healthz should just be a static response that touches as few external dependencies as possible [16:51:22] bd808: yeah, basically liveness should just demonstrate that your process isn't deadlocked or unresponsive. if it does depending on external things -- especially if you have transitive chains of that -- thrashing lots of pods with restarts becomes a real concern, amongst other things [16:52:49] 10serviceops, 10envoy, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10jijiki) {F34136873} I think we can call it success and roll it out to app and api next week (as we have more visibility there), and then on j... [17:15:09] 10serviceops, 10SRE, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [17:15:58] 10serviceops, 10SRE, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) I want to >>! In T271967#6877704, @Joe wrote: > Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our... [17:18:17] is it expected that the service proxy would change x-client-ip on a request? [17:47:21] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team: Progressive rollout of MediaWiki deployment on Kubernetes - https://phabricator.wikimedia.org/T276487 (10jijiki) [17:56:01] hnowlan: can you be more specific? [17:56:13] because I will work on this soonish [17:56:17] ah wait [17:56:24] service proxy, not tls proxy [17:56:32] please do tell [17:57:14] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-restbase-2021.02.24?id=H0tT1XcBsCn0xdb8Djv in this message, x-client-ip is set to the IP of the restbase instance receiving the requestf [17:58:10] which triggers the restbase ratelimiter before we'd expect to because assuming this is happening for all services, lots of messages are coming from this client ip [17:58:23] I would have expected x-client-ip to come from varnish and nowhere else [17:59:11] I had half an answer but I got tangled up between x-client-ip and x-forwarded-for [17:59:41] so I appreciate the question and I hope someone else knows how to answer it so I can understand this as well as I thought I did :P [18:00:27] if you'd like to follow along, there is a frankenissue here that should probably be at least 3 bugs https://phabricator.wikimedia.org/T276485 [18:01:19] hnowlan: does restbase happen to be calling back into the traffic layer to make these requests? [18:05:37] or, if restbase passes x-client-ip when makes a request via services proxy [18:06:14] I am not on a computer but we can look into it later or tomorrow [18:13:35] cdanis: I'm not certain - in theory all of the service proxy endpoints are on localhost [18:13:49] by that you mean reaching varnish? [18:20:28] 10serviceops, 10Code-Health-Objective, 10Performance-Team (Radar), 10Platform Team Initiatives (Session Management Service (CDP2)), and 2 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10BPirkle) >>! In T267270#6808265, @Krinkle wrote: >> I think this asses... [18:31:07] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [18:35:25] hnowlan: yeah [18:35:46] there was also some diving into the formatting of the IPv4 addresses of the hosts as a pseudo-ipv6 on another bug [18:36:14] https://phabricator.wikimedia.org/T255568 [18:36:22] not convinced it's related; not convinced it isn't :) [18:37:56] Definitely-not-related-but-adjacent but we found out that restbase itself can't really talk ipv6 https://phabricator.wikimedia.org/T276323 [18:45:08] yeah that was another unfortunate discovery