[05:21:03] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Marostegui) >>! In T187984#6494499, @jcrespo wrote: > db1077 should now be available to be put back on test-* section, I don't think it is... [07:44:49] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) Possibly related to {T181632}. In the past, Redis was a single point of failure and if Celery could not conne... [07:49:31] <_joe_> hnowlan: I think calls to restbase from changeprop are failing since thursday [08:07:55] <_joe_> uh nevermind, it was never deployed, just merged [09:08:31] What makes you say that _joe_? [09:09:04] <_joe_> hnowlan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630544 :) [09:09:14] <_joe_> I saw you merged a change to call restbase via TLS [09:09:19] <_joe_> but restbase needed a new cert [09:10:34] ah, heh [09:32:57] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) With {T263583} coming up, perhaps we should use a special ParserCache instance for old revisions,... [09:33:41] 10serviceops, 10Operations, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Triage→03Medium [09:35:12] 10serviceops, 10Operations, 10Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ArielGlenn) p:05Triage→03Medium [09:46:23] 10serviceops, 10Operations, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10ArielGlenn) p:05Triage→03Medium [09:50:04] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ArielGlenn) p:05Triage→03High [09:54:10] 10serviceops, 10Operations, 10observability, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10ArielGlenn) p:05Triage→03High [09:54:13] 10serviceops, 10Operations, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Medium→03High [10:00:01] <_joe_> akosiaris: any idea why mobileapps would call wikifeeds? [10:01:27] _joe_: directly? I don't think it would [10:01:39] via restbase? maybe some summary endpoint? [10:01:55] <_joe_> akosiaris: I am unsure, maybe I'm reading something wrong [10:02:02] there is an endpoint that ends up in restbase splitting of in 4 different calls.. it could be one of them [10:02:36] well, what are you reading? [10:02:58] <_joe_> akosiaris: I found it as follows: netstat -tunap | fgrep 8889 on a kubernetes worker [10:03:33] <_joe_> and all I see aare connections from 10.2.1.14 to 10.192.64.180:8889 [10:03:49] that's just the readiness probe [10:03:51] <_joe_> context is - there are still calls to wikifeeds via http, and they really should not [10:03:57] and for some reason the kubelet using that IP [10:04:10] aka, those are locally generated [10:04:28] <_joe_> uhm but I do see connections on the LVS [10:04:38] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 is back into test-s4 role, although without any data. [10:04:39] <_joe_> so yeah, you're probably right [10:04:44] <_joe_> and I need to dig more [10:05:07] btw, at some point we need to experiment with not having the LVS ips on the kubernetes nodes [10:05:19] it strictly isn't needed, we are just doing it for consistency's sake [10:05:38] and it's causing that ^ kind of confusion [10:07:33] <_joe_> ok, the connections are gone now, nevermind [10:07:46] <_joe_> akosiaris: why "not needed"? [10:08:13] <_joe_> it's needed indeed, else the networking stack of the node will reject the request I think [10:08:18] cause iptables rules take priority and DNAT the traffic way before it even reaches the nodes [10:09:08] <_joe_> so you say iptables happens before a packet is even analyzed by the host's network stack to reject or accept it? I've never looked into that [10:09:16] https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg [10:09:36] looking at the "routing decision" box [10:09:51] the one in the network layer in the input path [10:10:09] dnat happens before that and it's going to change the destination ip [10:10:11] <_joe_> yes I was following [10:10:23] so that routing decision is going to switch from 10.x.x.x to the pod ip [10:10:35] <_joe_> it looks like it yes [10:10:38] and it will never go up the stack and no local process needs to handle it [10:11:03] <_joe_> ok, that's good. We should just try it in eqiad now on a node to be sure :P [10:11:17] I remember figuring it during a service deployment working just fine, despite me having forgotten that [10:11:37] I figured it out quickly and fixed it, but I kept it in the back of my mind for an experiment at some point [10:11:49] not sure if something would break, I think not, but we need to double check [10:12:47] for example, it might hamper us switching kube-proxy to ipvs mode (which I don't even node if we need) [10:12:58] s/node/know/ [10:14:10] akosiaris: regarding LVS ip not on k8s nodes. E.ffie and I did a "test" on that with push-not and LVS does not work without it right now [10:14:52] <_joe_> I always assumed accepting or rejecting the packet based on IP happened before iptables [10:14:57] hmm, interesting. What wasn't working? [10:15:10] my "test" said the inverse, which is why I am wondering [10:15:33] akosiaris: connecting to the service via the LVS ip did not work [10:15:45] as soon as we added the IP to the nodes, it did [10:16:57] hmm, the inverse of my experience. Weird. I wonder what changed. We probably should try to structure a test during the next deployment and figure it out [10:17:12] It looks strage in first place though, because monitoring is fine. But that's not using the LVS IP but nodes IPs directly [10:17:56] that sounds like pybal for some reason did not pool the nodes... [10:18:35] akosiaris: IIRC ipvsadm showed them correctly [10:18:37] <_joe_> let's remember to figure this out next time [10:18:57] yup. not urgent, but it would be a nice test to run it in a structured way next time [10:19:25] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10Joe) [10:19:39] I may just be misremembering ofc. It's been some time since then [10:19:53] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [10:21:58] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [10:59:40] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6497445, @awight wrote: > Possibly related to {T181632}. In the past, Redis was a single po... [12:04:56] 10serviceops, 10Operations, 10Traffic, 10conftool: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10ArielGlenn) [12:10:32] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10ArielGlenn) [12:19:05] hnowlan: maybe it's already to late, but the TLS port of api-gateway should be in the range of 4000 to 5000 https://wikitech.wikimedia.org/w/index.php?title=Service_ports&type=revision&diff=1874891&oldid=1868504 [12:20:17] if it's to late then that's another exception of the rule I guess :) [12:35:39] 10serviceops, 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn) [12:45:10] hi serviceops friends, I was hoping for some help evaluating the impact of k8s CPU throttling on a service (eventgate-logging-external) [12:46:56] I'm looking at this because I noticed a reasonable number of 503 service unavailables served by the Traffic layer with 60s ttfb (so a timeout): https://logstash.wikimedia.org/goto/690a3ab8d848cfac232216aaf7fa85b8 [12:47:02] https://grafana.wikimedia.org/d/tn6gBadMz/jayme-container_cpu_cfs_throttled_seconds_total?orgId=1&var-dc=thanos&var-site=codfw&var-prometheus=k8s&var-sum_by=container_name&var-service=eventgate-logging-external&var-ignore_container_regex=%7CPOD%7C.*metrics-exporter.*&from=now-30d&to=now [12:47:30] I'm not sure how to evaluate the severity of 300ms of throttled time on the tls proxy, though [12:49:41] looking [12:57:36] cdanis: It's a problem I would say. I would suggest to double the cpu requests and limits for the sidecar and see how that behaves [12:57:43] ok! [12:58:01] can i ask what you usually look at to determine problem or not? ratio of throttled time to total time...? [12:59:18] cdanis: do you know what has changed on 2020-09-15 that could have increased the load/throttling that much? [12:59:27] we increased the traffic directed at it a lot :) [12:59:33] well, "a lot"; it's still only serving something like 40 rps [12:59:41] but yes, the load did increase then [12:59:46] ah, thats an okay reason :) [13:01:13] In this case I was just looking low close the cpu usage is to the configured cpu request [13:02:03] do you have a few minutes for some very dumb questions about editing charts? [13:03:59] sure, shoot [13:05:41] so, charts/eventgate/templates/_tls_helpers.tpl seems to let you override the default limits from values, if you specify a tls.resources there [13:06:29] in eventgate-logging-external's case that means editing helmfile.d/services/{staging,codfw,eqiad}/eventgate-logging-external/values{,-canary}.yaml ? [13:09:37] exactly! [13:09:53] great [13:09:59] as it is not yet migrated to the new format [13:10:02] do I also understand correctly that eventgate uses the older-- right [13:10:14] in the newer format, is there some sort of inheritance or something where I wouldn't have to edit six files? :) [13:11:04] yep :-) would be one file then [13:16:29] Also, we will be upgrading the kubernetes clusters to kernel 4.19 next Q, so we should have less issues with "overthrottling" (also kind of hiding the issue of to low resource configs) [13:18:47] ah actually, unless I'm missing something I don't think I need to edit the canary files [13:18:52] so only three :) [13:19:50] oh, yeah. You are right. The canary release includes the values.yaml [13:27:10] is the lack of tls.telemetry: true the reason why we don't have data from envoy in https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=71&orgId=1&refresh=1m&from=now-7d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-logging-external&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=$__all [13:27:11] ? [13:33:21] jayme: not strictly fixed but would require a bit of care to change I'm afraid [13:35:25] hnowlan: was afraid it's already kind of fixed. Don't know if _joe_ has any strong feelings about that rule as he has probably set it up...(I guess it's only so humans can easily distinguish between HTTP/HTTPS) [13:36:05] cdanis: yea. Feel free to enable telemetry if you like, though [13:36:17] ok cool. for now I'll do the cpu change first [13:36:33] +1 [13:41:38] <_joe_> what rule? [13:42:40] _joe_: "new services should reserve TLS ports in the 4000-5000 port range." from https://wikitech.wikimedia.org/wiki/Service_ports [13:43:05] <_joe_> that's pretty tight unless there are historical reasons for the difference [13:44:31] so you think hnowlan should invest in changing it for api-gateway (bolded the rule now btw) [13:49:47] 4 memcached mc10xx nodes are going down in ~20 mins for the d4 rack tor update, FYI [13:53:41] network down, not power AFAIK [13:54:11] currently we don't have any critical alerting for the api gateway and it's technically not public yet, so changing it isn't a huge pain. If the preference is for now then I'm happy to do it [13:54:32] akosiaris: do you know what request would need to be crafted against citoid for it to call zotero as a consequence? [13:55:22] hmm, I think I can figure that out [13:56:17] hnowlan: I would say change is then if it's not so much pain. Sorry that I've not realized earlier :-| [13:57:19] akosiaris: looking at citoid code you mean? I can probably do that as well. Just wanted to know if you maybe know off the top of your head [14:01:51] jayme: no need to read the code [14:02:03] citoid exposes an augmented OpenAPI spec [14:02:14] deploy1001$ curl https://citoid.svc.eqiad.wmnet:4003/?spec | jq '.paths | {"group":.["/api"]} | .group.get | {"group": .["x-amples"]}' [14:02:32] you can also just run service-checker at a pod IP [14:02:46] it should parse that and issue the correct request [14:03:43] e.g. [14:03:47] deploy1001:~$ /usr/bin/service-checker-swagger 10.64.64.191 https://citoid.discovery.wmnet:4003 [14:03:47] All endpoints are healthy [14:03:56] that exercised zotero and it worked [14:04:12] now that I think about it ... [14:04:42] ah, cool. So asking citoid for the imperator should contact zotero as well [14:05:10] I am not sure it errors out when it can't reach zotero. logs it and sents metrics for it probably but not sure it returns a 500 [14:05:24] it might just not augment the results with the data zotero would return [14:05:34] so you might want to look at logs as well [14:05:54] btw, I did not follow on that. Why did services_proxy not work on Friday? what was the behavior? [14:07:27] that's what I'm trying to figure out currently. I was getting only 503 from service proxy (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1600964400000&to=1600968600000&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=zotero) [14:09:10] ah, also present in https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&from=1600961880340&to=1600970107080 [14:10:06] ah, yeah [14:10:32] but I'm unable to catch that in staging [14:11:17] staging citoid is still running with tls zotero configured, but I don't get propper telemetry as well [14:12:22] but now at least the 5xx apperat in the citoid dashboard itself (whereas still not in envoy telemetry) [14:13:12] that's something [14:13:28] (that was grafana just needing a reload it seems, please pass by) [14:14:51] zotero logs btw are close to useless [14:15:18] every requests ends up in like 40 lines of logs which is 20 actual lines + 20 blank lines [14:15:24] and that's if we are lucky [14:15:41] but we can enable them again in a pod. IIRC they are if guarded behind an ENV VAR [14:15:49] I cant seem to find any logs .. ah [14:16:03] yeah, we had to disable them because they killed logstash [14:16:10] AND they were useless [14:16:14] :) [14:16:29] DEBUG_LEVEL: 0 [14:16:48] change that in values.yaml and you should get them in kubectl logs [14:16:48] I'm not sure this is zoteros fault at all. It works fine for me when using curl via https [14:17:13] zotero not to blame.. that would be a first [14:17:18] * akosiaris joking [14:17:20] :) [14:17:41] so, what fails is the path from citoid -> zotero ? [14:18:08] so, something between citoid envoy sidecar and zotero envoy sidecar? [14:18:35] yep [14:18:39] I think I know why you don't get telemetry btw [14:18:55] just not enough requests [14:19:04] it's there now https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=now-15m&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=citoid&var-destination=zotero [14:19:21] (running service checker in a loop against staging currently) [14:19:29] ah, ok. Note however that zotero's envoy isn't being scraped [14:19:45] the deployment is lacking a prometheus.io/scrape: "true" I think [14:19:49] * akosiaris doublechecking [14:20:05] telemetry has it's own annotation akosiaris [14:20:44] envoyproxy.io/port: 9361 [14:20:45] envoyproxy.io/scrape: true [14:20:46] indeed [14:20:55] nice! I had forgotten about that [14:21:56] 10serviceops, 10Operations, 10observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10ema) >>! In T148976#6488108, @BBlack wrote: > This was mostly about cache nodes back when those had ipsec, I think. The remaining case that uses ipse... [14:40:17] maybe you're busy jayme but does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/630597 look right to you? [14:42:17] cdanis: 500m is not exacly 2x 200m :) but I'm fine with that [14:42:29] jayme: yeah I wanted to allow some extra headroom [14:42:36] thanks! [14:42:38] +1ed [15:08:46] 10serviceops, 10Operations, 10observability, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10colewhite) a:03colewhite [15:13:17] 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) [15:22:27] 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) According to https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 we have 3 Tib of disc available on parser cache nodes. Acc... [17:13:50] 10serviceops, 10User-jijiki: Create a structured testing environment for applications running on kubernetes - https://phabricator.wikimedia.org/T264025 (10jijiki) [17:15:25] 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) Some more RESTBase utilization data is available on T258414 [18:35:42] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Halfak) I'm not familiar with this problem. Anything change with the deployment recently? Did any overload errors ha... [18:40:27] Would it be expected that the puppet role/profile docker::registry is not used by anything in prod or cloud? [18:53:11] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) a:05Krinkle→03dpifke [22:05:06] mutante: It is used in Toolforge. ::role::wmcs::toolforge::docker::registry -> profile::toolforge::docker::registry -> ::docker::registry [22:06:09] I think it was replaced in prod with role::docker_registry_ha::registry [22:07:28] bd808: ah, i see that in openstack-browser, ack, thanks. and HA registry sounds right