[05:21:03] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Marostegui) >>! In T187984#6494499, @jcrespo wrote: > db1077 should now be available to be put back on test-* section, I don't think it is...
[07:44:49] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) Possibly related to {T181632}.  In the past, Redis was a single point of failure and if Celery could not conne...
[07:49:31] <_joe_>	 hnowlan: I think calls to restbase from changeprop are failing since thursday
[08:07:55] <_joe_>	 uh nevermind, it was never deployed, just merged
[09:08:31] <hnowlan>	 What makes you say that _joe_? 
[09:09:04] <_joe_>	 hnowlan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630544 :)
[09:09:14] <_joe_>	 I saw you merged a change to call restbase via TLS
[09:09:19] <_joe_>	 but restbase needed a new cert
[09:10:34] <hnowlan>	 ah, heh
[09:32:57] <wikibugs>	 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) With {T263583} coming up, perhaps we should use a special ParserCache instance for old revisions,...
[09:33:41] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Triage→03Medium
[09:35:12] <wikibugs>	 10serviceops, 10Operations, 10Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ArielGlenn) p:05Triage→03Medium
[09:46:23] <wikibugs>	 10serviceops, 10Operations, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10ArielGlenn) p:05Triage→03Medium
[09:50:04] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ArielGlenn) p:05Triage→03High
[09:54:10] <wikibugs>	 10serviceops, 10Operations, 10observability, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10ArielGlenn) p:05Triage→03High
[09:54:13] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Medium→03High
[10:00:01] <_joe_>	 akosiaris: any idea why mobileapps would call wikifeeds?
[10:01:27] <akosiaris>	 _joe_: directly? I don't think it would
[10:01:39] <akosiaris>	 via restbase? maybe some summary endpoint?
[10:01:55] <_joe_>	 akosiaris: I am unsure, maybe I'm reading something wrong
[10:02:02] <akosiaris>	 there is an endpoint that ends up in restbase splitting of in 4 different calls.. it could be one of them
[10:02:36] <akosiaris>	 well, what are you reading?
[10:02:58] <_joe_>	 akosiaris: I found it as follows: netstat -tunap | fgrep 8889 on a kubernetes worker
[10:03:33] <_joe_>	 and all I see aare connections from 10.2.1.14 to 10.192.64.180:8889
[10:03:49] <akosiaris>	 that's just the readiness probe
[10:03:51] <_joe_>	 context is - there are still calls to wikifeeds via http, and they really should not
[10:03:57] <akosiaris>	 and for some reason the kubelet using that IP
[10:04:10] <akosiaris>	 aka, those are locally generated
[10:04:28] <_joe_>	 uhm but I do see connections on the LVS
[10:04:38] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 is back into test-s4 role, although without any data.
[10:04:39] <_joe_>	 so yeah, you're probably right
[10:04:44] <_joe_>	 and I need to dig more
[10:05:07] <akosiaris>	 btw, at some point we need to experiment with not having the LVS ips on the kubernetes nodes
[10:05:19] <akosiaris>	 it strictly isn't needed, we are just doing it for consistency's sake
[10:05:38] <akosiaris>	 and it's causing that ^ kind of confusion
[10:07:33] <_joe_>	 ok, the connections are gone now, nevermind 
[10:07:46] <_joe_>	 akosiaris: why "not needed"?
[10:08:13] <_joe_>	 it's needed indeed, else the networking stack of the node will reject the request I think
[10:08:18] <akosiaris>	 cause iptables rules take priority and DNAT the traffic way before it even reaches the nodes
[10:09:08] <_joe_>	 so you say iptables happens before a packet is even analyzed by the host's network stack to reject or accept it? I've never looked into that
[10:09:16] <akosiaris>	 https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg
[10:09:36] <akosiaris>	 looking at the "routing decision" box
[10:09:51] <akosiaris>	 the one in the network layer in the input path
[10:10:09] <akosiaris>	 dnat happens before that and it's going to change the destination ip
[10:10:11] <_joe_>	 yes I was following
[10:10:23] <akosiaris>	 so that routing decision is going to switch from 10.x.x.x to the pod ip
[10:10:35] <_joe_>	 it looks like it yes
[10:10:38] <akosiaris>	 and it will never go up the stack and no local process needs to handle it
[10:11:03] <_joe_>	 ok, that's good. We should just try it in eqiad now on a node to be sure :P
[10:11:17] <akosiaris>	 I remember figuring it during a service deployment working just fine, despite me having forgotten that
[10:11:37] <akosiaris>	 I figured it out quickly and fixed it, but I kept it in the back of my mind for an experiment at some point
[10:11:49] <akosiaris>	 not sure if something would break, I think not, but we need to double check
[10:12:47] <akosiaris>	 for example, it might hamper us switching kube-proxy to ipvs mode (which I don't even node if we need)
[10:12:58] <akosiaris>	 s/node/know/
[10:14:10] <jayme>	 akosiaris: regarding LVS ip not on k8s nodes. E.ffie and I did a "test" on that with push-not and LVS does not work without it right now
[10:14:52] <_joe_>	 I always assumed accepting or rejecting the packet based on IP happened before iptables
[10:14:57] <akosiaris>	 hmm, interesting. What wasn't working?
[10:15:10] <akosiaris>	 my "test" said the inverse, which is why I am wondering
[10:15:33] <jayme>	 akosiaris: connecting to the service via the LVS ip did not work
[10:15:45] <jayme>	 as soon as we added the IP to the nodes, it did
[10:16:57] <akosiaris>	 hmm, the inverse of my experience. Weird. I wonder what changed. We probably should try to structure a test during the next deployment and figure it out
[10:17:12] <jayme>	 It looks strage in first place though, because monitoring is fine. But that's not using the LVS IP but nodes IPs directly
[10:17:56] <akosiaris>	 that sounds like pybal for some reason did not pool the nodes...
[10:18:35] <jayme>	 akosiaris: IIRC ipvsadm showed them correctly
[10:18:37] <_joe_>	 let's remember to figure this out next time
[10:18:57] <akosiaris>	 yup. not urgent, but it would be a nice test to run it in a structured way next time
[10:19:25] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10Joe)
[10:19:39] <akosiaris>	 I may just be misremembering ofc. It's been some time since then
[10:19:53] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[10:21:58] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[10:59:40] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6497445, @awight wrote: > Possibly related to {T181632}.  In the past, Redis was a single po...
[12:04:56] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10conftool: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10ArielGlenn)
[12:10:32] <wikibugs>	 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10ArielGlenn)
[12:19:05] <jayme>	 hnowlan: maybe it's already to late, but the TLS port of api-gateway should be in the range of 4000 to 5000 https://wikitech.wikimedia.org/w/index.php?title=Service_ports&type=revision&diff=1874891&oldid=1868504
[12:20:17] <jayme>	 if it's to late then that's another exception of the rule I guess :)
[12:35:39] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn)
[12:45:10] <cdanis>	 hi serviceops friends, I was hoping for some help evaluating the impact of k8s CPU throttling on a service (eventgate-logging-external)
[12:46:56] <cdanis>	 I'm looking at this because I noticed a reasonable number of 503 service unavailables served by the Traffic layer with 60s ttfb (so a timeout): https://logstash.wikimedia.org/goto/690a3ab8d848cfac232216aaf7fa85b8
[12:47:02] <cdanis>	 https://grafana.wikimedia.org/d/tn6gBadMz/jayme-container_cpu_cfs_throttled_seconds_total?orgId=1&var-dc=thanos&var-site=codfw&var-prometheus=k8s&var-sum_by=container_name&var-service=eventgate-logging-external&var-ignore_container_regex=%7CPOD%7C.*metrics-exporter.*&from=now-30d&to=now
[12:47:30] <cdanis>	 I'm not sure how to evaluate the severity of 300ms of throttled time on the tls proxy, though
[12:49:41] <jayme>	 looking
[12:57:36] <jayme>	 cdanis: It's a problem I would say. I would suggest to double the cpu requests and limits for the sidecar and see how that behaves
[12:57:43] <cdanis>	 ok!
[12:58:01] <cdanis>	 can i ask what you usually look at to determine problem or not?  ratio of throttled time to total time...?
[12:59:18] <jayme>	 cdanis: do you know what has changed on 2020-09-15 that could have increased the load/throttling that much?
[12:59:27] <cdanis>	 we increased the traffic directed at it a lot :)
[12:59:33] <cdanis>	 well, "a lot"; it's still only serving something like 40 rps
[12:59:41] <cdanis>	 but yes, the load did increase then
[12:59:46] <jayme>	 ah, thats an okay reason :)
[13:01:13] <jayme>	 In this case I was just looking low close the cpu usage is to the configured cpu request
[13:02:03] <cdanis>	 do you have a few minutes for some very dumb questions about editing charts?
[13:03:59] <jayme>	 sure, shoot
[13:05:41] <cdanis>	 so, charts/eventgate/templates/_tls_helpers.tpl seems to let you override the default limits from values, if you specify a tls.resources there
[13:06:29] <cdanis>	 in eventgate-logging-external's case that means editing helmfile.d/services/{staging,codfw,eqiad}/eventgate-logging-external/values{,-canary}.yaml ?
[13:09:37] <jayme>	 exactly!
[13:09:53] <cdanis>	 great
[13:09:59] <jayme>	 as it is not yet migrated to the new format
[13:10:02] <cdanis>	 do I also understand correctly that eventgate uses the older-- right
[13:10:14] <cdanis>	 in the newer format, is there some sort of inheritance or something where I wouldn't have to edit six files? :)
[13:11:04] <jayme>	 yep :-) would be one file then
[13:16:29] <jayme>	 Also, we will be upgrading the kubernetes clusters to kernel 4.19 next Q, so we should have less issues with "overthrottling" (also kind of hiding the issue of to low resource configs)
[13:18:47] <cdanis>	 ah actually, unless I'm missing something I don't think I need to edit the canary files
[13:18:52] <cdanis>	 so only three :)
[13:19:50] <jayme>	 oh, yeah. You are right. The canary release includes the values.yaml 
[13:27:10] <cdanis>	 is the lack of tls.telemetry: true the reason why we don't have data from envoy in https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=71&orgId=1&refresh=1m&from=now-7d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-logging-external&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=$__all
[13:27:11] <cdanis>	 ?
[13:33:21] <hnowlan>	 jayme: not strictly fixed but would require a bit of care to change I'm afraid
[13:35:25] <jayme>	 hnowlan: was afraid it's already kind of fixed. Don't know if _joe_ has any strong feelings about that rule as he has probably set it up...(I guess it's only so humans can easily distinguish between HTTP/HTTPS)
[13:36:05] <jayme>	 cdanis: yea. Feel free to enable telemetry if you like, though
[13:36:17] <cdanis>	 ok cool.  for now I'll do the cpu change first
[13:36:33] <jayme>	 +1
[13:41:38] <_joe_>	 what rule?
[13:42:40] <jayme>	 _joe_: "new services should reserve TLS ports in the 4000-5000 port range." from https://wikitech.wikimedia.org/wiki/Service_ports
[13:43:05] <_joe_>	 that's pretty tight unless there are historical reasons for the difference
[13:44:31] <jayme>	 so you think hnowlan should invest in changing it for api-gateway (bolded the rule now btw)
[13:49:47] <elukey>	 4 memcached mc10xx nodes are going down in ~20 mins for the d4 rack tor update, FYI 
[13:53:41] <volans>	 network down, not power AFAIK
[13:54:11] <hnowlan>	 currently we don't have any critical alerting for the api gateway and it's technically not public yet, so changing it isn't a huge pain. If the preference is for now then I'm happy to do it
[13:54:32] <jayme>	 akosiaris: do you know what request would need to be crafted against citoid for it to call zotero as a consequence?
[13:55:22] <akosiaris>	 hmm, I think I can figure that out
[13:56:17] <jayme>	 hnowlan: I would say change is then if it's not so much pain. Sorry that I've not realized earlier :-|
[13:57:19] <jayme>	 akosiaris: looking at citoid code you mean? I can probably do that as well. Just wanted to know if you maybe know off the top of your head
[14:01:51] <akosiaris>	 jayme: no need to read the code
[14:02:03] <akosiaris>	 citoid exposes an augmented OpenAPI spec 
[14:02:14] <akosiaris>	 deploy1001$ curl https://citoid.svc.eqiad.wmnet:4003/?spec | jq '.paths | {"group":.["/api"]} | .group.get | {"group": .["x-amples"]}'
[14:02:32] <akosiaris>	 you can also just run service-checker at a pod IP
[14:02:46] <akosiaris>	 it should parse that and issue the correct request
[14:03:43] <akosiaris>	 e.g.
[14:03:47] <akosiaris>	 deploy1001:~$ /usr/bin/service-checker-swagger 10.64.64.191 https://citoid.discovery.wmnet:4003
[14:03:47] <akosiaris>	 All endpoints are healthy
[14:03:56] <akosiaris>	 that exercised zotero and it worked
[14:04:12] <akosiaris>	 now that I think about it ... 
[14:04:42] <jayme>	 ah, cool. So asking citoid for the imperator should contact zotero as well
[14:05:10] <akosiaris>	 I am not sure it errors out when it can't reach zotero. logs it and sents metrics for it probably but not sure it returns a 500
[14:05:24] <akosiaris>	 it might just not augment the results with the data zotero would return
[14:05:34] <akosiaris>	 so you might want to look at logs as well
[14:05:54] <akosiaris>	 btw, I did not follow on that. Why did services_proxy not work on Friday? what was the behavior?
[14:07:27] <jayme>	 that's what I'm trying to figure out currently. I was getting only 503 from service proxy (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1600964400000&to=1600968600000&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=zotero)
[14:09:10] <akosiaris>	 ah, also present in https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&from=1600961880340&to=1600970107080
[14:10:06] <jayme>	 ah, yeah
[14:10:32] <jayme>	 but I'm unable to catch that in staging
[14:11:17] <jayme>	 staging citoid is still running with tls zotero configured, but I don't get propper telemetry as well
[14:12:22] <jayme>	 but now at least the 5xx apperat in the citoid dashboard itself (whereas still not in envoy telemetry)
[14:13:12] <akosiaris>	 that's something
[14:13:28] <jayme>	 (that was grafana just needing a reload it seems, please pass by)
[14:14:51] <akosiaris>	 zotero logs btw are close to useless
[14:15:18] <akosiaris>	 every requests ends up in like 40 lines of logs which is 20 actual lines + 20 blank lines
[14:15:24] <akosiaris>	 and that's if we are lucky
[14:15:41] <akosiaris>	 but we can enable them again in a pod. IIRC they are if guarded behind an ENV VAR
[14:15:49] <jayme>	 I cant seem to find any logs .. ah
[14:16:03] <akosiaris>	 yeah, we had to disable them because they killed logstash
[14:16:10] <akosiaris>	 AND they were useless
[14:16:14] <jayme>	 :)
[14:16:29] <akosiaris>	 DEBUG_LEVEL: 0
[14:16:48] <akosiaris>	 change that in values.yaml and you should get them in kubectl logs
[14:16:48] <jayme>	 I'm not sure this is zoteros fault at all. It works fine for me when using curl via https
[14:17:13] <akosiaris>	 zotero not to blame.. that would be a first 
[14:17:18] * akosiaris joking
[14:17:20] <jayme>	 :)
[14:17:41] <akosiaris>	 so, what fails is the path from citoid -> zotero ?
[14:18:08] <akosiaris>	 so, something between citoid envoy sidecar and zotero envoy sidecar?
[14:18:35] <jayme>	 yep
[14:18:39] <akosiaris>	 I think I know why you don't get telemetry btw
[14:18:55] <jayme>	 just not enough requests
[14:19:04] <jayme>	 it's there now https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=now-15m&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=citoid&var-destination=zotero
[14:19:21] <jayme>	 (running service checker in a loop against staging currently)
[14:19:29] <akosiaris>	 ah, ok. Note however that zotero's envoy isn't being scraped
[14:19:45] <akosiaris>	 the deployment is lacking a prometheus.io/scrape: "true" I think
[14:19:49] * akosiaris doublechecking
[14:20:05] <jayme>	 telemetry has it's own annotation akosiaris
[14:20:44] <akosiaris>	                 envoyproxy.io/port: 9361
[14:20:45] <akosiaris>	                 envoyproxy.io/scrape: true
[14:20:46] <akosiaris>	 indeed
[14:20:55] <akosiaris>	 nice! I had forgotten about that
[14:21:56] <wikibugs>	 10serviceops, 10Operations, 10observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10ema) >>! In T148976#6488108, @BBlack wrote: > This was mostly about cache nodes back when those had ipsec, I think.  The remaining case that uses ipse...
[14:40:17] <cdanis>	 maybe you're busy jayme but does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/630597 look right to you?
[14:42:17] <jayme>	 cdanis: 500m is not exacly 2x 200m :) but I'm fine with that 
[14:42:29] <cdanis>	 jayme: yeah I wanted to allow some extra headroom
[14:42:36] <cdanis>	 thanks!
[14:42:38] <jayme>	 +1ed
[15:08:46] <wikibugs>	 10serviceops, 10Operations, 10observability, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10colewhite) a:03colewhite
[15:13:17] <wikibugs>	 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo)
[15:22:27] <wikibugs>	 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) According to https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 we have 3 Tib of disc available on parser cache nodes.  Acc...
[17:13:50] <wikibugs>	 10serviceops, 10User-jijiki: Create a structured testing environment for applications running on kubernetes - https://phabricator.wikimedia.org/T264025 (10jijiki)
[17:15:25] <wikibugs>	 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) Some more RESTBase utilization data is available on T258414
[18:35:42] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Halfak) I'm not familiar with this problem.  Anything change with the deployment recently?  Did any overload errors ha...
[18:40:27] <mutante>	 Would it be expected that the puppet role/profile  docker::registry is not used by anything in prod or cloud?
[18:53:11] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) a:05Krinkle→03dpifke
[22:05:06] <bd808>	 mutante: It is used in Toolforge. ::role::wmcs::toolforge::docker::registry -> profile::toolforge::docker::registry -> ::docker::registry
[22:06:09] <bd808>	 I think it was replaced in prod with role::docker_registry_ha::registry
[22:07:28] <mutante>	 bd808: ah, i see that in openstack-browser, ack, thanks. and HA registry sounds right