[07:02:41] <ema>	 so interestingly since the move to kafka-only purges we stopped having any kind of purged workers backlog it seems
[07:02:46] <ema>	 https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=11&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp3050
[07:03:17] <ema>	 if you look at other cache_text instances, the same is true for them too
[07:04:33] <ema>	 what the backlog really means is: how many messages fully read from htcp/kafka have not been sent as an HTTP PURGE request to backend/frontend yet
[07:08:50] <ema>	 possibly due to the fact that reading from kafka is less bursty than the multicast HTCP messages? 
[07:10:25] <ema>	 well it's not true that there's 0 backlog, there is some but much less than before (switch from 24h view to 1h for instance to see them) 
[07:12:07] <ema>	 eg: max backlog in the last 24h was 673367 on cp3050 backend, in the last 3h it is 848
[07:59:57] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10JMeybohm) a:03JMeybohm
[08:56:49] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa)
[09:03:43] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa)
[09:14:10] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) Started to fail cca 4 hours ago, tests passed before. Fails currently only on testwiki and enwpbet...
[09:18:57] <moritzm>	 during the PHP update I found a number of hosts which were depooled: mw1318,mw2139,mw2145,mw2147,mw2221,mw2219,mw2250,mw2350
[09:19:24] <moritzm>	 these are all fully green in Icinga, so I suppose they were forgotten to be repooled after tests/maintenance
[09:19:34] <moritzm>	 unless anyone objects I'll all repool these later
[09:21:45] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa)
[09:29:54] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa)
[09:48:16] <wikibugs>	 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa)
[10:44:54] <_joe_>	 ema: maybe reading from kafka is slightly slower
[10:48:37] <ema>	 _joe_: right, plus there's some sort of buffering being done by the producer I'm sure
[10:49:02] <_joe_>	 but the purge lag seems to be very small in general
[10:49:30] <_joe_>	 we could use burrow metrics to see what is the offset lag of all consumers maybe, I have to look into it
[10:50:13] <_joe_>	 but if on average the most recent purge you've seen was generated 50 ms ago, and they're consumed in FIFO fashion, I wouldn't worry too much
[10:51:04] <_joe_>	 the worst lag can be found in eqsin, where it averages over 500 ms
[10:51:17] <_joe_>	 still well within the db maxlag limit
[13:03:16] <wikibugs>	 10serviceops, 10Operations, 10RESTBase, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe)
[13:04:49] <wikibugs>	 10serviceops, 10Operations, 10RESTBase, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe) we will have to add some more refinement to the service proxy - specifically we don't need to install all of the remote cluster ha...
[13:10:38] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm)
[14:05:30] <ema>	 _joe_: as briefly discussed yesterday -- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604716/
[14:57:22] <ema>	 also, _joe_ & Pchelolo: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/
[14:58:06] <ema>	 next step is setting profile::cache::purge::kafka_cluster_name to 'main-deployment-prep' in deployment-prep's project puppet config on horizon and hope it works as I think it does
[15:23:24] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) All clusters clean from `envoy-tls-local-proxy` image!
[16:21:13] <ema>	 Pchelolo: I've cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ on deployment-puppetmaster04, now purged on deployment-cache-text06.deployment-prep.eqiad.wmflabs uses the proper kafka brokers 
[16:22:35] <ema>	 well almost, the port seems to be wrong (9093 vs 9092)
[16:28:35] <ema>	 ah that's due to a minor issue in the purged profile, fixing
[16:30:31] <ema>	 yeah, here's the problem: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/cache/purge.pp#22
[16:30:43] <ema>	 $kafka_tls defaults to false, not undef
[16:41:16] <ema>	 fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604790/
[16:43:20] <ema>	 ok, now we're looking good:
[16:43:20] <ema>	 +  "bootstrap.servers":        "deployment-kafka-main-1.deployment-prep.eqiad.wmflabs:9092,deployment-kafka-main-2.deployment-prep.eqiad.wmflabs:9092",
[17:01:55] <ema>	 Pchelolo: I don't see restbase purges coming in twice in deployment-prep, not sure if known: https://phabricator.wikimedia.org/T254844#6216088
[17:03:27] <ema>	 but other than that we're close to victory!
[17:06:37] <Pchelolo>	 ema: I'm not sure what config has hnowlan deployed for changeprop in beta
[17:06:57] <Pchelolo>	 hnowlan: can you please tell if in beta we do kafka pruge rule or htcp purge rule?
[17:08:10] <_joe_>	 you can see that with kafkacat
[17:08:15] <_joe_>	 they have different tags
[17:08:20] <_joe_>	 the purge messages
[17:08:35] <_joe_>	 (I'm back, sorry for having been afk earlier)
[17:14:35] <ema>	 _joe_: np! so https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ is a noop in production and currently works fine on deployment-prep (cherry-picked there)
[17:15:03] <ema>	 now looking at the purge tags
[17:16:26] <ema>	 so not all messages have tags apparently
[17:16:42] <ema>	 https://phabricator.wikimedia.org/P11473
[17:17:30] <ema>	 I do see restbase purges there though, so perhaps the reason for not having double RB purges is that they're not coming in via multicast?
[17:17:39] <_joe_>	 i think so yes
[17:18:00] <_joe_>	 the ones coming from rb might not have a tag
[17:18:21] <_joe_>	 and indeed, looking at your paste
[17:18:30] <_joe_>	 yes, rb is only going via kafka
[17:18:45] <ema>	 excellent, updating the ticket with these findings
[17:21:53] <ema>	 I think we're good to go now, aren't we
[17:22:54] <_joe_>	 yup
[17:24:27] <ema>	 _joe_: ah, I've seen your comment now about using hiera everywhere. I agree, amending
[17:27:32] <ema>	 lol it's 19:30. Amending tomorrow! o/
[17:27:36] <Pchelolo>	 sorry, I've been in a meeting
[17:27:55] <Pchelolo>	 ok, too late, it's 19:30 apparently :)
[17:27:59] <Pchelolo>	 have a nice evening ema
[17:28:12] <Pchelolo>	 for the note - for restrbase purges are not suplicated
[17:28:24] <Pchelolo>	 there's either htcp or kafka - that's cause how change-prop works
[17:28:52] <ema>	 Pchelolo: deployment-prep is now receiving all expected kafka purges just fine, there's a minor puppet-style improvement possible but nothing very interesting
[17:29:48] <ema>	 see T254844 for the details. Now I'm really afk! :)
[17:31:19] <Pchelolo>	 have a nice evening
[17:35:08] <_joe_>	 Pchelolo: well done to you too, sir. I'm very happy this was completed before of my vacation
[17:37:07] <Pchelolo>	 this whole project was a pleasure
[17:37:24] <Pchelolo>	 I'll fix the loose ends soon
[18:44:59] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10Mholloway)
[19:31:28] <wikibugs>	 10serviceops, 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) >>! In T237889#6053911, @Joe wrote: > Then I'd definitely go with the idea of installing wikitech on a subset of appservers, at least at first.  @joe would you als...
[20:00:00] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles)
[20:02:25] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) a:05Gilles→03None
[20:02:49] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) I need someone from SRE to review the Puppet patch. Thanks!
[20:46:17] <cdanis>	 akosiaris: _joe_: during the window of the incident, kubernetes1001 (notably didn't run kask/sessionstore at the time) was having lots of trouble talking to the *anycast recdns IP* https://phabricator.wikimedia.org/P11474
[20:46:49] <cdanis>	 this was making most k8s-related things fail, as it turns out it wants to resolve kubemaster.svc.eqiad.wmnet an awful lot
[20:47:28] <apergos>	 can's we have a local resolver over there?
[20:47:59] <apergos>	 but why did it have trouble anyways, hrm
[20:48:08] <cdanis>	 yeah that's the real question
[20:48:09] <apergos>	 augh I'm not here, I'm gone anyways, sorry... nerd-swiped
[20:48:34] <apergos>	 will read backscroll tomorrow
[20:52:52] <cdanis>	 not just k8s having issues with it
[20:52:53] <cdanis>	 /home/cdanis/Pictures/Screenshot_20200611_165015.png
[20:52:55] <cdanis>	 err
[20:52:58] <cdanis>	 Jun 11 18:37:54 kubernetes1001 rsyslogd: omkafka: kafka error message: -193,'Local: Host resolution failure','ssl://logstash1010.eqiad.wmnet:9093/1004: Failed to resolve 'logstash1010.eqiad.wmnet:9093': Temporary failure in name resolution (after 6005ms in state DOWN)' [v8.1901.0 try https://www.rsyslog.com/e/2422 ]
[20:54:55] <volans>	 cdanis: any resolution?
[20:55:06] <volans>	 or only specific hostnames?
[20:55:11] <cdanis>	 not sure yet
[20:56:23] <cdanis>	 a fair bit of 'use of closed network connection' as well https://phabricator.wikimedia.org/P11475
[21:03:00] <cdanis>	 sockets were so broken on the machine we're missing node_exporter data for some of the interval https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=1591898526140&to=1591903566140
[21:03:25] <cdanis>	 but conntrack usage was at least as high as 20%
[21:03:49] <cdanis>	 er, more like 17 but whatever, that's 8x baseline
[21:08:48] <cdanis>	 interesting to see this as well: Jun 11 18:34:38 kubernetes1001 dockerd[842]: time="2020-06-11T18:34:38.203467102Z" level=info msg="Container 365c9c623df10020638aa04acf5dfa5fa9ca772b66e7972b60e3312728ad482d failed to exit within 30 seconds of signal 15 - using the force"
[21:08:54] <cdanis>	 sadly that container is no longer in docker ps -a
[21:10:25] <cdanis>	 Jun 11 18:38:48 kubernetes1001 puppet-agent[4874]: (/File[/var/lib/puppet/facts.d]) Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to puppet:8140 (getaddrinfo: Temporary failure in name resolution)
[21:10:53] <cdanis>	 also similar failures resolving mirrors.wikimedia.org, webproxy.eqiad.wmnet, apt.wikimedia.org as part of the apt update from puppet agent
[21:11:18] <cdanis>	 I mean, I'm in favor of picking up more artifact ramp.
[21:11:20] <cdanis>	 err
[21:11:22] <cdanis>	 Jun 11 18:40:10 kubernetes1001 rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.   [v8.1901.0 try https://www.rsyslog.com/e/2078 ]
[21:11:56] <cdanis>	 everything points to something becoming seriously broken in the networking stack of this machine