[07:02:41] so interestingly since the move to kafka-only purges we stopped having any kind of purged workers backlog it seems [07:02:46] https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=11&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp3050 [07:03:17] if you look at other cache_text instances, the same is true for them too [07:04:33] what the backlog really means is: how many messages fully read from htcp/kafka have not been sent as an HTTP PURGE request to backend/frontend yet [07:08:50] possibly due to the fact that reading from kafka is less bursty than the multicast HTCP messages? [07:10:25] well it's not true that there's 0 backlog, there is some but much less than before (switch from 24h view to 1h for instance to see them) [07:12:07] eg: max backlog in the last 24h was 673367 on cp3050 backend, in the last 3h it is 848 [07:59:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10JMeybohm) a:03JMeybohm [08:56:49] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) [09:03:43] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) [09:14:10] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) Started to fail cca 4 hours ago, tests passed before. Fails currently only on testwiki and enwpbet... [09:18:57] during the PHP update I found a number of hosts which were depooled: mw1318,mw2139,mw2145,mw2147,mw2221,mw2219,mw2250,mw2350 [09:19:24] these are all fully green in Icinga, so I suppose they were forgotten to be repooled after tests/maintenance [09:19:34] unless anyone objects I'll all repool these later [09:21:45] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) [09:29:54] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) [09:48:16] 10serviceops, 10Growth-Team, 10MediaWiki-Configuration, 10Pywikibot, 10Pywikibot-tests: Intermittent internal API error MWUnknownContentModelException - https://phabricator.wikimedia.org/T255105 (10Dvorapa) [10:44:54] <_joe_> ema: maybe reading from kafka is slightly slower [10:48:37] _joe_: right, plus there's some sort of buffering being done by the producer I'm sure [10:49:02] <_joe_> but the purge lag seems to be very small in general [10:49:30] <_joe_> we could use burrow metrics to see what is the offset lag of all consumers maybe, I have to look into it [10:50:13] <_joe_> but if on average the most recent purge you've seen was generated 50 ms ago, and they're consumed in FIFO fashion, I wouldn't worry too much [10:51:04] <_joe_> the worst lag can be found in eqsin, where it averages over 500 ms [10:51:17] <_joe_> still well within the db maxlag limit [13:03:16] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe) [13:04:49] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-architecture, 10Service-Architecture: Use the service proxy in restbase - https://phabricator.wikimedia.org/T255133 (10Joe) we will have to add some more refinement to the service proxy - specifically we don't need to install all of the remote cluster ha... [13:10:38] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) [14:05:30] _joe_: as briefly discussed yesterday -- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604716/ [14:57:22] also, _joe_ & Pchelolo: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ [14:58:06] next step is setting profile::cache::purge::kafka_cluster_name to 'main-deployment-prep' in deployment-prep's project puppet config on horizon and hope it works as I think it does [15:23:24] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) All clusters clean from `envoy-tls-local-proxy` image! [16:21:13] Pchelolo: I've cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ on deployment-puppetmaster04, now purged on deployment-cache-text06.deployment-prep.eqiad.wmflabs uses the proper kafka brokers [16:22:35] well almost, the port seems to be wrong (9093 vs 9092) [16:28:35] ah that's due to a minor issue in the purged profile, fixing [16:30:31] yeah, here's the problem: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/cache/purge.pp#22 [16:30:43] $kafka_tls defaults to false, not undef [16:41:16] fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604790/ [16:43:20] ok, now we're looking good: [16:43:20] + "bootstrap.servers": "deployment-kafka-main-1.deployment-prep.eqiad.wmflabs:9092,deployment-kafka-main-2.deployment-prep.eqiad.wmflabs:9092", [17:01:55] Pchelolo: I don't see restbase purges coming in twice in deployment-prep, not sure if known: https://phabricator.wikimedia.org/T254844#6216088 [17:03:27] but other than that we're close to victory! [17:06:37] ema: I'm not sure what config has hnowlan deployed for changeprop in beta [17:06:57] hnowlan: can you please tell if in beta we do kafka pruge rule or htcp purge rule? [17:08:10] <_joe_> you can see that with kafkacat [17:08:15] <_joe_> they have different tags [17:08:20] <_joe_> the purge messages [17:08:35] <_joe_> (I'm back, sorry for having been afk earlier) [17:14:35] _joe_: np! so https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604743/ is a noop in production and currently works fine on deployment-prep (cherry-picked there) [17:15:03] now looking at the purge tags [17:16:26] so not all messages have tags apparently [17:16:42] https://phabricator.wikimedia.org/P11473 [17:17:30] I do see restbase purges there though, so perhaps the reason for not having double RB purges is that they're not coming in via multicast? [17:17:39] <_joe_> i think so yes [17:18:00] <_joe_> the ones coming from rb might not have a tag [17:18:21] <_joe_> and indeed, looking at your paste [17:18:30] <_joe_> yes, rb is only going via kafka [17:18:45] excellent, updating the ticket with these findings [17:21:53] I think we're good to go now, aren't we [17:22:54] <_joe_> yup [17:24:27] _joe_: ah, I've seen your comment now about using hiera everywhere. I agree, amending [17:27:32] lol it's 19:30. Amending tomorrow! o/ [17:27:36] sorry, I've been in a meeting [17:27:55] ok, too late, it's 19:30 apparently :) [17:27:59] have a nice evening ema [17:28:12] for the note - for restrbase purges are not suplicated [17:28:24] there's either htcp or kafka - that's cause how change-prop works [17:28:52] Pchelolo: deployment-prep is now receiving all expected kafka purges just fine, there's a minor puppet-style improvement possible but nothing very interesting [17:29:48] see T254844 for the details. Now I'm really afk! :) [17:31:19] have a nice evening [17:35:08] <_joe_> Pchelolo: well done to you too, sir. I'm very happy this was completed before of my vacation [17:37:07] this whole project was a pleasure [17:37:24] I'll fix the loose ends soon [18:44:59] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10Mholloway) [19:31:28] 10serviceops, 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) >>! In T237889#6053911, @Joe wrote: > Then I'd definitely go with the idea of installing wikitech on a subset of appservers, at least at first. @joe would you als... [20:00:00] 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) [20:02:25] 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) a:05Gilles→03None [20:02:49] 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) I need someone from SRE to review the Puppet patch. Thanks! [20:46:17] akosiaris: _joe_: during the window of the incident, kubernetes1001 (notably didn't run kask/sessionstore at the time) was having lots of trouble talking to the *anycast recdns IP* https://phabricator.wikimedia.org/P11474 [20:46:49] this was making most k8s-related things fail, as it turns out it wants to resolve kubemaster.svc.eqiad.wmnet an awful lot [20:47:28] can's we have a local resolver over there? [20:47:59] but why did it have trouble anyways, hrm [20:48:08] yeah that's the real question [20:48:09] augh I'm not here, I'm gone anyways, sorry... nerd-swiped [20:48:34] will read backscroll tomorrow [20:52:52] not just k8s having issues with it [20:52:53] /home/cdanis/Pictures/Screenshot_20200611_165015.png [20:52:55] err [20:52:58] Jun 11 18:37:54 kubernetes1001 rsyslogd: omkafka: kafka error message: -193,'Local: Host resolution failure','ssl://logstash1010.eqiad.wmnet:9093/1004: Failed to resolve 'logstash1010.eqiad.wmnet:9093': Temporary failure in name resolution (after 6005ms in state DOWN)' [v8.1901.0 try https://www.rsyslog.com/e/2422 ] [20:54:55] cdanis: any resolution? [20:55:06] or only specific hostnames? [20:55:11] not sure yet [20:56:23] a fair bit of 'use of closed network connection' as well https://phabricator.wikimedia.org/P11475 [21:03:00] sockets were so broken on the machine we're missing node_exporter data for some of the interval https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=1591898526140&to=1591903566140 [21:03:25] but conntrack usage was at least as high as 20% [21:03:49] er, more like 17 but whatever, that's 8x baseline [21:08:48] interesting to see this as well: Jun 11 18:34:38 kubernetes1001 dockerd[842]: time="2020-06-11T18:34:38.203467102Z" level=info msg="Container 365c9c623df10020638aa04acf5dfa5fa9ca772b66e7972b60e3312728ad482d failed to exit within 30 seconds of signal 15 - using the force" [21:08:54] sadly that container is no longer in docker ps -a [21:10:25] Jun 11 18:38:48 kubernetes1001 puppet-agent[4874]: (/File[/var/lib/puppet/facts.d]) Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to puppet:8140 (getaddrinfo: Temporary failure in name resolution) [21:10:53] also similar failures resolving mirrors.wikimedia.org, webproxy.eqiad.wmnet, apt.wikimedia.org as part of the apt update from puppet agent [21:11:18] I mean, I'm in favor of picking up more artifact ramp. [21:11:20] err [21:11:22] Jun 11 18:40:10 kubernetes1001 rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.1901.0 try https://www.rsyslog.com/e/2078 ] [21:11:56] everything points to something becoming seriously broken in the networking stack of this machine