[00:47:03] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Recommendation-API, 10Release-Engineering-Team, and 2 others: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10bmansurov) @akosiaris Thanks for reviewing my patchsets. I was wondering if you've seen my last commen... [06:39:56] 10serviceops, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) @jijiki reading the task it is not clear to me what's needed from us (#dba). Is it a heads up that you'll be running `updateCollation.php` against the wikis listed on T264991#... [07:21:14] 10serviceops, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) @Marostegui yes, it is a headsup for your radar, thank you! [07:22:22] 10serviceops, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) Thank you! What's the expected impact of `updateCollation.php`? [08:09:46] 10serviceops, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) One other sanity check for the rollout (in particular when the whole server batch gets upgraded on the 16th); ` cumin foo* 'php -r "var_dump(IntlChar::getUnicodeVersi... [08:36:34] 10serviceops, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) [08:46:19] _joe_: we have updated to icu63 on mwdebug2002 [08:46:38] is there something specifici we should watch out for? [08:46:44] given this is ro [08:47:46] <_joe_> effie: not that I know of specifically [08:47:57] <_joe_> let me check one thing though [08:48:44] ah what's that [08:49:16] <_joe_> I was checking https://phabricator.wikimedia.org/T219279 [08:50:02] oh dear, that task [08:50:30] <_joe_> just add https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no to your test suite [08:51:15] <_joe_> and check it doesn't redirect in the "wikipedia does not have an article with this exact name" page [08:51:21] 10serviceops, 10MediaWiki-General, 10Operations, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) Gentle nudge, this really needs to be completed. @WDoranWMF... [08:52:08] that looks fine [08:52:12] sweet [08:52:15] <_joe_> yep [08:52:30] <_joe_> the unicode table internal to php hasn't changed, so I didn't expect problems [08:52:34] <_joe_> but better to check :) [08:53:04] <_joe_> it is possible we might need to update some lookup tables in luasandbox at some point though. But I don't know who to ask to now [08:57:36] <_joe_> jayme: I answered you on the httpd image review. I opted to switch to build args to avoid allowing incorrect setups at runtime [08:58:07] <_joe_> basically *every* single one of those variables, if changed, might mean you're left with a broken running httpd [08:58:31] <_joe_> given the amount of possible footguns, I went the "convention over configuration" way [08:58:48] <_joe_> there are also a couple other issues, but that's my main reasoning [09:00:47] _joe_: okay. But "Listen is one of the directives where it would not work" is wrong [09:01:18] <_joe_> is it? IIRC you cannot use env variables in contexts where "If" could not be used [09:01:41] <_joe_> jayme: if you've already tested that, we can just switch APACHE_RUN_PORT to be an env var [09:01:54] <_joe_> after all that's the only value that can be safely changed at runtime [09:02:00] I actually build the example I gave in the comment yesterday and it worked [09:02:27] <_joe_> ok, then I would just move APACHE_RUN_PORT to an env var [09:02:46] <_joe_> for the reasons above, the rest of the args really need to only be changed at build time [09:04:00] yeah, fine with that [09:45:18] I just got "clearence", so I'll be out the next days as well [09:45:30] *clearance [09:55:44] <_joe_> jayme: very happy for you! [10:13:21] thanks <3 [10:23:22] <_joe_> jayme/akosiaris: I was taking a look at the fact we happen to have connection resets on our k8s clusters, and it took me to look at our conntrack settings. Specifically I see [10:23:41] <_joe_> - the conntrack table is very large, but the number of buckets is not [10:24:11] <_joe_> you should have nf_conntrack_max = 4* nf_conntrack_buckets [10:24:38] <_joe_> we have that at 32 [10:24:57] <_joe_> also we have nf_conntrack_tcp_be_liberal set to 0 [10:25:28] <_joe_> https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/ suggests we should change those [10:26:17] <_joe_> I would propose we try to up the buckets on one node, and we also set tcp_be_liberal [10:26:24] <_joe_> to 1, I mean [10:26:31] jayme: I told you, you looked fine [10:26:46] <_joe_> and see if we still get a lot of logs for INVALID packets anymore [10:50:10] _joe_: you mean we have buckets at 32? [10:50:29] <_joe_> jayme: conntrack_max = 32* conntrack_buckets [10:50:35] ah [10:54:03] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) I will updating to icu63 in deployment-prep, with Moritz looking on. This will likely happen later today, and I'll post updates about the prog... [10:54:04] but where does the statement "nf_conntrack_max = 4* nf_conntrack_buckets" come from? [11:00:36] <_joe_> I remember reading it in the kernel docs, lemme find it [11:00:52] <_joe_> that's at least the default ratio on my boxes :) [11:04:57] <_joe_> https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt [11:04:59] https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt [11:05:09] nf_conntrack_buckets is autocalculated [11:05:11] <_joe_> ok "the default value is nf_conntrack_buckets value * 4." [11:05:11] well done :) [11:05:26] <_joe_> lol [11:05:34] <_joe_> akosiaris: it can be overridden IIRC [11:05:42] and it's a hash table, it's 65536 buckets [11:06:11] <_joe_> akosiaris: we have half of that in production on kubernetes1001 [11:06:16] <_joe_> probably older kernels [11:06:48] <_joe_> anyways that blog post suggests the issue is solved by allowing nf_conntrack_tcp_be_liberal = 1 [11:07:59] yeah, but do we have the same issue as them? [11:08:45] ah, we probably will piggyback on https://github.com/kubernetes/kubernetes/pull/74840 [11:08:52] is the PDF download thing the background here? [11:10:43] JFTR, we have existing Puppet code in the base firewall classes which changes the sysctl which eventually controls the conntrack bucket size [11:11:21] there's also an older Phab task around this which should have numbers of the additional memory usage for bumping the table, if we change it we can most certainly simply do it fleet-wide [11:11:33] https://phabricator.wikimedia.org/T105307 [11:12:30] or not...not a lot of numbers there :) [11:13:02] here they are https://gerrit.wikimedia.org/r/c/operations/puppet/+/237389 [11:13:25] biab [11:13:47] https://github.com/wikimedia/puppet/commit/51223efe4a has the slab sizes [11:26:10] <_joe_> jayme: no it's that I keep seeing in logstash a ton of packets discarded as INVALID [11:26:24] <_joe_> we did look into it a few months back with akosiaris [11:27:11] <_joe_> and it appeared when we introduced envoy [11:27:30] <_joe_> which establishes long-lasting TCP connections [11:27:41] <_joe_> so yes, the patch above is probably going to help [11:31:35] <_joe_> that could also explain some of the intermittent pybal failures we see [11:40:19] yeah, let's just wait for the 1.16 upgrade, it should solve this. [11:52:28] <_joe_> how soon that will happen? :) [11:53:29] soon enough :P [11:53:47] on a more serious note, I can't really offer you an ETA, the way things are ... [11:53:59] <_joe_> sure :/ [12:49:11] _joe_: I don't get why nf_conntrack_tcp_be_liberal=1 fixes anything here. IIUC that would just heave the packaets no longer marked as INVALID by conntrack, but would't they be still forwareded to the client (without proper SNAT)? [12:55:28] did you know about https://relnotes.k8s.io/?markdown=action required akosiaris? Thats cool [12:57:56] wow, I had no idea [12:57:58] that's nice! [12:58:20] that's awesome actually, actually thanks! [12:58:22] * akosiaris bookmarking [13:20:36] <_joe_> jayme: the difference is a packet that doesn't get routed, will be retransmitted, will not result in a RST sent back [13:21:24] <_joe_> yeah that's *really* nice [13:21:27] <_joe_> also a bit sad [13:23:42] okay. So it means if conntrack does not mark the packet as INVALID it will not be routed at all. Did not know that [13:24:52] <_joe_> that's my understanding, but i need to re-read how snat works [13:25:50] <_joe_> but if even if the packet gets forwarded without SNAT, what will happen? I guess it would be discarded by the receivng tcp stack [13:26:59] Yeah. That was my understanding of what happens in the current state. receiving tcp stack says "hu?", discards the packet and sends RST [13:28:12] The consequence for a packet not being market as INVALID was unclear to me as is sounds like it will be processes as if it was "valid" ... so no difference in the rest of the procedure [13:28:27] (meaning: receiving tcp stack says "hu?", discards the packet and sends RST) [13:29:04] <_joe_> jayme: so, no, sorry, I went back to TFA: packets marked as INVALID don't get DNATTed [13:29:08] <_joe_> that's the problem [13:30:15] <_joe_> so not marking them as INVALID because conntrack cannot keep track of them causes the RST to be sent, while if the DNAT rules were applied, it would be receivable [13:30:27] <_joe_> and btw we have an additional problem on top of that, which is [13:33:35] <_joe_> we drop a lot of those packets [13:34:33] <_joe_> Nov 10 13:31:12 kubernetes1001 ulogd[26914]: [fw-in-drop] IN=eno1 OUT= MAC=18:66:da:56:64:a1:f4:e9:d4:cf:36:60:08:00 SRC=10.64.0.241 DST=10.2.2.45 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=0 DF PROTO=TCP SPT=35094 DPT=4492 SEQ=3866775629 ACK=0 WINDOW=0 RST URGP=0 MARK=0 [13:35:38] wait a second...that still does not make sense to me. [13:35:56] <_joe_> uhm no wait, these are external-to-k8s RSTs [13:36:39] <_joe_> this is a parsoid server communicating with eventgate-main [13:37:21] current state is: conntrack marks a packet as INVALID (because of whatever, does not matter) -> package is forwarded without (S|D)NAT -> receiving side does not no about the connection -> client side sends RST [13:37:29] is that correct so far? [13:37:34] <_joe_> yes [13:38:03] <_joe_> so if you stop marking them invalid, (S|D)NAT ruless will be applied [13:38:30] <_joe_> it's probably a better choice to drop packets in state INVALID instead [13:39:11] <_joe_> which I don't remember us doing, also I'm scared to run iptables on a k8s host. That will make me want to uninstall k8s now :P [13:39:24] but how? There is a reason they are marked as INVALID ... and thats mostly because conntrack does not no about them (table full, whatever). So what (S|D)NAT rules should it apply then? [13:39:56] MARK=1337 [13:41:23] <_joe_> jayme: conntrack is not needed per-se to apply DNAT rules IIRC [13:42:11] <_joe_> but you're making me doubt my memory [13:42:21] i would assume it's needed for dynamic nat entries, not for static ones [13:42:49] in other words, perhaps it can do without where it can do a 1:1 mapping based purely on the information of the current packet, not earlier ones [13:43:42] <_joe_> mark: exactly, and we do have only static rules [13:43:50] <_joe_> set dynamically by kube-proxy [13:44:13] <_joe_> but I might miss something [13:44:33] ah, okay. So that's the part I'm missing then [13:45:09] thanks :) [13:45:39] <_joe_> jayme: I'm not 100% sure, I will look at the iptables rules kube-proxy sets up [13:45:59] <_joe_> it's also true we don't fill up the conntrack table on those servers [14:27:23] 10serviceops, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10MoritzMuehlenhoff) Before we start reimaging, let's also merge https://gerrit.wikimed... [14:30:26] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Reedy) >>! In T264991#6615421, @Marostegui wrote: > Thank you! What's the expected impact of `updateCollation.php`? Many many categorylinks rows being up... [14:31:53] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10MoritzMuehlenhoff) [14:34:35] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) >>! In T264991#6615421, @Marostegui wrote: > Thank you! What's the expected impact of `updateCollation.php`? We will run on one wiki and see how... [14:40:47] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) Thank you @Reedy! [15:28:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Build calico 3.16 - https://phabricator.wikimedia.org/T266893 (10JMeybohm) >>! In T266893#6605404, @JMeybohm wrote: > > When deployed as cluster addon, we can bypass all this and have mandatory components of our stack deployed/reconceiled... [15:38:14] akosiaris: did you know there is a kind of official calico helm chart :-o [15:38:34] https://github.com/projectcalico/calico/tree/master/_includes/charts/calico [15:40:42] it's missing a values.yaml for whatever reason, though [15:41:32] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Upgrade plan on deployment-prep: [] add profile::mediawiki::php::icu63: true to hiera for deployment-prep project prefix; this will only have... [15:42:23] jayme: yeah, but it relies their images and it's a bit of chicken egg problem to apply it [15:42:31] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) Sounds good! [15:43:47] akosiaris: but just because of the images? [15:44:15] or because stuff needs to talk to each other and that is not going to work untill stuff talked to each other :) [15:44:57] that latter part [15:46:23] from calico the hard way is seems pretty straight forward... [15:46:44] ? [15:47:58] rolling the components out via helm [15:48:35] ah..but then there is tiller. That's the problem I suppose [15:48:35] <_joe_> maybe with helm 3? [15:48:41] <_joe_> yeah [15:48:45] well done again :P [15:48:52] yeah, tiller being able to talk to the API [15:49:01] they ofc just support kubectl apply [15:49:08] hmm...dammit [15:51:20] <_joe_> yeah you can do helm template | kubectl apply --lol -f - [15:51:30] hrhr [15:53:17] and currently we also don't have calico-node in containers. That's why it looks a bit easier currently (in helmfile.d/admin) [15:57:22] but we have the same catch-22 now for the calico-policy-controller, right? That's why we hacked this allow all policy that gets overridden once the controller is started... [15:58:22] but doing that in helm2 would probably freak it out [15:59:18] 10serviceops, 10Operations: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10jijiki) p:05Triage→03Medium [15:59:31] what if we...upgrade helmfile.d/admin to helm3 ... [15:59:34] * jayme runns [15:59:54] 10serviceops, 10Operations, 10Platform Engineering, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) p:05Triage→03Medium [16:19:19] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Note that I'm running cumin 'O{project:deployment-prep name:^deployment-mediawiki-[0-9]+$ } or O{project:deployment-prep name:^deployment... [16:20:05] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) [16:20:20] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) p:05Triage→03High [16:22:58] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) To solve the catch-22 we could deploy the to-be calico helm chart via helm3. Which would require us to invest into helm3 integration earlier than we hoped fo... [16:40:30] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10sdkim) a:05Jgiannelos→03None Given we, Product Infra, are not finding issues at our service level i... [17:06:54] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6613038, @Jgiannelos wrote: > @akosiaris More from debugging on this issue: >... [17:12:05] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Can't proceed at the moment, puppet sync to deployment-prep has been broken since Nov 6. Log excerpts from the earliest error: ` 2020-11-06T20... [17:54:35] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) This is apparently from T267439 After some discussion with jbond and dancy in irc, I am going to revert that and hope I'm not making the varn... [18:03:27] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Puppet sync back to working. Back on track to continue with the update in deployment-prep. [18:33:55] 10serviceops, 10MediaWiki-General, 10Operations, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) Script execution failed with holger@mwmaint1002:~... [18:38:37] 10serviceops, 10Beta-Cluster-Infrastructure, 10DBA, 10Operations: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Update in deployment-prep is now complete, assuming I did not miss any hosts. ` root@deployment-cumin:~# cumin 'O{project:deployment-prep... [19:16:08] 10serviceops, 10MediaWiki-General, 10Operations, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10jcrespo) pc2010 seems to be lagging behind. This is a non-issue f... [19:19:55] 10serviceops, 10MediaWiki-General, 10Operations, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10ArielGlenn) From looking at the code, it seems like the user list... [20:44:47] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) I 've also ran the same tests against `restbase.svc.eqiad.wmnet` in P13257 and I have the fo... [20:48:31] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6616625, @sdkim wrote: > Given we, Product Infra, are not finding issues at o... [20:54:25] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) > Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Cont... [20:55:27] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Urbanecm) [20:55:38] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Urbanecm) [21:22:16] 10serviceops, 10Platform Engineering, 10observability, 10Developer Productivity: Set ENV SERVERGROUP for jobrunner MW web requests - https://phabricator.wikimedia.org/T266515 (10AMooney) p:05Triage→03Medium We think after task T246371 is complete, this will be resolved. Requesting to decline. [23:30:08] 10serviceops, 10Platform Engineering, 10observability, 10Developer Productivity: Set ENV SERVERGROUP for jobrunner MW web requests - https://phabricator.wikimedia.org/T266515 (10Krinkle) [23:30:37] 10serviceops, 10Platform Engineering, 10observability, 10Developer Productivity: Set ENV SERVERGROUP for jobrunner MW web requests - https://phabricator.wikimedia.org/T266515 (10Krinkle) Marking as sub task accordingly. We can confirm afterwards or re-prioritize then.