[05:12:07] 10serviceops, 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10Marostegui) p:05Triage→03Medium [08:11:04] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) Regarding the apache httpd container, I am approaching layering as follows: - one base image, which uses the a... [08:11:51] <_joe_> everyone: I'd really love to get feedback on https://phabricator.wikimedia.org/T265324#6567095 [08:30:59] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) If I got this right you are purposing to put apache and php-fpm in the same container, correct (talking ab... [08:36:00] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10ema) >>! In T265324#6567095, @Joe wrote: > - one base image, which uses the apache2-bin debian package and just modi... [08:48:01] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567154, @JMeybohm wrote: > If I got this right you are purposing to put apache and php-fpm in t... [08:53:54] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dcausse) * timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00 * host: mw2252 * message: `[2b171d8b-48ec-480d-b7a4-187dd3af259c] /w/api.php?t... [08:54:21] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Test deployment-charts for kubernetes 1.19 compatibility - https://phabricator.wikimedia.org/T266032 (10JMeybohm) While it looks like kubeval basically works and we can easily replace kubeyaml with it, it has the disadvantage of relying on https://kubernetesjson... [08:56:54] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) >>! In T265324#6567213, @Joe wrote: Oh, well. Sorry then. I guess I just misread the "one image configured... [08:59:24] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Pablo-WMDE) > must be an instance of **V**ikibase\Search\ [...] Is this some sort of copying glitch? [09:00:40] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567163, @ema wrote: >>>! In T265324#6567095, @Joe wrote: >> - one base image, which uses the ap... [09:02:59] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dcausse) >>! In T245183#6567295, @Pablo-WMDE wrote: >> must be an instance of **V**ikibase\Search\ [...] > > Is this some sort of copying glit... [09:06:41] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Pablo-WMDE) >>! In T245183#6567307, @dcausse wrote: >>>! In T245183#6567295, @Pablo-WMDE wrote: >>> must be an instance of **V**ikibase\Search\... [10:46:23] 10serviceops, 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) 05Open→03Resolved a:03Marostegui Finally this VM is up and running. ` [10:43:41] marostegui@dborch1001:~$ uptime 10:43:4... [10:59:26] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Joe) >>! In T245183#6567261, @dcausse wrote: > * timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00 > * host: mw2252 Please note: this was... [11:18:32] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567095, @Joe wrote: > Regarding the apache httpd container, I am approaching layering as fol... [11:26:43] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567694, @jijiki wrote: >>>! In T265324#6567095, @Joe wrote: >> Regarding the apache httpd conta... [11:27:53] <_joe_> jfc I keep finding stuff that can't be very obviously translated. [11:30:34] <_joe_> like: right now we send back Server: mw1234.eqiad.wmnet [11:30:45] <_joe_> how to translate that to k8s? [11:31:24] <_joe_> the pod name I guess? [11:33:44] possibly [11:37:43] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) [11:45:47] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567723, @Joe wrote: >>>! In T265324#6567694, @jijiki wrote: >> How are we planning to solve... [11:48:22] <_joe_> effie: it's not 5-6 steps [11:48:29] <_joe_> it's 1 more step than before. [11:49:31] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567761, @jijiki wrote: > Overall, I think we may need to take one step back and consider if an... [11:52:43] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) Also I want to clarify: we can reduce the pain as much as possible, but for the duration of the transition phas... [12:24:49] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Test deployment-charts for kubernetes 1.19 compatibility - https://phabricator.wikimedia.org/T266032 (10JMeybohm) I fiddled with this a bit and I it is possible to use a local version/checkout of the schema which we can also generate ourselves with something lik... [12:26:16] akosiaris: ^ did dome testing with kubeval...if you have a second to take a look [12:26:47] have trouble finding a subtle error case to verify that that is catched, though :) [12:27:46] jayme: checkout a version of the repo I 'd say from before introducing kubeyaml perhaps? There was ton of errors fixed due to it [12:27:57] but yeah, I 've seen it already. Looks promising [12:28:42] right...there was something broken. I remember. Will look that up. [12:30:00] I don't know why upsteam does not seem to update the schema repo very often, but it looks as if my script does okay (although it takes some time because of no threading and all). But if we commit that stuff somewhere (or simply provice a HTTP "mirror") we're probably fine [13:00:21] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) Memo (for myself or whoever is interested): xdebug 3 should supposedl... [13:27:19] 10serviceops, 10Operations, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Nintendofan885) [13:31:51] _joe_: I was thinking [13:32:27] that we could do a 2 step migration to k8s, and I am thinking out loud here, I have not gone through the details [13:33:16] the apache part looks very messy, and dispite the number of steps being +1 from before, they are more complex that before [13:33:23] one k8s deployment is expected for sure [13:37:17] where I am going with this is, before we start implementing complex solutions on the apache part, we could consider leaving X number of servers running envoy+apache [13:37:33] and focus on the php-fpm + mediawiki on k8s [13:39:09] and leave the apache part on k8s, later [13:39:42] there are issues to be address with this idea, eg we will again need to split our apaches in clusters [13:40:59] we will have to do network connections to php-fpm as opposed to sockets, and I do not know if we have tested that and how it will impact us [13:41:58] and other issues that can be raised through discussion, but [13:43:10] it might simplify our work, for now [13:49:20] <_joe_> that would just introduce a 15-50 ms latency addendum [13:49:32] <_joe_> and btw we'll have this problem at every layer [13:49:45] <_joe_> need a change to php-fpm? puppet run + k8s deployment [13:49:50] <_joe_> there is no way around it [13:50:00] <_joe_> and yeah, I thoroughly hate this idea [13:50:15] <_joe_> it's not just tcp connections, it's /remote/ tcp connections [13:52:09] hangon, if we are to change php-fpm, from what I read, it felt more than just a k8s deployment [13:52:11] yes? [13:52:13] <_joe_> the added complexity of deploying stuff during the transition will be a constant. If we're not ready to pay that price for a transition period, we should just give up already. [13:52:18] <_joe_> no. [13:52:28] ah, where would the puppet part be [13:52:33] on an php-fpm change [13:52:56] <_joe_> I assume settings in k8s will be different, quite radically, from puppet [13:53:04] <_joe_> so they will be separated configurations [13:53:49] alright, so our php-fpm settings for whatever we are running in k8s, will reside in which repo ? [13:54:20] <_joe_> the helm chart? [13:54:45] cool, so [13:55:12] we will have php settings for stuff we will not serve on k8s, say small php site [13:55:20] like the ones we have on mwmaint [13:55:48] <_joe_> wait [13:55:54] (I am trying to make sure I have a proper understanding) [13:55:54] <_joe_> that has nothing to do with mediawiki [13:56:09] <_joe_> let's focus on the wikis please [13:56:30] <_joe_> a small php site will not be part of this, and will have its own helm chart in case [13:56:58] I am trying to undestand you said earlier that changing php-fpm settings would require something more than a push on the help repo [13:57:02] and a deploy on k8s [13:57:13] I may have misunderstood something [13:57:26] <_joe_> yeah I was saying [13:57:31] <_joe_> | need a change to php-fpm? puppet run + k8s deployment [13:57:34] ^ that one [13:57:44] <_joe_> say we need to add a new extension to php-fpm [13:57:53] <_joe_> or change some config in all instances [13:58:01] ok let's pause here [13:58:18] <_joe_> and we're still in the middle of the transition [13:58:19] extensions would live in the image where php-fpm lives, right ? [13:58:33] <_joe_> yes, but say we need to enable one extension [13:58:35] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [13:58:52] <_joe_> or say we need to add a configuration to curl-php to only use TLS 1.2 or above [13:59:15] <_joe_> that's pure configuration [13:59:17] ok, why would that need a puppet run ? [13:59:19] <_joe_> one line in the ini file [13:59:30] <_joe_> during the transition phase? [13:59:55] <_joe_> MW will be running both in k8s and on the old appservers [14:00:07] <_joe_> so it will need changes to be kept in sync between puppet and k8s [14:00:11] ok that is what you mean as a transition phase [14:00:16] <_joe_> yes [14:00:23] <_joe_> what else would I mean? [14:00:34] <_joe_> why would you need puppet otherwise? [14:01:01] haha ok so [14:01:06] what I am saying is that [14:02:06] separate the php-fpm to k8s transition [14:02:12] from the apache to k8s transition [14:02:42] and when we have our php-fpm stuff in order generally [14:03:44] <_joe_> that would make the transition longer, more painful (we would need to maintain a new category of servers, with their own LVS, just for apache, then another lvs just for php-fpm [14:03:57] <_joe_> ), more brittle, and our systems to perform significantly worse [14:04:23] <_joe_> all in order to save a "helmfile apply" run [14:04:41] <_joe_> would also mean we'll have two transitions to manage [14:04:53] <_joe_> sorry, I really hate it. [14:05:07] <_joe_> let's hear from others too though, maybe it's me :) [14:05:55] the part I disagree with, is a puppet change that will be consumed by helm etc, I do not disagree with having a help apply obviously [14:06:02] helm, dammit [14:06:48] <_joe_> why a puppet change (a hieradata change) should not be consumed? that reduces duplication and the risk of configurational drifts [14:07:18] <_joe_> we already do it with envoy listeners [14:07:43] <_joe_> we use the same list (in hiera) for deploying new changes to envoy on k8s and to envoy on the appservers [14:08:59] <_joe_> akosiaris: jayme: rzl: we could use another opinion :) [14:09:12] <_joe_> (and wolfgang and daniel once they wake up) [14:10:20] I think this is where my visibility becomes limited, I partially understand what we have done in this part [14:14:15] I'm not completely sure how we do this kind of apache config changes currently, so I don't feel very informed. If we do generate the config for appservers from hiera/puppet, we can generate the exact same config to be included in via helm. The extra step to be taken care of is a "helmfile apply" after/together with what I guess is a "scap run" (?) for appservers, right? [14:19:08] jayme: don't let scap fool you, it is not doing anything in this case [14:19:32] but you will have a chance to get to know it down the road, have no fear :D [14:19:36] So it's just puppet creating apache config on appservers? [14:19:44] and reloading them [14:19:56] that is one part [14:20:26] I will let you discuss it with joe, I will read afterwards [14:21:38] don't steal my move :P [14:22:17] <_joe_> jayme: yes, right now apache changes are completely managed by puppet [14:23:23] <_joe_> my idea was to make it a yaml structure and use the same from helm and puppet, so you change the yaml in hiera => it changes on the deployment servers => helmfile apply will deploy it on k8s [14:23:49] Okay. So the apache config changes are even completely seperated from the release train? [14:24:13] <_joe_> yes [14:24:48] and they are currently not available as a yaml / hiera structure? [14:25:03] no [14:25:13] (I can't help myself) [14:25:16] <_joe_> no, that would need a small bit of work in the puppet tree [14:25:21] <_joe_> but not much tbh [14:26:08] okay. So is your (effie) concern about the fact that we would need to refactor that puppet part away from how it is now to yaml / hiera? [14:26:37] <_joe_> no her concern is that while now deploying an apache change is puppet patch + orchestrated puppet run [14:26:53] <_joe_> it will become puppet patch + run + helmfile deploy [14:27:18] <_joe_> or whatever we'll use to deploy mw on k8s, which will probably be a wrapper around helmfile [14:27:56] ok walk me through something [14:28:03] gimme a second [14:28:15] my concern is not the additional helm apply per se ofc [14:28:34] ah.. :) go on please [14:28:47] _joe_: if I were to test an apache change, I would ise the old way [14:29:35] so I would enable puppet on say mwdebug [14:29:43] <_joe_> effie: how to do test production deployments on k8s will be quite clearly a problem we will resolve next quarter, but it will end up being easier, and by a lot [14:30:04] let's assume we dont test the change in k8s [14:30:04] <_joe_> effie: we will add a mwdebug-like system, we can't do without [14:30:09] <_joe_> why? [14:30:35] <_joe_> I think mwdebugs will be the first things to be moved [14:30:46] I am trying to understand here :p I don't want to test the change in k8s, I want to deploy it in a theoretical scenario [14:30:54] <_joe_> ok [14:31:08] <_joe_> sorry, go on, explain what your concern is [14:31:43] I want you to walk me once more what will happen after I have tested my apache change on mwdebug [14:32:19] and assume we are not testing on k8s, we will directly deploy it [14:33:37] <_joe_> so you already merged the puppet change, right? [14:34:16] <_joe_> again, assuming you're just either a) adding or removing a wiki or b) changing configs for a wiki [14:35:03] <_joe_> so you have disabled puppet on the mw* servers [14:35:06] <_joe_> merged the change [14:35:07] (yes I have merged my change, and I am changing a config) [14:35:10] <_joe_> enabled on mwdebug [14:35:13] yes [14:35:15] <_joe_> tested it there [14:35:21] <_joe_> puppet enable everywhere [14:35:33] <_joe_> then ssh to a deploy server [14:35:43] <_joe_> and run whatever tool we use to deploy the mw env on k8s [14:36:35] so I am not pushing anything to the helm chart repo [14:39:41] ok I had misunderstood the steps [14:41:05] hmm...mystery solved then? [14:42:34] for the time being yes [14:42:46] cool! :-) [14:44:44] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567779, @Joe wrote: > Also I want to clarify: we can reduce the pain as much as possible, bu... [14:45:14] I have some concerns about how much puppet effort would we need for the transition period [14:47:57] * jayme needs to rush out for a quick errand, back in ~20min [14:50:04] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dancy) Noting for the record in all cases where there's a bogus changed character in a string, the bad character is always one less than what i... [14:50:58] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dancy) [15:01:39] * akosiaris glad you worked it out :-) [15:07:58] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10LarsWirzenius) [15:10:24] 10serviceops, 10Operations, 10Scap, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10LarsWirzenius) While this won't block me starting the release process of the next Scap release, I would like to get thi... [15:35:53] 10serviceops, 10Operations, 10Scap, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10jijiki) @LarsWirzenius after discussing it, we decided that for the time being we can't adopt this solution, given that... [16:10:13] 10serviceops, 10Release Pipeline, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO: Provide the official production base images for Wikimedia use - https://phabricator.wikimedia.org/T238774 (10Jdforrester-WMF) [16:10:16] 10serviceops, 10MW-on-K8s, 10Operations, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Jdforrester-WMF) [16:19:46] <_joe_> effie: does the banner on mwdebu1001 still make sense? [16:19:56] <_joe_> I understood those tests were over [16:20:37] yep, let me remove it [16:20:43] sorry [16:20:48] <_joe_> np! [16:21:14] done [16:30:55] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10WDoranWMF) a:03WDoranWMF [16:43:49] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10jijiki) Yesterday we had opcache corruptions on 2 servers, mw2252 && I don't know about other times, bu... [16:55:50] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10mmodell) >>! In T253673#6566529, @tstarling wrote: > > My idea is to tag shared memory with a pkey. The... [17:31:30] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10wiki_willy) [17:33:17] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10wiki_willy) [19:00:44] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) 05Open→03Resolved a:03Pchelolo [22:11:18] 10serviceops, 10Wikifeeds, 10Kubernetes: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation - https://phabricator.wikimedia.org/T266194 (10CDanis)