[00:00:01] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I'm a bit confused now. I thought that was the question we talked about in today's meeting. [00:16:01] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) 05Open→03Resolved uh, that's right, my bad >.< I submitted a documentation patch just... [00:24:45] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) documentation patch +1, confirmed that's how it is now. Thanks. I could also be wrong, if w... [08:06:06] can I get a +1 for [08:06:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/676580 [08:06:13] 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint), 10Patch-For-Review: Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) >>! In T279411#6995317, @akosiaris wrote: >>>! In T279411#6985523,... [08:17:18] 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) >>! In T279411#6997206, @gerritbot wrote: > Change 679240 **merged** by jenkins-bot: > %%... [08:22:59] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YA... [08:24:03] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) 05Resolved→03Open Hmm, I think I spoke too soon: ` kubectl describe pod/linkrecommendation-production-load-datasets-1618311600-mccln... [08:39:10] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) >>! In T279100#6997273, @akosiaris wrote: > I 've gone a bit overboard and created https://g... [09:03:19] jayme or akosiaris: if you're able to help out with T280076, I'd appreciate it as it would help us complete our dataset updates. As it is, it's going to take a long time due to the bug in the app code that's running in the cron job [09:04:38] kostajh: so the problem is your jobs are created with an only image? [09:05:34] jayme: with an old image, it should be using 2021-04-13-190913-production [09:05:50] yeah, *old - sorry :) [09:06:10] yesterday m.utante stopped the running cronjob that was using an image from 2021-04-08, but the new cronjob container is using 2021-04-08 instead of 2021-04-13-190913-production [09:09:09] makes sense. That pod is created by the job linkrecommendation-production-load-datasets-1618311600 which has the old image set [09:22:55] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10JMeybohm) What I think what happened here is: * The CronJob (with the old image set in spec) created a Job "linkrecommendation-production-load-dat... [09:23:20] kostajh: should be fine now [09:24:03] jayme: thanks for the update. I was wondering what happened [09:24:13] <_joe_> jayme: I was reading through https://github.com/kubernetes/kubernetes/issues/50345 and I was thinking I'm still ok to use subpaths if I also restart the pod [09:25:06] so just to be clear, the code in the old job went into a forever loop ? [09:25:53] akosiaris: in the old pod, you mean? [09:27:01] _joe_: ouch...I totally forgot about that one. That's really evil. But yes, as long as the pod is re-sheduled on change on whatever is projected into a subPath you will be find [09:27:03] *fine [09:27:59] the bad thing is that that's a really thorny detail to know and respect in further development of the chart [09:29:18] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6997312, @jijiki wrote: >>>! In T279100#6997273, @akosiaris wrote: >> I 'v... [09:30:04] jayme: yup, in the old pod. I gathered from the 9h runtime it was in a forever loop. [09:30:26] jayme: thanks very much [09:30:34] _joe_: oh, bind mounting files? Ouch, that can be painful. [09:30:52] <_joe_> akosiaris: yeah apache vhosts [09:31:33] akosiaris: it would have completed eventually, there was a silly programming error (https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/678918) that just was going to cause the dataset import process to take much, much longer than it should have [09:31:57] akosiaris: kind of, I guess. According to what kostajh wrote in the task I get its not forever but "for long" :D [09:37:26] kostajh: I am guessing "much, much" being in the order of days? [09:39:07] akosiaris: yeah, I think so [09:41:07] ok, makes sense. I think I got the puzzle pieces together now in my head. Thanks! [09:52:45] 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10akosiaris) >>! In T279411#6997232, @kostajh wrote: >> As a side note: Is is ok if we rename the "... [10:14:33] <_joe_> akosiaris, jayme let's see if you two find a better solution. I want the readiness probe for apache to request the status page, but that poses a problem: I need to authorize the IP that will make the probe to access server-status [10:15:09] <_joe_> so right now I think that's messy, and I'm thinking of writing a script to add to the image and execute to check the status page locally [10:15:24] <_joe_> but if you have a better idea, I'm open to it :) [10:16:51] _joe_: would a vhost that just returns 200 be enough to serve as readiness probe or does it need to be server-status? [10:17:13] <_joe_> I preferred to use server-status, but meh :) [10:17:59] _joe_: can you evaluate environment variables to get the allowed IP for server-status? [10:19:08] <_joe_> jayme: possibly [10:19:15] <_joe_> not 100% sure [10:19:26] <_joe_> and it's going to be messy anyways [10:21:42] If I'm not missled you could then just allow the nodes IP for server-status [10:22:00] alternatively you could send auth headers I guess [10:22:29] with *Probe.httpGet.httpHeaders [10:25:51] hmm so, I 've seen readiness probes come from multiple IPs in the past [10:25:54] including LVS IPs [10:26:07] as in, the LVS IPs that are present on the kubernetes node [10:26:23] ah, dammit [10:27:03] that might have changed and the kubelet might now have some way of even enforcing which source IP will be put in the request [10:27:40] but the point is, don't assume the health checks are going to be from the node IP [10:27:56] heck don't assume they are going to be over IPv4 and not IPv6 even [10:29:55] <_joe_> yes. I think it's better if I just add a check health script somewhere :) [10:30:08] ah, that has its own problems [10:30:40] so using Exec, depending on what you do, might end up being slower than the http check [10:31:06] and I mean even up to seconds. We 've seen a pattern like that with eventgate [10:31:32] where the readiness probe produces an event and guess what, it's slow. We figured it out when looking at docker operational latencies [10:31:48] so now our alerts filter out that specific operation IIRC [10:31:49] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqia... [10:32:31] <_joe_> that makes me think we should also avoid testing fcgi readiness via an exec [10:32:52] <_joe_> and just be ok with php being restarted the one time that apache is actually unavailable and php is [10:54:16] fcgi has no reason for a readiness probe anyway [10:54:40] it's not being pooled into traffic directly, it's apache that is [10:55:16] sure, if the fcgi container isn't Ready, you probably don't want to send traffic to the entire pod anyway [10:55:48] but you may be able to expose that via the apache container readiness probe too [10:55:58] <_joe_> yes, that already works [10:56:14] in that case, just don't have a readiness probe for fcgi. No point [10:56:55] <_joe_> yeah it's the liveness probe that should be able to find what's wrong instead [10:57:18] <_joe_> e.g. if php-fpm is down or unresponsive [10:58:53] yup [11:09:26] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) 05Open→03Resolved thank you @JMeybohm! [11:11:05] 10serviceops, 10Performance-Team, 10SRE, 10MW-1.37-notes (1.37.0-wmf.1; 2021-04-13), and 2 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [11:11:54] 10serviceops, 10SRE, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) 05Open→03Resolved a:03jijiki [11:29:58] <_joe_> https://people.wikimedia.org/~oblivian/mw-on-k8s-shared-socket.png [11:33:53] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) [11:34:18] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) p:05Triage→03Low [11:40:01] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very... [11:47:52] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet'] ` and were **ALL** successful. [12:02:42] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1037.eqiad.wmnet', 'wtp1038.eqiad.wmnet', 'wtp1039.eqia... [13:19:13] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1038.eqiad.wmnet', 'wtp1037.eqiad.wmnet', 'wtp1039.eqiad.wmnet'] ` and were **ALL** successful. [13:48:50] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) a:03RLazarus [14:07:53] <_joe_> Error: Deployment.apps "mediawiki-test" is invalid: [spec.template.spec.containers[5].livenessProbe.tcpSocket.port: Invalid value: "mcrouter-metrics": must be no more than 15 characters [14:07:59] <_joe_> quality software (TM) [14:08:12] 10serviceops, 10Maps, 10Packaging, 10Product-Infrastructure-Team-Backlog, 10SRE: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MSantos) [14:13:20] <_joe_> jayme: OTOH, mcrouter can detect a change in a config file from the configmap, I just made an error and it crashed mcrouter [14:14:00] 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) 05Open→03Resolved Done -- just re-enabled puppet, so they'll get picked up over the next 30m. [14:14:50] _joe_: just call it "m12t" [14:15:05] <_joe_> rzl: the -metrics part is important [14:15:22] oh oops I was shortening mediawiki-test because I'm good at reading [14:15:33] m11trics then [14:35:49] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) ` joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods NAME READY STATUS RESTARTS AGE mediawiki-test-6fb67b5f8b-... [14:39:55] _joe_: ah, nice :) [14:41:52] <_joe_> jayme: the php-fpm image is just running a file that calls phpinfo() [14:41:59] <_joe_> in my testing env [14:53:27] yeah, I know. Fair enough [15:39:57] 10serviceops, 10MW-on-K8s: Create MediaWiki httpd base image - https://phabricator.wikimedia.org/T276097 (10Joe) 05Open→03Resolved [15:40:01] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [18:39:48] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) >>! In T279100#6997273, @akosiaris wrote: > We seem to only have 1 dedicated videoscaler in c... [18:41:14] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) Alex, your patch looks good but i can also see Effie's point. hmm... [19:00:42] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) >>! In T279100#6997472, @akosiaris wrote: >>>! In T279100#6997312, @jijiki wrote: >> I thin... [19:00:42] what does "BW" on the doc mean? [19:01:46] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I like Lego's summary. [19:03:19] bandwidth? man I don't remember at all [19:07:29] wkandek: ^ ? [19:40:31] 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jeena) Some "errors" restarting php-fpm and depooling services popped up while running the train tod... [20:47:21] 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) After mw train was deployed we get some Icinga alerts which caused worry among deployers: ` 20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match ex... [20:58:21] 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10Dzahn) @JMeybohm appreciate the detailed explanation including commands. TIL [23:37:51] so, regarding new hardware for appservers in eqiad: this is what we will have to work with so far: [23:37:55] "A3 8 spots, B3 10 spots, in all of row C i only have C3 with 3 spots, D8 15 spots" [23:38:19] that's where we can put new servers to then decom old servers at the same time, one by one [23:38:29] while trying to keep it balanced (enough) somehow [23:39:01] it will start next week, currently rails are being installed [23:39:56] hoping it is possible to keep it balanced in the end without having to move stuff around in another step [23:40:07] with that C limitation there [23:48:38] 10serviceops: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) Talked with Jclark and this is the physical situation: ` right now A3 8 spots, B3 10 spots, in all of row C i only have C3 with 3 spots, D8 15 spots. ` so we have to...