[00:00:01] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I'm a bit confused now. I thought that was the question we talked about in today's meeting.
[00:16:01] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) 05Open→03Resolved uh, that's right, my bad >.<  I submitted a documentation patch just...
[00:24:45] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) documentation patch +1, confirmed that's how it is now. Thanks.   I could also be wrong, if w...
[08:06:06] <effie>	 can I get a +1 for 
[08:06:07] <effie>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/676580
[08:06:13] <wikibugs>	 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint), 10Patch-For-Review: Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) >>! In T279411#6995317, @akosiaris wrote: >>>! In T279411#6985523,...
[08:17:18] <wikibugs>	 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) >>! In T279411#6997206, @gerritbot wrote: > Change 679240 **merged** by jenkins-bot: > %%...
[08:22:59] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YA...
[08:24:03] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) 05Resolved→03Open Hmm, I think I spoke too soon:  ` kubectl describe pod/linkrecommendation-production-load-datasets-1618311600-mccln...
[08:39:10] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) >>! In T279100#6997273, @akosiaris wrote: > I 've gone a bit overboard and created https://g...
[09:03:19] <kostajh>	 jayme or akosiaris: if you're able to help out with T280076, I'd appreciate it as it would help us complete our dataset updates. As it is, it's going to take a long time due to the bug in the app code that's running in the cron job
[09:04:38] <jayme>	 kostajh: so the problem is your jobs are created with an only image?
[09:05:34] <kostajh>	 jayme: with an old image, it should be using 2021-04-13-190913-production
[09:05:50] <jayme>	 yeah, *old - sorry :)
[09:06:10] <kostajh>	 yesterday m.utante stopped the running cronjob that was using an image from 2021-04-08, but the new cronjob container is using 2021-04-08 instead of 2021-04-13-190913-production
[09:09:09] <jayme>	 makes sense. That pod is created by the job linkrecommendation-production-load-datasets-1618311600 which has the old image set
[09:22:55] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10JMeybohm) What I think what happened here is: * The CronJob (with the old image set in spec) created a Job "linkrecommendation-production-load-dat...
[09:23:20] <jayme>	 kostajh: should be fine now
[09:24:03] <akosiaris>	 jayme: thanks for the update. I was wondering what happened
[09:24:13] <_joe_>	 jayme: I was reading through https://github.com/kubernetes/kubernetes/issues/50345 and I was thinking I'm still ok to use subpaths if I also restart the pod
[09:25:06] <akosiaris>	 so just to be clear, the code in the old job went into a forever loop ?
[09:25:53] <jayme>	 akosiaris: in the old pod, you mean?
[09:27:01] <jayme>	 _joe_: ouch...I totally forgot about that one. That's really evil. But yes, as long as the pod is re-sheduled on change on whatever is projected into a subPath you will be find
[09:27:03] <jayme>	 *fine
[09:27:59] <jayme>	 the bad thing is that that's a really thorny detail to know and respect in further development of the chart
[09:29:18] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6997312, @jijiki wrote: >>>! In T279100#6997273, @akosiaris wrote: >> I 'v...
[09:30:04] <akosiaris>	 jayme: yup, in the old pod. I gathered from the 9h runtime it was in a forever loop.
[09:30:26] <kostajh>	 jayme: thanks very much
[09:30:34] <akosiaris>	 _joe_: oh, bind mounting files? Ouch, that can be painful.
[09:30:52] <_joe_>	 akosiaris: yeah apache vhosts
[09:31:33] <kostajh>	 akosiaris: it would have completed eventually, there was a silly programming error (https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/678918) that just was going to cause the dataset import process to take much, much longer than it should have
[09:31:57] <jayme>	 akosiaris: kind of, I guess. According to what kostajh wrote in the task I get its not forever but "for long" :D
[09:37:26] <akosiaris>	 kostajh: I am guessing "much, much" being in the order of days? 
[09:39:07] <kostajh>	 akosiaris: yeah, I think so
[09:41:07] <akosiaris>	 ok, makes sense. I think I got the puzzle pieces together now in my head. Thanks!
[09:52:45] <wikibugs>	 10serviceops, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team (Current Sprint): Determine why service responses are slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10akosiaris) >>! In T279411#6997232, @kostajh wrote:  >> As a side note: Is is ok if we rename the "...
[10:14:33] <_joe_>	 akosiaris, jayme  let's see if you two find a better solution. I want the readiness probe for apache to request the status page, but that poses a problem: I need to authorize the IP that will make the probe to access server-status
[10:15:09] <_joe_>	 so right now I think that's messy, and I'm thinking of writing a script to add to the image and execute to check the status page locally
[10:15:24] <_joe_>	 but if you have a better idea, I'm open to it :)
[10:16:51] <jayme>	 _joe_: would a vhost that just returns 200 be enough to serve as readiness probe or does it need to be server-status?
[10:17:13] <_joe_>	 I preferred to use server-status, but meh :)
[10:17:59] <jayme>	 _joe_: can you evaluate environment variables to get the allowed IP for server-status?
[10:19:08] <_joe_>	 jayme: possibly
[10:19:15] <_joe_>	 not 100% sure
[10:19:26] <_joe_>	 and it's going to be messy anyways
[10:21:42] <jayme>	 If I'm not missled you could then just allow the nodes IP for server-status
[10:22:00] <jayme>	 alternatively you could send auth headers I guess
[10:22:29] <jayme>	 with *Probe.httpGet.httpHeaders
[10:25:51] <akosiaris>	 hmm so, I 've seen readiness probes come from multiple IPs in the past
[10:25:54] <akosiaris>	 including LVS IPs
[10:26:07] <akosiaris>	 as in, the LVS IPs that are present on the kubernetes node
[10:26:23] <jayme>	 ah, dammit
[10:27:03] <akosiaris>	 that might have changed and the kubelet might now have some way of even enforcing which source IP will be put in the request
[10:27:40] <akosiaris>	 but the point is, don't assume the health checks are going to be from the node IP
[10:27:56] <akosiaris>	 heck don't assume they are going to be over IPv4 and not IPv6 even
[10:29:55] <_joe_>	 yes. I think it's better if I just add a check health script somewhere :)
[10:30:08] <akosiaris>	 ah, that has its own problems
[10:30:40] <akosiaris>	 so using Exec, depending on what you do, might end up being slower than the http check
[10:31:06] <akosiaris>	 and I mean even up to seconds. We 've seen a pattern like that with eventgate
[10:31:32] <akosiaris>	 where the readiness probe produces an event and guess what, it's slow. We figured it out when looking at docker operational latencies
[10:31:48] <akosiaris>	 so now our alerts filter out that specific operation IIRC
[10:31:49] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqia...
[10:32:31] <_joe_>	 that makes me think we should also avoid testing fcgi readiness via an exec
[10:32:52] <_joe_>	 and just be ok with php being restarted the one time that apache is actually unavailable and php is
[10:54:16] <akosiaris>	 fcgi has no reason for a readiness probe anyway
[10:54:40] <akosiaris>	 it's not being pooled into traffic directly, it's apache that is
[10:55:16] <akosiaris>	 sure, if the fcgi container isn't Ready, you probably don't want to send traffic to the entire pod anyway
[10:55:48] <akosiaris>	 but you may be able to expose that via the apache container readiness probe too
[10:55:58] <_joe_>	 yes, that already works
[10:56:14] <akosiaris>	 in that case, just don't have a readiness probe for fcgi. No point
[10:56:55] <_joe_>	 yeah it's the liveness probe that should be able to find what's wrong instead
[10:57:18] <_joe_>	 e.g. if php-fpm is down or unresponsive
[10:58:53] <akosiaris>	 yup
[11:09:26] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10kostajh) 05Open→03Resolved thank you @JMeybohm!
[11:11:05] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10MW-1.37-notes (1.37.0-wmf.1; 2021-04-13), and 2 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki)
[11:11:54] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) 05Open→03Resolved a:03jijiki
[11:29:58] <_joe_>	 https://people.wikimedia.org/~oblivian/mw-on-k8s-shared-socket.png
[11:33:53] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm)
[11:34:18] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) p:05Triage→03Low
[11:40:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very...
[11:47:52] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet'] `  and were **ALL** successful.
[12:02:42] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1037.eqiad.wmnet', 'wtp1038.eqiad.wmnet', 'wtp1039.eqia...
[13:19:13] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1038.eqiad.wmnet', 'wtp1037.eqiad.wmnet', 'wtp1039.eqiad.wmnet'] `  and were **ALL** successful.
[13:48:50] <wikibugs>	 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) a:03RLazarus
[14:07:53] <_joe_>	 Error: Deployment.apps "mediawiki-test" is invalid: [spec.template.spec.containers[5].livenessProbe.tcpSocket.port: Invalid value: "mcrouter-metrics": must be no more than 15 characters
[14:07:59] <_joe_>	 quality software (TM)
[14:08:12] <wikibugs>	 10serviceops, 10Maps, 10Packaging, 10Product-Infrastructure-Team-Backlog, 10SRE: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MSantos)
[14:13:20] <_joe_>	 jayme: OTOH, mcrouter can detect a change in a config file from the configmap, I just made an error and it crashed mcrouter
[14:14:00] <wikibugs>	 10serviceops, 10SRE: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) 05Open→03Resolved Done -- just re-enabled puppet, so they'll get picked up over the next 30m.
[14:14:50] <rzl>	 _joe_: just call it "m12t"
[14:15:05] <_joe_>	 rzl: the -metrics part is important
[14:15:22] <rzl>	 oh oops I was shortening mediawiki-test because I'm good at reading
[14:15:33] <rzl>	 m11trics then
[14:35:49] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) ` joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods NAME                              READY   STATUS    RESTARTS   AGE mediawiki-test-6fb67b5f8b-...
[14:39:55] <jayme>	 _joe_: ah, nice :)
[14:41:52] <_joe_>	 jayme: the php-fpm image is just running a file that calls phpinfo()
[14:41:59] <_joe_>	 in my testing env
[14:53:27] <jayme>	 yeah, I know. Fair enough
[15:39:57] <wikibugs>	 10serviceops, 10MW-on-K8s: Create MediaWiki httpd base image - https://phabricator.wikimedia.org/T276097 (10Joe) 05Open→03Resolved
[15:40:01] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe)
[18:39:48] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) >>! In T279100#6997273, @akosiaris wrote: > We seem to only have 1 dedicated videoscaler in c...
[18:41:14] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) Alex, your patch looks good but i can also see Effie's point. hmm...
[19:00:42] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) >>! In T279100#6997472, @akosiaris wrote: >>>! In T279100#6997312, @jijiki wrote: >> I thin...
[19:00:42] <legoktm>	 what does "BW" on the doc mean?
[19:01:46] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I like Lego's summary.
[19:03:19] <rzl>	 bandwidth? man I don't remember at all
[19:07:29] <legoktm>	 wkandek: ^ ?
[19:40:31] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jeena) Some "errors" restarting php-fpm and depooling services popped up while running the train tod...
[20:47:21] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) After mw train was deployed we get some Icinga alerts which caused worry among deployers:   ` 20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match ex...
[20:58:21] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 (10Dzahn) @JMeybohm appreciate the detailed explanation including commands. TIL
[23:37:51] <mutante>	 so, regarding new hardware for appservers in eqiad: this is what we will have to work with so far:
[23:37:55] <mutante>	 "A3 8 spots,  B3 10 spots, in all of row C i only have C3 with 3 spots,   D8 15 spots"
[23:38:19] <mutante>	 that's where we can put new servers to then decom old servers at the same time, one by one
[23:38:29] <mutante>	 while trying to keep it balanced (enough) somehow
[23:39:01] <mutante>	 it will start next week, currently rails are being installed
[23:39:56] <mutante>	 hoping it is possible to keep it balanced in the end without having to move stuff around in another step
[23:40:07] <mutante>	 with that C limitation there
[23:48:38] <wikibugs>	 10serviceops: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) Talked with Jclark and this is the physical situation:   ` <jclark-ctr> right now A3 8 spots,  B3 10 spots, in all of row C i only have C3 with 3 spots,   D8 15 spots.   `  so we have to...