[00:06:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) after this last merge everything seems good... [00:07:04] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) 05Open→03Resolved a:03Dzahn [11:00:04] 10serviceops, 10Cloud-VPS, 10cloud-services-team (Kanban): Get Service Operations team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273740 (10aborrero) [11:00:22] 10serviceops, 10Cloud-VPS, 10cloud-services-team (Kanban): Get Service Operations team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273740 (10aborrero) p:05Triage→03Medium [11:06:30] 10serviceops, 10SRE, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) I generated some [[http://www.brendangregg.c... [14:46:47] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) [15:18:26] hello people [15:18:43] I am planning to add the eventstreams-internal VIP if there is a merciful soul to assist [15:21:16] o/ [15:23:53] thanks :) [15:25:44] jayme: so one one the missing bits that Alex pointed out in the code reviews was [15:25:47] https://gerrit.wikimedia.org/r/c/operations/dns/+/661386 [15:25:53] that I hope is what it is needed [15:27:04] plus there is the other change that you pointed out earlier on [15:27:53] that should probably go after the dns change [15:27:59] (I assume) [15:30:48] yeah, so DNS needs to go first [15:31:59] well, after the discovery change [15:32:46] so https://gerrit.wikimedia.org/r/c/operations/puppet/+/661067/, then https://gerrit.wikimedia.org/r/c/operations/dns/+/661386 and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/661071/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/661072/ together [15:38:23] super, so just merged the conftool change, and I assume I need to run puppet on conf* [15:39:58] not really [15:40:21] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Implement switching of staging clusters - https://phabricator.wikimedia.org/T269835 (10JMeybohm) >> When switching from staging-eqiad to staging-codfw (and vice versa) we would need to: >> * Ensure all services currently deployed on staging-eqiad are deployed to... [15:41:19] elukey: I don't remember from the top of my head, reading [15:42:04] I was doing it as well [15:42:07] elukey: puppet run on authdns servers is needed [15:42:22] as of https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production [15:43:21] the conftool::master profile is on puppetmasters and lvs-es afaics [15:43:59] ah no I think only pybal test [15:44:32] I am running puppet on the puppetmaster1001 [15:45:21] mmmm not really needed as well [15:45:48] will do authdns1001 then [15:45:58] I'm a bit confused :) - but for adding the service to the discovery object you really just need a puppet run on A:dns-auth [15:47:04] ah nice on puppetmaster1001 I see /etc/conftool/data/discovery/services.yaml [15:47:11] that contains eventstreams-internal [15:50:01] jayme: I am running puppet on A:dns-auth, but on authdns1001 didn't really change much [15:51:07] the guide suggests also to modify hierdata/common/services.yaml, that we do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/661071/2 [15:51:11] so maybe later on? [15:53:24] hm. I whought it might need to know about the geoip!disc- stuff you will add next [15:54:28] But docs did not change since I followed them last time, so I guess it's fine :) [15:54:41] "DNS/Discovery" docs I mean [15:55:43] (still running puppet on the auth dns servers, -b 1 is very conservative but I always fear the DNS servers :D) [15:59:39] confctl is aware of the enties, so you should be good to continue with the steps in https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production when puppet has done it's thing [16:02:23] I confirm a no-op across all authdns nodes [16:03:46] okay. I think thats okay then (as confctl knows about it) [16:03:57] jayme: maybe we can follow directly https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service since it mentions steps for dns-disc [16:04:01] at the bottom [16:04:34] oh, yeah - we do :D [16:04:36] ok then I proceed with https://gerrit.wikimedia.org/r/c/operations/dns/+/661386 [16:05:42] so, we definitely want to follow https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service for all this. And we especially just followed https://wikitech.wikimedia.org/wiki/LVS#etcd_data_for_DNS_Discovery [16:05:58] which links to https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production [16:06:36] yeah, go ahead with DNS merge (step 2 in the section linked above) [16:07:32] all right now [16:07:33] elukey@puppetmaster1001:~$ sudo -i confctl --object-type discovery select 'dnsdisc=eventstreams-internal' get [16:07:36] {"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=eventstreams-internal"} [16:07:39] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=eventstreams-internal"} [16:07:45] so I'd pool both of them, and then run authdns-update [16:08:19] elukey: ack [16:08:44] done, running authdns [16:09:19] * jayme https://media.giphy.com/media/jUwpNzg9IcyrK/giphy.gif [16:10:09] ok first issue [16:10:10] error: plugin_geoip: Invalid resource name 'disc-eventstreams-internal' detected from zonefile lookup [16:10:13] ufff [16:10:41] lemme check if I made an error [16:12:04] checking as well... [16:13:30] it doesn't seem a typo [16:14:47] the rest of the error is [16:14:48] error: Name 'eventstreams-internal.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-eventstreams-internal' [16:14:51] fatal: Initial load of zone data failed [16:16:30] jayme: since I am stupid, there is a warning in the file that says to change hieradata/common/discovery.yaml first [16:17:03] that doesn't exist though [16:17:26] yeah...I think that is part of the common/service.yaml structure now [16:17:57] yes sigh [16:18:29] so it seems to me as if the order of things is no longer correct in the docs as we cant apply the DNS change without having the service_catalog entry in place [16:19:44] jayme: I think that the dns change may need to be only related to the svc zone [16:19:48] not dns discovery [16:20:46] jayme: so if you are ok I'd partially revert my dns change to remove the discovery bits [16:20:57] run authdns-update [16:21:06] and then merge the common/service.yaml patch [16:21:19] after it, we'll be able to add dns-disc [16:21:23] this is my impression [16:21:47] <_joe_> so yes [16:22:00] <_joe_> discovery records need to be added after puppet has set up lvs [16:22:02] yeah. Tbh I would have to look up older patch sets to be sure but it sounds about right [16:22:11] <_joe_> the order in the docs is still valid [16:22:45] indeed. It just does not clearly differentiate between DNS and discovery records [16:22:55] _joe_ I'll add documentation for n00bs like me [16:23:24] <_joe_> elukey: I thought it was clearly explained below that you need to add the discovery records only once the service is on lvs [16:23:47] and oblivious people like me - sorry elukey. I should have known better! [16:24:01] <_joe_> so [16:24:13] <_joe_> once you move the service to "production" state [16:24:24] <_joe_> the geo resource will be installed by puppet [16:24:28] <_joe_> see modules/profile/manifests/dns/auth/discovery.pp [16:24:31] _joe_ to be completely honest, it is definitely not clear at each step if you are not used to add a VIP often [16:24:42] this is why I am adding some warnings, so people will know [16:24:48] <_joe_> elukey: yes please do [16:24:50] (it actually is explained properly further below at https://wikitech.wikimedia.org/wiki/LVS#Add_the_dns_discovery_record) [16:25:02] <_joe_> elukey: you can blame ottomata if the docs aren't clear, he beta tested them [16:25:57] I think a note in the "DNS changes" section not to add discovery right away would prevent this situation in the future [16:26:11] exactly yes, I'll add it [16:27:51] all right the dns auth servers like me again [16:28:24] hey, now that I've finally caught up on it -- nice work on T273312 ! [16:28:25] and I see the svc records via dns [16:29:06] cool. Sorry for sending you down the wrong path [16:30:53] https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) [16:31:04] jayme: nono it is a good learning experience, thanks for the patience :) [16:31:45] eheh, sure. But I should have learned them already :D [16:32:13] wikitech change looks good! [16:32:19] ok now we should be at the services.yaml step [16:32:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/661071 [16:32:57] ack [16:38:25] I think that one of the steps missing in the config is adding what to check/do right after each step [16:39:06] for example, the service.yaml file seems to lead to a no-op, but Alex in the code review told me to merge it with the kubernets worker yaml change [16:39:21] I am going to add this info, but then after these two steps, should I run puppet somewhere? [16:39:40] (I know I can check the puppet repo etc.. but it would be great to have an indication in there) [16:42:41] the kubernetes worker yaml change is actually the next step on the wiki page, that is why Alex said you can merge them togehter [16:42:52] there is no real need to do so, though [16:43:24] ah ok ok.. well I added a note about it, it seems good to pack the two together if we can [16:44:34] jayme: now it is the turn of pybal I guess [16:44:42] going to check with the traffic team [16:45:48] right. After the kubernetes worker yaml change, the "LVS dance" starts [16:46:13] ah I need to change the service state into lvs_setup [16:46:28] not yet. Later [16:47:12] "To add the configuration to PyBal and add the LVS endpoint on the load-balancers, you just need to change the state of your service to lvs_setup. Once puppet has run on the LVS servers, you will have to restart PyBal for your changes to take effect. " [16:47:58] I think it makes sense, otherwise the above merges would trip a change to the lvs-es via puppet [16:47:58] oh, yeah. Sorry. It's in this step already. You are totally right [16:48:01] no? [16:48:01] okok [16:48:02] :) [16:50:12] jayme: so Valentin is about to go in meetings, and I have meetings as well in a few, I think that I can restart tomorrow morning with him and you. what do you think? [16:50:18] I'll comment it in the task [16:50:22] it should be safe to stop now [16:50:41] yeah, it's safe to hold here and fine for me to continue tomorrow [16:50:52] perfect, thanks a lot for the support :) [16:51:38] yw! Thanks for working your way through it [16:53:55] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Next step is https://wikitech.wikimedia.org/wiki/LVS#Configure_the_... [18:25:44] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:27:35] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:28:47] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10dduvall) Change to blubber(oid) (https://gerrit.wikimedia.org/r/c/blubber/+/660771) has been deployed. [18:39:44] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [18:41:10] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:14:10] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2282.codfw.wmnet'] ` an... [19:17:28] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2281.codfw.wmnet'] ` an... [19:29:38] 10serviceops: New Service Request - Calculator Service - https://phabricator.wikimedia.org/T273807 (10wkandek) [19:57:27] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1334.eqiad.wmnet'] ` an... [21:06:08] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Cmjohnson) 05Open→03Resolved removed from rack [21:11:06] \o/ [21:11:12] thanks mutante for that [21:11:32] and cmjohnson who is not in here [23:25:57] oh, it's already gone. nice [23:28:04] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:29:34] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:31:07] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:07] 10serviceops, 10SRE, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:38:34] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) How about the unchecked boxes like wiping and updating netbox? I still see it here: https://netbox.wikimedia.org/dcim/devices/1444/ [23:39:09] well, it's still in netbox