[00:05:21] <wikibugs>	 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) Okay, we can talk about this for next week's plan.
[05:01:48] <wikibugs>	 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Marostegui) p:05Triage→03High
[06:02:10] <wikibugs>	 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Joe) restarting all confds before switching DC seems overkill and frankly useless. We should...
[06:44:42] <_joe_>	 hnowlan: ping for when you're around :)
[06:46:49] <_joe_>	 elukey: I have a nice "starter task" for one of your new hires, too!
[06:53:43] <elukey>	 _joe_ they are already working on some tasks (mostly related to analytics) but we can add tasks to the backlog too
[06:53:52] <elukey>	 what do you have in mind?
[06:54:13] <_joe_>	 elukey: we need someone to convert the eventgate helmfile directories to the new format
[06:54:35] <_joe_>	 it's going to be either otto or someone else in your "team" :P
[06:54:53] <_joe_>	 well I think we (me and janis) already converted two of them
[06:56:24] <elukey>	 Andrew will probably pick this up, even if I should also know how it is done (totally ignorant about the k8s part sadly)
[06:57:01] <_joe_>	 ohhh so we can make this a workshop for you aand the others
[07:00:19] <elukey>	 Andrew introduced helm in the context of eventgate to me a couple of times, but I do k8s things so rarely that my limited LRU brain doesn't keep anything useful :D
[07:00:28] <elukey>	 a workshop would indeed be great
[07:31:21] <effie>	 elukey: I suffer from the same disease
[07:32:59] <apergos>	 I have LRU with data corruption...
[07:39:45] <_joe_>	 tsk people, upgrade to a persistent plan
[07:39:58] <_joe_>	 although tbh remembering too much is a curse as well
[08:45:33] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[09:22:21] <hnowlan>	 _joe_: yo
[09:23:44] <_joe_>	 hnowlan: I would ask you to take care of converting the cpjobqueue/changeprop/api-gateway helmfiles to the new format
[09:23:59] <_joe_>	 it's mostly just running the script and deduplicating
[09:25:14] <hnowlan>	 _joe_: sounds good, might have time for it today. is there a doc or anything? 
[09:25:54] <_joe_>	 yes, look at the README in the helmfile.d/services directory
[09:27:47] <hnowlan>	 cool
[09:29:04] <_joe_>	 and then, either me or jayme can review
[10:06:02] <wikibugs_>	 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo)
[10:09:35] <wikibugs>	 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10JMeybohm) @akosiaris and me put together a more precise step by step plan about what needs to be done for the outstanding eqiad migration.  Current p...
[11:23:55] <hnowlan>	 is the helmfile_convert_diff.sh script designed to be run exclusively from the deploy servers? 
[11:24:08] <_joe_>	 no
[11:24:14] <_joe_>	 exclusively on your computer
[11:24:18] <hnowlan>	 ack 
[11:24:23] <_joe_>	 but you need to have helmfile and helm installed
[11:24:38] <_joe_>	 you can find a way to make it run in the helm-linter container though
[11:26:12] <_joe_>	 something like docker run --rm -v $PWD:/src:rw --entrypoint /src/helmfile.d/services/helmfile_convert_diff.sh docker-registry.wikimedia.org/releng/helm-linter:0.2.6 <service>
[11:26:53] <_joe_>	 you need to add a --user somewhere :P
[11:28:20] <_joe_>	 meh it doesn't work, it needs git 
[12:19:50] <wikibugs>	 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10LarsWirzenius)
[12:48:21] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy)
[13:07:55] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) 05Open→03Resolved a:03Jdforrester-WMF
[13:35:18] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[13:42:54] <hnowlan>	 I have an apache config change that I'd like to deploy but I'd appreciate a bit of handholding. Anyone have a few minutes free? It's a pretty simple change. 
[13:45:11] <Pchelolo>	 Question, dear service ops. So I need to create a new table in the database for a merged patch. I've been told that I just go and do 'sql.php --wiki=metawiki extensions/blabla/schema.sql' - is that really it? just like that?
[13:46:13] <cdanis>	 hnowlan: I'm not an expert here but do you know of httpbb? https://wikitech.wikimedia.org/wiki/httpbb
[13:47:20] <hnowlan>	 cdanis: yep! I've written a test for the change https://gerrit.wikimedia.org/r/c/operations/puppet/+/599751/4/modules/profile/files/httpbb/appserver/test_wikimania_wikimedia.yaml 
[13:47:27] <cdanis>	 ah nice
[13:47:51] <mark>	 Pchelolo: may want to ask #wikimedia-databases as well, serviceops isn't often involved in schema changes
[13:48:06] <Pchelolo>	 good point mark
[13:48:09] <Pchelolo>	 thank you
[13:50:12] <effie>	 hnowlan: can I help ?
[13:58:11] <jayme>	 hnowlan: I would, but I fear I don't have any experience in this particular type of handholding 
[13:59:05] <hnowlan>	 effie: please! I'll dm - just a few questions 
[13:59:15] <effie>	 sure
[14:11:43] <_joe_>	 Pchelolo: I have a question for you
[14:12:15] <Pchelolo>	 _joe_: what's up. in a meeting so might disappear
[14:12:17] <_joe_>	 Pchelolo: I moved restbase-async to eqiad, so restbase should only generate purges there
[14:12:32] <_joe_>	 why do I still see restbase purges in codfw?
[14:12:41] <Pchelolo>	 not nesessarily, some of the purges are generated by restbase-sync
[14:12:49] <_joe_>	 oh?
[14:12:57] <_joe_>	 ok I see
[14:13:24] <Pchelolo>	 restbase in primary DC is used for things we want updated quicker in main DC, like main parsoid HTML from an actual edit
[14:13:48] <Pchelolo>	 restbase in secondary DC is for things updated cause of template propagtion, cause nobody cares about them
[14:15:17] <_joe_>	 ok I see :D
[14:15:46] <_joe_>	 also I guess stuff like mathoid
[14:17:09] <Pchelolo>	 mathoid doesn't purge anything afaik
[14:17:19] <Pchelolo>	 it's write-only, renders never change
[14:22:50] <_joe_>	 so why do I see purges for it?
[14:23:08] <_joe_>	 I mean obviously it's restbase doing it
[14:33:03] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Jdforrester-WMF)
[14:33:21] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Jdforrester-WMF)
[14:33:23] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): upgrade MediaWiki appservers to Debian 10 (buster) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF)
[14:37:04] <_joe_>	 Pchelolo: restbase produced directly to kafka, correct?
[14:42:23] <ottomata>	 Pchelolo:  is there a way we can differentiate if a resource-purge messages comes from restbase to kafka? 
[14:42:40] <ottomata>	 i think maybe restbase hasn't correctly noticed the fact that there are now 3 partitions in that topic
[14:42:40] <_joe_>	 ottomata: yes, the tag attribute
[14:42:47] <ottomata>	 and is still only producing to partition 0
[14:42:58] <ottomata>	 hm _joe_  i don't see any tag in these messages
[14:43:10] <Pchelolo>	 no, restbase does not produce directly, it calls eventgate
[14:43:47] <_joe_>	  kafkacat -b kafka-main1001.eqiad.wmnet:9092 -t 'eqiad.resource_change' -C | jq .tags
[14:44:00] <_joe_>	 Pchelolo: uhm are you sure?
[14:44:15] <_joe_>	 I think purges are generated directly
[14:44:21] <_joe_>	 which would explain something I see
[14:44:32] <Pchelolo>	 _joe_: 100%. RB has no dependency on kafka lib
[14:44:40] <ottomata>	 oh _joe_  i was looking at resource-purge
[14:44:45] <_joe_>	 Pchelolo: also, what topic does restbase produce to?
[14:44:55] <_joe_>	 $dc.resource-purge?
[14:44:58] <Pchelolo>	 resource_change 
[14:45:05] <ottomata>	 we only added partitions for purge atm
[14:45:43] <Pchelolo>	 so, restbase goes to eventgate for and produces to resource_change, and change-prop listens to resource_change from restbase and puts it to resource_purge
[14:46:03] <ottomata>	 hm and cp produces directly to kafka?
[14:46:14] <Pchelolo>	 https://github.com/wikimedia/restbase/blob/master/sys/events.js#L29-L51
[14:46:21] <Pchelolo>	 CP - yes, directly to kafka
[14:46:37] <ottomata>	 ok, i think maybe CP has not noticed the new partitions
[14:46:48] <Pchelolo>	 oh YES
[14:46:57] <Pchelolo>	 https://github.com/wikimedia/change-propagation/pull/351
[14:47:01] <Pchelolo>	 that fixes it
[14:47:08] <Pchelolo>	 but I didn't deploy that yet
[14:47:28] <ottomata>	 ahhhhh
[14:47:30] <Pchelolo>	 ouch..
[14:47:31] <ottomata>	 ok
[14:47:32] <ottomata>	 phew
[14:48:08] <Pchelolo>	 it's not crazy urgent no? I have a meeting, will deploy this in an hour or so
[14:50:11] <_joe_>	 not crazy urgent anymore, no
[14:50:24] <_joe_>	 I can wait for that to happen before I restore the old world order :P
[15:00:13] <Pchelolo>	 ottomata: deployment-charts build is failing cause of eventlogging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/623808
[15:00:33] <Pchelolo>	 s/eventlogging/eventgate-logging-external
[15:01:21] <_joe_>	 Pchelolo: that's a sad race condition, just recheck
[15:01:38] <Pchelolo>	 mmm..okiey
[15:01:43] <_joe_>	 just sent it
[15:02:07] <_joe_>	 jayme: it seems we have other failures... we need to just run helmfile for the stuff that changed, and do it serially :/
[15:03:13] <_joe_>	 Pchelolo: second time it worked
[15:03:17] <_joe_>	 :D
[15:04:15] <ottomata>	 oooook?
[15:05:26] <Pchelolo>	 yeah
[15:06:36] <Pchelolo>	 _joe_: and for change-prop I just deploy the old way cause the dir structure didn;'t change yet and ignore your email?
[15:06:54] <_joe_>	 correct
[15:08:11] <_joe_>	 ottomata: I'll move back all purge traffic to codfw alone once Pchelolo has deployed his changes
[15:10:01] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 5 others: Drop PHP 7.2 support in MediaWiki 1.35; require PHP 7.3.19 - https://phabricator.wikimedia.org/T257879 (10Aklapper)
[15:13:26] <ottomata>	 _joe_: k
[15:15:31] <_joe_>	 ok, depooling eventgate-main in eqiad
[15:31:22] <jayme>	 _joe_: damn...@helmfile - I'll take another look
[15:37:48] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Joe) It seems the actions taken to solve T261846 have solved this issue as well. Let's keep an eye on it but it...
[15:49:52] <_joe_>	 ottomata: since I moved back purges to codfw
[15:49:56] <_joe_>	 all of them I mean
[15:50:02] <_joe_>	 we're having trouble again
[15:50:26] <_joe_>	 ottomata: see https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=6&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-main
[15:51:31] <_joe_>	 rzl: can you pick up from me on this? I'm exhausted, but I think for now we should just run with eventgate-main a/a and restbase-async in eqiad
[15:51:59] <_joe_>	 I'm going to restore that situation, there is clearly something more going on with kafka
[15:52:02] <rzl>	 sure, I'm catching up
[15:52:23] <rzl>	 I don't have a whole lot of depth on kafka but I can take a look
[15:52:55] <_joe_>	 me neither :)
[15:53:17] <_joe_>	 there are people like ottomata who know it more, you just need to keep an eye on the effects on applications
[15:53:38] <_joe_>	 do you agree with going back to a known-good situation, even if it means a couple things will run from eqiad?
[15:53:51] <rzl>	 yeah definitely
[15:54:48] <_joe_>	 ok
[15:54:50] <_joe_>	 doing so
[15:56:25] <_joe_>	 done, I'm mostly off now
[15:57:16] <rzl>	 ack, have a good evening
[16:31:35] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm)
[16:31:37] <wikibugs>	 10serviceops, 10Patch-For-Review: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) 05Resolved→03Open Reusing this but it looks like a different problem to me: https://integration.wikimedia.org/ci/job/helm-lint/2339/console  Again no error on chartmuseu...
[16:44:39] <ottomata>	 sorry meetings etc can look again in around 1.5hrs!
[16:54:21] <wikibugs>	 10serviceops, 10ChangeProp, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Partition the transclusions topic in ChangeProp - https://phabricator.wikimedia.org/T157649 (10Pchelolo) 05Open→03Resolved The support for partitioning was added to change-prop and used for purges. We don't r...
[18:28:45] <ottomata>	 rzl:  i'm going to add partitions to the resource_change topics as well
[18:29:27] <ottomata>	 there may indeed be somethign wrong with kafka-main2003, but it is also still serving more volume than tthe other brokers
[18:29:32] <ottomata>	 and i think this is why
[18:29:51] <ottomata>	 Pchelolo:  did you deploy the cp fix for producer partition?
[18:29:56] <Pchelolo>	 yup
[18:30:00] <ottomata>	 ok
[18:30:15] <ottomata>	 FYI the graph I want to see balanced is
[18:30:15] <ottomata>	 https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=1598466608448&to=1599071408449&viewPanel=19
[18:30:26] <ottomata>	 2003 is also the leader for both eqiad and codfw resource_change
[18:30:30] <ottomata>	 and it only has one partition
[19:25:22] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson)
[19:26:09] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson) 05Open→03Resolved removed from rack, switch port and script update
[20:25:45] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson)
[20:33:24] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1017.eqi...
[21:19:15] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) 05Open→03Stalled Stalled as blocked on {T245757} which is also stalled
[21:35:49] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1017.eqiad.wmnet'] `  and were **ALL** successful.