[00:05:21] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) Okay, we can talk about this for next week's plan. [05:01:48] 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Marostegui) p:05Triage→03High [06:02:10] 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Joe) restarting all confds before switching DC seems overkill and frankly useless. We should... [06:44:42] <_joe_> hnowlan: ping for when you're around :) [06:46:49] <_joe_> elukey: I have a nice "starter task" for one of your new hires, too! [06:53:43] _joe_ they are already working on some tasks (mostly related to analytics) but we can add tasks to the backlog too [06:53:52] what do you have in mind? [06:54:13] <_joe_> elukey: we need someone to convert the eventgate helmfile directories to the new format [06:54:35] <_joe_> it's going to be either otto or someone else in your "team" :P [06:54:53] <_joe_> well I think we (me and janis) already converted two of them [06:56:24] Andrew will probably pick this up, even if I should also know how it is done (totally ignorant about the k8s part sadly) [06:57:01] <_joe_> ohhh so we can make this a workshop for you aand the others [07:00:19] Andrew introduced helm in the context of eventgate to me a couple of times, but I do k8s things so rarely that my limited LRU brain doesn't keep anything useful :D [07:00:28] a workshop would indeed be great [07:31:21] elukey: I suffer from the same disease [07:32:59] I have LRU with data corruption... [07:39:45] <_joe_> tsk people, upgrade to a persistent plan [07:39:58] <_joe_> although tbh remembering too much is a curse as well [08:45:33] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [09:22:21] _joe_: yo [09:23:44] <_joe_> hnowlan: I would ask you to take care of converting the cpjobqueue/changeprop/api-gateway helmfiles to the new format [09:23:59] <_joe_> it's mostly just running the script and deduplicating [09:25:14] _joe_: sounds good, might have time for it today. is there a doc or anything? [09:25:54] <_joe_> yes, look at the README in the helmfile.d/services directory [09:27:47] cool [09:29:04] <_joe_> and then, either me or jayme can review [10:06:02] 10serviceops, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [10:09:35] 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10JMeybohm) @akosiaris and me put together a more precise step by step plan about what needs to be done for the outstanding eqiad migration. Current p... [11:23:55] is the helmfile_convert_diff.sh script designed to be run exclusively from the deploy servers? [11:24:08] <_joe_> no [11:24:14] <_joe_> exclusively on your computer [11:24:18] ack [11:24:23] <_joe_> but you need to have helmfile and helm installed [11:24:38] <_joe_> you can find a way to make it run in the helm-linter container though [11:26:12] <_joe_> something like docker run --rm -v $PWD:/src:rw --entrypoint /src/helmfile.d/services/helmfile_convert_diff.sh docker-registry.wikimedia.org/releng/helm-linter:0.2.6 [11:26:53] <_joe_> you need to add a --user somewhere :P [11:28:20] <_joe_> meh it doesn't work, it needs git [12:19:50] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10LarsWirzenius) [12:48:21] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) [13:07:55] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) 05Open→03Resolved a:03Jdforrester-WMF [13:35:18] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [13:42:54] I have an apache config change that I'd like to deploy but I'd appreciate a bit of handholding. Anyone have a few minutes free? It's a pretty simple change. [13:45:11] Question, dear service ops. So I need to create a new table in the database for a merged patch. I've been told that I just go and do 'sql.php --wiki=metawiki extensions/blabla/schema.sql' - is that really it? just like that? [13:46:13] hnowlan: I'm not an expert here but do you know of httpbb? https://wikitech.wikimedia.org/wiki/httpbb [13:47:20] cdanis: yep! I've written a test for the change https://gerrit.wikimedia.org/r/c/operations/puppet/+/599751/4/modules/profile/files/httpbb/appserver/test_wikimania_wikimedia.yaml [13:47:27] ah nice [13:47:51] Pchelolo: may want to ask #wikimedia-databases as well, serviceops isn't often involved in schema changes [13:48:06] good point mark [13:48:09] thank you [13:50:12] hnowlan: can I help ? [13:58:11] hnowlan: I would, but I fear I don't have any experience in this particular type of handholding [13:59:05] effie: please! I'll dm - just a few questions [13:59:15] sure [14:11:43] <_joe_> Pchelolo: I have a question for you [14:12:15] _joe_: what's up. in a meeting so might disappear [14:12:17] <_joe_> Pchelolo: I moved restbase-async to eqiad, so restbase should only generate purges there [14:12:32] <_joe_> why do I still see restbase purges in codfw? [14:12:41] not nesessarily, some of the purges are generated by restbase-sync [14:12:49] <_joe_> oh? [14:12:57] <_joe_> ok I see [14:13:24] restbase in primary DC is used for things we want updated quicker in main DC, like main parsoid HTML from an actual edit [14:13:48] restbase in secondary DC is for things updated cause of template propagtion, cause nobody cares about them [14:15:17] <_joe_> ok I see :D [14:15:46] <_joe_> also I guess stuff like mathoid [14:17:09] mathoid doesn't purge anything afaik [14:17:19] it's write-only, renders never change [14:22:50] <_joe_> so why do I see purges for it? [14:23:08] <_joe_> I mean obviously it's restbase doing it [14:33:03] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Jdforrester-WMF) [14:33:21] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Jdforrester-WMF) [14:33:23] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): upgrade MediaWiki appservers to Debian 10 (buster) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [14:37:04] <_joe_> Pchelolo: restbase produced directly to kafka, correct? [14:42:23] Pchelolo: is there a way we can differentiate if a resource-purge messages comes from restbase to kafka? [14:42:40] i think maybe restbase hasn't correctly noticed the fact that there are now 3 partitions in that topic [14:42:40] <_joe_> ottomata: yes, the tag attribute [14:42:47] and is still only producing to partition 0 [14:42:58] hm _joe_ i don't see any tag in these messages [14:43:10] no, restbase does not produce directly, it calls eventgate [14:43:47] <_joe_> kafkacat -b kafka-main1001.eqiad.wmnet:9092 -t 'eqiad.resource_change' -C | jq .tags [14:44:00] <_joe_> Pchelolo: uhm are you sure? [14:44:15] <_joe_> I think purges are generated directly [14:44:21] <_joe_> which would explain something I see [14:44:32] _joe_: 100%. RB has no dependency on kafka lib [14:44:40] oh _joe_ i was looking at resource-purge [14:44:45] <_joe_> Pchelolo: also, what topic does restbase produce to? [14:44:55] <_joe_> $dc.resource-purge? [14:44:58] resource_change [14:45:05] we only added partitions for purge atm [14:45:43] so, restbase goes to eventgate for and produces to resource_change, and change-prop listens to resource_change from restbase and puts it to resource_purge [14:46:03] hm and cp produces directly to kafka? [14:46:14] https://github.com/wikimedia/restbase/blob/master/sys/events.js#L29-L51 [14:46:21] CP - yes, directly to kafka [14:46:37] ok, i think maybe CP has not noticed the new partitions [14:46:48] oh YES [14:46:57] https://github.com/wikimedia/change-propagation/pull/351 [14:47:01] that fixes it [14:47:08] but I didn't deploy that yet [14:47:28] ahhhhh [14:47:30] ouch.. [14:47:31] ok [14:47:32] phew [14:48:08] it's not crazy urgent no? I have a meeting, will deploy this in an hour or so [14:50:11] <_joe_> not crazy urgent anymore, no [14:50:24] <_joe_> I can wait for that to happen before I restore the old world order :P [15:00:13] ottomata: deployment-charts build is failing cause of eventlogging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/623808 [15:00:33] s/eventlogging/eventgate-logging-external [15:01:21] <_joe_> Pchelolo: that's a sad race condition, just recheck [15:01:38] mmm..okiey [15:01:43] <_joe_> just sent it [15:02:07] <_joe_> jayme: it seems we have other failures... we need to just run helmfile for the stuff that changed, and do it serially :/ [15:03:13] <_joe_> Pchelolo: second time it worked [15:03:17] <_joe_> :D [15:04:15] oooook? [15:05:26] yeah [15:06:36] _joe_: and for change-prop I just deploy the old way cause the dir structure didn;'t change yet and ignore your email? [15:06:54] <_joe_> correct [15:08:11] <_joe_> ottomata: I'll move back all purge traffic to codfw alone once Pchelolo has deployed his changes [15:10:01] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 5 others: Drop PHP 7.2 support in MediaWiki 1.35; require PHP 7.3.19 - https://phabricator.wikimedia.org/T257879 (10Aklapper) [15:13:26] _joe_: k [15:15:31] <_joe_> ok, depooling eventgate-main in eqiad [15:31:22] _joe_: damn...@helmfile - I'll take another look [15:37:48] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Joe) It seems the actions taken to solve T261846 have solved this issue as well. Let's keep an eye on it but it... [15:49:52] <_joe_> ottomata: since I moved back purges to codfw [15:49:56] <_joe_> all of them I mean [15:50:02] <_joe_> we're having trouble again [15:50:26] <_joe_> ottomata: see https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=6&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-main [15:51:31] <_joe_> rzl: can you pick up from me on this? I'm exhausted, but I think for now we should just run with eventgate-main a/a and restbase-async in eqiad [15:51:59] <_joe_> I'm going to restore that situation, there is clearly something more going on with kafka [15:52:02] sure, I'm catching up [15:52:23] I don't have a whole lot of depth on kafka but I can take a look [15:52:55] <_joe_> me neither :) [15:53:17] <_joe_> there are people like ottomata who know it more, you just need to keep an eye on the effects on applications [15:53:38] <_joe_> do you agree with going back to a known-good situation, even if it means a couple things will run from eqiad? [15:53:51] yeah definitely [15:54:48] <_joe_> ok [15:54:50] <_joe_> doing so [15:56:25] <_joe_> done, I'm mostly off now [15:57:16] ack, have a good evening [16:31:35] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) [16:31:37] 10serviceops, 10Patch-For-Review: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) 05Resolved→03Open Reusing this but it looks like a different problem to me: https://integration.wikimedia.org/ci/job/helm-lint/2339/console Again no error on chartmuseu... [16:44:39] sorry meetings etc can look again in around 1.5hrs! [16:54:21] 10serviceops, 10ChangeProp, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Partition the transclusions topic in ChangeProp - https://phabricator.wikimedia.org/T157649 (10Pchelolo) 05Open→03Resolved The support for partitioning was added to change-prop and used for purges. We don't r... [18:28:45] rzl: i'm going to add partitions to the resource_change topics as well [18:29:27] there may indeed be somethign wrong with kafka-main2003, but it is also still serving more volume than tthe other brokers [18:29:32] and i think this is why [18:29:51] Pchelolo: did you deploy the cp fix for producer partition? [18:29:56] yup [18:30:00] ok [18:30:15] FYI the graph I want to see balanced is [18:30:15] https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=1598466608448&to=1599071408449&viewPanel=19 [18:30:26] 2003 is also the leader for both eqiad and codfw resource_change [18:30:30] and it only has one partition [19:25:22] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson) [19:26:09] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson) 05Open→03Resolved removed from rack, switch port and script update [20:25:45] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) [20:33:24] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1017.eqi... [21:19:15] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) 05Open→03Stalled Stalled as blocked on {T245757} which is also stalled [21:35:49] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1017.eqiad.wmnet'] ` and were **ALL** successful.