[06:19:45] 10serviceops, 10Operations: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10elukey) >>! In T251378#6111482, @Joe wrote: > @elukey let's schedule this test for 6:00Z on monday, May 11th? +1 [10:11:28] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) >>! In T251726#6109023, @Vgutierrez wrote: > the icinga check on cp hosts currently warns 30 days before and goes critical 15 days before cert... [10:20:25] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10hnowlan) a:05holger.knust→03hnowlan [10:38:56] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) a:05Joe→03JMeybohm [11:29:26] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [12:32:14] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [13:55:39] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) if every LE certificate checked by that icinga check it's issued by acme-chief then yes, it's good [14:24:43] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1007.eqiad.wmnet ` The log can be fou... [14:25:28] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1008.eqiad.wmnet ` The log can be fou... [14:26:14] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1009.eqiad.wmnet ` The log can be fou... [14:35:32] these days I am trying out https://github.com/kubernetes-sigs/kind. It's pretty awesome if you already have a linux desktop/laptop with docker. You can get a set of kubernetes clusters pretty quickly. A few things might be a little weird, like nodeports essentially being exposed on the container IP and not the host IP, but for all simple usages I 've come up with up to now, it's pretty awesome [14:35:56] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1010.eqiad.wmnet ` The log can be fou... [14:40:02] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1011.eqiad.wmnet ` The log can be fou... [14:42:09] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1009.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1009.eqiad.wmnet'] ` [14:43:22] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1009.eqiad.wmnet ` The log can be fou... [14:47:52] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1008.eqiad.wmnet'] ` and were **ALL** successful. [14:56:05] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1010.eqiad.wmnet'] ` and were **ALL** successful. [15:00:13] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1011.eqiad.wmnet'] ` and were **ALL** successful. [15:02:25] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1009.eqiad.wmnet'] ` and were **ALL** successful. [15:02:54] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1012.eqiad.wmnet ` The log can be fou... [15:03:11] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1013.eqiad.wmnet ` The log can be fou... [15:03:35] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1014.eqiad.wmnet ` The log can be fou... [15:13:39] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [15:21:59] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1012.eqiad.wmnet'] ` and were **ALL** successful. [15:26:08] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1013.eqiad.wmnet'] ` and were **ALL** successful. [15:32:01] _joe_: hnowlan just finished deploying a new rule to k8s changeprop [15:32:10] so now we have a resource-purge topic [15:32:18] which contains all purges and only purges [15:32:45] both eqiad. and codfw. [15:32:46] <_joe_> Pchelolo: great, next week I hope to deploy the new version of purged [15:32:54] <_joe_> so we can consume that [15:33:10] <_joe_> I did write the last CR that I think was needed today [15:33:13] _joe_: we are thinking that today we switch old changeprop on scb to consume from the topic [15:33:29] and transform to htcp [15:33:39] <_joe_> yeah hnowlan told me that was your plan, and I think it's ok for now as a bridge [15:33:51] <_joe_> btw, today I was looking at messages with kafkacat [15:34:08] <_joe_> and we are sending purges for transclusions that happened on march 28th :D [15:34:12] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1014.eqiad.wmnet ` The log can be fou... [15:34:16] <_joe_> those will all be dropped by purged now [15:34:37] <_joe_> we will only purge events that *originated* less than 2x the cache TTL [15:35:12] mm.. I am going to file a task about followups with the purge events, will add that as a part of it [15:35:31] we were dropping really old stuff, but maybe not all of it [15:36:09] right now the event is a bit bulky - it has dt, id, request_id.. but neiher of those are actually related to anything [15:36:13] I'd need to change that [15:36:35] <_joe_> Pchelolo: wait, dt isn't linked to when the event originated? [15:37:03] <_joe_> also it's ok to keep the other fields, I can parse 110k messages/sec with one purged on my computer [15:37:04] _joe_: right now it is when the 'purge' originated, not when original thing that caused the pruge originated [15:37:19] <_joe_> Pchelolo: that's in the root_event dt I think? [15:37:28] <_joe_> at least for transcludes [15:37:30] that's for transclusions [15:37:33] <_joe_> yes [15:37:50] <_joe_> those are purge requests as well, right? [15:37:54] yeah [15:38:06] <_joe_> Pchelolo: so my logic right now is [15:38:11] <_joe_> lemme show you the code [15:39:44] <_joe_> https://gerrit.wikimedia.org/r/c/operations/software/purged/+/594147/4/event.go#99 [15:39:56] <_joe_> so if we have root_event, we use the dt of it [15:40:03] <_joe_> else we use the dt of meta [15:41:03] ok. gotcha. We'll need some code changes to make it more 'real' dt, I'll file a ticket. I will make the timestamps and req_id better before you are ready to get purged going [15:42:02] this is exciting. we've been talking about that for years and now it's finally going! I'm very excited. [15:57:07] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team): Improve resource-purge request_id and dt propagation - https://phabricator.wikimedia.org/T252127 (10Pchelolo) [16:00:13] <_joe_> Pchelolo: the only think I want to be sure about is [16:00:31] <_joe_> the dt I see now is equal of newer to the dt of the event that caused the purge request [16:01:20] gotcha. [16:01:59] <_joe_> Pchelolo: that is the case now, right? [16:02:11] <_joe_> the dt is when the thing changed in restbase, AIUI [16:02:25] <_joe_> and that's exactly what it should be. [16:02:33] right at this moment 'dt' is when the purge event was created. [16:02:40] <_joe_> so that's ok [16:03:08] so that still needs fixing. right now you'd not be able to reject any purges by timestamp [16:03:09] <_joe_> it should be the time at which the content changed in restbase [16:03:14] yup. [16:03:49] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1014.eqiad.wmnet'] ` [16:03:54] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team): Improve resource-purge request_id and dt propagation - https://phabricator.wikimedia.org/T252127 (10Pchelolo) [16:03:55] updated the task [16:04:15] <_joe_> Pchelolo: did you have the time to look at MediaWiki too? [16:04:24] not yet unfortunately [16:04:39] we have a workshop now that eats up a bunch of time [16:04:47] <_joe_> that's ok, I migh have some time tomorrow [16:05:03] <_joe_> friday is the perfect day for a dive into mediawiki :) [16:07:17] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) All but 1014 have been image, I think I have a bad network cable for 1014. I have scheduled a quick trip to the data center this afternoon to t...