[13:02:55] godog: q for ya about dashboarding [13:03:01] i'm working on https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate-analytics-otto0?refresh=1m&orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_producer_type=All&var-kafka_broker=All&var-kafka_topic=All [13:03:07] at the bottom i've got kafka stuff [13:03:14] do you think I should leave those there [13:03:15] or [13:03:26] should I incorporate them into the above golden signal panels? [13:03:35] i think each graph would fit into one of the above panels somehow [13:03:38] e.g. rtt into latency [13:03:53] tx data rate into traffic, etc. [13:08:41] ottomata: I think stats in the top panels should include just a few key metrics, if there are kafka metrics that fit the description they should be there yeah [13:08:58] the bottom rows IMHO can contain more detail/drilldown [13:10:22] hm ok, so maybe i'll pick the a single relevant kafka one for each panel and see how that goes, and leave the rest at the bottom [13:10:23] anke [13:10:24] danke [13:12:39] np! [13:19:37] akosiaris: should I be removing old chart version packages/indexes we won't be using? [13:27:46] ottomata: I see no reason yet. Plus we can always rollback to old versions [13:27:57] But if it becomes too much of a hassle, maybe yes [13:29:15] k, just have 14 versions of eventgate-analytics now :) [14:14:55] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate changeprop to kubernetes - https://phabricator.wikimedia.org/T213193 (10Pchelolo) [14:14:59] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Make change-prop tests independent of Kafka and Redis - https://phabricator.wikimedia.org/T218396 (10Pchelolo) 05Open→03Resolved The PR has been merged, resolving [14:16:13] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate changeprop to kubernetes - https://phabricator.wikimedia.org/T213193 (10Pchelolo) [14:16:17] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Make change-prop tests independent of Kafka and Redis - https://phabricator.wikimedia.org/T218396 (10Pchelolo) 05Resolved→03Open Oh, no, not resolving yet. Next step - mock redis. [14:31:59] godog: did somethign change with prometheus servers? [14:33:25] ottomata: how recently? as of this morning there's only one prometheus server serving traffic in eqiad as opposed to two normally [14:33:30] I'm migrating prometheus1003 to prometheus 2 [14:35:49] i've just noticed that there wwas a new ssh key, and that my dashboards don't have any data [14:36:08] prometheus1003 seems to work fine via its web ui [14:36:26] but the eqiad/prometheus/k8s* don't have any of my kafka data. [14:36:30] they do have the service runner stuff... [14:36:31] yeah please use 1004 if you are accessing prometheus directly [14:36:35] ok [14:38:41] oook and now i have data... [14:38:56] well sort of...in staging i do [14:39:04] maybe there is some lag in grafana datasource failover? [14:39:31] AH, no i don't... are some prometheus queries failing? [14:40:19] yeah there are stricter timeouts in place now because of slow queries [14:42:01] https://phabricator.wikimedia.org/T217715 that is [14:43:05] it seems a little inconsistent tho [14:47:37] weird. [19:26:20] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans) [20:32:11] 10serviceops, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) a:03Ottomata