[07:23:55] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Joe) [07:31:42] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Joe) I'm a bit conflicted about this, and let me clarify why: in all the use-cases referenced above the use... [10:54:36] <_joe_> my mac crashed [10:59:53] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Patch-For-Review, and 2 others: Migrate changeprop to kubernetes - https://phabricator.wikimedia.org/T213193 (10Pchelolo) [10:59:57] 10serviceops, 10ChangeProp, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Make change-prop tests independent of Kafka and Redis - https://phabricator.wikimedia.org/T218396 (10Pchelolo) 05Open→03Resolved Now it's ready - CP tests are independent of both Kafka and... [12:03:21] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) @Papaul then the issue is somewhere else: ` [Thu Mar 21 11:02:31 2019] mce: [Hardware Error]: Machine check events logged [Thu Mar 21 11:02:31 2019] EDAC sbridge... [12:05:21] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10MoritzMuehlenhoff) It could be simply a broken CPU? If we have such the CPU type in a decom host, we could loot it from there. [12:58:44] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Pchelolo) There is already an ability to execute jobs after a delay or at more-or-less specific time, but it'... [13:56:13] godog: o/ [13:56:14] qq [13:56:28] the librdkafka stats are emitted internally in eventgate once every 30 seconds [13:56:38] and i suppose prometheeus is configured to scrape every 60 seconds [13:56:52] so i think e.g. the windows are caculated every 30 seconds (not sure about thaht) [13:57:08] for some other rhings. like message rates [13:57:22] i'm doing a sum(rate ... )) [13:57:38] with rate [5m] [13:57:53] seems like it works, but i think the metrics are pretty slow to update. maybe it sok? [13:58:00] i guess i'm just looking for best practice here [13:58:09] if you have any thoughts. [14:03:41] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10CDanis) The memory address is the same in all of these error reports. That suggests to me that one of the DIMMs has a 'stuck' bit and that it is unlikely to be a CPU issu... [14:21:59] ottomata: hi! not sure what you mean by metrics are slow to update, like the query is slow? [14:22:07] does a parsoid::testing server need to match the PHP version on the prod parsoid servers (7.0) (assumed that) or the one on appservers (7.2) per https://phabricator.wikimedia.org/T216102#4954452 ? [14:22:17] my assumption was that i match wtp* [14:22:28] no sorry [14:22:55] actually yes. [14:23:02] i have two questions, but let me me the second first [14:23:08] because i can't provide an example [14:23:08] ! [14:23:17] godog: when you go here [14:23:17] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1&from=1553174590868&to=1553178190868&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [14:23:30] (and switch to k8s-staging if not already selected) [14:23:39] do you see anything under kafka messages rate by topic? [14:23:42] or just No Data points? [14:24:49] i see no data [14:24:50] and yet [14:24:54] http://localhost:8000/k8s-staging/graph?g0.range_input=1h&g0.expr=sum(rate(eventgate_rdkafka_producer_topic_partition_txmsgs%5B5m%5D))+by+(producer_type%2C+topic)&g0.tab=0 [14:25:01] has what I expect [14:25:18] i can get usually grafana to display the kafka data IF I click around a lot and change things, [14:25:23] swich time frame to last 24 hours [14:25:29] flip between datasources [14:25:34] then eventuaally it will show something [14:27:20] ah ok, sorry I don't have time to dig into that right ottomata [14:27:23] the query inspector shows that it is executing correctly [14:27:27] but just has no results [14:28:17] ok godog [14:28:50] i can verify that [14:28:53] curl 'localhost:8000/k8s-staging/api/v1/query_range?query=sum(rate(eventgate_rdkafka_producer_topic_partition_txmsgs%7Bservice%3D%22eventgate-analytics%22%2Cproducer_type%3D~%22%22%2Ctopic%3D~%22staging_test_event%22%7D%5B5m%5D))%20by%20(producer_type%2Ctopic)&start=1553174820&end=1553178435&step=15' [14:28:56] doesn't return any data [14:28:58] v strange [14:38:39] akosiaris: yt? [14:38:43] ? [14:38:51] i think i can reproduce this eventgate/k8s/kafka problem in staging. [14:39:02] which problem? [14:39:03] but i'm really sure how to troubleshoot it [14:39:34] * akosiaris reading backlog [14:39:36] well right now, what happens when a new pod is spawned [14:39:47] akosiaris: backlog won't help unless in -services :) [14:39:56] so [14:39:59] new pod gets spawneed [14:40:07] and the service connects to kafka and says it is ready [14:40:19] but, the first time I try to POST to it [14:40:24] it blocks, and does not produce to Kakfa for a LONG time. [14:40:29] > 60 seconds [14:40:33] or more [14:40:37] before anything else [14:40:42] the http client eventually times out [14:40:44] do we continue this here on in #-services? [14:40:48] and EVENTUALLY the messages are produced [14:40:49] oh [14:41:00] the same people are in both! [14:41:04] i guess services? [14:42:47] ok [15:44:16] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Addshore) >>! In T212189#5020187, @akosiaris wrote: > > Thanks for the understanding. We are drafting next quarter goals this week, I 'll... [15:47:54] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#5044451, @Addshore wrote: >>>! In T212189#5020187, @akosiaris wrote: >> >> Thanks for the understanding. We are... [16:07:47] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Mooeypoo) >>! In T218812#5042813, @Joe wrote: > @aezell is there something I'm missing? Wouldn't a scheduled... [16:20:03] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10aezell) >>! In T218812#5042813, @Joe wrote: > - TTL-based expiry of records I agree. This is the "work aroun... [22:59:39] 10serviceops, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) I don't know much more, but I have a lot more data! Here is a staging pod with trace logging enabled reproducin...