[05:07:05] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Marostegui) This server crashed again: ` ------------------------------------------------------------------------------- Record: 2 Date/Time: 04/13/2019 12:33:55 Source: system Severity: Crit... [06:20:29] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Joe) I'm setting it to inactive while we know how the request to dell goes. @Cmjohnson let us know when you know more. [06:23:07] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Joe) On second thoughts, this is an API server, of which we have a just a few right now. I'll avoid depooling it if not strictly necessary. [14:23:53] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) @joe Good news is I have already ordered the DIMM from the previous failure and it's on-site. I can do this tomorrow afternoon (my time) if you can depool it then. [14:29:24] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) @Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready [14:30:28] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) @cdanis tx! [14:55:31] i'm looking at capex for next year [14:55:38] looks like we have about 160 mediawiki servers to refresh [14:59:06] akosiaris: this worker heartbeat pause shouldn't be affected by other pods, correct? [14:59:33] ? [14:59:41] just want to rule that out, any CPU work for the requested rersources for the eventgate pod would only be stuff inside the pod, right? [15:00:12] i mean, some load on some other service or pod on the same host node wouldn't affect this one? [15:00:33] oh it would, but only if the host is under heavy load [15:00:40] aye [15:00:53] which judging from grafana hasn't happened [15:01:11] yah, and this does seem to happen under no real load [15:01:20] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&var-instance=All [15:01:23] but i can pretty conssistenly reproduce under heavy load [15:01:31] so, in strace when it happens [15:02:03] strace sesems to itseslf even get blocked...it doesn't output anything for a while. it eventually does with correct timestamps from calls that happened seconds ago [15:02:09] but the output is delayed [15:02:38] there are big time gaps between the syscalls when this happenss [15:02:48] so something is blocking execution somehow [15:02:50] but i can't tell what [15:03:03] sounds like something being stuck in busy CPU loops [15:03:37] when you say reproduce, is it live traffic? or artificial one? like generated by some tool you have/wrote? [15:04:50] ab [15:04:52] artificial [15:04:57] can repro in staging [15:05:10] ah that's interesting. So some specific request is causing this [15:05:22] well, maybe? [15:05:46] the requests are the same everytime. its especially strange that worker does miss heartbeats even not under load [15:05:53] https://phabricator.wikimedia.org/T220661#5111534 [15:06:53] the only requests going at that time are the readinessProbe ones [15:07:41] hm, i'm goign to turn back on kafka debug logging, maybe there will be something intesting thee. [15:08:07] heh, a worker that does not respond to master even under 0 load ? [15:08:16] right [15:08:23] that sounds super fishy [15:08:26] yeah [15:08:34] i haven't caught an unloaded strace yet [15:08:55] worker kill under no load strace* [15:13:32] ottomata: re https://logstash.wikimedia.org/goto/19cdf6fea0ad77d51ed9e19343d35ef2, what should I be looking at? The timestamps are from 2 distinct periods (14:27 and 06:49) and I can see the master being shutdown, nothing about workers failing to send a heartbeat [15:13:40] or do we just infer that this is what happened? [15:13:57] and that "master shutting down, killing workers" is just the livenessProbe ? [15:14:30] akosiaris: i see 06:49 worker stopped sending heartbeats, killing [15:14:45] * akosiaris pebkac [15:14:51] 06:49:20 [15:14:51] I missed that [15:18:20] ottomata: sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl get events -n eventgate-analytics [15:18:32] 3m27s Normal Killing Pod Killing container with id docker://staging-metrics-exporter:Need to kill Pod [15:18:32] 3m26s Warning Unhealthy Pod Readiness probe failed: [15:18:32] 3m26s Normal Killing Pod Killing container with id docker://eventgate-analytics-staging:Need to kill Pod [15:20:25] that could be related with the part "with nobody messing with it" [15:21:03] hm i'm messing with it right now :p [15:26:27] <_joe_> mark: we also need to buy more mw servers, given we're importing the load of parsoid [15:29:07] i know still 1 min left [15:29:08] well [15:29:10] but im alone on the VC [15:29:11] wtp refresh is also in there [15:29:15] so we can probably fold those right in [15:29:15] :) [15:31:50] <_joe_> akosiaris: meeting! [17:33:28] Përshëndetje Mirë se ju gjeta të Gjithëve!! :))) [17:35:45] if you need to decom a host, the other day i used the latest method that is even newer that wmf-decom-script, it's now "sre.hosts.decommission". added an example at https://wikitech.wikimedia.org/wiki/Decom_script [17:36:03] thanks to v.olans [18:08:37] Pchelolo: i'm seeing a SIGKILL shortly after the missed heartbeat [18:08:51] ottomata: how shortly? [18:09:19] in the most recent run, 15 seconds later [18:09:26] 18:05:44.857489 write(3, "{\"type\":\"heartbeat\"}\n [18:09:35] ottomata: https://github.com/wikimedia/service-runner/blob/master/lib/master.js#L138-L153 [18:09:40] then 2019-04-15T18:05:59.183Z] ERROR: ... worker stopped sending heartbeats, killing. [18:09:40] that's the relevant code [18:09:46] that kills things [18:09:52] [2019-04-15T18:05:59.204Z] ERROR: worker died, restarting [18:09:55] I think we could add more logging in there [18:10:04] 18:05:59.087524 +++ killed by SIGKILL +++ [18:10:28] it 'should' be a rare thing that heartbeats are missed, I think much more logging could go into thatpiece of code [18:11:42] oh it does still send a SIGKILL [18:11:43] on disconnect. [18:12:04] and then 1 minute later if disconnect never happens [18:12:08] aye ok [18:12:20] thought it would do something different than sigkill for graceful [18:12:32] ok [18:12:37] graceful shutdown is calling 'disconnect' [18:13:19] when we call it, the master stops assigning any new connections to the worker, the worker is supposed to finish up on existing requests and do the cleanup work and then emit 'disconnect' [18:13:49] technically if everything is going well that 'kill' upon 'disconnect' should not be required [18:14:18] but we are not great in closing all the timers and everything that could keep the process alive, so we sigkill jkust to make sure [18:14:46] ok, well at least in this case then the worker is responding to the shutdown from the master [18:39:30] Pchelolo: why do we set a large UV_THREADPOOL_SIZE for eventstreams and change prop? [18:46:05] ottomata: each consumer uses up 1 UV thread for message dispatching message [18:46:33] ok so probably not relevant for producer? [18:46:57] ottomata: producer does use up 1 thread too [18:47:17] but in eventgate we have 2 producer instances right? [18:47:23] by default there's 4 threads [18:47:36] so we should be good [18:48:02] it's good to have some free threads, cause DNS resolution for example uses UV threads too [18:48:39] aye, and these are only the threads that libuv uses, right? not the the librdkafka ones? [18:48:49] i believe librdkafka uses a bunch for broker communicataion [18:49:19] ottomata: librdkafka spins up it's own threads [18:49:40] libuv threads are used for dispatching async tasks [18:50:00] aye [18:50:31] for producer I believe that's actually not really used at all [18:51:22] yup, not used. [18:52:00] for consumer it uses up a UV thread because you need to block for 'consume' [18:52:04] aye [20:00:53] Pchelolo: is there a way to know how many requests a master is waiting to hand off to the worker? [20:20:22] ottomata: not that I know of