streaming: frequent timeout in end-to-end tests #6617
Labels
component/streaming
Stream processing related issue.
priority/critical
type/bug
Something isn't working
Milestone
Recently we have encountered timeouts frequently in end-to-end tests on CI. Here're some samples of the main workflow. I'll investigate it with async-stack-trace.
May related:
Investigation
I've run all e2e tests locally for hours (😄) and finally reproduced the stuck. Here's the async stack trace. async-stack-trace.txt
Env:
risedev ci-start ci-3cn-3fe, ci-release, parallel e2e
SELECT 1 AS v1 FROM m16 AS mm16 JOIN m16 ON mm16.v1 = m16.v1
, which is inmv_on_mv.slt
.3433057878016000
cannot be collected, according to the RPC traces.3433057878016000
and we find that most actors fromActor 505048
toActor 505059
(4*3=12 parallelisms of one fragment) have a root span ofEpoch 3433057878016000
, which means it has been collected. But there're two actors behaving strangely: they are still on the<initial>
epoch. 🤔captured 379.501681106s ago
. The stack trace is reported periodically from the same tokio task of the actor (withfutures::join
) since it's thread-local.Normally, only if the thread is busy with some CPU-intensive tasks, will we miss the reporting period and get a stale trace. This shouldn't be the case as we're stuck, and the CPU tends to be idle.
So why the tokio task cannot be scheduled? It's likely because this worker thread has been parked to the OS and not scheduled. So tokio cannot schedule the async tasks on it as well.
captured 3.+\ds ago
. We can find 8 actors, and all of them are scheduled to the second compute node at:5688
. So there must be some problems in this process.gdb -batch -ex 'thread apply all bt' -p 3786162
. sync-stack-trace.txtIt shows that most threads are parked actively by tokio runtime as there's no task to schedule.
risingwave_common::cache::LruCache
. This is abnormal, as the mutex guard cannot be held across theawait
point, and we assume the critical section is lightweight. By diving it more deeply, we find a different stack trace.Drop
ofCacheableEntry
acquires the lock of its shard, so dropping it inside the scope of a lock guard will lead to deadlock. fix(cache): do not drop cache-entry inside lock #6315 tried to fix that by deallocating the returned entries, due to send error, i.e., the oneshot channel receiver dropped.risingwave/src/common/src/cache.rs
Lines 684 to 704 in c2c3795
However, this is not the only case. if the receiver is dropped right after we check whether it's dropped, then the
send
won't returnErr
, while theinner
wrapping the value will be dropped before returningOk
.Considerations
This explains the reasons why it's hard to reproduce:
Some thoughts:
Drop
sounds dangerous.moka
might be a better choice.The text was updated successfully, but these errors were encountered: