You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been using jl_record_backtrace extensively in production to print backtraces when one of our servers hits a state of degraded performance or deadlock.
We're afraid, however, that there could be a ABA-like bug in jl_record_backtrace which may lead us to miss a task for which we're attempting to record the backtrace.
Here is the code for reference:
JL_DLLEXPORTsize_tjl_record_backtrace(jl_task_t*t, jl_bt_element_t*bt_data, size_tmax_bt_size) JL_NOTSAFEPOINT
{
jl_task_t*ct=jl_current_task;
jl_ptls_tptls=ct->ptls;
if (t==ct) {
returnrec_backtrace(bt_data, max_bt_size, 0);
}
bt_context_t*context=NULL;
bt_context_tc;
int16_told=-1;
while (!jl_atomic_cmpswap(&t->tid, &old, ptls->tid) &&old!=ptls->tid) {
intlockret=jl_lock_stackwalk();
// if this task is already running somewhere, we need to stop the thread it is running on and query its stateif (!jl_thread_suspend_and_get_state(old, 1, &c)) {
jl_unlock_stackwalk(lockret);
if (jl_atomic_load_relaxed(&t->tid) !=old)
continue;
return0;
}
jl_unlock_stackwalk(lockret);
if (jl_atomic_load_relaxed(&t->tid) ==old) {
jl_ptls_tptls2=jl_atomic_load_relaxed(&jl_all_tls_states)[old];
if (ptls2->previous_task==t||// we might print the wrong stack here, since we can't know whether we executed the swapcontext yet or not, but it at least avoids trying to access the state inside uc_mcontext which might not be set yet
(ptls2->previous_task==NULL&&jl_atomic_load_relaxed(&ptls2->current_task) ==t)) { // this case should be always accurate// use the thread context for the unwind statecontext=&c;
}
break;
}
// got the wrong thread stopped, try againjl_thread_resume(old);
}
if (context==NULL&& (!t->ctx.copy_stack&&t->ctx.started&&t->ctx.ctx!=NULL)) {
// need to read the context from the task stored statejl_jmp_buf*mctx=&t->ctx.ctx->uc_mcontext;
#if defined(_OS_WINDOWS_)
memset(&c, 0, sizeof(c));
if (jl_simulate_longjmp(*mctx, &c))
context=&c;
#elif defined(JL_HAVE_UNW_CONTEXT)
context=t->ctx.ctx;
#elif defined(JL_HAVE_UCONTEXT)
context=jl_to_bt_context(t->ctx.ctx);
#elif defined(JL_HAVE_ASM)
memset(&c, 0, sizeof(c));
if (jl_simulate_longjmp(*mctx, &c))
context=&c;
#else#pragma message("jl_record_backtrace not defined for unknown task system")
#endif
}
size_tbt_size=0;
if (context)
bt_size=rec_backtrace_ctx(bt_data, max_bt_size, context, t->gcstack);
if (old==-1)
jl_atomic_store_relaxed(&t->tid, old);
elseif (old!=ptls->tid)
jl_thread_resume(old);
returnbt_size;
}
The case which I think may be pathological is the following:
t is initially scheduled on thread 1. The compare-and-swap at the while loop will fail and old will get a value of 1.
We will stop thread 1 in jl_thread_suspend_and_get_state, but let's say that task t was faster than us and got rescheduled in thread 2.
The check t->tid == 1 fails, and we will resume thread 1.
Task t is again faster than us, and very quickly got migrated from thread 2 to thread 1.
At the time we hit the top of the while loop again, we will have t->tid == 1, and old == 1. The compare-and-swap succeeds even though this task is still running in a thread, which will cause us to leave early from the while loop.
Is the case I'm describing here indeed pathological? Are there any invariants that I could be missing that will make this scenario impossible to occur?
Yes, good catch, there appears to be a missing re-assignment of old = -1; at the end of that loop which means in the ABA case, we accidentally actually acquire the lock on the thread despite not actually having stopped the thread; or in the counter-case, we try to run through this logic with old==-1 on the next iteration, and that isn't valid either (jl_thread_suspend_and_get_state should return failurea and the loop with abort too early)
There was a missing re-assignment of old = -1; at the end of that loop
which means in the ABA case, we accidentally actually acquire the lock
on the thread despite not actually having stopped the thread; or in the
counter-case, we try to run through this logic with old==-1 on the next
iteration, and that isn't valid either (jl_thread_suspend_and_get_state
should return failure and the loop will abort too early).
Fix#56046
We've been using
jl_record_backtrace
extensively in production to print backtraces when one of our servers hits a state of degraded performance or deadlock.We're afraid, however, that there could be a ABA-like bug in
jl_record_backtrace
which may lead us to miss a task for which we're attempting to record the backtrace.Here is the code for reference:
The case which I think may be pathological is the following:
t
is initially scheduled on thread 1. The compare-and-swap at thewhile
loop will fail andold
will get a value of 1.jl_thread_suspend_and_get_state
, but let's say that taskt
was faster than us and got rescheduled in thread 2.t->tid == 1
fails, and we will resume thread 1.t
is again faster than us, and very quickly got migrated from thread 2 to thread 1.t->tid == 1
, andold == 1
. The compare-and-swap succeeds even though this task is still running in a thread, which will cause us to leave early from thewhile
loop.Is the case I'm describing here indeed pathological? Are there any invariants that I could be missing that will make this scenario impossible to occur?
Thanks in advance.
CC: @vtjnash.
The text was updated successfully, but these errors were encountered: