fix: Improve DB deadlock detection/logging #5137

jbencin · 2024-09-04T13:21:57Z

Description

This PR makes two changes to tx_busy_handler(), which is passed as a callback to rusqlite and called if waiting too long for a DB lock

Improve logging. Previously, all logs in tx_busy_handler() were debug level, now we have warn/error level messages if we are waiting too long
After a certain amount of time has passed (about a minute) we should assume we have a deadlock, stop trying to open the DB, log and return an error

Note that this PR doesn't make deadlocks less likely to occur, it just makes them easier to find, and generates an error when they are detected, which may be recoverable or may crash the application (depending on how the caller handles the error)

Applicable issues

Refactor to help with DB deadlocks #5112

Additional info (benefits, drawbacks, caveats)

I could have added precise timing using Instant::now(), but that would require wrapping tx_busy_handler() in a closure, which creates lifetime issues because it's a callback. This could be addressed by adding the Boxed callback to the DBTx type, but this would require changing a lot of code, which I didn't think was worth it

Checklist

Test coverage for new or modified code paths
Changelog is updated
Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
New clarity functions have corresponding PR in clarity-benchmarking repo
New integration test(s) added to bitcoin-tests.yml

stackslib/src/util_lib/db.rs

obycode

Good idea @jbencin. I just had one request for a little simplification.

obycode

👍

obycode

👍

jbencin · 2024-09-05T13:15:16Z

There was a version of tx_busy_handler() in stacks-common also that I had to update

jbencin · 2024-09-06T13:20:33Z

Rebased and fixed a conflict, please re-approve

jcnelson

Putting this as a NACK for now until we can get the below addressed.

Per the sprint call today (9 Sep), there's basically two ways under consideration for discovering deadlocks:

Crashing the node and printing a caller thread stack trace if the caller of tx_begin_immediate() fails to begin a transaction within a specified amount of time.
Having tx_begin_immediate() return an error after a specified amount of time.

I don't think either approach is desirable, because neither tells us what we really want to know: which thread is holding the transaction at the time the caller tries (and fails) to run tx_begin_immediate()? To do this, we'd need to somehow get the lock-holder to print a stack trace.

The approach I'd recommend is trying to find a way to crash the node, and print all threads' stack traces. I don't think returning an error is the right thing here, because the caller may livelock instead -- it could continuously try and fail to begin the transaction, which makes finding the root cause from the logs harder (since the livelocked thread would keep the node running for longer than when the deadlock condition arose).

jbencin · 2024-09-09T18:27:42Z

which thread is holding the transaction at the time the caller tries (and fails) to run tx_begin_immediate()? To do this, we'd need to somehow get the lock-holder to print a stack trace

It almost certainly would with this PR. The first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

jcnelson · 2024-09-09T19:34:14Z

the first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

As mentioned in the call, it's entirely possible that the thread acquires one lock and then spins while trying to acquire the second lock (or just spins in general, without regards to any other locks), which would not result in its stack trace being printed.

jbencin · 2024-09-09T21:29:51Z

it's entirely possible that the thread acquires one lock and then spins

I see what you mean now, if you don't set the busy handler, rusqlite defaults to returning an SQLITE_BUSY error instead. I greped the code and we don't check for SQLITE_BUSY, so it seems unlikely we spin on this anywhere, but I guess it's possible we spin on all errors somewhere, and I don't know how I'd find that to fix it (I guess I'd have to check all the rusqlite functions which can acquire a lock)

How about I reduce the scope of this PR to just upgrading the logging to warn! and error!?

Also let me know what you think about using a re-entrant mutex as described in #5112

jcnelson · 2024-09-10T02:48:09Z

Gah, it appears Github ate my original comment. I don't know if it hit your inboxes or not. Lemme try again.

There are several places in the code where multiple transactions have to be initiated in order to carry out a task. For example, processing a Stacks block requires holding open a transaction to both the Stacks chain state as well as the sortition DB. The problem is that different threads may acquire transactions in conflicting orders, leading to livelocks and/or deadlocks, depending on how a failure to acquire one of the requisite locks is held.

Basically, the way I'd like to get this fixed involves two tasks:

We enumerate all of the legal sequences of transactions that may be required, and require threads to use one of these sequences. We can enforce correct lock order at compile-time by providing a layer of indirection between threads and databases such that threads initiate multi-DB transactions via a permitted sequence.
To facilitate building the above and catching errors, we provide a way to track and publish the backtraces of every caller that tries to open a transaction. Then, if a deadlock or livelock occurs, we can easily determine who the responsible threads are, and which code bodies were involved.

Basically, we need to provide a singleton "database manager" that can ensure that whenever a thread acquires a particular set of transactions, they do so in the right order, and do so atomically (so, all transactions begin, or all abort). The database manager system would log a backtrace of each caller whenever it tries to begin a set of transactions, so if it gets stuck, we can figure out who is blocking it.

How do we build this in a way that minimizes the refactoring burden? I think we'd treat the singleton database manager state as a global within the code, and provide a static API for initiating these transaction sequences and recording (to the global) the callers' backtraces. Specifically, we'd need the following:

pub enum TransactionSequence { /* ... */ }

This would be the set of valid transaction sequences, which we convince ourselves cannot conflict with one another. Fortunately, there aren't that many of them.

pub struct TransactionSet {
   pub(crate) txs: HashMap<TransactionSequence, DBTx>
   /* ... */
}

This would represent a set of open transactions. It would have a Drop implementation that rolled back each DBTx<'_> in ::txs,

impl TransactionSet {
   pub(crate) fn commit(self) -> Result<(), (Self, Error)> {
      /* ... */
   }
   
   pub(crate) fn rollback(self) -> Result<(), (Self, Error)> {
      /* ... */
   }
}

The commit() function which committed and consumed each inner DBTx<'_>. On failure, it would return the relevant Error, as well as itself, so the caller can try to commit again until all transactions succeed, or the caller decides to panic.

The rollback() function would similarly roll back the transactions in a restartable way. The Drop implementation for TransactionSet would probably just call .rollback().unwrap() since there's likely no way to recover from a partially-successful rollback without the caller's context.

pub mod DBManager {
   /* This is a sqlite DB connection; see below */
   static State : std::sync::Mutex<Option<DBConn>> = std::sync::Mutex::new(None);
   
   pub fn tx_begin_multi(sequence: TransactionSequence) -> Result<TransactionSet, Error> {
      /* ... */
   }
}

The DBManager acts as our global singleton. The function tx_begin_multi() takes one of the valid transaction sequence types as input, and instantiates a TransactionSet if and only if it was able to open transactions on all of the underlying databases in the right order. Otherwise, no transactions are opened, and an Error is returned.

The DBManager would write a backtrace of each caller who calls tx_begin_multi() to State (instantiating it lazily if need be). The function TransactionSet::commit() and TransactionSet::rollback() functions would also manipulate State to drop the caller's record. So, the contents of State would represent the backtraces of every process that has acquired a TransactionSet, which we can compare against backtraces from callers who are trying and failing to begin transactions.

We would remove all direct calls to tx_begin_immediate() in the code, and instead have them call DBManager::tx_begin_multi().

I've opted to use a Sqlite DB to implement State, for three reasons:

In debug mode, we can persist caller backtraces to disk. This way, even if the node freezes, or if the node is running in an environment where ptrace(2) is unavailable (e.g. most containers), we can still diagnose deadlocks
In production, State can instead point to an in-memory DB so we won't incur additional I/O penalties for opening transactions.
If more performance is desired, it's easy to simply disable the use of State

This approach addresses @obycode's original intent by capturing transaction acquisition orders within the type system, and it helps us find regressions and deadlocks more readily than only logging failures to open transactions (since more often than not, we also want to know which process is preventing the caller from opening them).

kantai · 2024-09-10T03:18:15Z

If, for the moment, we're just interested in improving the debugging process here, we could do something like the following:

Add a static, thread-safe, (i.e., locked) hashmap, which maps from db file name(-ish) to thread name.
In db.rs::tx_begin_immediate_sqlite, before returning Ok(tx), do something like:

pub fn tx_begin_immediate_sqlite<'a>(conn: &'a mut Connection) -> Result<DBTx<'a>, sqlite_error> {
    conn.busy_handler(Some(tx_busy_handler))?;
    let tx = Transaction::new(conn, TransactionBehavior::Immediate)?;
    let mut locks_table = LOCKS_TABLE.lock().unwrap();
    // the debug format for `Connection` includes the path, so deref the Transaction to a Connection
    locks_table.insert(format!("{:?}", tx.deref()), format!("{:?}", std::thread::current().name()));
    Ok(tx)
}

And then when the busy handler reaches enough wait time, dump the stack and the locks table. That would at least tell us what thread last held the lock in question. This is kind of hacky (it doesn't clear the entries in the table when they close their transactions, it relies on debug formats), but it would require very little refactoring (no refactoring, really: the only lines changed are the ones above, plus an initializer for the static table), and should make the debugging process much easier.

jbencin · 2024-09-10T20:12:31Z

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread
Implement drop() for DBTx so that it clears the entry when it's dropped

kantai · 2024-09-10T20:24:13Z

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

While it would make the data structure someone clearer if entries are cleared, the most recent holder of a db's lock is sufficient to debug a deadlock. If Thread A is waiting on DB 1's lock, presumably whichever thread is the last to obtain DB 1's lock is still holding it (otherwise, Thread A would be able to get the lock).

Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread

Threads can hold multiple transactions (indeed, that's the problem), so we'd want the table to map from dbs to threads.

Implement drop() for DBTx so that it clears the entry when it's dropped

While that could be a relatively straightforward refactor, I don't think its necessary for the reasoning above: the last holder of each lock is really all the information we would need to debug dead locks. This refactor isn't just a matter of implementing drop() for DBTx: DBTx isn't a struct, its just a type alias, and a foreign type, so this would actually require creating a new struct, probably implementing Deref, etc. to simplify the refactor, and then make sure that everywhere in the codebase that currently expects a rusqlite::Transaction is ready to handle the new type. This would ultimately end up being somewhat similar to the larger refactor Jude proposed above.

jbencin · 2024-09-11T12:37:08Z

@kantai okay after looking into the debug format for tx.deref() this makes sense to me. I will add something like this. I'd like to get some quick logging improvements in that we can use now, because @jcnelson's suggestion is going to take some time to implement

jbencin · 2024-09-11T13:30:46Z

@jcnelson I appreciate the detailed reply. I like this idea, but I think this would take a while to implement and test adequately, so I'll try to get some simple logging changes merged in the next couple days, so we can be sure to have something for the Nakamoto release

A couple questions/comments on this proposal:

I think for this to work we'd need a check in tx_begin_multi() that if a process currently holds any locks, the function panics. This would force threads to acquire all locks at once. Otherwise, it could still be used like tx_begin_immediate() is now
Are there other locks (besides DB) in the application that could potentially lead to a deadlock?
How exactly would TransactionSequence would work? The only way I'm aware of to guarantee deadlocks can't happen is to force locks to be always be acquired in the same order. For example, if you have locks A, B, and C, a thread can lock A, or B, or A->C, but not B->A. So would this enum would just establish a locking order, and error/panic if that order is violated?

… probable deadlock detected

jbencin · 2024-09-11T16:54:35Z

@jcnelson I've taken @kantai's suggestion and updated tx_busy_handler() to print out the HashMap of all locks held and panic when a deadlock is detected. Let me know if this is acceptable or what changes I should make (return an error instead of panicking? Continue deadlock and just log error?)

stacks-common/src/util/db.rs

blockstack-devops · 2024-10-27T00:22:08Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

jbencin mentioned this pull request Sep 4, 2024

Refactor to help with DB deadlocks #5112

Open

jbencin force-pushed the fix/improve-deadlock-handling branch from a8b7e96 to c21cf45 Compare September 4, 2024 14:19

obycode reviewed Sep 4, 2024

View reviewed changes

stackslib/src/util_lib/db.rs Outdated Show resolved Hide resolved

obycode reviewed Sep 4, 2024

View reviewed changes

stackslib/src/util_lib/db.rs Outdated Show resolved Hide resolved

obycode reviewed Sep 4, 2024

View reviewed changes

obycode previously approved these changes Sep 4, 2024

View reviewed changes

jbencin dismissed obycode’s stale review via 0bec5b2 September 4, 2024 19:19

jbencin force-pushed the fix/improve-deadlock-handling branch from cb78652 to 0bec5b2 Compare September 4, 2024 19:19

jbencin marked this pull request as ready for review September 4, 2024 19:20

jbencin requested a review from a team as a code owner September 4, 2024 19:20

obycode previously approved these changes Sep 4, 2024

View reviewed changes

jbencin dismissed obycode’s stale review via 8782407 September 5, 2024 13:15

jbencin requested a review from obycode September 5, 2024 13:15

jbencin force-pushed the fix/improve-deadlock-handling branch 2 times, most recently from 032def0 to dbfd7ea Compare September 5, 2024 13:26

jferrant previously approved these changes Sep 5, 2024

View reviewed changes

jbencin mentioned this pull request Sep 5, 2024

chore: Remove unnecessary function slice_partialeq() #5148

Merged

jbencin dismissed jferrant’s stale review via 1bf3417 September 6, 2024 13:19

jbencin force-pushed the fix/improve-deadlock-handling branch from dbfd7ea to 1bf3417 Compare September 6, 2024 13:19

jbencin requested a review from jferrant September 6, 2024 13:20

jbencin enabled auto-merge September 6, 2024 13:26

jcnelson self-requested a review September 9, 2024 17:03

jcnelson requested changes Sep 9, 2024

View reviewed changes

jbencin added 3 commits September 11, 2024 09:37

fix: Improve logging in rusqlite busy handler, and fail eventually if…

47436e7

… probable deadlock detected

chore: Address PR comment from Brice

3b24bd3

chore: Update the tx_busy_handler() in stacks-common also

8dcfadf

saralab added 3.0-must and removed 3.0-must labels Sep 11, 2024

jbencin force-pushed the fix/improve-deadlock-handling branch from 1bf3417 to 502d039 Compare September 11, 2024 16:31

chore: Add LOCK_TABLE to print out all DB locks if deadlock detected

021ab58

jbencin force-pushed the fix/improve-deadlock-handling branch from 502d039 to 021ab58 Compare September 11, 2024 16:32

refactor: Move logic for updating LOCK_TABLE into stacks-common

67d0f30

chore: Add comment

292cd89

jbencin force-pushed the fix/improve-deadlock-handling branch from 32a6ccf to 292cd89 Compare September 11, 2024 16:59

jcnelson reviewed Sep 13, 2024

View reviewed changes

stacks-common/src/util/db.rs Show resolved Hide resolved

jcnelson approved these changes Sep 13, 2024

View reviewed changes

jferrant approved these changes Sep 13, 2024

View reviewed changes

jbencin added this pull request to the merge queue Sep 13, 2024

Merged via the queue into stacks-network:develop with commit 75d7e27 Sep 13, 2024
1 check passed

blockstack-devops added the locked label Oct 27, 2024

stacks-network locked as resolved and limited conversation to collaborators Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve DB deadlock detection/logging #5137

fix: Improve DB deadlock detection/logging #5137

jbencin commented Sep 4, 2024

obycode left a comment

obycode left a comment

obycode left a comment

jbencin commented Sep 5, 2024

jbencin commented Sep 6, 2024

jcnelson left a comment

jbencin commented Sep 9, 2024

jcnelson commented Sep 9, 2024 •

edited

Loading

jbencin commented Sep 9, 2024

jcnelson commented Sep 10, 2024 •

edited

Loading

kantai commented Sep 10, 2024 •

edited

Loading

jbencin commented Sep 10, 2024

kantai commented Sep 10, 2024

jbencin commented Sep 11, 2024

jbencin commented Sep 11, 2024

jbencin commented Sep 11, 2024

blockstack-devops commented Oct 27, 2024

fix: Improve DB deadlock detection/logging #5137

fix: Improve DB deadlock detection/logging #5137

Conversation

jbencin commented Sep 4, 2024

Description

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

obycode left a comment

Choose a reason for hiding this comment

obycode left a comment

Choose a reason for hiding this comment

obycode left a comment

Choose a reason for hiding this comment

jbencin commented Sep 5, 2024

jbencin commented Sep 6, 2024

jcnelson left a comment

Choose a reason for hiding this comment

jbencin commented Sep 9, 2024

jcnelson commented Sep 9, 2024 • edited Loading

jbencin commented Sep 9, 2024

jcnelson commented Sep 10, 2024 • edited Loading

kantai commented Sep 10, 2024 • edited Loading

jbencin commented Sep 10, 2024

kantai commented Sep 10, 2024

jbencin commented Sep 11, 2024

jbencin commented Sep 11, 2024

jbencin commented Sep 11, 2024

blockstack-devops commented Oct 27, 2024

jcnelson commented Sep 9, 2024 •

edited

Loading

jcnelson commented Sep 10, 2024 •

edited

Loading

kantai commented Sep 10, 2024 •

edited

Loading