Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve DB deadlock detection/logging #5137

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

jbencin
Copy link
Contributor

@jbencin jbencin commented Sep 4, 2024

Description

This PR makes two changes to tx_busy_handler(), which is passed as a callback to rusqlite and called if waiting too long for a DB lock

  • Improve logging. Previously, all logs in tx_busy_handler() were debug level, now we have warn/error level messages if we are waiting too long
  • After a certain amount of time has passed (about a minute) we should assume we have a deadlock, stop trying to open the DB, log and return an error

Note that this PR doesn't make deadlocks less likely to occur, it just makes them easier to find, and generates an error when they are detected, which may be recoverable or may crash the application (depending on how the caller handles the error)

Applicable issues

Additional info (benefits, drawbacks, caveats)

I could have added precise timing using Instant::now(), but that would require wrapping tx_busy_handler() in a closure, which creates lifetime issues because it's a callback. This could be addressed by adding the Boxed callback to the DBTx type, but this would require changing a lot of code, which I didn't think was worth it

Checklist

  • Test coverage for new or modified code paths
  • Changelog is updated
  • Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo
  • New integration test(s) added to bitcoin-tests.yml

Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @jbencin. I just had one request for a little simplification.

obycode
obycode previously approved these changes Sep 4, 2024
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jbencin jbencin force-pushed the fix/improve-deadlock-handling branch from cb78652 to 0bec5b2 Compare September 4, 2024 19:19
@jbencin jbencin marked this pull request as ready for review September 4, 2024 19:20
@jbencin jbencin requested a review from a team as a code owner September 4, 2024 19:20
obycode
obycode previously approved these changes Sep 4, 2024
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jbencin
Copy link
Contributor Author

jbencin commented Sep 5, 2024

There was a version of tx_busy_handler() in stacks-common also that I had to update

@jbencin jbencin requested a review from obycode September 5, 2024 13:15
@jbencin jbencin force-pushed the fix/improve-deadlock-handling branch 2 times, most recently from 032def0 to dbfd7ea Compare September 5, 2024 13:26
jferrant
jferrant previously approved these changes Sep 5, 2024
@jbencin
Copy link
Contributor Author

jbencin commented Sep 6, 2024

Rebased and fixed a conflict, please re-approve

Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this as a NACK for now until we can get the below addressed.

Per the sprint call today (9 Sep), there's basically two ways under consideration for discovering deadlocks:

  • Crashing the node and printing a caller thread stack trace if the caller of tx_begin_immediate() fails to begin a transaction within a specified amount of time.

  • Having tx_begin_immediate() return an error after a specified amount of time.

I don't think either approach is desirable, because neither tells us what we really want to know: which thread is holding the transaction at the time the caller tries (and fails) to run tx_begin_immediate()? To do this, we'd need to somehow get the lock-holder to print a stack trace.

The approach I'd recommend is trying to find a way to crash the node, and print all threads' stack traces. I don't think returning an error is the right thing here, because the caller may livelock instead -- it could continuously try and fail to begin the transaction, which makes finding the root cause from the logs harder (since the livelocked thread would keep the node running for longer than when the deadlock condition arose).

@jbencin
Copy link
Contributor Author

jbencin commented Sep 9, 2024

which thread is holding the transaction at the time the caller tries (and fails) to run tx_begin_immediate()? To do this, we'd need to somehow get the lock-holder to print a stack trace

It almost certainly would with this PR. The first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

@jcnelson
Copy link
Member

jcnelson commented Sep 9, 2024

the first process holding the lock for an abnormally long period of time is almost certainly doing so because it is trying to acquire another lock. That's how deadlocks generally occur (it's also possible that it got stuck in an infinite loop). So both processes should be trying and failing to acquire locks, and so both should print a stack trace to warn/error logs

As mentioned in the call, it's entirely possible that the thread acquires one lock and then spins while trying to acquire the second lock (or just spins in general, without regards to any other locks), which would not result in its stack trace being printed.

@jbencin
Copy link
Contributor Author

jbencin commented Sep 9, 2024

it's entirely possible that the thread acquires one lock and then spins

I see what you mean now, if you don't set the busy handler, rusqlite defaults to returning an SQLITE_BUSY error instead. I greped the code and we don't check for SQLITE_BUSY, so it seems unlikely we spin on this anywhere, but I guess it's possible we spin on all errors somewhere, and I don't know how I'd find that to fix it (I guess I'd have to check all the rusqlite functions which can acquire a lock)

How about I reduce the scope of this PR to just upgrading the logging to warn! and error!?

Also let me know what you think about using a re-entrant mutex as described in #5112

@jcnelson
Copy link
Member

jcnelson commented Sep 10, 2024

Gah, it appears Github ate my original comment. I don't know if it hit your inboxes or not. Lemme try again.


There are several places in the code where multiple transactions have to be initiated in order to carry out a task. For example, processing a Stacks block requires holding open a transaction to both the Stacks chain state as well as the sortition DB. The problem is that different threads may acquire transactions in conflicting orders, leading to livelocks and/or deadlocks, depending on how a failure to acquire one of the requisite locks is held.

Basically, the way I'd like to get this fixed involves two tasks:

  • We enumerate all of the legal sequences of transactions that may be required, and require threads to use one of these sequences. We can enforce correct lock order at compile-time by providing a layer of indirection between threads and databases such that threads initiate multi-DB transactions via a permitted sequence.

  • To facilitate building the above and catching errors, we provide a way to track and publish the backtraces of every caller that tries to open a transaction. Then, if a deadlock or livelock occurs, we can easily determine who the responsible threads are, and which code bodies were involved.

Basically, we need to provide a singleton "database manager" that can ensure that whenever a thread acquires a particular set of transactions, they do so in the right order, and do so atomically (so, all transactions begin, or all abort). The database manager system would log a backtrace of each caller whenever it tries to begin a set of transactions, so if it gets stuck, we can figure out who is blocking it.

How do we build this in a way that minimizes the refactoring burden? I think we'd treat the singleton database manager state as a global within the code, and provide a static API for initiating these transaction sequences and recording (to the global) the callers' backtraces. Specifically, we'd need the following:

pub enum TransactionSequence { /* ... */ }

This would be the set of valid transaction sequences, which we convince ourselves cannot conflict with one another. Fortunately, there aren't that many of them.

pub struct TransactionSet {
   pub(crate) txs: HashMap<TransactionSequence, DBTx>
   /* ... */
}

This would represent a set of open transactions. It would have a Drop implementation that rolled back each DBTx<'_> in ::txs,

impl TransactionSet {
   pub(crate) fn commit(self) -> Result<(), (Self, Error)> {
      /* ... */
   }
   
   pub(crate) fn rollback(self) -> Result<(), (Self, Error)> {
      /* ... */
   }
}

The commit() function which committed and consumed each inner DBTx<'_>. On failure, it would return the relevant Error, as well as itself, so the caller can try to commit again until all transactions succeed, or the caller decides to panic.

The rollback() function would similarly roll back the transactions in a restartable way. The Drop implementation for TransactionSet would probably just call .rollback().unwrap() since there's likely no way to recover from a partially-successful rollback without the caller's context.

pub mod DBManager {
   /* This is a sqlite DB connection; see below */
   static State : std::sync::Mutex<Option<DBConn>> = std::sync::Mutex::new(None);
   
   pub fn tx_begin_multi(sequence: TransactionSequence) -> Result<TransactionSet, Error> {
      /* ... */
   }
}

The DBManager acts as our global singleton. The function tx_begin_multi() takes one of the valid transaction sequence types as input, and instantiates a TransactionSet if and only if it was able to open transactions on all of the underlying databases in the right order. Otherwise, no transactions are opened, and an Error is returned.

The DBManager would write a backtrace of each caller who calls tx_begin_multi() to State (instantiating it lazily if need be). The function TransactionSet::commit() and TransactionSet::rollback() functions would also manipulate State to drop the caller's record. So, the contents of State would represent the backtraces of every process that has acquired a TransactionSet, which we can compare against backtraces from callers who are trying and failing to begin transactions.

We would remove all direct calls to tx_begin_immediate() in the code, and instead have them call DBManager::tx_begin_multi().

I've opted to use a Sqlite DB to implement State, for three reasons:

  • In debug mode, we can persist caller backtraces to disk. This way, even if the node freezes, or if the node is running in an environment where ptrace(2) is unavailable (e.g. most containers), we can still diagnose deadlocks
  • In production, State can instead point to an in-memory DB so we won't incur additional I/O penalties for opening transactions.
  • If more performance is desired, it's easy to simply disable the use of State

This approach addresses @obycode's original intent by capturing transaction acquisition orders within the type system, and it helps us find regressions and deadlocks more readily than only logging failures to open transactions (since more often than not, we also want to know which process is preventing the caller from opening them).

@kantai
Copy link
Member

kantai commented Sep 10, 2024

If, for the moment, we're just interested in improving the debugging process here, we could do something like the following:

  1. Add a static, thread-safe, (i.e., locked) hashmap, which maps from db file name(-ish) to thread name.
  2. In db.rs::tx_begin_immediate_sqlite, before returning Ok(tx), do something like:
pub fn tx_begin_immediate_sqlite<'a>(conn: &'a mut Connection) -> Result<DBTx<'a>, sqlite_error> {
    conn.busy_handler(Some(tx_busy_handler))?;
    let tx = Transaction::new(conn, TransactionBehavior::Immediate)?;
    let mut locks_table = LOCKS_TABLE.lock().unwrap();
    // the debug format for `Connection` includes the path, so deref the Transaction to a Connection
    locks_table.insert(format!("{:?}", tx.deref()), format!("{:?}", std::thread::current().name()));
    Ok(tx)
}

And then when the busy handler reaches enough wait time, dump the stack and the locks table. That would at least tell us what thread last held the lock in question. This is kind of hacky (it doesn't clear the entries in the table when they close their transactions, it relies on debug formats), but it would require very little refactoring (no refactoring, really: the only lines changed are the ones above, plus an initializer for the static table), and should make the debugging process much easier.

@jbencin
Copy link
Contributor Author

jbencin commented Sep 10, 2024

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

  • Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread
  • Implement drop() for DBTx so that it clears the entry when it's dropped

@kantai
Copy link
Member

kantai commented Sep 10, 2024

@kantai I'm not sure how useful this is if the entries are never cleared from the hash table, but that can be easily addressed with a couple changes:

While it would make the data structure someone clearer if entries are cleared, the most recent holder of a db's lock is sufficient to debug a deadlock. If Thread A is waiting on DB 1's lock, presumably whichever thread is the last to obtain DB 1's lock is still holding it (otherwise, Thread A would be able to get the lock).

  • Index the HashMap by thread id and/or thread name, so we'll only have the most recent stack trace for each thread

Threads can hold multiple transactions (indeed, that's the problem), so we'd want the table to map from dbs to threads.

  • Implement drop() for DBTx so that it clears the entry when it's dropped

While that could be a relatively straightforward refactor, I don't think its necessary for the reasoning above: the last holder of each lock is really all the information we would need to debug dead locks. This refactor isn't just a matter of implementing drop() for DBTx: DBTx isn't a struct, its just a type alias, and a foreign type, so this would actually require creating a new struct, probably implementing Deref, etc. to simplify the refactor, and then make sure that everywhere in the codebase that currently expects a rusqlite::Transaction is ready to handle the new type. This would ultimately end up being somewhat similar to the larger refactor Jude proposed above.

@jbencin
Copy link
Contributor Author

jbencin commented Sep 11, 2024

@kantai okay after looking into the debug format for tx.deref() this makes sense to me. I will add something like this. I'd like to get some quick logging improvements in that we can use now, because @jcnelson's suggestion is going to take some time to implement

@jbencin
Copy link
Contributor Author

jbencin commented Sep 11, 2024

@jcnelson I appreciate the detailed reply. I like this idea, but I think this would take a while to implement and test adequately, so I'll try to get some simple logging changes merged in the next couple days, so we can be sure to have something for the Nakamoto release

A couple questions/comments on this proposal:

  • I think for this to work we'd need a check in tx_begin_multi() that if a process currently holds any locks, the function panics. This would force threads to acquire all locks at once. Otherwise, it could still be used like tx_begin_immediate() is now
  • Are there other locks (besides DB) in the application that could potentially lead to a deadlock?
  • How exactly would TransactionSequence would work? The only way I'm aware of to guarantee deadlocks can't happen is to force locks to be always be acquired in the same order. For example, if you have locks A, B, and C, a thread can lock A, or B, or A->C, but not B->A. So would this enum would just establish a locking order, and error/panic if that order is violated?

@jbencin jbencin force-pushed the fix/improve-deadlock-handling branch from 502d039 to 021ab58 Compare September 11, 2024 16:32
@jbencin
Copy link
Contributor Author

jbencin commented Sep 11, 2024

@jcnelson I've taken @kantai's suggestion and updated tx_busy_handler() to print out the HashMap of all locks held and panic when a deadlock is detected. Let me know if this is acceptable or what changes I should make (return an error instead of panicking? Continue deadlock and just log error?)

@jbencin jbencin force-pushed the fix/improve-deadlock-handling branch from 32a6ccf to 292cd89 Compare September 11, 2024 16:59
@jbencin jbencin added this pull request to the merge queue Sep 13, 2024
Merged via the queue into stacks-network:develop with commit 75d7e27 Sep 13, 2024
1 check passed
@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Oct 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants