Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor locking; add more debug locking #60

Merged
merged 1 commit into from
Jun 8, 2020

Conversation

tony-iqlusion
Copy link
Member

@tony-iqlusion tony-iqlusion commented Jun 8, 2020

This is an attempt to help address #37.

Based on strace logging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this:

close(12)   = 0
rmdir("/.../.atomicwrite.InysUcmuRax7") = 0
futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL

Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk.

This commit does a couple things to try to narrow down what is happening:

  1. Ensures that an exclusive lock to the chain state isn't held while the signing operation is being performed (i.e. while communicating with the HSM). If we were able to update the consensus state, that means the signing operation is authorized, and we no longer need to hold the lock. In the event the signing operation fails, the validator will miss the block in question, but with no risk of double-signing.
  2. Adds a significant amount of additional debug logging, particularly around things like lock acquisition and writing to disk. While this commit is unlikely to fix tmkms freeze #37 in and of itself, the additional debug logging should be helpful in isolating the problem.

This is an attempt to help address #37.

Based on `strace` logging it appears at least one of the instances of
this bug occurred during a lock acquisition happening immediately after
persisting the chain state. The system call sequence looked something
like this:

```
close(12)   = 0
rmdir("/.../.atomicwrite.InysUcmuRax7") = 0
futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL
```

Unfortunately this isn't a whole lot to go on, but makes it appear as if
it's hanging trying to acquire a lock immediately after persisting the
consensus state to disk.

This commit does a couple things to try to narrow down what is
happening:

1. Ensures that an exclusive lock to the chain state isn't held while
   the signing operation is being performed (i.e. while communicating
   with the HSM). If we were able to update the consensus state, that
   means the signing operation is authorized, and we no longer need to
   hold the lock. In the event the signing operation fails, the
   validator will miss the block in question, but with no risk of
   double-signing.
2. Adds a significant amount of additional debug logging, particularly
   around things like lock acquisition and writing to disk. While this
   commit is unlikely to fix #37 in and of itself, the additional
   debug logging should be helpful in isolating the problem.
@codecov-commenter
Copy link

Codecov Report

Merging #60 into develop will decrease coverage by 0.36%.
The diff coverage is 1.63%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #60      +/-   ##
===========================================
- Coverage    28.85%   28.48%   -0.37%     
===========================================
  Files           50       50              
  Lines         1837     1864      +27     
===========================================
+ Hits           530      531       +1     
- Misses        1307     1333      +26     
Impacted Files Coverage Δ
src/session.rs 0.00% <0.00%> (ø)
src/chain/state.rs 41.22% <20.00%> (-0.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45a7370...f9e5de0. Read the comment docs.

@tony-iqlusion tony-iqlusion merged commit 9d0661e into develop Jun 8, 2020
@tony-iqlusion tony-iqlusion deleted the refactor-locking-and-add-more-debugging branch June 8, 2020 20:28
This was referenced Jun 8, 2020
This was referenced Jun 23, 2020
@tony-iqlusion tony-iqlusion mentioned this pull request Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tmkms freeze
2 participants