Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tmkms freeze #37

Closed
mattharrop opened this issue Apr 22, 2020 · 54 comments · Fixed by #60
Closed

tmkms freeze #37

mattharrop opened this issue Apr 22, 2020 · 54 comments · Fixed by #60
Labels
bug Something isn't working

Comments

@mattharrop
Copy link

I'm having an issue with silent failures. tmkms will stop signing, with no errors logged. It just stops signing. The last log entry is the last signature. Restarting tmkms resolves the problem.

This is v0.7.2, on ubuntu. I have this issue on two instances, with 2 keys and 3 keys on a yubihsm in use respectively. The problem re-occurs irregularly, seems to be roughly every few weeks.

At this point I have very little to go on, and no useful information to share.

Is there anything I can do to produce useful debugging information next time the failure happens?

@tony-iqlusion
Copy link
Member

There have been reports of similar issues in the past (e.g. tendermint/tmkms#389) however I have never been able to reproduce them.

Can you see if you can reproduce the issue while running tmkms under strace?

@mattharrop
Copy link
Author

mattharrop commented Apr 23, 2020

I've started strace watching the trims process. It generates enormous log files! I'll watch and avoid running out of disk space, and keep it running until the next halt. I am not aware of any way to trigger the failure, so it may take a while.

@tarcieri
Copy link
Collaborator

@mattharrop I've just released v0.7.3 which includes a number of bugfixes. Not sure any of them will help this issue per se, but there were some related to YubiHSM device (re-)enumeration which could potentially be the source of this problem

@mattharrop
Copy link
Author

And just now I finally, after three weeks of running under strace, I have a freeze. I have 28G of logs. How many lines or how much time do you want?

@tarcieri
Copy link
Collaborator

I only need a little bit of recent context, maybe the last 10,000 lines or so?

@greg-szabo
Copy link

Hi! We just ran into the same issue. It's our second time but the first time, I think we upgraded tmkms.

@mattharrop which version did you run for the strace?

@tony-iqlusion
Copy link
Member

We managed to isolate a probable location where the problem is occurring. I'll add some more debug logging around this area then cut a v0.7.4 release.

tarcieri pushed a commit that referenced this issue Jun 8, 2020
This is an attempt to help address #37.

Based on `strace` logging it appears at least one of the instances of
this bug occurred during a lock acquisition happening immediately after
persisting the chain state. The system call sequence looked something
like this:

```
close(12)   = 0
rmdir("/.../.atomicwrite.InysUcmuRax7") = 0
futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL
```

Unfortunately this isn't a whole lot to go on, but makes it appear as if
it's hanging trying to acquire a lock immediately after persisting the
consensus state to disk.

This commit does a couple things to try to narrow down what is
happening:

1. Ensures that an exclusive lock to the chain state isn't held while
   the signing operation is being performed (i.e. while communicating
   with the HSM). If we were able to update the consensus state, that
   means the signing operation is authorized, and we no longer need to
   hold the lock. In the event the signing operation fails, the
   validator will miss the block in question, but with no risk of
   double-signing.
2. Adds a significant amount of additional debug logging, particularly
   around things like lock acquisition and writing to disk. While this
   commit is unlikely to fix #37 in and of itself, the additional
   debug logging should be helpful in isolating the problem.
@tony-iqlusion tony-iqlusion reopened this Jun 8, 2020
@tony-iqlusion
Copy link
Member

Looks like #60 triggered auto-close, but it (likely) doesn't actually address the issue, but should help debug it.

@tony-iqlusion
Copy link
Member

I've released v0.8.0-alpha1 which corrects a few minor things that might be going amiss but also adds considerably more debug logging:

https://forum.cosmos.network/t/ann-tendermint-kms-v0-8-tendermint-v0-33-support/3704

If you can reproduce this issue while running in verbose mode (i.e. with the -v command line argument) it'd be very helpful in nailing this down.

We're presently running this release in production.

@tony-iqlusion tony-iqlusion added the bug Something isn't working label Jul 2, 2020
@gaia
Copy link

gaia commented Jul 6, 2020

Just FYI, we also experienced the same issue once, a couple of weeks ago, while running 0.7.2. It didn't occur again and as of 4 days ago we are on 0.8.0.

@tony-iqlusion
Copy link
Member

We haven't had any reports of this since re-occurring since the v0.8 release, so I'm going to go ahead and close this. Please let me know if that changes.

tarcieri pushed a commit that referenced this issue Oct 22, 2020
These loglines were for helping debug #37. Now that it's closed, they
can be removed as they are quite noisy.
tarcieri pushed a commit that referenced this issue Oct 22, 2020
These loglines were for helping debug #37. Now that it's closed, they
can be removed as they are quite noisy.
@tony-iqlusion tony-iqlusion reopened this Nov 30, 2020
@gaia
Copy link

gaia commented Dec 20, 2020

Again last night :(

Dec 20 02:08:30 hsm tmkms[25469]: Dec 20 02:08:30.390  INFO tmkms::session: [cosmoshub-3@tcp://a.b.c.e:26658] signed PreCommit:4676990D86 at h/r/s 4493122/0/2 (126 ms)
Dec 20 02:08:31 hsm tmkms[25469]: Dec 20 02:08:31.636  INFO tmkms::session: [columbus-4@tcp://a.b.c.d:26658] signed PreVote:1423FF2F98 at h/r/s 1045895/0/1 (127 ms)
Dec 20 02:17:24 hsm systemd[1]: Stopping TMKMS...
Dec 20 02:17:24 hsm systemd[1]: Stopped TMKMS.
Dec 20 02:17:24 hsm systemd[1]: Started TMKMS.
Dec 20 02:17:24 hsm tmkms[7905]: Dec 20 02:17:24.920  INFO tmkms::commands::start: tmkms 0.9.0 starting up...

@tony-iqlusion
Copy link
Member

For anyone encountering this issue in production (we have never encountered it before, so that makes debugging that much harder), it would be extremely helpful if someone would attach to the process with something like gdb or lldb and obtain a backtrace (e.g. with the bt command).

That said, Tendermint v0.35 will include a gRPC-based privval connection between TMKMS and a validator, which will completely change the entire networking stack on both the TMKMS and Tendermint side. Perhaps that will indirectly fix the issue, or if not, it will at least help isolate the possible places it's occurring.

@activenodes
Copy link

The same for me after a few weeks of uptime:

8:09:55 tmkms[372518]: #033[...[0m tmkms::session: [xxx@tcp://xx.0.4.11:26658] signed PreCommit:xxx at h/r/s 1023094/0/2 (126 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreVote:yyy at h/r/s 1245586/0/1 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreCommit:yyy at h/r/s 1245586/0/2 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [zzz@tcp://xx.0.2.11:26658] signed PreVote:xxx at h/r/s 3964819/0/1 (156 ms)
8:11:02 systemd[1]: Stopping Cosmos tmkms...
8:11:02 systemd[1]: tmkms.service: Succeeded.
8:11:02 systemd[1]: Stopped Cosmos tmkms.
8:11:02 systemd[1]: Started Cosmos tmkms.
8:11:02 tmkms[458276]: #033[...[0m tmkms::commands::start: tmkms 0.10.0 starting up...
8:11:02 tmkms[458276]: #033[...[0m yubihsm::connector::http::server: yubihsm::http-server[127.0.0.1:12345]: listening for connections
8:11:02 kernel: [3222633.441759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 kernel: [3222633.717759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 tmkms[458276]: #033[...[0m tmkms::keyring: [keyring:yubihsm] added consensus Ed25519 key: ...

The server is dedicated to TMKMS

# free -h

              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       219Mi       287Mi       2.0Mi       7.2Gi       7.2Gi
Swap:         4.0Gi       1.0Mi       4.0Gi

$ tmkms version

tmkms 0.10.0

Right now I have automated restarting the process after 60 seconds without log.
I could try to automate a "strace --attach = PID" (for 30 seconds) before rebooting. Can it help?

@tony-iqlusion
Copy link
Member

If anyone is experiencing something like this, especially frequently, it'd be very helpful to get a traffic capture with a tool like tcpdump.

There are a number of things that could potentially cause this which are external to TMKMS, including it never receiving an incoming request. If you can use a tool like tcpdump to rule that out and show a case where TMKMS is actually receiving incoming network packets but never processing them, then that would be a TMKMS bug.

Based on some other anecdotal reports from people who were experiencing issues like these, they were ultimately a network problem.

@activenodes
Copy link

During the issue I was able to verify that the network connections worked (the destinations IP are all different, all reachable) and a simple restart of the TMKMS service fix the problem.
I will try to capture traffic to see if there is incoming (and outgoing) traffic to/from tmkms service

@tony-iqlusion
Copy link
Member

If you can build TMKMS in debug mode (i.e. omit the --release when doing cargo build) and attach with gdb or lldb and run bt to get a backtrace of its current state when the problem occurs, that would be helpful too.

@gaia
Copy link

gaia commented Jun 7, 2021

Same here, simple restart fixes. I've set tmkms to restart every two weeks regardless, it seems that the issue happens around every 3 weeks or so.

Also, IMHO it is unlike it is a network problem for us: the TMKMS VM is on the same physical host as the validator, they are VMs under the same hypervisor.

@AC-Validator which method are you using to detect logging stopped?

@activenodes
Copy link

yubihsm-connector is a separate Golang executable distributed with the YubiHSM2 SDK:
It would be useful to help narrow down if the problem is resulting from the USB stack or not.

Mmm... It is quite intrusive for a production setup...
Can we do something in the case of "higher message volume"?
At this time it is also easy to reproduce the issue (every 3.5 days)

@activenodes
Copy link

OK, that makes sense. We run a max of 2 validators per YubiHSM.

Analyzing the response latencies it seems that the YubiHSM2 has no particular problems (it also depends on the blocking time of each chain). So I prefer active/passive clusters with this number of connections (~10)

@activenodes
Copy link

@tony-iqlusion are you ready to let me do some tests?

@tony-iqlusion
Copy link
Member

Not quite yet... there's still some more work on the yubihsm crate I want to do, then I'll cut another release and update tmkms

tarcieri pushed a commit to iqlusioninc/yubihsm.rs that referenced this issue Nov 30, 2021
Context: iqlusioninc/tmkms#37

To summarize, the `yubihsm::Client` type has internal locking for a
`Session` with the YubiHSM. There is actually very little lock-related
code associated with this scheme, but there appears to be a subtle
reentrancy bug in pretty much the only function where it could possibly
exist: the `yubihsm::Client::send_command` method.

Based on discussions on iqlusioninc/tmkms#37 the bug definitely seemed
to be tied to rekeying the session after a preset number of messages had
been sent. It appears there was a reentrancy bug where an RAII guard was
still holding the `Mutex` for the session state, which prevented a
subsequent attempt to resent the message from acquiring it.

The solution is to explicitly drop the stale session guard prior to
trying to establish a new session.
tony-iqlusion added a commit to iqlusioninc/yubihsm.rs that referenced this issue Nov 30, 2021
Context: iqlusioninc/tmkms#37

To summarize, the `yubihsm::Client` type has internal locking for a
`Session` with the YubiHSM. There is actually very little lock-related
code associated with this scheme, but there appears to be a subtle
reentrancy bug in pretty much the only function where it could possibly
exist: the `yubihsm::Client::send_command` method.

Based on discussions on iqlusioninc/tmkms#37 the bug definitely seemed
to be tied to rekeying the session after a preset number of messages had
been sent. It appears there was a reentrancy bug where an RAII guard was
still holding the `Mutex` for the session state, which prevented a
subsequent attempt to resent the message from acquiring it.

The solution is to explicitly drop the stale session guard prior to
trying to establish a new session.
@tony-iqlusion
Copy link
Member

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

@gaia
Copy link

gaia commented Nov 30, 2021

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

I'm curious why your setup never experienced this.

@tony-iqlusion
Copy link
Member

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

I'm curious why your setup never experienced this.

We just experienced it for the first time recently. As to why we didn't experience it before I'm uncertain, but it's possible my guess as to the cause is wrong.

@activenodes
Copy link

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

Happy to try it!
I will be AFK in the weekend. I have to do an exact calculation of the next crash before putting it into production, about 3.5 days, but definitely next week

Thanks Tony!

@activenodes
Copy link

@tony-iqlusion we are in production, could you give an indication to be safe?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign

   [..]
   Compiling bip32 v0.2.2
   Compiling yubihsm v0.40.0-pre
error: edition 2021 is unstable and only available with -Z unstable-options.

error: could not compile `yubihsm`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

@tony-iqlusion
Copy link
Member

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

@activenodes
Copy link

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

Done, thanks.

Reported version is always v0.10.1. Right?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign
$ cargo install tmkms --features=yubihsm-server,softsign

   [..]
   Replacing /home/tmkms/.cargo/bin/tmkms
    Replaced package `tmkms v0.10.1` with `tmkms v0.10.1` (executable `tmkms`)


$ tmkms
tmkms 0.10.1

@tony-iqlusion
Copy link
Member

No, using cargo install that way will install the latest version from crates.io.

You need to do:

cargo install --path . --features=yubihsm-server,softsign

@activenodes
Copy link

A quick update:
Active: active (running) since Thu 2021-12-09 21:26:45 CET; 6 days ago
Usually the crash happened before.
We remain waiting, hoping not to have to restart the service.
In the meantime .. thanks ;)

@tony-iqlusion
Copy link
Member

Going to go ahead and close this.

If anyone experiences another freeze with tmkms v0.11.0-pre.2 or subsequent releases, please leave a comment and I'll reopen this issue.

@tony-iqlusion
Copy link
Member

FYI, we've released v0.11.0 containing a fix for this issue #479

flihp added a commit to flihp/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

This is still a WIP as I'm hoping to find a way to close the Client's
session w/o having to modify the yubihsm.rs code.
flihp added a commit to flihp/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

The replacement was to provide a `session()` function that returns an
Arc / MutexGuard wrapped reference to the optional session. This isn't
useful for us here because we don't and AFAIK can't take ownership of
the session which we need because the Sesison::close function consumes
the session (it can't be reopened). Our solution requires an upstream
change to the `Client` type adding a `close_session` function that just
closes the session if one is open.
flihp added a commit to flihp/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

The replacement was to provide a `session()` function that returns an
Arc / MutexGuard wrapped reference to the optional session. This isn't
useful for us here because we don't and AFAIK can't take ownership of
the session which we need because the Sesison::close function consumes
the session (it can't be reopened). Our solution requires an upstream
change to the `Client` type adding a `close_session` function that just
closes the session if one is open.
flihp added a commit to flihp/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

The replacement was to provide a `session()` function that returns an
Arc / MutexGuard wrapped reference to the optional session. This isn't
useful for us here because we don't and AFAIK can't take ownership of
the session which we need because the Sesison::close function consumes
the session (it can't be reopened). Our solution requires an upstream
change to the `Client` type adding a `close_session` function that just
closes the session if one is open.
flihp added a commit to flihp/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

The replacement was to provide a `session()` function that returns an
Arc / MutexGuard wrapped reference to the optional session. This isn't
useful for us here because we don't and AFAIK can't take ownership of
the session which we need because the Sesison::close function consumes
the session (it can't be reopened). Our solution requires an upstream
change to the `Client` type adding a `close_session` function that just
closes the session if one is open.
flihp added a commit to oxidecomputer/offline-keystore that referenced this issue Apr 2, 2024
The `Client` type previously had a `Drop` implementation that closed the
session if the Client had an open one. This seems to have caused
problems in other downstream projects and was subsequently removed:
iqlusioninc/tmkms#37
iqlusioninc/yubihsm.rs#265

The replacement was to provide a `session()` function that returns an
Arc / MutexGuard wrapped reference to the optional session. This isn't
useful for us here because we don't and AFAIK can't take ownership of
the session which we need because the Sesison::close function consumes
the session (it can't be reopened). Our solution requires an upstream
change to the `Client` type adding a `close_session` function that just
closes the session if one is open.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants