tmkms freeze #37

mattharrop · 2020-04-22T13:19:56Z

I'm having an issue with silent failures. tmkms will stop signing, with no errors logged. It just stops signing. The last log entry is the last signature. Restarting tmkms resolves the problem.

This is v0.7.2, on ubuntu. I have this issue on two instances, with 2 keys and 3 keys on a yubihsm in use respectively. The problem re-occurs irregularly, seems to be roughly every few weeks.

At this point I have very little to go on, and no useful information to share.

Is there anything I can do to produce useful debugging information next time the failure happens?

tony-iqlusion · 2020-04-22T14:13:51Z

There have been reports of similar issues in the past (e.g. tendermint/tmkms#389) however I have never been able to reproduce them.

Can you see if you can reproduce the issue while running tmkms under strace?

mattharrop · 2020-04-23T20:37:23Z

I've started strace watching the trims process. It generates enormous log files! I'll watch and avoid running out of disk space, and keep it running until the next halt. I am not aware of any way to trigger the failure, so it may take a while.

tarcieri · 2020-05-14T15:27:39Z

@mattharrop I've just released v0.7.3 which includes a number of bugfixes. Not sure any of them will help this issue per se, but there were some related to YubiHSM device (re-)enumeration which could potentially be the source of this problem

mattharrop · 2020-05-14T20:33:07Z

And just now I finally, after three weeks of running under strace, I have a freeze. I have 28G of logs. How many lines or how much time do you want?

tarcieri · 2020-05-14T20:41:50Z

I only need a little bit of recent context, maybe the last 10,000 lines or so?

greg-szabo · 2020-05-15T15:47:14Z

Hi! We just ran into the same issue. It's our second time but the first time, I think we upgraded tmkms.

@mattharrop which version did you run for the strace?

tony-iqlusion · 2020-05-15T16:12:54Z

We managed to isolate a probable location where the problem is occurring. I'll add some more debug logging around this area then cut a v0.7.4 release.

This is an attempt to help address #37. Based on `strace` logging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this: ``` close(12) = 0 rmdir("/.../.atomicwrite.InysUcmuRax7") = 0 futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL ``` Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk. This commit does a couple things to try to narrow down what is happening: 1. Ensures that an exclusive lock to the chain state isn't held while the signing operation is being performed (i.e. while communicating with the HSM). If we were able to update the consensus state, that means the signing operation is authorized, and we no longer need to hold the lock. In the event the signing operation fails, the validator will miss the block in question, but with no risk of double-signing. 2. Adds a significant amount of additional debug logging, particularly around things like lock acquisition and writing to disk. While this commit is unlikely to fix #37 in and of itself, the additional debug logging should be helpful in isolating the problem.

tony-iqlusion · 2020-06-08T20:35:32Z

Looks like #60 triggered auto-close, but it (likely) doesn't actually address the issue, but should help debug it.

tony-iqlusion · 2020-06-09T02:00:25Z

I've released v0.8.0-alpha1 which corrects a few minor things that might be going amiss but also adds considerably more debug logging:

https://forum.cosmos.network/t/ann-tendermint-kms-v0-8-tendermint-v0-33-support/3704

If you can reproduce this issue while running in verbose mode (i.e. with the -v command line argument) it'd be very helpful in nailing this down.

We're presently running this release in production.

gaia · 2020-07-06T17:10:58Z

Just FYI, we also experienced the same issue once, a couple of weeks ago, while running 0.7.2. It didn't occur again and as of 4 days ago we are on 0.8.0.

tony-iqlusion · 2020-08-27T20:15:56Z

We haven't had any reports of this since re-occurring since the v0.8 release, so I'm going to go ahead and close this. Please let me know if that changes.

These loglines were for helping debug #37. Now that it's closed, they can be removed as they are quite noisy.

gaia · 2020-12-20T16:29:15Z

Again last night :(

Dec 20 02:08:30 hsm tmkms[25469]: Dec 20 02:08:30.390  INFO tmkms::session: [cosmoshub-3@tcp://a.b.c.e:26658] signed PreCommit:4676990D86 at h/r/s 4493122/0/2 (126 ms)
Dec 20 02:08:31 hsm tmkms[25469]: Dec 20 02:08:31.636  INFO tmkms::session: [columbus-4@tcp://a.b.c.d:26658] signed PreVote:1423FF2F98 at h/r/s 1045895/0/1 (127 ms)
Dec 20 02:17:24 hsm systemd[1]: Stopping TMKMS...
Dec 20 02:17:24 hsm systemd[1]: Stopped TMKMS.
Dec 20 02:17:24 hsm systemd[1]: Started TMKMS.
Dec 20 02:17:24 hsm tmkms[7905]: Dec 20 02:17:24.920  INFO tmkms::commands::start: tmkms 0.9.0 starting up...

tony-iqlusion · 2020-12-20T16:43:22Z

For anyone encountering this issue in production (we have never encountered it before, so that makes debugging that much harder), it would be extremely helpful if someone would attach to the process with something like gdb or lldb and obtain a backtrace (e.g. with the bt command).

That said, Tendermint v0.35 will include a gRPC-based privval connection between TMKMS and a validator, which will completely change the entire networking stack on both the TMKMS and Tendermint side. Perhaps that will indirectly fix the issue, or if not, it will at least help isolate the possible places it's occurring.

activenodes · 2021-06-07T10:13:53Z

The same for me after a few weeks of uptime:

8:09:55 tmkms[372518]: #033[...[0m tmkms::session: [xxx@tcp://xx.0.4.11:26658] signed PreCommit:xxx at h/r/s 1023094/0/2 (126 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreVote:yyy at h/r/s 1245586/0/1 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreCommit:yyy at h/r/s 1245586/0/2 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [zzz@tcp://xx.0.2.11:26658] signed PreVote:xxx at h/r/s 3964819/0/1 (156 ms)
8:11:02 systemd[1]: Stopping Cosmos tmkms...
8:11:02 systemd[1]: tmkms.service: Succeeded.
8:11:02 systemd[1]: Stopped Cosmos tmkms.
8:11:02 systemd[1]: Started Cosmos tmkms.
8:11:02 tmkms[458276]: #033[...[0m tmkms::commands::start: tmkms 0.10.0 starting up...
8:11:02 tmkms[458276]: #033[...[0m yubihsm::connector::http::server: yubihsm::http-server[127.0.0.1:12345]: listening for connections
8:11:02 kernel: [3222633.441759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 kernel: [3222633.717759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 tmkms[458276]: #033[...[0m tmkms::keyring: [keyring:yubihsm] added consensus Ed25519 key: ...

The server is dedicated to TMKMS

# free -h

              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       219Mi       287Mi       2.0Mi       7.2Gi       7.2Gi
Swap:         4.0Gi       1.0Mi       4.0Gi

$ tmkms version

tmkms 0.10.0

Right now I have automated restarting the process after 60 seconds without log.
I could try to automate a "strace --attach = PID" (for 30 seconds) before rebooting. Can it help?

tony-iqlusion · 2021-06-07T13:45:06Z

If anyone is experiencing something like this, especially frequently, it'd be very helpful to get a traffic capture with a tool like tcpdump.

There are a number of things that could potentially cause this which are external to TMKMS, including it never receiving an incoming request. If you can use a tool like tcpdump to rule that out and show a case where TMKMS is actually receiving incoming network packets but never processing them, then that would be a TMKMS bug.

Based on some other anecdotal reports from people who were experiencing issues like these, they were ultimately a network problem.

activenodes · 2021-06-07T14:07:15Z

During the issue I was able to verify that the network connections worked (the destinations IP are all different, all reachable) and a simple restart of the TMKMS service fix the problem.
I will try to capture traffic to see if there is incoming (and outgoing) traffic to/from tmkms service

tony-iqlusion · 2021-06-07T14:13:41Z

If you can build TMKMS in debug mode (i.e. omit the --release when doing cargo build) and attach with gdb or lldb and run bt to get a backtrace of its current state when the problem occurs, that would be helpful too.

gaia · 2021-06-07T15:52:06Z

Same here, simple restart fixes. I've set tmkms to restart every two weeks regardless, it seems that the issue happens around every 3 weeks or so.

Also, IMHO it is unlike it is a network problem for us: the TMKMS VM is on the same physical host as the validator, they are VMs under the same hypervisor.

@AC-Validator which method are you using to detect logging stopped?

activenodes · 2021-11-05T08:36:48Z

yubihsm-connector is a separate Golang executable distributed with the YubiHSM2 SDK:
It would be useful to help narrow down if the problem is resulting from the USB stack or not.

Mmm... It is quite intrusive for a production setup...
Can we do something in the case of "higher message volume"?
At this time it is also easy to reproduce the issue (every 3.5 days)

activenodes · 2021-11-05T08:39:44Z

OK, that makes sense. We run a max of 2 validators per YubiHSM.

Analyzing the response latencies it seems that the YubiHSM2 has no particular problems (it also depends on the blocking time of each chain). So I prefer active/passive clusters with this number of connections (~10)

activenodes · 2021-11-20T21:50:53Z

@tony-iqlusion are you ready to let me do some tests?

tony-iqlusion · 2021-11-21T15:31:19Z

Not quite yet... there's still some more work on the yubihsm crate I want to do, then I'll cut another release and update tmkms

Context: iqlusioninc/tmkms#37 To summarize, the `yubihsm::Client` type has internal locking for a `Session` with the YubiHSM. There is actually very little lock-related code associated with this scheme, but there appears to be a subtle reentrancy bug in pretty much the only function where it could possibly exist: the `yubihsm::Client::send_command` method. Based on discussions on iqlusioninc/tmkms#37 the bug definitely seemed to be tied to rekeying the session after a preset number of messages had been sent. It appears there was a reentrancy bug where an RAII guard was still holding the `Mutex` for the session state, which prevented a subsequent attempt to resent the message from acquiring it. The solution is to explicitly drop the stale session guard prior to trying to establish a new session.

tony-iqlusion · 2021-11-30T01:39:28Z

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

gaia · 2021-11-30T05:10:32Z

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

I'm curious why your setup never experienced this.

tony-iqlusion · 2021-11-30T16:50:12Z

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

I'm curious why your setup never experienced this.

We just experienced it for the first time recently. As to why we didn't experience it before I'm uncertain, but it's possible my guess as to the cause is wrong.

activenodes · 2021-11-30T18:15:20Z

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

Happy to try it!
I will be AFK in the weekend. I have to do an exact calculation of the next crash before putting it into production, about 3.5 days, but definitely next week

Thanks Tony!

activenodes · 2021-12-07T11:24:52Z

@tony-iqlusion we are in production, could you give an indication to be safe?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign

   [..]
   Compiling bip32 v0.2.2
   Compiling yubihsm v0.40.0-pre
error: edition 2021 is unstable and only available with -Z unstable-options.

error: could not compile `yubihsm`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

tony-iqlusion · 2021-12-07T22:34:20Z

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

activenodes · 2021-12-09T15:39:24Z

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

Done, thanks.

Reported version is always v0.10.1. Right?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign
$ cargo install tmkms --features=yubihsm-server,softsign

   [..]
   Replacing /home/tmkms/.cargo/bin/tmkms
    Replaced package `tmkms v0.10.1` with `tmkms v0.10.1` (executable `tmkms`)


$ tmkms
tmkms 0.10.1

tony-iqlusion · 2021-12-09T16:39:40Z

No, using cargo install that way will install the latest version from crates.io.

You need to do:

cargo install --path . --features=yubihsm-server,softsign

activenodes · 2021-12-16T09:12:33Z

A quick update:
Active: active (running) since Thu 2021-12-09 21:26:45 CET; 6 days ago
Usually the crash happened before.
We remain waiting, hoping not to have to restart the service.
In the meantime .. thanks ;)

tony-iqlusion · 2021-12-16T16:12:30Z

Going to go ahead and close this.

If anyone experiences another freeze with tmkms v0.11.0-pre.2 or subsequent releases, please leave a comment and I'll reopen this issue.

tony-iqlusion · 2022-02-11T20:39:39Z

FYI, we've released v0.11.0 containing a fix for this issue #479

The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 This is still a WIP as I'm hoping to find a way to close the Client's session w/o having to modify the yubihsm.rs code.

The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.

tony-iqlusion mentioned this issue Jun 8, 2020

Refactor locking; add more debug locking #60

Merged

tony-iqlusion closed this as completed in #60 Jun 8, 2020

tony-iqlusion reopened this Jun 8, 2020

tony-iqlusion added the bug Something isn't working label Jul 2, 2020

tony-iqlusion closed this as completed Aug 27, 2020

tarcieri pushed a commit that referenced this issue Oct 22, 2020

session: simplify debug logging

58c412c

These loglines were for helping debug #37. Now that it's closed, they can be removed as they are quite noisy.

tony-iqlusion mentioned this issue Oct 22, 2020

session: simplify debug logging #194

Merged

tarcieri pushed a commit that referenced this issue Oct 22, 2020

session: simplify debug logging

aa2a49f

These loglines were for helping debug #37. Now that it's closed, they can be removed as they are quite noisy.

tony-iqlusion mentioned this issue Nov 10, 2020

TMKMS stops signing #215

Closed

tony-iqlusion reopened this Nov 30, 2020

tony-iqlusion mentioned this issue Nov 30, 2020

Stopped Signing For No Apparent Reason (0.9.0) #235

Closed

tony-iqlusion mentioned this issue Nov 17, 2021

Remove session auto-close in Drop handler iqlusioninc/yubihsm.rs#265

Merged

tony-iqlusion mentioned this issue Nov 30, 2021

Attempted bugfix for potential session reentrancy bug iqlusioninc/yubihsm.rs#273

Merged

tony-iqlusion closed this as completed Dec 16, 2021

tony-iqlusion mentioned this issue Feb 11, 2022

v0.11.0 #479

Merged

tony-iqlusion mentioned this issue Feb 14, 2022

tmkms hangs unexpectedly #481

Closed

ipaljak-tbtl mentioned this issue Nov 15, 2023

Sessions aren't automatically closed on Drop iqlusioninc/yubihsm.rs#495

Open

flihp mentioned this issue Apr 2, 2024

Use session close function from oxide fork of yubihsm.rs. oxidecomputer/offline-keystore#193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tmkms freeze #37

tmkms freeze #37

mattharrop commented Apr 22, 2020

tony-iqlusion commented Apr 22, 2020

mattharrop commented Apr 23, 2020 •

edited

Loading

tarcieri commented May 14, 2020

mattharrop commented May 14, 2020

tarcieri commented May 14, 2020

greg-szabo commented May 15, 2020

tony-iqlusion commented May 15, 2020

tony-iqlusion commented Jun 8, 2020

tony-iqlusion commented Jun 9, 2020

gaia commented Jul 6, 2020

tony-iqlusion commented Aug 27, 2020

gaia commented Dec 20, 2020

tony-iqlusion commented Dec 20, 2020

activenodes commented Jun 7, 2021

tony-iqlusion commented Jun 7, 2021

activenodes commented Jun 7, 2021

tony-iqlusion commented Jun 7, 2021

gaia commented Jun 7, 2021 •

edited

Loading

activenodes commented Nov 5, 2021

activenodes commented Nov 5, 2021

activenodes commented Nov 20, 2021

tony-iqlusion commented Nov 21, 2021

tony-iqlusion commented Nov 30, 2021

gaia commented Nov 30, 2021

tony-iqlusion commented Nov 30, 2021

activenodes commented Nov 30, 2021

activenodes commented Dec 7, 2021

tony-iqlusion commented Dec 7, 2021

activenodes commented Dec 9, 2021

tony-iqlusion commented Dec 9, 2021

activenodes commented Dec 16, 2021

tony-iqlusion commented Dec 16, 2021

tony-iqlusion commented Feb 11, 2022

tmkms freeze #37

tmkms freeze #37

Comments

mattharrop commented Apr 22, 2020

tony-iqlusion commented Apr 22, 2020

mattharrop commented Apr 23, 2020 • edited Loading

tarcieri commented May 14, 2020

mattharrop commented May 14, 2020

tarcieri commented May 14, 2020

greg-szabo commented May 15, 2020

tony-iqlusion commented May 15, 2020

tony-iqlusion commented Jun 8, 2020

tony-iqlusion commented Jun 9, 2020

gaia commented Jul 6, 2020

tony-iqlusion commented Aug 27, 2020

gaia commented Dec 20, 2020

tony-iqlusion commented Dec 20, 2020

activenodes commented Jun 7, 2021

tony-iqlusion commented Jun 7, 2021

activenodes commented Jun 7, 2021

tony-iqlusion commented Jun 7, 2021

gaia commented Jun 7, 2021 • edited Loading

activenodes commented Nov 5, 2021

activenodes commented Nov 5, 2021

activenodes commented Nov 20, 2021

tony-iqlusion commented Nov 21, 2021

tony-iqlusion commented Nov 30, 2021

gaia commented Nov 30, 2021

tony-iqlusion commented Nov 30, 2021

activenodes commented Nov 30, 2021

activenodes commented Dec 7, 2021

tony-iqlusion commented Dec 7, 2021

activenodes commented Dec 9, 2021

tony-iqlusion commented Dec 9, 2021

activenodes commented Dec 16, 2021

tony-iqlusion commented Dec 16, 2021

tony-iqlusion commented Feb 11, 2022

mattharrop commented Apr 23, 2020 •

edited

Loading

gaia commented Jun 7, 2021 •

edited

Loading