Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stacks signer able to save multiple dkg shares and load it where appropriate #4704

Conversation

netrome
Copy link
Contributor

@netrome netrome commented Apr 22, 2024

Description

This PR extends the signer persistence to allow for storing shares for multiple concurrent DKG rounds, identified by the aggregate public key for that round. On startup, the signer looks up the approved aggregate key in the contract and loads the corresponding shares.

To ensure the Nakamoto sign coordinator is able to pick up loaded DKG shares, the signers broadcast DkgResults messages when loaded with their polynomial commitments. The sign coordinator, which previously assumed that any DkgResults message contained the complete set of polynomial commitments, has been extended to also allow individual polynomial commitments in these messages - which it will then aggregate for all signers to form the complete set of commitments.

Applicable issues

Additional info

To ensure a bounded size stored in StackerDB, we limit the implementation to only hold two sets of shares concurrently to handle the scenario described in #4654.

@netrome netrome self-assigned this Apr 22, 2024
@netrome netrome force-pushed the 4654-stacks-signer-a-signer-must-be-able-to-save-multiple-dkg-aggregate-key-round-states-and-load-it-where-appropriate branch 3 times, most recently from 0ff200c to 70f7d69 Compare April 25, 2024 19:28
@netrome netrome force-pushed the 4654-stacks-signer-a-signer-must-be-able-to-save-multiple-dkg-aggregate-key-round-states-and-load-it-where-appropriate branch from 70f7d69 to f433116 Compare April 29, 2024 14:23
@netrome netrome marked this pull request as ready for review April 29, 2024 14:23
hstove
hstove previously approved these changes Apr 29, 2024
Copy link
Contributor

@hstove hstove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM! There are a few instances where it would be helpful to pluralize function names that now return states instead of one state.

stacks-signer/src/signer.rs Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
@netrome
Copy link
Contributor Author

netrome commented Apr 29, 2024

Everything LGTM! There are a few instances where it would be helpful to pluralize function names that now return states instead of one state.

Good point. I've updated get_signer as you noted, and also load_encrypted_signer_state which should be pluralized.

@netrome netrome requested review from jferrant and hstove April 30, 2024 13:58
Copy link
Member

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a few questions

stacks-signer/src/runloop.rs Outdated Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
stacks-signer/src/signer.rs Outdated Show resolved Hide resolved
testnet/stacks-node/src/nakamoto_node/sign_coordinator.rs Outdated Show resolved Hide resolved
Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this code is giving me a lot of pause. It seems that there are several places where the signer state can just be willy-nilly reset from saved state, but without any consideration to how it will affect the execution of the signer at that point in the reload. I fear this will lead to hard-to-reproduce bugs and state corruption, because it is hard to see in advance what the downstream consequences can be when changing this state mid-way through a main loop pass.

Instead, can this code be factored to make a state reload happen only at the top-level of the main loop? A reload should cancel any in-progress activities and force the signer to restart whatever it was doing anew.

@netrome
Copy link
Contributor Author

netrome commented May 3, 2024

The structure of this code is giving me a lot of pause. It seems that there are several places where the signer state can just be willy-nilly reset from saved state, but without any consideration to how it will affect the execution of the signer at that point in the reload. I fear this will lead to hard-to-reproduce bugs and state corruption, because it is hard to see in advance what the downstream consequences can be when changing this state mid-way through a main loop pass.

Instead, can this code be factored to make a state reload happen only at the top-level of the main loop? A reload should cancel any in-progress activities and force the signer to restart whatever it was doing anew.

I'm not too happy about the current structure either. While I have attempted to be careful with how and when we load state from the signer, I do share your concern. We've already seen plenty of hard-to-reproduce bugs in the signer so far, and I definitely don't want to add to it, especially since the intention of this change is to make the signer more robust towards bugs caused by network delays and timing issues.

I don't fully understand right now how exactly we'd be able to make the state reloads only happen at the top-level of the main loop (assuming you mean within theSignerRunLoop::run_one_pass function and not the top-level provided SignerRunLoop::main_loop function), but I will look into it with fresh eyes on Monday and see what I can do.

@netrome
Copy link
Contributor Author

netrome commented May 6, 2024

I've updated now so that state reloads only happen in signer::update_approved_aggregate_key which is only called in signer::refresh_dkg, which in turn is only called from the main loop if no approved aggregate key is set. This eliminates some of the unnecessary boilerplate and allows us to instantiate the signer without doing yet another state reload after the signer has been constructed.

This feels much cleaner than the previous solution. Thank you for bringing this up @jcnelson. Let me know if you have further opinions on how to structure this. For example, if you want state reloads to be explicit in the main loop I can do some further refactors to achieve this - but I think the solution we have now fits better with the current structure of the signer.

@netrome netrome requested review from kantai and jcnelson May 7, 2024 09:09
Comment on lines 1456 to 1461
match self.load_saved_state_for_aggregate_key(approved_aggregate_key) {
Ok(()) => self.send_dkg_results(&approved_aggregate_key),
Err(e) => warn!(
"{self}: Failed to load saved state for key {approved_aggregate_key}: {e}"
),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused by the logic here. update_approved_aggregate_key is invoked to refresh the signer's view of the approved aggregate key. Why should that trigger a load from saved state? Shouldn't the signer's state already be loaded into memory at this point? Why doesn't the signer just load state into memory during process startup (as I think was the general idea behind Jude's suggestion)?

The general logic seems like it should be:

On boot:
    self.refresh_approved_aggregate_key();
    if self.approved_aggregate_public_key.is_some():
         self.signer_state = self.load_saved_states()[self.aggregate_key];
         if self.signer_state.pending_dkg_results.is_some():
            send_dkg_results();
            self.signer_state.pending_dkg_results.clear();
            self.overwrite_saved_states()

On refresh aggregate public key:
     if self.approved_aggregate_public_key is changed:
         if self.signer_state.pending_dkg_results.is_some():
            send_dkg_results();
            self.signer_state.pending_dkg_results.clear();
            self.overwrite_saved_states()

That way, the signer sends a DkgResults message after the aggregate key is approved if they're running at the time, or if they wakeup after the aggregate key has been approved, they send the message then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids excessively reloading the signer states and it avoids resending the DkgResults message potentially every time a state reload occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I think we could do it that way. Two thoughts on the logic:

  1. On boot the signer won't have any pending DKG results so we don't have to send DKG results at that point.
  2. State writes happen now as soon as DKG is finished (i.e. a DkgEnd(_) message is received), which I think makes sense because you want to persist the shares as soon as you have confirmation that you have been able to create an aggregate key.

With these I think we can simplify the logic to

On boot:
    self.refresh_approved_aggregate_key();
    if self.approved_aggregate_public_key.is_some():
         self.signer_state = self.load_saved_states()[self.aggregate_key];

On refresh aggregate public key:
     if self.approved_aggregate_public_key is changed:
         if self.signer_state.pending_dkg_results.is_some():
            send_dkg_results();
            self.signer_state.pending_dkg_results.clear();

On DkgEnd received:
      self.save_signer_state()

What do you think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should that trigger a load from saved state? Shouldn't the signer's state already be loaded into memory at this point?

It probably should be, but the rationale behind the current logic is to ensure that the saved state is always consistent with the approved aggregate key. If we for any reason observe a change in the approved aggregate key, a natural reaction would be to reload the state observed with the changed aggregate key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its possible to have a pending dkg results on boot, because the signer may have gone offline between DkgEnd and the key being approved by the network. I think we should still be persisting these states after the pending_dkg_results is cleared as well, so I think my suggested logic is about as simplified as it can get. I do agree that we should be persisting on DkgEnd as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a problem with the proposed logic is if you boot the signer and an approved aggregate key is not yet set you would never load any persisted state. If you are unlucky with the timing, your signer might boot just before the last vote for an aggregate key arrives and sets that key in the contract.

From that perspective, I think it is safer to load the saved state when you observe a change in the aggregate key.

To prevent excessive state reloads I could add a condition to only load the saved state in case the approved aggregate key is different from the one already present in the state.

Copy link
Contributor Author

@netrome netrome May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd then propose considering something that looks like this:

On boot:
     Start without any approved_aggregate_key or loaded signer state

On refresh aggregate public key:
     if self.approved_aggregate_public_key is changed:

         if the new approved_aggregate_public_key is not the same as the one in our signer state:
             self.load_saved_state(approved_aggregate_key)

         if self.signer_state.pending_dkg_results.is_some():
            send_dkg_results();
            self.signer_state.pending_dkg_results.clear();

On DkgEnd received:
      self.save_signer_state()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid excessive state reloads, but ensure that the signer doesn't miss loading the state for a particular aggregate key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we've addressed this in the past (albeit in very different contexts) is to have two structs: a "preload" struct and a "loaded" struct which is instantiated from a "preload" struct and some additional state. You'd have an UninitializedSigner struct, which represents a Signer that does not yet have its state loaded. Its impl would have a conversion method like into_signer(state), which fully instantiated the Signer from the loaded state once it became available.

Then, you'd make it impossible to accidentally have a partially-instantiated or corrupted Signer used somewhere. Given the importance of the signer to chain liveness, I think it's worth the effort to make invalid Signer states unrepresentable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes a lot of sense to differentiate between an UninitializedSigner and a Signer instead of having the current State field within the signer. It's not clear to me what the relation is between an initialized/uninitialized signer and which DKG shares the signer should have loaded. Afaik, a Signer may still be initialized but without having a known approved aggregate key or any DKG shares.

@netrome
Copy link
Contributor Author

netrome commented May 14, 2024

Closing this PR due to us going in a different direction with the signer.

@netrome netrome closed this May 14, 2024
@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Oct 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
6 participants