Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add optional cluster key rotation #1

Open
costela opened this issue Mar 28, 2019 · 11 comments
Open

add optional cluster key rotation #1

costela opened this issue Mar 28, 2019 · 11 comments
Assignees
Labels
enhancement New feature or request security Impacts operational or mesh security
Milestone

Comments

@costela
Copy link
Owner

costela commented Mar 28, 2019

It would be nice to have the memberlist key be rotated automatically.
This would increase security a bit, at the cost of making ad-hoc joining less practical.
We'd also need a way to get the current key.

@costela costela added the enhancement New feature or request label Mar 28, 2019
@costela costela self-assigned this Mar 28, 2019
@costela costela added the security Impacts operational or mesh security label Jul 14, 2019
@costela costela added this to the v0.3.0 milestone Jul 14, 2019
@ProfFan
Copy link

ProfFan commented Sep 29, 2019

Can we make this key like an OTP? i.e. depend on timestamp

@costela
Copy link
Owner Author

costela commented Sep 30, 2019

Hey @ProfFan, thanks for the suggestion!
I'm reluctant to go this route because I'm not sure how much added security we would get from it. Under "normal" 2FA usage, TOTP is a nice solution because it adds a layer of security and avoids over-the-shoulder secret leaks in its presumably most common use-case: cellphone as TOTP token.
In our case it would not add a layer of security because getting access to the TOTP key would be the same as getting access to the current static key, both in difficulty and effects. There's less of a "over-the-shoulder" security issue and there's no "first-factor".

Am I missing something?

@ProfFan
Copy link

ProfFan commented Oct 1, 2019

Thank you for the reply! One question, is the limiting factor here memberlist? I didn't find a way to "plugin-in" other methods for authentication there.

@costela
Copy link
Owner Author

costela commented Oct 1, 2019

Yes, memberlist has a single way of implementing authorization: using shared-keys. Note that it does not provide authentication: since there is a single key, there is no notion of "identity".

However, this does not mean we cannot extend this scheme in a few ways. It would be possible, for instance, to generate the key from a fixed shared part + a TOTP part.
I just cannot imagine a use-case that would actually profit from this. But maybe I'm just lacking imagination. What's your concrete use-case for it?

@ProfFan
Copy link

ProfFan commented Oct 1, 2019

My take is that in a cloud-computing context we do not actually control the servers (as they are hosted as VMs by cloud providers) so anything we "put" on these are "compromised" to the providers. Same idea applies for IoT devices where they can be accessed by third parties inspecting the device physically. Thus a key unique to the device or VM helps that we can safely revoke the key and the rest of the system stays safe.

@costela
Copy link
Owner Author

costela commented Oct 1, 2019

Ok, I see where you're coming from.
But for this threat model, TOTP is no help at all. There's an inherent compromise here: either the node knows enough to keep the PSK up-to-date (the node must have access to the TOTP key), or you must step in for each re-join or new node, in which case you don't really need TOTP at all and just basic rotation would suffice.

Another possibility would be a CA-like setup, which is an order of magnitude more complex than PSKs and probably out-of-scope for wesher.

@ProfFan
Copy link

ProfFan commented Oct 1, 2019

Agree on your view :) Rotation is enough in this sense.

Thank you for all the effort again!

@kaiyou
Copy link
Contributor

kaiyou commented May 15, 2020

A couple important issues that need be addressed for rotation to work properly (I am probably stating the obvious, but let's do it anyway):

  • specify when a new key is to be generated, is it after a number of memberlist packets, after a given time?
  • specify how we generate a new key, is it at random, is it derived from the previous (probably not)?
  • specify who is responsible for generating the new key, is the responsibility shared, how do we determine who is responsible?
  • specify how the new key is propagated, using memberlist maybe, yet how do we make sure all members got it, do we even know that we know all members?
  • specify when the new key is to be used, and how do we handle transition, can memberlist temporary support two different keys for instance during rotation?

My current take on the matter is that the problem is very hard. Most systems that had to handle key rotation ended up using a central key infrastructure and authentication certificates instead of a shared key, thus shifting the issue to certificate renewal, which is already solved by X.509.

Some systems remain that use psk at scale for authentication, mostly physical access control systems to my knowledge, e.g. the entire Mifare family. They usually handle key rotation by supporting two different keys during rotation time and drop support for the old key when all users have rotated. However, these systems have the huge advantage of separating concerns: the central management unit is responsible for generating and propating the keys while users simply get and store the new key provided to them.

One way to implement it could be to rotate keys at a deterministic time (for instance after a specific memberlist message id) in a deterministic way (for instance derive from the old key and something that is not static, for intsance the hash of the given memberlist message, or hash of the state at that time). However, I fear that given the specifics of gossip protocols we fail at finding a proper deterministic time (some nodes might not receive messages in the same order for instance) or even a deterministic derivation algorithm (some nodes may node view the same state at the same time, gossip only provides eventual consistency).

So, the other way is to elect one or multiple nodes (which ones? based on what criteria?) responsible for generating the key, and maybe reach consensus about the new key, then propagate that new key (possibly using memberlist directly). Then we can use existing primitives in memberlist to add the key as supported for decryption (AddKey), switch to the new key when we feel it is properly propagated (no idea what clues we can use for this), then switch key (UseKey) and drop the old one (RemoveKey).

This still sounds hard and uncertain 😄

@kaiyou
Copy link
Contributor

kaiyou commented May 15, 2020

Here is a concrete idea for implementation. The requirement is that a single node is responsible for triggering a rotation. That responsibility could be static (a configuration flag) or dynamic (find a deterministic feature that makes the node responsible).

  • Every key has an identifier, which is an incremented integer, starting at 0, 0 being the key provided in settings.
  • In its metadata, every node lists the set of key identifiers that it supports for decrypting messages, plus the key identifier that it uses for encrypting messages (the latter for information and debugging mostly, as I fear we are facing a load of bugs with this).
  • The responsible node also exposes in its meta a map of keys, providing the key for every propagated key id.
  • The responsible node will decide when key rotation is to be triggered, possibly based on a timer. Upon rotation, it will increment the key identifier counter, generate a new key and add it to its propagated key map.
  • Every node is then responsible for adding that key to its list of keys and the key identifier to its meta as a supported decryption key.
  • Every node is responsible for using the latest key for encryption and expose this in its meta when every other node has added the corresponding key id to the list of supported decryption key.
  • Every node is also responsible for removing support for decrypting with old keys and stop exposing their ids in meta when all other nodes have announed newer keys as encryption key.
  • The node responsible for generating and propagating keys is responsible for removing key material from its meta when every other node has announced that it was supported as decryption key.

Here is a list of problems that I think will arise:

  • the list of nodes is moving, the cluster might be missing a member for a couple seconds, that will not be aware of key rotation and be kicked out of the cluster
  • some nodes might be slow at receiving new keys or might have issues gossiping out acknowledgement, hence forcing all other nodes to keep supporting old keys (and making the scheme less secure that before)
  • this does not provide any additional security against a rogue node or someone that has the initial key and a network capture of gossip messages (it does not provide pfs), but it might lead to a false sense of security on the matter

In conclusion, I am really wondering if this is worth the pain. Maybe we could stick with a simpler way of changing keys once in a while: updating configurations and restarting wesher. For that specific matter, a couple other changes could improve reliability:

  • make it possible to not take wireguard down when restarting wesher,
  • make it possible to wait for a bit for state to stabilize before reconfiguring wireguard,
  • support accepting multiple keys for memberlist.

Then the matter of renewing a key can be delegated to configuration management tools:

  • run ansible (substitute with your favourite) to add key to the set of supported keys to all members
  • wait for all nodes to get the configuration
  • run again to change the primary (first) key
  • wait for all nodes to get the configuration
  • run one last time to remove the secondary key

@costela
Copy link
Owner Author

costela commented May 15, 2020

Good insights @kaiyou !

Maybe we should first try to enable manual key rotation. This means a lot of the issues you mention don't apply. I was probably being too ambitious with the "automatic" part of the original issue 😉

Giving an operator the possibility to easily see the state of his cluster ("are all nodes I expect to be there really there?") and giving him a command like wesher push-new-key XXX would already be a clear improvement, even if we don't get all the corner cases perfectly solved.

@kaiyou
Copy link
Contributor

kaiyou commented May 15, 2020

That last one is probably a lot simpler to implement. We would still need some ipc, possibly a control socket for wesher or some inotify. But it sounds a lot more reasonable.

My last edit on the previous comment goes in this direction mostly. Your push-new-key subcommand would do about the same, maybe a reload subcommand, simply pushing new configuration to a running instance, would also go a long way for other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request security Impacts operational or mesh security
Projects
None yet
Development

No branches or pull requests

3 participants