Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202

Closed
21 tasks done
CMCDragonkai opened this issue Jul 5, 2021 · 42 comments · Fixed by #283
Closed
21 tasks done
Assignees
Labels
design Requires design enhancement New feature or request epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 5, 2021

Block 1 from #194 (comment)

Aims to provide support for a Polykey instance to run unattended (i.e. without user prompt) when both the PK_RECOVERY_CODE and PK_PASSWORD variables are set.

Requirements of this design

  1. User has to supply a root password in order to encrypt a root key that is automatically generated
  2. If the user loses or forgets the root password, there is no way to decrypt the root key, there is no "Forget Password" functionality for decentralised application
  3. Provider an alternative way of recovering the root key
  4. The alternative is to store a BIP39 seed phrase that is used to generate the root key. If the user can supply a BIP39 seed phrase, this can be used to deterministically generate the root key, which doesn't require decrypting the root key and thus unlocking the keynode
  5. In an extreme case, the root key ciphertext file on disk is corrupted, and so decrypting it produces an incorrect root key, the seed phrase represents a backup of the root key
  6. Users must be notified to securely backup/store their root key, offer users to print it out as a "paper key" (like https://www.jabberwocky.com/software/paperkey/, we can even do funny things like QR codes)

Additional context

Note that the core function to generate a root key from a BIP39 seed is already available in our keys/utils.ts, see the function: generateDeterministicKeyPair. However it is not currently used by KeyManager, and not well tested.

Screenshot from Figma GUI design:

image

Specification

  1. Get the KeyManager to use the generateDeterministicKeyPair
  2. Make sure that this function generateDeterministicKeyPair is well tested
  3. Users are presented with a 24 word phrase to store and remember (prompt the user to store/print it)
  4. During the startup of the PK agent, the user has to supply the root password OR the seed phrase
  5. During session unlock, the user can supply the root password OR the seed phrase
  6. When a seed phrase is supplied, the root key is regenerated from the seed phrase, the root key ciphertext is not touched
  7. We must have a root key acceptance condition to know whether the root key has legitmately opened up a vault
  8. If it has, we can save/rewrite the root key as ciphertext over the ciphertext in the node state, which will resolve any root key ciphertext corruption

Regarding the root key acceptance, consider this idea:

  • Create a signed-message
  • Encrypt the message and save the message
  • If you can later decrypt the message, and check the the signature was signed by the root key and the message is what you expect
  • Then the root key is correct!
  • This message will need to be stored in the DB (as this is what is decrypted by the root key)

CLI command variables

Variables for CLI commands can be retrieved from:

  • file parameters
  • environment variables
  • src/config.ts
  • user prompt in STDIN

We require functions to retrieve the variables from each of these locations (created in src/bin/utils).

Variable precedence should be as follows: file parameters > environment variables/src/config.ts > default (STDIN prompt). We can use a sequence of coalescing operations to emulate this: for example, to retrieve the password, password = getFromParams() ?? getFromEnv ?? getFromStdin(). getFromStdin() here is expected to be the default, where we assume that a value will always be provided. If a default cannot exist for a particular variable, then throw an exception if none of the sources of variables can produce one.

See point 1 at #202 (comment) for further information.

Bootstrap process

Some components of the bootstrapping process need to be simplified:

Sub-Issues & Sub-PRs created

#283

checklist

  • Generate the keypair from the BIP menmonic
    • Generate a BIP mnemonic
    • generate keypair from menmonic
  • update keyManager to use new key generation method.
  • add a check that the mnemonic matches the nodeId for the node. if so we update the encrypted rootkey with the new password.
  • fix KeyManager to do the recovery process when a recovery code is provided with a password when node state exists.
  • tests
    • generating a keypair.
    • mnemonic deterministically recreates keypair.
    • can tell if mnemonic matches keypair/nodeId
    • can create new node from scratch, recreates nodeId and keypair.
    • can recover a keynode.
  • env variables implementation
    • bin/utils functions for variable retrieval
      • ~~getFromParams(): get variable from file parameter. Use str.trim() to remove trailing newlines/whitespace from the file (see 2 at Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202 (comment) for further information) ~~
      • getFromEnv(): get variable from environment variable
      • getFromConfig(): get variable from src/config.ts
      • getFromStdin(): get variable from user prompt
      • changed how this is handled, there are binUtils.getPassword and binUtils.getRecoveryCode. options for hosts and ports was created in src/bin/options.ts with their own defaults and env.
  • bootstrap process
    • use ENV variables for password and recovery code.
    • simplify bootstrapPolykeyState
    • tests for:
      • concurrent execution of 2x pk bootstrap
      • concurrent execution of 2x pk agent start
      • concurrent execution of pk bootstrap + pk agent start
    • simplify checkKeynodeState
  • implement --fresh flag: forces new keynode state, even if it already exists
    • pk agent start
    • pk bootstrap
@CMCDragonkai CMCDragonkai added enhancement New feature or request design Requires design labels Jul 5, 2021
@CMCDragonkai CMCDragonkai changed the title Using BIP39 to recover the root password Using BIP39 to recover the root key as opposed to the root password Jul 5, 2021
@CMCDragonkai CMCDragonkai changed the title Using BIP39 to recover the root key as opposed to the root password Using BIP39 seed to recover the root key as opposed to the root password Jul 5, 2021
@CMCDragonkai
Copy link
Member Author

The BIP39 spec is here: https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki

The implementation of the function is currently:

async function generateDeterministicKeyPair(
  bits: number,
  seed: string,
): Promise<KeyPair> {
  const prng = random.createInstance();
  prng.seedFileSync = (needed: number) => {
    // using bip39 seed generation parameters
    // no passphrase is considered here
    return pkcs5.pbkdf2(seed, 'mnemonic', 2048, needed, md.sha512.create());
  };
  const generateKeyPair = promisify(pki.rsa.generateKeyPair).bind(pki.rsa);
  return await generateKeyPair({ bits, prng });
}

The seed is actually called the "mnemonic phrase". This is meant to be generated by a bip39 library: https://www.npmjs.com/package/bip39

The "mnemonic phrase" is converted to a binary seed using PBKDF2. This binary seed is what is used to generate the RSA root keypair. As long as the same seed can be generated from the same mnemonic phrase, then the same keypair will come out.

Notice that the salt is currently just mnemonic.

In the BIP39 spec, the user can specify a passphrase to protect the the mnemonic seed. To do so, they must provide a "passphrase" that is appended to the end of 'mnemonic' + passphrase used for the salt. This usecase doesn't make sense for Polykey.

We are already randomly generating the root keypair. Switching over to BIP39 simply means we use:

  1. bip39 library to randomly generate a mnemonic phrase of 12/24 words
  2. generateDeterministicKeyPair to generate the root key pair
  3. The root keypair is saved to the node state, while requesting the user a password to symmetrically encrypt it, this is using keyPairToPemEncrypted, which uses a special encryptRsaPrivateKey function from pki, this has its own spec.
  4. Thus there's no need for a separate passphrase for adding to the salt of bip39.
  5. Finally you have a root password that can decrypt the ciphered root keypair, and if you lose the root keypair or forget your root password, you can recover the plaintext root keypair by using generateDeterministicKeyPair again with your 12/24 word mnemonic phrase.

@CMCDragonkai
Copy link
Member Author

Some options here for the pk bootstrap command:

  • Should PK_PASSWORD which is used to supply password for authentication of PK cli commands be used by pk bootstrap to also automatically use as the password without prompting?
  • If the PK_RECOVERY_CODE is supplied, this forces a deterministic creation of the root key. Then the root key can be protected by PK_PASSWORD if supplied, otherwise there is a prompt for PK_PASSWORD

The semantics of these environment variables would be specific to the pk bootstrap command, and therefore should be centralised there.

To ensure un-attended pk bootstrap and pk agent start all parameters must be supplied either by env variables or by parameters. If supplied with parameters, then the parameters takes over from the env variables.

@tegefaulkes
Copy link
Contributor

SHould the Bip recovery function separate from the KeyManager or a static method? We don't really need to start the KeyManager to do the recovery but we might need use the DB depending how I handle checking if the Mnemonic is for the node.

@CMCDragonkai
Copy link
Member Author

I believe that the deterministic key generation should be done all the time.

That means when calling pk bootstrap, if there's no PK_RECOVERY_CODE available, that means you will be generating one.

When pk bootstrap command finishes, it should have:

  1. generated random recovery code
  2. used recovery code to generate root key
  3. prompted you for the root password to encrypt the root key
  4. present the user the recovery code on the STDOUT

This means recovery code is always generated.

@CMCDragonkai CMCDragonkai changed the title Using BIP39 seed to recover the root key as opposed to the root password Enable un-attended pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD Nov 1, 2021
@CMCDragonkai
Copy link
Member Author

I've renamed this issue to better represent the intention. The usage of BIP39 is an implementation detail of the PK_RECOVERY_CODE, and we have to make sure this works for both pk agent start and pk bootstrap.

Recent discussions with @tegefaulkes indicates that pk agent start automatically bootstraps as well. So pk bootstrap is only used when you want to bootstrap without starting an agent. But if you are starting an agent you can bootstrap automatically.

That means pk agent start will not bootstrap when it detects the directory is already occupied. But we must deal with the case where the directory is occupied by not a proper node state, any errors during the start will result in failure to start. A fresh pk agent start would delete the directory, but but this must be explicit like we do in the code with the fresh boolean.

Same behaviour then with pk bootstrap too.

@CMCDragonkai
Copy link
Member Author

BIP recovery code can be in keys/utils.ts, in fact I think the impl is already there. There is no state for the recovery code. The recovery code is just an temporary step before generating the root key. Remember you must be able to use bip39 to randomly generate the recovery code. Unless the user were to provide a recovery code to use deterministically. You must verify that it is a valid bip39 recovery code though. See the bip39 libraries to see how to do this.

@CMCDragonkai CMCDragonkai changed the title Enable un-attended pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD Nov 1, 2021
@CMCDragonkai
Copy link
Member Author

Separated duties regarding port and host bindings to #286. This issue is about "un-attended bootstrapping" of the node state and keys specifically.

@tegefaulkes
Copy link
Contributor

I just tested in we can check that a mnemonic matches generates the correct keypair for the node.
It can be done in two ways.

  1. verify the signature of the nodes Id. this will directly confirm that the public key is correct.
  2. decrypt a message containing the nodeId. this will directly confirm that the private key is correct.

both methods check the generated nodeId matches as well. using method 2 is simpler as we use the proceedure.

  1. generate new rootKeypair from mnemonic provided.
  2. read verification message from somewhere. stored on disk or in db?
  3. decrypt message with new private key.
  4. check decrypted message matches the new NodeId.

After confirming that the keypair generated from the mnemonic is correct for the node we can 'recover' they node by writing a new password with writeRootKeyPair.

@CMCDragonkai
Copy link
Member Author

I think there's 2 usecases of the PK_RECOVERY_CODE atm. The first is to deterministically generate a root keypair during bootstrapping. The second case is to actually recover the root keypair when the root key password is not available.

You're talking about the second case right? That means the PK_RECOVERY_CODE is used to generate a new root key, and the question is how do we know if this key pair is the correct one.

You have a root certificate that you can use to check if your root key pair's public key matches the expected node id. The node id information should be available in the root certificate, and perhaps available in other files too. I remember we're supposed to have a lock file for singleton agent instances.

@joshuakarp can you incorporate this into the spec.

@tegefaulkes
Copy link
Contributor

Yes, I'm working on the 2nd right now.

@joshuakarp
Copy link
Contributor

joshuakarp commented Nov 2, 2021

So as a summation of this:

  • PK_RECOVERY_CODE: the environment variable used to both
    1. Deterministically generate a root keypair during unattended bootstrapping
    2. Recover a root keypair (e.g. a lost root key password)
    • presumably, we'll need another command to invoke a root keypair recovery (e.g. pk agent recover <nodepath> <24 word phrase> <new password>)
  • PK_PASSWORD: the environment variable used for "unlocking" the root keypair such that we authenticate the execution of PK commands

From discussions with @tegefaulkes, it seems like both these environment variables will need to be available during:

  1. pk agent start
  2. pk bootstrap
  3. The start of a session

Does this mean that we need to read the environment variables in multiple places in src/bin?

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 2, 2021 via email

@joshuakarp
Copy link
Contributor

joshuakarp commented Nov 2, 2021

Initial draft of me trying to map out the bootstrap process:

image

http://www.plantuml.com/plantuml/uml/TLJVIzn047xFNp7uOWvAy4r8a2fS2cBLO0HRIf6zxEWjINQMsSdPKFg_DzdxOSxpNevZlX_pVREGIo-AfMkRAFo9ejJLi20IBwnS5aNc1U92UVz6hE75D0XEiMfHaC4nhOn6_NK6iaGBSOz-5W6aRYku04Orw9ZQ8CuYR3n2R6da1U-Rv-AgB9_k7-RWE-SQS-Ytx-YOAlG-GRlwTJgHd_xE1kxK_rYIyd5-VHV8LfkNQFqRfRlQ1JLzZNqfIZ2UROxmdQewzCS_nel_c76aNBsY5IYkvw0gCIlyf8SicTuWPoOEx8VEMcDViu1w8JUji4kuc-ooLakiwXliBMdSppbE2YLBq8QX8YlAM0_rcXe6Q0_ZfpVvpxlxUVwS_xWk7yiisi-QZvs7d67L28Qeqw3ft31MGjm3javw80PDwEsfh2-MlMTdL8AR8uShn4cIV87jMurDd68nOiZ2oT3oLbdrYWn1EGnFuE_MvXqzD4azIkSUY6HW5Sfml8SEmu5kFkOSa7fDXmiOqY4hmQr14Bwn4PpQRlYJxHENHogmpMt5lHFKHxXFeoKK-Qq78M0YARf5UCROTOl6L3DGLWDlbmQ0pKe9Zlf_WyLLqm394XccfztTRDnIei2t4b0RWLRv6k8pEZi_3mdSH9HOqkYJy4c8ZwFaFW7ruYY2NfZwW8oKw1AjxjhcFm00

Discussed this with Roger, and there's some things that will need to change in this process.

  1. Reading from parameter, environment variable, config (between pk bootstrap and bootstrapPolykeyState in diagram)

    • We should remove the nested if conditions when attempting to get a required variable for a CLI command's execution (e.g. the password)
    • This can be replaced by a sequence of coalescing operations: password = getFromParams() ?? getFromEnv ?? getFromStdin()
    • This will set password as the first string it can successfully acquire from these functions
    • Compared to 'monads' in Haskell: if op1 fails, do op2. If op2 fails, do op3.
    • Importantly, we need to make the assumption that there's a final 'default' function that's always guaranteed to provide a value if all other retrievals fail (e.g. getFromStdin, or whatever suits for the variable). Alternatively, if nothing is provided from any parameters, then we should throw an exception from the CLI command itself.
    • For example:
     function getFromParams() {
       return 'params';
     }
     
     function getFromEnv() {
       return undefined;
     }
     
     function getFromConfig() {
       return 'config';
     }
     
     function main() {
       const password = getFromParams() ?? getFromEnv() ?? getFromConfig();
       console.log(password); // 'params'
       // if getFromParams() returned undefined, password = 'config'
     }
    • getFromParams(), getFromEnv(), getFromStdin(), getFromConfig, etc, (or some other better naming convention) should be utility functions defined in bin/utils, and should all return undefined if there's nothing to retrieve from that specific area
    • These should be used for all relevant CLI commands
  2. --password vs --password-file/--recovery-code vs --recovery-code-file flag

    • There's a persistent log stored for CLI commands at the terminal level
    • Thus, in Polykey, we mitigate the chance of a footgun for the user, and only supply the --password-file/--recovery-code-file flag, such that they don't provide their password/recovery code and leave it in a stored plaintext form on the terminal command log
    • It will be important to document the fact that the user can still override this with process redirection at the terminal: pk bootstrap --password-file=<(echo 'abc')
    • A quick note about reading from files too:
      • Note that newlines/whitespace can be appended to the end of a created file
      • We can resolve this simply by doing str.trim() after reading the file
      • This should be applied to all file reads (e.g. --password-file, --recovery-code-file)
      • Note that whitespace is still considered to be a valid character in a password/recovery code, so it's important to restrict where whitespace is stripped
  3. Lockfile (between bootstrapPolykeyState and checkKeynodeState in diagram)

    • There was some discussion of whether we need to the lockfile at all
    • Currently we use both proper-lockfile and fd-lock, and these were aiming to be unified (see Unify File Locking Across Domains #251)
    • The usage of the lockfile in bootstrapPolykeyState can be simplified. Firstly, we must ensure that the lock is acquired and held across the entire lifetime of a pk agent process (i.e. the agent process locks the file at all times).
    • Then, in bootstrapPolykeyState, perform the following process instead:
      1. If the lock file exists:
      2. Attempt to acquire the lock
      3. If we cannot acquire the lock, then something is running. If this is the case, we can simply abort the process. No need to await until the lock is released.
    • i.e. PID checking is not necessary
    • it should be possible to just use fd-lock for this
    • will need to ensure that the lock is created according to specific OS flags, such that there's no truncation etc. Something like fs.promises.open() - see src/sessions/Session.ts for this
    • Consuequently, will need the following tests:
      • concurrent execution of 2x pk bootstrap
      • concurrent execution of 2x pk agent start
      • concurrent execution of pk bootstrap + pk agent start
  4. checkKeynodeState simplification

    • this process can be simplified too
    • if there's any file except the lockfile, then it should throw an error. No need to be fine-grained about the approach and check the expected directories (as it's doing at the moment) such that it can throw a specific error
    • The current way with how it checks the schema can be reserved for some other command (e.g. a status command on the directory structure)
  5. Force bootstrap process: pk bootstrap --force

    • The bootstrap process should offer an optional --force flag to force a fresh bootstrap
    • i.e. completely refresh the state
    • If the agent is already running, it should fail. Otherwise, it should always succeed. Password does not need to be supplied, as it should be seen as a complete wipe (i.e. lost password, etc)

@CMCDragonkai
Copy link
Member Author

Some changes to the lockfile mechanism will occur once I merge in the CLI auto retry. There's some stuff that's important regarding the creation of the the status file. Remember the agent process is meant to hold the lock while it's alive, so pk bootstrap and pk agent start both have to try and acquire the lock when doing what they are doing.

@joshuakarp
Copy link
Contributor

@tegefaulkes I've updated #202 (comment) with comments from discussion with Roger about some other things that should change with the bootstrap process. Please have a read.

@joshuakarp
Copy link
Contributor

Based on #202 (comment) and #202 (comment), this is how I foresee usage of PK_PASSWORD and PK_RECOVERY_CODE:

  1. pk bootstrap: perform a regular bootstrap
    • PK_PASSWORD needs to be supplied (either from file parameter, env variable, or input prompt) such that the generated root keypair is protected with provided password.
    • Fails if agent is already running. If not running, also fails if there exists any state at nodePath (besides a lockfile)
  2. pk agent start: perform a regular start
    • PK_PASSWORD needs to be supplied (either from file parameter, env variable, or input prompt). If the keypair exists, checks if password is correct. Otherwise, generates a new keypair and sets this password to protect the generate keypair.
  3. pk bootstrap/agent start --recovery-code-file='./file' (or from env variable): deterministically generate a root keypair during bootstrapping (use case 1 from Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202 (comment))
    • PK_PASSWORD needs to be supplied (either from file parameter, env variable, or input prompt) such that the deterministically generated root keypair is protected with provided password.
    • PK_RECOVERY_CODE supplied
    • Note this is not the recovery function.
    • As such, fails if agent is already running. If not running, also fails if there exists any state at nodePath (besides a lockfile)
  4. Perhaps a separate pk agent recover that performs the recovery operation? It's not technically a bootstrap process, so it would make sense for this to be separate.
    • PK_PASSWORD is supplied (either from file parameter, env variable, or input prompt) but note that this would be a new password to protect the root keypair.
    • PK_RECOVERY_CODE also supplied, to perform the recovery. Fails if incorrect code.
    • Fails if agent is already running. There will probably be state at nodePath, but this is fine - we're effectively recovering the keynode state.

Does this make sense from your perspective @CMCDragonkai?


A question of my own about this:

  • For 3 where we perform the deterministic keypair generation, if PK_RECOVERY_CODE is supplied as an environment variable, and we just do pk bootstrap, is the expectation that this will automatically retrieve the environment variable and use it? Does this then mean that PK_RECOVERY_CODE would need to be removed from the environment variables if we don't intend to do the deterministic keypair generation? i.e. without providing the --recovery-code-file flag, there's no indicator on command execution that we intend to do a deterministic keypair generation.

@tegefaulkes
Copy link
Contributor

The agent recovery CLI command needs to be made but doesn't quite fit this scope. I think it should be made a separate issue. I've already made the recovery function recoverKeyManager in KeyManager. so only the bin command for it needs to be implemented for this.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 3, 2021

The --fresh should only ignore existing state, but not an existing lock.

pk bootstrap --fresh # bootstrap a new node state

# this we can do later
pk recover --recovery-code-file=./rcf --password-file=./pf # just recovers the key (but doesn't start the agent)

# FOR: empty node state

# deterministic node state generation
pk agent start --recovery-code-file=./rcf

# generate key pair, and then set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf

# FOR: non-empty node state

# generate a keypair deterministically, check if the public key is correct, and if so, prompt for password (this is the new password)
pk agent start --recovery-code-file=./rcf

# recover the key pair, if the public key is correct, set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf 

# the --fresh option ignores existing state (but not any existing lock)
pk agent start --recovery-code-file=./rcf --password-file=./pf --fresh

@tegefaulkes
Copy link
Contributor

Ok, so just to confirm, the expected behaviour is that we want to update the password when starting a node with a recovery key, password and existing state?

@CMCDragonkai
Copy link
Member Author

Yes.

@tegefaulkes
Copy link
Contributor

For simplify bootstrapPolykeyState do we really only want to check if an agent is already running? Don't we also want to check if the directory is clean for creating a keynode in? currently we check that non-keynode files and directories exist before creating the keynode state. do we still want to check for that?

I can simplify checkKeynodeState but I think at minimum we want to check for an existing keynode or running agent, other files, and directory is clear.

Lastly, is this something that really needs to be changed at this stage?

@joshuakarp
Copy link
Contributor

That was brought up by @CMCDragonkai. For reference, this is currently the function:

async function checkKeynodeState(nodePath: string): Promise<KeynodeState> {
  try {
    const files = await fs.promises.readdir(nodePath);
    //Checking if directory structure matches keynode structure. Possibly check the private and public key and the level db for keys)
    if (
      files.includes('keys') &&
      files.includes('db') &&
      files.includes('versionFile')
    ) {
      const keysPath = path.join(nodePath, 'keys');
      const keysFiles = await fs.promises.readdir(keysPath);
      if (
        !keysFiles.includes('db.key') ||
        !keysFiles.includes('root_certs') ||
        !keysFiles.includes('root.crt') ||
        !keysFiles.includes('root.key') ||
        !keysFiles.includes('root.pub') ||
        !keysFiles.includes('vault.key')
      ) {
        return 'MALFORMED_KEYNODE';
      }
      return 'KEYNODE_EXISTS'; // Should be a good initilized keynode.
    } else {
      if (files.length !== 0) {
        return 'OTHER_EXISTS'; // Bad structure, either malformed or not a keynode.
      } else {
        return 'EMPTY_DIRECTORY'; // Directy exists, but is empty, can make a keynode.
      }
    }
  } catch (e) {
    if (e.code === 'ENOENT') {
      return 'NO_DIRECTORY'; // The directory does not exist, we can create a bootstrap a keynode.
    } else {
      throw e;
    }
  }
}

@CMCDragonkai can likely weigh in here. For just ensuring that we can deploy, it's not completely necessary to change this when just considering deployment. As far as I can tell, it's just a minor optimisation by simplifying the process.

@CMCDragonkai
Copy link
Member Author

When checking whether an agent is running or not, it's not necessary to check the PID at all. Only that whether we can acquire the lock or not. If you cannot acquire the lock, you just fail the command entirely. This is why I realised that fd-lock can be used and thus unify our locking #251.

@tegefaulkes
Copy link
Contributor

This issue is mostly complete. Everything is done besides the bootstrap task. if its not strictly required then I suggest that we move it to a new issue for later.

@tegefaulkes
Copy link
Contributor

According to this comment #286 (comment) we can handle ENV parsing and defaults with commander. However in the case here we want password file passed via a flag while the password itself passed via an ENV variable. so the solution in the comment is not quite applicable here.

I can update the the --password-file and --recovery-code-file options to streamline and give them a ENV variable. but I'm not sure I can get commander to parse PK_PASSWORD for us without creating a corresponding --password flag.

@joshuakarp
Copy link
Contributor

Is it really necessary to use commander in this case then, given that we don't have the consistency between --password-file and the actual password field stored in PK_PASSWORD? Could we not do something like:

const passwordFilePath = new commander.Option(
  '-pf, --password-file <path>',
  'Password Filepath'
);
// now just use the coalescing:
const password = getFromParams(passwordFilePath) ?? getFromEnv('PK_PASSWORD') ?? getFromStdin();

@CMCDragonkai
Copy link
Member Author

This issue is mostly complete. Everything is done besides the bootstrap task. if its not strictly required then I suggest that we move it to a new issue for later.

Shouldn't the bootstrap share similar behaviour to what agent start is doing?

@CMCDragonkai
Copy link
Member Author

Is it really necessary to use commander in this case then, given that we don't have the consistency between --password-file and the actual password field stored in PK_PASSWORD? Could we not do something like:

const passwordFilePath = new commander.Option(
  '-pf, --password-file <path>',
  'Password Filepath'
);
// now just use the coalescing:
const password = getFromParams(passwordFilePath) ?? getFromEnv('PK_PASSWORD') ?? getFromStdin();

Oh yea, that's true. The PK_PASSWORD is not the same as --password-file. So therefore the .env should only be used for options that are the same.

@tegefaulkes
Copy link
Contributor

Ok, then there is not much to change for that. I can update the --password-file option to parse the filepath and read the file for us and give it it's own ENV variable. Same for the --recovery-code-file.

@CMCDragonkai
Copy link
Member Author

There's no PK_PASSWORD_FILE env variable only PK_PASSWORD which should the raw password.

@joshuakarp
Copy link
Contributor

Simplifying bootstrapPolykeyState - refactoring Lockfile

The Lockfile class should be refactored, such that it copies the structure from Session.ts. We can instead refer to this new class as a Status class.

Currently, the bootstrapping process uses checkAgentRunning. This shouldn't be necessary anymore. We just need to:

  1. Create the lock file if it doesn't exist
  2. Attempt to acquire it

There's no need to acquire the PID anymore, for both pk agent start and pk bootstrap. It's sufficient to simply acquire the lock on the file descriptor. If we do acquire it, we can proceed. If we can't, then we have to cancel the command. Even with the fresh flag, if an agent process is still running, we fail this process. The lifetime of the agent process should hold the lock. Only after stopping the agent process do we release the lock.

Exceptions

  • ErrorStatusLocked: status file already locked - should return the data of the status file

A basic prototype of this, using similar structure from Session.ts:

import lock from 'fd-lock';

@CreateDestroyStartStop
class Status {

  static async createStatus({ 
    // same as Session.ts:
    // logger
    // statusPath
  }

  public async start() {
    // Unlike Session.ts, we need to make sure we lock the file here!
    // So just call writeStatus() here
  }
  public async stop() {
    // Release the lock here - the only place it should be released
  }
  public async destroy() {
    // destroy the entire file
  }
  
  public async readStatus() {
    // read necessary JSON from the file
    // PID, node ID, etc
  }

  public async writeStatus() {
    // basically identical to Session.writeToken
    // on fs.promises.open, we need the following flags though:
    // - fs.constants.O_RDWR (open for both read and write, instead of write-only)
    // - fs.constants.O_CREAT (create if doesn't exist)

    // Don't need to wait on sleep until we acquire lock - simply exit.
    if (!lock(status.fd)) {
      // send the contents of the file back to STDERR through exception contents, so can see what's holding it
      throw StatusLocked(...);
    }

  }
}

Tests

  • all the tests from tests/sessions/Session.test.ts are expected to be adjusted to be used for this
  • also add tests for checking the expected locking when calling:
    • concurrent execution of 2x pk bootstrap
    • concurrent execution of 2x pk agent start
    • concurrent execution of pk bootstrap + pk agent start

@tegefaulkes
Copy link
Contributor

Shouldn't we be holding the lock just before destroying the file? I can see a situation here were we call stop(), something else gets the lock and then we call destroy() removing the file.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 4, 2021

If you stop, it means you're stopping the Status. The semantics of that means the status is not held anymore.

So the real question is whether Status should be an encapsulated dependency of PolykeyAgent?

If so, then that means the lifetime of PolykeyAgent matches the lifetime of Status.

Now for pk agent start both are going to want to acquire the status to start PolykeyAgent.

However they may want to start the status before they are starting PolykeyAgent. Unless they rely on the pkAgent.start() to fail and tell them an exception for not being allowed to start the agent in the case of the lock being already held.

But if you are doing pk bootstrap, you're not intending to start the PolykeyAgent at all. In that case you're using the Status directly there in the bootstrap action handler.


If you do need for some reason to use the Status external to the PolykeyAgent and then want to pass it onto the PolykeyAgent, then you can DI it. In that case, it overrides the default in PolykeyAgent.

And remember, the CreateDestroyStartStop will automatically make repeated start a noop. This means it is "idempotent".

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 4, 2021

I was thinking that Status might need to be CreateDestroy instead of CreateDestroyStartStop.

The reason is because you want to ensure that the status file is destroyed when it's not live.

As in there's no reason to leave the file around. When the agent is no longer running. It's not stateful pass process lifecycle like Session.

Note that when reading the status, you may get no information. That might mean it's not "ready" yet, which means it's just waiting to launch the server. This is because the status file is created first, and then filled with information once the ports are all bound to.

That means your read function should "wait" on actual data.

Another problem, the intermediate time between creating a new file and locking it:

So imagine:

fd = open() // assume that the status didn't exist so this creates file
// AT THIS POINT it is possible another process acquires the lock
lock(fd) // locks the fd

If somehow another process acquired the lock, then you'll have to fail here.

For other process, the the status file will be empty at this point. This could mean you have to fail at this point cause it was in the middle of another process launching.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 4, 2021

A prototype pseudocode:

@CreateDestroy
class Status {

static async createStatus(statusPath) {
  // create if not exist
  const fd = open(statusPath, this.fs.constants.O_RDWR | this.fs.constants.O_CREAT);
  // at this point it may be locked by another process
  const locked = lock(fd);
  if (!locked) {
    throw new ErrorStatusLocked();
  }
  fd.write({ type: 'starting' });
  const status = new Status(statsuPath, statusFd);
  return status;
}

constructor(statusPath, statusFd) {
  this.statusPath = statusPath;
  // assume this is locked already
  this.statusFd = statusFd;
}

async destroy () {
  unlock(this.statusFd);
  this.fs.rm(this.statusPath);
}

// remember this may not be "ready"
// because the PolykeyAgent has to call `writeStatus` to write the actual status information
async readStatus () {

}

async writeStatus () {

}

}

Now all agents will do: this.status = status ?? await Status.createStatus().

If it fails, then an exception is thrown there.

If there are 2 processes both doing this, then only 1 winner. Whichever gets to lock it. The other fails, and throws that the status is already locked.

@CMCDragonkai
Copy link
Member Author

Possible situations:

  • pk bootstrap & pk agent start & wait - simultaneous start and bootstrap, only 1 winner
  • pk agent start & pk agent start & wait - simultaneous start, only 1 winner

@CMCDragonkai
Copy link
Member Author

One of the problem is that Clients need to read the status information. And right now readStatus would only work if the Status object is created. And that's not going to work for clients.

I don't think the PolykeyClient should be holding a Status instance. It should just receive parameters to connect from DI directly.

Reading the status might need to be a utility function instead. It shouldn't be trying to acquire a lock cause we don't have readlocks atm. It might work as long as the status is well formed, and if it is not well formed, then it should be discarded and converted to be undefined.

Anyway, the reading of the status and writing of the status is interdependent on the locking of the status. Need a design that can handle all of these cases.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 4, 2021

Ok better prototype:

import type { FileSystem } from './src/types';

import { CustomError } from 'ts-custom-error';
import Logger from '@matrixai/logger';
import { StartStop, ready } from '@matrixai/async-init/dist/StartStop';
import lock from 'fd-lock';

// pseudocode
class ErrorStatus extends CustomError { }

class ErrorStatusLocked extends ErrorStatus { }

class ErrorStatusParse extends ErrorStatus { }

/**
 * When starting, you have to "wait".
 * When stopping, there's no point in connecting.
 */
type StatusInfo = {
  type: 'starting';
} | {
  type: 'stopping';
} | {
  type: 'running';
  clientHost: string;
  clientPort: number;
};

interface Status extends StartStop {}
@StartStop()
class Status {

  public readonly statusPath: string;
  protected fs: FileSystem;
  protected logger: Logger;
  protected statusFd: number;

  constructor ({
    statusPath,
    fs = require('fd'),
    logger = new Logger(Status.name),
  }: {
    statusPath: string;
    fs: FileSystem;
    logger: Logger;
  }) {
    this.logger = logger;
    this.statusPath = statusPath;
    this.fs = fs;
  }

  public async start() {
    this.logger.info('Starting Status');
    const statusFile = await this.fs.promises.open(
      this.statusPath,
      this.fs.constants.O_RDWR | this.fs.constants.O_CREAT
    );
    if (!lock(statusFile.fd)) {
      throw new ErrorStatusLocked;
    }
    this.statusFd = statusFile.fd;
    this.logger.info('Started Status');
  }

  public async stop() {
    this.logger.info('Stopping Status');
    lock.unlock(this.statusFd);
    await this.fs.promises.rm(this.statusPath,
      { force: true }
    );
    this.logger.info('Stopped Status');
  }

  /**
   * Can read status without locking
   */
  public async readStatus(): Promise<StatusInfo | undefined> {
    let statusData;
    try {
      statusData = await this.fs.promises.readFile(this.statusPath, 'utf-8');
    } catch (e) {
      if (e.code === 'ENOENT') {
        return;
      }
      // do we expect other filesystem errors?
      // like permission errors?
      // io errors?
      throw e;
    }
    if (statusData === '') {
      return;
    }
    let statusInfo;
    try {
      statusInfo = JSON.parse(statusData);
    } catch(e) {
      throw new ErrorStatusParse;
    }
    // TODO: PARSE the statusInfo, use json schema
    return statusInfo;
  }

  /**
   * Can only write status when it is locked
   */
  @ready()
  public async writeStatus(statusInfo: StatusInfo): Promise<void> {
    // do your write
  }
}

So now it's possible:

const status = new Status(...);
await status.readStatus();

Or

const status = new Status(...);
await status.start();
await status.writeStatus(...);
await status.stop();

We may want to have exceptions for dealing with reads and writes. Exceptions could be more general like ErrorStatusRead and ErrorStatusWrite. We do something like this in keys domain already. The writeStatus() call would occur during the start method of PolykeyAgent. As in when we start, we do status.start() first, but then once all other things are started, we use status.writeStatus().

@CMCDragonkai
Copy link
Member Author

There are 4 places where Status would be used:

  1. PolykeyAgent - this has to construct, and start in the beginning in createPolykeyAgent, propagate in start, write the various status infos during start via writeStatus, and then propagate stop.
  2. PolykeyClient - this just constructs and performs readStatus, has to deal with the 3 different status possibilities, starting, stopping, running, sleep-loop can be used for waiting from starting to running. Make sure that the clientHost and clientPort is reconciled with the --client-host and --client-port.
  3. pk bootstrap will try to acquire the lock independently and perform bootstrapping.
  4. pk agent start can call PolykeyAgent directly and rely on its usage of Status to know whether it can start or not.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes just to confirm the recovery code you're using is a 12 or 24 word recovery key?

Is it 12 or 24?

If you haven't chosen yet, I suggest parameterising it, and then defaulting it to 24. You can use a literal type: a: 12 | 24.

@tegefaulkes
Copy link
Contributor

It's 24 and easy to change.

@CMCDragonkai
Copy link
Member Author

The 15s time of pbkdf2 means we should move towards webcrypto and webassembly asap. But we will do this after teh testnet deployment. This is tracked in #270.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Requires design enhancement New feature or request epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management
Development

Successfully merging a pull request may close this issue.

3 participants