Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault Fails to Start with Large Number of Tokens #3772

Closed
miketonks opened this issue Jan 11, 2018 · 5 comments
Closed

Vault Fails to Start with Large Number of Tokens #3772

miketonks opened this issue Jan 11, 2018 · 5 comments
Milestone

Comments

@miketonks
Copy link

Environment:

  • Vault Version: Vault v0.9.1 ('87b6919dea55da61d7cd444b2442cabb8ede8ab1')
  • Operating System/Architecture: Ubuntu 16.04

Note we are running inside docker. Both host OS and docker base image is ubuntu 16.04.

Also happens with Vault version v0.8.3

Vault Config File:

backend "etcd" {
  address = "http://192.168.42.24:4001"
  sync = "false"
  path = "vault"
  redirect_addr = "https://services1.vault.dev:7668"
  cluster_addr = "https://services1.vault.dev:7670"
  ha_enabled = "false"
}

listener "tcp" {
  address = "0.0.0.0:7667"
  cluster_address = "192.168.42.24:7669"
  tls_disable = 1
}

listener "tcp" {
  address = "0.0.0.0:7668"
  cluster_address = "192.168.42.24:7670"
  tls_cert_file = "/usr/local/****/ssl/vault.pem"
  tls_key_file = "/usr/local/****/ssl/vault.key"
}

Startup Log Output:

Running /docker-entrypoint.sh ...
==> Configuring backend address = http://192.168.42.24:4001
==> Configuring backend redirect_address = https://services1.vault.dev:7668
==> Configuring backend cluster_address = https://services1.vault.dev:7670
==> Configuring backend sync = false
==> Configuring backend ha_enabled = false
==> Configuring http.address = 0.0.0.0:7667
==> Configuring http.cluster_address = 192.168.42.24:7669
==> Configuring https.address = 0.0.0.0:7668
==> Configuring https.cluster_address = 192.168.42.24:7670
==> Vault server configuration:

                     Cgo: disabled
         Cluster Address: https://services1.vault.dev:7670
              Listener 1: tcp (addr: "0.0.0.0:7667", cluster address: "192.168.42.24:7669", tls: "disabled")
              Listener 2: tcp (addr: "0.0.0.0:7668", cluster address: "192.168.42.24:7670", tls: "enabled")
               Log Level: 
                   Mlock: supported: true, enabled: true
        Redirect Address: https://services1.vault.dev:7668
                 Storage: etcd (HA disabled)
                 Version: Vault v0.9.1
             Version Sha: 87b6919dea55da61d7cd444b2442cabb8ede8ab1

==> Vault server started! Log data will stream in below:

2018/01/11 14:01:38.905403 [INFO ] core: vault is unsealed
2018/01/11 14:01:38.906109 [INFO ] core: post-unseal setup starting
2018/01/11 14:01:38.906608 [INFO ] core: loaded wrapping token key
2018/01/11 14:01:38.906643 [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018/01/11 14:01:38.913976 [INFO ] core: successfully mounted backend: type=kv path=secret/
2018/01/11 14:01:38.914289 [INFO ] core: successfully mounted backend: type=system path=sys/
2018/01/11 14:01:38.914600 [INFO ] core: successfully mounted backend: type=pki path=pki/
2018/01/11 14:01:38.914809 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018/01/11 14:01:38.915133 [INFO ] core: successfully mounted backend: type=identity path=identity/
2018/01/11 14:01:38 [ERR] Error flushing to statsd! Err: write udp 127.0.0.1:39430->127.0.0.1:8125: write: connection refused
2018/01/11 14:01:38.920552 [INFO ] expiration: restoring leases
2018/01/11 14:01:38.920701 [INFO ] rollback: starting rollback manager
2018/01/11 14:01:38.924968 [INFO ] identity: entities restored
2018/01/11 14:01:38.925358 [INFO ] identity: groups restored
2018/01/11 14:01:38.925661 [INFO ] core: post-unseal setup complete
2018/01/11 14:01:38.954988 [ERROR] expiration: error restoring leases: error=failed to scan for leases: list failed at path '': rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5582564 vs. 4194304)
2018/01/11 14:01:38.955006 [ERROR] expiration: shutting down
2018/01/11 14:01:38.955014 [INFO ] core: pre-seal teardown starting
2018/01/11 14:01:38.955038 [INFO ] core: cluster listeners not running
2018/01/11 14:01:38.955087 [INFO ] rollback: stopping rollback manager
2018/01/11 14:01:38.955139 [INFO ] core: pre-seal teardown complete
2018/01/11 14:01:38.955182 [INFO ] core: vault is sealed

Expected Behavior:

After creating 8000 tokens, vault should restart normally.

Actual Behavior:

After creating 8000 tokens, vault cannot be restarted.

Steps to Reproduce:

Create 8000 tokens:

VAULT_ADDR=https://node1.vault.dev:7668; VAULT_TOKEN=`cat vault_root_token`; for i in {1..8000}; do curl -sXPOST $VAULT_ADDR/v1/auth/token/create -H "X-Vault-Token: $VAULT_TOKEN"; done

Restart vault server.

Important Factoids:

Large numbers of tokens (approx 6000-8000+) with an etcd backend cause Vault to fail to start up due to the default limit of 4MiB receive size in gRPC.

This does not prevent Vault from working once running and unsealed - if the number of tokens increases while Vault is running, nothing happens, but if restarted, Vault will not unseal again unless:

a) the 4MiB limit is increased by editing the vendored Vault dependencies (specifically vendor/google.golang.org/grpc/clientconn.go L96) and rebuilding, or

b) etcd is wiped and/or the tokens are deleted from etcd

The token counts we're seeing may be caused something we're doing wrong in how we use Vault, but even so I'd hope for a graceful degradation rather than a hard stops-working-at-4MiB - perhaps the lease scan should be paginated or similar?

Critical error is:

[ERROR] expiration: error restoring leases: error=failed to scan for leases: list failed at path '': rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5582564 vs. 4194304)

References:

n/a

@jefferai
Copy link
Member

We can't generally paginate because this is not something supported by all storage backends. It would have to be something particular to the etcd backend. Ping @xiang90 !

@benpaxton-hf
Copy link
Contributor

options for this on the etcd client side have been added to etcd in etcd-io/etcd#9047 - vault's etcd backend code needs to use this to increase max recv size, possibly based on config/environment/etc?

@xiang90
Copy link
Contributor

xiang90 commented Jan 15, 2018

@benpaxton-hf

yea. we should bump the etcd client, and set the response size to unlimited by default. we can potentially add an option to change the size.

would you like to help on that?

@benpaxton-hf
Copy link
Contributor

I don't know how to bump the etcd client version that's used (how does Vault's vendoring work?), but I think the following should work if the client is updated:

diff --git a/physical/etcd/etcd3.go b/physical/etcd/etcd3.go
index 04944e59..03af89dc 100644
--- a/physical/etcd/etcd3.go
+++ b/physical/etcd/etcd3.go
@@ -108,6 +108,15 @@ func newEtcd3Backend(conf map[string]string, logger log.Logger) (physical.Backen
                cfg.Password = password
        }
 
+       if maxReceive, ok := conf["max_receive_size"]; ok {
+               // grpc converts this to uint32 internally, so parse as that to avoid passing invalid values
+               val, err := strconv.ParseUint(maxReceive, 10, 32)
+               if err != nil {
+                       return nil, fmt.Errorf("value [%v] of 'max_receive_size' could not be understood", maxReceiveStr)
+               }
+               cfg.MaxCallRecvMsgSize = int(val)
+       }
+
        etcd, err := clientv3.New(cfg)
        if err != nil {
                return nil, err

@jefferai
Copy link
Member

Vault uses govendor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants