Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple credentials fetch from actual requests #203

Closed
stszap opened this issue May 29, 2019 · 9 comments
Closed

Decouple credentials fetch from actual requests #203

stszap opened this issue May 29, 2019 · 9 comments

Comments

@stszap
Copy link
Contributor

stszap commented May 29, 2019

Is your feature request related to a problem? Please describe.

First of all thanks for the big update in 0.16, it made our setup much simpler. Our issue is that timeout for credentials fetcher is now fixed (we use auth0 with jwt authenticator). Unfortunately, requests to jwks.json are very slow (1-1.5s from countries where our developers reside) and occasionally we get 403 from oathkeeper, but the next request succeeds.

The actual scenario is (I believe):

  1. oathkeeper starts, empty caches, no connections
  2. client makes a request to an endpoint with jwt authenticator
  3. oathkeeper makes a request to jwks url (synchronously with the client's request)
  4. 1s passes and ctx timeout occurs, the client gets{"error":{"code":403,"status":"Forbidden","reason":"An internal server error occurred, please contact the system administrator","message":"Access credentials are not sufficient to access this resource"}}
  5. oathkeeper actually finishes the request, caches the key set and even keeps tcp connection to jwks server open
  6. the client makes another request
  7. oathkeeper gets the key from cache and proxies the request without an error
  8. 30s passes and the client makes another request
  9. the cache for jwks has expired and oathkeeper make another request but this time the connection is still open so there is no tcp handshake and request takes less than 1s
  10. client's request gets proxied without an error but with some added latency
  11. if there are no requests for a while oathkeeper closes tcp connection and the whole process repeats

Describe the solution you'd like

An ideal solution would be some kind of background function that will fetch jwks periodically so actual requests from clients would always use cache (even if it's stale). The benefits would be:

  1. no added latency because no additional synchronous requests are made
  2. higher error tolerance because even if jwks server is unavailable the cache is always used and when the server is up again the cache will be updated automatically

Describe alternatives you've considered

I understand that the ideal solution might be difficult to implement so I see 2 alternative ones:

  • make the timeout configurable
  • use something outside of oathkeeper (eg setup some external process that will save jwks to file system and use "file://" in oathkeeper)

Additional context

If I understand correctly the 1s limit is set here

r.credentialsFetcher = credentials.NewFetcherDefault(r.Logger(), time.Second, time.Second*30)
and then used here
ctx, cancel := context.WithTimeout(ctx, s.cancelAfter)

@aeneasr
Copy link
Member

aeneasr commented May 30, 2019

Thank you for the detailed issue description! Your analysis is on point. The logic was implemented in such a way because of a couple of reasons:

  1. We did not want to preemptively fetch keys because they may be specified (later on) in access rules themselves, thus making it somehow difficult / clunky to have an event loop there that checks every 30s or something.
  2. We did not want slow running requests to backends to clog up the system. Instead, we thought it would be best if the request is denied (maybe 403 isn't the best idea - maybe 503 with Retry-After?) but Oathkeeper still tries to resolve the key in the background so that, if it eventually resolves, the key is available in cache and subsequent requests pass.

What's obviously less than ideal is that the cache has such a short TTL. I would expect that the even if EOL for the cache item is reached, that the cache item be preferred than having no key (assuming the request to the remote fails with a timeout).

In general I think that it is a sensible request to make the fetching of credentials configurable wrt timeouts. Every system has different tolerances regarding latency and if - for example - you decide that a 2s latency is tolerable, then that should be supported by the software. The same would apply the other way around where the latency must be much smaller (e.g. < 50ms). It also makes sense to configure the TTL of the cache somewhere.

As a workaround for now, yes, the file thing is probably best and easily implementable. Simply fetch the key before running docker build or use an alpine image where you run curl https://path/to/keys.json and then use that locally available file to pull the keys from in oathkeeper (with file://).

In any case, thank you for the write up and making the context so clear. I think that we'll find a suitable solution to this issue quickly!

@swapnilgawade16
Copy link

Just wanted to check if there was any progress on this issue?

@aeneasr
Copy link
Member

aeneasr commented Jan 3, 2020

Not from our side but we appreciate - as always - contributions in this area :)

@stszap
Copy link
Contributor Author

stszap commented Sep 12, 2020

At the end in our company, we decided that it would be easier to just implement the file:// hack. We use k8s and run oathkeeper (version 0.16.0) in the same pod as the application. Here are some relevant yaml (it's not functional by itself to keep it small) hope it can help someone else:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: application
  labels:
    k8s-app: application
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: application
  template:
    metadata:
      labels:
        k8s-app: application
    spec:
      volumes:
      - name: oathkeeper-jwks
        emptyDir: {}
      containers:
      ...
      - name: oathkeeper-jwks-updater
        image: <docker image with curl> //we use our own image with curl but it doesn't really matter
        workingDir: /etc/oathkeeper/jwks/
        command: ["sh"]
        args: ["-c", "while :; do curl -f \"$JWKS_URL\" -o jwks.json; sleep 30; done"] //-f should prevent curl from overwriting the file if there was an error during the request
        volumeMounts:
        - name: oathkeeper-jwks
          mountPath: /etc/oathkeeper/jwks/
        env:
        - name: JWKS_URL
          value: "<your jwks url>"
      - name: oathkeeper
        image: oryd/oathkeeper:v0.16.0
        args: ["serve"]
        volumeMounts:
        - name: oathkeeper-jwks
          mountPath: /etc/oathkeeper/jwks
        env:
        - name: AUTHENTICATORS_JWT_ENABLED
          value: "true"
        - name: AUTHENTICATORS_JWT_SCOPE_STRATEGY
          value: "wildcard"
        - name: AUTHENTICATORS_JWT_JWKS_URLS
          value: "file://etc/oathkeeper/jwks/jwks.json"
        ...

Here we use another "oathkeeper-jwks-updater" container just to update jwks file in a shared "emptyDir" volume. This way the file updates every 30s and is also cached by oathkeeper for another 30s so it can take from 30s to 60s after an update to take an effect. We have been using this solution for more than a week in prod and haven't found any problems with it.

@aeneasr
Copy link
Member

aeneasr commented Sep 16, 2020

Awesome, thank you for sharing your knowledge!

@eroznik
Copy link
Contributor

eroznik commented May 3, 2021

@aeneasr do you have any preferences on what might be done in regards to this? I am thinking about jumping on this issue.. Few suggestions:

  • async reload on demand
    • each request would validate jwks, same as now
      • if they are expired: a new Go routine is spawned for jwk reload, for the current request -> old/stale jwks are used
      • if no jwk available(e.g. cold start), the jwk fetch would block till it gets fetched(current behavior)
  • configurable ttl

@aeneasr
Copy link
Member

aeneasr commented May 8, 2021

SGTM - I think we can also fetch the JWK and if it is fast enough use the new keys, if it's too slow use the old ones?

@eroznik
Copy link
Contributor

eroznik commented May 8, 2021

That is certainly worth considering, I'll prepare a PR with the approach @aeneasr recommended using a timer & go rutine.. we'll see how that goes :)

@aeneasr
Copy link
Member

aeneasr commented May 11, 2021

Wohoo, nice! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants