Decouple credentials fetch from actual requests #203

stszap · 2019-05-29T17:44:03Z

Is your feature request related to a problem? Please describe.

First of all thanks for the big update in 0.16, it made our setup much simpler. Our issue is that timeout for credentials fetcher is now fixed (we use auth0 with jwt authenticator). Unfortunately, requests to jwks.json are very slow (1-1.5s from countries where our developers reside) and occasionally we get 403 from oathkeeper, but the next request succeeds.

The actual scenario is (I believe):

oathkeeper starts, empty caches, no connections
client makes a request to an endpoint with jwt authenticator
oathkeeper makes a request to jwks url (synchronously with the client's request)
1s passes and ctx timeout occurs, the client gets{"error":{"code":403,"status":"Forbidden","reason":"An internal server error occurred, please contact the system administrator","message":"Access credentials are not sufficient to access this resource"}}
oathkeeper actually finishes the request, caches the key set and even keeps tcp connection to jwks server open
the client makes another request
oathkeeper gets the key from cache and proxies the request without an error
30s passes and the client makes another request
the cache for jwks has expired and oathkeeper make another request but this time the connection is still open so there is no tcp handshake and request takes less than 1s
client's request gets proxied without an error but with some added latency
if there are no requests for a while oathkeeper closes tcp connection and the whole process repeats

Describe the solution you'd like

An ideal solution would be some kind of background function that will fetch jwks periodically so actual requests from clients would always use cache (even if it's stale). The benefits would be:

no added latency because no additional synchronous requests are made
higher error tolerance because even if jwks server is unavailable the cache is always used and when the server is up again the cache will be updated automatically

Describe alternatives you've considered

I understand that the ideal solution might be difficult to implement so I see 2 alternative ones:

make the timeout configurable
use something outside of oathkeeper (eg setup some external process that will save jwks to file system and use "file://" in oathkeeper)

Additional context

If I understand correctly the 1s limit is set here

oathkeeper/driver/registry_memory.go

Line 183 in 1e03ee2

    
           r.credentialsFetcher = credentials.NewFetcherDefault(r.Logger(), time.Second, time.Second*30)

and then used here

oathkeeper/credentials/fetcher_default.go

Line 94 in 1e03ee2

ctx, cancel := context.WithTimeout(ctx, s.cancelAfter)

The text was updated successfully, but these errors were encountered:

aeneasr · 2019-05-30T08:00:34Z

Thank you for the detailed issue description! Your analysis is on point. The logic was implemented in such a way because of a couple of reasons:

We did not want to preemptively fetch keys because they may be specified (later on) in access rules themselves, thus making it somehow difficult / clunky to have an event loop there that checks every 30s or something.
We did not want slow running requests to backends to clog up the system. Instead, we thought it would be best if the request is denied (maybe 403 isn't the best idea - maybe 503 with Retry-After?) but Oathkeeper still tries to resolve the key in the background so that, if it eventually resolves, the key is available in cache and subsequent requests pass.

What's obviously less than ideal is that the cache has such a short TTL. I would expect that the even if EOL for the cache item is reached, that the cache item be preferred than having no key (assuming the request to the remote fails with a timeout).

In general I think that it is a sensible request to make the fetching of credentials configurable wrt timeouts. Every system has different tolerances regarding latency and if - for example - you decide that a 2s latency is tolerable, then that should be supported by the software. The same would apply the other way around where the latency must be much smaller (e.g. < 50ms). It also makes sense to configure the TTL of the cache somewhere.

As a workaround for now, yes, the file thing is probably best and easily implementable. Simply fetch the key before running docker build or use an alpine image where you run curl https://path/to/keys.json and then use that locally available file to pull the keys from in oathkeeper (with file://).

In any case, thank you for the write up and making the context so clear. I think that we'll find a suitable solution to this issue quickly!

swapnilgawade16 · 2020-01-01T21:53:31Z

Just wanted to check if there was any progress on this issue?

aeneasr · 2020-01-03T12:11:55Z

Not from our side but we appreciate - as always - contributions in this area :)

stszap · 2020-09-12T15:58:55Z

At the end in our company, we decided that it would be easier to just implement the file:// hack. We use k8s and run oathkeeper (version 0.16.0) in the same pod as the application. Here are some relevant yaml (it's not functional by itself to keep it small) hope it can help someone else:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: application
  labels:
    k8s-app: application
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: application
  template:
    metadata:
      labels:
        k8s-app: application
    spec:
      volumes:
      - name: oathkeeper-jwks
        emptyDir: {}
      containers:
      ...
      - name: oathkeeper-jwks-updater
        image: <docker image with curl> //we use our own image with curl but it doesn't really matter
        workingDir: /etc/oathkeeper/jwks/
        command: ["sh"]
        args: ["-c", "while :; do curl -f \"$JWKS_URL\" -o jwks.json; sleep 30; done"] //-f should prevent curl from overwriting the file if there was an error during the request
        volumeMounts:
        - name: oathkeeper-jwks
          mountPath: /etc/oathkeeper/jwks/
        env:
        - name: JWKS_URL
          value: "<your jwks url>"
      - name: oathkeeper
        image: oryd/oathkeeper:v0.16.0
        args: ["serve"]
        volumeMounts:
        - name: oathkeeper-jwks
          mountPath: /etc/oathkeeper/jwks
        env:
        - name: AUTHENTICATORS_JWT_ENABLED
          value: "true"
        - name: AUTHENTICATORS_JWT_SCOPE_STRATEGY
          value: "wildcard"
        - name: AUTHENTICATORS_JWT_JWKS_URLS
          value: "file://etc/oathkeeper/jwks/jwks.json"
        ...

Here we use another "oathkeeper-jwks-updater" container just to update jwks file in a shared "emptyDir" volume. This way the file updates every 30s and is also cached by oathkeeper for another 30s so it can take from 30s to 60s after an update to take an effect. We have been using this solution for more than a week in prod and haven't found any problems with it.

aeneasr · 2020-09-16T09:47:12Z

Awesome, thank you for sharing your knowledge!

eroznik · 2021-05-03T12:37:35Z

@aeneasr do you have any preferences on what might be done in regards to this? I am thinking about jumping on this issue.. Few suggestions:

async reload on demand
- each request would validate jwks, same as now
  - if they are expired: a new Go routine is spawned for jwk reload, for the current request -> old/stale jwks are used
  - if no jwk available(e.g. cold start), the jwk fetch would block till it gets fetched(current behavior)
configurable ttl

aeneasr · 2021-05-08T10:20:31Z

SGTM - I think we can also fetch the JWK and if it is fast enough use the new keys, if it's too slow use the old ones?

eroznik · 2021-05-08T13:33:04Z

That is certainly worth considering, I'll prepare a PR with the approach @aeneasr recommended using a timer & go rutine.. we'll see how that goes :)

aeneasr · 2021-05-11T09:42:42Z

Wohoo, nice! :)

eroznik mentioned this issue May 15, 2021

feat: improved JWT Authorizer JWKs fetching #726

Merged

5 tasks

aeneasr closed this as completed in 5613f65 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple credentials fetch from actual requests #203

Decouple credentials fetch from actual requests #203

stszap commented May 29, 2019

aeneasr commented May 30, 2019

swapnilgawade16 commented Jan 1, 2020

aeneasr commented Jan 3, 2020

stszap commented Sep 12, 2020

aeneasr commented Sep 16, 2020

eroznik commented May 3, 2021

aeneasr commented May 8, 2021

eroznik commented May 8, 2021

aeneasr commented May 11, 2021

Decouple credentials fetch from actual requests #203

Decouple credentials fetch from actual requests #203

Comments

stszap commented May 29, 2019

aeneasr commented May 30, 2019

swapnilgawade16 commented Jan 1, 2020

aeneasr commented Jan 3, 2020

stszap commented Sep 12, 2020

aeneasr commented Sep 16, 2020

eroznik commented May 3, 2021

aeneasr commented May 8, 2021

eroznik commented May 8, 2021

aeneasr commented May 11, 2021