Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking behavior too strict in OpenIdConnectCachingSecurityTokenProvider, struggles with enough concurrent requests #3078

Open
mderriey opened this issue Oct 10, 2024 · 0 comments

Comments

@mderriey
Copy link

Microsoft.Identity.Web Library

Microsoft.Identity.Web.OWIN

Microsoft.Identity.Web version

3.2.2

Web app

Not Applicable

Web API

Not Applicable

Token cache serialization

Not Applicable

Description

TL;DR: under enough concurrency, the locking behaviour of OpenIdConnectCachingSecurityTokenProvider creates a deadlock-like situation.

Note

I've seen this class in enough Microsoft repositories to suspect that you don't really own it per-se, but this repo seems to be the go-to when it comes to OIDC auth across all stacks, and also the most recent commit for this file, so you're the "lucky" ones to get this report.

We've experienced cases where our app grinds to a halt when a certain concurrency threshold is reached.
Snapshot analysis pointed to OpenIdConnectCachingSecurityTokenProvider, more specifically the Issuer and SecurityKeys properties.

Looking at the implementation, and reading the ReaderWriterLockSlim documentation, I think I can see how this happens.

The a-ha moment for me was when reading the following section from the documentation linked above (I left the upgradeable mode away as it's not using it)

If a ReaderWriterLockSlim does not allow recursion, a thread that tries to enter the lock can block for several reasons:

  • A thread that tries to enter read mode blocks if there are threads waiting to enter write mode or if there is a single thread in write mode.
  • A thread that tries to enter write mode blocks if there is a thread in any of the three modes.

Looking at the implementation, we can see that accessing Issuer or SecurityKeys will:

  1. Call RetrieveMetadata which enters "write" mode to get the configuration from the underlying ConfigurationManager instances, then
  2. Enter "read" mode to return the value of the private field.

Here's a mental model of what happens for a single request:

  1. The Issuer property is accessed:
    1. This calls RetrieveMetadata, which enters "write" mode and fetches the configuration.
    2. It then enters read mode to return the value.
  2. The SecurityKeys properties is accessed:
    1. This calls RetrieveMetadata, which enters "write" mode and fetches the configuration.
    2. It then enters read mode to return the value.

Extrapolating this to many concurrent requests, here's what I think happens:

  1. The first request comes in, and its thread T1 enters "write" mode.
  2. Other requests come in, and their threads block trying to enter "write" mode because of T1.
  3. T1 exits "write" mode, and cannot enter "read" mode because other threads are waiting to enter "write" mode.
  4. Another thread T2 enters "write" mode.
  5. More requests arrive, try to enter "write" mode, but can't, and get added to the "write" mode queue.

This means that concurrent requests are dependent on all other concurrent requests to have entered then exited write mode to be able to enter read mode and continue the processing of the request.
The bigger the "write" mode queue gets, the higher the chance of new requests arriving and making the problem worse.

The only way out of this situation is for no new requests to arrive, so the existing concurrent requests can enter and exit "write" mode one after the other, after which they will all be able to enter "read" mode.

However, the fact that we call RetrieveMetadata twice in a single request (.Issuer and .SecurityKeys, both accessed by JwtFormat) doesn't make it easy to get out of this situation.
Even more so when ConfigurationManager caches the OIDC configuration for 12 hours by default, so the vast majority of these calls will be no-ops when it comes to the actual config used to validate tokens.

Another aggravating factor is that we use EnterReadMode and EnterWriteMode, which will wait for as long as needed until they can enter the relevant mode. Note that TryEnterReadMode and TryEnterWriteMode exist, with timeout parameters, and their return value indicate whether entering the mode was successful.


Here are assumptions about Entra ID / B2C:

  1. For a given tenant, the issuer never changes (I'm aware this is metadata endpoint-dependent, but the class can only target one endpoint).
  2. Signing keys will be advertised in the metadata document long before they're used for signing so all consumers have the time to know about them.

Using these assumptions, I think we can loosen the locking behaviour:

  1. Try to get a write lock "synchronously", otherwise use the information already available.
  2. Do not use read locks, and use the information available.

Any thoughts?

Reproduction steps

  1. Set up an OWIN project with Microsoft.Identity.Web.
  2. Fire enough concurrent requests to trigger the blocking behaviour (200 was the tipping point on my local machine)

Error message

No response

Id Web logs

No response

Relevant code snippets

N/A

Regression

No response

Expected behavior

No locking that brings the app to a halt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants