Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File lock implementation #13

Merged
merged 10 commits into from
May 16, 2023
Merged

File lock implementation #13

merged 10 commits into from
May 16, 2023

Conversation

chlowell
Copy link
Collaborator

Really just a thin wrapper around flock. It's internal because applications don't need to use it directly. (They use the Cache type instead, to be added in future PR.)

To show how this fits into the bigger picture, here's how I've organized the packages:

extensions
 ├── accessor: types for accessing storage
 │   └── file: plaintext file storage
 ├── cache: for use with MSAL's WithCache() option
 └── internal
     ├── flock: copy of gofrs/flock
     └── lock: this PR; file lock implementation used by cache

Closes #5

@chlowell chlowell marked this pull request as ready for review April 25, 2023 21:14
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License. See LICENSE in the project root for license information.

package lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had written a good test for this for the file lock here: https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55

It creates a bunch of processes, each of them making changes to a file (which is locked). And then see if the file contents are not trash. https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55

That said, this lock impl is just a wrapper over this flock thing (which is a private repo!), so not sure it's worth it having it here.

@chlowell
Copy link
Collaborator Author

I converted this to a draft because I found a race on Linux that allows two processes to hold the lock simultaneously:

A opens the file
A locks the file
B opens the file
A unlocks the file
A removes the file
B locks the (removed) file
  - this works, I guess because the inode remains until all FDs are closed; B's is open
C opens the file, creating it because no file with the name exists (new inode)
C locks the file

Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.

@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.

@chlowell chlowell marked this pull request as draft April 28, 2023 00:41
@rayluo
Copy link

rayluo commented Apr 28, 2023

... I found a race on Linux that allows two processes to hold the lock simultaneously:

A opens the file
A locks the file
B opens the file
A unlocks the file
A removes the file
B locks the (removed) file
  - this works, I guess because the inode remains until all FDs are closed; B's is open
C opens the file, creating it because no file with the name exists (new inode)
C locks the file

Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.

@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.

Thanks for the convincing description, Charles! You confirmed my long time suspicion.

Our existing approach happened to be "create a lock file, lock it, and then delete it" since day 1, and MSAL .Net, Python and Java even spent significant effort to keep the implementation in-sync. But,

(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated.
(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)". At one point, a teammate on MSAL Java experimented with this approach and reported better performance.

Let's revisit this and choose either (1) or (2), but not both. I would prefer (2), because it can even remove our dependency on a file locking library.

@bgavrilMS
Copy link
Member

@rayluo - do you have customer feedback related to performance? This is public client, you can at most have 5-6 apps performing very fast operations on the file, which is max ~200 KB (1 entry is about 7 KB and the biggest token cache I know of was reported by a consultant with accounts in some 30 companies), but generally the file is <20KB.

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

@chlowell - the solution to the race is to lock the file first, then to read / write / delete it.

@chlowell
Copy link
Collaborator Author

chlowell commented Apr 28, 2023

(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated.

Right. The current implementation is reliable under contention only if the filename consistently maps to the same underlying object; it doesn't on Linux when the file is recreated.

(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".

That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.

the solution to the race is to lock the file first, then to read / write / delete it.

The lock only prevents other processes from applying another lock on the same file. It doesn't matter whether A unlocks or removes first because A's lock doesn't prevent B opening the file.

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

I think it's reasonable to treat this as a known issue or out of scope. It does make the lock unreliable on Linux under even modest contention, but if that were common in real usage, I imagine someone would have noticed a perf impact from all the cache misses, and I think it's fair to say this library is intended only for low/no-contention scenarios like helping CLI tools avoid authenticating every time they run.

@rayluo
Copy link

rayluo commented Apr 28, 2023

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

Totally agreed. Heck, even writing comments in this issue turns out to be a mentally demanding task. :-) Locking is hard.

For the sake of completeness, when/if we would eventually make that move, the change might be smaller than we might think, because MSAL Java EX has already been doing the "atomic file system operations creation + file locks", and MSAL Python followed (not sure about MSAL .Net). It is just that, for backward compatibility in terms of behaviors, we chose to still proceed with normal file locking after file exclusive creation fails, and that middle ground does not utilize the benefits of either approach to the fullest.

By the way, driven by a different initiative (which is to make our SDK SAW-ready by reducing 3rd-party library dependency), I have experimented removing the file lock and solely relying on exclusive file creation. It passed local stress test.

(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".

That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.

Good point. Back then we contemplated writing a timestamp into the lock file and then the next lock seeker would remove a stale lock file. That approach was not implemented/needed in the current middle ground. But thanks for reminding me that I would need to do that in my SAW experiment above.

@rayluo - do you have customer feedback related to performance?

Not much. Over the years, there have been roughly 8 issues reported to Azure CLI. Most of them were various errors during lock acquisition, which could arguably mean the lock mechanism was doing it job. Only one of them was a json decode error which - now that I think about it - could mean the lock acquisition succeeded but a different process still raced to mess up the file content.

It does make the lock unreliable on Linux under even modest contention

Coincidentally, the latest issue on the list above (https://github.com/Azure/azure-cli/issues/26109) was on Linux. The exclusive file creation failed but at least those failures were easily detectable, but then the subsequent file lock acquisition somehow stuck. That customer had to revert to an ADAL-powered old version Azure CLI. :-(

@chlowell chlowell marked this pull request as ready for review April 28, 2023 22:45
return &Lock{f: flock.New(p), retryDelay: retryDelay}, nil
}

// Lock acquires the file lock on behalf of the process. The behavior of concurrent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's some brittleness on certain platforms. Do you think it would be worth it to add some kind of internal counter (or whatever) and return an error (or potentially panic()) if the invariant isn't met (e.g. concurrent calls to Lock())?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be worth a panic because two goroutines holding the lock simultaneously manifests as a cache miss. If the goroutines write the cache sequentially, one overwrites the other's data i.e., one token is lost. If their timing is extremely unlucky and the storage backend allows it, they could make overlapping writes that corrupt the whole cache, losing all its tokens. That's non-fatal because the application can reauthenticate. I suppose an application could lose many tokens, exceed the STS's rate limit with a deluge of token requests and get throttled, which would be tough to recover from, but that requires the alignment of many evil stars.

I think we're okay here. Customers will use this type indirectly through Cache (#14), which synchronizes goroutines. If applications use only one Cache per storage location (we'll document that they should), that eliminates the risk. Off the top of my head I can't think of a good way to catch multiple instances accessing the same storage, but trying doesn't seem worth the effort to me. Anyway, that's my reasoning, please let me know if I'm neglecting something.

@chlowell chlowell merged commit 2cab7c3 into dev May 16, 2023
@chlowell chlowell deleted the chlowell/lock branch May 16, 2023 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create CrossplatformLock
4 participants