-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File lock implementation #13
Conversation
// Copyright (c) Microsoft Corporation. All rights reserved. | ||
// Licensed under the MIT License. See LICENSE in the project root for license information. | ||
|
||
package lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had written a good test for this for the file lock here: https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55
It creates a bunch of processes, each of them making changes to a file (which is locked). And then see if the file contents are not trash. https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55
That said, this lock impl is just a wrapper over this flock
thing (which is a private repo!), so not sure it's worth it having it here.
I converted this to a draft because I found a race on Linux that allows two processes to hold the lock simultaneously:
Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it. @bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage. |
Thanks for the convincing description, Charles! You confirmed my long time suspicion. Our existing approach happened to be "create a lock file, lock it, and then delete it" since day 1, and MSAL .Net, Python and Java even spent significant effort to keep the implementation in-sync. But, (1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated. Let's revisit this and choose either (1) or (2), but not both. I would prefer (2), because it can even remove our dependency on a file locking library. |
@rayluo - do you have customer feedback related to performance? This is public client, you can at most have 5-6 apps performing very fast operations on the file, which is max ~200 KB (1 entry is about 7 KB and the biggest token cache I know of was reported by a consultant with accounts in some 30 companies), but generally the file is <20KB. The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs. @chlowell - the solution to the race is to lock the file first, then to read / write / delete it. |
Right. The current implementation is reliable under contention only if the filename consistently maps to the same underlying object; it doesn't on Linux when the file is recreated.
That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.
The lock only prevents other processes from applying another lock on the same file. It doesn't matter whether A unlocks or removes first because A's lock doesn't prevent B opening the file.
I think it's reasonable to treat this as a known issue or out of scope. It does make the lock unreliable on Linux under even modest contention, but if that were common in real usage, I imagine someone would have noticed a perf impact from all the cache misses, and I think it's fair to say this library is intended only for low/no-contention scenarios like helping CLI tools avoid authenticating every time they run. |
Totally agreed. Heck, even writing comments in this issue turns out to be a mentally demanding task. :-) Locking is hard. For the sake of completeness, when/if we would eventually make that move, the change might be smaller than we might think, because MSAL Java EX has already been doing the "atomic file system operations creation + file locks", and MSAL Python followed (not sure about MSAL .Net). It is just that, for backward compatibility in terms of behaviors, we chose to still proceed with normal file locking after file exclusive creation fails, and that middle ground does not utilize the benefits of either approach to the fullest. By the way, driven by a different initiative (which is to make our SDK SAW-ready by reducing 3rd-party library dependency), I have experimented removing the file lock and solely relying on exclusive file creation. It passed local stress test.
Good point. Back then we contemplated writing a timestamp into the lock file and then the next lock seeker would remove a stale lock file. That approach was not implemented/needed in the current middle ground. But thanks for reminding me that I would need to do that in my SAW experiment above.
Not much. Over the years, there have been roughly 8 issues reported to Azure CLI. Most of them were various errors during lock acquisition, which could arguably mean the lock mechanism was doing it job. Only one of them was a json decode error which - now that I think about it - could mean the lock acquisition succeeded but a different process still raced to mess up the file content.
Coincidentally, the latest issue on the list above ( |
return &Lock{f: flock.New(p), retryDelay: retryDelay}, nil | ||
} | ||
|
||
// Lock acquires the file lock on behalf of the process. The behavior of concurrent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's some brittleness on certain platforms. Do you think it would be worth it to add some kind of internal counter (or whatever) and return an error (or potentially panic()
) if the invariant isn't met (e.g. concurrent calls to Lock()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be worth a panic because two goroutines holding the lock simultaneously manifests as a cache miss. If the goroutines write the cache sequentially, one overwrites the other's data i.e., one token is lost. If their timing is extremely unlucky and the storage backend allows it, they could make overlapping writes that corrupt the whole cache, losing all its tokens. That's non-fatal because the application can reauthenticate. I suppose an application could lose many tokens, exceed the STS's rate limit with a deluge of token requests and get throttled, which would be tough to recover from, but that requires the alignment of many evil stars.
I think we're okay here. Customers will use this type indirectly through Cache
(#14), which synchronizes goroutines. If applications use only one Cache
per storage location (we'll document that they should), that eliminates the risk. Off the top of my head I can't think of a good way to catch multiple instances accessing the same storage, but trying doesn't seem worth the effort to me. Anyway, that's my reasoning, please let me know if I'm neglecting something.
Really just a thin wrapper around
flock
. It's internal because applications don't need to use it directly. (They use theCache
type instead, to be added in future PR.)To show how this fits into the bigger picture, here's how I've organized the packages:
Closes #5