File lock implementation #13

chlowell · 2023-04-25T18:43:48Z

Really just a thin wrapper around flock. It's internal because applications don't need to use it directly. (They use the Cache type instead, to be added in future PR.)

To show how this fits into the bigger picture, here's how I've organized the packages:

extensions
 ├── accessor: types for accessing storage
 │   └── file: plaintext file storage
 ├── cache: for use with MSAL's WithCache() option
 └── internal
     ├── flock: copy of gofrs/flock
     └── lock: this PR; file lock implementation used by cache

Closes #5

bgavrilMS · 2023-04-25T22:34:24Z

extensions/internal/lock/lock_test.go

+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License. See LICENSE in the project root for license information.
+
+package lock


I had written a good test for this for the file lock here: https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55

It creates a bunch of processes, each of them making changes to a file (which is locked). And then see if the file contents are not trash. https://github.com/AzureAD/microsoft-authentication-extensions-for-dotnet/blob/main/tests/Microsoft.Identity.Client.Extensions.Msal.UnitTests/CrossPlatLockTests.cs#L55

That said, this lock impl is just a wrapper over this flock thing (which is a private repo!), so not sure it's worth it having it here.

chlowell · 2023-04-28T00:41:40Z

I converted this to a draft because I found a race on Linux that allows two processes to hold the lock simultaneously:

A opens the file
A locks the file
B opens the file
A unlocks the file
A removes the file
B locks the (removed) file
  - this works, I guess because the inode remains until all FDs are closed; B's is open
C opens the file, creating it because no file with the name exists (new inode)
C locks the file

Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.

@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.

rayluo · 2023-04-28T02:20:09Z

... I found a race on Linux that allows two processes to hold the lock simultaneously:
A opens the file
A locks the file
B opens the file
A unlocks the file
A removes the file
B locks the (removed) file
  - this works, I guess because the inode remains until all FDs are closed; B's is open
C opens the file, creating it because no file with the name exists (new inode)
C locks the file
Now B and C hold the lock so far as this library is concerned and both may write the cache simultaneously. The first solution that comes to mind is, don't remove the file after unlocking it.

@bgavrilMS, it looks like MSAL.NET and Python have the same race, is it a known one? The worst thing that could happen is two processes corrupting the cache with overlapping writes, but I imagine that would be extremely unlikely in real usage.

Thanks for the convincing description, Charles! You confirmed my long time suspicion.

Our existing approach happened to be "create a lock file, lock it, and then delete it" since day 1, and MSAL .Net, Python and Java even spent significant effort to keep the implementation in-sync. But,

(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated.
(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)". At one point, a teammate on MSAL Java experimented with this approach and reported better performance.

Let's revisit this and choose either (1) or (2), but not both. I would prefer (2), because it can even remove our dependency on a file locking library.

bgavrilMS · 2023-04-28T11:05:25Z

@rayluo - do you have customer feedback related to performance? This is public client, you can at most have 5-6 apps performing very fast operations on the file, which is max ~200 KB (1 entry is about 7 KB and the biggest token cache I know of was reported by a consultant with accounts in some 30 companies), but generally the file is <20KB.

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

@chlowell - the solution to the race is to lock the file first, then to read / write / delete it.

chlowell · 2023-04-28T17:50:09Z

(1) The "locking on a file handle" operation should probably only be used on a stable file which will not be constantly removed and recreated.

Right. The current implementation is reliable under contention only if the filename consistently maps to the same underlying object; it doesn't on Linux when the file is recreated.

(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".

That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.

the solution to the race is to lock the file first, then to read / write / delete it.

The lock only prevents other processes from applying another lock on the same file. It doesn't matter whether A unlocks or removes first because A's lock doesn't prevent B opening the file.

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

I think it's reasonable to treat this as a known issue or out of scope. It does make the lock unreliable on Linux under even modest contention, but if that were common in real usage, I imagine someone would have noticed a perf impact from all the cache misses, and I think it's fair to say this library is intended only for low/no-contention scenarios like helping CLI tools avoid authenticating every time they run.

rayluo · 2023-04-28T19:21:57Z

The file locking mechanism has been working well for years, and I'm very hesitant to make any changes to it. Testing and fine tuning is extremely expensive. We can reconsider if a P1 bug occurs.

Totally agreed. Heck, even writing comments in this issue turns out to be a mentally demanding task. :-) Locking is hard.

For the sake of completeness, when/if we would eventually make that move, the change might be smaller than we might think, because MSAL Java EX has already been doing the "atomic file system operations creation + file locks", and MSAL Python followed (not sure about MSAL .Net). It is just that, for backward compatibility in terms of behaviors, we chose to still proceed with normal file locking after file exclusive creation fails, and that middle ground does not utilize the benefits of either approach to the fullest.

By the way, driven by a different initiative (which is to make our SDK SAW-ready by reducing 3rd-party library dependency), I have experimented removing the file lock and solely relying on exclusive file creation. It passed local stress test.

(2) Alternatively, we can abandon the "locking on a file handle" completely, and use "exclusive file creation (and later removing it)".

That solves this race, however it makes the file's presence the locked state, so failing to remove the file would cause deadlock.

Good point. Back then we contemplated writing a timestamp into the lock file and then the next lock seeker would remove a stale lock file. That approach was not implemented/needed in the current middle ground. But thanks for reminding me that I would need to do that in my SAW experiment above.

@rayluo - do you have customer feedback related to performance?

Not much. Over the years, there have been roughly 8 issues reported to Azure CLI. Most of them were various errors during lock acquisition, which could arguably mean the lock mechanism was doing it job. Only one of them was a json decode error which - now that I think about it - could mean the lock acquisition succeeded but a different process still raced to mess up the file content.

It does make the lock unreliable on Linux under even modest contention

Coincidentally, the latest issue on the list above (https://github.com/Azure/azure-cli/issues/26109) was on Linux. The exclusive file creation failed but at least those failures were easily detectable, but then the subsequent file lock acquisition somehow stuck. That customer had to revert to an ADAL-powered old version Azure CLI. :-(

jhendrixMSFT · 2023-05-10T20:59:24Z

extensions/internal/lock/lock.go

+	return &Lock{f: flock.New(p), retryDelay: retryDelay}, nil
+}
+
+// Lock acquires the file lock on behalf of the process. The behavior of concurrent


Looks like there's some brittleness on certain platforms. Do you think it would be worth it to add some kind of internal counter (or whatever) and return an error (or potentially panic()) if the invariant isn't met (e.g. concurrent calls to Lock())?

It wouldn't be worth a panic because two goroutines holding the lock simultaneously manifests as a cache miss. If the goroutines write the cache sequentially, one overwrites the other's data i.e., one token is lost. If their timing is extremely unlucky and the storage backend allows it, they could make overlapping writes that corrupt the whole cache, losing all its tokens. That's non-fatal because the application can reauthenticate. I suppose an application could lose many tokens, exceed the STS's rate limit with a deluge of token requests and get throttled, which would be tough to recover from, but that requires the alignment of many evil stars.

I think we're okay here. Customers will use this type indirectly through Cache (#14), which synchronizes goroutines. If applications use only one Cache per storage location (we'll document that they should), that eliminates the risk. Off the top of my head I can't think of a good way to catch multiple instances accessing the same storage, but trying doesn't seem worth the effort to me. Anyway, that's my reasoning, please let me know if I'm neglecting something.

chlowell marked this pull request as ready for review April 25, 2023 21:14

chlowell requested review from rayluo, jhendrixMSFT, bgavrilMS and element-of-surprise April 25, 2023 21:16

bgavrilMS reviewed Apr 25, 2023

View reviewed changes

bgavrilMS approved these changes Apr 25, 2023

View reviewed changes

chlowell added 6 commits April 27, 2023 06:05

go.mod

3846888

Lock manages file locks

f854cd4

ignore flock errors due to contention

ad316ec

test fix: remove timeout after first lock

a2699aa

comment on reliability

0b9308d

add basic CI

8939ff6

chlowell marked this pull request as draft April 28, 2023 00:41

chlowell force-pushed the chlowell/lock branch from d904868 to 8939ff6 Compare April 28, 2023 21:21

chlowell added 3 commits April 28, 2023 14:47

lint: replace deprecated calls

4e97df1

try auto line endings to work around gofmt false positive

b682d0f

set lf endings for *.go to align with gofmt

2394740

chlowell marked this pull request as ready for review April 28, 2023 22:45

open the lock file for writing

853cb72

chlowell mentioned this pull request May 3, 2023

Cache and accessor/file packages #14

Merged

jhendrixMSFT reviewed May 10, 2023

View reviewed changes

jhendrixMSFT approved these changes May 10, 2023

View reviewed changes

chlowell merged commit 2cab7c3 into dev May 16, 2023

chlowell deleted the chlowell/lock branch May 16, 2023 23:43

chlowell mentioned this pull request Jun 8, 2023

Cross Platform lock + windows encryption #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File lock implementation #13

File lock implementation #13

chlowell commented Apr 25, 2023

bgavrilMS Apr 25, 2023

chlowell commented Apr 28, 2023

rayluo commented Apr 28, 2023

bgavrilMS commented Apr 28, 2023

chlowell commented Apr 28, 2023 •

edited

Loading

rayluo commented Apr 28, 2023

jhendrixMSFT May 10, 2023

chlowell May 11, 2023

File lock implementation #13

File lock implementation #13

Conversation

chlowell commented Apr 25, 2023

bgavrilMS Apr 25, 2023

Choose a reason for hiding this comment

chlowell commented Apr 28, 2023

rayluo commented Apr 28, 2023

bgavrilMS commented Apr 28, 2023

chlowell commented Apr 28, 2023 • edited Loading

rayluo commented Apr 28, 2023

jhendrixMSFT May 10, 2023

Choose a reason for hiding this comment

chlowell May 11, 2023

Choose a reason for hiding this comment

chlowell commented Apr 28, 2023 •

edited

Loading