Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with the shared AST provider #1918

Closed
0dinD opened this issue Oct 21, 2021 · 6 comments · Fixed by #2709
Closed

Race condition with the shared AST provider #1918

0dinD opened this issue Oct 21, 2021 · 6 comments · Fixed by #2709

Comments

@0dinD
Copy link
Contributor

0dinD commented Oct 21, 2021

In my debugging efforts on redhat-developer/vscode-java#2176, I've found what I believe to be the root cause of that issue: a race condition with the shared AST provider.

Here's what happens when you get desynced semantic tokens in the client (time increases with each row):

Thread A Thread B Thread C
Starts handling the first semantic tokens request
Calls CoreASTProvider#getAST
Calls CoreASTProvider#createAST because there was no cached AST
Starts handling a document change request
Calls CoreASTProvider#disposeAST
Schedules a publish diagnostics job
Done handling the document change request
Calls CoreASTProvider#cache with the newly created (but now outdated) AST
Returns from CoreASTProvider#getAST
Calls documentMonitor#checkChanged, which aborts the request (and the client will send a new one)
--- Starts handling the second semantic tokens request
--- Calls CoreASTProvider#getAST
--- Immediately receives the cached (but outdated!) AST
--- Collects semantic tokens using the outdated AST
--- Done handling the semantic tokens request (the client receives desynced tokens)
Starts the publish diagnostics job ---
Calls CoreASTProvider#disposeAST ---
Calls CoreASTProvider#getAST, which creates a new (correct) AST ---
Uses said AST to compute diagnostics ---
Done publishing diagnostics ---

Note: the document version doesn't change during the second semantic tokens request on Thread C, and the buffer contents are correct. What causes the desynced tokens is that the AST being used is outdated.

This is a race condition between the second semantic tokens request and the publish diagnostics job. Because if the publish diagnostics job calls CoreASTProvider#disposeAST before the second semantic tokens request calls CoreASTProvider#getAST, the cached AST will be updated in time and hence the tokens won't become desynced. That's why the issue is so hard to reproduce consistently: the time window where the second semantic tokens request "wins the race" over the publish diagnostics job is very narrow.


With all this newfound knowledge, I managed to create a reproducible scenario that works every time:

  1. Comment out this line and this line to make sure no diagnostics are published (simulating that they are slow).
  2. Attach to JDT.LS with a debugger.
  3. Put a breakpoint just after the AST has been created, but before it has been cached.
  4. Insert a newline somewhere in the middle of the document opened by the debuggee.
  5. Keep the thread handling the semantic tokens request (Thread A) paused using the breakpoint.
  6. Disable the breakpoint (but don't unpause the thread).
  7. Insert another newline below the first one.
  8. Thread B will run, and dispose the old AST. The diagnostics job won't run yet (or really, at all, since we disabled it).
  9. Unpause Thread A, causing it to cache the outdated AST.
  10. Thread C will run, giving desynced tokens because it's using the outdated AST.

See it in action here:

reproduce-race-condition.mp4

One solution to this problem is to wait for both DOCUMENT_LIFE_CYCLE_JOBS and PUBLISH_DIAGNOSTICS_JOBS in the SemanticTokensHandler, to avoid the race condition. However, in many cases this will just slow the semantic tokens request down since there isn't always a race.

Another solution is to always call CoreASTProvider#disposeAST in the SemanticTokensHandler, to make sure the AST being used is never from the cache. But this will also slow the request down in the cases where there is perfectly good cached AST.


Neither of these workarounds really address what I think is the real race condition: the one between CoreASTProvider#getAST and CoreASTProvider#disposeAST. If two threads call these methods at roughly the same time, the methods will race and CoreASTProvider#disposeAST might not actually be successful in disposing the AST. This is of course assuming that CoreASTProvider#getAST was called with something other than WAIT_NO.

I'm not familiar enough with upstream Eclipse libraries to know whether or not this should be considered a bug (and in that case, does it affect the Eclipse IDE?), or if it's an implementation error in JDT.LS.

Since CoreASTProvider#disposeAST is a synchronized method, I did manage to work around the issue by wrapping much of the code in SemanticTokensHandler in a synchronized (CoreASTProvider.getInstance()) {...} block.

But that only solves the race condition for the SemanticTokensHandler. What about all the other places where we might call CoreASTProvider#getAST? I feel like this issue with the shared AST provider might affect other parts of the code, see for example redhat-developer/vscode-java#2173 which looks somewhat similar (but also affects formatting).


So, to summarize: I know fixes/workarounds for redhat-developer/vscode-java#2176, but am not sure which is the best one. It's also possible that there are better ways of solving the problem, or that the race condition should be fixed upstream. If this issue doesn't affect the Eclipse IDE, I wonder how they've worked around it?

I'd appreciate any help from those who have more experience working with the upstream libraries.

@0dinD
Copy link
Contributor Author

0dinD commented Oct 22, 2021

Actually, I think I may have found a more "correct" way to work around it (how I'm guessing it's done in the Eclipse IDE). What we probably should do is make sure that whenever we call CoreASTProvider#getAST, we are running an atomic workspace operation using org.eclipse.core.resources.IWorkspace#run(ICoreRunnable action, ISchedulingRule rule, int flags, IProgressMonitor monitor). From the Javadoc of said method:

The supplied scheduling rule is used to determine whether this operation can be run simultaneously with workspace changes in other threads. If the scheduling rule conflicts with another workspace change that is currently running, the calling thread will be blocked until that change completes.

This sounds similar to what we're already doing when calling JobHelpers.waitForJobs(DocumentLifeCycleHandler.DOCUMENT_LIFE_CYCLE_JOBS, monitor);. Indeed, making that call blocks until the document lifecycle job completes, so we don't run them in parallel. However, that doesn't stop us from running into race conditions, because the document lifecycle job can still start in the middle of a semantic tokens request (just after we're done waiting for the previous lifecycle jobs).

Do note that the BaseDocumentLifeCycleHandler already uses these atomic scheduling rules, it's the other requests (such as semantic tokens) that currently don't.


So it seems like we should find places where we call CoreASTProvider#getAST with the WAIT_YES or WAIT_ACTIVE_ONLY flag and replace usages of JobHelpers.waitForJobs(DocumentLifeCycleHandler.DOCUMENT_LIFE_CYCLE_JOBS, monitor); with a call to org.eclipse.core.resources.IWorkspace#run(ICoreRunnable action, ISchedulingRule rule, int flags, IProgressMonitor monitor). I tried this with the SemanticTokensHandler, and it fixed the issue.

Does this make sense to you?

@0dinD
Copy link
Contributor Author

0dinD commented Oct 22, 2021

I guess another workaround would be to always call CoreASTProvider#getAST after disposing the old AST when we get the document change request. If we rely on using JobHelpers.waitForJobs(DocumentLifeCycleHandler.DOCUMENT_LIFE_CYCLE_JOBS, monitor);, I guess we may also be able to defer that to the document lifecycle job (performValidation). But I'm not sure if that's the right approach, just brainstorming ideas.

@jdneo
Copy link
Contributor

jdneo commented Dec 8, 2021

Using IWorkspace#run() to wrap the semantic token calculation looks good to me. As long as the scheduling rule set to that URI, the semantic token calculation should be mutually exclusive with the document lifecycle for the file with that URI.

@0dinD Did you observe any perf change if we apply such changes?

@0dinD
Copy link
Contributor Author

0dinD commented Dec 13, 2021

I didn't thoroughly test performance, but I don't remember there being any noticeable impact.

As far as I understand, using IWorkspace#run() instead of JobHelpers.waitForJobs should be fairly equivalent, the main difference being that the BaseDocumentLifeCycleHandler will also wait for SemanticTokensHandler in order to avoid the situation were the AST is disposed in the middle of a request. Waiting for SemanticTokensHandler shouldn't be a problem though, from what I can tell it's quite fast if you don't need to JobHelpers.waitForJobs.

Overall I think it's a pretty good workaround and it should fix the desync bug for semantic tokens. But I do wonder if other parts of the codebase could be affected, because it seems like using the SharedASTProvider is a pretty common pattern. Ideally there shouldn't be a race condition in the SharedASTProvider at all, but I don't know if it's possible to fix easily and I don't really have time to investigate it right now.


I can have a look at using IWorkspace#run() again and submit a PR when I have time, maybe by the end of this week or early next week.

Update:
I had to prioritize other things, but I just need to write a few tests and then I'll submit a PR, hopefully in a few days.

@0dinD
Copy link
Contributor Author

0dinD commented Jun 19, 2023

I have now had some time to investigate this in more detail again. First of all, I don't think my previous suggestion to use IWorkspace#run is a good idea anymore, since it relies on the scheduling rule of didChange to achieve mutual exclusion, but that scheduling rule was removed in #2641 for performance reasons. Also, it was always an imperfect workaround that didn't really solve the underlying synchronization issue in CoreASTProvider: eclipse-jdt/eclipse.jdt.core#1151.

Investigating CoreASTProvider usage in JDT LS vs JDT UI

After concluding that it would probably be better to investigate the root cause rather than working around it, I had a look at the JDT UI code in an attempt to understand why this synchronization issue doesn't seem to affect JDT UI. I'm still not 100% sure how CoreASTProvider is used in JDT UI, but I've been able to identify some major differences between how it's used in JDT UI and JDT LS.

In JDT LS, we call CoreASTProvider#disposeAST when the text document changes, and we call CoreASTProvider#getAST (with waitFlag == CoreASTProvider.WAIT_YES, to trigger AST creation after the dispose) when we need the AST to service an LSP request. These two calls may race in different threads, which means that we run into trouble since the two methods are not synchronized (eclipse-jdt/eclipse.jdt.core#1151).

In JDT UI, CoreASTProvider#disposeAST is never called directly when the text document changes, but the AST is still disposed indirectly via calls to CoreASTProvider#aboutToBeReconciled (which calls cache(null, javaElement);), from org.eclipse.jdt.internal.ui.text.java.JavaReconcilingStrategy. JDT UI does call CoreASTProvider#getAST when the text document changes, but in contrast to JDT LS, it does not call it with waitFlag == CoreASTProvider.WAIT_YES, which means that those calls to getAST do not trigger AST creation (in contrast to JDT LS).

So how does JDT UI create the new AST after calling CoreASTProvider#dipose, if not via CoreASTProvider#getAST (like JDT LS)? Well, it seems like this is once again the job of org.eclipse.jdt.internal.ui.text.java.JavaReconcilingStrategy, which not only calls disposeAST, but also calls CoreASTProvider#reconciled (after having created the AST).


So in conclusion, JDT UI uses a very different strategy for CoreASTProvider, compared to JDT LS. This is probably the reason why the synchronization problem (eclipse-jdt/eclipse.jdt.core#1151) doesn't seem to affect JDT UI. Maybe org.eclipse.jdt.internal.ui.text.java.JavaReconcilingStrategy handles synchronization of CoreASTProvider in a safer way than what is done in JDT LS.

With the above in mind, what can be done to solve this issue in JDT LS?

I see a few different alternatives:

Option A

Investigate whether or not JDT LS should be using org.eclipse.jdt.internal.ui.text.java.JavaReconcilingStrategy (or something similar). I still don't know much about how JavaReconcilingStrategy works though (or whether it is a good solution to this issue), and I don't really feel like I have time to investigate it in more detail.

Option B

Investigate whether or not we can fix eclipse-jdt/eclipse.jdt.core#1151 in JDT Core.

Option C

Call CoreASTProvider#clearReconciliation after disposeAST. This resets CoreASTProvider#fReconcilingJavaElement to null, which means that this code path is triggered (which prevents getAST from caching the outdated AST): https://github.com/eclipse-jdt/eclipse.jdt.ui/blob/5e02b306fbe22247843b1e09fcf4212b48f2d7f6/org.eclipse.jdt.core.manipulation/common/org/eclipse/jdt/core/manipulation/CoreASTProvider.java#L328-L338

I will submit a PR to implement this approach as it seems like the best option to me (it completely fixes this issue and the implementation is very simple). Of course, I welcome any help in exploring and investigating the other options, I just don't feel like I have the time to do that (and I don't have that much experience with the inner workings of JDT UI).

I have done some more testing using the steps to reproduce the race condition 100% of the time (which I shared in my initial comment), and after adding the clearReconciliation call after disposeAST, I can no longer reproduce this issue.

Update: Using clearReconciliation does not (on its own) solve the race condition in 100% of the cases, see #2714 (comment)

@0dinD
Copy link
Contributor Author

0dinD commented Jul 3, 2023

I feel like this issue with the shared AST provider might affect other parts of the code

Indeed, pretty much anything that uses CoreASTProvider could be affected by this issue. DocumentHighlightHandler is an example, I managed to catch a screenshot where both the semantic tokens and the document highlight is using outdated AST:

ast-desync-CoreASTProvider

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants