-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse the workspace to better take advantage of caches when indexing #69268
Reuse the workspace to better take advantage of caches when indexing #69268
Conversation
This means all writes immediately flush to the file so you can watch the log file in real time to get a sense of progress. Although this is slightly slower, we don't remotely write enough to the log file for it to matter.
When we were processing a binlog, we would create a workspace per project. This was't ideal, as it turns out this meant we wouldn't reuse caches for metadata references and documentation comments from one indexing job to the next. Fixes dotnet#68750
2f09cef
to
19d29d6
Compare
@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory | |||
await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found."); | |||
|
|||
var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile); | |||
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync()); | |||
|
|||
foreach (var msbuildInvocation in msbuildInvocations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if there's any benefit in parallelizing this step at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general assumption was projects tend to have enough documents to probably saturate. We could do this in parallel but it'd also mean additional memory usage because now we're holding multiple projects at once versus just the one we're analyzing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in FindRefs we found that processing a single project at a time (ideally in topological order) and then preocessing in parallel within the project produced the best results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I downloaded this new version and ingested the VS Editor binlog and it looks like it's averaging just one or two CPU cores of activity at a time.
I didn't drill much deeper than peeking at Task Manager but I thought this was interesting as I'd assume this is mostly CPU bound work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occasionally bursts higher, but I wonder if this points to a bottleneck, like that lock around STDOUT, that could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the same file will be in multiple projects, with different semantics. So you'd need to know which project you were looking through to make the meaning of a particular file be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the consumer side matter though?
It impacts server costs, read duration, scalability (how big of a workspace we can handle before we run out of memory on that machine), reliability (probability of an OOM exception).
I would not have expected consumption to be costly since everything is computed up front.
LSIF is a graph so one has to hold much of it in memory at a time to be able to reason about it. It can also only really be traversed by a single thread at a time, so large LSIFs mean machines consuming LSIF have a single core pinned at 100%, the rest mostly idle, and much of their RAM utilized, and can't really process any other LSIFs simultaneously unless you can be sure they are all small enough to fit in the remaining RAM.
There are ways to work-around this by allocating and tracking RAM dedicated to each in-progress LSIF ingestion but this is more complicated and it'd be preferable to have more, smaller, LSIFs.
With small, predictable sizes, the level of concurrency is more consistent and OOMs consuming large or unusual workspaces will de facto never happen.
you'd need to know which project you were looking through to make the meaning of a particular file be correct
I think this already works today. AFAIK the LSIF generator creates a separate document node for each project that contains the document. This data is encoded via the contains
edge, which indicates that the document is contained by that specified project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to process with a single thread? It's all immutable data. I would expect parallel processing to be trivial. After all, that's what we can do in the production side. It seems odd here that consumption would have it harder.
Furthermore, I would expect deep parallelization in servers due to processing a queue of lsif files. When I worked at G, that's how things worked. It wasn't like you spun up a server to process one piece of data. You had servers dequeuing what was essentially an unbounded queue of work. So they were always saturated.
Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)
That's the gist of the issue I linked. It describes incremental changes to make it more efficient to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way. This should be like game development. We should think about our formats first, optimizing for performance. We should not start everything with bad structures (like graphs or json) and then have to contort to make that decent :-)
In the end, the product we are shipping is not "lsif support", it's "rich code experiences in the web". We should start with that and move backwards. If that indicates lsif is a poor choice because of design and cogs, we can make something else.
💡 Bonus/misc performance idea: I noticed here a task is being started for each document (in a project or solution?). I'd imagine this could be hundreds or thousands of tasks. If so, this means we could have several hundred documents worth of context that all needs to be held in memory at a time, since each This would seem to dramatically increase the amount of memory used without improving throughput. Would it be more memory efficient to instead use a .NET Channel or Microsoft.VisualStudio.Threading.AsyncQueue + Environment.ProcessorCount threads to saturate CPU? |
If this is running .net core then document parallelization tends to effectively saturate all cores with linear speedup. If this is netfx, then it tends to be awful |
Thanks for the clarification! Didn't realize there were existing optimizations in place. |
@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory | |||
await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found."); | |||
|
|||
var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile); | |||
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we were processing a binlog, we would create a workspace per project. This was't ideal, as it turns out this meant we wouldn't reuse caches for metadata references and documentation comments from one indexing job to the next.
Fixes #68750