Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse the workspace to better take advantage of caches when indexing #69268

Conversation

jasonmalinowski
Copy link
Member

@jasonmalinowski jasonmalinowski commented Jul 28, 2023

When we were processing a binlog, we would create a workspace per project. This was't ideal, as it turns out this meant we wouldn't reuse caches for metadata references and documentation comments from one indexing job to the next.

Fixes #68750

This means all writes immediately flush to the file so you can watch
the log file in real time to get a sense of progress. Although this is
slightly slower, we don't remotely write enough to the log file for it
to matter.
@jasonmalinowski jasonmalinowski requested a review from a team as a code owner July 28, 2023 01:31
@jasonmalinowski jasonmalinowski self-assigned this Jul 28, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Jul 28, 2023
When we were processing a binlog, we would create a workspace per
project. This was't ideal, as it turns out this meant we wouldn't
reuse caches for metadata references and documentation comments from
one indexing job to the next.

Fixes dotnet#68750
@jasonmalinowski jasonmalinowski force-pushed the allow-for-cache-reuse-when-indexing-binlogs branch from 2f09cef to 19d29d6 Compare July 28, 2023 01:34
src/Features/Lsif/Generator/Program.cs Show resolved Hide resolved
src/Features/Lsif/Generator/Program.cs Show resolved Hide resolved
@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory
await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found.");

var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile);
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());

foreach (var msbuildInvocation in msbuildInvocations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if there's any benefit in parallelizing this step at all?

Copy link
Member Author

@jasonmalinowski jasonmalinowski Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general assumption was projects tend to have enough documents to probably saturate. We could do this in parallel but it'd also mean additional memory usage because now we're holding multiple projects at once versus just the one we're analyzing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in FindRefs we found that processing a single project at a time (ideally in topological order) and then preocessing in parallel within the project produced the best results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I downloaded this new version and ingested the VS Editor binlog and it looks like it's averaging just one or two CPU cores of activity at a time.

I didn't drill much deeper than peeking at Task Manager but I thought this was interesting as I'd assume this is mostly CPU bound work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occasionally bursts higher, but I wonder if this points to a bottleneck, like that lock around STDOUT, that could be improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the same file will be in multiple projects, with different semantics. So you'd need to know which project you were looking through to make the meaning of a particular file be correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the consumer side matter though?

It impacts server costs, read duration, scalability (how big of a workspace we can handle before we run out of memory on that machine), reliability (probability of an OOM exception).

I would not have expected consumption to be costly since everything is computed up front.

LSIF is a graph so one has to hold much of it in memory at a time to be able to reason about it. It can also only really be traversed by a single thread at a time, so large LSIFs mean machines consuming LSIF have a single core pinned at 100%, the rest mostly idle, and much of their RAM utilized, and can't really process any other LSIFs simultaneously unless you can be sure they are all small enough to fit in the remaining RAM.

There are ways to work-around this by allocating and tracking RAM dedicated to each in-progress LSIF ingestion but this is more complicated and it'd be preferable to have more, smaller, LSIFs.

With small, predictable sizes, the level of concurrency is more consistent and OOMs consuming large or unusual workspaces will de facto never happen.

you'd need to know which project you were looking through to make the meaning of a particular file be correct

I think this already works today. AFAIK the LSIF generator creates a separate document node for each project that contains the document. This data is encoded via the contains edge, which indicates that the document is contained by that specified project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to process with a single thread? It's all immutable data. I would expect parallel processing to be trivial. After all, that's what we can do in the production side. It seems odd here that consumption would have it harder.

Furthermore, I would expect deep parallelization in servers due to processing a queue of lsif files. When I worked at G, that's how things worked. It wasn't like you spun up a server to process one piece of data. You had servers dequeuing what was essentially an unbounded queue of work. So they were always saturated.

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

That's the gist of the issue I linked. It describes incremental changes to make it more efficient to process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way. This should be like game development. We should think about our formats first, optimizing for performance. We should not start everything with bad structures (like graphs or json) and then have to contort to make that decent :-)

In the end, the product we are shipping is not "lsif support", it's "rich code experiences in the web". We should start with that and move backwards. If that indicates lsif is a poor choice because of design and cogs, we can make something else.

@gundermanc
Copy link
Member

💡 Bonus/misc performance idea: I noticed here a task is being started for each document (in a project or solution?).

I'd imagine this could be hundreds or thousands of tasks. If so, this means we could have several hundred documents worth of context that all needs to be held in memory at a time, since each await can cause us to yield and start another task.

This would seem to dramatically increase the amount of memory used without improving throughput.

Would it be more memory efficient to instead use a .NET Channel or Microsoft.VisualStudio.Threading.AsyncQueue + Environment.ProcessorCount threads to saturate CPU?

@CyrusNajmabadi
Copy link
Member

If this is running .net core then document parallelization tends to effectively saturate all cores with linear speedup. If this is netfx, then it tends to be awful

@gundermanc
Copy link
Member

If this is running .net core then document parallelization tends to effectively saturate all cores with linear speedup. If this is netfx, then it tends to be awful

Thanks for the clarification! Didn't realize there were existing optimizations in place.

@jasonmalinowski jasonmalinowski merged commit 4d5d8f5 into dotnet:main Jul 28, 2023
22 of 24 checks passed
@jasonmalinowski jasonmalinowski deleted the allow-for-cache-reuse-when-indexing-binlogs branch July 28, 2023 20:31
@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory
await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found.");

var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile);
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Workspace needs to be disposed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LSIF indexing performance degraded
4 participants