Reuse the workspace to better take advantage of caches when indexing #69268

jasonmalinowski · 2023-07-28T01:31:55Z

When we were processing a binlog, we would create a workspace per project. This was't ideal, as it turns out this meant we wouldn't reuse caches for metadata references and documentation comments from one indexing job to the next.

Fixes #68750

This means all writes immediately flush to the file so you can watch the log file in real time to get a sense of progress. Although this is slightly slower, we don't remotely write enough to the log file for it to matter.

When we were processing a binlog, we would create a workspace per project. This was't ideal, as it turns out this meant we wouldn't reuse caches for metadata references and documentation comments from one indexing job to the next. Fixes dotnet#68750

src/Features/Lsif/Generator/Program.cs

gundermanc · 2023-07-28T01:41:00Z

src/Features/Lsif/Generator/Program.cs

@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory
            await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found.");

            var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile);
+            var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());

            foreach (var msbuildInvocation in msbuildInvocations)


I'm curious if there's any benefit in parallelizing this step at all?

My general assumption was projects tend to have enough documents to probably saturate. We could do this in parallel but it'd also mean additional memory usage because now we're holding multiple projects at once versus just the one we're analyzing.

in FindRefs we found that processing a single project at a time (ideally in topological order) and then preocessing in parallel within the project produced the best results.

FWIW I downloaded this new version and ingested the VS Editor binlog and it looks like it's averaging just one or two CPU cores of activity at a time.

I didn't drill much deeper than peeking at Task Manager but I thought this was interesting as I'd assume this is mostly CPU bound work?

It occasionally bursts higher, but I wonder if this points to a bottleneck, like that lock around STDOUT, that could be improved.

For example, the same file will be in multiple projects, with different semantics. So you'd need to know which project you were looking through to make the meaning of a particular file be correct.

Does the consumer side matter though?

It impacts server costs, read duration, scalability (how big of a workspace we can handle before we run out of memory on that machine), reliability (probability of an OOM exception).

I would not have expected consumption to be costly since everything is computed up front.

LSIF is a graph so one has to hold much of it in memory at a time to be able to reason about it. It can also only really be traversed by a single thread at a time, so large LSIFs mean machines consuming LSIF have a single core pinned at 100%, the rest mostly idle, and much of their RAM utilized, and can't really process any other LSIFs simultaneously unless you can be sure they are all small enough to fit in the remaining RAM.

There are ways to work-around this by allocating and tracking RAM dedicated to each in-progress LSIF ingestion but this is more complicated and it'd be preferable to have more, smaller, LSIFs.

With small, predictable sizes, the level of concurrency is more consistent and OOMs consuming large or unusual workspaces will de facto never happen.

you'd need to know which project you were looking through to make the meaning of a particular file be correct

I think this already works today. AFAIK the LSIF generator creates a separate document node for each project that contains the document. This data is encoded via the contains edge, which indicates that the document is contained by that specified project.

Why do you have to process with a single thread? It's all immutable data. I would expect parallel processing to be trivial. After all, that's what we can do in the production side. It seems odd here that consumption would have it harder.

Furthermore, I would expect deep parallelization in servers due to processing a queue of lsif files. When I worked at G, that's how things worked. It wasn't like you spun up a server to process one piece of data. You had servers dequeuing what was essentially an unbounded queue of work. So they were always saturated.

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

That's the gist of the issue I linked. It describes incremental changes to make it more efficient to process.

Put another way. This should be like game development. We should think about our formats first, optimizing for performance. We should not start everything with bad structures (like graphs or json) and then have to contort to make that decent :-)

In the end, the product we are shipping is not "lsif support", it's "rich code experiences in the web". We should start with that and move backwards. If that indicates lsif is a poor choice because of design and cogs, we can make something else.

gundermanc · 2023-07-28T01:50:47Z

💡 Bonus/misc performance idea: I noticed here a task is being started for each document (in a project or solution?).

I'd imagine this could be hundreds or thousands of tasks. If so, this means we could have several hundred documents worth of context that all needs to be held in memory at a time, since each await can cause us to yield and start another task.

This would seem to dramatically increase the amount of memory used without improving throughput.

Would it be more memory efficient to instead use a .NET Channel or Microsoft.VisualStudio.Threading.AsyncQueue + Environment.ProcessorCount threads to saturate CPU?

CyrusNajmabadi · 2023-07-28T02:01:51Z

If this is running .net core then document parallelization tends to effectively saturate all cores with linear speedup. If this is netfx, then it tends to be awful

gundermanc · 2023-07-28T02:11:43Z

If this is running .net core then document parallelization tends to effectively saturate all cores with linear speedup. If this is netfx, then it tends to be awful

Thanks for the clarification! Didn't realize there were existing optimizations in place.

sharwell · 2023-07-31T13:49:30Z

src/Features/Lsif/Generator/Program.cs

@@ -264,11 +266,10 @@ private static async Task LocateAndRegisterMSBuild(TextWriter logFile, Directory
            await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found.");

            var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile);
+            var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());


⚠️ Workspace needs to be disposed

jasonmalinowski added 2 commits July 27, 2023 18:00

Include total execution time at the very end

0cb7f11

Enable auto-flush for the log file

d664ee5

This means all writes immediately flush to the file so you can watch the log file in real time to get a sense of progress. Although this is slightly slower, we don't remotely write enough to the log file for it to matter.

jasonmalinowski requested a review from a team as a code owner July 28, 2023 01:31

jasonmalinowski self-assigned this Jul 28, 2023

dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Jul 28, 2023

jasonmalinowski force-pushed the allow-for-cache-reuse-when-indexing-binlogs branch from 2f09cef to 19d29d6 Compare July 28, 2023 01:34

gundermanc approved these changes Jul 28, 2023

View reviewed changes

CyrusNajmabadi approved these changes Jul 28, 2023

View reviewed changes

jasonmalinowski merged commit 4d5d8f5 into dotnet:main Jul 28, 2023
22 of 24 checks passed

jasonmalinowski deleted the allow-for-cache-reuse-when-indexing-binlogs branch July 28, 2023 20:31

sharwell reviewed Jul 31, 2023

View reviewed changes

allisonchou mentioned this pull request Aug 2, 2023

[Automated] PRs inserted in VS build main-34002.289 #69347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse the workspace to better take advantage of caches when indexing #69268

Reuse the workspace to better take advantage of caches when indexing #69268

jasonmalinowski commented Jul 28, 2023 •

edited

Loading

gundermanc Jul 28, 2023

jasonmalinowski Jul 28, 2023 •

edited

Loading

CyrusNajmabadi Jul 28, 2023

gundermanc Jul 31, 2023

gundermanc Jul 31, 2023

CyrusNajmabadi Jul 31, 2023

gundermanc Jul 31, 2023

CyrusNajmabadi Jul 31, 2023

gundermanc Jul 31, 2023

CyrusNajmabadi Jul 31, 2023

gundermanc commented Jul 28, 2023

CyrusNajmabadi commented Jul 28, 2023

gundermanc commented Jul 28, 2023

sharwell Jul 31, 2023

Reuse the workspace to better take advantage of caches when indexing #69268

Reuse the workspace to better take advantage of caches when indexing #69268

Conversation

jasonmalinowski commented Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

jasonmalinowski Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gundermanc commented Jul 28, 2023

CyrusNajmabadi commented Jul 28, 2023

gundermanc commented Jul 28, 2023

Choose a reason for hiding this comment

jasonmalinowski commented Jul 28, 2023 •

edited

Loading

jasonmalinowski Jul 28, 2023 •

edited

Loading