Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse the workspace to better take advantage of caches when indexing #69268

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions src/Features/Lsif/Generator/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -86,14 +86,16 @@ private static async Task GenerateAsync(
outputWriter = outputFile;
}

using var logFile = log != null ? new StreamWriter(log) : TextWriter.Null;
using var logFile = log != null ? new StreamWriter(log) { AutoFlush = true } : TextWriter.Null;
jasonmalinowski marked this conversation as resolved.
Show resolved Hide resolved
ILsifJsonWriter lsifWriter = outputFormat switch
{
LsifFormat.Json => new JsonModeLsifJsonWriter(outputWriter),
LsifFormat.Line => new LineModeLsifJsonWriter(outputWriter),
_ => throw new NotImplementedException()
};

var totalExecutionTime = Stopwatch.StartNew();

try
{
// Exactly one of "solution", or "project" or "compilerInvocation" should be specified
Expand Down Expand Up @@ -141,7 +143,7 @@ private static async Task GenerateAsync(
}

(lsifWriter as IDisposable)?.Dispose();
await logFile.WriteLineAsync("Generation complete.");
await logFile.WriteLineAsync($"Generation complete. Total execution time: {totalExecutionTime.Elapsed.ToDisplayString()}");
}

private static async Task LocateAndRegisterMSBuild(TextWriter logFile, DirectoryInfo? sourceDirectory)
Expand Down Expand Up @@ -264,11 +266,10 @@ private static async Task GenerateFromBinaryLogAsync(
await logFile.WriteLineAsync($"Load of the binlog complete; {msbuildInvocations.Length} invocations were found.");

var lsifGenerator = Generator.CreateAndWriteCapabilitiesVertex(lsifWriter, logFile);
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Workspace needs to be disposed


foreach (var msbuildInvocation in msbuildInvocations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if there's any benefit in parallelizing this step at all?

Copy link
Member Author

@jasonmalinowski jasonmalinowski Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general assumption was projects tend to have enough documents to probably saturate. We could do this in parallel but it'd also mean additional memory usage because now we're holding multiple projects at once versus just the one we're analyzing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in FindRefs we found that processing a single project at a time (ideally in topological order) and then preocessing in parallel within the project produced the best results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I downloaded this new version and ingested the VS Editor binlog and it looks like it's averaging just one or two CPU cores of activity at a time.

I didn't drill much deeper than peeking at Task Manager but I thought this was interesting as I'd assume this is mostly CPU bound work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occasionally bursts higher, but I wonder if this points to a bottleneck, like that lock around STDOUT, that could be improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the same file will be in multiple projects, with different semantics. So you'd need to know which project you were looking through to make the meaning of a particular file be correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the consumer side matter though?

It impacts server costs, read duration, scalability (how big of a workspace we can handle before we run out of memory on that machine), reliability (probability of an OOM exception).

I would not have expected consumption to be costly since everything is computed up front.

LSIF is a graph so one has to hold much of it in memory at a time to be able to reason about it. It can also only really be traversed by a single thread at a time, so large LSIFs mean machines consuming LSIF have a single core pinned at 100%, the rest mostly idle, and much of their RAM utilized, and can't really process any other LSIFs simultaneously unless you can be sure they are all small enough to fit in the remaining RAM.

There are ways to work-around this by allocating and tracking RAM dedicated to each in-progress LSIF ingestion but this is more complicated and it'd be preferable to have more, smaller, LSIFs.

With small, predictable sizes, the level of concurrency is more consistent and OOMs consuming large or unusual workspaces will de facto never happen.

you'd need to know which project you were looking through to make the meaning of a particular file be correct

I think this already works today. AFAIK the LSIF generator creates a separate document node for each project that contains the document. This data is encoded via the contains edge, which indicates that the document is contained by that specified project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to process with a single thread? It's all immutable data. I would expect parallel processing to be trivial. After all, that's what we can do in the production side. It seems odd here that consumption would have it harder.

Furthermore, I would expect deep parallelization in servers due to processing a queue of lsif files. When I worked at G, that's how things worked. It wasn't like you spun up a server to process one piece of data. You had servers dequeuing what was essentially an unbounded queue of work. So they were always saturated.

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, if lsif isn't a good format for processing, then we should create one that is. My presumption on using lsif as the intermediary format was that it was actually good for this. If not, we can absolutely come up with formats that are tuned for low resource usage, and trivial algorithms on both construction and ingestion. :-)

That's the gist of the issue I linked. It describes incremental changes to make it more efficient to process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way. This should be like game development. We should think about our formats first, optimizing for performance. We should not start everything with bad structures (like graphs or json) and then have to contort to make that decent :-)

In the end, the product we are shipping is not "lsif support", it's "rich code experiences in the web". We should start with that and move backwards. If that indicates lsif is a poor choice because of design and cogs, we can make something else.

{
var workspace = new AdhocWorkspace(await Composition.CreateHostServicesAsync());

var projectInfo = CommandLineProject.CreateProjectInfo(
Path.GetFileNameWithoutExtension(msbuildInvocation.ProjectFilePath),
msbuildInvocation.Language == Microsoft.Build.Logging.StructuredLogger.CompilerInvocation.CSharp ? LanguageNames.CSharp : LanguageNames.VisualBasic,
Expand All @@ -284,6 +285,10 @@ private static async Task GenerateFromBinaryLogAsync(
var generationStopwatch = Stopwatch.StartNew();
await lsifGenerator.GenerateForProjectAsync(project, GeneratorOptions.Default, cancellationToken);
await logFile.WriteLineAsync($"Generation for {project.FilePath} completed in {generationStopwatch.Elapsed.ToDisplayString()}.");

// Remove the project from the workspace; we reuse the same workspace object to ensure that some workspace-level services (like the IMetadataService
// or IDocumentationProviderService) are kept around allowing their caches to be reused.
jasonmalinowski marked this conversation as resolved.
Show resolved Hide resolved
workspace.OnProjectRemoved(project.Id);
}
}
}
Expand Down
Loading