Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk, scope, or shard LSIF for better processing throughput #1755

Open
gundermanc opened this issue Jun 23, 2023 · 23 comments
Open

Chunk, scope, or shard LSIF for better processing throughput #1755

gundermanc opened this issue Jun 23, 2023 · 23 comments
Labels
feature-request Request for new features or functionality lsif
Milestone

Comments

@gundermanc
Copy link
Member

gundermanc commented Jun 23, 2023

I work on a tool that parses and interprets LSIF content. Recently we've been looking for ways to improve throughput when parsing large quantities of LSIF and have encountered a couple of bottlenecks that seem like they could be most effectively resolved with spec changes.

Lots of Stuff in Memory

By nature of being a graph we pretty much have to process the entire LSIF output, often holding much of it in memory. This strategy for memory management can severely limit the scalability of tools that consume LSIF output from large workspaces.

I have evaluated using begin and end $events as signals to unload items that are completed from memory. Using this mechanism, we can seemingly drop any of the following from memory after the owning document ends:

  • range vertices
  • contains edges
  • next edges

Other items, like resultSets, monikers, and item edges can only be dropped if local to a particular document. For example: "unique": "document" property indicates that that item belongs to the document and can be unloaded when the document ends. Some generators, like Roslyn, however, do not use such narrow scoping, and so every resultSet must be treated as potentially global (cc: @jasonmalinowski).

Together, ranges, contains, and next edges make up 60% of the lines in a typical LSIF file. Dropping these at an end event is a solid reduction in memory, but it does mean that memory required to process an LSIF seems to grow linearly with the size of the workspace, albeit, at a slower rate.

For example: it's not unusual to see workspaces that generate 7-12GB of LSIF. Processing these, we end up with very long running parses and CPU time to spare, but absolutely no memory to process another LSIF in parallel.

No Parallel Traversal

In its current form, LSIF is very hard to traverse and process in parallel. This is related to its graph-like nature, in that you have to have most of the nodes and edges available to be sure your parse is semantically correct.

Attempts to ingest in parallel are typically defeated by an inability to guarantee that a thread doing a parse for a particular group of nodes, edges, or a particular document isn't missing a relevant bit of context by processing only a subsequence of vertices and edges. There are ways to solve this, but most I've tried run into lock contention or other tradeoffs that negate the benefits.

The Goal

The goal of this discussion is to make incremental changes to the LSIF spec and the surrounding guidance to better optimize it for ingestion throughput **cc: ** @dbaeumer,

The ideal approach for performance

The ideal from a performance standpoint is for language service information to be represented in uniformly sized chunks (perhaps one per document or per project), perhaps in separate LSIF graphs altogether. Each chunk should be independent of each other and able to be processed in any order. Linkages between the documents could be expressed via the usage of monikers, hashes, IDs, or other foreign key relationship, which can be realized when querying the backing file/datastore/etc.

JSON is also quite expensive to parse, largely due to memory allocation (in C# it's approximately 20% of the LSIF parser Thread Time) and the majority of the GC-triggering allocations, making alternate encodings enticing from an optimization perspective.

The biggest appeal of this approach is that it mirrors that of traditional build systems with separate compile and link stages: files/projects are compiled, and the results are linked by mangled name or ID, enabling effective utilization of the machine resources at pretty much any scale of project.

I base this ideal on the observation that while the data are conceptually a graph, it frequently ends up a database, store, or alternate encoding as flat records with foreign key relationships.

The incremental step

The ideal performance approach is a radical departure from the existing model that may be difficult for existing partners to adapt to and would break compatibility.

An incremental alternative is to maintain the existing graph format for LSIF, but 1) make targeted changes to enable generators to chunk their output and 2) update the language in the spec to contain a best practices section that describes language optimizations with a bit greater detail.

Begin and end $events are almost exactly what is needed to solve this problem. Their one drawback is that they can be interleaved/overlapping. If they instead were nested, like brackets in code, a trivial, scope aware LSIF parser could use a stack to maintain awareness of when things go in/out of scope. A savvy implementation could even execute peer scopes in parallel.

Specifics

$events are updated to either disallow overlapping/interleaving (a breaking change), or updated to make non-overlapping an optional capability in the capabilities vertex ( #1753, non-breaking change), or by adding new event names like "beginScope" and "endScope" with different semantics.

To ensure proper behavior after a scope terminates, some rules apply:

  • ResultSets spanning multiple documents in a single project via reference edges must not be contained a document partition. They must be contained in the enclosing project partition, if any.
  • ResultSets spanning multiple projects via reference edges must not be contained in a project partition. They go in the global scope.
  • Where possible, LSIF generators that work on large projects should decouple reference graphs at some level... perhaps the project level and link them instead via usage of a common moniker.
@dbaeumer
Copy link
Member

dbaeumer commented Jun 26, 2023

A couple of observations from implementing an indexer for TypeScript and from ingesting dumps into SQLite DBs.

The memory needed to ingest a dump can only be kept constant if you use the IDs already in the dump to identify NON local symbols. This id is a number right now. Using it makes the dump useful by itself however you can never relate symbols across dumps. If you want to do so ingesting the dump has basically the same problem as creating it: you need to keep a unique ID for NON local symbols around to be able to identify them. Otherwise you can't relate them later on. So the memory size is linear to the amount of NON local symbols.

I also see the desire of having the graph better portioned upfront however asking the LSIF tool to do so puts more memory pressure on that tool and from my experience implementing one, memory is already a big issue in these tools since they need to keep a lot compiler information around (AST node, bindings, symbols, ). So one driving factor of the LSIF format was to allow these tools to free memory as soon as possible.

Having non overlapping documents in the output is actually not possible if the language producing the graph does support cyclic dependencies between files (which TS and C# does). So there will always be the case that more than one document might be open for those kind of languages.

When it comes to projects having them non overlapping in the output is for example the case for TypeScript since TypeScript doesn't support cyclic project dependencies. It is implemented that way in the lsif tsc tool.

LSIF tsc even supports generating independent outputs per project and using monikers later on to relate them. E.g. if there is a workspace with three projects you can parse each project independently, create a dump and later on link the symbols using monikers. However measurements have shown that creating three dumps is more time consuming than creating one dump since you pay the initial parsing cost n times. Doing this IMO makes only sense if you can do a impact analysis on project level and only parse those projects that need to be recomputed. In addition it would be helpful to parallize those. However TypeScript has no support for such features and I didn't spent the time so far to implelement them.

What we can definitally look into for TypeScript is the following:

  • make sure that we output the documents best suitable for post processing by for example finishing parsing a document as soon as possible so that it can be postprocessed as quick as possible and its data can be freed.
  • make sure that we output the files according to their dependency graph so that result sets are not dangling for too long.

We can also change the output format to make things easier. What comes to my mind is the following

  • annotate result sets with their visibility (e.g. whether they are document local). Currently this is encoded in the moniker but complicated things for scheme unique monikers.
  • we can emit a combination of a graph togther with back pointers (e.g. have a document pointer on the range, and a project pointer on the document). This will save quite some edges, however will make the dump harder to understand. So we need to get a feeling if this is worth while. For 1:N edges we need to emit an edge or something comparable always.

These are only a first set of thoughts and I am open to any changes that help make dumps easier for clients to post processed. But I also think that there might not be the one solution since this IMO also depends on how the information is finally store in the DB. And we shouldn't give up on the goal that these dumps are relatively easy to produce for the compiler tool itself.

@gundermanc
Copy link
Member Author

gundermanc commented Jun 26, 2023

Thank you for fielding my questions!

Is it possible that several of the constraints that you have identified here are specific to the TypeScript ecosystem and/or the TS compiler or optimize for tradeoffs unique to these?

e.g.:

So the memory size is linear to the amount of NON local symbols.

While this is true, my expectation would be that processing more, smaller dumps, instead of one big one, would lead to lower process memory utilization overall, even if the dump on disk ends up bigger. For C# at least, this reduction in process memory can mean less GCs and less cache misses, and it enables us to parallelize processing, making ingestion faster, at least.

There are also ways to optimize further for size, e.g., by inlining the moniker as a property in the ResultSet, when there is just one, instead of writing out an edge and vertex for it, but IDK if that's strictly necessary, unless we find that IO is a limiting factor.

however asking the LSIF tool to do so puts more memory pressure on that tool and from my experience implementing one, memory is already a big issue in these tools since they need to keep a lot compiler information around (AST node, bindings, symbols, ).

This in particular sounds like it might be specific to TypeScript-like ecosystems.

When building your code in C#, there's typically one call to the compiler, csc.exe, per project built. This happens in parallel, with multiple instances of the compiler running at any one point in time, building separate projects. This means that a single instance of the compiler ends up holding all context relevant to just that one project in memory at a time, binds its dependencies via a unique type identifier constructed from the fully qualified type name, assembly, public key token, version, etc. and metadata from DLLs built by projects that were previously compiled.

This type ID becomes the LSIF 'moniker' for public types, but is also known for all internal types as well. My understanding is that under the hood the C# LSIF tool does something very similar, albeit, all in the same process. These compilations happen with some degree of parallelism at the project level, and then are multi-plexed into a single LSIF output stream.

We may save on the need to embed internal monikers by making one large dump, but we do so at the cost of multiple projects having to be held in memory to draw the edges between them, and lock contention where the individual project compilation outputs are multiplexed into the single output stream and result ID generator.

With a clearer pattern for partitioning dumps, we will be able to both produce LSIF one C# project at a time as well as consume it in a similar fashion.

By coincidence, Roslyn already exhibits the behavior I spec-ed above, purely to avoid the lock contention around fine-grained multiplexing of projects into a single stream. It's apparently faster to do it in big chunks. @jasonmalinowski mentioned it might be even faster if we could avoid this contention altogether by writing to multiple streams.

However measurements have shown that creating three dumps is more time consuming than creating one dump since you pay the initial parsing cost n times.

This also seems like an ecosystem-specific concern. I imagine that any ecosystem that imports libraries via direct-includes may encounter this issue.

I would imagine this spec change as something that the LSIF generator can choose to opt into, if it's beneficial. Enterprise C# and C++ repos commonly have thousands of projects and millions of lines of code. Holding all of this in memory doesn't scale. I'd imagine this is less common in TS projects, which are likely smaller or more modular. TS not supporting this feature does not negate its benefit for C# and C++ as far as I can tell and I'm ok with TS LSIF dumps being a little bit chunkier if that ends up being the better performance tradeoff.

annotate result sets with their visibility

This seems like a net positive overall but it doesn't negate the need for some sort of chunking, I think. Without the ability to parallelize ingestion we end up needing to allocate a lot of compute, which sits mostly idle, save for one thread, and with no memory to spare.

@gundermanc
Copy link
Member Author

Having non overlapping documents in the output is actually not possible if the language producing the graph does support cyclic dependencies between files (which TS and C# does). So there will always be the case that more than one document might be open for those kind of languages.

The partitioning I mention doesn't necessarily have to happen at the document level. Doing it per project or arbitrary group of files would be just as useful, as long as each partition can be ingested in parallel.

Using C# as an example, the files can have circular dependencies, but projects typically can't without special build magic.

@dbaeumer
Copy link
Member

I looks like I misunderstood your request.

The LSIF specification fully supports chunking the output per project and stitching them together again using monikers. It is one of the main ideas to support having references across repositories. So you can use the same approach inside a repository (e.g. LSIF is supposed to work if every project sits in its own repository assuming that there are no cyclic dependencies between projects; if there are the cycle needs to sit in its own repository right now).

What I was trying to say was that I tried this with TypeScript (e.g create a dump per project) and due to the limitations with TypeScript I didn't see a huge advantage. This of course might be completely different with C# since it sounds like you have support to parallelize the build and to do some sort of change impact analysis.

What I think we should add to the spec is some information about cyclic dependencies (between documents and between projects) to allow tools that parse a output containing n projects to do some optimizations.

@gundermanc
Copy link
Member Author

The LSIF specification fully supports chunking the output per project and stitching them together again using monikers. It is one of the main ideas to support having references across repositories.

Are you referring to the implicit ability link references from separate LSIF dumps? I think this is what I want conceptually but I'm looking for a way to make it a more explicit recommendation in the spec for ecosystems that can support it so it doesn't end up something specific to my use case.

I suggested the 'partition' notion above largely because it facilitates this sort of chunking in the existing single-LSIF-stream model we use. I tested it out, chunking by project, and processing projects in parallel, and found it makes processing ~33% faster in my case using Roslyn as the generator and enables us to hold significantly less in memory.

Multiple independent streams or units to parse would help parallelize even further by eliminating contention around stuff 'outside' any partitions but I want to make sure whatever optimizations we adopt for our uses end up part of the official spec.

What I think we should add to the spec is some information about cyclic dependencies (between documents and between projects) to allow tools that parse a output containing n projects to do some optimizations.

What did you have in mind?

@dbaeumer
Copy link
Member

dbaeumer commented Jul 5, 2023

Are you referring to the implicit ability link references from separate LSIF dumps? I think this is what I want conceptually but I'm looking for a way to make it a more explicit recommendation in the spec for ecosystems that can support it so it doesn't end up something specific to my use case.

Actually I tried to make this explicit here as a concept, however I never forced a tool to support this.

https://microsoft.github.io/language-server-protocol/specifications/lsif/0.6.0/specification/#multiProjects

A tool can still produce a dump without any monikers. The dump would then be feasible to reason about the repository / project itself but not across repositories or projects. I would like to keep this ability however I see the need that ingestion tools would like to know this upfront. We could add some additional meta data at the beginning of the dump to signal this.

There is one additional thing I want to point out that helps better understand the difference between one dump for n projects versus n dumps per project. Consider a system with three project L, P1 and P2 were P1 and P2 both depend on L and exports a function foo.

In the case of one dump for the three projects there is exactly one result set and moniker in the dump for symbol foo and all three projects reference this vertex. So the symbol foo needs to be kept in memory until the end of the dump is reached since you don't know whether another project references it (there could be a project P3 being part of the dump but not using foo). We could ask the tool to emit some information like end of using foo however computing this put additional requirements and implementation effort on the emitting tool (if reasonable possible at all; in the above case the emitting tool needs to scan all projects to make that statement)

In the case of n dump per project each dump has a result set and moniker for symbol foo. In the project L the symbol is marked as exported and in P1 and P2 as imported. This will help to stich these symbol declarations and references together again. But this comes with a prices of larger dumps and additional computation to match the symbols.

I am open for recommendations but I couldn't come up with an easy to implement partitioning model for the one dump per n project case that allows to free memory for symbols visible outside of a document.

@gundermanc
Copy link
Member Author

Actually I tried to make this explicit here as a concept

This does seem to imply an existing generator could emit multiple distinct LSIF outputs and we can join them all together later...

...what isn't clear in the existing spec is:

  • What is best practice for an LSIF generator with regards to perf + scalability. e.g.: when should an LSIF generator prefer outputting in a single large dump vs. smaller dumps.
  • The examples of LSIF generator I am familiar with all output to STDOUT. With multiple projects being built as separate dumps, how do we represent the boundaries between each dump in STDOUT? Can/should the LSIF spec extend to behaviors of the tool too, like how multiple LSIF dumps should be laid out on disk?

never forced a tool to support this.

I think it's fine if it's an opt in thing. For now, my first priority is Roslyn though we'd ideally settle upon a way to do what I'm trying to do that works for any language that deems it a perf benefit.

In the case of n dump per project each dump has a result set and moniker for symbol foo

My understanding from @jasonmalinowski is that this is already the case in Roslyn, even when LSIF is generated as a single large dump.

This is largely due to how compilation works in .NET: dotnet projects are compiled independently and in parallel and then stitched together by joining on fully qualified type references. If Roslyn de-dupes ResultSets, it has to keep every symbol it sees in a project in memory until the end so it can be de-duped vs. just flushing when the project is done.

But this comes with a prices of larger dumps and additional computation to match the symbols.

For producers and consumers that support parallelism and do not need to re-parse files like in TypeScript, perhaps the added size still poses a net benefit?

In the case my tool, we have a 14 core machine doing the work. Parsing a single small LSIF dump can easily consume GBs of memory but only makes effective use of a single core for most of the processing.

Peak memory usage ends up being the limiting factor preventing us from processing more LSIFs in parallel but smaller LSIFs or some sort of scoping/parititioning would enable us to parallelize across all of those cores and lead to higher processing throughput I think even if the LSIF is a bit larger.

If size is thought to be an issue, an alternate encoding may help offset the added cost. Anecdotally, garbage collection from JSON deserialization ends up being a significant portion of the wall clock time of my tool (~10-20 percent) so there is some opportunity here where perf is concerned.

I am open for recommendations but I couldn't come up with an easy to implement partitioning model for the one dump per n project case that allows to free memory for symbols visible outside of a document.

I suspect what works well is largely going to depend on the compilation model of the language. In C# for example, compilation happens project by project in parallel. Similar models exist in compile and link languages, like C and C++. I'd imagine any of these could be made to do per-project indexes, at the cost of having some potential duplicates between projects.

💡 As a thought experiment: what if you take the existing TypeScript generator with its overlapping begin and end events and change it slightly so that each project's vertices and edges are emitted to a different file stream instead of a single stream.

The processing of the projects and documents is interleaved as it is today, but in the end, you end up with n project LSIFs, albeit with inter-file relationships in the 'global scope'. This 'partitioned LSIF' is parallelizable at the project level, even if there is some global knowledge that is shared between projects.

@gundermanc
Copy link
Member Author

gundermanc commented Jul 6, 2023

If it's of interest, I found that I got a 30% improvement in throughput by ingesting a 600 file C# project in parallel vs. on a single thread. Here's a prototype implementation using something like the beginScope proposal above: https://devdiv.visualstudio.com/DevDiv/_git/VSCloudKernel/pullrequest/480813.

Note that this is still one big LSIF file and there is more to gain by splitting into separate files so cost to deserialize and GC is split across more threads.

@gundermanc
Copy link
Member Author

gundermanc commented Jul 7, 2023

While we iterate on potential improvements, I've updated my tool to match the spec's behavior. Ingesting that same 93MB LSIF file with 600 documents, the total unreleased memory at the end goes from 80MB in the 'global' scope to 161MB in C# files and peak memory goes from ~250-300MB up to ~450MB.

Much of this is monikers and result sets that aren't released due to C#'s usage of the scheme uniqueness and inability to otherwise tell when they go out of scope.

**cc: ** @jasonmalinowski

@dbaeumer
Copy link
Member

I was thinking about the dump per project a little more and I am actually not a big fan of asking tools to do that. Main reasons are

  • increases complexity
  • things can't be piped anymore

I would rather add additional data to the front of the dump that tells a processing tool:

  • whether project dumps are self contained and separated in the dump output
  • whether there are cyclic dependencies between projects / documents

Then a first tool can basically act like a tee and either write things to files or pipe it to other processes. To avoid that this tee-tool produces a lot of garbage we could implement it in Rust to get a constant memory consumption since we can reuse the same buffer for reading lines and dispatching them.

@jasonmalinowski
Copy link
Member

Catching up on a few bits here:

Some generators, like Roslyn, however, do not use such narrow scoping, and so every resultSet must be treated as potentially global (cc: @jasonmalinowski).

Ah, interesting, I never thought of the resultset uniqueness as being helpful for scoping, but maybe this is my misunderstanding of the intent there. We're just marking all symbols as unique because the moniker is unique, in the sense that a local named 'foo' in one document and a local named 'foo' in another document will still have unique monikers -- it means it's not up to the consumer to decide these are actually different symbols. Rather the moniker kind has things like 'local' which more directly implies scope.

If uniqueness is really supposed to mean "when I can drop the memory" sure I can change it, but then I don't understand what uniqueness really means anymore. I just thought it was more for avoiding name collisions across tools or different runs.

@jasonmalinowski
Copy link
Member

I was thinking about the dump per project a little more and I am actually not a big fan of asking tools to do that.

I agree there's some nice simplicity about the single output and I similarly don't want to abandon that as a simple example, but it does mean we can't scale well on larger machines. In the Roslyn case, even if project B depends on A, we can still be doing lots of indexing of those in parallel -- for example, once we've parsed project A's source file and built our symbol table, we can be computing references / semantic tokens / go to definition for all the contents of A while we're starting to process B.

increases complexity

Right now the single output actually creates complexity if the goal is to be efficient: the Roslyn indexer wants to be as multi-threaded as possible, but still has to synchronize on the single output lock in the end. The code in the end isn't hard, but it's code that just wouldn't have been written in the first place if we did have multiple project outputs.

@dbaeumer
Copy link
Member

@jasonmalinowski can you provide me with more details how this works in Roslyn. Are you having one process with n threads each basically parsing one project? Or are you having n processes one per project?

Uniqueness is not about dropping memory. It is about how unique a moniker is. For example two repositories can have both a project p with a document d with a function f. A typical moniker for such a function will look like p/d/f. So the moniker is only unique inside the project since the same moniker will be generated for the other repository.

Monikers are used to relate n symbols to see if they at the end point to same symbol for cross repository searches. In the above case search for reference to p/d/f in repository one should not list p/d/f in repository two. Could you explain how you ensure a unique moniker for two repositories assuming that the second is a clone of the first?

@gundermanc
Copy link
Member Author

gundermanc commented Jul 11, 2023

I never thought of the resultset uniqueness as being helpful for scoping, but maybe this is my misunderstanding of the intent there.

Adding to this, this was not obvious to me either, until explained.

increases complexity

FWIW there are tradeoffs made.. while it is more complicated in terms of lifecycle: where is the LSIF written, what is the order of consumption, etc., for systems that support parallel consumption, it becomes much easier to optimize.

In my scenario, for example, we end up zipping the LSIF, so the consuming process has to unzip the LSIF stream, deserialize each line, and then interpret the line. There are costs at each step that add up, resulting in impact to overall throughput.

When LSIF can only be processed on a single thread, we have to micro-optimize these costs, but divided across n reasonably small files, we can skip the micro-optimization and instead fan-out across the machine's cores. With the current model this isn't currently possible much of the time because the LSIF frequently requires most (or much) of the machine's available RAM to parse the LSIF, so we have to do a small number at a time or risk running out of memory.

The tee file might be a reasonable compromise, but it does require a bit more thought on the part of the consumer (and potentially the producer, if it's multi-threaded) to avoid the perf bottlenecks of multiplexing streams and reading from a single file at multiple locations

dropping memory

With regards to dropping memory, with that PR in my previous comment, we are hitting OutOfMemoryExceptions much more frequently. I dug in last night and found I was able to drop quite a few more things from memory. In the end, what remained was the monikers, and a large number of duplicate strings from the various JSON constructs.

  • Manually managing the lifetime of the context we build is error-prone: If I free too much, we fail to parse the LSIF file. If I don't free enough we OOM. In the best case scenario, we consume GBs of RAM parsing a GB LSIF file. Something akin to the nested scopes proposal I gave dramatically simplifies the rules for consumers of the format.

  • The duplicate strings problem is endemic to .NET: lots of apps have this problem and they have to explicitly de-dupe strings. It's more of a problem in this case though due to the size. With smaller inputs, we could probably ignore this problem.

@jasonmalinowski
Copy link
Member

Are you having one process with n threads each basically parsing one project?

@dbaeumer: this. The Roslyn model allows for us to analyze multiple source files in parallel -- once the top-level symbol structure is created (i.e. the names of classes and methods) then the actual binding of individual methods is fully parallelizable.

Could you explain how you ensure a unique moniker for two repositories assuming that the second is a clone of the first?

In the .NET world at least, our monikers are all prefixed by the assembly (i.e. DLL) name; which we consider as being effectively unique since that's what actually also defines uniqueness at runtime. In the clone case, yes, it's not clear which is the "original" definition for something like go to definition, but that's not a problem that the indexer can solve; that's an organizational problem! That's probably quite different than what TypeScript (or say even C++) can do, since the model is just very different.

@dbaeumer
Copy link
Member

Actually there is a misunderstanding of how monikers / uniqueness should work which might be caused by bad naming on my end. Here is what I had in mind when designing them:

There are currently five uniqueness levels (document, project, workspace, scheme, global) and they express in which context the moniker can be related to other symbols. For example a moniker with uniqueness document can only be related to other symbols with the same moniker in the same document. So these monikers when stored in a DB only need to be queryable inside the same document. Project and workspace have the same semantic. So a moniker with a uniqueness value of workspace (e.g repository) should not be match with a moniker from other workspaces (e.g. repository). So the first three levels can be treated local to the repository for which the dump got created since there intention is to not related them with other monikers in other repositories. Additional the first three levels will typically be emitted by the tools that understands the source code.t

Monikers with uniqueness scheme are unique in that scheme. Schemes are usually introduce by a a packaging system. For C# such a system would ne nuget for TS/JS it is npm. Those monikers can be related across repositories. They are typically not generated by the tool that understands the source code but by a tool that understands packing files, publishing and version numbers. These monikers need to be handled special in a DB since they need to be stored globally to release symbols across repositories.

Another idea here is that moniker with uniqueness scheme don't replace an existing moniker but annotate them (using a next link) At least in TypeScript a project can have 500 exported symbols but only 10 are accessible when the module is published to npm and due to the main property in the package.json file can even have a different name / path. So the 500 monikers of the project can be stored local to the project dump whereas the 10 npm monikers need to be stored globally

The uniqueness level global is for language that use URIs to refer to libraries like deno. But again the monikers are annotation to exist language specific monikers to relate symbols across repositories.

Regarding

The Roslyn model allows for us to analyze multiple source files in parallel -- once the top-level symbol structure is created (i.e. the names of classes and methods) then the actual binding of individual methods is fully parallelizable.

@jasonmalinowski Still not clear what that means if you have n projects. I interpret this as n process were each process as m threads. Is this correct?

@jasonmalinowski
Copy link
Member

@jasonmalinowski Still not clear what that means if you have n projects. I interpret this as n process were each process as m threads. Is this correct?

We can process multiple projects in a single process, and in many cases do.

@jasonmalinowski
Copy link
Member

jasonmalinowski commented Jul 12, 2023

So thinking about the moniker scheme more as @dbaeumer has described it; I could move to document/project scopes for monikers for private members and things like locals. Even though we don't expect the actual string of the moniker to collide with other projects in meaningful ways, we can still say that since yes, it still doesn't need to have any global scope because nothing else could. So if that helps for database optimization, great.

For members that can be seen from other projects though (including internal/protected) we'd still have to go with a scheme scope since we don't really have a meaningful separation otherwise. There's no "export" concept like in TypeScript.

@dbaeumer
Copy link
Member

The idea of the scheme scope is not for exported members. It is for members that are visible to other projects. For C# the idea is as follows, assuming that project P1 produces a library (assembly):

  • all members marked as internal and its children should NOT have a scheme moniker since they can't be seen outside the assembly.
  • private members and all its children should NOT have a scheme moniker either.
  • public members and its public / protected (including protected internal) children SHOULD have a scheme modifier

The uniqueness property tells how unique the moniker is when being linked against another moniker. Relating symbols in LSIF is comparable to what a linker does when linking code. A linker sees for example a different set of symbols when linking obj files together compared to when linking obj files to a library. You can think about the uniqueness property of different link tables that are produced for different linking scenarios (e.g. scheme and global for cross repository linking, project and workspace for linking inside the same repository and document for linking inside the same document).

The idea is not to denote how unique the moniker is in terms of its value. Even in TypeScript I could compute a scheme or even global unique moniker for private symbols by simply factoring the repository URL and the hash of the document into the moniker.

If someone has a better idea for naming I am absolutely open to change it. May be scope would have been a better name (I used it in the early days but there was confusion as well since people were thinking about scoping of variables).

Why is there a uniqueness scheme. The reason for that was to allow every language to introduce their own moniker scheme to avoid matches across languages (e.g. a moniker from TypeScript by coincidence matches a moniker from C#).

What does that mean for DB storage:

  • only scheme and global monikers need to go into global storage / table
  • all other monikers can be kept local to the repository that was indexed. These moniker shouldn't be considered when doing cross repository searches.

@jasonmalinowski
Copy link
Member

If someone has a better idea for naming I am absolutely open to change it.

I think the name is fine (or at least, I don't have a better one). I think we're generally on the same page now.

@dbaeumer does make one good point though, so question for @gundermanc: right now the C# LSIF is generating monikers for things that strictly cannot escape a project. It looks like we are skipping them for locals and a few other easy things, but private members at least don't necessarily need monikers either. Does that cause any problems if we omit monikers for more things?

@gundermanc
Copy link
Member Author

gundermanc commented Jul 14, 2023

but private members at least don't necessarily need monikers either. Does that cause any problems if we omit monikers for more things?

I think that's ok... the one scenario that may be potentially impacted is searching references across projects.

Consider this:

private string foo;

private class Bar { }

In this case I think we do need an edge from the range at foo to the System.String moniker but we're ok without the Bar moniker for that class definition as long as references to it are indicated via reference edges.

I think the rule is public types should always have moniker/edge, even when usage is an internal, private, or local member/reference.

❓ Do we also need monikers for internal types since references can be cross-repo via [InternalsVisibleTo}?

If we can make this work, we can probably bring down memory usage during ingestion significantly! @jasonmalinowski, should I file an issue on Roslyn?

@dbaeumer
Copy link
Member

Regarding the example:

private string foo;

private class Bar { }

IMO there should be a result set for string and foo. The result set for string should have a scheme moniker and the range of string should point to it. The result set for foo should be document local and shouldn't have a scheme moniker.

@dbaeumer
Copy link
Member

Regarding [InternalsVisibleTo]. I think it depends on where the assembly that gets access to the internal lives. If it is in the same repository I would still not generate scheme monikers. If it is outside you have to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality lsif
Projects
None yet
Development

No branches or pull requests

3 participants