📎 Multi-file support #3307

ematipico · 2024-06-28T16:09:23Z

Preface

We want to implement a feature that allows the various features of Biome to tap into information regarding other files. At the moment, features like lint rules don't have access to information that belongs to other files. Some examples are exported bindings, exported types, types of modules, etc.

An entity we could call Project (the name is not set in stone) should provide this kind of information.

Design

Following, I will list some design aspects of the feature, which we can discuss. Don't take everything as set in stone, but assimilate it and then debate it. I will try to do my best to outline the current archicture as best as I can.

First, the architecture should follow this principle:

Build for generic clients. Don’t assume that a terminal will only consume output using ANSI codes. Use abstractions that could be generalized for viewing in an IDE, browser, or other environments.

Playing actors

At the moment, we have the current actors at play:

The FileSystem trait. A generic trait meant to crawl the file system and evaluate if certain paths (directories, files, symbolic links, etc.) should be checked. Two types implement this trait OsFileSystem and MemoryFileSystem. The FileSystem doens't know which files should be handled. The logic that checks if those paths can be handled is implemented elsewhere via another trait called TraversalContext. As you can see, we are going elsewhere. OsFileSystem, however, has an interner for paths, but it isn't related to the paths checked.

The WorkspaceServer can be seen as our query compiler. The end consumer (e.g. CLI or LSP) has control over the compiler and tells it what to do. The WorkspaceServer has some characteristics:

It contains the content of each file.
It contains the CST of each file.
Doesn't which files should be checked. The end consumer knows it.

The concept of ProjectData is already in use. It knows that a client might have different projects with different settings:

biome/crates/biome_service/src/settings.rs

Lines 52 to 57 in eafc62e

    
           pub struct WorkspaceSettings { 
        
               /// The data of the projects 
        
               data: WorkspaceData<ProjectData>, 
        
               /// The ID of the current project. 
        
               current_project: ProjectKey, 
        
           }

The FileSystem and the WorkspaceServer rarely interact with each other. The only moment when they is for just one operation: fs.open_file -> workspace.open_file

Possible data flow

I envision the Project to be part of the WorkspaceServer. If so, Project must be Send, Sync and RefUnwindSafe. This means that we need to choose data that meets those criteria.

Fortunately, a module graph can be built with petgraph, which implements these traits.

A Project shouldn't duplicate the information that are already present in the WorkspaceServer, e.g. file content, CST.

Where we should strive

Here are some ideas that we should consider while implementing the architecture:

try to compute data only when requested;
don't overdo it, try to compute only the data requested;
think about queries, where a consumer asks for data, and the provider delivers them (and ideally caches them);

CLI

From a CLI perspective, a way to build a Project and its graph is by doing two passes of the file system.

One pass to collect all files. A second pass to check the files. After the first pass it could be possible build the Project we require.

The second pass is more challenging because, between the first pass and the second pass, we don't have an entity that will tell the WorkspaceServer which files should be checked. Should we create one? Should we extend FileSystem to tell us? Let's discuss it

LSP

From the LSP perspective, it's easier to initialise a new Project, but it will be more challenging to update it. Every time a file is updated/created/deleted, the graph should be updated.

References

Here, Jeroen Engels, creator of elm-review, explains how he does multi-file analysis inside the linter: https://jfmengels.net/multi-file-analysis/ .
However, let's take into consideration that we have different constraints and requirements.

Tasks

I haven't talked about dependencies. Dependencies will be part of the graph, but they can be added later one, as we need a manifest to collect them (which we already have).

Create a type Project inside biome_project
Create and model the project graph
Breakdown the CLI traversal into two passes. It doesn't require the project to be broken down.
Create and model an API to update the graph based on file paths and imports
Create and model APIs to CRUD the module graph

(Feel free to add more tasks if you think they are needed)

The text was updated successfully, but these errors were encountered:

arendjr · 2024-06-29T08:45:14Z

Thanks for this outline!

In general I agree with the two-pass approach. I think indeed there should be one pass that collects data we need and one that does analysis with the data available over all files. As you mentioned, between the passes there may indeed be a step that detects and defines dependencies as well; the creation of the Project data structure.

I saw one part that made me cautious:

try to compute data only when requested;

don't overdo it, try to compute only the data requested;

think about queries, where a consumer asks for data, and the provider delivers them (and ideally caches them);

I am somewhat afraid this might be approaching the direction from the wrong angle. Rather than querying from the place where the data is needed and caching results, I think it’s better to build (most of) the data from the ground during the first phase so that it’s already available when the second phase starts. This does mean we end up computing some information we end up not querying, but I see some convincing reasons we may want to take this approach anyway:

It is much easier to parallelize the work. Workers during phase one will be able to do this per file without complex synchronization overhead. Workers during phase two have it available to them without needing to wait for one another.
Because it requires less synchronization it will also be easier for us to build and have fewer bugs.
I think this approach also better align with the direction we want to take for type inference, meaning we may reuse some of the infrastructure.

And finally, if the goal is to avoid doing unnecessary work, I think it’s best to split the types of data we gather into various services, and only activate the services that have one or more rules registered for them. But if we make data gathering itself too fine-grained, I think we’ll make things too complex for ourselves and I’m skeptical the work we safe would outweigh the benefits of better parallelization.

Unrelated question: Since the data structure is called Project for now? Do you think this provides a good opportunity to improve our monorepo support as well?

arendjr · 2024-06-29T09:44:12Z

So I think a little bit more concretely I would suggest something in the following direction:

Initialization step: Config Resolution

This is very much like today, we load the config files. Since we decide here which rules should be enabled, we’ll know which services to activate for the next step.

Phase 1: Traversal and Discovery

This step performs the filesystem traversal and parses all the supported files. It also extracts file-specific data for enabled services.

Examples of services could be:

Extract project structure (package.json, deno.json)
Extract list of imported files/packages this file depends on (these would still be unresolved at this point)
Extract type information
Extract installed dependencies (package.json, deno.json)
Extract import maps (tsconfig.json, deno.json)
Extract GraphQL snippets

I included project structure in this list because I think if we do this right, it might help us to fix our monorepo support. This probably means we’ll treat package.json as a JSON file in the project like any other, but we’ll have a service that recognizes it as a project root, extracts its dependencies, etc.. It won’t care yet if it’s a nested project, since that’ll be up to the next step to resolve.

Intermediary step: Project Resolution

At this point we look at everything we’ve gathered and build a dependency tree. We create a one or more Projects a and define the relations between them.

This step is probably hard to parallelize, since graph manipulation is difficult to multi thread, which is why I think we should have most data available already.

File resolution also happens here, and I think we can be clever here by not doing any additional filesystem I/O. We’ve already done a traversal, so we know the file paths and their contents.

Phase 1b: Traversal and Discovery of Dependencies

Some services, such as type inference, may want to extract information from external packages. We couldn’t do that during phase 1, since we hadn’t done any file resolution yet. But now that we know and have identified what is missing we can repeat phase 1 for these resources. The main difference is that it won’t be a file system traversal, but traversal of the dependency graph for missing data. It may take file system access, or even network access, to fully resolve everything though.

Phase 2: Analysis

Now we have well-defined projects, their dependencies, full file resolutions and a bunch of extracted metadata. It’s time to run our rules :)

This can be fully parallelized again since the data in the services is fixed at this point.

arendjr · 2024-06-29T09:54:01Z

One more thing I’d like to add is that I think an approach like this could scale very well to our LSP use case too, since every time a file is changed, we just re-run the phase 1 discovery on it after which we only may need to update the Project data structure(s) to get back into a fully consistent, up-to-date state again.

ematipico · 2024-07-02T10:21:27Z

There's also one more thing that I would like to consider as part of this refactor/feature. salsa.

If you don't know salsa, it is an on-demand computation and memoisation framework.

I have been looking at salsa for a looooong time, and we've had plans to try to introduce it inside Biome. Ruff is also evaluating it. There's a big question mark, though: at the moment, they are rewriting it, and they will land the 3.0 version. However, the developments are very slow, and we currently don't know when they will finish it. However, I know that Micha from Ruff is currently helping with the developments, the team had a meeting about shipping v3 and more. So I am hopeful.

Their website lists the documentation of the upcoming v3, and the old website is now gone.

Another word of advice: salsa isn't a library; it is a framework, which means our code architecture may have to change to accommodate it. Some months ago, I also tried for fun to include the old salsa, and I am certain that things will have to change, however we can incrementally add stuff to the database.

The documentation lacks a bit, especially when it comes to how to use salsa inside a query compiler or IDE. However, I know that they plan to expand the docs in that territory.

Conaclos · 2024-07-02T10:27:09Z

From a CLI perspective, a way to build a Project and its graph is by doing two passes of the file system.
One pass to collect all files. A second pass to check the files.
The second pass is more challenging because, between the first pass and the second pass, we don't have an entity that will tell the WorkspaceServer which files should be checked.

One thing I am afraid of is unnecessary traversing of directories (including symbolic links).
In the past we had some performance issues with that.

We could take a more granular approach, where we load the files in the currently traversed directory and decide - based on .gitignore, and global ignore/include - which subdirectories to traverse.

This has one drawback: a handled file may import a file in an ignored
directory. However, even by traversing all the file hierarchy, we still have this problem because a handled file can import an unhandled file outside the traversed hierarchy.
We could use on-demand traversal for these cases. I think we could even give up on this, because if the user is explicitly asked to (globally) ignore a file or directory, this is surely for a good reason. The user might be confused by getting parsing errors for an ignored file.

Note that we need to clarify one thing: if a user globally includes directories, we should only process them. However, if a user doesn't use include and only uses ignore, then technically we could process files outside the traversed hierarchy. Is that what we want?

Using this approach, will require to slightly precise or change the semantic of our ignore/include:

Global ignore/include and .gitignore control which directories we traverse and which
files we process (parse, extract ypes, dependency graph).
linter/formatter ignore/include don't prevent traversing of a directory or
processing (parsing, type extarction, ...) of a file.
It only prevents linter/formatter execution.
Paths listed as CLI arguments could be similar to linter/formatter ignore/include.
However, I am unsure where the traversal of the file hierrachy should start if
we apply a such semantic. Maybe the current directory where the command is executed?
Or the directory where the biome config file appears (This could be a bad idea if a project has no config file)?

Fortunately, a module graph can be built with petgraph, which implements these traits.

It is not clear to me why a module graph is needed for multi-file analysis.
We just need a way to resolve imported files and a way to extract data for those files.
Of course, the module dependency graph will be really useful for import analysis (cycle detection, ...).

Phase 1: Traversal and Discovery
Phase 2: Analysis

If we keep our linter query-based system, most of the current linter rules could be executed in the first phase because they visit the AST. Is there any reason to defer them until after the dependencies have been resolved?

The main difference is that it won’t be a file system traversal, but traversal of the dependency graph for missing data. It may take file system access, or even network access, to fully resolve everything though.

arendjr · 2024-07-02T17:47:16Z

Another word of advice: salsa isn't a library; it is a framework, which means our code architecture may have to change to accommodate it. Some months ago, I also tried for fun to include the old salsa, and I am certain that things will have to change, however we can incrementally add stuff to the database.

I think salsa is a pretty impressive architecture, but I am indeed slightly wary of the amount of changes it would require for it to fit into Biome's architecture.

Another thing to consider: As our plugin efforts progress, we may want to give plugins the ability to extract information from files similar to how Rust services can. This might be possible with salsa too, but I'm not sure if it will become harder to achieve.

We could use on-demand traversal for these cases. I think we could even give up on this, because if the user is explicitly asked to (globally) ignore a file or directory, this is surely for a good reason. The user might be confused by getting parsing errors for an ignored file.

(emphasis mine)
This sounds dangerous. I think ignore instructions from the user should be respected as far as analysis is concerned, but not necessarily as far as extraction/discovery is concerned. Example: node_modules is usually ignored, but you'd still want to extract type information from it.

Note that we need to clarify one thing: if a user globally includes directories, we should only process them. However, if a user doesn't use include and only uses ignore, then technically we could process files outside the traversed hierarchy. Is that what we want?

I think indeed processing files outside the traversed hierarchy is what we want, and may even need to extract information from dependencies. I would say this is true even if the user only specified a limited set of includes. In my suggestion, this is effectively what Phase 1b is about, and which indeed would have the ability to traverse outside files/directories from the initial traversal.

It is not clear to me why a module graph is needed for multi-file analysis.
We just need a way to resolve imported files and a way to extract data for those files.
Of course, the module dependency graph will be really useful for import analysis (cycle detection, ...).

I think it's more a convenience than a necessity. Indeed as soon as you extract imports and are able to resolve them, you already have the graph implicitly. Putting it in a data structure makes access more convenient. Cycle detection is indeed a good example of that :)

I wouldn't say it necessarily needs to be a petgraph, btw. As long as we track files in our workspace, a simple list of paths (after resolving) of which other workspace files it depends on might also be enough.

If we keep our linter query-based system, most of the current linter rules could be executed in the first phase because they visit the AST. Is there any reason to defer them until after the dependencies have been resolved?

You're right, they could be. The main reason I figured it might be better to do all analysis in phase 2 was simplicity: If all analysis rules run in the same phase, there might be less bookkeeping involved.

Maybe your suggestion could actually improve performance by giving the workers more to do while we also perform a lot of filesystem I/O, so I wouldn't rule out that option either.

arendjr · 2024-07-24T17:25:51Z

We were discussing this on Discord, but adding here for visibility: I think it could be a good idea to build an initial version without dependency graph and use it to power the GraphQL analyzer. Might be a good PoC for a first iteration.

ematipico added S-Feature Status: new feature to implement A-CLI Area: CLI A-Editors Area: editors A-Analyzer Area: analyzer A-Core Area: core labels Jun 28, 2024

togami2864 mentioned this issue Jul 11, 2024

☂️ CSS Semantic Model #3411

Closed

2 tasks

ematipico mentioned this issue Jul 17, 2024

💅 Biome does not resolve modules? #3440

Closed

1 task

This was referenced Aug 1, 2024

refactor(cli): traversal #3563

Merged

🐛 VS Code extension: Poor support for multi-folder workspaces #1630

Open

ematipico mentioned this issue Aug 9, 2024

☂️ Biome - current developments #2455

Open

ematipico mentioned this issue Oct 1, 2024

vcs.useIgnoreFile ignores nested ignore files #2312

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📎 Multi-file support #3307

📎 Multi-file support #3307

ematipico commented Jun 28, 2024 •

edited

Loading

arendjr commented Jun 29, 2024

arendjr commented Jun 29, 2024 •

edited

Loading

arendjr commented Jun 29, 2024

ematipico commented Jul 2, 2024

Conaclos commented Jul 2, 2024

arendjr commented Jul 2, 2024 •

edited

Loading

arendjr commented Jul 24, 2024

📎 Multi-file support #3307

📎 Multi-file support #3307

Comments

ematipico commented Jun 28, 2024 • edited Loading

Preface

Design

Playing actors

Possible data flow

Where we should strive

CLI

LSP

References

Tasks

arendjr commented Jun 29, 2024

arendjr commented Jun 29, 2024 • edited Loading

Initialization step: Config Resolution

Phase 1: Traversal and Discovery

Intermediary step: Project Resolution

Phase 1b: Traversal and Discovery of Dependencies

Phase 2: Analysis

arendjr commented Jun 29, 2024

ematipico commented Jul 2, 2024

Conaclos commented Jul 2, 2024

arendjr commented Jul 2, 2024 • edited Loading

arendjr commented Jul 24, 2024

ematipico commented Jun 28, 2024 •

edited

Loading

arendjr commented Jun 29, 2024 •

edited

Loading

arendjr commented Jul 2, 2024 •

edited

Loading