Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Pylint's cyclic-import rule (PLR0401) #3880

Closed
wants to merge 85 commits into from

Conversation

chanman3388
Copy link
Contributor

@chanman3388 chanman3388 commented Apr 5, 2023

Implement https://pylint.pycqa.org/en/v2.13.9/messages/refactor/cyclic-import.html
Issue #970

The principle here is we utilise the ImportMap to traverse the modules which may or may not import each other. When a module has all its imported modules checked, it is counted as 'fully visited'.
When considering a traversed path, each module imported along the way is kept in a stack, a cycle is detected when the module we're currently inspecting is already contained within this stack. The appropriate part of the stack is copied into a new Vec and pushed onto a Vec which contains all the cycles currently detected from the current module.

By the nature of the implementation, cycles of other modules traversed are detected, but these won't be reported immediately, instead stored in a HashMap. When the cyclic import rule is called, a check of whether we have already the cycles for this module is done to avoid duplicate work.

@chanman3388 chanman3388 marked this pull request as ready for review April 22, 2023 02:50
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive.

Can we add some integration tests that test the generated diagnostics?

How does that work in ruff-lsp when the user has unsaved changes in multiple files?

Comment on lines 52 to 54
let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default();
let mut cycles: FxHashSet<Vec<Arc<str>>> = FxHashSet::default();
let mut stack: Vec<Arc<str>> = vec![name.clone()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings in hash sets is expensive because hasing requires iterating over the whole string (and it may be necessary to do so more than once if the hash map needs to grow). Could we instead have a single map that stores a name to id mapping, where id is a zero type wrapper around e.g. a u32 type? It would allow us to only hash the string once rather than once per map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, I'll look into it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've done this, but I think it's a bit ugly imho, at least I think it's kinda ugly.
At least on the noddy example I provided above, it doesn't appear to provide any gains, but it is a small example.

Copy link
Member

@MichaReiser MichaReiser Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on the noddy example I provided above,

What example is this?

You would probably want to hide this behind a custom data structure that takes care of performing the str -> id lookups. So that it appears to users the same as when they to the HashMap.get(str) lookup directly.

Copy link
Contributor Author

@chanman3388 chanman3388 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this file tree

grand
├── __init__.py
├── a.py (from .parent.child import a; from .parent import a; from grand.parent.child import b)
└── parent
    ├── __init__.py
    ├── a.py (from .child import a, b; from .. import a)
    └── child
        ├── __init__.py
        ├── a.py (from .. import a; from ... import a)
        ├── b.py (from . import a)
        └── c.py --empty--

This one.
Ok, I'll give .get implementation a go.

Copy link
Contributor Author

@chanman3388 chanman3388 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I decided to do some noddy 'benchmarking'. I cloned the cpython repo, checked out the tag v.3.11.3 and ran ruff on it with --select=E,F,PLR0401 on it with varying commits from this branch using cargo build --release, by my estimation this is about 1M lines of python. I have an AMD 5900X. Off the current tip (f819262), it takes about 15-17s. Without performing a name to id mapping, that is to say leaving all the module names as Arc<str>, c878a78, it seems to reasonably consistently take 15s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrt ruff-lsp...I believe I'm using my own build of ruff with vscode...but I don't think it picks up on the cyclic imports. Does it just lint individual files at a time?


impl CyclicImportChecker<'_> {
fn has_cycles(&self, name: &Arc<str>) -> VisitedAndCycles {
let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using Arc vs eg. Box?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reasonably new to this so do correct me if I'm wrong. The way I've done it I believe that we should have the same str potentially owned by multiple owners, though this might also not always be the case and I may indeed have a different Arc<str> which does contain the same value as another. For a Box<T> only one thing can own T. Cloning a Box<T> requires cloning the contents of the Box which isn't what I want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Using refcounting here does make sense to avoid allocating the same module name many times. Would it be possible to use an Rc instead of Arc or is this data structure shared across thread boundaries?

Copy link
Contributor Author

@chanman3388 chanman3388 Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ImportMap is created in a way that is 'threaded' due to using par_iter which is where the first use of Arc<str> comes in. With the changes for referring to module names by id instead, because I create a name to id and id to name mapping...these just clone the Arc so unless I deliberately construct new Rcs from these Arcs I don't think I can just use an Rc even after this point. So I don't think I can trivially use Rc in favour of Arc.
It's awkward, because I don't think we actually share these across thread boundaries, but the compiler doesn't like it.

Comment on lines 62 to 64
stack: &mut Vec<Arc<str>>,
cycles: &mut FxHashSet<Vec<Arc<str>>>,
fully_visited: &mut FxHashSet<Arc<str>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It could be useful to put all these related data structures into a single struct that you can than pass around (and can expose semantically meaningful methods)

crates/ruff/src/rules/pylint/rules/cyclic_import.rs Outdated Show resolved Hide resolved
crates/ruff_cli/src/commands/run.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/imports.rs Outdated Show resolved Hide resolved
.map(|diagnostic| {
Message::from_diagnostic(
diagnostic,
SourceFileBuilder::new(&path.to_string_lossy()).finish(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test if ruff correctly renders the diagnostic using --show-source. You may need to set the source code if this is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is that it doesn't render it correctly so I'll need to sort this out.

Copy link
Contributor Author

@chanman3388 chanman3388 Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got this working, but I'm not sure I've done this in the expected fashion. If I want to use ruff::linter::diagnostics_to_messages I have to re-tokenize the source file which seems inefficient. I would need to do that to populate the noqa flags correctly, should I do that? For now I haven't as it appears simpler no to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to have a closer look and provide you with an answer by Friday or early next week (I'm travelling today/tomorrow)

crates/ruff/src/rules/pylint/rules/cyclic_import.rs Outdated Show resolved Hide resolved
@chanman3388
Copy link
Contributor Author

chanman3388 commented Apr 23, 2023

Impressive.

Can we add some integration tests that test the generated diagnostics?

How does that work in ruff-lsp when the user has unsaved changes in multiple files?

If I understand you correctly, I think at the moment we can't do the former as the integration tests only work on individual files, whereas to check I get the correct cycles I need to lint a package.

Short answer for the latter is, I don't know, I'll find that out.

Thanks for the thorough review!

@chanman3388
Copy link
Contributor Author

Impressive.
Can we add some integration tests that test the generated diagnostics?
How does that work in ruff-lsp when the user has unsaved changes in multiple files?

If I understand you correctly, I think at the moment we can't do the former as the integration tests only work on individual files, whereas to check I get the correct cycles I need to lint a package.

Short answer for the latter is, I don't know, I'll find that out.

Thanks for the thorough review!

So at the moment at least I don't know how to test ruff-lsp with my own version of ruff. Is there someone that could help with that?

@chanman3388
Copy link
Contributor Author

I've got another branch that uses par_iter instead and places CyclicImportHelper in an Arc<RwLock<T>. At least from my benchmarking it's ever-so-slightly faster.

@chanman3388
Copy link
Contributor Author

I've got another branch that uses par_iter instead and places CyclicImportHelper in an Arc<RwLock<T>. At least from my benchmarking it's ever-so-slightly faster.

Scratch that, seems I was trying something else.

@MichaReiser
Copy link
Member

MichaReiser commented May 1, 2023

This is a very impressive PR for an extremely valuable lint rule. Thanks for keeping working on it and rebasing it.

I have taken a closer look today and what it would mean. I have a few technical concerns for which I don't have any immediate answers right now and I recommend solving before merging. We can make some other improvements to the data structures to improve performance and reduce memory consumption.

Open Questions

LSP and unsaved changes

I haven't verified this but I fear that ruff will start reporting false positives (or negatives) if users have multiple open unsaved documents in their editor. Ruff's LSP passes the content of the active file in the editor per stdin instead of reading it from disk because there are likely unsaved changes.

The problem is, users can have multiple open documents with unsaved changes. We would need to pass all of them via stdin to ensure ruff operates on the latest content rather than on the stale content on disk. A scenario where I could see this be especially problematic is if the user has a cyclic import, makes changes two both files that fix the issue, but ruff continues to report it because it uses stale content from disk.

Properly supporting this use case most likely require significant changes to the LSP. We may want to decide that we're okay with this limitation. I rather prefer not to disable the rule in editors only because one of ruffs value propositions is that it works the same in editors and the CLI.

Caching

Ruff caches all diagnostics per file to avoid repeating the same analysis for unchanged files. I don't know how to cache the project-wide analysis yet and/or how cache invalidation should work. But I think that's something that we potentially can defer to later.

Related to this: We need to have the source code to emit diagnostics. It's unclear to me how to best retrieve the source code. We may need to first refactor our code to extract the logic that reads the source for a file.

Architecture

I would like to build an architecture that supports other use-cases like checking if a module only imports valid symbols from a module. Ideally, most of the computation happens per-file to use the CPUs' multi-core architecture best.

The way I think about this, and it is to a large extent implemented this way in this PR, is that we need to extract additional metadata in the per-file lint phase, cache it, aggregate that information, and then run additional checks on the aggregated data.

Lint Phase

Extract the name of the imported modules (and future, exported symbols). We can do this by computing a set of all imports in the lint phase where ModuleImport is defined as

struct ModuleImports {
	imports: FxHashSet<ModuleImport>
}

struct ModuleImport {
	/// The range of the import statement, used for diagnostics
	range: TextRange,
	/// Or can we use a `TextRange` slicing into the source to avoid allocations?
	name: String,
}

We need to persist the ModuleImports when writing to the cache in the cache.rs and restore them when reading from the cache. This is similar to what you have today. My main recommendation is to keep this as slim as possible to make it cheap to cache.

Orchestration (CLI)

Build up an aggregated view. Ideally, as little as possible of that code lives in the CLI. The CLI should only be calling into the corresponding code living in ruff. For example, we could create a Modules or Project struct in ruff that aggregates the data:

struct Modules {
	index: hashbrowns::HashMap<ModuleId, ()> // Use the raw_entry_mut method to manually compute the hash key to avoid storing the name multiple time. https://docs.rs/hashbrown/latest/hashbrown/struct.HashMap.html#method.raw_entry
	modules: Vec<Module> // the index defines the `ModuleId`. Use `self.modules.len()` to compute the next id.
}

impl Modules {
  pub fn add_module(name: String, imports: ModuleImports) {...}
}

#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
struct ModuleId(u32);

struct Module {
	name: String,
	imports: Vec<ResolvedImport>,
	source: String, // (or lazy load?)
}

struct ResolvedImport {
	id: ModuleId,
	range: TextRange
}

We can avoid the need for locking on Modules by only using it from a single thread. Ideally, we could start building it while the analysis of other modules is still running rather than having to wait until all analysis ended (use a separate thread with a Channel?)

Project-wide analysis

Create a new lint_project(modules: &Modules) or similar method in ruff that does the cross-file linting.

What's unclear is how to retrieve the source text for cached files without loading every single file.

Next Steps

I think we should try to answer if we want to solve the LSP and caching issues today and we can then decide what other technical improvements we want to make (again, your PR is already super close and some of my suggestions may not work, you will know better). @charliermarsh what's your take on the LSP and caching problematic?

@chanman3388
Copy link
Contributor Author

chanman3388 commented May 1, 2023

With regards to say, checking that we import valid symbols, I think in the per-file lint phase, we could construct an ExportMap of sorts. I suppose the main issue is that such a map is going to require more memory, but it would allow for checking if a member in a module does exist, at least for Ruff, it would be nice to have if we could at least do this on first-party packages.
I've not checked the memory usage, the version of perf I had to compile to work with WSL2 doesn't support memory events. I did play around with some other implementations of cyclic import detection, and it does seem that if you don't store the cycles and only consider cycles applicable to the current module that it is slightly faster than the approach in this PR as you can do it in a parallel fashion. Would this be of interest? It feels inefficient, but it is faster, and it would allow for the running of other rules much like we do now, but using the ImportMap and the ExportMap concept that I mentioned.

@chanman3388
Copy link
Contributor Author

chanman3388 commented May 4, 2023

The PR with checking only the cycles relevant to the module being checked, if it is of any interest.

@charliermarsh
Copy link
Member

I'm going to close this for now in the interest of bookkeeping active PRs. I'm bummed that we haven't been able to merge it (and sorry for all the work you put in, only to lead to this disappointing outcome), but it starts to move us into multi-file analysis and so we need to solve some other problems before we'd be comfortable shipping it (namely, around caching and LSP support, as Micha described above in more detail). When the time comes, we can always reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants