Implement Pylint's `cyclic-import` rule (`PLR0401`) #3880

chanman3388 · 2023-04-05T00:26:07Z

Implement https://pylint.pycqa.org/en/v2.13.9/messages/refactor/cyclic-import.html
Issue #970

The principle here is we utilise the ImportMap to traverse the modules which may or may not import each other. When a module has all its imported modules checked, it is counted as 'fully visited'.
When considering a traversed path, each module imported along the way is kept in a stack, a cycle is detected when the module we're currently inspecting is already contained within this stack. The appropriate part of the stack is copied into a new Vec and pushed onto a Vec which contains all the cycles currently detected from the current module.

By the nature of the implementation, cycles of other modules traversed are detected, but these won't be reported immediately, instead stored in a HashMap. When the cyclic import rule is called, a check of whether we have already the cycles for this module is done to avoid duplicate work.

…ollected

…ort_imports

…ashmap of imports

MichaReiser

Impressive.

Can we add some integration tests that test the generated diagnostics?

How does that work in ruff-lsp when the user has unsaved changes in multiple files?

MichaReiser · 2023-04-23T03:57:57Z

crates/ruff/src/rules/pylint/rules/cyclic_import.rs

+        let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default();
+        let mut cycles: FxHashSet<Vec<Arc<str>>> = FxHashSet::default();
+        let mut stack: Vec<Arc<str>> = vec![name.clone()];


Using strings in hash sets is expensive because hasing requires iterating over the whole string (and it may be necessary to do so more than once if the hash map needs to grow). Could we instead have a single map that stores a name to id mapping, where id is a zero type wrapper around e.g. a u32 type? It would allow us to only hash the string once rather than once per map.

This is a good idea, I'll look into it!

Ok, I've done this, but I think it's a bit ugly imho, at least I think it's kinda ugly.
At least on the noddy example I provided above, it doesn't appear to provide any gains, but it is a small example.

At least on the noddy example I provided above,

What example is this?

You would probably want to hide this behind a custom data structure that takes care of performing the str -> id lookups. So that it appears to users the same as when they to the HashMap.get(str) lookup directly.

Consider this file tree

grand ├── __init__.py ├── a.py (from .parent.child import a; from .parent import a; from grand.parent.child import b) └── parent ├── __init__.py ├── a.py (from .child import a, b; from .. import a) └── child ├── __init__.py ├── a.py (from .. import a; from ... import a) ├── b.py (from . import a) └── c.py --empty--

This one.
Ok, I'll give .get implementation a go.

So I decided to do some noddy 'benchmarking'. I cloned the cpython repo, checked out the tag v.3.11.3 and ran ruff on it with --select=E,F,PLR0401 on it with varying commits from this branch using cargo build --release, by my estimation this is about 1M lines of python. I have an AMD 5900X. Off the current tip (f819262), it takes about 15-17s. Without performing a name to id mapping, that is to say leaving all the module names as Arc<str>, c878a78, it seems to reasonably consistently take 15s.

Wrt ruff-lsp...I believe I'm using my own build of ruff with vscode...but I don't think it picks up on the cyclic imports. Does it just lint individual files at a time?

MichaReiser · 2023-04-23T03:58:15Z

crates/ruff/src/rules/pylint/rules/cyclic_import.rs

+
+impl CyclicImportChecker<'_> {
+    fn has_cycles(&self, name: &Arc<str>) -> VisitedAndCycles {
+        let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default();


What's the reason for using Arc vs eg. Box?

I'm still reasonably new to this so do correct me if I'm wrong. The way I've done it I believe that we should have the same str potentially owned by multiple owners, though this might also not always be the case and I may indeed have a different Arc<str> which does contain the same value as another. For a Box<T> only one thing can own T. Cloning a Box<T> requires cloning the contents of the Box which isn't what I want.

That's a good point! Using refcounting here does make sense to avoid allocating the same module name many times. Would it be possible to use an Rc instead of Arc or is this data structure shared across thread boundaries?

So the ImportMap is created in a way that is 'threaded' due to using par_iter which is where the first use of Arc<str> comes in. With the changes for referring to module names by id instead, because I create a name to id and id to name mapping...these just clone the Arc so unless I deliberately construct new Rcs from these Arcs I don't think I can just use an Rc even after this point. So I don't think I can trivially use Rc in favour of Arc.
It's awkward, because I don't think we actually share these across thread boundaries, but the compiler doesn't like it.

MichaReiser · 2023-04-23T03:59:08Z

crates/ruff/src/rules/pylint/rules/cyclic_import.rs

+        stack: &mut Vec<Arc<str>>,
+        cycles: &mut FxHashSet<Vec<Arc<str>>>,
+        fully_visited: &mut FxHashSet<Arc<str>>,


Nit: It could be useful to put all these related data structures into a single struct that you can than pass around (and can expose semantically meaningful methods)

crates/ruff/src/rules/pylint/rules/cyclic_import.rs

crates/ruff_cli/src/commands/run.rs

crates/ruff_python_ast/src/imports.rs

MichaReiser · 2023-04-23T04:09:58Z

crates/ruff_cli/src/commands/run.rs

+                        .map(|diagnostic| {
+                            Message::from_diagnostic(
+                                diagnostic,
+                                SourceFileBuilder::new(&path.to_string_lossy()).finish(),


Can you test if ruff correctly renders the diagnostic using --show-source. You may need to set the source code if this is not the case.

The answer is that it doesn't render it correctly so I'll need to sort this out.

I've got this working, but I'm not sure I've done this in the expected fashion. If I want to use ruff::linter::diagnostics_to_messages I have to re-tokenize the source file which seems inefficient. I would need to do that to populate the noqa flags correctly, should I do that? For now I haven't as it appears simpler no to.

I plan to have a closer look and provide you with an answer by Friday or early next week (I'm travelling today/tomorrow)

crates/ruff/src/rules/pylint/rules/cyclic_import.rs

chanman3388 · 2023-04-23T05:13:11Z

Impressive.

Can we add some integration tests that test the generated diagnostics?

How does that work in ruff-lsp when the user has unsaved changes in multiple files?

If I understand you correctly, I think at the moment we can't do the former as the integration tests only work on individual files, whereas to check I get the correct cycles I need to lint a package.

Short answer for the latter is, I don't know, I'll find that out.

Thanks for the thorough review!

chanman3388 · 2023-04-24T00:37:51Z

Impressive.
Can we add some integration tests that test the generated diagnostics?
How does that work in ruff-lsp when the user has unsaved changes in multiple files?

If I understand you correctly, I think at the moment we can't do the former as the integration tests only work on individual files, whereas to check I get the correct cycles I need to lint a package.

Short answer for the latter is, I don't know, I'll find that out.

Thanks for the thorough review!

So at the moment at least I don't know how to test ruff-lsp with my own version of ruff. Is there someone that could help with that?

chanman3388 · 2023-04-30T21:25:14Z

I've got another branch that uses par_iter instead and places CyclicImportHelper in an Arc<RwLock<T>. At least from my benchmarking it's ever-so-slightly faster.

chanman3388 · 2023-04-30T21:37:59Z

I've got another branch that uses par_iter instead and places CyclicImportHelper in an Arc<RwLock<T>. At least from my benchmarking it's ever-so-slightly faster.

Scratch that, seems I was trying something else.

MichaReiser · 2023-05-01T15:21:15Z

This is a very impressive PR for an extremely valuable lint rule. Thanks for keeping working on it and rebasing it.

I have taken a closer look today and what it would mean. I have a few technical concerns for which I don't have any immediate answers right now and I recommend solving before merging. We can make some other improvements to the data structures to improve performance and reduce memory consumption.

Open Questions

LSP and unsaved changes

I haven't verified this but I fear that ruff will start reporting false positives (or negatives) if users have multiple open unsaved documents in their editor. Ruff's LSP passes the content of the active file in the editor per stdin instead of reading it from disk because there are likely unsaved changes.

The problem is, users can have multiple open documents with unsaved changes. We would need to pass all of them via stdin to ensure ruff operates on the latest content rather than on the stale content on disk. A scenario where I could see this be especially problematic is if the user has a cyclic import, makes changes two both files that fix the issue, but ruff continues to report it because it uses stale content from disk.

Properly supporting this use case most likely require significant changes to the LSP. We may want to decide that we're okay with this limitation. I rather prefer not to disable the rule in editors only because one of ruffs value propositions is that it works the same in editors and the CLI.

Caching

Ruff caches all diagnostics per file to avoid repeating the same analysis for unchanged files. I don't know how to cache the project-wide analysis yet and/or how cache invalidation should work. But I think that's something that we potentially can defer to later.

Related to this: We need to have the source code to emit diagnostics. It's unclear to me how to best retrieve the source code. We may need to first refactor our code to extract the logic that reads the source for a file.

Architecture

I would like to build an architecture that supports other use-cases like checking if a module only imports valid symbols from a module. Ideally, most of the computation happens per-file to use the CPUs' multi-core architecture best.

The way I think about this, and it is to a large extent implemented this way in this PR, is that we need to extract additional metadata in the per-file lint phase, cache it, aggregate that information, and then run additional checks on the aggregated data.

Lint Phase

Extract the name of the imported modules (and future, exported symbols). We can do this by computing a set of all imports in the lint phase where ModuleImport is defined as

struct ModuleImports {
	imports: FxHashSet<ModuleImport>
}

struct ModuleImport {
	/// The range of the import statement, used for diagnostics
	range: TextRange,
	/// Or can we use a `TextRange` slicing into the source to avoid allocations?
	name: String,
}

We need to persist the ModuleImports when writing to the cache in the cache.rs and restore them when reading from the cache. This is similar to what you have today. My main recommendation is to keep this as slim as possible to make it cheap to cache.

Orchestration (CLI)

Build up an aggregated view. Ideally, as little as possible of that code lives in the CLI. The CLI should only be calling into the corresponding code living in ruff. For example, we could create a Modules or Project struct in ruff that aggregates the data:

struct Modules {
	index: hashbrowns::HashMap<ModuleId, ()> // Use the raw_entry_mut method to manually compute the hash key to avoid storing the name multiple time. https://docs.rs/hashbrown/latest/hashbrown/struct.HashMap.html#method.raw_entry
	modules: Vec<Module> // the index defines the `ModuleId`. Use `self.modules.len()` to compute the next id.
}

impl Modules {
  pub fn add_module(name: String, imports: ModuleImports) {...}
}

#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
struct ModuleId(u32);

struct Module {
	name: String,
	imports: Vec<ResolvedImport>,
	source: String, // (or lazy load?)
}

struct ResolvedImport {
	id: ModuleId,
	range: TextRange
}

We can avoid the need for locking on Modules by only using it from a single thread. Ideally, we could start building it while the analysis of other modules is still running rather than having to wait until all analysis ended (use a separate thread with a Channel?)

Project-wide analysis

Create a new lint_project(modules: &Modules) or similar method in ruff that does the cross-file linting.

What's unclear is how to retrieve the source text for cached files without loading every single file.

Next Steps

I think we should try to answer if we want to solve the LSP and caching issues today and we can then decide what other technical improvements we want to make (again, your PR is already super close and some of my suggestions may not work, you will know better). @charliermarsh what's your take on the LSP and caching problematic?

chanman3388 · 2023-05-01T23:37:02Z

With regards to say, checking that we import valid symbols, I think in the per-file lint phase, we could construct an ExportMap of sorts. I suppose the main issue is that such a map is going to require more memory, but it would allow for checking if a member in a module does exist, at least for Ruff, it would be nice to have if we could at least do this on first-party packages.
I've not checked the memory usage, the version of perf I had to compile to work with WSL2 doesn't support memory events. I did play around with some other implementations of cyclic import detection, and it does seem that if you don't store the cycles and only consider cycles applicable to the current module that it is slightly faster than the approach in this PR as you can do it in a parallel fashion. Would this be of interest? It feels inefficient, but it is faster, and it would allow for the running of other rules much like we do now, but using the ImportMap and the ExportMap concept that I mentioned.

chanman3388 · 2023-05-04T22:26:39Z

The PR with checking only the cycles relevant to the module being checked, if it is of any interest.

charliermarsh · 2023-10-20T18:04:07Z

I'm going to close this for now in the interest of bookkeeping active PRs. I'm bummed that we haven't been able to merge it (and sorry for all the work you put in, only to lead to this disappointing outcome), but it starts to move us into multi-file analysis and so we need to solve some other problems before we'd be comfortable shipping it (namely, around caching and LSP support, as Micha described above in more detail). When the time comes, we can always reopen.

chanman3388 and others added 30 commits February 26, 2023 01:20

WIP

19b7755

return the list of imports per module

87fac94

first go at propagating imports out to where diagnostic results are c…

e76cbd9

…ollected

remove some unecessary clones

19b9d9c

some comments

4ec1f54

pull from main and refactor slightly

fd65a38

remove some lifetime annotations

c2181ca

Merge branch 'report_imports' of github.com:chanman3388/ruff into rep…

f4acf2b

…ort_imports

fix tests and remove some print statements

2344bd8

rename LoggingLevel::from_str to from_python_logger_method_str

d2bbb41

fix lib_wasm

6809a03

fix conflicts after merging main

c7310cd

fix conflicts

2bd8b67

associate path with module imports

bac9fcd

add new type to avoid clippy message about complex type

c02cc50

cargo fmt

1a2caf0

use new type to avoid complex type message

9fb6c52

realise that actually we don't want a vec of hashmaps, but just one h…

b67889e

…ashmap of imports

Merge branch 'main' into report_imports

5652512

Merge branch 'main' into report_imports

2dd5feb

the easy comments

2c41cee

clippy and fmt changes

99650fb

implement struct

ce24b4f

Add comment explaining macro usage and correct import

ad87a74

Make DiagnosticsAndImports private as it doesn't need to be public

80c7fad

resolve conflicts after merging main

a114fa9

remove an erroneously included comment

e00142d

add resolving of relative imports

57ea484

fix case where package is None

a255d58

fix case where level of a relative import is 0

1d7dc5d

change some unittest names

c878a78

chanman3388 marked this pull request as ready for review April 22, 2023 02:50

MichaReiser reviewed Apr 23, 2023

View reviewed changes

chanman3388 added 4 commits April 23, 2023 18:26

address some of the quicker review comments

27ce49c

use ids to represent imports when detecting cycles instead

079cb0c

linting

be5bb5d

amend comment and fix typo

ed347e2

chanman3388 added 6 commits April 26, 2023 22:12

use vec instead of FxHashMap for id_to_module

f819262

use &str instead of Arc<str>

7a7a1a0

merge main

a621483

merge main

bf35ae3

merge main

41efef3

comment out debug! calls

fdd5fd2

chanman3388 added 7 commits May 4, 2023 23:32

merge main

9cfb746

fix after merging

456f43b

merge main

8ba0b63

fix

e0110e2

Merge branch 'main' into PLR0401

bbc5c23

Merge branch 'main' into PLR0401

597ebba

merge main

6695702

chanman3388 mentioned this pull request Jun 16, 2023

PLR0401 Cyclic import implementation #2914

Open

charliermarsh closed this Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Pylint's `cyclic-import` rule (`PLR0401`) #3880

Implement Pylint's `cyclic-import` rule (`PLR0401`) #3880

chanman3388 commented Apr 5, 2023 •

edited

Loading

MichaReiser left a comment

MichaReiser Apr 23, 2023

chanman3388 Apr 23, 2023

chanman3388 Apr 24, 2023

MichaReiser Apr 26, 2023 •

edited

Loading

chanman3388 Apr 26, 2023 •

edited

Loading

chanman3388 Apr 26, 2023 •

edited

Loading

chanman3388 Apr 29, 2023

MichaReiser Apr 23, 2023

chanman3388 Apr 23, 2023

MichaReiser Apr 23, 2023

chanman3388 Apr 23, 2023 •

edited

Loading

MichaReiser Apr 23, 2023

MichaReiser Apr 23, 2023

chanman3388 Apr 23, 2023

chanman3388 Apr 23, 2023 •

edited

Loading

MichaReiser Apr 26, 2023

chanman3388 commented Apr 23, 2023 •

edited

Loading

chanman3388 commented Apr 24, 2023

chanman3388 commented Apr 30, 2023

chanman3388 commented Apr 30, 2023

MichaReiser commented May 1, 2023 •

edited

Loading

chanman3388 commented May 1, 2023 •

edited

Loading

chanman3388 commented May 4, 2023 •

edited

Loading

charliermarsh commented Oct 20, 2023

Implement Pylint's cyclic-import rule (PLR0401) #3880

Implement Pylint's cyclic-import rule (PLR0401) #3880

Conversation

chanman3388 commented Apr 5, 2023 • edited Loading

MichaReiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

chanman3388 Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

chanman3388 Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chanman3388 Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chanman3388 Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chanman3388 commented Apr 23, 2023 • edited Loading

chanman3388 commented Apr 24, 2023

chanman3388 commented Apr 30, 2023

chanman3388 commented Apr 30, 2023

MichaReiser commented May 1, 2023 • edited Loading

Open Questions

LSP and unsaved changes

Caching

Architecture

Lint Phase

Orchestration (CLI)

Project-wide analysis

Next Steps

chanman3388 commented May 1, 2023 • edited Loading

chanman3388 commented May 4, 2023 • edited Loading

charliermarsh commented Oct 20, 2023

Implement Pylint's `cyclic-import` rule (`PLR0401`) #3880

Implement Pylint's `cyclic-import` rule (`PLR0401`) #3880

chanman3388 commented Apr 5, 2023 •

edited

Loading

MichaReiser Apr 26, 2023 •

edited

Loading

chanman3388 Apr 26, 2023 •

edited

Loading

chanman3388 Apr 26, 2023 •

edited

Loading

chanman3388 Apr 23, 2023 •

edited

Loading

chanman3388 Apr 23, 2023 •

edited

Loading

chanman3388 commented Apr 23, 2023 •

edited

Loading

MichaReiser commented May 1, 2023 •

edited

Loading

chanman3388 commented May 1, 2023 •

edited

Loading

chanman3388 commented May 4, 2023 •

edited

Loading