-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Pylint's cyclic-import
rule (PLR0401
)
#3880
Conversation
…ashmap of imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive.
Can we add some integration tests that test the generated diagnostics?
How does that work in ruff-lsp when the user has unsaved changes in multiple files?
let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default(); | ||
let mut cycles: FxHashSet<Vec<Arc<str>>> = FxHashSet::default(); | ||
let mut stack: Vec<Arc<str>> = vec![name.clone()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using strings
in hash sets is expensive because hasing requires iterating over the whole string (and it may be necessary to do so more than once if the hash map needs to grow). Could we instead have a single map that stores a name
to id
mapping, where id
is a zero type wrapper around e.g. a u32
type? It would allow us to only hash the string once rather than once per map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, I'll look into it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've done this, but I think it's a bit ugly imho, at least I think it's kinda ugly.
At least on the noddy example I provided above, it doesn't appear to provide any gains, but it is a small example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least on the noddy example I provided above,
What example is this?
You would probably want to hide this behind a custom data structure that takes care of performing the str -> id
lookups. So that it appears to users the same as when they to the HashMap.get(str)
lookup directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this file tree
grand ├── __init__.py ├── a.py (from .parent.child import a; from .parent import a; from grand.parent.child import b) └── parent ├── __init__.py ├── a.py (from .child import a, b; from .. import a) └── child ├── __init__.py ├── a.py (from .. import a; from ... import a) ├── b.py (from . import a) └── c.py --empty--
This one.
Ok, I'll give .get
implementation a go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I decided to do some noddy 'benchmarking'. I cloned the cpython repo, checked out the tag v.3.11.3 and ran ruff on it with --select=E,F,PLR0401 on it with varying commits from this branch using cargo build --release
, by my estimation this is about 1M lines of python. I have an AMD 5900X. Off the current tip (f819262), it takes about 15-17s. Without performing a name
to id
mapping, that is to say leaving all the module names as Arc<str>
, c878a78, it seems to reasonably consistently take 15s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrt ruff-lsp...I believe I'm using my own build of ruff with vscode...but I don't think it picks up on the cyclic imports. Does it just lint individual files at a time?
|
||
impl CyclicImportChecker<'_> { | ||
fn has_cycles(&self, name: &Arc<str>) -> VisitedAndCycles { | ||
let mut fully_visited: FxHashSet<Arc<str>> = FxHashSet::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using Arc
vs eg. Box
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reasonably new to this so do correct me if I'm wrong. The way I've done it I believe that we should have the same str
potentially owned by multiple owners, though this might also not always be the case and I may indeed have a different Arc<str>
which does contain the same value as another. For a Box<T>
only one thing can own T
. Cloning a Box<T>
requires cloning the contents of the Box
which isn't what I want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Using refcounting here does make sense to avoid allocating the same module name many times. Would it be possible to use an Rc
instead of Arc
or is this data structure shared across thread boundaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ImportMap
is created in a way that is 'threaded' due to using par_iter
which is where the first use of Arc<str>
comes in. With the changes for referring to module names by id instead, because I create a name to id and id to name mapping...these just clone the Arc
so unless I deliberately construct new Rc
s from these Arc
s I don't think I can just use an Rc
even after this point. So I don't think I can trivially use Rc
in favour of Arc
.
It's awkward, because I don't think we actually share these across thread boundaries, but the compiler doesn't like it.
stack: &mut Vec<Arc<str>>, | ||
cycles: &mut FxHashSet<Vec<Arc<str>>>, | ||
fully_visited: &mut FxHashSet<Arc<str>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It could be useful to put all these related data structures into a single struct that you can than pass around (and can expose semantically meaningful methods)
crates/ruff_cli/src/commands/run.rs
Outdated
.map(|diagnostic| { | ||
Message::from_diagnostic( | ||
diagnostic, | ||
SourceFileBuilder::new(&path.to_string_lossy()).finish(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test if ruff correctly renders the diagnostic using --show-source
. You may need to set the source code if this is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer is that it doesn't render it correctly so I'll need to sort this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got this working, but I'm not sure I've done this in the expected fashion. If I want to use ruff::linter::diagnostics_to_messages
I have to re-tokenize the source file which seems inefficient. I would need to do that to populate the noqa
flags correctly, should I do that? For now I haven't as it appears simpler no to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to have a closer look and provide you with an answer by Friday or early next week (I'm travelling today/tomorrow)
If I understand you correctly, I think at the moment we can't do the former as the integration tests only work on individual files, whereas to check I get the correct cycles I need to lint a package. Short answer for the latter is, I don't know, I'll find that out. Thanks for the thorough review! |
So at the moment at least I don't know how to test ruff-lsp with my own version of ruff. Is there someone that could help with that? |
I've got another branch that uses |
Scratch that, seems I was trying something else. |
This is a very impressive PR for an extremely valuable lint rule. Thanks for keeping working on it and rebasing it. I have taken a closer look today and what it would mean. I have a few technical concerns for which I don't have any immediate answers right now and I recommend solving before merging. We can make some other improvements to the data structures to improve performance and reduce memory consumption. Open QuestionsLSP and unsaved changesI haven't verified this but I fear that ruff will start reporting false positives (or negatives) if users have multiple open unsaved documents in their editor. Ruff's LSP passes the content of the active file in the editor per stdin instead of reading it from disk because there are likely unsaved changes. The problem is, users can have multiple open documents with unsaved changes. We would need to pass all of them via stdin to ensure ruff operates on the latest content rather than on the stale content on disk. A scenario where I could see this be especially problematic is if the user has a cyclic import, makes changes two both files that fix the issue, but ruff continues to report it because it uses stale content from disk. Properly supporting this use case most likely require significant changes to the LSP. We may want to decide that we're okay with this limitation. I rather prefer not to disable the rule in editors only because one of ruffs value propositions is that it works the same in editors and the CLI. CachingRuff caches all diagnostics per file to avoid repeating the same analysis for unchanged files. I don't know how to cache the project-wide analysis yet and/or how cache invalidation should work. But I think that's something that we potentially can defer to later. Related to this: We need to have the source code to emit diagnostics. It's unclear to me how to best retrieve the source code. We may need to first refactor our code to extract the logic that reads the source for a file. ArchitectureI would like to build an architecture that supports other use-cases like checking if a module only imports valid symbols from a module. Ideally, most of the computation happens per-file to use the CPUs' multi-core architecture best. The way I think about this, and it is to a large extent implemented this way in this PR, is that we need to extract additional metadata in the per-file lint phase, cache it, aggregate that information, and then run additional checks on the aggregated data. Lint PhaseExtract the name of the imported modules (and future, exported symbols). We can do this by computing a set of all imports in the lint phase where struct ModuleImports {
imports: FxHashSet<ModuleImport>
}
struct ModuleImport {
/// The range of the import statement, used for diagnostics
range: TextRange,
/// Or can we use a `TextRange` slicing into the source to avoid allocations?
name: String,
} We need to persist the Orchestration (CLI)Build up an aggregated view. Ideally, as little as possible of that code lives in the CLI. The CLI should only be calling into the corresponding code living in ruff. For example, we could create a struct Modules {
index: hashbrowns::HashMap<ModuleId, ()> // Use the raw_entry_mut method to manually compute the hash key to avoid storing the name multiple time. https://docs.rs/hashbrown/latest/hashbrown/struct.HashMap.html#method.raw_entry
modules: Vec<Module> // the index defines the `ModuleId`. Use `self.modules.len()` to compute the next id.
}
impl Modules {
pub fn add_module(name: String, imports: ModuleImports) {...}
}
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
struct ModuleId(u32);
struct Module {
name: String,
imports: Vec<ResolvedImport>,
source: String, // (or lazy load?)
}
struct ResolvedImport {
id: ModuleId,
range: TextRange
} We can avoid the need for locking on Project-wide analysisCreate a new What's unclear is how to retrieve the source text for cached files without loading every single file. Next StepsI think we should try to answer if we want to solve the LSP and caching issues today and we can then decide what other technical improvements we want to make (again, your PR is already super close and some of my suggestions may not work, you will know better). @charliermarsh what's your take on the LSP and caching problematic? |
With regards to say, checking that we import valid symbols, I think in the per-file lint phase, we could construct an |
The PR with checking only the cycles relevant to the module being checked, if it is of any interest. |
I'm going to close this for now in the interest of bookkeeping active PRs. I'm bummed that we haven't been able to merge it (and sorry for all the work you put in, only to lead to this disappointing outcome), but it starts to move us into multi-file analysis and so we need to solve some other problems before we'd be comfortable shipping it (namely, around caching and LSP support, as Micha described above in more detail). When the time comes, we can always reopen. |
Implement https://pylint.pycqa.org/en/v2.13.9/messages/refactor/cyclic-import.html
Issue #970
The principle here is we utilise the
ImportMap
to traverse the modules which may or may not import each other. When a module has all its imported modules checked, it is counted as 'fully visited'.When considering a traversed path, each module imported along the way is kept in a stack, a cycle is detected when the module we're currently inspecting is already contained within this stack. The appropriate part of the stack is copied into a new
Vec
and pushed onto aVec
which contains all the cycles currently detected from the current module.By the nature of the implementation, cycles of other modules traversed are detected, but these won't be reported immediately, instead stored in a
HashMap
. When the cyclic import rule is called, a check of whether we have already the cycles for this module is done to avoid duplicate work.