-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--type=auto module type detection algorithm #5
Conversation
type-auto-algorithm.md
Outdated
|
||
- A reference to `arguments` in the top scope. | ||
|
||
- A reference to `return` in the top scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ain't a reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
- A
return
statement in the top scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be very specific:
- A
return
statement in the top VariableEnvironment of the source text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this applied to returns within things like block statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So really it's a return
statement in the top function scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a function scope necessarily. So no, not a function scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not? Doesn't the CommonJS wrapper create a function scope that encompasses the entire file? So a return
anywhere in there would exit that wrapper function scope, unless the return
were inside a child function scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is circular, you must treat it as CJS for you to check if it is CJS in that case. Nothing is preventing Module (or god forbid Script) from adding early return
s except going through committee. Same for await
. You should state the heuristics based upon the overlap, which would just be the VariableEnvironment; which is the Module environment for Modules, the Global environment for Script, and the Function environment for CJS. Only stating it on the function environment doesn't make sense since this is just a heuristic, we cannot assert that it is a function environment in order to check if it is a function environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really that common to use return
in the top scope (or whatever I should call it) without the CommonJS globals? I'm wondering if we need to bother with the checks beyond looking for use of those globals (including checking that the globals aren't defined by user code).
If we limit the CommonJS check to just the globals, then the explanation for this feature is straightforward: import/export is run as ESM, use of CommonJS globals is run as CommonJS, anything else errors. The simplicity of that might be worth a few false negatives (my "anything else" group) if these other checks are unlikely to grow our haul of detected CommonJS significantly.
@ljharb we're currently only allowing this algo to succeed if |
@devsnek i'm not sure what that's in reference to. |
@ljharb none of those variables could exist as globals |
@devsnek yes, but that doesn't mean that a reference to them proves it's ESM - it just means that such code couldn't complete. |
@ljharb I’m asking for collaboration, not just criticism, if you don’t mind. I appreciate you pointing out issues in this document, but if you don’t mind can you please couple those points with suggestions for how the document can resolve those issues? And if you can’t think of any solutions, please tell us what you considered and why none of your ideas worked. |
@GeoffreyBooth i apologize that it's come off as criticism; my suggestion is "remove any heuristic that depends on identifiers". Only syntax can reliably disambiguate. |
So with regards to detecting CommonJS globals, yes, the following could be ESM: require('./file.js'); And of course it would immediately throw, since we’re defining this file as an initial entry point and so we therefore know that So maybe my language at the top about being able to “unambiguously” detect CommonJS is too sweeping, and needs to be dialed back a bit. I think users would still prefer we look for references to the CommonJS globals as our primary mode of detecting CommonJS, rather than excluding them because they aren’t absolutely positively 100% proof of CommonJS. Or looked at another way, I would say that references to CommonJS globals is proof beyond a reasonable doubt, and that’s good enough for the purposes of this algorithm. Could-run-as-ESM code could be written that could reference such globals and not throw, via code like if (require) {
// . . .
} but likewise, I think this is safe to assume that this code should be run in CommonJS mode. There’s a big difference between code that references the CommonJS globals and code that’s truly ambiguous. If you look at the syntax/keyword differences in the list, I think they’re too obscure for us to use as CommonJS signifiers. Too many obviously CommonJS files would lack such signifiers of Script mode, for example any CommonJS file that has |
I'm not a fan of trying to detect CJS either. The approach by @zenparsing here might be promising. It tries to compile (not run) the entry module as CJS and if it errors it tries as ESM. |
That's also not sufficient; there exists modules that can parse and run successfully in both CJS and ESM, but have different semantics; it's not safe to evaluate these as either without an explicit signal. |
@GeoffreyBooth i don't think it would be a good idea for node to ever "guess" - if it's not certain, it should throw, and relying on identifiers is not a certainty. If someone has a file that |
@ljharb I can appreciate how that seems like it might be an issue. Can you provide a concrete example of how that might be a problem in practice? |
@zenparsing i can come up with a contrived example that wouldn't be very convincing, but either way I think node needs to draw a line based on what's possible, not merely based on what's likely. The parsing goals are ambiguous, and as such, it is simply not possible in every case to disambiguate them without an additional signal. |
I think that’s the argument for why |
We're not userland - we don't have the luxury of being fuzzy/loose. "Likely" doesn't matter; node itself should only parse files as X when it knows them to be X, for certain. |
FWIW I figure I should probably mention the approach I take with auto detection in the |
I think a reference to a CommonJS global, where that global isn’t defined in the file and the file is an initial entry point, is certain that any real-world code is CommonJS. Otherwise the code is written to try to trick the detector; I just don’t see what realistic useful program would exist otherwise. @ljharb I understand you may disagree, but unless we have a better algorithm I consider this one to be good enough, even for Node core, for an opt-in feature. If I’m outvoted down the line then so be it. We have a corpus we can test against. My research repo has 941 projects from NPM that define |
Using a corpus of projects that are highly likely to be ESM-only is one thing; I'd be interested to see the majority of npm, which is highly likely to be CJS-only. |
Speaking from the other side of the aisle (ie FrontEnd), the majority of the existing libs are IIFEs with AMD/UMD/CJS tacked on. Unfortunately, if you're determining type by matching on a 'require' statement it's going encounter a lot of false positives. At least, until the ecosystem has some time to transition. |
I can also do: global.require = () => {}; // or `global['re' + 'quire']`
require(); which would match that heuristic as CJS, since there's no declaration, but could easily be both ESM or CJS. |
The proposal includes checking for Again, I'm looking for collaboration not criticism. If all you have to offer are problems and not at least your thought process for a solution, please keep your objection to yourself. I'm not looking to defend this document from a firing squad, I'm asking for help improving it. |
My example does not contain a declaration for that variable. @GeoffreyBooth these are suggestions to improve it - removing things that aren't viable helps improve the document, and is absolutely collaboration, not criticism. |
@ljharb a suggestion for improving it would be to clarify that I should've written an assignment for the global variable. That's what I'm asking for. |
Right, but I'm saying that relying on identifiers is not viable, so my suggestion continues to be to remove all such heuristics. |
Unless you have an alternative that still works, specifier detection is the best heuristic I know of. Removing it without a viable replacement makes this feature worthless. So you can suggest another solution, otherwise your point is noted. |
To clarify further, |
I don't consider any of the examples that would foil detection to be realistic. I think it's worth adding a section to the document listing the edge cases that would be false positives and negatives, to show that those edge cases aren't likely to appear in real user code. As an opt-in feature, I don't think it needs perfect accuracy. So if you'd like to contribute you can write such a section, and help catch as many edge cases as possible. We need to catch common realistic cases like AMD, which I think is doable. If there's a big class of realistic code that's a false positive I'll reconsider, but for now I think a feature that works the vast majority of the time is worth building and acceptable as an opt-in feature. |
Scratch my earlier comment. I think it's safe to say, multi-mode modules that support CJS should always be detected as CJS. Since, 'module.exports' and 'export default' don't work together in the same file anyway. Besides, it's not difficult to convert an IIFE so it can be imported as a default ESM. Converting existing libs that heavily rely entirely on CJS are another story. Those will take much longer to update. |
…ommonJS false positives, that shouldn’t happen in real-world code
I’ve updated the document per the comments above. Feedback is welcome, but if your feedback is to point out a problem with the document I respectfully request that you suggest an improvement to the document to address your objection. I added a paragraph to discuss @ljharb’s point about references to the CommonJS globals being allowable ESM. I still think detecting the globals, including checking that they’re not getting defined, is a good enough heuristic for CommonJS until someone can show me some practical real-world code where that fails. I’m thinking it should exclude as ambiguous most frontend cases like Browserify or AMD/UMD, since those patterns either define global I also removed the detection for
|
Thanks @ljharb, I made the corrections you suggested. That’s the kind of feedback I was looking for 👍 With regards to I think if someone is trying to deceive the algorithm they’re going to succeed, like in your earlier The purpose of checking for redefining the CommonJS globals is to filter out legitimate use cases like UMD or Browserify, though even they probably aren’t defining |
What i mean is, if it’s not syntactically ESM, and it’s thus a CJS candidate, I’m suggesting using “use of eval” as a disqualifier, to eliminate some edge cases that could make “CJS” the incorrect goal. |
What edge cases does it disqualify? I can’t think of any other than someone intentionally trying to trick the detector, which I don’t think we should be worrying about since they have so many other ways to do so besides I also like the simplicity of defining autodetected CommonJS files as files that use the CommonJS globals, full stop, without needing to explain a bunch of extra items we’re detecting; unless those extra detections add something significant. |
@jdalton and I were discussing this and he suggested making the algorithm even simpler:
And that’s it. Ambiguous files would run as CommonJS, just like how
I’m not saying I’m entirely convinced that this is a better way to go, but he makes a strong argument. This algorithm entirely removes the worries about the detector being fooled or gamed; either the source contains |
Do you mean vice versa? I think this simpler algorithm works pretty well for extensionless files; i do not think it should work for package.json-less .js files - we should not silently override an explicit signal, and “.js” is an explicit signal that it’s not ESM (lacking additional explicit signals, like a package.json) |
Yes, thanks. I don't think a file being |
How many of those were written to be node entry points, without babel, or a flagged experimental node feature, in mind? |
And how many of those ESM entry points intended for use with Babel might also be intended to be evaluated directly in runtimes that support ESM? There’s no way for the author to tell us. We simply can’t assume anything about a loose I’m not really interested in debating when to disallow this feature. There’s an argument to be made that As for this algorithm itself, as opposed to what it can apply to, does anyone have any other notes? |
I’m also realizing, if (/^(?:im|ex)port\s/m.test(sourceCode)) esmDetected(); This saves us from having to work with any parsers, which feels like a win. @SMotaal is there a better regex than this one? Is there any problem with using regex for this? |
you can't use regex to search for anything in js because (as one example) I can start any line in js with a string, and regex can't parse a js string. |
@devsnek I think we visited this one before in #203 and while my demo is eagerly begging for a serious refactor, it proves that RegExp based "semi-contextual" parsing is a viable thing that like any other will have it's own limitations but can also comes with unique (ie different, no claim if it is better or worst) optimization potential. I have since found few cases where I continued to improve the safety from a sense of curiosity in slowly extending this when it seems viably suited. For instance, thanks to @guybedford's comment in es-modules-shim, I found and fixed a failure where expressions like I am not plugging this in, I continue my work because I personally see potential which qualify at least to merit not claiming that it is not possible ❤️ please.
|
@SMotaal you can use regex for individual matches when building tokens (the tokenizer holds all the context that regex doesn't have) but you can't use regex over a whole source text, as Geoffrey asks above |
Okay, forget I asked about regex, that’s an implementation detail. Anyone have a rough idea on how this could be implemented? Would it be parsed with Acorn or is there some way to use the V8 parser? @zenparsing? |
@GeoffreyBooth i would want to use a full parser. i think it's important for our method for getting the kind of module something to also validate that the source text is valid. we shouldn't return a valid output for an invalid input. |
feb7792
to
497ab46
Compare
Yes, that’s what my last comment said? Are you saying that Acorn or the V8 parser aren’t full parsers? Is there something else you would use? I updated the PR to match our discussions, I think this is ready to merge in now. |
Absolutely, a So the tokenizer mechanics here are the missing contextual component of what is normally isolated context-free regex patterns. In all cases parsing scans every character, in some methods parsing will incur "additional" code execution costs for every character, but ultimately every lexeme (token of meaningful characters) should incur code execution costs regardless of method for contextual validity to hold. So prior to template strings, when all masquerading literals where either flat or homogeneously structured, you could do a As for invalid code, @devsnek you put this very nicely, any operator used to "translate" — which is a term I saw used in Edit: Just for clarity, I am symbolically considering translation in the context of |
@GeoffreyBooth my conclusion from going over the final update here is that RegExp or AST is a detail for the following parsing scenario (roughly):
Some of the considerations here are:
|
Per nodejs/modules#263 (comment)
Feedback invited! @bmeck @devsnek @guybedford @jdalton @jkrems @ljharb @SMotaal
I’ve invited the folks that have been involved in discussions related to
--type=auto
or this proposal in general. Once it seems okay to you all, I’ll open an issue on the main modules repo to invite broader feedback.