-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect filetype from shebang line #1001
Conversation
dc44953
to
a7c52af
Compare
helix-core/src/syntax.rs
Outdated
if std::io::Read::read(&mut file, &mut buf[..]).is_ok() { | ||
if let Ok(str) = str::from_utf8(&buf) { | ||
static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| { | ||
Regex::new(r"#!/[^\s]*/(env\s)*([_a-zA-Z0-9-]+)").unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pass at cleaning up the shebang regex. Here's an alternative to yours that captures only the interpreter part, and allows more exotic interpreter names (like "php7.4").
^#!\s*/\S*/(?:env\s+)?(\S+)
I wrote a bunch of tests over here so you can see how it breaks down and modify if needed:
https://regex101.com/r/fsxGMO/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the regex cleanup!
I made some minor changes to also support lines like "#! python"
One thing im not sure of, if we really want to match for minor versions of interpreters, is there a real world usecase for this?
This makes it harder to cut off the .exe/.cmd/.. part of interpreters, while having the need o support minor versions in language toml shebangs entry.
what i have now (excluding minor versions):
^#!\s*(?:\S*/(?:env\s+)?)?([^\s\.]+)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ath3 good point about supporting things like php7.4
. My thought was that sometimes you might have more than one version of php installed on a server (like with ondrej's ubuntu repos) that name them this way, and you may possibly want to specifically invoke one. But for our case, we just want to discover the name of the language, not the version, so you're modification makes sense. Although, it does still erroneously catch php7
out of my example. Do we want to exclude digits from the capture group? That would also clean up the python3
case.
Also, for windows, would we need to worry about supporting \
as a directory separator? If so, just modify the /
in your version to [/\\]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for windows, would we need to worry about supporting
\
as a directory separator?
I don't think windows supports shebangs, not natively anyway (cygwin can use it I think, with /
like in unix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched and saw that some people use Windows style paths in shebang lines which is probably not the most correct way, but we can use a regex handling those cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the new regex, hopefully this works for all the variants.
e057337
to
cbd3016
Compare
helix-core/src/syntax.rs
Outdated
// If we have not found the configuration_id, see if we can get it from a shebang line | ||
if configuration_id.is_none() { | ||
if let Ok(mut file) = File::open(path) { | ||
let mut buf = [0; 100]; | ||
if std::io::Read::read(&mut file, &mut buf[..]).is_ok() { | ||
if let Ok(str) = str::from_utf8(&buf) { | ||
static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| { | ||
Regex::new(r"^#!\s*(?:\S*[/\\](?:env\s+)?)?([^\s\.\d]+)").unwrap() | ||
}); | ||
configuration_id = SHEBANG_REGEX | ||
.captures(str) | ||
.and_then(|cap| cap.get(1)) | ||
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str())) | ||
} | ||
} | ||
} | ||
}; | ||
|
||
// TODO: content_regex handling conflict resolution | ||
configuration_id.and_then(|&id| self.language_configs.get(id).cloned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see this as a separate method, then called via language_config_for_file_name(..).or_else(|| shebang(..))
helix-core/src/syntax.rs
Outdated
.and_then(|cap| cap.get(1)) | ||
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.and_then(|cap| cap.get(1)) | |
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str())) | |
.and_then(|cap| self.language_config_ids_by_shebang.get(&cap[1])) |
cbd3016
to
357dbab
Compare
helix-core/src/syntax.rs
Outdated
// Read the first 128 bytes of the file. If its a shebang line, try to find the language | ||
let file = File::open(path).ok()?; | ||
let mut buf = String::with_capacity(128); | ||
Read::read_to_string(&mut Read::take(file, 128), &mut buf).ok()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just file.take(128).read_to_string(&mut buf)
? It's a trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid unwraps that can cause panic, thats why i used if let on those places.
Is it ok to have those here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no unwraps, just Read::read(buf).ok()
is equivalent to buf.read().ok
-- Read is a trait
b658f90
to
fb64735
Compare
helix-core/src/syntax.rs
Outdated
// Read the first 128 bytes of the file. If its a shebang line, try to find the language | ||
let file = File::open(path).ok()?; | ||
let mut buf = String::with_capacity(128); | ||
file.take(128).read_to_string(&mut buf).ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file.take(128).read_to_string(&mut buf).ok(); | |
file.take(128).read_to_string(&mut buf).ok()?; |
fb64735
to
a1c2fa0
Compare
So it turns out the reading from disk is unnecessary since the file is already read by the time detection runs. I'll fix it after merge |
Most of the shell and perl scripts i have to work on have no extensions.
With this change im adding shebang support for detecting file types.
There is also a new config entry in languages.toml: "shebangs"