-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using display on paths. #828
Conversation
Adds a helper trait `ToUtf8` which converts the ath to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see @rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing.
I'll be adding the other Say I create a file with an >>> with open("Überraschung".encode("ISO-8859-15"), "w") as f:
... f.write("Hello")
...
5
>>> import os; os.listdir(os.getcwd())
['\udcdcberraschung'] We can confirm this is correct in Bash by checking the output in UTF-8 (only works if the terminal emulator also expects UTF-8 encoded output, which it should): $ LANG=en_us.UTF-8 ls
''$'\334''berraschung So now I generate my Rust code: use std::{fs, io, env};
pub fn main() -> io::Result<()> {
let cwd = env::current_dir()?;
for entry in fs::read_dir(cwd)? {
let path = entry?.path();
println!("debug={path:?}");
println!("display={}", path.display());
}
Ok(())
} And this outputs:
Or, our output isn't exactly correct: we shouldn't be using |
does it make sense to add |
I have no direct testing, but considering Update: that would require us having a non-Unicode locale on the host (extremely unlikely) and change the locale on the image to a non-Unicode one as well. Both are very unlikely, especially for new development, so let's just formalize UTF-8 assumptions and fix them if the need ever occurs. |
Yes I think so. It is incorrect for our uses, even if it's a good method for more general cases. |
Since we use our mount paths, which will be UTF-8 inside the Linux container, it makes sense to ensure they're valid UTF-8 prior so they can be valid paths for the locale in the container. I've left the following functions alone: - `MountFinder::new`, since `OsStr` is cheaper for path comparisons than `str` is. - `cargo_metadata_with_args`, since the manifest path is on the host, and therefore this is always valid. - `pretty_path`, since it handles both valid Unicode paths and debug representations of non-Unicode paths.
bors r=Emilgardis |
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common). So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character. Co-authored-by: Alex Huszagh <[email protected]>
Just an additional comment in case this ever needs to be discussed again, as unlikely as that is: we need both the host and the container locale to be the same, I believe, since we need to provide bind mount mappings from the host to the container that are identical. Code like this (I believe) cannot exist if the path is not valid Unicode, since our container expects this to be a path that can be represented in UTF-8: Lines 173 to 186 in 2937032
However, nothing here stops you from providing non-UTF-8 filenames or directories as part of your crate data or external mounts to If any of these assumptions are wrong, this PR has made it easier to patch assumptions about UTF-8 paths in the future. |
bors r- This isn't a bug, but it does lead to |
Canceled. |
Previously, it would double the `//root` path, since it would add a `/` for the root, and an additional one for the first path afterwards. Also added unittests to ensure `as_posix` works as expected with drive letters and normal paths.
bors r=Emilgardis |
Build succeeded: |
Prior to cross-rs#828, we added a '/' separator before component in our makeshift `as_posix`, so the path would always become absolute, even though the actual path was relative. `as_posix` therefore handled this as a relative path, so using `project` and not `/project` as the working directory for the cross command caused commands to fail.
831: Fixes regression in #828 with absolute path. r=Emilgardis a=Alexhuszagh Prior to #828, we added a '/' separator before component in our makeshift `as_posix`, so the path would always become absolute, even though the actual path was relative. `as_posix` therefore handled this as a relative path, so using `project` and not `/project` as the working directory for the cross command caused commands to fail. Co-authored-by: Alex Huszagh <[email protected]>
Adds a helper trait
ToUtf8
which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then changedisplay()
toto_utf8()?
and be completely correct in all cases.On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).
So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.