Avoid using display on paths. #828

Alexhuszagh · 2022-06-21T07:27:15Z

Adds a helper trait ToUtf8 which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change display() to to_utf8()? and be completely correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).

So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.

Adds a helper trait `ToUtf8` which converts the ath to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see @rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing.

Alexhuszagh · 2022-06-21T07:29:38Z

I'll be adding the other as_posix helper in this PR, but a quick explanation on a Linux system:

Say I create a file with an ISO-8859-1 filename, ensuring this is done in Python:

>>> with open("Überraschung".encode("ISO-8859-15"), "w") as f:
...     f.write("Hello")
... 
5
>>> import os; os.listdir(os.getcwd())
['\udcdcberraschung']

We can confirm this is correct in Bash by checking the output in UTF-8 (only works if the terminal emulator also expects UTF-8 encoded output, which it should):

$ LANG=en_us.UTF-8 ls
''$'\334''berraschung

So now I generate my Rust code:

use std::{fs, io, env};

pub fn main() -> io::Result<()> {
    let cwd = env::current_dir()?;
    for entry in fs::read_dir(cwd)? {
        let path = entry?.path();
        println!("debug={path:?}");
        println!("display={}", path.display());
    }

    Ok(())
}

And this outputs:

./main 
debug="/home/ahuszagh/Desktop/tmp/\xDCberraschung"
display=/home/ahuszagh/Desktop/tmp/�berraschung

Or, our output isn't exactly correct: we shouldn't be using display either for error messages or mounting. At least in the debug output, we get the \xDC escape character printed, so we should be able to guess our encoding that our file we failed to mount was.

Emilgardis · 2022-06-21T07:37:29Z

does it make sense to add Path::display to clippy lint disallowed_methods

Alexhuszagh · 2022-06-21T07:37:30Z

I have no direct testing, but considering as_os_str would be the encoding of the host locale, and our container should probably always be in UTF-8 I believe, we might also need to look over the code in docker_cwd. Let me test a bit and convert this to a draft.

Update: that would require us having a non-Unicode locale on the host (extremely unlikely) and change the locale on the image to a non-Unicode one as well. Both are very unlikely, especially for new development, so let's just formalize UTF-8 assumptions and fix them if the need ever occurs.

Alexhuszagh · 2022-06-21T07:38:36Z

does it make sense to add Path::display to clippy lint disallowed_methods

Yes I think so. It is incorrect for our uses, even if it's a good method for more general cases.

Since we use our mount paths, which will be UTF-8 inside the Linux container, it makes sense to ensure they're valid UTF-8 prior so they can be valid paths for the locale in the container. I've left the following functions alone: - `MountFinder::new`, since `OsStr` is cheaper for path comparisons than `str` is. - `cargo_metadata_with_args`, since the manifest path is on the host, and therefore this is always valid. - `pretty_path`, since it handles both valid Unicode paths and debug representations of non-Unicode paths.

clippy.toml

Alexhuszagh · 2022-06-21T08:18:26Z

bors r=Emilgardis

828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common). So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character. Co-authored-by: Alex Huszagh <[email protected]>

Alexhuszagh · 2022-06-21T08:33:32Z

Just an additional comment in case this ever needs to be discussed again, as unlikely as that is: we need both the host and the container locale to be the same, I believe, since we need to provide bind mount mappings from the host to the container that are identical. Code like this (I believe) cannot exist if the path is not valid Unicode, since our container expects this to be a path that can be represented in UTF-8:

cross/src/docker/shared.rs

Lines 173 to 186 in 2937032

    
           pub(crate) fn mount( 
        
               docker: &mut Command, 
        
               val: &Path, 
        
               prefix: &str, 
        
               verbose: bool, 
        
           ) -> Result<PathBuf> { 
        
               let host_path = file::canonicalize(val)?; 
        
               let mount_path = canonicalize_mount_path(&host_path, verbose)?; 
        
               docker.args(&[ 
        
                   "-v", 
        
                   &format!("{}:{prefix}{}", host_path.display(), mount_path.display()), 
        
               ]); 
        
               Ok(mount_path) 
        
           }

However, nothing here stops you from providing non-UTF-8 filenames or directories as part of your crate data or external mounts to cross, just they cannot be the mount points. For example, if I have /path/to/dir/{{ISO-8859-15}}, I can provide a mount to /path/to/dir, and this should work fine, allowing me to access my files with non-UTF-8 filenames without issue (common if you need to process data extracted from a legacy source onto a modern system).

If any of these assumptions are wrong, this PR has made it easier to patch assumptions about UTF-8 paths in the future.

Alexhuszagh · 2022-06-21T08:44:39Z

bors r-

This isn't a bug, but it does lead to / being duplicated for the path immediately after root, so /foo/bar becomes //foo/bar. I'll add tests too for this.

bors · 2022-06-21T08:44:40Z

Canceled.

Previously, it would double the `//root` path, since it would add a `/` for the root, and an additional one for the first path afterwards. Also added unittests to ensure `as_posix` works as expected with drive letters and normal paths.

Alexhuszagh · 2022-06-21T09:44:37Z

bors r=Emilgardis

bors · 2022-06-21T10:28:36Z

Build succeeded:

conclusion

Prior to cross-rs#828, we added a '/' separator before component in our makeshift `as_posix`, so the path would always become absolute, even though the actual path was relative. `as_posix` therefore handled this as a relative path, so using `project` and not `/project` as the working directory for the cross command caused commands to fail.

831: Fixes regression in #828 with absolute path. r=Emilgardis a=Alexhuszagh Prior to #828, we added a '/' separator before component in our makeshift `as_posix`, so the path would always become absolute, even though the actual path was relative. `as_posix` therefore handled this as a relative path, so using `project` and not `/project` as the working directory for the cross command caused commands to fail. Co-authored-by: Alex Huszagh <[email protected]>

Alexhuszagh added the no changelog A valid PR without changelog (no-changelog) label Jun 21, 2022

Alexhuszagh requested a review from a team as a code owner June 21, 2022 07:27

Add PathExt trait to convert path to POSIX path.

44eda56

Alexhuszagh marked this pull request as draft June 21, 2022 07:40

Alexhuszagh marked this pull request as ready for review June 21, 2022 08:03

Alexhuszagh removed the no changelog A valid PR without changelog (no-changelog) label Jun 21, 2022

Emilgardis approved these changes Jun 21, 2022

View reviewed changes

clippy.toml Outdated Show resolved Hide resolved

Add deny lint for Path::display.

0741969

Alexhuszagh force-pushed the utf8_path branch from 2d5de94 to 0741969 Compare June 21, 2022 08:14

Fix minor formatting issue with as_posix.

8e51321

Previously, it would double the `//root` path, since it would add a `/` for the root, and an additional one for the first path afterwards. Also added unittests to ensure `as_posix` works as expected with drive letters and normal paths.

Emilgardis approved these changes Jun 21, 2022

View reviewed changes

bors bot merged commit 0cb97dd into cross-rs:main Jun 21, 2022

Alexhuszagh mentioned this pull request Jun 21, 2022

Add comprehensive support for remote docker. #785

Merged

Alexhuszagh mentioned this pull request Jun 21, 2022

Fixes regression in #828 with absolute path. #831

Merged

Alexhuszagh deleted the utf8_path branch June 23, 2022 23:25

Emilgardis added this to the v0.2.2 milestone Jun 24, 2022

Emilgardis mentioned this pull request Jul 3, 2022

wiki check toml #907

Merged

Alexhuszagh added the no-ci-targets PRs that do not affect any cross-compilation targets. label Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using display on paths. #828

Avoid using display on paths. #828

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Emilgardis commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

bors bot commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022

bors bot commented Jun 21, 2022

Avoid using display on paths. #828

Avoid using display on paths. #828

Conversation

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 • edited Loading

Emilgardis commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 • edited Loading

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 • edited Loading

Alexhuszagh commented Jun 21, 2022 • edited Loading

bors bot commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022

bors bot commented Jun 21, 2022

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Alexhuszagh commented Jun 21, 2022 •

edited

Loading

Alexhuszagh commented Jun 21, 2022 •

edited

Loading