Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix handling invalid(?) unicode in filenames. #163

Merged
merged 4 commits into from
Jul 28, 2024

Conversation

wxsBSD
Copy link
Contributor

@wxsBSD wxsBSD commented Jul 24, 2024

When running yr scan against some random files I have laying around I noticed a crash. I think it is because there is invalid (at least I think it is, I'm so bad at understanding unicode) unicode in the filename of one of the files contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 (available on VT).

This changes it so the string is using the debug formatter which handles the invalid utf-8 better? This seems to work fine on my machine.

When running yr scan against some random files I have laying around I noticed a
crash. I think it is because there is invalid (at least I think it is, I'm so
bad at understanding unicode) unicode in the filename of one of the files
contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507
(available on VT).

This changes it so the string is using the debug formatter which handles the
invalid utf-8 better? This seems to work fine on my machine.
@plusvic
Copy link
Member

plusvic commented Jul 26, 2024

I'm unable to reproduce this issue in MacOS or Linux. I tried downloading and unzipping the file 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 and then scanning the directory, but it didn't fail.

Notice however that file.display().to_string() and format!("{:?}", file.display().to_string()) doesn't produce exactly the same result, the latter produces a file path that is enclosed in double quotes.

Can you try using let path = file.to_string_lossy().to_string(); and see what happens?

@plusvic
Copy link
Member

plusvic commented Jul 26, 2024

BTW: the problem here is that the superconsole crate refuses to print any string that contains Unicode characters that are considered spaces but are not the ASCII space character (ASCII 32).

These are the characters considered whitespaces: https://doc.rust-lang.org/reference/whitespace.html

A more robust approach is writing our own function that replaces those characters with the ASCII 32 character. It would be very straightforward, every character c where c.is_whitespace() returns true, is replaced.

@wxsBSD
Copy link
Contributor Author

wxsBSD commented Jul 26, 2024

I implemented a replace_whitespace() which works for me.

What I don't understand is that this is only when outputting the files being scanned, not when outputting a match. I would have expected to also alter the code around https://github.com/VirusTotal/yara-x/blob/main/cli/src/commands/scan.rs#L450 too, but for some reason that does not crash when outputting a match on the same file for me.

@plusvic
Copy link
Member

plusvic commented Jul 26, 2024

A more efficient implementation would be:

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    for (i, c) in s.char_indices() {
        if c != ' ' && c.is_whitespace() {
            s.to_mut().replace_range(i..i, " ");
        }
    }
    s
}

With this implementation, replace_whitespace receives a reference to a PathBuf directly, so you don't need to call neither .display() nor .to_string(), which will create a copy of the path regardless of whether they have non-ASCII characters or not. Notice that this function returns a Cow<str> instead of a String, Cow<str> is a string that is initially a reference to some other string, but a copy of that string is done when you need to mutate the original string for some reason.

&PathBuf is converted to a String using .to_string_lossy(), which also returns a Cow<str> and won't create a copy unless the path contains some non-UTF8 character (in most cases this copy is not necessary).

Also, s.to_mut() creates a copy of the original path only when a non-ASCII whitespace is found. As a result, most of the paths, which are valid UTF-8 characters that don't contain any non-ASCII whitespaces, won't be copied at all. A reference to the original PathBuf will be returned instead.

This requires adapting truncate_with_ellipsis to receive a Cow<str>, which is left as an exercise to you.

@wxsBSD
Copy link
Contributor Author

wxsBSD commented Jul 26, 2024

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    for (i, c) in s.char_indices() {
        if c != ' ' && c.is_whitespace() {
            s.to_mut().replace_range(i..i, " ");
        }
    }
    s
}

This mutates the string as you're iterating over it, which the type checker is unhappy with:

error[E0502]: cannot borrow `s` as mutable because it is also borrowed as immutable
   --> cli/src/commands/scan.rs:537:13
    |
535 |     for (i, c) in s.char_indices() {
    |                   ----------------
    |                   |
    |                   immutable borrow occurs here
    |                   immutable borrow later used here
536 |         if c != ' ' && c.is_whitespace() {
537 |             s.to_mut().replace_range(i..i, " ");
    |             ^^^^^^^^^^ mutable borrow occurs here

@plusvic
Copy link
Member

plusvic commented Jul 26, 2024

for (i, c) in s.char_indices() {
if c != ' ' && c.is_whitespace() {
s.to_mut().replace_range(i..i, " ");
}
}

Oh, that's right. Then something like this should work...

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    if s.chars().any(|c| c != ' ' && c.is_whitespace()) {
        let mut r = String::with_capacity(s.len());
        for c in s.chars() {
            if c.is_whitespace() {
                r.push(' ')
            } else {
                r.push(c)
            }
        }
        s = Cow::Owned(r);
    }
    s
}

The allocation for the new string will happen only in the case of strings containing a strange whitespace.

Use suggestion from Victor on how to make this only allocate a new string when
necessary. If there are no non-ascii whitespace characters we don't do an
allocation, and if there is we should only allocate once.
@plusvic plusvic merged commit 7b10500 into VirusTotal:main Jul 28, 2024
15 checks passed
@wxsBSD wxsBSD deleted the crash branch July 28, 2024 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants