fix: Fix handling invalid(?) unicode in filenames. #163

wxsBSD · 2024-07-24T20:49:04Z

When running yr scan against some random files I have laying around I noticed a crash. I think it is because there is invalid (at least I think it is, I'm so bad at understanding unicode) unicode in the filename of one of the files contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 (available on VT).

This changes it so the string is using the debug formatter which handles the invalid utf-8 better? This seems to work fine on my machine.

When running yr scan against some random files I have laying around I noticed a crash. I think it is because there is invalid (at least I think it is, I'm so bad at understanding unicode) unicode in the filename of one of the files contained in 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 (available on VT). This changes it so the string is using the debug formatter which handles the invalid utf-8 better? This seems to work fine on my machine.

plusvic · 2024-07-26T08:42:16Z

I'm unable to reproduce this issue in MacOS or Linux. I tried downloading and unzipping the file 84523ddad722e205e2d52eedfb682026928b63f919a7bf1ce6f1ad4180d0f507 and then scanning the directory, but it didn't fail.

Notice however that file.display().to_string() and format!("{:?}", file.display().to_string()) doesn't produce exactly the same result, the latter produces a file path that is enclosed in double quotes.

Can you try using let path = file.to_string_lossy().to_string(); and see what happens?

plusvic · 2024-07-26T08:59:02Z

BTW: the problem here is that the superconsole crate refuses to print any string that contains Unicode characters that are considered spaces but are not the ASCII space character (ASCII 32).

These are the characters considered whitespaces: https://doc.rust-lang.org/reference/whitespace.html

A more robust approach is writing our own function that replaces those characters with the ASCII 32 character. It would be very straightforward, every character c where c.is_whitespace() returns true, is replaced.

wxsBSD · 2024-07-26T14:17:58Z

I implemented a replace_whitespace() which works for me.

What I don't understand is that this is only when outputting the files being scanned, not when outputting a match. I would have expected to also alter the code around https://github.com/VirusTotal/yara-x/blob/main/cli/src/commands/scan.rs#L450 too, but for some reason that does not crash when outputting a match on the same file for me.

plusvic · 2024-07-26T14:46:23Z

A more efficient implementation would be:

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    for (i, c) in s.char_indices() {
        if c != ' ' && c.is_whitespace() {
            s.to_mut().replace_range(i..i, " ");
        }
    }
    s
}

With this implementation, replace_whitespace receives a reference to a PathBuf directly, so you don't need to call neither .display() nor .to_string(), which will create a copy of the path regardless of whether they have non-ASCII characters or not. Notice that this function returns a Cow<str> instead of a String, Cow<str> is a string that is initially a reference to some other string, but a copy of that string is done when you need to mutate the original string for some reason.

&PathBuf is converted to a String using .to_string_lossy(), which also returns a Cow<str> and won't create a copy unless the path contains some non-UTF8 character (in most cases this copy is not necessary).

Also, s.to_mut() creates a copy of the original path only when a non-ASCII whitespace is found. As a result, most of the paths, which are valid UTF-8 characters that don't contain any non-ASCII whitespaces, won't be copied at all. A reference to the original PathBuf will be returned instead.

This requires adapting truncate_with_ellipsis to receive a Cow<str>, which is left as an exercise to you.

wxsBSD · 2024-07-26T15:54:23Z

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    for (i, c) in s.char_indices() {
        if c != ' ' && c.is_whitespace() {
            s.to_mut().replace_range(i..i, " ");
        }
    }
    s
}

This mutates the string as you're iterating over it, which the type checker is unhappy with:

error[E0502]: cannot borrow `s` as mutable because it is also borrowed as immutable
   --> cli/src/commands/scan.rs:537:13
    |
535 |     for (i, c) in s.char_indices() {
    |                   ----------------
    |                   |
    |                   immutable borrow occurs here
    |                   immutable borrow later used here
536 |         if c != ' ' && c.is_whitespace() {
537 |             s.to_mut().replace_range(i..i, " ");
    |             ^^^^^^^^^^ mutable borrow occurs here

plusvic · 2024-07-26T21:03:14Z

for (i, c) in s.char_indices() {
if c != ' ' && c.is_whitespace() {
s.to_mut().replace_range(i..i, " ");
}
}

Oh, that's right. Then something like this should work...

fn replace_whitespace(path: &PathBuf) -> Cow<str> {
    let mut s = path.to_string_lossy();
    if s.chars().any(|c| c != ' ' && c.is_whitespace()) {
        let mut r = String::with_capacity(s.len());
        for c in s.chars() {
            if c.is_whitespace() {
                r.push(' ')
            } else {
                r.push(c)
            }
        }
        s = Cow::Owned(r);
    }
    s
}

The allocation for the new string will happen only in the case of strings containing a strange whitespace.

Use suggestion from Victor on how to make this only allocate a new string when necessary. If there are no non-ascii whitespace characters we don't do an allocation, and if there is we should only allocate once.

Implement replace_whitespace()

de1c33a

wxsBSD added 2 commits July 26, 2024 17:06

Improve replace_whitespace()

d12002a

Use suggestion from Victor on how to make this only allocate a new string when necessary. If there are no non-ascii whitespace characters we don't do an allocation, and if there is we should only allocate once.

Use &Path and not &PathBuf in replace_whitespace().

580429b

plusvic merged commit 7b10500 into VirusTotal:main Jul 28, 2024
15 checks passed

wxsBSD deleted the crash branch July 28, 2024 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix handling invalid(?) unicode in filenames. #163

fix: Fix handling invalid(?) unicode in filenames. #163

wxsBSD commented Jul 24, 2024

plusvic commented Jul 26, 2024 •

edited

Loading

plusvic commented Jul 26, 2024

wxsBSD commented Jul 26, 2024

plusvic commented Jul 26, 2024 •

edited

Loading

wxsBSD commented Jul 26, 2024

plusvic commented Jul 26, 2024

fix: Fix handling invalid(?) unicode in filenames. #163

fix: Fix handling invalid(?) unicode in filenames. #163

Conversation

wxsBSD commented Jul 24, 2024

plusvic commented Jul 26, 2024 • edited Loading

plusvic commented Jul 26, 2024

wxsBSD commented Jul 26, 2024

plusvic commented Jul 26, 2024 • edited Loading

wxsBSD commented Jul 26, 2024

plusvic commented Jul 26, 2024

plusvic commented Jul 26, 2024 •

edited

Loading

plusvic commented Jul 26, 2024 •

edited

Loading