Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return char instead of u8 from InvalidCharError::invalid_char #100

Open
Kixunil opened this issue Aug 19, 2024 · 13 comments
Open

Return char instead of u8 from InvalidCharError::invalid_char #100

Kixunil opened this issue Aug 19, 2024 · 13 comments
Labels
1.0 Issues and PRs required or helping to stabilize the API enhancement New feature or request

Comments

@Kixunil
Copy link
Collaborator

Kixunil commented Aug 19, 2024

If the string contains a multi-byte UTF-8 char then the error returns confusing value unsuitable for the user.

This change is kinda breaking but we can phase it slowly: add invalid_char_human(&self) -> char method (please invent a better name!), implement char -> u8 conversion inside invalid_char by encoding the first byte of the char and deprecate it.

I suspect this may need a bunch of changes to be able to get the entire char though.

@Kixunil Kixunil added enhancement New feature or request 1.0 Issues and PRs required or helping to stabilize the API labels Aug 19, 2024
@apoelstra
Copy link
Member

We had a similar bug in rust-bech32 in our new Hrp::from_display API. Fortunately we caught it before releasing.

Agreed with your strategy. Ideas for the new method name

  • invalid
  • invalid_character
  • character
  • char_ (lol)

@Kixunil
Copy link
Collaborator Author

Kixunil commented Aug 19, 2024

character sounds least bad to me.

@tcharding
Copy link
Member

I had a play with this which led me to the view that

  • Adding checks for utf-8 slows down parsing (because without some complex trickery we need to parse the input string twice)
  • Giving a nice error message for utf-8 adds either complexity to the lib or a perf hit or both
  • Parsing utf-8 (non ascii) characters is an edge case not the normal case
  • We should not have a perf hit or unnecessary complexity for an edge case

I hacked up a validate_hex_string function but it is so trivial and undiscoverable that users will likely just get stuck debugging then write their own when they think of it.

/// Validates that input `string` contains valid hex characters.
pub fn validate_hex_string(s: &str) -> Result<(), NonHexDigitError> {
    for c in s.chars() {
        if !c.is_ascii_hexdigit() {
            return Err(NonHexDigitError { invalid: c});
        }
    }
    Ok(())
}

Perhaps a middle ground would be to add this to crate docs under some sort of # Debugging section?

@apoelstra
Copy link
Member

UTF-8 is designed so that as soon as you hit a non-ASCII byte you know that it's the start of a non-ASCII character. So we shouldn't need to use the slow chars iterator or anything. Just, when we hit a non-ASCII character then we need to maybe do some slow stuff to extract the full character that we've run into. (Taking s[pos..].chars().next().unwrap() would work for example, if we know the byte position of the bad character. And knowing this won't slow us down at all.)

@tcharding
Copy link
Member

That sounds feasible, I'll have a play with it. Thanks

@Kixunil
Copy link
Collaborator Author

Kixunil commented Sep 4, 2024

Another possible approach that doesn't require explicitly tracking the position and relies on iterators only:

// We're not using `.bytes()` because it doesn't have the `as_slice` method.
let mut bytes = s.as_bytes().iter();
// clone() is actually a copy and it allows peeking behavior without losing access to the `as_slice` method on the iterator.
while let Some(byte) = bytes.clone().next() {
    match decode_hex_digit(byte) {
        Some(value) => { /* do something with the value here */ },
        None => {
            // SAFETY: all bytes up to this one were ACII which implies valid UTF-8 and the input was valid UTF-8 to begin with so splitting the string here is sound. We're also making sure we split it at correct position by cloning the iterator.
            let remaining_str = unsafe { core::str::from_utf8_unchecked(bytes.as_slice()) };
            return Err(InvalidChar { c: remaining_str.chars().next().unwrap(), pos: s.len() - remaining_str.len() });
        },
    }
    bytes.next()
}

@apoelstra
Copy link
Member

👀 can you really call .next() directly on a byteslice like that?

@Kixunil
Copy link
Collaborator Author

Kixunil commented Sep 4, 2024

LOl, I forgot to write .iter(). :D

@apoelstra
Copy link
Member

Ah! The .iter() is on bytes. Gotcha. Yep, that would work. I don't feel strongly in favor of either solution.

@Kixunil
Copy link
Collaborator Author

Kixunil commented Sep 5, 2024

FTR I was also considering having enum InvalidChar { Utf8(char), Other(u8) } to allow parsing raw &[u8] slices without prior UTF-8 validation (e.g. when parsing from stdin). But I think it's better to have separate types for those cases.

@apoelstra
Copy link
Member

But I think it's better to have separate types for those cases.

What do you mean by this?

@Kixunil
Copy link
Collaborator Author

Kixunil commented Sep 6, 2024

Basically if we ever add functions that take &[u8] we would make them return separate error types to indicate that the bytes may not be UTF-8.

@apoelstra
Copy link
Member

Ah. yeah, that's probably best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Issues and PRs required or helping to stabilize the API enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants