Skip to content
This repository has been archived by the owner on Jul 22, 2019. It is now read-only.

No support for non-UTF8 parsing #5

Open
jonhoo opened this issue Nov 3, 2017 · 6 comments
Open

No support for non-UTF8 parsing #5

jonhoo opened this issue Nov 3, 2017 · 6 comments

Comments

@jonhoo
Copy link

jonhoo commented Nov 3, 2017

It is (unfortunately) not entirely uncommon for IMAP servers to not support UTF-8 (see mattnenterprise/rust-imap#54 for more), but imap-proto currently returns &str all over the place, which necessarily implies that it only works on UTF-8 response. I think the right thing to do here (although it makes the types a bit less pleasant) is to parameterize all the types by a T: TryFrom<&[u8]>. That way, users can choose whether to parse into str, or simply keep things as &[u8] (or do some other decoding). Unfortunately, TryFrom is still nightly-only, and hasn't even landed yet for strings (rust-lang/rust#44916), so in the meantime, you'd probably just want to add a trait like this:

pub trait FromByteResponse<'a> {
    fn from_bytes(&'a [u8]) -> Self;
}

impl<'a> FromByteResponse<'a> for &'a str {
    fn from_bytes(b: &'a [u8]) -> Self {
        use std;
        std::str::from_utf8(b).unwrap()
    }
}

impl<'a> FromByteResponse<'a> for &'a [u8] {
    fn from_bytes(b: &'a [u8]) -> Self {
        b
    }
}

And then make the various types use this trait. For example, for MailboxDatum:

pub enum MailboxDatum<T: for<'a> FromByteResponse<'a>> {
    Exists(u32),
    Flags(Vec<T>),
    List(Vec<T>, T, T),
    Recent(u32),
}
@djc
Copy link
Owner

djc commented Nov 4, 2017

So I'm inclined to say that the ergonomics of this are maybe up to users of this crate to handle. Then this crate can focus on providing a correct protocol implementation (using bytes wherever needed), and you can for example have rust-imap offer a nicer API on top of this somehow.

This is kind of predicated on the notion that non-UTF-8 data will really only be a problem for message contents. Have you actually seen this be a problem for header data?

It also relies a little bit on my perception that using the message parsing parts of IMAP should mostly be avoided in favor of handling message parsing in the client application... Some parts of IMAP really seem like a layering violation that way.

@sanmai-NL
Copy link
Contributor

@djc:

  1. Yes, headers can be non-UTF-8 as well. See e.g. https://tools.ietf.org/html/rfc5738#section-8.

  2. Where shall we draw the boundary between this crate and the mailparse crate for message parsing?

@jonhoo
Copy link
Author

jonhoo commented Nov 4, 2017

Yeah, it's the non-UTF-8 headers that are the issue (in part because not all servers support UTF-8 up-conversion). We could comb through the specs to find which fields are safe to always parse as unicode strings, and which aren't, but it seems better (at least to me) to just have the user choose how to interpret text-like &[u8]s using a trait like the one outlined in the first comment.

For what it's worth, I completely agree with you that if headers were all valid unicode, then this library should just use str everywhere with the exception of message contents. Unfortunately, as @sanmai-NL points out, that doesn't seem to be a realistic assumption :( See also further discussion in mattnenterprise/rust-imap#11.

@sanmai-NL
Copy link
Contributor

@jonhoo: In my mind, we can just use bytes everywhere and add the UTF-8 API later. I think correctness and completeness of implementation is more important in this early stage than convenience. I'd like us to starting filing issues for missing implementations and bugs, and I'll starting working on them!

@djc
Copy link
Owner

djc commented Nov 11, 2017

@sanmai-NL RFC 5738 is an extension, so that in my understanding a client should not have to support it.

As for the boundary between this crate and something like mailparse (or my own email-parser crate), I'm not sure how you see that? Especially focusing on the client implementation, according to my understanding an IMAP client is not required to do any email parsing; the server will need to do some of it to be able to return ENVELOPE structures, for example.

jonhoo added a commit to jonhoo/imap-proto that referenced this issue Nov 19, 2017
This fixes djc#5 as proposed in that issue. Specifically, it introduces a
trait, `FromByteResponse`, and provides a generic `parse_response`
function that parses input as raw byte strings, and then maps them using
some implementation of that trait. This in turns allows users to choose
how they want to parse byte sequences in an ergonomic way.

Under the hood, this is *slightly* less efficient that it could be.
Specifically, it parses everything as `&[u8]` first, and then maps
everything using calls to `FromByteResponse`. This works fine, but will
cause unnecessary re-allocation of vectors inside a bunch of structs.
The way to work around this is to have `nom` directly use a generic
return value in all its parsers, but this unfortunately doesn't seem to
be supported at the time of writing.
jonhoo added a commit to jonhoo/imap-proto that referenced this issue Nov 19, 2017
This fixes djc#5 as proposed in that issue. Specifically, it introduces a
trait, `FromByteResponse`, and provides a generic `parse_response`
function that parses input as raw byte strings, and then maps them using
some implementation of that trait. This in turns allows users to choose
how they want to parse byte sequences in an ergonomic way.

Under the hood, this is *slightly* less efficient that it could be.
Specifically, it parses everything as `&[u8]` first, and then maps
everything using calls to `FromByteResponse`. This works fine, but will
cause unnecessary re-allocation of vectors inside a bunch of structs.
The way to work around this is to have `nom` directly use a generic
return value in all its parsers, but this unfortunately doesn't seem to
be supported at the time of writing.
@jonhoo
Copy link
Author

jonhoo commented Nov 21, 2017

Interested readers should also see PR #12, and the reason for its closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants