Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serdect: is it actually constant-time? #1111

Closed
fjarri opened this issue Jun 20, 2023 · 40 comments
Closed

serdect: is it actually constant-time? #1111

fjarri opened this issue Jun 20, 2023 · 40 comments

Comments

@fjarri
Copy link
Contributor

fjarri commented Jun 20, 2023

The binary serializer uses serializer.serialize_tuple() and serialize_element() which, in some formats at least, makes it data-dependent. E.g. MessagePack prepends every element greater than 127 with 0xCC.

Also, this contradicts the documentation claim:

When using a binary format, the data is serialized as-is into binary.

What was the reason behind not using serialize_bytes()? Seems like it would provide better constant-time guarantees?

@tarcieri
Copy link
Member

The README describes it as follows:

This crate provides "best effort" constant-time helper methods for reducing the amount of timing variability involved in serializing/deserializing data when using serde, Rust's standard serialization framework.

Really the goal is to be less noisy than e.g. a JSON serialization of bytes as an integer array. PRs accepted to improve the wording in the README.

What was the reason behind not using serialize_bytes()? Seems like it would provide better constant-time guarantees?

The slice serializer just uses the Serialize::serialize impl on [u8]. I'm not familiar with how that differs from serialize_bytes, but if you think there's a compelling case I'd suggest opening a PR. We test with a number of different serde format impls so perhaps you can point out specific deficiencies with a particular one.

serialize_tuple is used by the array serializer, which avoids length prefixing the data because it's known a priori by the type system. That implementation avoids a problematic deficiency with serde which prevents it from working with const generics: serde-rs/serde#1937

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

The slice serializer just uses the Serialize::serialize impl on [u8]. I'm not familiar with how that differs from serialize_bytes

It seems that it is different, since serialize_bytes serializes bytes verbatim in MessagePack, while serialize doesn't.

serialize_tuple is used by the array serializer, which avoids length prefixing the data because it's known a priori by the type system.

That's true, but then binary formats instead prefix separate elements, which kind of defeats the point.

@tarcieri
Copy link
Member

Can you point to a specific implementation, and ideally, write some test cases which capture the issue?

@tarcieri
Copy link
Member

cc @daxpedda

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

Can you point to a specific implementation, and ideally, write some test cases which capture the issue?

Let me make an MRE.

Also, speaking of serialize_tuple(), I would argue that while it would be nice to have an array serialization with no length specifier, it is secondary to the intent of serdect which is to a) strive for constant-timeness, and b) serialize bytes unchanged.

@tarcieri
Copy link
Member

Again, we provide both options. Use the slice serializer if you're okay with a length prefix.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

They are not completely interchangeable. slice::deserialize_hex_or_bin() doesn't fail if the buffer is larger than what's being deserialized, which is not what I want if I'm actually deserializing an array.

@tarcieri
Copy link
Member

I am loathe to change it without an actual empirical demonstration of timing variability or other performance issue. Can you put one together?

@tarcieri
Copy link
Member

Also note there is a long, long history of anecdotal reports like this with the various serde serializers across the @RustCrypto project, which is why we put this crate together and why it tests against various format implementations, so I would really like to see running code as the motivation for any changes.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

I am loathe to change it without an actual empirical demonstration of timing variability or other performance issue. Can you put one together?

On it now. But if the length of the serialized bytestring is data-dependent, doesn't timing variability naturally follow? Not to mention that it's quite unexpected to get a variable-sized serialized data when you thought you were serializing a constant-sized array.

@tarcieri
Copy link
Member

If the length varies, then the timing will as well, yes.

However, that would seem like a sharp edge and counterclaim to your argument that using a length-prefixed serialization would be better than one guaranteed fixed-length by the type system, if anything.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

use serde::{Serialize, Serializer};

#[derive(Serialize)]
struct A(
    #[serde(serialize_with = "serdect::slice::serialize_hex_lower_or_bin")]
    [u8; 8]
    );

#[derive(Serialize)]
struct B(
    #[serde(serialize_with = "serdect::array::serialize_hex_lower_or_bin")]
    [u8; 8]
    );

struct C([u8; 8]);

impl Serialize for C {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        serializer.serialize_bytes(&self.0)
    }
}


fn main() {
    let a = A([1, 2, 254, 255, 3, 4, 252, 253]);

    let b = rmp_serde::encode::to_vec(&a);
    println!("messagepack + slice: {:?}", b);

    let b = bincode::serialize(&a).unwrap();
    println!("bincode + slice: {:?}", b);

    let a = B([1, 2, 254, 255, 3, 4, 252, 253]);

    let b = rmp_serde::encode::to_vec(&a);
    println!("messagepack + array: {:?}", b);

    let b = bincode::serialize(&a).unwrap();
    println!("bincode + array: {:?}", b);

    let a = C([1, 2, 254, 255, 3, 4, 252, 253]);

    let b = rmp_serde::encode::to_vec(&a);
    println!("messagepack + serialize_bytes: {:?}", b);

    let b = bincode::serialize(&a).unwrap();
    println!("bincode + serialize_bytes: {:?}", b);
}

Output:

messagepack + slice: Ok([152, 1, 2, 204, 254, 204, 255, 3, 4, 204, 252, 204, 253])
bincode + slice: [8, 0, 0, 0, 0, 0, 0, 0, 1, 2, 254, 255, 3, 4, 252, 253]
messagepack + array: Ok([152, 1, 2, 204, 254, 204, 255, 3, 4, 204, 252, 204, 253])
bincode + array: [1, 2, 254, 255, 3, 4, 252, 253]
messagepack + serialize_bytes: Ok([196, 8, 1, 2, 254, 255, 3, 4, 252, 253])
bincode + serialize_bytes: [8, 0, 0, 0, 0, 0, 0, 0, 1, 2, 254, 255, 3, 4, 252, 253]

So Bincode behaves well, MessagePack doesn't.

However, that would seem like a sharp edge and counterclaim to your argument that using a length-prefixed serialization would be better than one guaranteed fixed-length by the type system, if anything.

The type system guarantees that the format will receive a constant-sized tuple, but not what it will do with it. Of course, giving it a byte slice instead does not guarantee it either, but it seems to me that it is a better "best effort" for constant-timeness.

@tarcieri
Copy link
Member

I'm not sure I see anything actionable there?

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

The "204" elements are not in the actual data and are added before any byte with the value above 127, making the serialization length (and time) data-dependent.

Also it shows the difference between serialize() (used in serdect::slice) and serialize_bytes

@tarcieri
Copy link
Member

What are you suggesting as a mitigation?

(Honestly, this seems like an argument against MessagePack, not serdect)

@tarcieri
Copy link
Member

tarcieri commented Jun 20, 2023

It sounds like moving to serialize_bytes might be a good idea, but has other potential tradeoffs. Can you open a separate issue specific to that?

It also sounds like it might be a difficult breaking change, depending on how various format implementations handle it.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

(Honestly, this seems like an argument against MessagePack, not serdect)

There may be other formats that do the same. My argument is that serialize_bytes() provides better constant-time guarantees than serialize() or serialize_tuple(), across the whole existing zoo of formats.

Yes, it will be certainly a breaking ABI change, there's no way around it.

@tarcieri
Copy link
Member

There may be other formats that do the same

We can only provide guarantees for the formats we test against, so for any format you want to have guarantees, it needs to be added to our test suite.

Obviously we can't provide guarantees for things we haven't tested against.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

If one variant works (in the sense of providing constant-time guarantees) in Bincode and CBOR (currently tested ones), and the other works in Bincode, CBOR, and MessagePack, I would argue that the latter variant is better. MessagePack is about 70 times more popular than CBOR according to crates.io download data.

@tarcieri
Copy link
Member

Can you open a separate issue for a potential switch to serialize_bytes()?

Also I think we can close this issue as "asked and answered". If you feel the existing documentation is insufficient, please open a PR.

@daxpedda
Copy link
Contributor

Still waiting for the new issue, but I'm not exactly following what the issue is here.

serialize_bytes() is for variable length data, which ours is not. serialize_tuple() seems to be the right call to me.
I think this is a deficiency in MessagePack, or it's implementation, I know nothing about MessagePack.

Also this doesn't seem like a constant-time issue to me, that different sizes have different timings is kind of unavoidable. I don't think this is what serdect is trying to solve. What it does intend to address is different data having different timings.
Please correct me if I'm wrong here @tarcieri.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

I think this is a deficiency in MessagePack

Deficiency or not, it's a de-facto standard used by a lot of people.

that different sizes have different timings is kind of unavoidable

Same sizes, different data.

@tarcieri
Copy link
Member

tarcieri commented Jun 20, 2023

We deal with both types of data. serialize_bytes() seems good for slices, less so for arrays since it adds an unnecessary length prefix.

Whether or not we need to add a redundant length prefix to every other format for encoded arrays as a workaround for how MessagePack handles tuples is a rather ugly and highly debatable tradeoff.

I think we should perhaps get a bit more diligent about our list of supported format implementations as it seems we should be worrying about formats that do wacky things like serialize binary data in variable-time. It'd be good to get a table added to the README.md at least.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

it seems we should be worrying about formats that do wacky things like serialize binary data in variable-time

To be fair, with the tuple approach, we are serializing separate u8s, not binary data. Which is probably funneled into some "unsized integer" internal representation, and that's how MessagePack sees it.

@daxpedda
Copy link
Contributor

daxpedda commented Jun 20, 2023

I think this is a deficiency in MessagePack

Deficiency or not, it's a de-facto standard used by a lot of people.

So MessagePack seems to support fixed sized arrays and tuples.
I'm not sure what exactly rmp_serde does, but MessagePack should support the right thing.

that different sizes have different timings is kind of unavoidable

Same sizes, different data.

Are you saying the output length is different depending on data if the size is the same?

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

Are you saying the output length is different depending on data if the size is the same?

Yes, see the posted example. Every byte over 127 gets prefixed with 0xCC.

@tarcieri
Copy link
Member

I'm not sure what exactly rmp_serde does, but MessagePack should support the right thing.

Curious if there are other Rust (or otherwise) implementations of MessagePack which handle this better

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

So MessagePack seems to support fixed sized arrays and tuples.

It's not about that. The problem is that serializing u8 is funneled into serializing u64, which is then packed. It seems though that rmp has a write_u8 method, so I'll see if that helps.

@daxpedda
Copy link
Contributor

Yes, see the posted example. Every byte over 127 gets prefixed with 0xCC.

Apologies, I think I finally understand the problem now, I should have been reading more carefully.

Yeah, that's definitely an issue. So as you are saying, rmp should use write_u8 when it can to avoid packing in this case. Let's hope this is an issue in the implementation, because I think it would be bad to have to switch to serialize_bytes() for fixed size arrays.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

I'm looking into it now, and it doesn't seem like the format supports fixed-size u8 values. write_u8 is exactly the one that adds the 0xCC prefix (which is a marker of a 1-byte integer, see the spec).

@daxpedda
Copy link
Contributor

daxpedda commented Jun 20, 2023

But shouldn't it add this prefix for every u8 then, regardless of it's value?

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

Nope, see the first line:

positive fixint | 0xxxxxxx | 0x00 - 0x7f

A single byte starting from the bit 0 is an u8 under 128.

@daxpedda
Copy link
Contributor

Ah I see ... that's so weird ...

Apologies if you already explained that, but why is serialize_bytes() not doing that?

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

@daxpedda
Copy link
Contributor

I guess the issue is that serde doesn't really have proper support for fixed sized arrays, tuple is just a "bad" workaround ... which specifically MessagePack implementers can't really do anything about.

@tarcieri I think I'm in favor of changing this now, simply because serde doesn't properly support it. WDYT?

@tarcieri
Copy link
Member

As I said, I think it might make sense for slices. Arrays are more debatable.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 20, 2023

Personally, while the array thing is inconvenient, I can live without it, but I would like to use serialize_bytes() at least for slices. I'm working on a PR now, having some issues with CBOR.

@daxpedda
Copy link
Contributor

As I said, I think it might make sense for slices. Arrays are more debatable.

Yeah, I'm in favor of both. Though I understand that there is a tradeoff here.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

By the way, apparently CBOR uses packing as well. I replaced 0x0F with 0xFF in the EXAMPLE_BYTES in the tests, and both array and slice tests started to fail. It represents 0xFF as [24, 255]. serialize_bytes fixes this.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

Created PR #1112

tarcieri pushed a commit that referenced this issue Oct 31, 2023
See #1111 for the discussion leading to this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants