Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics of deserialisation #870

Closed
rossberg opened this issue Nov 14, 2019 · 20 comments
Closed

Semantics of deserialisation #870

rossberg opened this issue Nov 14, 2019 · 20 comments
Labels
idl Candid or serialisation P3 low priority, resolve when there is time

Comments

@rossberg
Copy link
Contributor

rossberg commented Nov 14, 2019

The design doc currently defines deserialisation of IDL values by an elaboration function that is only defined when subtyping holds. However, that is not how we implement it, because deserialisation actually proceeds by dynamically traversing types and value. In particular, it avoids descending into portions of the types for which no corresponding value is present, such as unused cases of a variant or the element type of empty options or vectors.

The consequence is that more values deserialise than actually should according to the spec. In particular, it is not always discovered when the annotated type is not actually a subtype of the target type. For example,

deserialise((v : T), U)

will succeed if v is an empty vector, but T and U are incompatible vector types, such as vec Text vs vec Nat. Similarly for options or variants.

We want to specify the precise (necessary and sufficient) precondition under which deserialisation succeeds. One candidate condition might be

exists T',  v : T'  /\  T' <: T  /\  T' <: U

i.e., the value has a "more principle" type T' that is a subtype of both annotated and target type. In the previous example, that might be vec empty. Effectively, this works around the requirement to always produce a principal type.

However, this is a somewhat brittle condition. For example, with some of the possible implementation choices discussed on #832 (branching on the truth of a subtype statement) it would break.

Unclear what to specify here, or whether we should actually enforce the subtype check (undesirable).

@nomeata
Copy link
Collaborator

nomeata commented Nov 14, 2019

I assume that its not ok to chicken out and say “if the argument is not a subtype of the expected type, the only guarantee is that the result is a member of the expected type”, i.e. partial correctness + safety afterwards, else you wouldn't raise this point, right?

@rossberg
Copy link
Contributor Author

rossberg commented Nov 14, 2019

You mean if they aren't subtypes it could be any type-correct value? So deserialise(([] : [Nat]), [Text]) could be an out of thin air value like ["You", "stink!"]?

@nomeata
Copy link
Collaborator

nomeata commented Nov 14, 2019

Yes… it is one way out. It’s not like the caller could not have put ["You", "stink!"] in there…

@rossberg
Copy link
Contributor Author

But this may get them banned without being guilty.

Why is this laxer condition better than the one above?

@nomeata
Copy link
Collaborator

nomeata commented Nov 14, 2019

But this may get them banned without being guilty.

What do you mean?

Why is this laxer condition better than the one above?

Easier to implement? But maybe it is what we do already. Probably it is. I didn’t notice you slightly changed it over what we discussed on the other PR. So maybe it is ok.

@rossberg
Copy link
Contributor Author

Did I change it? Not consciously.

I mean that it's like a man-in-the-middle attack, were the man is undefined behaviour.

@nomeata
Copy link
Collaborator

nomeata commented Nov 14, 2019

I wonder what is the threat model: We only ever get into that situation if someone sends a value of a wrong type. How much pity do we need to have for that case? Isn’t it just a case of garbage in - garbabe out? It would be the same under the “old“ scheme without the type description, so why does it matter now?

@rossberg
Copy link
Contributor Author

Well, the type mismatch could be accidental. There is no concrete threat model, but it still seems rather fishy. Violating the idea of typing that way.

@nomeata
Copy link
Collaborator

nomeata commented Nov 14, 2019

But you agree that we are trying to fix a problem that we also had before we had the type section, and that would have been unfixable back then, and it didn’t bother us then?

@rossberg
Copy link
Contributor Author

I am not entirely sure, to be honest. Before, we had a weaker format, it was basically untyped and did not promise much. Now it's typed and you may think that you can rely on that. But you can't. In a way, that is worse.

@nomeata
Copy link
Collaborator

nomeata commented Nov 15, 2019

If we find an elegant characterization of what deserialization does on badly typed input, I am not opposed, but given that it didn’t bother us before I would hesitsate to go out of our way (significant complication or overhead during deserialization) to fix it.

And extra work will have to be done. For example, the current code would happily skip over a a malformed text value (invalid utf8) when parsed at type any. The above criterion would require the decoder to do the utf8 validity check even then.

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value. But not all one-byte sequences are valid encodings of a value of type opt null! For example 0x02 is not. But there is no v : t' with t' <: t that encodes as 0x02, so the above criterion would again force us to do extra work.

@rossberg
Copy link
Contributor Author

For example, the current code would happily skip over a a malformed text value (invalid utf8) when parsed at type any.

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value.

What for? It doesn't sound like a relevant cost to check wf of a singleton values.

The case of an unused Text field is a bit more interesting, but even there I would argue that it is not worth skipping over it. It doesn't simplify the implementation notably (you have the decoder anyway), and the cost also probably never matters in practice. So I'd argue against premature optimisation of this sort.

@nomeata
Copy link
Collaborator

nomeata commented Nov 15, 2019

I don't think it’s premature optimization, because I am not complicating the code in order to make it faster, I am just resiting against extra complexitiy for an unclear use case. But point taken that these two examples are not compelling enough.

So the goal of the above criterion (as opposed to the simple “check it’s a subtype” is that we don't want to do needless work checking types of values that we can ignore (because there is no such value), but we do want to do the work of check the bytes of the values that are there but which are ignored. Seems a bit arbitrary, but I can live with that (and add the necessary checks in the skip_any function).

We of course would still accept ignored garbabe in future types, where we cannot know the structure. So it stays a best-effort approach.

So does that decoding algorithm satisfy the above criterion? … probably …

So I guess I am on board here.

(I just hope that this is all not just motivated by wanting to get the negative subtyping check in the propsal #832 acceptable :-))

@rossberg
Copy link
Contributor Author

The primary goal here is to have a sufficiently clean correctness condition for deserialisation, since I find UB troublesome.

But future type are another good point, hadn't considered those. So yeah, in it's best effort for types that are not understood by the deserialiser and the spec it operates against.

I'd very much like to get rid of the subtype check in #832. Can't say I have a good idea, though. :(

@chenyan-dfinity
Copy link
Contributor

chenyan-dfinity commented Nov 15, 2019

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value. But not all one-byte sequences are valid encodings of a value of type opt null! For example 0x02 is not. But there is no v : t' with t' <: t that encodes as 0x02, so the above criterion would again force us to do extra work.

The Rust implementation does check types while skipping the value. The serde framework enforces me to do this. I think this is a good example of UB, where one implementation does the accept the invalid bytes, while the other doesn't. If we plan to implement the IDL in multiple languages, it's good to eliminate UB. But I suspect this condition won't rule out all UBs...

@nomeata
Copy link
Collaborator

nomeata commented Nov 16, 2019

The Rust implementation does check types while skipping the value.

Does it also check values? I.e. check that the a text value is valid UTF8?

@chenyan-dfinity
Copy link
Contributor

Yes, it will call the corresponding deserialize function as if it were not skipped and then throw away the result: https://github.com/dfinity-lab/sdk/blob/master/lib/serde_idl/src/de.rs#L278

@ghost ghost added this to the Post-Launch Priorities milestone Nov 19, 2019
@nomeata
Copy link
Collaborator

nomeata commented Nov 26, 2019

Even if we validate stuff like UTF8, what do we do for function references? There we can't do any kind of checks, unless we do a full, proper subtype check – and then we could just do it for the whole message and only accept “real subtypes” of the expected type.

@rossberg rossberg added the idl Candid or serialisation label Apr 23, 2020
@rossberg rossberg changed the title [idl] Semantics of deserialisation Semantics of deserialisation Apr 29, 2020
@rossberg rossberg added the P3 low priority, resolve when there is time label Apr 29, 2020
@nomeata
Copy link
Collaborator

nomeata commented Sep 1, 2020

The discussion spilled a bit into #1830 (comment), good questions there.

@nomeata
Copy link
Collaborator

nomeata commented Nov 23, 2020

I believe we can close this, the Candid spec is now pretty clear on how deserialization should work.

@nomeata nomeata closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idl Candid or serialisation P3 low priority, resolve when there is time
Projects
None yet
Development

No branches or pull requests

3 participants