Semantics of deserialisation #870

rossberg · 2019-11-14T14:42:52Z

The design doc currently defines deserialisation of IDL values by an elaboration function that is only defined when subtyping holds. However, that is not how we implement it, because deserialisation actually proceeds by dynamically traversing types and value. In particular, it avoids descending into portions of the types for which no corresponding value is present, such as unused cases of a variant or the element type of empty options or vectors.

The consequence is that more values deserialise than actually should according to the spec. In particular, it is not always discovered when the annotated type is not actually a subtype of the target type. For example,

deserialise((v : T), U)

will succeed if v is an empty vector, but T and U are incompatible vector types, such as vec Text vs vec Nat. Similarly for options or variants.

We want to specify the precise (necessary and sufficient) precondition under which deserialisation succeeds. One candidate condition might be

exists T',  v : T'  /\  T' <: T  /\  T' <: U

i.e., the value has a "more principle" type T' that is a subtype of both annotated and target type. In the previous example, that might be vec empty. Effectively, this works around the requirement to always produce a principal type.

However, this is a somewhat brittle condition. For example, with some of the possible implementation choices discussed on #832 (branching on the truth of a subtype statement) it would break.

Unclear what to specify here, or whether we should actually enforce the subtype check (undesirable).

The text was updated successfully, but these errors were encountered:

nomeata · 2019-11-14T16:34:32Z

I assume that its not ok to chicken out and say “if the argument is not a subtype of the expected type, the only guarantee is that the result is a member of the expected type”, i.e. partial correctness + safety afterwards, else you wouldn't raise this point, right?

rossberg · 2019-11-14T16:41:43Z

You mean if they aren't subtypes it could be any type-correct value? So deserialise(([] : [Nat]), [Text]) could be an out of thin air value like ["You", "stink!"]?

nomeata · 2019-11-14T16:49:03Z

Yes… it is one way out. It’s not like the caller could not have put ["You", "stink!"] in there…

rossberg · 2019-11-14T16:56:49Z

But this may get them banned without being guilty.

Why is this laxer condition better than the one above?

nomeata · 2019-11-14T17:03:17Z

But this may get them banned without being guilty.

What do you mean?

Why is this laxer condition better than the one above?

Easier to implement? But maybe it is what we do already. Probably it is. I didn’t notice you slightly changed it over what we discussed on the other PR. So maybe it is ok.

rossberg · 2019-11-14T17:17:30Z

Did I change it? Not consciously.

I mean that it's like a man-in-the-middle attack, were the man is undefined behaviour.

nomeata · 2019-11-14T17:29:43Z

I wonder what is the threat model: We only ever get into that situation if someone sends a value of a wrong type. How much pity do we need to have for that case? Isn’t it just a case of garbage in - garbabe out? It would be the same under the “old“ scheme without the type description, so why does it matter now?

rossberg · 2019-11-14T17:46:09Z

Well, the type mismatch could be accidental. There is no concrete threat model, but it still seems rather fishy. Violating the idea of typing that way.

nomeata · 2019-11-14T18:30:14Z

But you agree that we are trying to fix a problem that we also had before we had the type section, and that would have been unfixable back then, and it didn’t bother us then?

rossberg · 2019-11-15T07:17:32Z

I am not entirely sure, to be honest. Before, we had a weaker format, it was basically untyped and did not promise much. Now it's typed and you may think that you can rely on that. But you can't. In a way, that is worse.

nomeata · 2019-11-15T08:19:40Z

If we find an elegant characterization of what deserialization does on badly typed input, I am not opposed, but given that it didn’t bother us before I would hesitsate to go out of our way (significant complication or overhead during deserialization) to fix it.

And extra work will have to be done. For example, the current code would happily skip over a a malformed text value (invalid utf8) when parsed at type any. The above criterion would require the decoder to do the utf8 validity check even then.

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value. But not all one-byte sequences are valid encodings of a value of type opt null! For example 0x02 is not. But there is no v : t' with t' <: t that encodes as 0x02, so the above criterion would again force us to do extra work.

rossberg · 2019-11-15T09:18:30Z

For example, the current code would happily skip over a a malformed text value (invalid utf8) when parsed at type any.

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value.

What for? It doesn't sound like a relevant cost to check wf of a singleton values.

The case of an unused Text field is a bit more interesting, but even there I would argue that it is not worth skipping over it. It doesn't simplify the implementation notably (you have the decoder anyway), and the cost also probably never matters in practice. So I'd argue against premature optimisation of this sort.

nomeata · 2019-11-15T09:31:00Z

I don't think it’s premature optimization, because I am not complicating the code in order to make it faster, I am just resiting against extra complexitiy for an unclear use case. But point taken that these two examples are not compelling enough.

So the goal of the above criterion (as opposed to the simple “check it’s a subtype” is that we don't want to do needless work checking types of values that we can ignore (because there is no such value), but we do want to do the work of check the bytes of the values that are there but which are ignored. Seems a bit arbitrary, but I can live with that (and add the necessary checks in the skip_any function).

We of course would still accept ignored garbabe in future types, where we cannot know the structure. So it stays a best-effort approach.

So does that decoding algorithm satisfy the above criterion? … probably …

So I guess I am on board here.

(I just hope that this is all not just motivated by wanting to get the negative subtyping check in the propsal #832 acceptable :-))

rossberg · 2019-11-15T09:43:26Z

The primary goal here is to have a sufficiently clean correctness condition for deserialisation, since I find UB troublesome.

But future type are another good point, hadn't considered those. So yeah, in it's best effort for types that are not understood by the deserialiser and the spec it operates against.

I'd very much like to get rid of the subtype check in #832. Can't say I have a good idea, though. :(

chenyan-dfinity · 2019-11-15T18:27:12Z

The decoder doesn't do that yet, but for any type with a statically known size (say, opt null), it might not look at the bytes when skipping the value. But not all one-byte sequences are valid encodings of a value of type opt null! For example 0x02 is not. But there is no v : t' with t' <: t that encodes as 0x02, so the above criterion would again force us to do extra work.

The Rust implementation does check types while skipping the value. The serde framework enforces me to do this. I think this is a good example of UB, where one implementation does the accept the invalid bytes, while the other doesn't. If we plan to implement the IDL in multiple languages, it's good to eliminate UB. But I suspect this condition won't rule out all UBs...

nomeata · 2019-11-16T12:04:21Z

The Rust implementation does check types while skipping the value.

Does it also check values? I.e. check that the a text value is valid UTF8?

chenyan-dfinity · 2019-11-17T04:31:36Z

Yes, it will call the corresponding deserialize function as if it were not skipped and then throw away the result: https://github.com/dfinity-lab/sdk/blob/master/lib/serde_idl/src/de.rs#L278

nomeata · 2019-11-26T16:02:04Z

Even if we validate stuff like UTF8, what do we do for function references? There we can't do any kind of checks, unless we do a full, proper subtype check – and then we could just do it for the whole message and only accept “real subtypes” of the expected type.

nomeata · 2020-09-01T08:07:30Z

The discussion spilled a bit into #1830 (comment), good questions there.

nomeata · 2020-11-23T11:32:44Z

I believe we can close this, the Candid spec is now pretty clear on how deserialization should work.

ghost added this to the Post-Launch Priorities milestone Nov 19, 2019

nomeata mentioned this issue Nov 25, 2019

Fix IDL subtyping #832

Closed

rossberg added the idl Candid or serialisation label Apr 23, 2020

rossberg changed the title ~~[idl] Semantics of deserialisation~~ Semantics of deserialisation Apr 29, 2020

rossberg added the P3 low priority, resolve when there is time label Apr 29, 2020

nomeata mentioned this issue Aug 14, 2020

Run Candid spec tests against Motoko #1830

Merged

nomeata mentioned this issue Sep 1, 2020

more candid test data dfinity/candid#83

Merged

nomeata mentioned this issue Sep 21, 2020

[IDL] Optimistic subtyping #1959

Closed

nomeata closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantics of deserialisation #870

Semantics of deserialisation #870

rossberg commented Nov 14, 2019 •

edited

Loading

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019 •

edited

Loading

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 15, 2019

nomeata commented Nov 15, 2019

rossberg commented Nov 15, 2019

nomeata commented Nov 15, 2019

rossberg commented Nov 15, 2019

chenyan-dfinity commented Nov 15, 2019 •

edited

Loading

nomeata commented Nov 16, 2019

chenyan-dfinity commented Nov 17, 2019

nomeata commented Nov 26, 2019

nomeata commented Sep 1, 2020

nomeata commented Nov 23, 2020

Semantics of deserialisation #870

Semantics of deserialisation #870

Comments

rossberg commented Nov 14, 2019 • edited Loading

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019 • edited Loading

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 14, 2019

nomeata commented Nov 14, 2019

rossberg commented Nov 15, 2019

nomeata commented Nov 15, 2019

rossberg commented Nov 15, 2019

nomeata commented Nov 15, 2019

rossberg commented Nov 15, 2019

chenyan-dfinity commented Nov 15, 2019 • edited Loading

nomeata commented Nov 16, 2019

chenyan-dfinity commented Nov 17, 2019

nomeata commented Nov 26, 2019

nomeata commented Sep 1, 2020

nomeata commented Nov 23, 2020

rossberg commented Nov 14, 2019 •

edited

Loading

rossberg commented Nov 14, 2019 •

edited

Loading

chenyan-dfinity commented Nov 15, 2019 •

edited

Loading