Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

DagCBOR: tighten up spec, additional strictness requirements #236

Merged
merged 1 commit into from
Mar 31, 2020

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jan 23, 2020

Ref: #227

It's already pretty close to what we've been talking about. This is mostly just clarification, and some additional fluffing with a mind to publishing it on specs.ipld.io too.


1. Use no tags other than the CID tag (`42`). A valid DagCBOR encoder must not encode using any additional tags and a valid DagCBOR decoder must reject objects containing additional tags as invalid.
2. Use the canonical CBOR encoding defined by the the suggestions in [section 3.9 of the CBOR specification]. A valid DagCBOR decoder should reject objects not following these restrictions as invalid. Specifically:
* Integers must be as small as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg this is somewhat ambiguous. Do you mean Integer encoding must be as short as possible, just like in the next item ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag 98 for COSE signing may be helpful. before merging can we discuss? #227 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, see the updated draft of irc 7049: https://tools.ietf.org/html/draft-ietf-cbor-7049bis-12#section-4.2.2 regarding "Additional Deterministic Encoding Considerations" for CBOR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonnycrunch re COSE seems like a larger concern and perhaps even touches on the CID spec, would you mind deferring that to a separate issue where we can all wrap our heads around it - either open a fresh issue just for it, making the case and giving us the resources we need to grok it fully, or even open a PR against this doc if you have a firm proposal for how it would work and how encoders & decoders would need to interact with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re "Additional Deterministic Encoding Considerations", I did kind of ignore that but probably shouldn't have, the main area it touches on is floats, I don't know what our current CBOR encoders are doing but we should probably opt for "Encode all values as the smallest of 16-, 32-, or 64-bit floating point that accurately represents the value, even for integral values", that's consistent with the int rule. I just don't know whether that's going to work well across languages, we might end up with inconsistencies here and need additional floating point rules. The alternative is to put all floats into 64-bit.

Another thing to consider is how to handle very large integers, or whether they can be handled. We can probably hand-wave that for now like we do in the data model but it'll likely come up again in the future.

Copy link
Contributor

@warpfork warpfork Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the issue with "smallest encoding that accurately represents the value" for floats is that it's kind of an ambiguous -- or if not ambiguous, or at least highly detailed, shall we say -- minefield to say what's "accurately represents" and will round-trip reliably with floats. Ints? Clear. Floats? Ngh.

The go implementations currently make all floats 64bit, fwiw. I wouldn't be opposed to trying to improve this, but I would be terrified, personally, if that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just saw the link to borc in your #236 (comment) .

I'd be fine taking patches to change the go libraries to do the same, if we're sure that logic is correct and we want to stand with it. (I just don't feel qualified to actually review that bitshuffling logic, myself.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it could just be worded: the default encoding for MIME type application/ipld+dagcbor ignores all other tag other than tag 42. Unfortunately, this get into the MIME type discussion. But, I could work with that if I can generate an application/ipld+cbor which is the full cbor spec with all tags. to be clear application/ipld+dagcbor only support tag 42 whereas application/ipld+cbor is all tags.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: #102

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a major part of the exercise here is, as stated in the changes:

DagCBOR requires that there exist a single way of encoding any given object, and that encoded forms contain no superfluous data that may be ignored or lost in a round-trip decode/encode.

If we ignore rather than disallow then we make significant space for arbitrary data to be included in DagCBOR blocks, there's no "just one way to encode" guarantee.

This is only for IPLD DagCBOR (multicodec 0x71), what an IPLD CBOR (multicodec 0x51) does is another matter, that can be the flexible space while DagCBOR is what you use if you want these strictness guarantees. Maybe that's something we should hasten the development, it seems like low-hanging fruit to make space for much more interesting data importing and linking.

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify...

block-layer/codecs/dag-cbor.md Show resolved Hide resolved
Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing my review to "Approved", but we should review the COSE stuff now or later...

@rvagg
Copy link
Member Author

rvagg commented Jan 24, 2020

Pushed an update to clarify "small integer encoding" rule and added one for floats, but I'm much more uncertain about that one. @warpfork I wouldn't mind your thoughts on #236 (comment), whether we go for "shortest possible maintaining accuracy" or "everything in a single size (64-bit)".

@rvagg
Copy link
Member Author

rvagg commented Jan 24, 2020

Quick survey, it looks like we have a mismatch already between the two languages:

@vmx
Copy link
Member

vmx commented Jan 24, 2020

whether we go for "shortest possible maintaining accuracy" or "everything in a single size (64-bit)".

I would go for "shortest possible maintaining accuracy". As the IPLD data model only specifies a Float kind it makes sense to use the smallest representation. Else it would be more like a de-facto Float64 (for the DagCBOR encoding, which is currently the one we care most). Having a fixed size it's also easier to express with some annotation, so I'd make that a possible Schema opt-in (though I'm not really sure we want that).

* The expression of lengths in major types 2 through 5 must be as short as possible.
* Floating point values must be encoded as the smallest of 16-, 32-, or 64-bit floating point that accurately represents the value, even for integral values.
* The keys in every map must be sorted lowest value to highest. Sorting is performed on the bytes of the representation of the keys.
- If two keys have different lengths, the shorter one sorts earlier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to briefly note, somewhere, as an aside, that I'm not at all fond of this sorting rule. It's the RFC, and it's what dag-cbor certainly does do in prod already, so yeah, we should spec this and this PR is good.

But I'm not fond of it.

https://play.golang.org/p/f0kkdXKppKF

The length rule makes very unnatural-looking outcomes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's an odd one, I'd rather just rely on straight byte ordering regardless of length

Copy link
Contributor

@warpfork warpfork Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your comment in #236 (comment) -- we should still specify this. To no small extent, we're clarifying descriptions on stuff already shipped and in use that's generated data we expect to persist, and it definitely does this.

Just noting my laments. And if we make another codec specification in the future, this sorting is something I would very much flag as worth reconsidering.

A useless comment, for this PR, perhaps, since our hands are tied. But maybe also good contextual headspace to have for the bigger discussions about sorting orders, because it's another piece of kindling for the idea that not all codecs will agree on sorting orders.

@rvagg
Copy link
Member Author

rvagg commented Jan 27, 2020

So, the float thing: to figure out what size, an encoder first needs to encode a float at 16 bits, then decode and compare to original, if it's different then it moves to 32-bit, does the same and then finally 64-bit and lands there if it's not the same.

In theory, I think we're dealing with "proper" IEEE-754 in both JavaScript and Go, and likely other platforms interacting with IPLD, so the calculations should work out the same in all of them. My faith in JavaScript's numbers is limited though, there's always something surprising to find when you dig deep enough.

We need to lock something in here (I think... correct me if you see a reason where we could be lax on floats but not on everything else?), and when we do we have to change either refmt or borc to conform. Along with that we really should develop a solid test suite for floats, with edge cases testing the boundaries to make sure we're right in our assumptions.

I don't feel strongly either way tbh; it's a trade-off between wasted space and extra encoding time for every float.

@rvagg
Copy link
Member Author

rvagg commented Jan 27, 2020

re map key sorting

borc does length-then-byte-order sorting of map keys: https://github.com/dignifiedquire/borc/blob/b6bae8b0bcde7c3976b0f0f0957208095c392a36/src/encoder.js#L351-L355 & https://github.com/dignifiedquire/borc/blob/b6bae8b0bcde7c3976b0f0f0957208095c392a36/src/utils.js#L139-L152

refmt does the same thing I believe: https://github.com/polydawn/refmt/blob/474b602855ed928fc5bb3448bd0dae4a6fe78d96/obj/marshalMapWildcard.go#L142-L148

so they're both following the suggestions in the cbor spec so we may as well specify that here even though it is ugly.

@warpfork
Copy link
Contributor

warpfork commented Jan 27, 2020

On the floats thing: call me a paranoid old man, but I... don't trust anyone's float implementations. I would go so far as to say I'm not even convinced IEEE-754 was a good idea.

Floating point calculations have been known to vary on CPUs because of bugs in silicon that went undetected (or possibly, and what may even be worse if true: were detected but filed as unremarkable and "WONTFIX" at manufacture time) until too long after shipping. (wow, I did more research) because of arguably-defensible logic about some chips providing more precise numbers of extra bit size in intermediate calculations ("Extended Precision" -- apparently this was even recommended but not specified by IEEE-754; cool cool cool).

Fun anecdata: Such an divergent behavior in silicon ended up so widely deployed that for some years malware began to use it to detect whether or not they were on a virtualized host: real hosts would do floating point math "wrong"; virtual machines would do it right. This worked on such a wide deployment of real hosts for so long that it was actually a viable strategy for malware to opt-out of working on hosts that did fp math right, as a way of captured-study avoidance.

In Java, we can also see clear evidence of this bugaboo in the existence of the strictfp keyword built into the language itself, because java's earliest designers and builders found the floating point reliability of CPUs to be a threat to the "write once, run anywhere" mantra. (They eventually relented and made the default loosey-goosey, for speed reasons, but then added the strictfp keyword as an opt-in.)

I bring up these anecdotes and examples because I find it's entirely too possible to go "ohhh, pfffff -- sure floats are sketchy in theory, but no one ever notices that in practice..." Yes. Yes we do notice it in practice. It's very real, and it's absolutely not a thing relegated to distant forgettable history. If only.

Now, I think, to the best of my understanding and reading of this, that comparison is the one operation that should be unequivocally safe from all this. But honestly? I'm not sure.

Floats terrify me.

@warpfork
Copy link
Contributor

we really should develop a solid test suite for floats, with edge cases testing the boundaries to make sure we're right in our assumptions.

it's a trade-off between wasted space and extra encoding time for every float.

Yeah, we'd definitely need fixtures o'plenty for this.

It's also going to be extra decoding time for every float, I'm afraid. Per the same logic about strictness elsewhere, we should probably be rejecting floats encoded with too much precision.

Although I'll readily admit the idea of "rejecting floats encoded with too much precision" sounds like it's going to be an absolute usability nightmare in practice. If there does turn out to be the slightest disagreement about precision and rounding in code in the wild, the bugs will be beyond bizarre: only some libraries on only some physical machines on only some data samples may suddenly have show-stopping issues. Blegh; the lengths I'd go to in order to avoid this are substantial.

This is also something that's probably going to make reuse of existing serialization libraries in various languages harder. I'd bet a shiny nickel we'll more or less have to fork the CBOR decoder in many cases in order to express strictness on this -- look at the line numbers we're looking at already in borc and refmt; they're not at the shallow end of the libraries! Maybe this is just something we swallow and accept, but uffdah.

I... I don't know what to do with this.

There's so much "between a rock and a hard place" with floats that I actually want to tentatively float (heh) the idea that floats don't even belong in the Data Model, because they wreck so many things and are so dangedly impossible to make practical guarantees about. Or if we keep them there, we put incredibly loud warnings around them: convergence of hashes of objects containing floats is generally weakly guaranteed if at all, etc; essentially, leaving floats in so that they're there for rapid prototyping, but beyond that, the answer to any float-related problems becomes "don't use them" and we try to get people to move towards more intentional stances around numbers instead.

@rvagg
Copy link
Member Author

rvagg commented Jan 27, 2020

I did already raise the possibility of an advanced layout to represent floats using alternate mechanisms, that was only 1/2 a joke.

I share your concern with floats, to the degree that unless we're going to just say "all in 64-bits", I'm not entirely comfortable codifying "smallest possible" here until we get some solid tests in place to confirm that we can achieve what we think we can.

Next steps: I'll try and tighten this all up but may push the floats thing to "unresolved pending coding & testing work".

The CBOR recommendation for "smallest possible" on floats does make me scratch my head a bit on the length-first key ordering recommendation. It seems like a performance optimisation because length comparison is so cheap. But it's the opposite with floats, this nasty float encode+decode+compare+repeat work is expensive. But maybe it's a reflection of how common they think both data types will be in cbor?

Also not touched on in the float discussion is JavaScript's nasty whole-number problem for round-trips. A 1234.0 will become a 1234. We can probably deal with that when schemas get involved, but without metadata carried along with a decoded object we can't deal with that.

@ribasushi
Copy link
Contributor

A 1234.0 will become a 1234

@rvagg @warpfork Have we considered not implementing floats at all on the wire, and instead have a struct with 2 integers? Similar to: https://github.com/ipfs/specs/blame/3816b05747e/UNIXFS.md#L99-L113

Yes it is wordy and convoluted. But there is no way to screw it up on the wire.

@rvagg
Copy link
Member Author

rvagg commented Jan 28, 2020

@ribasushi yep, considered, and as @warpfork said, maybe even remove them from the Data Model! A bit extreme, but maybe we'll keep on banging our heads against this one enough that when we (inevitably) build our own coding format to evolve past CBOR that we ditch native floats entirely.

And also, an advanced layout could do this transparently (struct of two ints, a string, some other binary encoding) so maybe we just work on that and encourage its use for much win.

@ribasushi
Copy link
Contributor

@rvagg so then... can't the initial DM spec just say "floats currently unsupported" ?

@rvagg
Copy link
Member Author

rvagg commented Jan 28, 2020

@ribasushi I have a feeling that the majority of programmers will look at that and think we're not taking this seriously. Maybe people who have experienced significant float pain and have worked around it in novel ways will see through it, but mostly people are going to say "well, this is not usable for real-world data then!".

A large part of my background is in animal genetics and dealing with gobs of data that is primarily floating point, where increased accuracy exhibits marginal returns. Most of the test data I could come up with is float-heavy and without a primitive or an existing tool that I can pick off the shelf to encode them, I'm probably not going to bother!

@vmx do floats make much of an appearance in GS data or do they deal with that pain in other ways already?

My inclination right now is to:

  1. figure out the limits we're dealing with across languages with this CBOR question
  2. litter our specs (data model & codecs) with disclaimer about floats and what they can do to stability and determinism
  3. work on tools to give users a way out -- perhaps an ADL, perhaps just a spec for a codec-independent float representation format that has a little bit of extra user-work (perhaps just simple JS & Go libraries that take two ints and give you what you need).
  4. consider a future possibility of removing floats from the data model, or at least deprecating their direct use

@ribasushi
Copy link
Contributor

@rvagg what you are saying re

majority of programmers will look at that and think we're not taking this seriously

is unfortunately somewhat true :/ More specifically: folks who think they need floats far outweigh the folks who know they need floats and can't help it any other way.

However:

consider a future possibility of removing floats from the data model, or at least deprecating their direct use

^^ this is unlikely to happen due to the nature of protocols in general, and our context of "promised immutable" in particular. There needs to be a clear understanding and acceptance that anything that is put in any of the PL specs is effectively there forever. Better options can appear in the future and be very strongly encouraged, but removal is not really possible.

Making this tradeoff ( we ship something suboptimal now, knowing it will stick ) is perfectly fine, and I'd encourage this myself. What worries me is that a decision is made under a false premise of "we can remove X later".

@rvagg
Copy link
Member Author

rvagg commented Jan 28, 2020

I put together a JavaScript and Go DagCBOR side-by-side to look at edge cases and test differences https://github.com/rvagg/ipld-dagcbor-floats

Needs more experimentation, I'd love to test if there are cases around the edges that push the limits of the IEEE-754 implementations. So far the differences are only in the shortest-encoded vs everything-64-bit choice.

Copy link
Contributor

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is all very cleanly and clearly defined.

@rvagg
Copy link
Member Author

rvagg commented Jan 28, 2020

@vmx it would be good if you could consider how difficult these strictness rules are going to be within Rust / serde. It looks like you have quite a bit of control over types and can make some of these decisions during encoding but the defaults have a way to go to get to strictness: https://github.com/vmx/rust-ipld/blob/master/dag-cbor/src/lib.rs

With the following strictness rules, how hard will it be with Serde to (a) encode as the right type and (b) reject encoded data that doesn't meet the strictness requirement:

  • ints as small as they can be. It looks like you have the option to choose to do this outside of serde_cbor but it makes you responsible for checking boundaries. It's going to mean that a round-trip of an i32 might come out as an i8 if it's a small number, but that's the case for reading arbitrary data anyway.
  • int sizes as small as they can be. This might be a problem without upstreaming changes, serialize_str() seems to always use a 64-bit int for size and serialize_bytes() redirects to the same.
  • map key ordering - from what I can see you get to control that by feeding it a data structure that has the ordering you want?
  • floats as their smallest representation - I believe the comparisons in here might be doing this already: https://github.com/pyfisch/cbor/blob/d012830653340b13d349d59838304d9c59a368ca/src/ser.rs#L312-L343 - it's using an internal Rust comparison of type fittedness and I'm not sure if that's going to yield the same as doing a full encode+decode to check whether it fits, but it's closer to borc than refmt I think.

If we have the opportunity to make it less painful for Rust while we're making these decisions then we should at least consider it before committing.

@vmx
Copy link
Member

vmx commented Jan 28, 2020

@vmx it would be good if you could consider how difficult these strictness rules are going to be within Rust / serde.

I haven't had a look, but all those things should be possible. I'd also like to mention that I'd like to move away from Serde anyway, so this shouldn't really be a blocker on the decisions about floats.

@vmx
Copy link
Member

vmx commented Jan 28, 2020

@ribasushi I have a feeling that the majority of programmers will look at that and think we're not taking this seriously. Maybe people who have experienced significant float pain and have worked around it in novel ways will see through it, but mostly people are going to say "well, this is not usable for real-world data then!".

I share this concern. Even knowing the issues with floats I probably also wouldn't look into a system for encoding data that doesn't easily support floats. I would think if it as an "esoteric, theoretically nice system no one will use". I would argue that people who don't know much about floats, implicitly think of IEEE-754 floats. They "just work" in every language. And if you hit issues, it's the same issues everywhere. I personally rarely ever had issues with floats in real world. IEEE-754 floats are the standard. Numbers in JSON are also very broken, though we don't encode them as strings in DagJSON (probably to keep things practical), we should also take a practical approach for the IPLD Data Model.

Another point is that I think IPLD should also be useful without schemas. Without floats, it's not useful. Though we can (and should) discuss a decimal type on the schema level, just as we do with non-string keys for maps.

@vmx do floats make much of an appearance in GS data or do they deal with that pain in other ways already?

In the geospatial world everything is floats. Most widely used in GeoJSON, which doesn't go into details about floats. But I haven't seen any libraries for GeoJSON that do anything special with floats. A very old de-facto standard is Shapefile, there floating point numbers are stored as string (I didn't know that before). I should also mention GeoPackage which is basically just a SQLite file, hence using IEEE-754.

@rvagg
Copy link
Member Author

rvagg commented Jan 29, 2020

I've split out the floating point issue into a separate section under "Strictness" titled "Floating Point Encoding (Unresolved)", we'll figure out that one a bit later and not block this with it. I've also added an "Implementations" section at the bottom with notes on where they're at. Let me know if you think this doesn't belong, I'm not attached to it.

Ready for final reviews (please).

@warpfork
Copy link
Contributor

majority of programmers will look at that and think we're not taking this seriously

is unfortunately somewhat true :/ More specifically: folks who think they need floats far outweigh the folks who know they need floats and can't help it any other way.

Yup. This.

This is why they're in there. Both for some practical bridging reasons and for some cough practical cultural evolutionary truths.

It's just a question of how many and how loud of warnings we can put around them.


consider a future possibility of removing floats from the data model, or at least deprecating their direct use

^^ this is unlikely to happen due to the nature of protocols in general, and our context of "promised immutable" in particular.

I think an "eternally deprecated" (but never removed) approach is basically the tone to aim for.

(Just as an aside, because this word has come up in a couple issues lately, and I'd like to clear some air around it -- the definition of deprecate is: "discourage... without completely removing it or prohibiting its use" (from wikipedia or "to express disapproval of" (from dictionary.com, particularly the "usage note" section). The word "deprecate" does not denote removal and it certainly doesn't imply a timeline; it may carry that connotation sometimes, but without More Words stating something about e.g. "slated for removal", it really shouldn't be assumed.)


also, an advanced layout could do this transparently (struct of two ints, a string, some other binary encoding)

Oooooooh. I'd... actually never thought of that! That's an adventurous idea and I'd love to see how it shakes out.

I was gonna say this is one where I'd be happy to see us just put a big block of docs with a worked example (e.g. doing the struct and all), but then leave people to it. But making an advanced layout for it is also a very interesting idea I'd kinda like to see.


Wholly on board with pushing this floats stuff to a separate section and marking it pending research. 👍

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super good. Ship it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants