WIP Add CAR spec #51

Stebalien · 2017-11-07T07:24:08Z

This is a WIP. I'm posting it now in case anyone has any early comments.

Stebalien · 2017-11-07T07:27:37Z

@kevina this is a draft of that CAR spec I promised.
@whyrusleeping this is, approximately, the CAR spec we discussed (although I've backed off on a few things we thought were "solved" so we can discuss them).

rklaehn · 2017-11-07T19:31:32Z

car/README.md

+]
+```
+
+Offsets are relative, offsets for missing children use a sentinel value.


So how would you deal with a leaf node that is referenced from multiple branches, e.g. root -> a -> c, root -> b -> c?

I assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level? Would c then be stored before a and b?

assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level?

Yes. I'll try to make this diagram a bit less confusing (and add the example below).

Would c then be stored before a and b?

That's where the topological sort comes in. We'd store this as:

root.Cid() root: root.Length() root.Bytes() offset(a) offset(b) a: a.Length() a.Bytes() offset(c) b: b.Length() b.Bytes() offset(c) // 0 c: c.Length() c.Bytes()

I am not sure I see the benefit of this. This could force certain children to be very far from there first parent and will likely not stream well. It would seam far simpler to just store children in a breadth first manor and allow negative offsets to keep the children close to there (first) parent. A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

Well, breadth first may not be the best method to keep children close to there parents, in fast there probably is no best method to do so. Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

This could force certain children to be very far from there first parent and will likely not stream well.

We can't both stream (write) and support the offset index (unless we keep the index in memory and append it to the end but that would hurt reading significantly).

children close to there (first) parent

Regardless of what we do, we can't get this property (assuming a reasonable branching factor). For example, with a branching factor of 2, some children will already be ~1024 nodes away from their parents once we hit a depth of 10.

However, this would ensure that siblings are (usually) next to each other. On the other hand, we could probably get a similar property with a topological sort if we use the right algorithm (although we'd always end up grouping nodes with the deepest siblings).

A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

We'd have to have read in the entire CAR for this to be the case. It would be nice to be able to skip over portions of the CAR that we don't care about.

Basically, if we can't optimize for streaming while writing, we might as well optimize for streaming while reading. With a topologically sorted DAG, we can traverse a DAG in one pass through the file by:

Looking a at a node.

Picking the child we want.

Looking up the offset of that child.

Seeking (forward) to the offset of that child.

Recurse.

This is a useful property for media that works best with sequential access patterns.

Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

Personally, I don't really see any benefit of putting them close to the first parent rather than the last. I don't see order as having anything to do with "importance". Is there any reason to do this?

rklaehn · 2017-11-07T19:34:56Z

car/README.md

+1. We *don't* want to duplicate the data.
+2. We need to support inline blocks with children.
+
+## Topological sort


I think the purpose of the topological sort and how exactly it is going to sort needs to be fleshed out a bit more.

Totally agree. Basically, it allows us to traverse a DAG in one pass. This + offsets makes traversing a DAG on, e.g., tape really fast.

So the lower nodes are stored after the higher nodes? Then the offset calculation will be tricky. I don't see how that can work with the varints. With int64 you could just backfill once you know the offsets.

If you store first the leaf nodes, and then the higher nodes and then the root, you always know the offset when you write a node. But then the access pattern is backwards.

So the lower nodes are stored after the higher nodes?

Yes.

With int64 you could just backfill once you know the offsets.

We (@whyrusleeping and I) planned on using uint64s at first. However:

The best topological sort algorithm I could find (basically, just a DFS) actually does work backwards.

If we want to provide an index, we can't make the CAR in one pass anyways (although, if we don't do a topological sort, we could dump the data in one pass and then write in the index afterwards).

My current plan is to do one pass to determine the structure of the car and a second pass to write it. Unfortunately, this will require a significant amount of scratch space.

So it would be something like

do the sorting so that leaves are last, create a sequence of nodes

go backwards through the seq and determine serialised size for each node (to do this you need the offsets)

go forwards through the seq and write to disk

Sounds good. Backfilling the offsets in case of int64 offsets would also need to be done in a clever way if you want linear access patterns. The alternative would be to write leaves first and calculate the offsets in one pass. That would be a single pass for writing, but would be a backwards access pattern for reading.

Camlistore, with care taken for data archeology/recoverability, stores data and metadata like folder structures and references the same way: content adressed blobs.

I understand CAR as IPFS's version of tar, to archive IPFS-held data in a (tape) streaming friendly format. Maybe it could become a .torrent competitor. But maybe that functionality can be achieved in a simpler way? Just pushing blobs to/from storage automatically, where automation can be made from a combination of Bloom filters and Inverted Bloom filters over the content hashes. This is currently being implemented for bitcoin block propagation, based on fresh insights, that surely can be recycled also for IPFS.

Imagine this upgrade: A HTTP/2 server can stream with interleaving so that several "files" and "directory structures" could be handled at once.

Which supports this use case: you'd click a web link into your "junior year reception party" and your HTTP/2 server suggest you might also want the gag/for fun subtexts or voice over that other people who streamed that video also used. Your client, say VLC, makes it a single keypress to say yes to any of the suggested extras.

kevina · 2017-11-07T20:21:34Z

car/README.md

+
+Currently, I'm leaning toward varints as this will make storing lots of small
+blocks significantly more efficient.
+


I would agree with this (varints).

Note: there are significant downsides:

Can't leave them blank and fill them in (or change them after the fact).

Slower/harder to parse.

Also, any suggestions on sentinal values?

Also, any suggestions on sentinal values?

Not yet, something like this is best determined once the details are worked out. If we are using COBL we could just use the NULL value.

What is COBL? I was planning on using base128 varints (the one protobuf uses) which has no "empty" values. We could use 0 and say that 1 means the next byte, I guess.

Sorry meant to say CBOR.

Ah, you mean structure this as a CBOR object? I hadn't considered that. My plan was to make a custom file format. Unfortunately, CBOR isn't very seekable. We could also go with some other existing format but I would like something very simple and compact.

Actually, one argument for uint64 is that we'd be able to skip directly to the correct offset in the jump table without iterating through it. However, given that we already have to parse the IPLD object, that's probably a non-issue. Also, there are some fancy bit-twiddling algorithms that can make this very fast by counting bytes with zero MSBs (I just need to remember to implement it...).

kevina · 2017-11-08T00:24:03Z

car/README.md

+
+We only bother including the root CID because all the other CIDs are embedded in
+the objects themselves. This saves space and *forces* parsers to actually
+traverse the DAG (hopefully validating it).


I am not sure I see the benefit of "forcing a parser to traverse the DAG".

Good question. For one, it ensures that the CAR is actually one giant DAG rooted at that CID. However, that may not even be worth mentioning (space is, IMO, sufficient).

kevina · 2017-11-12T23:46:27Z

I have been thinking about this for awhile and I think we should keep the initial spec as flexible as possible for future expansion in particular:

We should use CBOR for the encoding because it encodes the types as part of the format which allows for additional flexibility. This also solves the problem of how to handle missing children as we can just use a CBOR "Null" or "Undefined value".
There is no best way to order the nodes, so it should not be dictated by the standard except for the fact that parents should come before the children except possibly in the case of duplicates. For reproducibility we can define define standard orders and encode the order as part of the header for the archive.
If CBOR is used we do not have to decide on the varint issue. We can leave the size of the ints up to the implementation. If for a particular topo. sort it is better to use fixed size int so they can be filled in after the fact then implementations are free to do so. Other implementations that don't require this can use mixed CBOR int types to minimize space.
Unless there is a significant downside I think we should allow negative offsets to provide the maxim flexibility in the ordering of nodes within the archive in the case of duplicates.
The standard should encourage that duplicates should be avoided but should not require it. For example there may be an advantage to keeping all the nodes for a file together to minimize seeking when extracting a file, if there are some duplicate blocks in the file then a large amount of seeking may be required.

daviddias

I'm merging this one so that we get this folder a restructure and clean up

Stebalien · 2018-05-15T18:32:27Z

FYI, we'll probably need to rework that spec from scratch if we actually want to implement this. After writing it, I realized that it ignored complex questions like, e.g., compression.

WIP Add CAR spec

WIP Add CAR spec

189dfa4

This is a WIP. I'm posting it now in case anyone has any early comments.

rklaehn reviewed Nov 7, 2017

View reviewed changes

kevina reviewed Nov 7, 2017

View reviewed changes

kevina reviewed Nov 8, 2017

View reviewed changes

ZenGround0 mentioned this pull request Jan 10, 2018

IPFS Cluster sharding RFC ipfs/notes#278

Open

daviddias added the status/deferred Conscious decision to pause or backlog label Mar 19, 2018

daviddias approved these changes May 12, 2018

View reviewed changes

daviddias merged commit eea173a into ipld:master May 12, 2018

frrist mentioned this pull request Jul 6, 2018

As a user I can init my daemon with a pre-computed genesis chain filecoin-project/venus#611

Closed

whyrusleeping mentioned this pull request Aug 11, 2018

add go-car impl url #69

Closed

prataprc pushed a commit to iprs-dev/ipld-specs that referenced this pull request Oct 13, 2020

Merge pull request ipld#51 from Stebalien/feat/car

aead8ca

WIP Add CAR spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Add CAR spec #51

WIP Add CAR spec #51

Stebalien commented Nov 7, 2017

Stebalien commented Nov 7, 2017

rklaehn Nov 7, 2017

Stebalien Nov 7, 2017

kevina Nov 8, 2017

kevina Nov 8, 2017

Stebalien Nov 8, 2017

Stebalien Nov 8, 2017

rklaehn Nov 7, 2017

Stebalien Nov 7, 2017

rklaehn Nov 7, 2017

Stebalien Nov 7, 2017

rklaehn Nov 8, 2017

sesam Nov 12, 2017

kevina Nov 7, 2017

Stebalien Nov 7, 2017

kevina Nov 8, 2017 •

edited

Loading

Stebalien Nov 8, 2017

kevina Nov 8, 2017

Stebalien Nov 8, 2017

Stebalien Nov 8, 2017

kevina Nov 8, 2017

Stebalien Nov 8, 2017

kevina commented Nov 12, 2017 •

edited

Loading

daviddias left a comment

Stebalien commented May 15, 2018


		Currently, I'm leaning toward varints as this will make storing lots of small
		blocks significantly more efficient.

WIP Add CAR spec #51

WIP Add CAR spec #51

Conversation

Stebalien commented Nov 7, 2017

Stebalien commented Nov 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevina Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevina commented Nov 12, 2017 • edited Loading

daviddias left a comment

Choose a reason for hiding this comment

Stebalien commented May 15, 2018

kevina Nov 8, 2017 •

edited

Loading

kevina commented Nov 12, 2017 •

edited

Loading