Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

WIP Add CAR spec #51

Merged
merged 1 commit into from
May 12, 2018
Merged

WIP Add CAR spec #51

merged 1 commit into from
May 12, 2018

Conversation

Stebalien
Copy link
Contributor

This is a WIP. I'm posting it now in case anyone has any early comments.

This is a WIP. I'm posting it now in case anyone has any early comments.
@Stebalien
Copy link
Contributor Author

@kevina this is a draft of that CAR spec I promised.
@whyrusleeping this is, approximately, the CAR spec we discussed (although I've backed off on a few things we thought were "solved" so we can discuss them).

]
```

Offsets are relative, offsets for missing children use a sentinel value.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how would you deal with a leaf node that is referenced from multiple branches, e.g. root -> a -> c, root -> b -> c?

I assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level? Would c then be stored before a and b?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level?

Yes. I'll try to make this diagram a bit less confusing (and add the example below).

Would c then be stored before a and b?

That's where the topological sort comes in. We'd store this as:

      root.Cid()
root: root.Length()
      root.Bytes()
      offset(a)
      offset(b)
a:    a.Length()
      a.Bytes()
      offset(c)
b:    b.Length()
      b.Bytes()
      offset(c) // 0
c:    c.Length()
      c.Bytes()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the benefit of this. This could force certain children to be very far from there first parent and will likely not stream well. It would seam far simpler to just store children in a breadth first manor and allow negative offsets to keep the children close to there (first) parent. A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, breadth first may not be the best method to keep children close to there parents, in fast there probably is no best method to do so. Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could force certain children to be very far from there first parent and will likely not stream well.

We can't both stream (write) and support the offset index (unless we keep the index in memory and append it to the end but that would hurt reading significantly).

children close to there (first) parent

Regardless of what we do, we can't get this property (assuming a reasonable branching factor). For example, with a branching factor of 2, some children will already be ~1024 nodes away from their parents once we hit a depth of 10.

However, this would ensure that siblings are (usually) next to each other. On the other hand, we could probably get a similar property with a topological sort if we use the right algorithm (although we'd always end up grouping nodes with the deepest siblings).

A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

We'd have to have read in the entire CAR for this to be the case. It would be nice to be able to skip over portions of the CAR that we don't care about.


Basically, if we can't optimize for streaming while writing, we might as well optimize for streaming while reading. With a topologically sorted DAG, we can traverse a DAG in one pass through the file by:

  1. Looking a at a node.
  2. Picking the child we want.
  3. Looking up the offset of that child.
  4. Seeking (forward) to the offset of that child.
  5. Recurse.

This is a useful property for media that works best with sequential access patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

Personally, I don't really see any benefit of putting them close to the first parent rather than the last. I don't see order as having anything to do with "importance". Is there any reason to do this?

1. We *don't* want to duplicate the data.
2. We need to support inline blocks with children.

## Topological sort
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of the topological sort and how exactly it is going to sort needs to be fleshed out a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. Basically, it allows us to traverse a DAG in one pass. This + offsets makes traversing a DAG on, e.g., tape really fast.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the lower nodes are stored after the higher nodes? Then the offset calculation will be tricky. I don't see how that can work with the varints. With int64 you could just backfill once you know the offsets.

If you store first the leaf nodes, and then the higher nodes and then the root, you always know the offset when you write a node. But then the access pattern is backwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the lower nodes are stored after the higher nodes?

Yes.

With int64 you could just backfill once you know the offsets.

We (@whyrusleeping and I) planned on using uint64s at first. However:

  1. The best topological sort algorithm I could find (basically, just a DFS) actually does work backwards.
  2. If we want to provide an index, we can't make the CAR in one pass anyways (although, if we don't do a topological sort, we could dump the data in one pass and then write in the index afterwards).

My current plan is to do one pass to determine the structure of the car and a second pass to write it. Unfortunately, this will require a significant amount of scratch space.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would be something like

  1. do the sorting so that leaves are last, create a sequence of nodes
  2. go backwards through the seq and determine serialised size for each node (to do this you need the offsets)
  3. go forwards through the seq and write to disk

Sounds good. Backfilling the offsets in case of int64 offsets would also need to be done in a clever way if you want linear access patterns. The alternative would be to write leaves first and calculate the offsets in one pass. That would be a single pass for writing, but would be a backwards access pattern for reading.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Camlistore, with care taken for data archeology/recoverability, stores data and metadata like folder structures and references the same way: content adressed blobs.

I understand CAR as IPFS's version of tar, to archive IPFS-held data in a (tape) streaming friendly format. Maybe it could become a .torrent competitor. But maybe that functionality can be achieved in a simpler way? Just pushing blobs to/from storage automatically, where automation can be made from a combination of Bloom filters and Inverted Bloom filters over the content hashes. This is currently being implemented for bitcoin block propagation, based on fresh insights, that surely can be recycled also for IPFS.

Imagine this upgrade: A HTTP/2 server can stream with interleaving so that several "files" and "directory structures" could be handled at once.

Which supports this use case: you'd click a web link into your "junior year reception party" and your HTTP/2 server suggest you might also want the gag/for fun subtexts or voice over that other people who streamed that video also used. Your client, say VLC, makes it a single keypress to say yes to any of the suggested extras.


Currently, I'm leaning toward varints as this will make storing lots of small
blocks significantly more efficient.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this (varints).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there are significant downsides:

  1. Can't leave them blank and fill them in (or change them after the fact).
  2. Slower/harder to parse.

Also, any suggestions on sentinal values?

Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any suggestions on sentinal values?

Not yet, something like this is best determined once the details are worked out. If we are using COBL we could just use the NULL value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is COBL? I was planning on using base128 varints (the one protobuf uses) which has no "empty" values. We could use 0 and say that 1 means the next byte, I guess.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant to say CBOR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean structure this as a CBOR object? I hadn't considered that. My plan was to make a custom file format. Unfortunately, CBOR isn't very seekable. We could also go with some other existing format but I would like something very simple and compact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one argument for uint64 is that we'd be able to skip directly to the correct offset in the jump table without iterating through it. However, given that we already have to parse the IPLD object, that's probably a non-issue. Also, there are some fancy bit-twiddling algorithms that can make this very fast by counting bytes with zero MSBs (I just need to remember to implement it...).


We only bother including the root CID because all the other CIDs are embedded in
the objects themselves. This saves space and *forces* parsers to actually
traverse the DAG (hopefully validating it).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the benefit of "forcing a parser to traverse the DAG".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For one, it ensures that the CAR is actually one giant DAG rooted at that CID. However, that may not even be worth mentioning (space is, IMO, sufficient).

@kevina
Copy link

kevina commented Nov 12, 2017

I have been thinking about this for awhile and I think we should keep the initial spec as flexible as possible for future expansion in particular:

  1. We should use CBOR for the encoding because it encodes the types as part of the format which allows for additional flexibility. This also solves the problem of how to handle missing children as we can just use a CBOR "Null" or "Undefined value".
  2. There is no best way to order the nodes, so it should not be dictated by the standard except for the fact that parents should come before the children except possibly in the case of duplicates. For reproducibility we can define define standard orders and encode the order as part of the header for the archive.
  3. If CBOR is used we do not have to decide on the varint issue. We can leave the size of the ints up to the implementation. If for a particular topo. sort it is better to use fixed size int so they can be filled in after the fact then implementations are free to do so. Other implementations that don't require this can use mixed CBOR int types to minimize space.
  4. Unless there is a significant downside I think we should allow negative offsets to provide the maxim flexibility in the ordering of nodes within the archive in the case of duplicates.
  5. The standard should encourage that duplicates should be avoided but should not require it. For example there may be an advantage to keeping all the nodes for a file together to minimize seeking when extracting a file, if there are some duplicate blocks in the file then a large amount of seeking may be required.

@daviddias daviddias added the status/deferred Conscious decision to pause or backlog label Mar 19, 2018
Copy link
Member

@daviddias daviddias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm merging this one so that we get this folder a restructure and clean up

@daviddias daviddias merged commit eea173a into ipld:master May 12, 2018
@Stebalien
Copy link
Contributor Author

FYI, we'll probably need to rework that spec from scratch if we actually want to implement this. After writing it, I realized that it ignored complex questions like, e.g., compression.

@whyrusleeping whyrusleeping mentioned this pull request Aug 11, 2018
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this pull request Oct 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/deferred Conscious decision to pause or backlog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants