DAG Import/Export #7011

ribasushi · 2020-03-18T15:57:24Z

Implements #6870.

core/commands/dag/dag.go

ribasushi · 2020-03-20T19:50:23Z

@Stebalien @rvagg @warpfork : this is the final implementation of the import/export functionality. The only thing missing are the sharness tests, which are coming some time Saturday: I am still not entirely happy with my test cases.

Sorry it took forever, there were a lot of... dusty corners to connect properly.

I specifically want someone to check my concurrency work ( around the progress meters ). I have not worked with such code a whole lot, and while there was a lot to crib from existing progress meters in ther commands, nothing was quite close to what I needed.

I will convert this into a proper PR once I push the sharness bits.

willscott

🏎 💨

core/commands/dag/dag.go

Co-Authored-By: Will <[email protected]>

core/commands/dag/dag.go

rvagg · 2020-03-24T03:14:53Z

core/commands/dag/dag.go

+// It is not guaranteed that a root in a header is actually present in the same ( or any )
+// .car file. This is the case in version 1, and ideally in further versions too
+// Accumulate any root CID seen in a header, and supplement its actual node if/when encountered
+// We will attempt a pin *only* at the end in case all car files were well formed


attempt a pin only at the end in case all car files were well formed

I can't quite see this in the code. It looks like it's done on a per-root basis, not per-car.

Correct - each individual root is opportunistically pinned, to maximize the benefit of what we just ingested being pinned.

better change the comment then, it seems misleading talking about the "car files" being "well formed" since you may find one root and pin that and not find another and not pin that.

rvagg · 2020-03-24T03:16:53Z

core/commands/dag/dag.go

+				//
+				// if err := api.Pin().Add(req.Context, rp, options.Pin.Recursive(true)); err != nil {
+
+				ret := RootMeta{Cid: &c, PresentInCar: seen}


What if seen is false but the CID exists in the blockstore already. I think this would mean that it gets pinned regardless of the "malformed" CAR. Should the next block of if/else's be contained in an if seen?

This is correct/by design: https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export.sh#L83-L86

this goes back to the question of "malformed" then, are you sure you want to have a lose edge like this? might be better to be a little more strict about the contents of a car - "you don't have the root you say you do, bad luck", instead you are opening up edges that could become important features for some in the future when we decide to be more strict with a future CAR format, or could lead to unexpected behaviour for others.

rvagg · 2020-03-24T03:27:42Z

core/commands/dag/dag.go

+			return sess.GetBlock(req.Context, c)
+		}
+		sb := gipselectorbuilder.NewSelectorSpecBuilder(gipfree.NodeBuilder())
+		car := gocar.NewSelectiveCar(


Do we care about the case where you don't have a complete graph available? I believe that the * selector is going to result in an error if it reaches an edge that sess.GetBlock() can't return for a CID. We just can't currently disambiguate the error type from ipld-prime to know whether that may be acceptable (e.g. if a --allow-incomplete type flag was introduced), but we could probably change it to get one if that's needed. (https://github.com/ipld/go-ipld-prime/blob/eb71617f4aeb7e73f36276f14e76e47fd90241ef/traversal/walk.go#L119)

@rvagg on export we want to either output the complete graph ( potentially grabbing it from the network ), or to fail. I.e. at present we do not at all have --allow-incomplete in spirit, so that the .car files produced by go-ipfs are exclusively "well formed and deterministic" ( deterministic as dictated by go-ipld-prime ).

The import of course is a complete free-for-all ( also by design ) as can be seen here:
https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export.sh#L41-L55
based on the datasets: https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export-data/README.md

Let me know if this is potentially problematic...

Comment above, it's not necessarily problematic, it just opens up API feature space that may be difficult to claw back in the future if we decide to be more strict. My personal inclination with such things these days is to be explicit and strict and open up feature space only in response to requests and demonstrated need (it also gives you a much more explicit space to construct tests around). Not my call here, so just a word of caution.

- Stop printing during Run: it does not execute on local tty - Use tabs for easier parsing - Ensure countstats do not appear in json events

It is not possible to stream out "binary" and emit events at the same time. Thus switch from listing out how many objects did we export, to how many *bytes* are being spat out. Move everything to PostRun to keep emitting on Stderr of the calling process.

On relatively large amount of objects things get into a livelock ( spinning fans, no progress ) specifically after importing: https://ipfs.io/ipfs/QmbnqojkPFjZtkCtzzCEmUbAVoKEkJzdY5u7Z7anR6vAte this locks up: ipfs dag export bafy2bzaced4ueelaegfs5fqu4tzsh6ywbbpfk3cxppupmxfdhbpbhzawfw5oy

Failcases to come as a separate PR

We do not lose any blocks ( we are able to re-export ), so that's great But the fact that `repo gc` comes back with non-existent Qm CIDs... ugh

Stebalien · 2020-03-25T23:13:22Z

go.mod

@@ -54,6 +54,7 @@ require (
 	github.com/ipfs/go-unixfs v0.2.4
 	github.com/ipfs/go-verifcid v0.0.1
 	github.com/ipfs/interface-go-ipfs-core v0.2.6
+	github.com/ipld/go-car v0.0.5-0.20200316204026-3e2cf7af0fab


Go ahead and cut a CAR release when ready.

Stebalien · 2020-03-25T23:18:55Z

core/commands/dag/dag.go

+					roots = ret.roots
+
+					progressTicker.Stop()
+					if err := res.Emit(&CarImportOutput{ObjectCounts: &counts}); err != nil {


This isn't safe. Emit can do anything it wants with the object, including send it to another goroutine.

The reason &counts is a pointer ( here and your next comment ) is that this is the only way to activate omitempty higher up: https://github.com/ipfs/go-ipfs/pull/7011/files#diff-15172926c4147422f472139313467179R73

Want to make sure you agree with the rationale before I add a bunch of mutexes.

That's fine although I'm not sure why omitempty is important in this case. My point is that we should be copying this object safely instead of sharing it.

Stebalien · 2020-03-25T23:20:12Z

core/commands/dag/dag.go

+
+		retCh := make(chan importResult, 1)
+		var counts ObjectCounts
+		go importWorker(req, &res, &api, &counts, retCh)


This is definitely not safe. We can't update counts from a different thread without a lock.

core/commands/dag/dag.go

Stebalien · 2020-03-25T23:36:21Z

core/commands/dag/dag.go

+					for {
+						len, readErr := r.Read(buf)
+						if len > 0 {
+							if err := re.Emit(bytes.NewBuffer(buf[:len])); err != nil {


This may happen to work (should add a check to prevent it) when not running the daemon, but simply won't work when actually running the daemon. When running the daemon, we can only return:

Repeated results of the same type (encoded as json). To do this, we have to specify the type.

Streaming bytes (no type specified).

We can't return both.

I wasn't sure abut both not being possible, so opted for the loop. Will simplify.

Sorry, I misread this the first time. We can repeatedly stream new byte readers.

However, we really shouldn't read 4MiB, write 4MiB, etc. Ideally, we'd print out a rate and amount downloaded.

We do not read/write 4MiB. This is the max we can read from the pipe at once, but other than that we just block and pull through whatever slice is placed on the pipe.

https://github.com/ipfs/go-ipfs/pull/7037/files#diff-15172926c4147422f472139313467179R418-R421

Stebalien · 2020-03-25T23:37:05Z

Returning progress on export seems insanely complicated and not particularly useful unless we already know the number of blocks to expect.

Additionally, the current method simply won't work with the current commands library. We can't mix data blocks and objects.

Stebalien · 2020-03-25T23:37:34Z

Let's just drop progress on get. Progress on add is useful and should be pretty simple to implement.

ribasushi · 2020-03-26T00:34:54Z

Returning progress on export seems insanely complicated and not particularly useful unless we already know the number of blocks to expect.

All I am doing is copying bytes from one pipe to another ( with a complication-conditional that I now realise is not needed ). export can mean going to the network. Without a progress of any sort it is not clear whether the dag is just humongous, or we are stuck on a particular spot of the DAG forever. Having this progress was also what allowed quick diagnosis of go-ipld-prime being broken. The entire pattern was copied over from: https://github.com/ipfs/go-ipfs/blob/master/core/commands/cat.go#L82-L103

Additionally, the current method simply won't work with the current commands library. We can't mix data blocks and objects.

This code works 100%, I just elected not to add tests for it. I could do that if you are unconvinced :)

Let's just drop progress on get.

I can do that, but having long-running network operations without any way to tell "is this thing alive?" is pretty bad UX. Please let me know this is really what you want to do.

Progress on add is useful and should be pretty simple to implement.

Progress on add ( import ) is already in this PR... what am I missing?

core/commands/dag/dag.go

Stebalien · 2020-03-26T01:02:48Z

core/commands/dag/dag.go

+
+		if silent || encType != "text" {
+			// force-disable progress unless on non-silent CLI
+			req.Options[progressOptionName] = false


If I ask for progress but set the encoding type to json, I'd expect json progress updates. I wouldn't mess with this in pre-run.

Stebalien · 2020-03-26T01:05:26Z

core/commands/dag/dag.go

+			// enable progress implicitly if a TTY
+			errStat, _ := os.Stderr.Stat()
+			if 0 != (errStat.Mode() & os.ModeCharDevice) {
+				req.Options[progressOptionName] = true


I'd just set the progress option default to true in the options. That way, the client can override it.

They can override it ( slightly rewritten ): https://github.com/ipfs/go-ipfs/pull/7037/files#diff-15172926c4147422f472139313467179R363-R373

core/commands/dag/dag.go

Stebalien · 2020-03-26T01:13:31Z

Ok, I've been mixing up parts of import/export. I'll need to re-read this but it looks more reasonable on a second pass.

However, I'd still like to punt some of this logic till a future version. This is very tricky code with subtle issues and addressing them all in one PR will be frustrating for all of us.

This code works 100%, I just elected not to add tests for it. I could do that if you are unconvinced.

You're right, I was mixing up import/export and https://github.com/ipfs/go-ipfs/pull/7011/files#r398257738 (I thought this case was reachable).

rvagg · 2020-03-26T01:16:19Z

core/commands/dag/dag.go

+		c, err := cid.Decode(req.Arguments[0])
+		if err != nil {
+			return fmt.Errorf(
+				"unable to parse selector (currently only bare CIDs are supported): %s",


What's the format of this argument when proper selectors are allowed? I assume it'll allow parsing of @creationix's new string-form selectors at some point, but aren't we still going to need to specify a root separate to the selector? Is that going to need some extra syntax to squish CID+selector together or will this need to be 2 arguments, one for CID and one for selector? And in that case, should this argument just be renamed from "selector" to "root" for now with the expected addition of a selector argument to come later?

See also https://github.com/ipld/go-car/blob/3e2cf7af0fab7d59ac0eb2ebc3dcad1c06fa68d0/selectivecar.go#L21-L24 where these two things are distinct.

👍 When selector support is added, I'd advocate for a tuple here, yes.

Selectors are a declarative document which, when interpreted, navigates from one starting node, to a series of reached nodes (being all those visited) and matched nodes (being those which the selector... selects).

Starting a selector at a particular CID is a common operation -- great; it composes cleanly with the above.

Appearances of func foo(CID, Selector) would look totally reasonable and idiomatic to me.

Addressed as follows: https://github.com/ipfs/go-ipfs/pull/7036/files#diff-15172926c4147422f472139313467179R254-R263

The error text: https://github.com/ipfs/go-ipfs/pull/7036/files#diff-15172926c4147422f472139313467179R269

This should alleviate the duality ( @warpfork , @rvagg please confirm )

rvagg · 2020-03-26T11:41:51Z

Been pondering the import semantics all day and want to continue on from my inline comment earlier: I think the first version of this should be fairly exclusive and not attempt to handle too much novelty. Rejecting zero-root CARs is probably a good idea for now (since it's essentially baked in go CARv1) and being more picky about how to handle the root-not-in-body case is probably a good idea too - it's badly formed. Don't even bother trying to pin that root.

There's two main reasons for this line of thinking:

Don't make it easy for the user to give you bad data or bad tooling will proliferate and persist.
Don't open your API too wide because it's much much harder to restrict an expansive API than to expand a restrictive one. Expand the functionality options in response to real, justified use-cases. Being too expansive is also going to make a more restrictive CARv2 more difficult, it'll end up being a more heavily breaking API when introduced here.

The two areas I can see justified for not making import a strict mirror of export in terms of its exclusiveness are: well ordered, structured and complete DAG ("deterministic", too hard, but we will want to make that reasonable to check in CARv2) and >1 roots (although export could easily be expanded here to accept more than one root already).

ribasushi · 2020-03-26T16:08:20Z

I have addressed all EX port pieces in a separate PR to make thins simpler to review.
The Import part coming up in another 2 PRs...

ribasushi · 2020-03-26T18:06:18Z

New PR's:
export functionality only: #7036
export with cli progress: #7037
import functionality only: #7038

Regarding @rvagg comment earlier: I still feel that dag import should be nothing more than a more convenient dag put. This does mean that the streams it accepts would be .car files that are not valid anywhere else but... so what?

If the group feels this is not the direction we should pursue: I will add further strictness checks as requested, and will in parallel write something else to provide the "more convenient dag put" I need for other purposes :)

ribasushi · 2020-03-28T03:19:22Z

@Stebalien the sharness test is a fluke ( pubsub stuff failed ). In any case this PR is now too different from the "broken up" ones, I am closing it, and killing the branch, as all suggestions here got incorporated ( aside from the discussion on "is import ok as-is", which we will do in the import PR #7038

ribasushi requested a review from Stebalien March 18, 2020 15:57

ribasushi force-pushed the feat/carfile-support branch 2 times, most recently from c6f9e3b to f5d5715 Compare March 19, 2020 21:26

Stebalien reviewed Mar 19, 2020

View reviewed changes

core/commands/dag/dag.go Outdated Show resolved Hide resolved

ribasushi force-pushed the feat/carfile-support branch from f5d5715 to ac54fc3 Compare March 20, 2020 19:43

Dag import/export for final review: full sharness battery coming Sat.

aa50194

ribasushi force-pushed the feat/carfile-support branch from ac54fc3 to aa50194 Compare March 20, 2020 19:48

Stebalien marked this pull request as ready for review March 20, 2020 22:16

Stebalien requested a review from willscott March 20, 2020 22:19

Stebalien changed the title ~~Complete export support for pre-review~~ DAG Import/Export Mar 20, 2020

willscott reviewed Mar 21, 2020

View reviewed changes

ribasushi and others added 6 commits March 21, 2020 05:25

Update core/commands/dag/dag.go

1997ac9

Co-Authored-By: Will <[email protected]>

Errors are not serialized to JSON - use strings

55d1413

Do not listen to nonexistent timers when progress disabled

fccbd8f

Only allow progress on the CLI

1a2cf9f

Stop variable/type shadowing

0fe70a6

Clarify iterator file closures

7b48943

ribasushi commented Mar 21, 2020

View reviewed changes

core/commands/dag/dag.go Show resolved Hide resolved

Stebalien reviewed Mar 21, 2020

View reviewed changes

core/commands/dag/dag.go Outdated Show resolved Hide resolved

core/commands/dag/dag.go Show resolved Hide resolved

core/commands/dag/dag.go Outdated Show resolved Hide resolved

switch channel to context

ae2d4f7

ribasushi mentioned this pull request Mar 23, 2020

Research/quantify performance envelopes of multiple CDC algorighms ipfs/specs#227

Open

24 tasks

rvagg reviewed Mar 24, 2020

View reviewed changes

ribasushi added 5 commits March 24, 2020 07:55

Properly exit with an error if some pins fail

aa85f2e

Streamline progress/silence/tty defaulting

de52029

Fix importer progress display:

67642d8

- Stop printing during Run: it does not execute on local tty - Use tabs for easier parsing - Ensure countstats do not appear in json events

Fix exporter progress display

3d9ed1a

It is not possible to stream out "binary" and emit events at the same time. Thus switch from listing out how many objects did we export, to how many *bytes* are being spat out. Move everything to PostRun to keep emitting on Stderr of the calling process.

ribasushi added 3 commits March 25, 2020 23:09

Initial 'happy path' sharness test

752073b

Failcases to come as a separate PR

Augment tests to show pinning/gc troubles

0dae3a2

We do not lose any blocks ( we are able to re-export ), so that's great But the fact that `repo gc` comes back with non-existent Qm CIDs... ugh

Merge remote-tracking branch 'origin/master' into feat/carfile-support

03fe9bd

ribasushi requested a review from Stebalien March 25, 2020 22:27

Forgotten backed out piece from go-ipld-prime codepath

e8540c6

ribasushi mentioned this pull request Mar 25, 2020

Present codebase does not allow opening root-less .car files ipld/go-car#26

Open

Stebalien reviewed Mar 25, 2020

View reviewed changes

Stebalien reviewed Mar 26, 2020

View reviewed changes

core/commands/dag/dag.go Show resolved Hide resolved

Stebalien reviewed Mar 26, 2020

View reviewed changes

core/commands/dag/dag.go Show resolved Hide resolved

rvagg reviewed Mar 26, 2020

View reviewed changes

rvagg mentioned this pull request Mar 26, 2020

(WIP) feat: dag import and export to and from CAR files ipfs/js-ipfs#2953

Closed

ribasushi closed this Mar 28, 2020

ribasushi deleted the feat/carfile-support branch March 28, 2020 03:20

DAG Import/Export #7011

DAG Import/Export #7011

Conversation

ribasushi commented Mar 18, 2020 • edited by Stebalien Loading

ribasushi commented Mar 20, 2020

willscott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ribasushi Mar 26, 2020 • edited by Stebalien Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Mar 25, 2020

Stebalien commented Mar 25, 2020

ribasushi commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Mar 26, 2020

ribasushi commented Mar 26, 2020

ribasushi commented Mar 26, 2020

ribasushi commented Mar 28, 2020

ribasushi commented Mar 18, 2020 •

edited by Stebalien

Loading

ribasushi Mar 26, 2020 •

edited by Stebalien

Loading

warpfork Mar 26, 2020 •

edited

Loading