Refactor async loading for simplicity and correctness #356

hannahhoward · 2022-02-11T02:14:00Z

Goals

fix #339 -- this is essentially an implementation of the algorithm outline there

In the process, it lays the foundation for much more robust handling of different asyncloading scenarios. It also enables "fire request only when needed" behavior

Implementation

The first concept to understand is the TraversalRecord and the Verifier. The TraversalRecord is a compact representation of an ordered history of links loaded in a traversal, along with their paths. It's based on the pathNode implementation in ipld/go-ipld-prime#354 (hattip @willscott). The Verifier is a class for replaying a traversal history against a series of responses from a remote. It also handles missing links the local may have had the remote does not. Note that the Verifier currently has to run without changes being made to the TraversalRecord.

The the second concept is the ReconciledLoader and it's associated remote queue of metadata and blocks.

The loading process is as follows:

when we request a link, we wait for new items from the remote
- if we're offline, we return immediately
- if we're online, we wait until we have new items to read
  - when we first go online, we initiate a "verification" against previous loaded data -- this comes from the traversal record
  - this means when we first go online, we actually don't have any "new" remote items until the verification finishes. if the verification fails, we actually pass as an error in response to the load request -- this is a bit odd, but keep in mind the traverser that calls BlockReadOpener will simply throw this in the graphsync error channel and close the request. I really didn't want to deal with having the channel inside the reconciled loaded and getting drawn request closing process.
if we have new remote items, and we're not loading from a path we know the remote is missing, we attempt to load the link from the first remote item, erroring if the link we're loading and the remote do no match
if we're on a path the remote is missing, we simply go right too loading local
if the remote sent the block, it's directly in the remote item queue and we save and return it. This queue itself is the unverified blockstore
if the remote said it had the block but it was a duplicate, we load local
if the remote said its missing the block, we record the new missing path, then load local
Once we load, we record the load into the traversal record, with one exception: if we're offline, we actually hold till the next load before putting the link in the traversal record. This enables a "retry" to be run when local load fails but then the node goes online

Injesting data is now very straightforward -- we add divide up responses+ blocks by request ID and then queue responses into the respective queues for the ReconciledLoader. Coordination is simply handled via mutexes and sync.Cond for signalling.

For Discussion

TODOs (feel free to rank optional or required in PR review):
- the Verifier is a bit imperative and finicky especially in terms of not being able to handle more data incoming to the TraversalRecord while verifying -- I took a couple shots at this and then gave up
  - this means for the time being we don't let the local traversal run ahead when online -- I need a verifyer you can continually verify against as data gets added -- I'm pretty sure it's doable just... haven't done it yet
- There are some commented out tests in the RequestManager -- this is because now they're basically not relevant with the new "only request once you need it mode" -- however, there's some question of whether we need to make this mode optional either as a config option or as a per-request option (we don't currently have a framework for that -- we probably need it). Firing requests only when you need it is a big change -- I kind of love the efficiency -- but it leaves open questions about what happens with stuff like go-data-transfer (what if you never got any graphsync request in response to a Push cause the other side already had all your data? eek!).
- Updating the documentation -- this is probably a required -- probably I'll copy a lot of this PR description.

style(lint): fix static checks

requestmanager/executor/executor_test.go

rvagg · 2022-02-11T08:18:07Z

graphsync.go

 // for a request and does NOT cause a request to fail completely
-type RemoteMissingBlockErr struct {


why the rename? it's still for remotes and we have other Remote* errors like the one you introduced below

technically it's now just a "missing block" -- missing both locally & remotely -- maybe a rename that breaks compatibility isn't worth it just to be super accurate.

rvagg · 2022-02-11T08:22:56Z

impl/graphsync_test.go

@@ -109,6 +109,11 @@ func TestMakeRequestToNetwork(t *testing.T) {
 	require.True(t, found)
 	require.Equal(t, td.extensionData, returnedData, "Failed to encode extension")

+	builder := gsmsg.NewBuilder()
+	builder.AddResponseCode(receivedRequest.ID(), graphsync.RequestRejected)


why do we need this now? is it because this test never needed to actually make a response and the previous version had all the blocks locally; but this new one puts the blocks on the remote but then tells the client to get lost?

So we're simply testing the ability to make a network connection and have the most basic interaction here?

yea this test is from very far back in the early graphsync development days -- when I was just seeing if the protocol was generating the network traffic I expected. So previously, it yes loaded absolutely everything locally but we just wanted to make sure a request got made.

Maybe the right move now is eliminate, since really, at this point, I'm pretty sure we can send network requests :P

rvagg · 2022-02-11T08:28:26Z

requestmanager/reconciledloader/reconciledloader.go

+}
+
+// SetRemoteState records whether or not the request is online
+func (rl *ReconciledLoader) SetRemoteState(online bool) {


Could we rename this so it's more natural with a boolean? (or use an enum for the values, but there's really only two). Maybe SetRemoteOnline()? Reading SetRemoteState(false) is a bit mysterious.

Yea I agree.

rvagg · 2022-02-11T08:40:33Z

requestmanager/executor/executor.go

+		// if we've only loaded locally so far and hit a missing block
+		// initiate remote request and retry the load operation from remote
+		if _, ok := result.Err.(graphsync.MissingBlockErr); ok && !requestSent {
+			requestSent = true


Should this be a state that's available on the loader rather than having to figure it out here? It seems a little risky having to make inferences like this from the outside. It seems to me that there's a new loader made for each request in the requestmanager, and then this assumption is implicit in the executor.

It's also a bit weird that we're asking the loader for a block, and then using the result that the loader gives us to then tell the loader what its own state is. It seems that we're doing state management here so we can trigger the RetryLastOfflineLoad(). But the flow does raise questions about who's responsible for what. The loader clearly isn't just a dumb container, yet it's not responsible for its own state management.

Oh, looking at requestTask() it looks like could be a reused loader, so we have local state (to the executor) and loader state going on here. So maybe it's already "online" as far as its concerned, from a previous use, but then we go through this local state management to figure out it's online, then we tell it's online (again) and do the RetryLastOfflineLoad() dance.

Is there risk involved in this dual state management and reuse?

yea, I mean, maybe? honestly this is a boundary I thought about a lot and couldn't decide upon.

Honestly, I thought about having the loader just make the request itself cause yea, then state in only one place... I dunno, at that point it's just doing a whole whole lot and it makes this already large PR even bigger.

I'm not sure where to draw the line, and my gut feeling is: let's ship it, prove it works, and then refactor more as needed.

fair enough, I'll buy that

requestmanager/reconciledloader/reconciledloader.go

willscott

This seems reasonable. Docs are probably the only one of the 3 bullets that i'd push to have in this PR rather than a follow up

willscott · 2022-02-11T09:35:04Z

graphsync.go

+// RemoteIncorrectResponseError indicates that the remote peer sent a response
+// to a traversal that did not correspond with the expected next link
+// in the selector traversal based on verification of data up to this point
+type RemoteIncorrectResponseError struct {


the path where this link mismatched occured might be useful context as well

willscott · 2022-02-11T09:36:27Z

requestmanager/client.go

-	CleanupRequest(p peer.ID, requestID graphsync.RequestID)
+// PersistenceOptions is an interface for getting loaders by name
+type PersistenceOptions interface {
+	GetLinkSystem(name string) (ipld.LinkSystem, bool)


need docs on why there are many named link systems here

this interface is unrelated to the current PR -- it just got moved around a bit -- can we defer to a seperate ticket? This got large

willscott · 2022-02-11T09:38:18Z

requestmanager/executor/executor.go

+// ReconciledLoader is an interface that can be used to load blocks from a local store or a remote request
+type ReconciledLoader interface {
+	SetRemoteState(online bool)
+	RetryLastOfflineLoad() types.AsyncLoadResult


we should document thread safety and semantics here - i assume that the intention is this loader can only be used by one client.

I feel like something is off with this interface as well:

should this just be a 'retryLastLoad()`?
why does this reconciled loader need to know that the state is online/offline? when it asks the linksystem to load when offline, the linksystem can immediately return an error - in the same way it would if the loader was online, but there was a network failure?

the thread safety is relatively well documented in reconciledloader.go itself. This is joy of go's whole "put the interface next to where its used" -- you expect docs in one palce that are elsewhere -- maybe I should put a pointer here.

Re: RetryLastOfflineLoad -- it's named as such cause I thought this was really the only useful case -- trying a load again when you go online. But I agree, a global retry would be more useful and clear. In my mind the operations are as follows:

offline load -> retry online -- this time wait for remote data and try again

online load -> retry online -- just try loading from local store (if you were online and loaded successfully, it got saved locally, the remote isn't sending any more data on this particular load, so the only possibility is more local data saved from another source)

online load -> retry offline -- just load local (and previous remote data was saved)

offline load -> retry offline -- just load local

A future version could theoretically make an actual request to the network to tell the remote to try the block again, but that's not something I think we tackle now at all.

requestmanager/reconciledloader/injest.go

rvagg · 2022-02-11T10:30:19Z

requestmanager/reconciledloader/load.go

+
+	// verify it matches the expected next load
+	if !head.link.Equals(link.(cidlink.Link).Cid) {
+		return nil, graphsync.RemoteIncorrectResponseError{


we have zero tolerance for out-of-order blocks right? so this error is final and breaks the entire transfer?

yes and yes. Remotes had best not fuck up.

I should probably remove swearing in a Github comment and yet I am not feeling compelled to :P

rvagg · 2022-02-11T10:31:47Z

requestmanager/reconciledloader/load.go

+	// update path tracking
+	rl.pathTracker.recordRemoteLoadAttempt(lctx.LinkPath, head.action)
+
+	// if block == nil, we have no remote block to load


this comment could do with more extrapolation, there's some stuff related to states and the duplicate block logic in the metadata ingester

fair I'll work on it based on what I wrote above.

rvagg · 2022-02-11T10:53:56Z

requestmanager/reconciledloader/traversalrecord/traversalrecord.go

+func (v *Verifier) appendUntilLink() {
+	if v.tip().link == nil && len(v.tip().children) > 0 {
+		v.stack = append(v.stack, v.tip().children[0])
+		v.appendUntilLink()


why does this recurse rather than iterate?

cause I used to be an Elixir/Erlang programmer? Totally fair. I'll fix.

rvagg · 2022-02-11T10:59:57Z

requestmanager/reconciledloader/traversalrecord/traversalrecord.go

+// Verifier allows you to verify series of links loads matches a previous traversal
+// order when those loads are successful
+// At any point it can reconstruct the current path.
+type Verifier struct {


most of this seems necessary because of ipld-prime's insistence on pretending that block boundaries don't exist, so it has very poor support for the distinction that links make within a series of nodes; is that fair?

I mean yea sorta. Technically, we could just re-run the selector against the results, but that seems like A LOT of overkill.

I get why ipld-prime wants to treat data as a giant abstract structured data source in the sky, but yes block boundaries are a thing and it can be tricky to deal with.

Co-authored-by: Rod Vagg <[email protected]>

hannahhoward · 2022-02-16T05:25:41Z

@willscott + @rvagg I believe I have resolved all PR comments -- hopefully we are good to merge?

That said, this merits another re-review cause boy "update the documentation" turned into "wow, the architecture docs are really out of date and need a bunch of updates"

rvagg · 2022-02-17T06:10:44Z

impl/graphsync_test.go

@@ -69,130 +66,6 @@ var protocolsForTest = map[string]struct {
 	"(v1.0 -> v1.0)": {[]protocol.ID{gsnet.ProtocolGraphsync_1_0_0}, []protocol.ID{gsnet.ProtocolGraphsync_1_0_0}},
 }

-func TestMakeRequestToNetwork(t *testing.T) {


my guess is that these two are removed because they were, as you were saying recently, early bootstrapping tests that made sense before the whole thing was built, now they are a bit too basic to be bothered with?

rvagg · 2022-02-17T06:20:51Z

requestmanager/reconciledloader/load.go

+		trace.WithAttributes(attribute.String("cid", link.String())))
+	defer span.End()
+
+	log.Debugw("verified block", "request_id", rl.requestID, "total_queued_bytes", buffered)


would these be interesting attributes in the verifyBlock trace above too?

rvagg · 2022-02-17T06:26:59Z

requestmanager/requestmanager_test.go

+		TODO: these are no longer used as the old responses are consumed upon restart, to minimize
+		network utilization -- does this make sense? Maybe we should throw out these responses while paused?


oh, this is interesting .. and I have no idea what the answer to this might be; are you comfortable enough with the new behaviour to commit to it for now?

rvagg · 2022-02-17T06:46:27Z

docs/request-execution.puml

+ReconciledLoader -> LocalStorage : Load Block
+LocalStorage --> ReconciledLoader : Block or missing
+else
+loop until block loaded, missing, or error


gets a bit dense in here; to the point where I wonder about the utility of the detail, but I think this looks right

rvagg

if you're happy with the state of this, then fine by me, let's give it a go

willscott

Agree that the 'request-execution' state diagram is pretty hard to make sense of in one go. i wonder if there are smaller parts that make sense as their own broken out diagrams. That seems pretty nit-picky though, so happy to see this go in as is.

…ng (#332) * feat(net): initial dag-cbor protocol support also added first roundtrip benchmark * feat(requestid): use uuids for requestids Ref: #278 Closes: #279 Closes: #281 * fix(requestmanager): make collect test requests with uuids sortable * fix(requestid): print requestids as string uuids in logs * fix(requestid): use string as base type for RequestId * chore(requestid): wrap requestid string in a struct * feat(libp2p): add v1.0.0 network compatibility * chore(net): resolve most cbor + uuid merge problems * feat(net): to/from ipld bindnode types, more cbor protoc improvements * feat(net): introduce 2.0.0 protocol for dag-cbor * fix(net): more bindnode dag-cbor protocol fixes Not quite working yet, still need some upstream fixes and no extensions work has been attempted yet. * chore(metadata): convert metadata to bindnode * chore(net,extensions): wire up IPLD extensions, expose as Node instead of []byte * Extensions now working with new dag-cbor network protocol * dag-cbor network protocol still not default, most tests are still exercising the existing v1 protocol * Metadata now using bindnode instead of cbor-gen * []byte for deferred extensions decoding is now replaced with datamodel.Node everywhere. Internal extensions now using some form of go-ipld-prime decode to convert them to local types (metadata using bindnode, others using direct inspection). * V1 protocol also using dag-cbor decode of extensions data and exporting the bytes - this may be a breaking change for exising extensions - need to check whether this should be done differently. Maybe a try-decode and if it fails export a wrapped Bytes Node? * fix(src): fix imports * fix(mod): clean up go.mod * fix(net): refactor message version format code to separate packages * feat(net): activate v2 network as default * fix(src): build error * chore: remove GraphSyncMessage#Loggable Ref: #332 (comment) * chore: remove intermediate v1.1 pb protocol message type v1.1.0 was introduced to start the transition to UUID RequestIDs. That change has since been combined with the switch to DAG-CBOR messaging format for a v2.0.0 protocol. Thus, this interim v1.1.0 format is no longer needed and has not been used at all in a released version of go-graphsync. Fixes: filecoin-project/lightning-planning#14 * fix: clarify comments re dag-cbor extension data As per dission in #338, we are going to be erroring on extension data that is not properly dag-cbor encoded from now on * feat: new LinkMetadata iface, integrate metadata into Response type (#342) * feat(metadata): new LinkMetadata iface, integrate metadata into Response type * LinkMetadata wrapper around existing metadata type to allow for easier backward-compat upgrade path * integrate metadata directly into GraphSyncResponse type, moving it from an optional extension * still deal with metadata as an extension for now—further work for v2 protocol will move it into the core message schema Ref: #335 * feat(metadata): move metadata to core protocol, only use extension in v1 proto * fix(metadata): bindnode expects Go enum strings to be at the type level * fix(metadata): minor fixes, tidy up naming * fix(metadata): make gofmt and staticcheck happy * fix(metadata): docs and minor tweaks after review Co-authored-by: Daniel Martí <[email protected]> * fix: avoid double-encode for extension size estimation Closes: filecoin-project/lightning-planning#15 * feat(requesttype): introduce RequestType enum to replace cancel&update bools (#352) Closes: #345 * fix(metadata): extend round-trip tests to byte representation (#350) * feat!(messagev2): tweak dag-cbor message schema (#354) * feat!(messagev2): tweak dag-cbor message schema For: 1. Efficiency: compacting the noisy structures into tuples representations and making top-level components of a message optional. 2. Migrations: providing a secondary mechanism to lean on for versioning if we want a gentler upgrade path than libp2p protocol versioning. Closes: #351 * fix(messagev2): adjust schema per feedback * feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID (#355) * feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID Closes: #349 * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID when using error type T, use *T with As, rather than **T * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID Co-authored-by: Daniel Martí <[email protected]> * feat: SendUpdates() API to send only extension data to via existing request * fix(responsemanager): send update while completing If request has finished selector traversal but is still sending blocks, I think it should be possible to send updates. As a side effect, this fixes our race. Logically, this makes sense, cause our external indicator that we're done (completed response listener) has not been called. * fix(requestmanager): revert change to pointer type * Refactor async loading for simplicity and correctness (#356) * feat(reconciledloader): first working version of reconciled loader * feat(traversalrecorder): add better recorder for traversals * feat(reconciledloader): pipe reconciled loader through code style(lint): fix static checks * Update requestmanager/reconciledloader/injest.go Co-authored-by: Rod Vagg <[email protected]> * feat(reconciledloader): respond to PR comments Co-authored-by: Rod Vagg <[email protected]> * fix(requestmanager): update test for rebase Co-authored-by: Daniel Martí <[email protected]> Co-authored-by: hannahhoward <[email protected]>

Was missed in #356 This doesn't break anything obvious because the loader isn't used from the traverser's linksystem, and the traverser sets up defaults where the provided linksystem is missing them. But, we don't get certain things, like KnownReifiers, which are kind of important.

hannahhoward added 3 commits February 10, 2022 17:27

feat(reconciledloader): first working version of reconciled loader

3bbc962

feat(traversalrecorder): add better recorder for traversals

0178112

feat(reconciledloader): pipe reconciled loader through code

0818f67

style(lint): fix static checks

hannahhoward force-pushed the feat/refactor-asyncloader branch from b303f1a to 0818f67 Compare February 11, 2022 02:23

hannahhoward requested review from rvagg and willscott February 11, 2022 02:24

rvagg reviewed Feb 11, 2022

View reviewed changes

requestmanager/executor/executor_test.go Show resolved Hide resolved

rvagg reviewed Feb 11, 2022

View reviewed changes

requestmanager/reconciledloader/reconciledloader.go Outdated Show resolved Hide resolved

willscott reviewed Feb 11, 2022

View reviewed changes

rvagg reviewed Feb 11, 2022

View reviewed changes

requestmanager/reconciledloader/injest.go Outdated Show resolved Hide resolved

rvagg reviewed Feb 11, 2022

View reviewed changes

Update requestmanager/reconciledloader/injest.go

6465a4f

Co-authored-by: Rod Vagg <[email protected]>

hannahhoward requested review from willscott and rvagg February 16, 2022 05:24

feat(reconciledloader): respond to PR comments

de4f016

hannahhoward force-pushed the feat/refactor-asyncloader branch from 99257d2 to de4f016 Compare February 16, 2022 05:27

rvagg reviewed Feb 17, 2022

View reviewed changes

rvagg approved these changes Feb 17, 2022

View reviewed changes

willscott reviewed Feb 17, 2022

View reviewed changes

hannahhoward merged commit f6a08b5 into rvagg/uuid-rebasing Feb 18, 2022

mvdan deleted the feat/refactor-asyncloader branch March 7, 2022 11:57

rvagg mentioned this pull request Feb 27, 2023

fix: wire up proper linksystem to traverser #411

Merged

		// for a request and does NOT cause a request to fail completely
		type RemoteMissingBlockErr struct {

		TODO: these are no longer used as the old responses are consumed upon restart, to minimize
		network utilization -- does this make sense? Maybe we should throw out these responses while paused?

Refactor async loading for simplicity and correctness #356

Refactor async loading for simplicity and correctness #356

Conversation

hannahhoward commented Feb 11, 2022

Goals

Implementation

For Discussion

Choose a reason for hiding this comment

hannahhoward Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willscott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahhoward commented Feb 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg left a comment

Choose a reason for hiding this comment

willscott left a comment

Choose a reason for hiding this comment

hannahhoward Feb 11, 2022 •

edited

Loading