sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

magik6k · 2021-08-18T11:01:03Z

(note: 95% of the diff is testing, the bugfix is the few lines changed at the bottom of states_failed.go)

I'm not exactly sure what has to happen to make the FSM put wrong piece into a sector, but this has happened on my miner after it has crashed with OOM while processing a whole bunch of deal, so it's probably a rather unlikely edge-case.

One likely explanation is that a call to AddPiece didn't add the correct data (e.g. the reader stream dropped early), which got the wrong PieceCID into SectorInfo.Pieces[].Piece.PieceCID. If that's the case (which I pretty strongly suspect it is), we can only do 2 things:

Drop the piece from the sector (but that's rather complicated, requires new storage methods, and will require non-trivial shuffling data around in sectors to fix padding if the removed piece was not at the end of the sector)
Remove the sector

Given that we'll detect this before even attempting PC1 (we call checkPieces as the first step), we don't loose much work (only putting deals into the sector)
Deals which fail because of being put into that sector should auto-retry adding to a new sector, so effectively we end up doing 1. by reassembling the sector from scratch

neondragon · 2021-08-18T20:00:29Z

"rather unlikely edge-case" -- I have 7 sectors and 63 deals stuck with RecoverDealIDs on f019551 (1.11.1-m1.3.5+mainnet+git.7be207bc5.dirty+api1.2.0).

jennijuju · 2021-08-18T20:05:21Z

id consider to get this in a lotus v1.11.2-rc if possible

raulk

I'm not super familiar with this part of the codebase. I think @aarshkshah1992 can provide a better review here.

If what we want to do is remove the sector if we find a piece CID mismatch, then this looks good.
Would the deals originally packed into that sector be reassigned a new sector automatically? If so, should we write a test for that?

raulk · 2021-08-18T20:46:23Z

extern/storage-sealing/sealing.go

@@ -114,7 +116,7 @@ type Sealing struct {
 	commiter    *CommitBatcher

 	getConfig GetSealingConfigFunc
-	dealInfo  *CurrentDealInfoManager
+	DealInfo  *CurrentDealInfoManager


ubernit: move public fields to the top.

aarshkshah1992 · 2021-08-19T03:11:55Z

@jennijuju @magik6k This should close #7103.

aarshkshah1992 · 2021-08-19T04:10:07Z

@magik6k

Deals which fail because of being put into that sector should auto-retry adding to a new sector,
so effectively we end up doing 1. by reassembling the sector from scratch

I see we are completely dropping the sector with the bad pieces. But, where is the code to do the above i.e. put all the deals in the dropped sector into a new sector ?

raulk · 2021-08-19T09:34:31Z

@aarshkshah1992, from the PR description:

Deals which fail because of being put into that sector should auto-retry adding to a new sector, so effectively we end up doing 1. by reassembling the sector from scratch

^^ @magik6k do we have a test for that behaviour. Rr can we write one, since that's the most useful outcome that folks are going to expect?

magik6k · 2021-08-19T09:59:12Z

(missclick on the close thing)

I see we are completely dropping the sector with the bad pieces. But, where is the code to do the above i.e. put all the deals in the dropped sector into a new sector ?

My assumption is that we do the retrying in markets.

do we have a test for that behaviour. Rr can we write one, since that's the most useful outcome that folks are going to expect?

No, I'm assuming that markets are going to retry putting deals into a sector when they see that sealing failed (I'm not 100% sure if this is the current behavior, but I'm pretty sure it was at some point).

aarshkshah1992 · 2021-08-19T10:22:24Z

@magik6k Discussed this offline but putting it here for Github record.

There is no such code in Markets that detects when a deal is dropped from a sector and re-attempts adding it to another sector. In-fact, I am not even sure if there's a mechanism by which Markets can detect that.

arajasek

SGWM, but what actually happens to these deals now?

raulk

I think this is just a first step, but it's not a complete solution as stated in comments above.

IUC, by the point we are here, markets has already handed off the deal to the sealing subsystem, and responsibility is transferred.

AFAIK, there is no notification back to markets that the deal has been excluded from a sector a posteriori, or the sector is deleted altogether (like in this PR).

Even if there were, all that markets could do is call SectorAddPieceToAny again.

In the case we're addressing immediately, it seems like the piece transfer was interrupted, and the miner node ended up with a partial piece which, of course, doesn't compute up to the expected PieceCID.

In other words, the handoff failed but nobody noticed until later. IMO we need to focus on making sure that failures on handoff are detected immediately.

The thing that complicates it is the io.Reader JSON-RPC encoder: it is incapable of knowing if the transfer was sent in full or not!

Possible solution:

allow the caller of the JSON-RPC client to pass special readers with a Len() int64 method.
encoder: type assert the io.Reader, if it contains a Len() int64 method, call it to obtain the length of the transfer and set it in the Content-Length header of the POST/PUT.
decoder: get the content length, and assert that the io.Copy copies that many bytes. If not, declare the transfer as failed, which should fail the original JSON-RPC invocation (thus the client -- markets -- would notice and would retry immediately).

magik6k requested a review from a team as a code owner August 18, 2021 11:01

neondragon mentioned this pull request Aug 18, 2021

[BUG] Piece Cid mismatch between deal proposal in Actor state and deal Piece added to Sector (online and offline deals) #7103

Closed

4 tasks

jennijuju added the P2 P2: Should be resolved label Aug 18, 2021

jennijuju added this to the v1.11.2 milestone Aug 18, 2021

raulk reviewed Aug 18, 2021

View reviewed changes

raulk requested a review from aarshkshah1992 August 18, 2021 22:21

magik6k closed this Aug 19, 2021

magik6k reopened this Aug 19, 2021

sealing: Fix RecoverDealIDs loop with changed PieceCID

62769e3

magik6k force-pushed the fix/recoverdealids-loop branch from 62c8508 to 62769e3 Compare August 20, 2021 14:00

magik6k requested a review from a team August 20, 2021 14:02

arajasek approved these changes Aug 20, 2021

View reviewed changes

raulk approved these changes Aug 20, 2021

View reviewed changes

magik6k merged commit 6c3acb8 into master Aug 21, 2021

magik6k deleted the fix/recoverdealids-loop branch August 21, 2021 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

magik6k commented Aug 18, 2021

neondragon commented Aug 18, 2021 •

edited

Loading

jennijuju commented Aug 18, 2021

raulk left a comment

raulk Aug 18, 2021

aarshkshah1992 commented Aug 19, 2021

aarshkshah1992 commented Aug 19, 2021 •

edited

Loading

raulk commented Aug 19, 2021 •

edited

Loading

magik6k commented Aug 19, 2021

aarshkshah1992 commented Aug 19, 2021

arajasek left a comment

raulk left a comment

sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

Conversation

magik6k commented Aug 18, 2021

neondragon commented Aug 18, 2021 • edited Loading

jennijuju commented Aug 18, 2021

raulk left a comment

Choose a reason for hiding this comment

raulk Aug 18, 2021

Choose a reason for hiding this comment

aarshkshah1992 commented Aug 19, 2021

aarshkshah1992 commented Aug 19, 2021 • edited Loading

raulk commented Aug 19, 2021 • edited Loading

magik6k commented Aug 19, 2021

aarshkshah1992 commented Aug 19, 2021

arajasek left a comment

Choose a reason for hiding this comment

raulk left a comment

Choose a reason for hiding this comment

neondragon commented Aug 18, 2021 •

edited

Loading

aarshkshah1992 commented Aug 19, 2021 •

edited

Loading

raulk commented Aug 19, 2021 •

edited

Loading