Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sealing: Fix RecoverDealIDs loop with changed PieceCID #7117

Merged
merged 1 commit into from
Aug 21, 2021

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Aug 18, 2021

(note: 95% of the diff is testing, the bugfix is the few lines changed at the bottom of states_failed.go)

I'm not exactly sure what has to happen to make the FSM put wrong piece into a sector, but this has happened on my miner after it has crashed with OOM while processing a whole bunch of deal, so it's probably a rather unlikely edge-case.

One likely explanation is that a call to AddPiece didn't add the correct data (e.g. the reader stream dropped early), which got the wrong PieceCID into SectorInfo.Pieces[].Piece.PieceCID. If that's the case (which I pretty strongly suspect it is), we can only do 2 things:

  1. Drop the piece from the sector (but that's rather complicated, requires new storage methods, and will require non-trivial shuffling data around in sectors to fix padding if the removed piece was not at the end of the sector)

  2. Remove the sector

  • Given that we'll detect this before even attempting PC1 (we call checkPieces as the first step), we don't loose much work (only putting deals into the sector)
  • Deals which fail because of being put into that sector should auto-retry adding to a new sector, so effectively we end up doing 1. by reassembling the sector from scratch

@magik6k magik6k requested a review from a team as a code owner August 18, 2021 11:01
@neondragon
Copy link

neondragon commented Aug 18, 2021

"rather unlikely edge-case" -- I have 7 sectors and 63 deals stuck with RecoverDealIDs on f019551 (1.11.1-m1.3.5+mainnet+git.7be207bc5.dirty+api1.2.0).

@jennijuju
Copy link
Member

id consider to get this in a lotus v1.11.2-rc if possible

@jennijuju jennijuju added this to the v1.11.2 milestone Aug 18, 2021
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this part of the codebase. I think @aarshkshah1992 can provide a better review here.

  • If what we want to do is remove the sector if we find a piece CID mismatch, then this looks good.

  • Would the deals originally packed into that sector be reassigned a new sector automatically? If so, should we write a test for that?

@@ -114,7 +116,7 @@ type Sealing struct {
commiter *CommitBatcher

getConfig GetSealingConfigFunc
dealInfo *CurrentDealInfoManager
DealInfo *CurrentDealInfoManager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubernit: move public fields to the top.

@aarshkshah1992
Copy link
Contributor

@jennijuju @magik6k This should close #7103.

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Aug 19, 2021

@magik6k

Deals which fail because of being put into that sector should auto-retry adding to a new sector,
so effectively we end up doing 1. by reassembling the sector from scratch

I see we are completely dropping the sector with the bad pieces. But, where is the code to do the above i.e. put all the deals in the dropped sector into a new sector ?

@raulk
Copy link
Member

raulk commented Aug 19, 2021

@aarshkshah1992, from the PR description:

Deals which fail because of being put into that sector should auto-retry adding to a new sector, so effectively we end up doing 1. by reassembling the sector from scratch

^^ @magik6k do we have a test for that behaviour. Rr can we write one, since that's the most useful outcome that folks are going to expect?

@magik6k magik6k closed this Aug 19, 2021
@magik6k magik6k reopened this Aug 19, 2021
@magik6k
Copy link
Contributor Author

magik6k commented Aug 19, 2021

(missclick on the close thing)

I see we are completely dropping the sector with the bad pieces. But, where is the code to do the above i.e. put all the deals in the dropped sector into a new sector ?

My assumption is that we do the retrying in markets.

do we have a test for that behaviour. Rr can we write one, since that's the most useful outcome that folks are going to expect?

No, I'm assuming that markets are going to retry putting deals into a sector when they see that sealing failed (I'm not 100% sure if this is the current behavior, but I'm pretty sure it was at some point).

@aarshkshah1992
Copy link
Contributor

@magik6k Discussed this offline but putting it here for Github record.

There is no such code in Markets that detects when a deal is dropped from a sector and re-attempts adding it to another sector. In-fact, I am not even sure if there's a mechanism by which Markets can detect that.

@magik6k magik6k requested a review from a team August 20, 2021 14:02
Copy link
Contributor

@arajasek arajasek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGWM, but what actually happens to these deals now?

Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a first step, but it's not a complete solution as stated in comments above.

IUC, by the point we are here, markets has already handed off the deal to the sealing subsystem, and responsibility is transferred.

AFAIK, there is no notification back to markets that the deal has been excluded from a sector a posteriori, or the sector is deleted altogether (like in this PR).

Even if there were, all that markets could do is call SectorAddPieceToAny again.

In the case we're addressing immediately, it seems like the piece transfer was interrupted, and the miner node ended up with a partial piece which, of course, doesn't compute up to the expected PieceCID.

In other words, the handoff failed but nobody noticed until later. IMO we need to focus on making sure that failures on handoff are detected immediately.

The thing that complicates it is the io.Reader JSON-RPC encoder: it is incapable of knowing if the transfer was sent in full or not!

Possible solution:

  1. allow the caller of the JSON-RPC client to pass special readers with a Len() int64 method.
  2. encoder: type assert the io.Reader, if it contains a Len() int64 method, call it to obtain the length of the transfer and set it in the Content-Length header of the POST/PUT.
  3. decoder: get the content length, and assert that the io.Copy copies that many bytes. If not, declare the transfer as failed, which should fail the original JSON-RPC invocation (thus the client -- markets -- would notice and would retry immediately).

@magik6k magik6k merged commit 6c3acb8 into master Aug 21, 2021
@magik6k magik6k deleted the fix/recoverdealids-loop branch August 21, 2021 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 P2: Should be resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants