Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reset for workflow with corrupted history when possible #661

Merged
merged 1 commit into from
Aug 11, 2020

Conversation

vitarb
Copy link
Contributor

@vitarb vitarb commented Aug 6, 2020

This PR addresses #585 allowing resets on workflows with corrupted history given that reset point is located before the place where history corruption has occurred. If corruption occurs after the reset point then all valid batches will be carried out to the next workflow.

Reset behavior for workflows with normal history should remain unchanged.

@CLAassistant
Copy link

CLAassistant commented Aug 6, 2020

CLA assistant check
All committers have signed the CLA.

@vitarb vitarb requested review from mastermanu and removed request for alexshtin August 6, 2020 17:45
@@ -1073,6 +1073,8 @@ type (
MinEventID int64
// Get the history nodes upto MaxEventID. Exclusive.
MaxEventID int64
// If set, identifies event ID at which reset has to be performed.
ResetFromID int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should introduce this concept of ResetFromID as part of ReadHistoryBranchRequest. This API is very generic and used from a lot of different places, and introducing ResetFromID as the parameter just makes it harder to use.
The decision to ignore events after reset point if history is corrupted is very specific to reset use case and should live in the reset code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a bit too specific, let's discuss alternatives offline.

@@ -408,83 +402,100 @@ func (m *historyV2ManagerImpl) readRawHistoryBranch(
func (m *historyV2ManagerImpl) readHistoryBranch(
byBatch bool,
request *ReadHistoryBranchRequest,
) ([]*historypb.HistoryEvent, []*historypb.History, []byte, int, int64, error) {
) ([]*historypb.HistoryEvent, []*historypb.History, []byte, int, int64, int64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to convert the return params to a struct.

Copy link
Contributor Author

@vitarb vitarb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal pass-through method that's used just in two places here, I'd probably not bother packing/unpacking it into the struct especially given that result goes directly into the struct and is not assigned to intermediate variables. Let me know if you feel strongly about this using a struct, in which case I will do it.

if batchErr != nil {
break
} else {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we special casing empty event case and continue to load more batches?
I think we should treat empty event exactly in the same manner as other errors and not proceed further. This might change the behavior for existing semantics. Imagine the following situation:

  1. Read 10 batches successfully
  2. Read empty events batch
  3. Read 11th batch successfully

Previously the following sequence of events will result into this API returning an error back to the caller on step 2. Now it will return a successful response.

Copy link
Contributor Author

@vitarb vitarb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just maintains previous behavior where we had a few cases for ignoring batches.
Specifically:

             if firstEvent.GetVersion() < token.LastEventVersion {
			// version decrease means the this batch are all stale events, we should skip
			logger.Info("Stale event batch with smaller version", tag.FirstEventVersion(firstEvent.GetVersion()), tag.TokenLastEventVersion(token.LastEventVersion))
			continue
		}
		if firstEvent.GetEventId() <= token.LastEventID {
			// we could see it because first batch of next page has a smaller txn_id
			logger.Info("Stale event batch with eventID", tag.WorkflowFirstEventID(firstEvent.GetEventId()), tag.TokenLastEventID(token.LastEventID))
			continue
		}

In the refactored logic these would return nil events and no error, which is handled exactly the same by continuing the loop.
Do you want to change the behavior in some way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the case which will now behave differently.

Code Before your changes

events, err := m.historySerializer.DeserializeBatchEvents(batch)
		if err != nil {
			return nil, nil, nil, 0, 0, err
		}
		if len(events) == 0 {
			logger.Error("Empty events in a batch")
			return nil, nil, nil, 0, 0, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, empty events"))
		}

Based on the previous logic function would return an error immediately when it runs into an empty batch. Now with the new refactored implementation it will skip the empty batch and continue processing further events which will result into a successful response in the scenario I provided above.

This function is already quite tricky and I don't believe the refactored implementation buys us anything. I would rather leave the implementation as is. Considering we already moved the validation part to caller, all we need to do is return the events even in the case of failure and let the caller decide what to do with partial payload.

This is very core functionality and any slight change in semantics might result into very tricky bugs. I would not try to refactor this logic without understanding impact on all callers of this function.

Copy link
Contributor Author

@vitarb vitarb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert the refactoring, however for the sake of correctness let's try to complete this argument here.
Iteration will stop with the new code too if the batch is empty as validation would return (nil, err) (line 455) and not (nil, nil), which in turn will result in breaking the loop (line 423) and returning an error to the caller (line 498).
So the only difference would be that previous code would return nil, nil, ..., err but now we'll return all previously collected events AND the error, which is exactly what we want here.
Although I agree that this refactoring doesn't serve any practical purpose after we moved reset-specific logic out of historyStore, I do find this version of code more readable as now functions have more clear responsibilities and both fit into one screen.

events, err := m.validateEventBatch(batch, token, request.MinEventID-1, logger)
if events == nil {
// There are three scenarios that we consider:
// 1. Error happens during reset and we've already passed the reset point. In this case we will skip current and all further batches returning events that we've collected so far.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] wrap comment as line is too long (at least in github it goes off the screen and I have to move to the left with my mouse)

// 1. Error happens during reset and we've already passed the reset point. In this case we will skip current and all further batches returning events that we've collected so far.
// 2. Error happens during non-reset flow or in the reset flow but before the reset point. In this case we return an error up the stack.
// 3. No error but events are nil. This means that batch needs to be ignored as out-of-sync.
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] can we do the if err != nil check right after line 424? That way it reduces overall indentation and it guards against a (admittedly unlikely) future scenario where error is not nil and somehow events is not nil. Then after the if err != nil block, you can have the if events == nil { continue; } block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ok I see you are translating empty to nil in the helper method, but I'd still use a length check for consistency if thats okay)


token.LastEventVersion = firstEvent.GetVersion()
token.LastEventID = lastEvent.GetEventId()
token.LastEventVersion = events[0].GetVersion()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to your change, but very interesting that LastEventVersion is the "first event's" version. Naming here just seems off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they've meant "last batch's first event" :/

tag.Counter(eventCount))
return nil, nil, nil, 0, 0, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, eventID is not continuous"))
events, err := m.validateEventBatch(batch, token, request.MinEventID-1, logger)
if events == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a len() = 0 check instead? Or is empty slice and nil slice different in this case? https://medium.com/@habibridho/golang-nil-vs-empty-slice-87fd51c0a4d

Copy link
Contributor Author

@vitarb vitarb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work too, but nil check is completely appropriate in this case as during validation we do:

	if len(events) == 0 {
		logger.Error("Empty events in a batch")
		return nil, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, empty events"))
	}

So there is no way you'll be dealing with an actual empty slice here.

tag.LastEventVersion(lastEvent.GetVersion()), tag.WorkflowNextEventID(lastEvent.GetEventId()),
tag.Counter(eventCount))
// If the batch is corrupted and we have enough events to perform a reset then we'll just proceed with events that we've collected so far.
return nil, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, wrong version and IDs"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] I wonder why we even need the fmt.Sprintf call here and other places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i wonder if its worth returning in the error the specific information. For instance:

versionMatch := firstEvent.GetVersion() != ...
idsContiguous := firstEvent.GetEventId()

if (!versionMatch || !idsContiguous) {
// ... (error message can print true/false details for both)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be some rudimentary left-over. Previously this code likely printed some attributes to the error, but now it doesn't make any sense. Let me clean that up.

return nil, nil
}
if firstEvent.GetEventId() <= token.LastEventID {
// we could see it because first batch of next page has a smaller txn_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't fully understand this, but we don't have to worry about overlap, right?

example: LastEventID is 10
batch contains 9,10,11,12

In this case, 11 and 12 are "beyond" the last event ID, but we are dropping anyway? I don't fully understand what's happening here though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the condition but don't understand circumstances under which it might occur, that's why I left it the untouched.

@@ -636,6 +636,7 @@ func (w *workflowResetorImpl) replayHistoryEvents(
MinEventID: common.FirstEventID,
// NOTE: read through history to the end so that we can keep the received signals
MaxEventID: prevMutableState.GetNextEventID(),
ResetFromID: workflowTaskFinishEventID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to your change, but looks like the file and struct were badly misspelled

// 3. No error but events are nil. This means that batch needs to be ignored as out-of-sync.
if err != nil {
if request.ResetFromID > 0 && request.ResetFromID <= token.LastEventID {
break // Case 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do anything with the token fields in this case 1 (like set them to 0/nil or something) so that the nextPageToken is nil or w/e

@@ -1073,6 +1073,8 @@ type (
MinEventID int64
// Get the history nodes upto MaxEventID. Exclusive.
MaxEventID int64
// If set, identifies event ID at which reset has to be performed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, this is a "read" request, but a field called ResetFromID implies there is a "side-effect" as a result of executing the request, which seems weird. I wonder if there is better naming we can have here

// 1. nil, nil - batch should be completely ignored.
// 2. nil, error - batch is invalid.
// 3. events, nil - non-empty slice of events for a valid batch.
func (m *historyV2ManagerImpl) validateEventBatch(batch *serialization.DataBlob, token *historyV2PagingToken, defaultLastEventID int64, logger log.Logger) ([]*historypb.HistoryEvent, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "defaultLastEventID" still accurate in this case? It looks like that value is fixed across all iterations that invoke this helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be accurate and behavior of this part of the code didn't change. It's purpose is to identify if we are processing the first batch and if the first batch is in the middle. Check out the comment below for details.

}

return historyEvents, historyEventBatches, nextPageToken, dataSize, lastFirstEventID, nil
return historyEvents, historyEventBatches, nextPageToken, dataSize, lastFirstEventID, token.LastEventID, batchProcessingError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to return "nextPageToken" if batchProcessingError is not-nil? Can this lead to some weird scenario from the caller perspective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, we should return nil token in case of an error, otherwise further iteration doesn't make any sense. Let me change that.

@@ -530,7 +531,8 @@ func (w *workflowResetorImpl) replayReceivedSignals(
for {
var readResp *persistence.ReadHistoryBranchByBatchResponse
readResp, err := w.eng.historyV2Mgr.ReadHistoryBranchByBatch(readReq)
if err != nil {
// Fail if we don't have enough events to perform the reset, otherwise continue with what we've got.
if err != nil && (readResp == nil || readResp.LastEventID < workflowTaskFinishEventID) {
Copy link
Member

@mastermanu mastermanu Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to figure out when readResp would actually be nil here. It seems we always initialize the struct and return a pointer to it, right? This is just an extra safety check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just a safety check.

@vitarb vitarb merged commit 04058a5 into temporalio:master Aug 11, 2020
@wxing1292
Copy link
Contributor

@samarabbas

history/workflowResetor.go is 2DC only
history/workflowResetter.go is nDC only

i would suggest first remove all 2DC related code since one of the purposes of this folk is to start clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants