Allow reset for workflow with corrupted history when possible #661

vitarb · 2020-08-06T05:04:26Z

This PR addresses #585 allowing resets on workflows with corrupted history given that reset point is located before the place where history corruption has occurred. If corruption occurs after the reset point then all valid batches will be carried out to the next workflow.

Reset behavior for workflows with normal history should remain unchanged.

CLAassistant · 2020-08-06T05:04:31Z

All committers have signed the CLA.

samarabbas · 2020-08-06T20:59:59Z

common/persistence/dataInterfaces.go

@@ -1073,6 +1073,8 @@ type (
 		MinEventID int64
 		// Get the history nodes upto MaxEventID.  Exclusive.
 		MaxEventID int64
+		// If set, identifies event ID at which reset has to be performed.
+		ResetFromID int64


I don't think we should introduce this concept of ResetFromID as part of ReadHistoryBranchRequest. This API is very generic and used from a lot of different places, and introducing ResetFromID as the parameter just makes it harder to use.
The decision to ignore events after reset point if history is corrupted is very specific to reset use case and should live in the reset code path.

I agree that it's a bit too specific, let's discuss alternatives offline.

samarabbas · 2020-08-07T00:15:04Z

common/persistence/historyStore.go

@@ -408,83 +402,100 @@ func (m *historyV2ManagerImpl) readRawHistoryBranch(
 func (m *historyV2ManagerImpl) readHistoryBranch(
 	byBatch bool,
 	request *ReadHistoryBranchRequest,
-) ([]*historypb.HistoryEvent, []*historypb.History, []byte, int, int64, error) {
+) ([]*historypb.HistoryEvent, []*historypb.History, []byte, int, int64, int64, error) {


We really need to convert the return params to a struct.

This is an internal pass-through method that's used just in two places here, I'd probably not bother packing/unpacking it into the struct especially given that result goes directly into the struct and is not assigned to intermediate variables. Let me know if you feel strongly about this using a struct, in which case I will do it.

samarabbas · 2020-08-07T00:37:22Z

common/persistence/historyStore.go

+			if batchErr != nil {
+				break
+			} else {
+				continue


Why are we special casing empty event case and continue to load more batches?
I think we should treat empty event exactly in the same manner as other errors and not proceed further. This might change the behavior for existing semantics. Imagine the following situation:

Read 10 batches successfully

Read empty events batch

Read 11th batch successfully

Previously the following sequence of events will result into this API returning an error back to the caller on step 2. Now it will return a successful response.

This just maintains previous behavior where we had a few cases for ignoring batches.
Specifically:

if firstEvent.GetVersion() < token.LastEventVersion { // version decrease means the this batch are all stale events, we should skip logger.Info("Stale event batch with smaller version", tag.FirstEventVersion(firstEvent.GetVersion()), tag.TokenLastEventVersion(token.LastEventVersion)) continue } if firstEvent.GetEventId() <= token.LastEventID { // we could see it because first batch of next page has a smaller txn_id logger.Info("Stale event batch with eventID", tag.WorkflowFirstEventID(firstEvent.GetEventId()), tag.TokenLastEventID(token.LastEventID)) continue }

In the refactored logic these would return nil events and no error, which is handled exactly the same by continuing the loop.
Do you want to change the behavior in some way?

Here is the case which will now behave differently.

Code Before your changes

events, err := m.historySerializer.DeserializeBatchEvents(batch) if err != nil { return nil, nil, nil, 0, 0, err } if len(events) == 0 { logger.Error("Empty events in a batch") return nil, nil, nil, 0, 0, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, empty events")) }

Based on the previous logic function would return an error immediately when it runs into an empty batch. Now with the new refactored implementation it will skip the empty batch and continue processing further events which will result into a successful response in the scenario I provided above.

This function is already quite tricky and I don't believe the refactored implementation buys us anything. I would rather leave the implementation as is. Considering we already moved the validation part to caller, all we need to do is return the events even in the case of failure and let the caller decide what to do with partial payload.

This is very core functionality and any slight change in semantics might result into very tricky bugs. I would not try to refactor this logic without understanding impact on all callers of this function.

We can revert the refactoring, however for the sake of correctness let's try to complete this argument here.
Iteration will stop with the new code too if the batch is empty as validation would return (nil, err) (line 455) and not (nil, nil), which in turn will result in breaking the loop (line 423) and returning an error to the caller (line 498).
So the only difference would be that previous code would return nil, nil, ..., err but now we'll return all previously collected events AND the error, which is exactly what we want here.
Although I agree that this refactoring doesn't serve any practical purpose after we moved reset-specific logic out of historyStore, I do find this version of code more readable as now functions have more clear responsibilities and both fit into one screen.

mastermanu · 2020-08-06T20:09:41Z

common/persistence/historyStore.go

+		events, err := m.validateEventBatch(batch, token, request.MinEventID-1, logger)
+		if events == nil {
+			// There are three scenarios that we consider:
+			// 1. Error happens during reset and we've already passed the reset point. In this case we will skip current and all further batches returning events that we've collected so far.


[Nit] wrap comment as line is too long (at least in github it goes off the screen and I have to move to the left with my mouse)

mastermanu · 2020-08-06T20:13:59Z

common/persistence/historyStore.go

+			// 1. Error happens during reset and we've already passed the reset point. In this case we will skip current and all further batches returning events that we've collected so far.
+			// 2. Error happens during non-reset flow or in the reset flow but before the reset point. In this case we return an error up the stack.
+			// 3. No error but events are nil. This means that batch needs to be ignored as out-of-sync.
+			if err != nil {


[Nit] can we do the if err != nil check right after line 424? That way it reduces overall indentation and it guards against a (admittedly unlikely) future scenario where error is not nil and somehow events is not nil. Then after the if err != nil block, you can have the if events == nil { continue; } block

(ok I see you are translating empty to nil in the helper method, but I'd still use a length check for consistency if thats okay)

mastermanu · 2020-08-06T20:15:35Z

common/persistence/historyStore.go

-
-		token.LastEventVersion = firstEvent.GetVersion()
-		token.LastEventID = lastEvent.GetEventId()
+		token.LastEventVersion = events[0].GetVersion()


unrelated to your change, but very interesting that LastEventVersion is the "first event's" version. Naming here just seems off

I think they've meant "last batch's first event" :/

mastermanu · 2020-08-06T20:16:27Z

common/persistence/historyStore.go

-					tag.Counter(eventCount))
-				return nil, nil, nil, 0, 0, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, eventID is not continuous"))
+		events, err := m.validateEventBatch(batch, token, request.MinEventID-1, logger)
+		if events == nil {


can we do a len() = 0 check instead? Or is empty slice and nil slice different in this case? https://medium.com/@habibridho/golang-nil-vs-empty-slice-87fd51c0a4d

That would work too, but nil check is completely appropriate in this case as during validation we do:

if len(events) == 0 { logger.Error("Empty events in a batch") return nil, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, empty events")) }

So there is no way you'll be dealing with an actual empty slice here.

mastermanu · 2020-08-06T20:24:57Z

common/persistence/historyStore.go

+			tag.LastEventVersion(lastEvent.GetVersion()), tag.WorkflowNextEventID(lastEvent.GetEventId()),
+			tag.Counter(eventCount))
+		// If the batch is corrupted and we have enough events to perform a reset then we'll just proceed with events that we've collected so far.
+		return nil, serviceerror.NewInternal(fmt.Sprintf("corrupted history event batch, wrong version and IDs"))


[Nit] I wonder why we even need the fmt.Sprintf call here and other places

also, i wonder if its worth returning in the error the specific information. For instance:

versionMatch := firstEvent.GetVersion() != ...
idsContiguous := firstEvent.GetEventId()

if (!versionMatch || !idsContiguous) {
// ... (error message can print true/false details for both)
}

It must be some rudimentary left-over. Previously this code likely printed some attributes to the error, but now it doesn't make any sense. Let me clean that up.

mastermanu · 2020-08-06T20:46:53Z

common/persistence/historyStore.go

+		return nil, nil
+	}
+	if firstEvent.GetEventId() <= token.LastEventID {
+		// we could see it because first batch of next page has a smaller txn_id


don't fully understand this, but we don't have to worry about overlap, right?

example: LastEventID is 10
batch contains 9,10,11,12

In this case, 11 and 12 are "beyond" the last event ID, but we are dropping anyway? I don't fully understand what's happening here though

I think I understand the condition but don't understand circumstances under which it might occur, that's why I left it the untouched.

mastermanu · 2020-08-06T20:50:52Z

service/history/workflowResetor.go

@@ -636,6 +636,7 @@ func (w *workflowResetorImpl) replayHistoryEvents(
 		MinEventID:  common.FirstEventID,
 		// NOTE: read through history to the end so that we can keep the received signals
 		MaxEventID:    prevMutableState.GetNextEventID(),
+		ResetFromID:   workflowTaskFinishEventID,


unrelated to your change, but looks like the file and struct were badly misspelled

mastermanu · 2020-08-06T20:58:20Z

common/persistence/historyStore.go

+			// 3. No error but events are nil. This means that batch needs to be ignored as out-of-sync.
+			if err != nil {
+				if request.ResetFromID > 0 && request.ResetFromID <= token.LastEventID {
+					break // Case 1.


do we need to do anything with the token fields in this case 1 (like set them to 0/nil or something) so that the nextPageToken is nil or w/e

mastermanu · 2020-08-06T21:02:12Z

common/persistence/dataInterfaces.go

@@ -1073,6 +1073,8 @@ type (
 		MinEventID int64
 		// Get the history nodes upto MaxEventID.  Exclusive.
 		MaxEventID int64
+		// If set, identifies event ID at which reset has to be performed.


in general, this is a "read" request, but a field called ResetFromID implies there is a "side-effect" as a result of executing the request, which seems weird. I wonder if there is better naming we can have here

mastermanu · 2020-08-06T21:04:20Z

common/persistence/historyStore.go

+// 1. nil, nil - batch should be completely ignored.
+// 2. nil, error - batch is invalid.
+// 3. events, nil - non-empty slice of events for a valid batch.
+func (m *historyV2ManagerImpl) validateEventBatch(batch *serialization.DataBlob, token *historyV2PagingToken, defaultLastEventID int64, logger log.Logger) ([]*historypb.HistoryEvent, error) {


is "defaultLastEventID" still accurate in this case? It looks like that value is fixed across all iterations that invoke this helper function

Yes it should be accurate and behavior of this part of the code didn't change. It's purpose is to identify if we are processing the first batch and if the first batch is in the middle. Check out the comment below for details.

common/persistence/historyStore.go

mastermanu · 2020-08-07T21:32:15Z

common/persistence/historyStore.go

 	}

-	return historyEvents, historyEventBatches, nextPageToken, dataSize, lastFirstEventID, nil
+	return historyEvents, historyEventBatches, nextPageToken, dataSize, lastFirstEventID, token.LastEventID, batchProcessingError


What does it mean to return "nextPageToken" if batchProcessingError is not-nil? Can this lead to some weird scenario from the caller perspective?

I think you are right, we should return nil token in case of an error, otherwise further iteration doesn't make any sense. Let me change that.

mastermanu · 2020-08-07T21:35:32Z

service/history/workflowResetor.go

@@ -530,7 +531,8 @@ func (w *workflowResetorImpl) replayReceivedSignals(
 		for {
 			var readResp *persistence.ReadHistoryBranchByBatchResponse
 			readResp, err := w.eng.historyV2Mgr.ReadHistoryBranchByBatch(readReq)
-			if err != nil {
+			// Fail if we don't have enough events to perform the reset, otherwise continue with what we've got.
+			if err != nil && (readResp == nil || readResp.LastEventID < workflowTaskFinishEventID) {


trying to figure out when readResp would actually be nil here. It seems we always initialize the struct and return a pointer to it, right? This is just an extra safety check?

Yes, just a safety check.

wxing1292 · 2020-08-11T07:27:57Z

@samarabbas

history/workflowResetor.go is 2DC only
history/workflowResetter.go is nDC only

i would suggest first remove all 2DC related code since one of the purposes of this folk is to start clean

vitarb requested review from samarabbas and alexshtin August 6, 2020 05:04

vitarb requested review from mastermanu and removed request for alexshtin August 6, 2020 17:45

samarabbas reviewed Aug 6, 2020

View reviewed changes

samarabbas requested changes Aug 7, 2020

View reviewed changes

mastermanu reviewed Aug 7, 2020

View reviewed changes

common/persistence/historyStore.go Outdated Show resolved Hide resolved

mastermanu reviewed Aug 7, 2020

View reviewed changes

mastermanu approved these changes Aug 7, 2020

View reviewed changes

samarabbas approved these changes Aug 10, 2020

View reviewed changes

Allow reset for workflow with corrupted history when possible

2c43e38

vitarb force-pushed the reset-with-corrupted-hist branch from 42d8ee2 to 2c43e38 Compare August 10, 2020 23:00

vitarb merged commit 04058a5 into temporalio:master Aug 11, 2020

vitarb mentioned this pull request Aug 11, 2020

Unable to reset Workflow with hole in history events #585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reset for workflow with corrupted history when possible #661

Allow reset for workflow with corrupted history when possible #661

vitarb commented Aug 6, 2020

CLAassistant commented Aug 6, 2020 •

edited

Loading

samarabbas Aug 6, 2020

vitarb Aug 6, 2020

samarabbas Aug 7, 2020

vitarb Aug 7, 2020 •

edited

Loading

samarabbas Aug 7, 2020

vitarb Aug 7, 2020 •

edited

Loading

samarabbas Aug 7, 2020

vitarb Aug 7, 2020 •

edited

Loading

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

vitarb Aug 7, 2020

mastermanu Aug 6, 2020

vitarb Aug 7, 2020 •

edited

Loading

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

vitarb Aug 7, 2020

mastermanu Aug 6, 2020

vitarb Aug 7, 2020

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

mastermanu Aug 6, 2020

vitarb Aug 7, 2020

mastermanu Aug 7, 2020

vitarb Aug 10, 2020

mastermanu Aug 7, 2020 •

edited

Loading

vitarb Aug 7, 2020

wxing1292 commented Aug 11, 2020

Allow reset for workflow with corrupted history when possible #661

Allow reset for workflow with corrupted history when possible #661

Conversation

vitarb commented Aug 6, 2020

CLAassistant commented Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitarb Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitarb Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitarb Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitarb Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mastermanu Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxing1292 commented Aug 11, 2020

CLAassistant commented Aug 6, 2020 •

edited

Loading

vitarb Aug 7, 2020 •

edited

Loading

vitarb Aug 7, 2020 •

edited

Loading

vitarb Aug 7, 2020 •

edited

Loading

vitarb Aug 7, 2020 •

edited

Loading

mastermanu Aug 7, 2020 •

edited

Loading