Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

Merged
merged 4 commits into from
May 17, 2023

Conversation

sethsaperstein-lyft
Copy link
Contributor

@sethsaperstein-lyft sethsaperstein-lyft commented May 15, 2023

overview

In the job monitoring PR we introduced a bug such that when the job monitoring fails due to timeout or a failed vertex, the state DeployFailed is reached instead of attempting to rollback. This simplifies the logic of submitting job and job monitoring as well as results in the job attempting to roll back

additional info

Errors returned by a state in the state machine are added to the status as the last error. The shouldRollback at the beginning of these states checks to see if it is retryable and moves to rolling back if not. Thus, the change made is to return an error if monitoring results in a failed vertex or vertex timeout

@sethsaperstein-lyft
Copy link
Contributor Author

/PTAL @maghamravi @anandswaminathan

}
return updateJobAndReturn(ctx, job, s, allVerticesRunning, app, hash)
logger.Info(ctx, "Monitoring job vertices with timeout ", flinkJobVertexTimeout)
jobStarted, err := monitorJobStart(job, flinkJobVertexTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to call the method name monitorJobSubmission and jobStarted to status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with monitorJobSubmission. Cleaner.

I don't fully understand why call jobStarted to status. I believe jobStarted is more intuitive as to what the monitorJobStart actually returns. In the case where all vertices are not running jobStarted is false. If all vertices are running jobStarted is true. If any vertex is failed it throws an error.

Unless you're suggesting that status should be a string rather than a bool and correspond to something like "NOT_STARTED", "STARTED". Can you clarify status and why it should be status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two hard things in Computer Science: cache invalidation and naming things :)
My rationale to rename jobStarted to status was primarily due my other suggestion for renaming the method to monitorJobSubmission. Given the method was returning a boolean, a status felt more natural. Okay to keep it as jobStarted.

// wait until all vertices have been scheduled and running
hasFailure := false
failedVertexIndex := -1
func monitorJobStart(job *client.FlinkJobOverview, timeout config2.Duration) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see this method being succinct now !

maghamravi
maghamravi previously approved these changes May 16, 2023
@sethsaperstein-lyft sethsaperstein-lyft merged commit bea4e54 into master May 17, 2023
@sethsaperstein-lyft sethsaperstein-lyft deleted the STRMHELP-315_monitor_state_fix branch May 17, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants