Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STRMCMP-1658] Enable Deploys to Fallback without State #287

Merged
merged 26 commits into from
May 5, 2023

Conversation

sethsaperstein-lyft
Copy link
Contributor

@sethsaperstein-lyft sethsaperstein-lyft commented Apr 17, 2023

overview

Add CRD parameter fallbackWithoutState to be enabled to allow deploys to occur without a savepoint or checkpoint in the case that the savepoint fails and there are no usable externalized checkpoints.

In the case that the savepoint and checkpoints are failing, this has a similar effect as savepointDisabled. The benefit of this configuration option is that it only exhibits this behavior when there is no other path forward and thus acts as a fallback without potential intervention. This is disabled by default.

The state graph does not change, but rather, the recovering state that attempts to find an externalized checkpoint forwards to submittingJob in the case that this is enabled and there is no externalized checkpoint

@sethsaperstein-lyft
Copy link
Contributor Author

/PTAL @premsantosh @leoluoInSea @maghamravi

@sethsaperstein-lyft
Copy link
Contributor Author

additional info

The original scope of this change was to handle 2 cases:

  1. Savepoints and checkpoints are failing and we would like to start without state
  2. Submitting job is failing and would otherwise succeed if not using a savepoint (ex: job graph change)

The original solution involved an extra state in the state graph for submitting without state. This state would be reached in the above two cases on upgrade as such:

  1. savepointing -> recovering -> submittingJobWithoutState
  2. savepointing -> submittingJob -> submittingJobWithoutState

After further review, it appears that case 2 can be solved by enabling the option allowNonRestoredState. This leaves only the first case to be handled in which the savepoint fails and no externalized checkpoints are available. In this case, we do not need a different state and instead can reuse the state submittingJob which will proceed without an empty savepointPath. This solves the issue while simplifying the solution. Consulted @maghamravi who agrees.

@@ -9,7 +9,7 @@ ENV PATH=$FLINK_HOME/bin:$HADOOP_HOME/bin:$MAVEN_HOME/bin:$PATH
COPY . /code

# Configure Flink version
ENV FLINK_VERSION=1.11.6 \
ENV FLINK_VERSION=1.8.1 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning this back to 1.8.1 which corresponds to the image in dockerhub. Attempted to update this in the integration test PR but am reverting for now as not to deal with further configuration options that are non-trivial when dealing with 8GB memory total for github actions

minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.1
minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.2

cd integ/operator-test-app
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the test app image as part of the integration test as opposed to relying on the remote image whose contents may not match the integ/operator-test-app contents

Copy link
Contributor

@anandswaminathan anandswaminathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are introducing a regression here when fallbackWithoutState is not set.

You are not failing deploy when there is an error

I see that you have handled the case below. Looks good

Copy link
Contributor

@anandswaminathan anandswaminathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good. Wondering if the duplication of lines can be avoided. Something like

failDeploy := false

if err !=nil  {
  failDeploy = true

} else path =="" {
  ...
  failDeploy = true
}

if failDeploy {
  if FallbackWithoutState {
  failDeploy = false
}
}

if failDeploy {
  s.deplpyFailed
}

submitjob

err = s.Util.ExecuteCommand("minikube", "ssh", "touch /tmp/checkpoints/fail && chmod 0644 /tmp/checkpoints/fail")
c.Assert(err, IsNil)

// wait a bit for it to start failing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment incorrect because on line 115 you assert error is nil?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! You can not verifying that the app is failing is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines have moved around since but if you are referring to s.Util.GetFlinkApplication this gets the k8s FlinkApplication CR and the err corresponds to if there were errors retrieving the CR object as opposed to anything related to the status of the actual job.

Ln 135 s.Util.FlinkAPIGet(newApp, endpoint) gets the actual flink job and corresponding status to show that the job itself is healthy

@sethsaperstein-lyft
Copy link
Contributor Author

The logic looks good. Wondering if the duplication of lines can be avoided. Something like

failDeploy := false

if err !=nil  {
  failDeploy = true

} else path =="" {
  ...
  failDeploy = true
}

if failDeploy {
  if FallbackWithoutState {
  failDeploy = false
}
}

if failDeploy {
  s.deplpyFailed
}

submitjob

agreed. refactored

@sethsaperstein-lyft sethsaperstein-lyft requested review from anandswaminathan and premsantosh and removed request for leoluoInSea May 4, 2023 18:38
@sethsaperstein-lyft sethsaperstein-lyft merged commit db5a5de into master May 5, 2023
@sethsaperstein-lyft sethsaperstein-lyft deleted the STRMCMP-1658_fallback_without_state branch May 5, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants