Monitor Job Vertices State on Deploy #284

leoluoInSea · 2023-03-31T19:19:57Z

Context

Currently flinkk8soperator only monitor Flink job level overview status, and if the job status is RUNNING, it considers the deployment is succeeded.

However we encountered scenarios that some vertices are in bad states, the job keeps crash and restart. So the change is also check all vertices status.

Design doc

Implementation

~~Use a feature flag to control the feature ON/OFF. Initially default value is false. After we have enough confidence, we will flip the flag or remove the feature flag.~~ As discussed with Anand, TLDR is this change is relatively small and I will do handful manual testing in staging after merge.
I add a logic to fail fast if there is any vertex status is FAILED.
Add several unit tests
I was trying to add a integration test which will add a , in programArgs and assume it will break args parsing. However, currently test flink app is pinned to a specific old image. I tried to build a new flink image, but new image exceeded memory limit.

Jira link

https://jira.lyft.net/browse/STRMCMP-1640

anandswaminathan · 2023-03-31T20:50:18Z

pkg/controller/flinkapplication/flink_state_machine.go

+	jobFinalizer                  = "job.finalizers.flink.k8s.io"
+	statusChanged                 = true
+	statusUnchanged               = false
+	jobVertexStateTimeoutInMinute = 3


Can this be part of the config so that it can be set through configmap?

Yes, it could be. But I think even we set it through configmap, we still need a default value in code in case that this key is not set in configMap. If we have the use case that flinkk8soperator user want to override the default value later, we can consider to add it then.

Correct. You can add the default value as well.

flinkk8soperator/pkg/controller/config/config.go

Line 1 in 12959ee

package config

+1 to adding this as a configuration for the user. Any arbitrary waits should be configurable, even for our own sake of finding the optimal time to ensure a safe deploy but decrease deploy times.

Additionally, I like the "duration" syntax used in the configmap linked above as it does not restrict to minutes which is less flexible

anandswaminathan · 2023-03-31T20:51:34Z

pkg/controller/flinkapplication/flink_state_machine.go

-		if v1beta1.IsBlueGreenDeploymentMode(app.Status.DeploymentMode) && app.Status.DeployHash != "" {
-			s.updateApplicationPhase(app, v1beta1.FlinkApplicationDualRunning)
+	// wait until all vertices have been scheduled and running
+	jobStartTimeSec := job.StartTime / 1000


Is the job start time reliable ? I believe this will get reset when the jm restarts or job fails and restarts

Hmm, good point. Let me verify it tmr.

I have verified that the start time will not reset after job restart.

{"jobs":[{"jid":"70ec861c2ddf20072a1d2e3f1aff3bd7","name":"nmodes","state":"RESTARTING","start-time":1680713677111,"end-time":-1,"duration":1235989,"last-modification":1680714880571,"tasks":{"total":192,"created":0,"scheduled":0,"deploying":0,"running":0,"finished":0,"canceling":0,"canceled":191,"failed":1,"reconciling":0,"initializing":0}}]}

The job is in RESTARTING state, but start-time is Wednesday, April 5, 2023 9:54:37.111 AM GMT-07:00 DST, which is the time I triggered the deployment.

pkg/controller/flinkapplication/flink_state_machine.go

leoluoInSea · 2023-04-05T16:18:24Z

ptal @premsantosh @maghamravi #streaming-compute-dev-prs

leoluoInSea · 2023-04-07T18:20:41Z

ptal @premsantosh @maghamravi #streaming-compute-dev-prs

sethsaperstein-lyft · 2023-04-10T17:47:48Z

nit: Can you make the PR title a bit more descriptive

sethsaperstein-lyft · 2023-04-10T17:58:46Z

pkg/controller/flinkapplication/flink_state_machine.go

+	jobFinalizer                  = "job.finalizers.flink.k8s.io"
+	statusChanged                 = true
+	statusUnchanged               = false
+	jobVertexStateTimeoutInMinute = 3


+1 to adding this as a configuration for the user. Any arbitrary waits should be configurable, even for our own sake of finding the optimal time to ensure a safe deploy but decrease deploy times.

Additionally, I like the "duration" syntax used in the configmap linked above as it does not restrict to minutes which is less flexible

sethsaperstein-lyft · 2023-04-10T18:02:10Z

pkg/controller/flinkapplication/flink_state_machine.go

-	for _, v := range job.Vertices {
-		allVerticesStarted = allVerticesStarted && (v.StartTime > 0)
+	allVerticesRunning := true
+	if app.Spec.BetaFeaturesEnabled {


Are we documenting what features are in beta?

This is not a standard operator pattern.

Primarily because, this flag getting updated will cause applications to restart (if you consider hash calculation to include this flag)

@sethsaperstein-lyft I will add more to crd.md file for documenting

@anandswaminathan sorry, I didn't follow that why if the flag is updated, the application to restart. Can you elaborate that? My understanding is that this field is set in applicaiton jsonnet file and the value is passed in when new deployment/update is made. But this field can't be changed in runtime.

@leoluoInSea

So here is what you are doing - you are looking for a way to have some applications be exposed to this feature.

This is not commonly done in CRD spec and operator, as this is not something you want to have for the long run. There is no direct correlation between Flink and "BetaFeaturesEnabled". You are now setting a goal that all the apps will have "BetaFeaturesEnabled" set to True. Say you remove the flag some day later (to make the behavior default), all the apps that have this set need to revert, and we are looking at updating 100s of apps (depending on scale).

If you are worried about the reliability of the change, may be pass a version as label (if we already don't do so), and have logic on the label (if version == xx). The other option is have flag in the label (but that needs a small change in here). That way, if the code gets removed, or changed, the apps will not need to be updated.

Just for record, Anand and I had a discussion and our conclusion is that this is not breaking changes, and I will do several testing in staging after merging, definitely with an example of failure scenario nmodes.

For future benefit if we need a feature flag, as Anand pointed out above, the recommended way is to check label instead of adding a new field in CRD. From app jsonnet file, just add the label if we need to turn on the feature.

sethsaperstein-lyft · 2023-04-10T18:31:19Z

pkg/controller/flinkapplication/flink_state_machine.go

+	if app.Spec.BetaFeaturesEnabled {
+		// wait until all vertices have been scheduled and running
+		logger.Info(ctx, "Beta features flag is enabled.")
+		jobStartTimeSec := job.StartTime / 1000


Can we throw the below logic in a separate function to make this more readable and testable?

I should have clarified that I meant all the new logic of the running vertex state check, not just the start time calculation

After removing the feature flag if condition, I think it's readable now.

func is >100 lines. I think it's somewhat difficult to read. Nit, but for maintainability

anandswaminathan · 2023-04-10T21:02:58Z

pkg/controller/flinkapplication/flink_state_machine.go

+				s.flinkController.LogEvent(ctx, app, corev1.EventTypeWarning, "JobRunningFailed",
+					fmt.Sprintf(
+						"Vertex %d with name [%s] state is Failed", failedVertexIndex, job.Vertices[failedVertexIndex].Name))
+				return s.deployFailed(app)


This is a new behavior. Is it intentional ? More than monitoring, we are updating the state of the deployment here

This change is not just monitoring but also updating state of deployment when conditions are met.
This specific fail fast part is new and it's based on general philosophy that if fails, fail fast.

I double checked Flink doc about the vertex state transition graph, FAILING could retry while FAILED is the final bad state.

So I will update the logic to if any vertex state is FAILED, fail deployment fast.

@anandswaminathan in retrospect this behavior did not follow the original design as we would like to move to rollingback. New PR: #291

leoluoInSea · 2023-04-11T16:58:31Z

ptal @sethsaperstein-lyft @anandswaminathan

leoluoInSea · 2023-04-12T21:24:49Z

ptal @anandswaminathan

leoluoInSea · 2023-04-12T23:53:15Z

ptal @anandswaminathan

anandswaminathan

Please get a +1 from @sethsaperstein-lyft

pkg/controller/flinkapplication/flink_state_machine.go

premsantosh · 2023-04-13T23:24:16Z

pkg/controller/flinkapplication/flink_state_machine.go

+			hasFailure = true
+			break
+		}
+		allVerticesRunning = allVerticesRunning && (v.StartTime > 0) && v.Status == client.Running


If the first vertex is failing then won't this return allVerticesRunning as true when it should be false?

If the vertex is failing, hasFailure is set to true, then line 768 if block will have early return to fail the deployment.

I see. So its working because of how the code in the current caller is written but essentially this method is still buggy because if its called from somewhere else where the case is not handled like the above caller it will show incorrect result.

Agree. Initially this part is in the main method, so I didn't set value for allVerticesRunning in fail scenario. I will provide the fix later

anandswaminathan · 2023-05-16T20:14:27Z

Checking.

…

On Mon, May 15, 2023 at 5:24 PM Seth Saperstein ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pkg/controller/flinkapplication/flink_state_machine.go <#284 (comment)> : > + hasFailure := false + failedVertexIndex := -1 + for index, v := range job.Vertices { + if v.Status == client.Failed || v.Status == client.Failing { + failedVertexIndex = index + hasFailure = true + break + } + allVerticesRunning = allVerticesRunning && (v.StartTime > 0) && v.Status == client.Running + } + // fail fast + if hasFailure { + s.flinkController.LogEvent(ctx, app, corev1.EventTypeWarning, "JobRunningFailed", + fmt.Sprintf( + "Vertex %d with name [%s] state is Failed", failedVertexIndex, job.Vertices[failedVertexIndex].Name)) + return s.deployFailed(app) @anandswaminathan <https://github.com/anandswaminathan> in retrospect this behavior did not follow the original design as we would like to move to rollingback. New PR: #291 <#291> — Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMOGLPSTRT6Y5CVGCFAZE3XGLCM5ANCNFSM6AAAAAAWPBFI2Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

## overview In the [job monitoring PR](#284) we introduced a bug such that when the job monitoring fails due to timeout or a failed vertex, the state DeployFailed is reached instead of attempting to rollback. This simplifies the logic of submitting job and job monitoring as well as results in the job attempting to roll back ## additional info Errors returned by a state in the state machine are added to the status as the last error. The shouldRollback at the beginning of these states checks to see if it is retryable and moves to rolling back if not. Thus, the change made is to return an error if monitoring results in a failed vertex or vertex timeout

leoluoInSea added 3 commits March 31, 2023 12:12

monitor

085e07f

Merge remote-tracking branch 'origin' into monitor

fcdf7cc

comment out tests

dc3f40a

leoluoInSea requested review from anandswaminathan, premsantosh and maghamravi as code owners March 31, 2023 19:19

anandswaminathan reviewed Mar 31, 2023

View reviewed changes

pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved

leoluoInSea added 13 commits April 4, 2023 09:08

more debug message

c56a72e

shorten waiting time

d8ebb16

more logging

796c2b5

use test app local image

07d5013

add more debug message to setup.sh

0c97c8f

specify image to be local image

15197c5

change image pull policy

78fb2af

fix the image path

431ea16

fix image path

716c550

add feature flag

5f4b09a

remove integ

b0b9c0d

revert test code

c9edcf7

remove spaces

0f4be62

Add more logging

f6cf59f

leoluoInSea force-pushed the monitor branch from 1c17ddd to f6cf59f Compare April 7, 2023 17:55

sethsaperstein-lyft self-requested a review April 10, 2023 17:48

sethsaperstein-lyft reviewed Apr 10, 2023

View reviewed changes

leoluoInSea changed the title ~~monitor~~ Safe Deploys Part 1: Monitor job vertices state instead of job overview state Apr 10, 2023

anandswaminathan reviewed Apr 10, 2023

View reviewed changes

leoluoInSea added 2 commits April 10, 2023 15:21

Merge remote-tracking branch 'origin' into monitor

231277a

address comments

27452da

leoluoInSea force-pushed the monitor branch from 754f9e4 to 7cb7538 Compare April 11, 2023 04:22

fix unit tests

3c6134d

leoluoInSea force-pushed the monitor branch from 7cb7538 to 3c6134d Compare April 11, 2023 16:24

sethsaperstein-lyft changed the title ~~Safe Deploys Part 1: Monitor job vertices state instead of job overview state~~ Monitor Job Vertices State on Deploy Apr 11, 2023

leoluoInSea added 2 commits April 12, 2023 14:10

remove feature flag

e1b0778

cleanup

2db36f3

leoluoInSea added 2 commits April 12, 2023 15:39

give default value to flinkJobVertexTimeout

4dcb7c1

move to monitor logic to func

7e8c521

leoluoInSea force-pushed the monitor branch from ef76b22 to 7e8c521 Compare April 12, 2023 23:22

anandswaminathan approved these changes Apr 13, 2023

View reviewed changes

sethsaperstein-lyft approved these changes Apr 13, 2023

View reviewed changes

leoluoInSea merged commit d5e6f18 into master Apr 13, 2023

leoluoInSea deleted the monitor branch April 13, 2023 17:29

premsantosh reviewed Apr 13, 2023

View reviewed changes

pkg/controller/flinkapplication/flink_state_machine.go Show resolved Hide resolved

premsantosh reviewed Apr 13, 2023

View reviewed changes

sethsaperstein-lyft mentioned this pull request May 15, 2023

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor Job Vertices State on Deploy #284

Monitor Job Vertices State on Deploy #284

leoluoInSea commented Mar 31, 2023 •

edited

Loading

anandswaminathan Mar 31, 2023

leoluoInSea Apr 5, 2023

anandswaminathan Apr 5, 2023

sethsaperstein-lyft Apr 10, 2023

anandswaminathan Mar 31, 2023

leoluoInSea Apr 5, 2023

leoluoInSea Apr 5, 2023 •

edited

Loading

leoluoInSea commented Apr 5, 2023

leoluoInSea commented Apr 7, 2023

sethsaperstein-lyft commented Apr 10, 2023

sethsaperstein-lyft Apr 10, 2023

sethsaperstein-lyft Apr 10, 2023

anandswaminathan Apr 10, 2023 •

edited

Loading

leoluoInSea Apr 10, 2023

leoluoInSea Apr 10, 2023

anandswaminathan Apr 12, 2023 •

edited

Loading

leoluoInSea Apr 12, 2023

sethsaperstein-lyft Apr 10, 2023

sethsaperstein-lyft Apr 11, 2023

leoluoInSea Apr 12, 2023

sethsaperstein-lyft Apr 12, 2023

anandswaminathan Apr 10, 2023

leoluoInSea Apr 10, 2023

sethsaperstein-lyft May 15, 2023

leoluoInSea commented Apr 11, 2023

leoluoInSea commented Apr 12, 2023

leoluoInSea commented Apr 12, 2023

anandswaminathan left a comment

premsantosh Apr 13, 2023

leoluoInSea Apr 13, 2023

premsantosh Apr 13, 2023

leoluoInSea Apr 13, 2023

anandswaminathan commented May 16, 2023 via email

Monitor Job Vertices State on Deploy #284

Monitor Job Vertices State on Deploy #284

Conversation

leoluoInSea commented Mar 31, 2023 • edited Loading

Context

Implementation

Jira link

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoluoInSea Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

leoluoInSea commented Apr 5, 2023

leoluoInSea commented Apr 7, 2023

sethsaperstein-lyft commented Apr 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandswaminathan Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandswaminathan Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoluoInSea commented Apr 11, 2023

leoluoInSea commented Apr 12, 2023

leoluoInSea commented Apr 12, 2023

anandswaminathan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandswaminathan commented May 16, 2023 via email

leoluoInSea commented Mar 31, 2023 •

edited

Loading

leoluoInSea Apr 5, 2023 •

edited

Loading

anandswaminathan Apr 10, 2023 •

edited

Loading

anandswaminathan Apr 12, 2023 •

edited

Loading