TestTaskRunTimeout is flakey : fix timeouts #731

vdemeester · 2019-04-05T10:45:22Z

Expected Behavior

TestTaskRunTimeout should pass all the time 💯

Actual Behavior

TestTaskRunTimeout is failing a lot. I'll skip it so that

Here are some examples of failures:

Steps to Reproduce the Problem

Open a PR
See it fail 💩
Run /retest and 🙏

I'm working on it in #691 but it's at a point where it makes a lot of build fail, and I'm really thinking of marking the test as skip and track the re-enablement of it here

cc @bobcatfish

/kind bug
/assign

The text was updated successfully, but these errors were encountered:

Keeping track on progress fixing it in tektoncd#731 Signed-off-by: Vincent Demeester <[email protected]>

Keeping track on progress fixing it in #731 Signed-off-by: Vincent Demeester <[email protected]>

bobcatfish · 2019-04-16T23:26:27Z

@vdemeester just let me know if you want a break from the investigation for a while, I'd be happy to take it on for a while 💃

vdemeester · 2019-04-17T07:21:00Z

/assign @bobcatfish

The more the merrier (especially as I'm running out of time this week)

bobcatfish · 2019-04-18T22:18:58Z

tl;dr: The timeout handle and the reconciler disagree about when a Run has timed out.

Okay I have a theory about what's going on. In #771 I ran the timeout test 10 times (and p.s. this seems to reproduce the issue locally as well), and in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118981945099816962/ we have 2/10 of those timing out.

I also added a lot of (terrible) logs so I could see what was going on and I noticed some interesting things:

The failed Runs were considered timed out in the timeout handler (i.e. case <-time.After(timeout) occurred) for all 10 Runs (and one of them twice haha)
But for run-giraffe-0 and run-giraffe-6, the subsequent reconcile did NOT consider them timed out

For example for run-giraffe-0, this is from the timeout handler:

I0418 21:17:16.127] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:203","msg":"TIMEOUT timed out for run-giraffe-0","knative.dev/controller":"taskrun-controller"}                    
I0418 21:17:16.127] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:200","msg":"TIMEOUT finished for run-giraffe-0","knative.dev/controller":"taskrun-controller"}

This would then trigger the reconcile loop, and the logs from that look like:

I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:248","msg":"TIMEOUT about to check if timeout for run-giraffe-0","knative.dev/controller":"taskrun-controller"}               
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:531","msg":"TIMEOUT checking timeout for run-giraffe-0, starttime 2019-04-18 21:07:13 +0000 UTC","knative.dev/controller":"taskrun-controller"}
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:537","msg":"TIMEOUT check if nil run-giraffe-0","knative.dev/controller":"taskrun-controller"}                                
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT not nil run-giraffe-0 timeout 30s runtime 29.98450907s","knative.dev/controller":"taskrun-controller"}

As far as the reconcile loop is concerned, the timeout is 30s but the runtime is 29.98450907s - right on the edge!

And the same thing for run-giraffe-6:

I0418 21:17:16.033] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT not nil run-giraffe-6 timeout 30s runtime 29.843298838s","knative.dev/controller":"taskrun-controller"}

So ∴ the theory is that the timeout handler and the reconcile loop are slightly out of sync in their logic to determine when a timeout has occurred, i.e. <-time.After(timeout) is happening, but the reonciler's logic is slightly different.

Anyway I don't know what the fix is haha - at a glance they SEEM to be doing the same thing. But the key imo is in making sure that exactly the same logic is being applied in both places OR in moving the logic to handle the timeout case into the timeout handler itself (instead of triggering an entire reconcile).

Whew! (maybe 🤞🤞 🤞 )

We need to:

Figure out a fix for this (haha)
Add more logs to track how timeouts are being applied (for any future debugging)

bobcatfish · 2019-04-18T22:20:57Z

Can see similar stuff happening in https://storage.googleapis.com/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118995157169999873/build-log.txt as well (with "improved" logs):

I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"controller/controller.go:288","msg":"Reconcile succeeded. Time taken: 47.002Âµs.","knative.dev/controller":"taskrun-controller","knative.dev/key":"arendelle-vs2m2/run-giraffe-0"}
I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:203","msg":"TIMEOUT HANDLER timed out for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:200","msg":"TIMEOUT HANDLER finished for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:248","msg":"TIMEOUT RECONCILER about to check if timeout for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:531","msg":"TIMEOUT RECONCILER checking timeout for run-giraffe-5, starttime 2019-04-18 21:59:31 +0000 UTC","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:537","msg":"TIMEOUT RECONCILER check if nil run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT RECONCILER not nil run-giraffe-5 timeout 30s runtime 29.943341324s","knative.dev/controller":"taskrun-controller"}

timeout 30s runtime 29.943341324s DUN DUN DUUUUN

I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!

I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the #731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!

I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>

I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with #731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields

I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!

I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>

I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!

I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>

Logging in the timeout handler was added as part of tektoncd#731 cuz it helped us debug when the timeout handler didn't work as expected. Unfortunately it looks like the logger we're using can't be used in multiple go routines (uber-go/zap#99 may be related). Removing this logging to fix tektoncd#1124, hopefully can find a safe way to add logging back in tektoncd#1307.

Logging in the timeout handler was added as part of #731 cuz it helped us debug when the timeout handler didn't work as expected. Unfortunately it looks like the logger we're using can't be used in multiple go routines (uber-go/zap#99 may be related). Removing this logging to fix #1124, hopefully can find a safe way to add logging back in #1307.

Minor fix: escape 'envsubst' using 'DOLLAR'

tekton-robot assigned vdemeester Apr 5, 2019

vdemeester added the kind/bug Categorizes issue or PR as related to a bug. label Apr 5, 2019

vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Apr 5, 2019

Skip temporarly TestTaskRunTimeout as it is flakey.

dc66648

Keeping track on progress fixing it in tektoncd#731 Signed-off-by: Vincent Demeester <[email protected]>

vdemeester mentioned this issue Apr 5, 2019

Skip temporarly TestTaskRunTimeout as it is flakey. #732

Merged

tekton-robot pushed a commit that referenced this issue Apr 5, 2019

Skip temporarly TestTaskRunTimeout as it is flakey.

b49aa2f

Keeping track on progress fixing it in #731 Signed-off-by: Vincent Demeester <[email protected]>

abayer mentioned this issue Apr 8, 2019

feat(retries) Design and implement retries #658

Merged

3 tasks

tekton-robot assigned bobcatfish Apr 17, 2019

This was referenced Apr 17, 2019

Run the timeout test 10 times 📈 #771

Closed

Can I reproduce it with just the timeout test alone? 🎵 #770

Closed

Re-enable flakey test so I can see it in action 🤓 #769

Closed

bobcatfish mentioned this issue Apr 19, 2019

Make first example of running e2e tests include longer timeout ⏱️ #776

Merged

1 task

tekton-robot closed this as completed in 773ce5d Apr 23, 2019

bobcatfish mentioned this issue Sep 13, 2019

Find a threadsafe way to log in timeout_handler #1307

Closed

bobcatfish mentioned this issue Sep 13, 2019

Fix and work around timeout handler data races #1308

Merged

2 tasks

nikhil-thomas pushed a commit to nikhil-thomas/pipeline that referenced this issue Aug 24, 2021

Merge pull request tektoncd#731 from nikhil-thomas/fix/kodata-HEAD

5fe5fb1

Minor fix: escape 'envsubst' using 'DOLLAR'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestTaskRunTimeout is flakey : fix timeouts #731

TestTaskRunTimeout is flakey : fix timeouts #731

vdemeester commented Apr 5, 2019 •

edited by bobcatfish

Loading

bobcatfish commented Apr 16, 2019

vdemeester commented Apr 17, 2019

bobcatfish commented Apr 18, 2019

bobcatfish commented Apr 18, 2019

TestTaskRunTimeout is flakey : fix timeouts #731

TestTaskRunTimeout is flakey : fix timeouts #731

Comments

vdemeester commented Apr 5, 2019 • edited by bobcatfish Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

bobcatfish commented Apr 16, 2019

vdemeester commented Apr 17, 2019

bobcatfish commented Apr 18, 2019

bobcatfish commented Apr 18, 2019

vdemeester commented Apr 5, 2019 •

edited by bobcatfish

Loading