Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestTaskRunTimeout is flakey : fix timeouts #731

Closed
vdemeester opened this issue Apr 5, 2019 · 4 comments
Closed

TestTaskRunTimeout is flakey : fix timeouts #731

vdemeester opened this issue Apr 5, 2019 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vdemeester
Copy link
Member

vdemeester commented Apr 5, 2019

@vdemeester vdemeester added the kind/bug Categorizes issue or PR as related to a bug. label Apr 5, 2019
vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Apr 5, 2019
Keeping track on progress fixing it in tektoncd#731

Signed-off-by: Vincent Demeester <[email protected]>
tekton-robot pushed a commit that referenced this issue Apr 5, 2019
Keeping track on progress fixing it in #731

Signed-off-by: Vincent Demeester <[email protected]>
@bobcatfish
Copy link
Collaborator

@vdemeester just let me know if you want a break from the investigation for a while, I'd be happy to take it on for a while 💃

@vdemeester
Copy link
Member Author

/assign @bobcatfish

The more the merrier (especially as I'm running out of time this week)

@bobcatfish
Copy link
Collaborator

tl;dr: The timeout handle and the reconciler disagree about when a Run has timed out.

Okay I have a theory about what's going on. In #771 I ran the timeout test 10 times (and p.s. this seems to reproduce the issue locally as well), and in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118981945099816962/ we have 2/10 of those timing out.

I also added a lot of (terrible) logs so I could see what was going on and I noticed some interesting things:

  • The failed Runs were considered timed out in the timeout handler (i.e. case <-time.After(timeout) occurred) for all 10 Runs (and one of them twice haha)
  • But for run-giraffe-0 and run-giraffe-6, the subsequent reconcile did NOT consider them timed out

For example for run-giraffe-0, this is from the timeout handler:

I0418 21:17:16.127] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:203","msg":"TIMEOUT timed out for run-giraffe-0","knative.dev/controller":"taskrun-controller"}                    
I0418 21:17:16.127] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:200","msg":"TIMEOUT finished for run-giraffe-0","knative.dev/controller":"taskrun-controller"}

This would then trigger the reconcile loop, and the logs from that look like:

I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:248","msg":"TIMEOUT about to check if timeout for run-giraffe-0","knative.dev/controller":"taskrun-controller"}               
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:531","msg":"TIMEOUT checking timeout for run-giraffe-0, starttime 2019-04-18 21:07:13 +0000 UTC","knative.dev/controller":"taskrun-controller"}
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:537","msg":"TIMEOUT check if nil run-giraffe-0","knative.dev/controller":"taskrun-controller"}                                
I0418 21:17:16.128] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT not nil run-giraffe-0 timeout 30s runtime 29.98450907s","knative.dev/controller":"taskrun-controller"}

As far as the reconcile loop is concerned, the timeout is 30s but the runtime is 29.98450907s - right on the edge!

And the same thing for run-giraffe-6:

I0418 21:17:16.033] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT not nil run-giraffe-6 timeout 30s runtime 29.843298838s","knative.dev/controller":"taskrun-controller"}

So ∴ the theory is that the timeout handler and the reconcile loop are slightly out of sync in their logic to determine when a timeout has occurred, i.e. <-time.After(timeout) is happening, but the reonciler's logic is slightly different.

Anyway I don't know what the fix is haha - at a glance they SEEM to be doing the same thing. But the key imo is in making sure that exactly the same logic is being applied in both places OR in moving the logic to handle the timeout case into the timeout handler itself (instead of triggering an entire reconcile).

Whew! (maybe 🤞🤞 🤞 )

cat dancing

We need to:

  • Figure out a fix for this (haha)
  • Add more logs to track how timeouts are being applied (for any future debugging)

@bobcatfish
Copy link
Collaborator

Can see similar stuff happening in https://storage.googleapis.com/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118995157169999873/build-log.txt as well (with "improved" logs):

I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"controller/controller.go:288","msg":"Reconcile succeeded. Time taken: 47.002µs.","knative.dev/controller":"taskrun-controller","knative.dev/key":"arendelle-vs2m2/run-giraffe-0"}
I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:203","msg":"TIMEOUT HANDLER timed out for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.153] {"level":"info","logger":"controller.taskrun-controller","caller":"reconciler/timeout_handler.go:200","msg":"TIMEOUT HANDLER finished for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:248","msg":"TIMEOUT RECONCILER about to check if timeout for run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:531","msg":"TIMEOUT RECONCILER checking timeout for run-giraffe-5, starttime 2019-04-18 21:59:31 +0000 UTC","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:537","msg":"TIMEOUT RECONCILER check if nil run-giraffe-5","knative.dev/controller":"taskrun-controller"}
I0418 22:09:34.154] {"level":"info","logger":"controller.taskrun-controller","caller":"taskrun/taskrun.go:541","msg":"TIMEOUT RECONCILER not nil run-giraffe-5 timeout 30s runtime 29.943341324s","knative.dev/controller":"taskrun-controller"}

timeout 30s runtime 29.943341324s DUN DUN DUUUUN

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 19, 2019
I foolishly tried to use the example at the top of the test README when
I was trying to run the timeout tests (for the tektoncd#731 investigation) and I
was sad when it failed b/c the default timeout was too short. This is
the timeout we are using when running CI anyway!
tekton-robot pushed a commit that referenced this issue Apr 20, 2019
I foolishly tried to use the example at the top of the test README when
I was trying to run the timeout tests (for the #731 investigation) and I
was sad when it failed b/c the default timeout was too short. This is
the timeout we are using when running CI anyway!
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 22, 2019
I originally tried to completely remove the PipelineRun timeout handler
- but it turns out that actually both controllers _do_ check for
timeouts, they just do it differently. I hope in a follow up PR to make
them more similar, but in the meantime, I refactored the timeout handler
logic such that the handler itself is the same for both kinds of runs,
and added the same `HasStarted` function for TaskRuns that PipelineRuns
have.

Also:
- Added more logs (which was how I figured out the problem with tektoncd#731 in
the first place!)
- Followed some linting guidelines, fixing typos, adding keyed fields
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 22, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 22, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
tekton-robot pushed a commit that referenced this issue Apr 23, 2019
I originally tried to completely remove the PipelineRun timeout handler
- but it turns out that actually both controllers _do_ check for
timeouts, they just do it differently. I hope in a follow up PR to make
them more similar, but in the meantime, I refactored the timeout handler
logic such that the handler itself is the same for both kinds of runs,
and added the same `HasStarted` function for TaskRuns that PipelineRuns
have.

Also:
- Added more logs (which was how I figured out the problem with #731 in
the first place!)
- Followed some linting guidelines, fixing typos, adding keyed fields
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
I foolishly tried to use the example at the top of the test README when
I was trying to run the timeout tests (for the tektoncd#731 investigation) and I
was sad when it failed b/c the default timeout was too short. This is
the timeout we are using when running CI anyway!
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
I originally tried to completely remove the PipelineRun timeout handler
- but it turns out that actually both controllers _do_ check for
timeouts, they just do it differently. I hope in a follow up PR to make
them more similar, but in the meantime, I refactored the timeout handler
logic such that the handler itself is the same for both kinds of runs,
and added the same `HasStarted` function for TaskRuns that PipelineRuns
have.

Also:
- Added more logs (which was how I figured out the problem with tektoncd#731 in
the first place!)
- Followed some linting guidelines, fixing typos, adding keyed fields
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
I foolishly tried to use the example at the top of the test README when
I was trying to run the timeout tests (for the tektoncd#731 investigation) and I
was sad when it failed b/c the default timeout was too short. This is
the timeout we are using when running CI anyway!
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
I originally tried to completely remove the PipelineRun timeout handler
- but it turns out that actually both controllers _do_ check for
timeouts, they just do it differently. I hope in a follow up PR to make
them more similar, but in the meantime, I refactored the timeout handler
logic such that the handler itself is the same for both kinds of runs,
and added the same `HasStarted` function for TaskRuns that PipelineRuns
have.

Also:
- Added more logs (which was how I figured out the problem with tektoncd#731 in
the first place!)
- Followed some linting guidelines, fixing typos, adding keyed fields
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Sep 13, 2019
Logging in the timeout handler was added as part of tektoncd#731 cuz it helped
us debug when the timeout handler didn't work as expected. Unfortunately
it looks like the logger we're using can't be used in multiple go
routines (uber-go/zap#99 may be related).
Removing this logging to fix tektoncd#1124, hopefully can find a safe way to add
logging back in tektoncd#1307.
tekton-robot pushed a commit that referenced this issue Sep 15, 2019
Logging in the timeout handler was added as part of #731 cuz it helped
us debug when the timeout handler didn't work as expected. Unfortunately
it looks like the logger we're using can't be used in multiple go
routines (uber-go/zap#99 may be related).
Removing this logging to fix #1124, hopefully can find a safe way to add
logging back in #1307.
nikhil-thomas pushed a commit to nikhil-thomas/pipeline that referenced this issue Aug 24, 2021
Minor fix: escape 'envsubst' using 'DOLLAR'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants