Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service only reports Ready=True when rollout is complete #2430

Closed
cooperneil opened this issue Nov 8, 2018 · 12 comments
Closed

Service only reports Ready=True when rollout is complete #2430

cooperneil opened this issue Nov 8, 2018 · 12 comments
Assignees
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. kind/spec Discussion of how a feature should be exposed to customers.
Milestone

Comments

@cooperneil
Copy link

cc @mattmoor @dgerd @sixolet @steren @shefaliv @mikehelmick @evankanderson

tl;dr this issue describes a scenario where the Service reports Ready=True before rollout is complete, and proposes a change (possibly with a new condition) that allows reporting Readiness more correctly.

Problem Statement

Consider this runLatest Service scenario - what should Service's Ready condition = in this case?

kind: Service
metadata:
  name: myservice
  generation: 2
...
spec:
  runLatest:
...
status:
  conditions:
  - type: Ready
    status: ?
  - type: ConfigurationsReady
    status: True
  - type: RoutesReady
    status: True
  observedGeneration: 2
  latestCreatedRevisionName: myservice.2
  latestReadyRevisionName: myservice.2  # Configuration is True based on new revision
  traffic:
   - revisionName: myservice.1  # Route is True based on the previous revision
      percent: 100

In this case, by the Knative conditions convention, the Service should be Ready=True because ConfigurationsReady = True and RoutesReady = True.

But in reality, what has happened is that the rollout is still in progress. The Configuration has reconciled and successfully created the new myservice.2, and becomes true, but the Route has not yet been notified of the updated revision and migrated traffic to it. It is still on the old revision. So the client's intent in updating the service is only partially complete.

The client is prematurely notified that their deployment is finished, but in reality traffic has not migrated yet.

Proposal(s)

To address this, a suggestion is to compare the status.latestReadyRevisionName (which is propagated from the Configuration) and the status.traffic.revisionName (which is propagated from the Route) to determine overall readiness.

One possibility is that Service's Ready condition is no longer simply the AND of RoutesReady & ConfigurationsReady terminal conditions. Instead Service's Ready is based on RoutesReady = True AND ConfigurationsReady = True AND the latestReadyRevisionName = traffic.revisionName.

Doing so could lead to conditions like this:

..
status:
  conditions:
  - type: Ready
    status: Unknown
    message: "Traffic migration is not complete"
  - type: ConfigurationsReady
    status: True
  - type: RoutesReady
    status: True
...

Another alternative is to leave the Ready condition as a conventional rollup of terminal conditions, and introduce a third terminal condition named something like RolloutComplete. RolloutComplete would encode the latestReadyRevisionName = traffic.revisionName test. This is recomputed with updates from the Configuration and Route. This would lead to conditions like:

..
status:
  conditions:
  - type: Ready
    status: Unknown
    message: "Traffic migration is not complete"
  - type: RolloutComplete
    status: Unknown
    message: "Traffic migration is not complete"
  - type: ConfigurationsReady
    status: True
  - type: RoutesReady
    status: True
...

Note that in either of these suggestions, the logic would be slightly different in Release mode, where the traffic in the spec of the Service (i.e. named revisions and percentages) is compared with the traffic in status.

@cooperneil cooperneil changed the title Service is only reports Ready=True when rollout is complete Service only reports Ready=True when rollout is complete Nov 8, 2018
@mattmoor
Copy link
Member

mattmoor commented Nov 8, 2018

I added this to the future topics for the WG. Are you up for leading a discussion next week @cooperneil ?

@mattmoor mattmoor added area/API API objects and controllers kind/spec Discussion of how a feature should be exposed to customers. labels Nov 9, 2018
@cooperneil
Copy link
Author

Missed the WG discussion on Wednesday, but heard there was discussion of if the logic should be in the Route rather than Service. I think it needs to at least be in the Service, due to the race condition of a Service being notified of a Configuration update before the Route being notified.

In this scenario, if the logic was only in the Route, I think it would have a stale positive indicating its traffic was current, and therefore the Service would prematurely report ready.

@dgerd
Copy link

dgerd commented Nov 16, 2018 via email

@cooperneil
Copy link
Author

Oh, got it. So in that 3rd option, the RoutesReady and ConfigurationsReady remain terminal conditions, and the conditions would look like:

..
status:
  conditions:
  - type: Ready
    status: Unknown
    message: "Traffic migration is not complete"
  - type: ConfigurationsReady
    status: True
  - type: RoutesReady
    status: Unknown
    message: "Traffic migration is not complete"
...

I think I actually prefer that to the other 2 alternatives. Thoughts?

@mattmoor
Copy link
Member

/kind bug
/assign @dgerd

So I believe that this is more or less the root cause of a flake I am seeing in our conformance testing (e.g.).

This logic here only waits for the Service's LatestReadyRevisionName to reflect the latest rollout, and then it immediately checks that (not waits for) the Route looks a particular way.

This leads to the intermittent error in CI:

service_test.go:196: the Route test-run-latest-service-pckfurpa was not updated to route traffic to the Revision test-run-latest-service-pckfurpa-00002: Expected traffic revision name to be test-run-latest-service-pckfurpa-00002 but actually is test-run-latest-service-pckfurpa-00001

cc @adrcunha

@knative-prow-robot knative-prow-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2018
@cooperneil
Copy link
Author

So of the 3 options outlined, what do we think the right approach is? I'm happy with either options 2 or 3, i.e.

  • option 2) add a new terminal condition like "RolloutComplete" that encapsulates the new readiness logic
  • option 3) pack the new readiness logic into the RoutesReady condition, so that RoutesReady is more than a straight propagation of the underlying Route's readiness.

@dgerd
Copy link

dgerd commented Dec 10, 2018

I was originally more in favor of Option 2 as I liked the properties that:

  • service."RouteReady" == route."Ready" and
  • service."Ready" == The && of all other terminal conditions

However, I am coming around to Option 3. I think we can accurately capture the state through message and reason fields so that it is clear that the Route is ready, but transitioning to a new revision. Meanwhile keeping the terminal conditions to just the Service and dependent objects reduces the number of conditions to keep track of both in code and in UI. If for some reason we end up unhappy with Option 3 we can always add a new condition later (although we would have to be careful changing the new RouteReady behavior if clients end up relying on it).

@dgerd
Copy link

dgerd commented Dec 20, 2018

/milestone Serving 0.4

@dgerd
Copy link

dgerd commented Jan 3, 2019

/assign vagababov

@dgerd
Copy link

dgerd commented Jan 3, 2019

/unassign

@mattmoor
Copy link
Member

mattmoor commented Jan 6, 2019

tl;dr I believe that this is related to some intermittent flakes that we see.

I believe @vagababov was able to reproduce this with sleeps in the Route controller, but here's an example from our continuous CI that I think may be related to this: https://gubernator.knative.dev/build/knative-prow/logs/ci-knative-serving-continuous/1081339923912462336

@vagababov
Copy link
Contributor

vagababov commented Jan 6, 2019 via email

vagababov added a commit to vagababov/serving that referenced this issue Jan 12, 2019
TL;DR: knative#2430
- This is the first change to ensure that we mark service as ready only
  when all the subresources have successfully reconciled.
- In this change runLatest is covered. The service will become ready
  only when config.LatestReadyRevision is the one served by the route.
  When they mismatch the service will transition into the `Unknown`
  state until route finishes reconciliation.
- Unit tests are updated and extended for this case
- Integration tests are hardened to make sure the service transitions
  into ready state before verifying request/responses.
knative-prow-robot pushed a commit that referenced this issue Jan 17, 2019
* Implement verification for route readiness for the runLatest.

TL;DR: #2430
- This is the first change to ensure that we mark service as ready only
  when all the subresources have successfully reconciled.
- In this change runLatest is covered. The service will become ready
  only when config.LatestReadyRevision is the one served by the route.
  When they mismatch the service will transition into the `Unknown`
  state until route finishes reconciliation.
- Unit tests are updated and extended for this case
- Integration tests are hardened to make sure the service transitions
  into ready state before verifying request/responses.

* Remove the outdated method.

* Address review comments.

* Address the comments.

- improve the functional helper name
- removed the confusing comment
- added a test that validates the proper behaviour when gen 2 config
  fails, but gen 1 is happy throughout.

* Comment typo fix.

* Add a comment to clarify what is going on.

* Fix the unit test after the merge, since the interface changed.

* Address review comments

- reason is a single word
- move the validation from callback to the postprocessing.

* Update pkg/apis/serving/v1alpha1/service_types_test.go

Co-Authored-By: vagababov <[email protected]>

* Address comments.
vagababov added a commit to vagababov/serving that referenced this issue Jan 19, 2019
When we update the service in the release mode, but the route is not yet
updated to the new version, we should mark the service to be in `Unknown` ready state.
See: knative#2430
knative-prow-robot pushed a commit that referenced this issue Jan 23, 2019
* add interface

* Implement release mode checking for the service not yet being ready.

When we update the service in the release mode, but the route is not yet
updated to the new version, we should mark the service to be in `Unknown` ready state.
See: #2430

* remove debug logging

* Bring back the accidentaly deleted comment.

* Apply suggestions from the code review.

* Replace cmp with kmp package.

* Update pkg/reconciler/v1alpha1/service/service.go

* Remove the check for type of service

Since reconcile() is only invoked, when we're not in manual mode,
hence the type check is useless here.
Also move the DeepCopy() one word right, saving us a few microseconds.

* diff -> eq
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. kind/spec Discussion of how a feature should be exposed to customers.
Projects
None yet
Development

No branches or pull requests

5 participants