-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service only reports Ready=True when rollout is complete #2430
Comments
I added this to the future topics for the WG. Are you up for leading a discussion next week @cooperneil ? |
Missed the WG discussion on Wednesday, but heard there was discussion of if the logic should be in the Route rather than Service. I think it needs to at least be in the Service, due to the race condition of a Service being notified of a Configuration update before the Route being notified. In this scenario, if the logic was only in the Route, I think it would have a stale positive indicating its traffic was current, and therefore the Service would prematurely report ready. |
I believe the discussion was if the Service should expose the logic as part
of the “RouteReady” condition on the Service rather than adding a new
condition to the Service. It would add additional logic to route ready in
the service controller to go to Unknown with a reason instead of straight
propagation from route’s condition so it would not have the race described.
Sorry for the confusion.
…On Fri, Nov 16, 2018 at 11:54 AM Neil Cooper ***@***.***> wrote:
Missed the WG discussion on Wednesday, but heard there was discussion of
if the logic should be in the Route rather than Service. I think it needs
to at least be in the Service, due to the race condition of a Service being
notified of a Configuration update before the Route being notified.
In this scenario, if the logic was only in the Route, I think it would
have a stale positive indicating its traffic was current, and therefore the
Service would prematurely report ready.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2430 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ApafMgxrZsbLuXEjx137IoGAXKspKq3sks5uvxfzgaJpZM4YUCP5>
.
|
Oh, got it. So in that 3rd option, the RoutesReady and ConfigurationsReady remain terminal conditions, and the conditions would look like: ..
status:
conditions:
- type: Ready
status: Unknown
message: "Traffic migration is not complete"
- type: ConfigurationsReady
status: True
- type: RoutesReady
status: Unknown
message: "Traffic migration is not complete"
... I think I actually prefer that to the other 2 alternatives. Thoughts? |
/kind bug So I believe that this is more or less the root cause of a flake I am seeing in our conformance testing (e.g.). This logic here only waits for the Service's This leads to the intermittent error in CI:
cc @adrcunha |
So of the 3 options outlined, what do we think the right approach is? I'm happy with either options 2 or 3, i.e.
|
I was originally more in favor of Option 2 as I liked the properties that:
However, I am coming around to Option 3. I think we can accurately capture the state through |
/milestone Serving 0.4 |
/assign vagababov |
/unassign |
tl;dr I believe that this is related to some intermittent flakes that we see. I believe @vagababov was able to reproduce this with sleeps in the Route controller, but here's an example from our continuous CI that I think may be related to this: https://gubernator.knative.dev/build/knative-prow/logs/ci-knative-serving-continuous/1081339923912462336 |
Yes, I was. I'll try to produce the simple approach (extend the route.ready
check to include revision match) coming week.
…On Sat, Jan 5, 2019 at 8:48 PM Matt Moore ***@***.***> wrote:
tl;dr I believe that this is related to some intermittent flakes that we
see.
I believe @vagababov <https://github.com/vagababov> was able to reproduce
this with sleeps in the Route controller, but here's an example from our
continuous CI that I think may be related to this:
https://gubernator.knative.dev/build/knative-prow/logs/ci-knative-serving-continuous/1081339923912462336
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2430 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAurX3hzLKT5xGEc5R2XGZvgrWFGbUlKks5vAYAFgaJpZM4YUCP5>
.
|
TL;DR: knative#2430 - This is the first change to ensure that we mark service as ready only when all the subresources have successfully reconciled. - In this change runLatest is covered. The service will become ready only when config.LatestReadyRevision is the one served by the route. When they mismatch the service will transition into the `Unknown` state until route finishes reconciliation. - Unit tests are updated and extended for this case - Integration tests are hardened to make sure the service transitions into ready state before verifying request/responses.
* Implement verification for route readiness for the runLatest. TL;DR: #2430 - This is the first change to ensure that we mark service as ready only when all the subresources have successfully reconciled. - In this change runLatest is covered. The service will become ready only when config.LatestReadyRevision is the one served by the route. When they mismatch the service will transition into the `Unknown` state until route finishes reconciliation. - Unit tests are updated and extended for this case - Integration tests are hardened to make sure the service transitions into ready state before verifying request/responses. * Remove the outdated method. * Address review comments. * Address the comments. - improve the functional helper name - removed the confusing comment - added a test that validates the proper behaviour when gen 2 config fails, but gen 1 is happy throughout. * Comment typo fix. * Add a comment to clarify what is going on. * Fix the unit test after the merge, since the interface changed. * Address review comments - reason is a single word - move the validation from callback to the postprocessing. * Update pkg/apis/serving/v1alpha1/service_types_test.go Co-Authored-By: vagababov <[email protected]> * Address comments.
When we update the service in the release mode, but the route is not yet updated to the new version, we should mark the service to be in `Unknown` ready state. See: knative#2430
* add interface * Implement release mode checking for the service not yet being ready. When we update the service in the release mode, but the route is not yet updated to the new version, we should mark the service to be in `Unknown` ready state. See: #2430 * remove debug logging * Bring back the accidentaly deleted comment. * Apply suggestions from the code review. * Replace cmp with kmp package. * Update pkg/reconciler/v1alpha1/service/service.go * Remove the check for type of service Since reconcile() is only invoked, when we're not in manual mode, hence the type check is useless here. Also move the DeepCopy() one word right, saving us a few microseconds. * diff -> eq
cc @mattmoor @dgerd @sixolet @steren @shefaliv @mikehelmick @evankanderson
tl;dr this issue describes a scenario where the Service reports Ready=True before rollout is complete, and proposes a change (possibly with a new condition) that allows reporting Readiness more correctly.
Problem Statement
Consider this runLatest Service scenario - what should Service's Ready condition = in this case?
In this case, by the Knative conditions convention, the Service should be Ready=True because ConfigurationsReady = True and RoutesReady = True.
But in reality, what has happened is that the rollout is still in progress. The Configuration has reconciled and successfully created the new myservice.2, and becomes true, but the Route has not yet been notified of the updated revision and migrated traffic to it. It is still on the old revision. So the client's intent in updating the service is only partially complete.
The client is prematurely notified that their deployment is finished, but in reality traffic has not migrated yet.
Proposal(s)
To address this, a suggestion is to compare the status.latestReadyRevisionName (which is propagated from the Configuration) and the status.traffic.revisionName (which is propagated from the Route) to determine overall readiness.
One possibility is that Service's Ready condition is no longer simply the AND of RoutesReady & ConfigurationsReady terminal conditions. Instead Service's Ready is based on RoutesReady = True AND ConfigurationsReady = True AND the latestReadyRevisionName = traffic.revisionName.
Doing so could lead to conditions like this:
Another alternative is to leave the Ready condition as a conventional rollup of terminal conditions, and introduce a third terminal condition named something like RolloutComplete. RolloutComplete would encode the latestReadyRevisionName = traffic.revisionName test. This is recomputed with updates from the Configuration and Route. This would lead to conditions like:
Note that in either of these suggestions, the logic would be slightly different in Release mode, where the traffic in the spec of the Service (i.e. named revisions and percentages) is compared with the traffic in status.
The text was updated successfully, but these errors were encountered: