Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable automated canary deployments using Services #2721

Closed
dicarlo2 opened this issue Dec 14, 2018 · 15 comments
Closed

Enable automated canary deployments using Services #2721

dicarlo2 opened this issue Dec 14, 2018 · 15 comments
Labels
area/API API objects and controllers kind/feature Well-understood/specified features, ready for coding.

Comments

@dicarlo2
Copy link

dicarlo2 commented Dec 14, 2018

Currently, using the /spec/release key in a Service allows for simple blue/green deployments, but for automated canaries (e.g. using spinnaker) we'd like to have 3 groups:

  1. new version
  2. with an equal traffic split to a) on the old version
  3. the remainder on the old version.

E.g., initially we might do 1/ 5%, 2/ 5% and 3/ 90% and then slowly increase 1/ and 2/. We could compare the metrics from 1/ against 2/ to determine release health. This seems like a pretty standard rollout procedure, so I think it makes sense to incorporate it directly into the higher level Service object.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers kind/feature Well-understood/specified features, ready for coding. labels Dec 14, 2018
@dicarlo2 dicarlo2 changed the title Enable canary deployments using Services Enable automated canary deployments using Services Dec 14, 2018
@mattmoor
Copy link
Member

mattmoor commented Feb 7, 2019

This is covered well by Route, but not Service (today).

Let's consider this as part of the "Service v1" conversation.

@mattmoor
Copy link
Member

The surface available at HEAD supports this in Service, which now simply inlines ConfigurationSpec and RouteSpec. The knative/docs repo at HEAD reflects this, and this will all be in the 0.6 release.

@polothy
Copy link

polothy commented Apr 26, 2019

Sorry, having trouble following your comment. Is there a link to the doc/sample you mentioned?

@jonnylangefeld
Copy link

I’m also interested in this functionality and don’t think this Use case is covered today. All tutorials/docs only show how you can manually change the traffic splitting, but there’s no built in automation that slowly changes traffic routing from 10% to 20% to 30% all the way to 100% to migrate traffic.

@markusthoemmes
Copy link
Contributor

@jonnylangefeld check out https://knative.dev/docs/developer/serving/rolling-out-latest-revision/! It‘s experimental‘ish so feedback will be very appreciated!

@jonnylangefeld
Copy link

Thanks @markusthoemmes for the pointer! Looks promising. Is the traffic shifting only time based? What if the new revision throws errors? Would the time based traffic shifting continue?

@markusthoemmes
Copy link
Contributor

Yes, right now the amount of traffic shifted is only time based as it's main motivation at the time was to smooth over weird autoscaling ripple effects if a traffic shift is done very harshly.
Reacting to actual failures in the new revision and similar signals was out of scope for this specific piece. IIRC the line of thought at the time was, that such a rollout strategy would require higher-level abstraction and much more control over which signals the rollout would actually use.

@jonnylangefeld
Copy link

Okay I see! Thanks for the summary.
I will take a look at flagger which takes metrics into account and works on custom resources. So technically it should be able to manipulate the traffic percentages on the Service.serving.knative.dev/v1 resource.

@markusthoemmes
Copy link
Contributor

Interesting! Please let us know how the experiment went, this sounds like a great integration use-case!

@jonnylangefeld
Copy link

I did take a look at flagger and while it's pretty cool, it unfortunately doesn't work with Service.serving.knative.dev/v1 as expected. Instead it watches deployments and whenever something changes (like a version tag), it runs an automatic traffic migration where it slowly shifts the traffic to the new version and rolls back if metrics show to many errors or too high latency for the new version.
Just like knative it uses istio (or other traffic managers) to shift the traffic via VirtualService.
I think knative would also really benefit from an automatic traffic shift to canaries.

@markusthoemmes
Copy link
Contributor

Sounds like it'd theoretically be possible to teach Flagger to talk Knative API as well though 🤔. Looking at the top of https://docs.flagger.app/ it looks like we'd have to teach it to talk Knative Service/Knative Route API for it to be able to do the switcheroo. Since it already seems to implement the API of various other projects, it'd be interesting to go ask if they'd be game for a Knative implementation as well.

@markusthoemmes
Copy link
Contributor

fluxcd/flagger#903 actually considers this.

@politician
Copy link

politician commented Oct 20, 2021

Integrating with Flagger to deploy Knative functions would definitely be something interesting as it would be one tool for both functions and container-based services.
It also seems like it would be an approach 100% compatible with the GitOps way of configuring a cluster.

Also, to add to the conversation, I have not tried Gloo, but it seems like it could also be a way to perform Canary deployments and it has a 1st class integration with Knative.
Flagger works with Gloo by the way.

@dprotaso
Copy link
Member

dprotaso commented Nov 3, 2021

I have not tried Gloo, but it seems like it could also be a way to perform Canary deployments and it has a 1st class integration with Knative.

Gloo's CLI installs a Knative version that's two years old - I'm not sure what kind of testing they perform but I'm guessing the integration has rotted. I'd open an issue on the gloo github to bump Knative if you're interesting in working with it long term.

Is there anything actionable for us to do in Knative serving? From reading the thread this stood out:

What if the new revision throws errors? Would the time based traffic shifting continue?

It seems like we should potentially document/clarify/fix the semantics of the rollout duration when the next revision fails. But I'm tempted to make separate issue and close this out.

So far it seems this issue was opened just to track the flagger integration. But I'd rather folks comment upstream to signal you want Knative support.

@dprotaso
Copy link
Member

Going to close this out as we have the following related issues

  1. Determine rollout duration semantics when next revision fails #12349
  2. Knative support fluxcd/flagger#903

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

No branches or pull requests

8 participants