Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix marking activator endpoints populated into sks condition accurately #15279

Closed

Conversation

yenniechen
Copy link

@yenniechen yenniechen commented May 30, 2024

Fixes #15278

  • marking activator endpoints populated into sks condition accurately
    While reconciling SKS, the reconciler will reconcile Public Endpoints and mark Activator Endpoints Populated into Condition at last. If no backends or if we're in the proxy mode, the activator will be considered to back this revision. But in fact, if no activator found, the mode will be forced into "Serve", and not be put in path. Would it be better to exclude this scenario when marking “ActivatorEndpointsPopulated” Condition?
 // Otherwise check how long SKS was in proxy mode.
 // Compute the difference between time we've been proxying with the timeout.
 // If it's positive, that's the time we need to sleep, if negative -- we
 // can scale to zero.
 pf := sks.Status.ProxyFor()
 to := cfgAS.ScaleToZeroGracePeriod - pf
 if to <= 0 {
 logger.Info("Fast path scaling to 0, in proxy mode for: ", pf)
  return desiredScale, true
 }

The Condition with type name "ActivatorEndpointsPopulated" in SKS status is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?

Release Note

NONE

Copy link

linux-foundation-easycla bot commented May 30, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@knative-prow knative-prow bot requested review from izabelacg and skonto May 30, 2024 02:48
Copy link

knative-prow bot commented May 30, 2024

Welcome @yingyueshi! It looks like this is your first PR to knative/serving 🎉

Copy link

knative-prow bot commented May 30, 2024

Hi @yingyueshi. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 30, 2024
@yenniechen yenniechen changed the title Fix marking activator endpoints populated into sks condition accurately (##15278) Fix marking activator endpoints populated into sks condition accurately May 30, 2024
@dprotaso
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 30, 2024
@skonto
Copy link
Contributor

skonto commented May 31, 2024

So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:

Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState):
...
          &v1alpha1.ServerlessService{
          	TypeMeta:   {},
          			Conditions: v1.Conditions{
          				{
          					Type:     "ActivatorEndpointsPopulated",
        - 					Status:   "False",
        + 					Status:   "True",
          					Severity: "Info",
          					... // 1 ignored and 2 identical fields
          				},

Copy link

knative-prow bot commented May 31, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yenniechen
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yenniechen yenniechen requested a review from dprotaso May 31, 2024 13:32
@skonto
Copy link
Contributor

skonto commented May 31, 2024

I think there is another case not covered:

 {
		Name: "keep in serve mode, no endpoints",
		Key:  "force/proxy",
		Objects: []runtime.Object{
			SKS("force", "proxy", markHappy, WithPubService, WithPrivateService, WithDeployRef("bar")),
			deploy("force", "bar"),
			svcpub("force", "proxy"),
			svcpriv("force", "proxy"),
			endpointspub("force", "proxy", withFilteredPorts(networking.BackendHTTPPort)),
			endpointspriv("force", "proxy" /* revision has no endpoints, force proxy mode */),
			activatorEndpoints(),
		},

		WantStatusUpdates: []clientgotesting.UpdateActionImpl{{
			Object: SKS("force", "proxy", WithPubService, WithPrivateService, WithDeployRef("bar"),
				// Changes from above.
				markNoEndpoints),
		}},
	},

This will pass by setting:

    logger.go:146: 2024-05-31T16:22:16.718+0300	DEBUG	serverlessservice/reconciler.go:333	Updating status with:   v1alpha1.ServerlessServiceStatus{
          	Status: v1.Status{
          		ObservedGeneration: 0,
          		Conditions: v1.Conditions{
          			{
          				Type:               "ActivatorEndpointsPopulated",
        - 				Status:             "False",
        + 				Status:             "True",

I see we accept that condition in one similar test: "OnCreate-no-activator-eps-proxy", where activator does not have endpoints. Any reason why is that, what does that mean when we dont have any endpoints?

@knative-prow knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 31, 2024
@yenniechen
Copy link
Author

yenniechen commented May 31, 2024

So that means basically we didn't update the mode with serve mode and used only the initial sks one (proxy) thus the problem. Unit test fails because we never checked the sks status field to be false:

Extra status update for networking.internal.knative.dev/v1alpha1, Kind=ServerlessService/force/serve (-extra, +prevState):
...
          &v1alpha1.ServerlessService{
          	TypeMeta:   {},
          			Conditions: v1.Conditions{
          				{
          					Type:     "ActivatorEndpointsPopulated",
        - 					Status:   "False",
        + 					Status:   "True",
          					Severity: "Info",
          					... // 1 ignored and 2 identical fields
          				},

You are right. We can see that above Condition (with type name "ActivatorEndpointsPopulated") is used to determine whether the interval for scaling to 0 is reached. If there are no endpoints, is it best not to scale down to 0?

 // Otherwise check how long SKS was in proxy mode.
 // Compute the difference between time we've been proxying with the timeout.
 // If it's positive, that's the time we need to sleep, if negative -- we
 // can scale to zero.
 pf := sks.Status.ProxyFor()
 to := cfgAS.ScaleToZeroGracePeriod - pf
 if to <= 0 {
 logger.Info("Fast path scaling to 0, in proxy mode for: ", pf)
   return desiredScale, true
 }
 // ProxyFor returns how long it has been since Activator was moved
 // to the request path.
 func (sss *ServerlessServiceStatus) ProxyFor() time.Duration {
 	cond := sss.GetCondition(ActivatorEndpointsPopulated)
 	if cond == nil || cond.Status != corev1.ConditionTrue {
 		return 0
 	}
 	return time.Since(cond.LastTransitionTime.Inner.Time)
  }

If no activator endpoints, but have revisions endpoints, is it better not scale to Zero while no traffic? If we scale to Zero at that time, maybe we will throw some traffic, for neither activator endpoints nor revision endpoints are added to Public Service.

In the case of no activator endpoints and revision endpoints, it is Zero endpoint and no need to scale to Zero any more. Whether marking activator populated or not, it seems no effect. Is it better to mark it or not that just follow the facts that whether activator is really in the path?

In addition, I searched the code globally and found that this SKS condition (with type name "ActivatorEndpointsPopulated") is only used to compute grace period of scaling to Zero.

Copy link

codecov bot commented Jun 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.78%. Comparing base (c2d0af1) to head (8c77066).
Report is 334 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15279      +/-   ##
==========================================
+ Coverage   84.11%   84.78%   +0.66%     
==========================================
  Files         213      218       +5     
  Lines       16783    13480    -3303     
==========================================
- Hits        14117    11429    -2688     
+ Misses       2315     1685     -630     
- Partials      351      366      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto
Copy link
Contributor

skonto commented Jun 3, 2024

For the case where TBC != -1 there is logic that will not allow to for the scale down:

		if resolveTBC(ctx, pa) != -1 {
			// if TBC is -1 activator is guaranteed to already be in the path.
			// Otherwise, probe to make sure Activator is in path.
			r, err = ks.activatorProbe(pa, ks.transport)
			logger.Infof("Probing activator = %v, err = %v", r, err)
		}

if r is false then it is not going to call ProxyFor(). For the case where TBC = -1 though I think we will scale down to zero while activator might not be around and endpoints are populated (without this PR). I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?

@yenniechen
Copy link
Author

yenniechen commented Jun 4, 2024

I also wonder the answer.
For the case TBC=-1 and no ready activator endpoints, with this PR, the handle of scale to 0 will be requeue until activator becomes ready. Once the activator has ready endpoints, SKS controller will watch the change and inform the SKS to mark activator populated, then the revision pods will be scaled to 0.
How would it be desinged?

@skonto
Copy link
Contributor

skonto commented Jun 14, 2024

@dprotaso wdyth?

@skonto
Copy link
Contributor

skonto commented Jun 19, 2024

@dprotaso gentle ping.

@dprotaso
Copy link
Member

I think we should not scale down if activator is not in the path right (when desiredScale is 0), @dprotaso wdyth?

Yeah I agree

@dprotaso
Copy link
Member

I think we should update the PR so that we are only marking ActivatorEndpointsPopulated when we are actually wiring the activator in the data path with ready endpoints.

@dprotaso
Copy link
Member

Also thanks for the investigation everyone - it helped me be more informed on the path to take here.

@dprotaso
Copy link
Member

dprotaso commented Jul 9, 2024

hey @yenniechen just following up

@yenniechen
Copy link
Author

hey @yenniechen just following up

So do you have any suggestions? I thought this PR would solve this problem.

@yenniechen
Copy link
Author

@dprotaso

Copy link

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2024
@yenniechen yenniechen closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
3 participants