Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta SDS incorrectly sends an error response for an empty response #32832

Closed
howardjohn opened this issue Mar 11, 2024 · 18 comments
Closed

Delta SDS incorrectly sends an error response for an empty response #32832

howardjohn opened this issue Mar 11, 2024 · 18 comments
Labels
area/sds SDS related bug stale stalebot believes this issue/PR has not been touched recently

Comments

@howardjohn
Copy link
Contributor

Title: Delta SDS incorrectly sends an error response for an empty response

Description:
Control plan sends resources:[], removed:[some resource]. Envoy rejects with 2024-03-11T23:35:13.641010Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138 gRPC config for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret rejected: Missing SDS resources for kubernetes://sds-credential in onConfigUpdate() thread=25

This comes from

absl::Status SdsApi::validateUpdateSize(int num_resources) {
which only considers SotW

Repro steps:
I don't have a trivial Envoy SDS setup for this in pure envoy, but its pretty simple in Istio. Just create something like

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: echo
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    hosts:
    - "*"
    tls:
      credentialName: sds-credential
      mode: SIMPLE

With istio latest (which uses delta xds now)

@nezdolik
Copy link
Member

cc @adisuissa @htuch

@adisuissa
Copy link
Contributor

This seems to be a dup of #24373.

@keithmattix
Copy link
Contributor

I can try to take this if it's not claimed

@htuch htuch closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@htuch
Copy link
Member

htuch commented Mar 13, 2024

@keithmattix I'll close this as dupe and assign you #24373

@keithmattix
Copy link
Contributor

@howardjohn I think the Istio repro in the original description is expected. If the secret is not supplied by the SDS server, then the client should indeed error, no?

@howardjohn
Copy link
Contributor Author

a NACK from lack of resource is not expected. it should stay warming.

but even if we had previously sent it, any removal is just entirely rejected

@keithmattix
Copy link
Contributor

but even if we had previously sent it, any removal is just entirely rejected

Right, I've confirmed this and am iterating on the linked PR; I'm looking to verify the repro behavior with Istio

a NACK from lack of resource is not expected. it should stay warming.

Ah I see; from the spec (emphasis mine):

Resources names of resources that have be deleted and to be removed from the xDS Client. Removed resources for missing resources can be ignored.

So you're saying the desired behavior is:

  1. if cluster is still warming and secret is removed, the SDS subscription should ignore the removal (NOT NACK)
  2. If cluster is already ready (and serving traffic), the SDS subscription should also ignore that removal because the owning listener/cluster should be updated as well

I agree with 1 but on 2, how can the SDS subscription guarantee that the cluster/listener are being updated? What if the server is saying "this secret no longer exists" (e.g. a cert rotation failed); what should that do to the owning cluster/listener? Does the SDS subscription need to do anything at all (I think this is current behavior)?

@howardjohn
Copy link
Contributor Author

howardjohn commented Mar 20, 2024 via email

@keithmattix
Copy link
Contributor

keithmattix commented Mar 20, 2024

I'm familiar with the ecds issue; I'm just not sure which is the more "correct" approach from a protocol perspective. What you're suggesting would mean any removal that isn't LDS or CDS essentially has no effect. From a pure protocol perspective, I would think that an xDS server should be able to indicate that dependent resources are missing/removed and have that mean something

@howardjohn
Copy link
Contributor Author

@keithmattix IMO there are 2 usable implementations:

  1. There is update and remove semantics (as today), and remove of dependent resources is ignored
  2. There is update, remove, and "requested thing not present" semantics. Remove dependant resource is respected

In (2), we would just never send Remove for dependent types. But if we just do that today, then we cannot tell Envoy a resource it requested is not available, which means it will be warming instead of just failing which is unexpected.

@keithmattix
Copy link
Contributor

keithmattix commented Mar 25, 2024

Ah I see; unless we xDS adopted (2)'s semantics, we wouldn't be able to express "not available". Fair point; I ended up coding up "delete ignored" in #32961, so I think this will solve Istio's issue

@kyessenov
Copy link
Contributor

Ignoring delete can be dangerous in case of compromised secrets. Does delete trigger an eventual release of the secret? Or will it potentially be used indefinitely after the deletion?

@keithmattix
Copy link
Contributor

Wouldn't that require an update instead of a removal though? Of the cluster is not supposed to use a secret in the interim period before a new secret is available, modify the cluster right?

@kyessenov
Copy link
Contributor

kyessenov commented Mar 25, 2024

@keithmattix I think it depends. I am mostly dis-agreeing with the base assumption that traffic must not be disrupted at all costs. Sometimes, you do want to break traffic to mitigate an ongoing attack, or for compliance reasons.

@keithmattix
Copy link
Contributor

Sure, but IIUC, not ignoring removal leaves control planes in a tough spot because they can't signal to the data plane that a requested secret is not/available without breaking traffic. If the CP wants to break traffic, it can just remove the secret from the cluster right?

@htuch
Copy link
Member

htuch commented Mar 29, 2024

My take here is that semantics today are as pointed out above, we don't support removal of singleton subscriptions (SDS, EDS, etc.) independent of the referencing resource (cluster, listener). To @kyessenov point about dangers, we can either rely on the control plane to enforce the correct behavior during secret removal or make a change in semantics as suggested by @howardjohn. It seems reasoable to rely on control plane updating cluster/listener in these cases? Reopening as this is not resolved.

@htuch htuch reopened this Mar 29, 2024
htuch pushed a commit that referenced this issue Mar 29, 2024
Fixes #24373 and #32832

Risk Level: Low
Testing: Manual testing with the Istio scenario described in #32832. Investigating how to add a unit test
Release Notes: Delta SDS removals will no longer result in a "Missing SDS resources" error message

Signed-off-by: Keith Mattix II <[email protected]>
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 28, 2024
alyssawilk pushed a commit to alyssawilk/envoy that referenced this issue Apr 29, 2024
Fixes envoyproxy#24373 and envoyproxy#32832

Risk Level: Low
Testing: Manual testing with the Istio scenario described in envoyproxy#32832. Investigating how to add a unit test
Release Notes: Delta SDS removals will no longer result in a "Missing SDS resources" error message

Signed-off-by: Keith Mattix II <[email protected]>
Copy link

github-actions bot commented May 5, 2024

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sds SDS related bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

6 participants