Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in delta xds server #458

Merged
merged 4 commits into from
Jul 20, 2021

Conversation

s-matyukevich
Copy link
Contributor

Here is the description of the request flow that illustrates the bug

  1. Envoy sends a DeltaServiceDiscoveryRequest and subscribes for an existing resource.
  2. here we check that there is no registered delta watch yet and we call cache.CreateDeltaWatch
  3. CreateDeltaWatch function checks that resource already exists and responds immediately. delayedResponse is false so the new watch isn't created.
  4. Envoy receives the response and responds with another DeltaServiceDiscoveryRequest with the right nonce set.
  5. This time, when the server receives the request there will be an object in the deltaWatches map here So the if statement below won't be executed, and cache.CreateDeltaWatch won't be called for the second time.
  6. We end up in a situation when there is no watcher registered in the cache so envoy won't receive any updates anymore.

You can check this by simply running integration tests. Before this fix integration tests don't fail, but from the output, you can see that no updates are being sent to the client after the test runner printsupdate snapshot message.

P.S: I think this function always puts a single object into watches.deltaMuxedResponses channel (because the cache always closes the watcher after sending a single response) so maybe, instead terminating this function by writing into terminate channels, we can simply get rid of the for loop and terminate the function after the first response is forwarded to the deltaMuxedResponses channel?

Signed-off-by: Sergey Matyukevich <[email protected]>
@alecholmez alecholmez self-assigned this Jul 2, 2021
@alecholmez
Copy link
Contributor

alecholmez commented Jul 2, 2021

@s-matyukevich thanks for opening this bug fix! This looks good, you can ignore the nonce set in delta right now because it's not really used in envoy. But this definitely addresses the issues opened earlier with delta failing to respond.

With regards to the termination channel, which for loop are you referring to? That termination go routine doesn't currently have one. Also note in this branch we now do this:

go func() {
resp, more := <-watch.responses
if more {
watches.deltaMuxedResponses <- resp
}
}()

Since we linearize the responses coming through

@s-matyukevich
Copy link
Contributor Author

s-matyukevich commented Jul 2, 2021

Thanks for the quick review @alecholmez

With regards to the termination channel, which for loop are you referring to? That termination go routine doesn't currently have one. Also note in this branch we now do this:

The way how you handle this in the branch is exactly what I meant. Please disregard my comment about the for loop.

@s-matyukevich
Copy link
Contributor Author

Another problem that I see is that neither unit tests nor integration tests capture this. In our own control plane, we use "integration style" unit tests: we run the control plane server using google.golang.org/grpc/test/bufconn then in the test we create and send different DeltaDiscoveryRequests to the server and verify the responses. I personally find those tests very useful because they allow testing the whole control plane (server + cache) but at the same time they are implemented as normal unit tests, so I can debug them + it is much easier to add new tests and assertions than in the normal integration tests, which rely on envoy and docker.

I can copy come of our test to this repo if you like. They actually captured this bug.

@alecholmez
Copy link
Contributor

@s-matyukevich can you fix the conflicts? I'd like to get this in

Signed-off-by: Sergey Matyukevich <[email protected]>
@s-matyukevich
Copy link
Contributor Author

@s-matyukevich can you fix the conflicts? I'd like to get this in

Sure, done!

Copy link
Contributor

@alecholmez alecholmez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve once we get that channel close in

pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
@alecholmez
Copy link
Contributor

This looks great thanks for working this!

@alecholmez alecholmez merged commit 888a7f7 into envoyproxy:main Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants