Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(xds): accelerate universal dp XDS generation #11180

Merged
merged 3 commits into from
Sep 18, 2024
Merged

Conversation

Icarus9913
Copy link
Contributor

@Icarus9913 Icarus9913 commented Aug 23, 2024

In the XDS callback chain, the dataplane sync tracker callback relies on the dataplane object to generate the xds configurations. However, the dataplane object creation happens in dataplane liftcycle callback which is latter than the sync tracker callback. Consequetenly, just revert these 2 order and create the dataplane object before the sync call.

Reference issue: #11135

Checklist prior to review

  • Link to relevant issue as well as docs and UI issues --
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as a image registry) and it will work on Windows, system specific functions like syscall.Mkfifo have equivalent implementation on the other OS --
  • Tests (Unit test, E2E tests, manual test on universal and k8s) --
    • Don't forget ci/ labels to run additional/fewer tests
  • Do you need to update UPGRADE.md? --
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label) --

@Icarus9913 Icarus9913 added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Aug 23, 2024
@Icarus9913 Icarus9913 requested a review from a team as a code owner August 23, 2024 06:47
@Icarus9913 Icarus9913 requested review from bartsmykla and lukidzi and removed request for a team August 23, 2024 06:47
@Icarus9913 Icarus9913 linked an issue Aug 23, 2024 that may be closed by this pull request
Copy link
Contributor

@lukidzi lukidzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubdyszkiewicz Do you know if that order was important? I think it was visible after this fix: #11094

@jakubdyszkiewicz
Copy link
Contributor

It makes sense, but it does not fully fix the problem.

We retrieve DPP for which we generate XDS from the MeshContext to avoid additional queries. MeshContext is cached (by default for 1s). This means that between creation of DPP and running XDS watchdog you need to invalidate the cache to pick up new DPP. This may not be the case very often.
If you were testing this on a single DPP you always had invalid cache, because no DPPs connected = no cached MeshContext.

@Icarus9913
Copy link
Contributor Author

It makes sense, but it does not fully fix the problem.

We retrieve DPP for which we generate XDS from the MeshContext to avoid additional queries. MeshContext is cached (by default for 1s). This means that between creation of DPP and running XDS watchdog you need to invalidate the cache to pick up new DPP. This may not be the case very often. If you were testing this on a single DPP you always had invalid cache, because no DPPs connected = no cached MeshContext.

Yeah, you're right.
Once the Dataplane resource was created, the xds Sync would try to retrieve this Dataplane object by ReadOnlyResourceManager which uses the cache to get the MeshContext first.

func (c *cachedManager) List(ctx context.Context, list model.ResourceList, fs ...store.ListOptionsFunc) error {
tenantID, err := c.tenants.GetID(ctx)
if err != nil {
return err
}
opts := store.NewListOptions(fs...)
if !opts.IsCacheable() {
return fmt.Errorf("filter functions are not allowed for cached store")
}
cacheKey := fmt.Sprintf("LIST:%s:%s:%s", list.GetItemType(), opts.HashCode(), tenantID)
obj, found := c.cache.Get(cacheKey)
if !found {
// There might be a situation when cache just expired and there are many concurrent goroutines here.
// We should only let one fill the cache and let the rest of them wait for it. Otherwise we will be repeating expensive work.
mutex := c.mutexFor(cacheKey)
mutex.Lock()
obj, found = c.cache.Get(cacheKey)
if !found {
// After many goroutines are unlocked one by one, only one should execute this branch, the rest should retrieve object from the cache
c.metrics.WithLabelValues("list", string(list.GetItemType()), "miss").Inc()
if err := c.delegate.List(ctx, list, fs...); err != nil {
mutex.Unlock()
return err
}
c.cache.SetDefault(cacheKey, list.GetItems())

The default cache expiration is 1s:

expirationTime: 1s # ENV: KUMA_STORE_CACHE_EXPIRATION_TIME

So, this PR doesn't fix the issue totally.

@Icarus9913
Copy link
Contributor Author

Icarus9913 commented Sep 9, 2024

Maybe we should set an interval between dataplane_lifecycle and goroutine dataplane_sync_tracker.

err := manager.Upsert(ctx, d.resManager, core_model.MetaToResourceKey(md.Resource.GetMeta()), proxyResource(md.GetProxyType()), func(existing core_model.Resource) error {
if err := d.validateUpsert(ctx, md.Resource); err != nil {
return errors.Wrap(err, "you are trying to override existing proxy to which you don't have an access.")
}
return existing.SetSpec(md.Resource.GetSpec())

Since only the universal DP created in the Strean flow:

if md.Resource == nil {
return nil
}

So I guess maybe we could sleep Cache ExpirationTime after we created the Universal DP object.
What do you think ? @jakubdyszkiewicz

@jakubdyszkiewicz
Copy link
Contributor

Yeah, I think sleeping after creating Universal DP for cache expiration time makes sense.

Btw. it's important to do this only when we create the resource in lifecycle. On Kube, Dataplane object is created by the controller, not by the callback

In the XDS callback chain, the dataplane sync tracker callback relies on the dataplane object to generate the xds configurations.
However, the dataplane object creation happens in dataplane liftcycle callback which is latter than the sync tracker callback.
Consequetenly, just revert these 2 order and create the dataplane object before the sync call.

Reference issue: #11135

Signed-off-by: Icarus Wu <[email protected]>
…e the MeshContext syncs the latest data

Signed-off-by: Icarus Wu <[email protected]>
@Icarus9913 Icarus9913 enabled auto-merge (squash) September 18, 2024 02:12
@Icarus9913 Icarus9913 merged commit 8ae46de into master Sep 18, 2024
35 checks passed
@Icarus9913 Icarus9913 deleted the fix/wk/xds branch September 18, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The universal DP is delayed to generate the XDS configuration
3 participants