Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to avoid etcd.Get as part of Delete operation #89828

Merged
merged 2 commits into from
Dec 17, 2020

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Apr 3, 2020

This was done for GuaranteedUpdate before in #35415 but later there were multiple (direct or indirect) bug fixes to that logic:
#40664 : Allow values to be wrapped prior to serialization in etcd
#47703 : Do not persist SelfLink into etcd storage
#48394 : GuaranteedUpdate must write if stored data is not canonical
#43152 : etcd3 store: retry with live object on conflict if there was a suggestion
#54780 : partial fix crd patch failing
#58375 : Recheck if transformed data is stale when doing live lookup during update
#77619 : In GuaranteedUpdate, retry on any error if we are working with cached data
#78713 : Set expected in-memory version when decoding unstructured objects from etcd
#82303 : In GuaranteedUpdate, retry on a precondition check failure if we are working with cached data

The PR upfront tries to address it by extensive testing based on all the issues listed above

Depending on the run and scale, we've seen between 10% and 60% reduction of latency on 99th percentiled for Delete API calls.

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 3, 2020
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 3, 2020
@wojtek-t wojtek-t force-pushed the suggestions_for_delete branch 2 times, most recently from b8ed2cb to 3ef6954 Compare April 3, 2020 16:30
@wojtek-t wojtek-t force-pushed the suggestions_for_delete branch 4 times, most recently from 107abf0 to f265b1b Compare April 3, 2020 18:51
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2020
@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 3, 2020

/retest

@wojtek-t wojtek-t added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 3, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Apr 3, 2020
@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 3, 2020
@wojtek-t wojtek-t changed the title [WIP] Try to avoid etcd.Get as part of Delete operation Try to avoid etcd.Get as part of Delete operation Apr 3, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 3, 2020
@wojtek-t wojtek-t force-pushed the suggestions_for_delete branch 2 times, most recently from ef03de5 to dc551cc Compare November 3, 2020 13:03
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 3, 2020
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Nov 21, 2020

@wojtek-t: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e-containerd f265b1b link /test pull-kubernetes-node-e2e-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wojtek-t
Copy link
Member Author

Neither of the errors couldn't be caused by problems with delete.
/retest

@wojtek-t wojtek-t changed the title [WIP] Try to avoid etcd.Get as part of Delete operation Try to avoid etcd.Get as part of Delete operation Nov 23, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2020
@wojtek-t wojtek-t added this to the v1.21 milestone Nov 23, 2020
@wojtek-t wojtek-t removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 23, 2020
@wojtek-t
Copy link
Member Author

@liggitt - I have analyzed deeply everything that happened with GuaranteedUpdate, added some tests to upfronet ensure that issues that it caused are tested explicitly (for those that it made sense) and I think it's ready for a pass of review; PTAL when you will get out of your 1.20 work

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
@@ -227,8 +269,13 @@ func (s *store) conditionalDelete(ctx context.Context, key string, out runtime.O
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a unit test to make sure the right thing happens if the suggestion is stale and the key no longer exists (was already deleted). I think we would get a NotFound here and return, but verify the right thing happens

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "TestDeleteWithSuggestionOfDeletedObject" test.

That said - it doesn't exercise this path. What happens is that the transation fails, we get into

if !txnResp.Succeeded {

branch and this is handled in getState (getResp contains empty KV field and that is handled as returning "not-found" error).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txnResp.Responses[0] exists for a NotFound response?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it exists and being GetResponse contains empty set of KV pairs.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - comments addressed PTAL

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
@@ -227,8 +269,13 @@ func (s *store) conditionalDelete(ctx context.Context, key string, out runtime.O
return err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "TestDeleteWithSuggestionOfDeletedObject" test.

That said - it doesn't exercise this path. What happens is that the transation fails, we get into

if !txnResp.Succeeded {

branch and this is handled in getState (getResp contains empty KV field and that is handled as returning "not-found" error).

@wojtek-t
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2020
@liggitt
Copy link
Member

liggitt commented Dec 16, 2020

ok, I'm satisfied this is functionally correct. I still didn't see the performance numbers demonstrating the benefit was worth the additional complexity:

Do you have evidence that the performance gains are significant? this makes the code quite a bit more complex, and it's not as obvious that deleting from cache is as much of a performance gain as updating from cache.

Sure - that's a good question. All debugging shows it will help a lot, but we need to run scale tests to provide numbers. Holding before we have them.

Did I miss where those were provided?

@wojtek-t
Copy link
Member Author

Did I miss where those were provided?

Sorry - I forgot to add that in the meantime. I added the following to the PR description:
Depending on the run and scale, we've seen between 10% and 60% reduction of latency on 99th percentiled for Delete API calls.

@liggitt - PTAL

@liggitt
Copy link
Member

liggitt commented Dec 17, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 17, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants