3.5.5: client auth failures on new member first startup #14571

veshij · 2022-10-11T07:51:39Z

What happened?

After cluster upgrade from 3.5.4 to 3.5.5 we are seeing auth issues on new nodes after first startup.
Cluster runs in auth-enable mode.

New node joins the cluster, fetches the snapshot, but fails all client requests with:

{"level":"error","ts":"2022-10-11T06:54:28.530Z","caller":"auth/range_perm_cache.go:114","msg":"user doesn't exist","user-name":"XXX","stacktrace":"go.etcd.io/etcd/server/v3/auth.(*authStore).isRangeOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/range_perm_cache.go:114\ngo.etcd.io/etcd/server/v3/auth.(*authStore).isOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:870\ngo.etcd.io/etcd/server/v3/auth.(*authStore).IsRangePermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:882\ngo.etcd.io/etcd/server/v3/etcdserver.checkTxnAuth\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:155\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Txn\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:166\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply.go:177\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:61\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2241\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2143\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1384\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1199\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1122\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\texternal/io_etcd_go_etcd_pkg_v3/schedule/schedule.go:157"}

etcd restart fixes the issue and no requests fail after that.

What did you expect to happen?

New node joins cluster and starts serving requests.

How can we reproduce it (as minimally and precisely as possible)?

WIP

Anything else we need to know?

might be related to
https://github.com/etcd-io/etcd/pull/14227/files ?

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.18.6
Go OS/Arch: linux/amd64

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
| 3dd26ba725e108d1 | started |   dfw5a-rh6-2a | https://10.80.109.66:5100 | https://10.80.109.66:5000 |      false |
| 4438c22e9bd12975 | started |   dfw5a-rg3-5a |  https://10.80.98.60:5100 |  https://10.80.98.60:5000 |      false |
| 6b9cffc673ba60e1 | started | dfw5a-rh11-23b | https://10.80.122.23:5100 | https://10.80.122.23:5000 |      false |
| ca3b971d3167b0cf | started | dfw5a-rh11-21a | https://10.80.122.28:5100 | https://10.80.122.28:5000 |      false |
| cbfad084fd48d0cf | started |  dfw5a-rh11-8a | https://10.80.122.54:5100 | https://10.80.122.54:5000 |      false |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.80.109.66:5000 | 3dd26ba725e108d1 |   3.5.5 |  1.5 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.23:5000 | 6b9cffc673ba60e1 |   3.5.5 |  1.7 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.28:5000 | ca3b971d3167b0cf |   3.5.5 |   12 GB |     false |      false |     11952 |  851807437 |          851807437 |        |
| 10.80.122.54:5000 | cbfad084fd48d0cf |   3.5.5 |  2.8 GB |      true |      false |     11952 |  851807437 |          851807437 |        |
|  10.80.98.60:5000 | 4438c22e9bd12975 |   3.5.5 |  602 MB |     false |      false |     11952 |  851807437 |          851807437 |        |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

veshij · 2022-10-11T19:04:21Z

Is it possible missing authStore.refreshRangePermCache() call in authStore.Recover() call triggers the issue on first startup?

veshij · 2022-10-11T20:55:24Z

diff --git a/server/auth/store.go b/server/auth/store.go
index b436939b2..e745fb98e 100644
--- a/server/auth/store.go
+++ b/server/auth/store.go
@@ -377,6 +377,7 @@ func (as *authStore) Recover(be backend.Backend) {
        as.enabled = enabled
        if enabled {
                as.tokenProvider.enable()
+               as.refreshRangePermCache(tx)
        }
        as.enabledMu.Unlock()
 }

seems like fixed an issue.

During the troubleshooting I found that this issue lead to data inconsistency in the cluster.
Is it possible new nodes with empty auth cache fail to apply entries from raft log due to auth failures but still advance the position in the log?

euroelessar · 2022-10-11T21:11:53Z

@mitake @ahrtr @serathius Please have a look, it looks like 3.5.5 is prone to data inconsistencies in newly added members if auth is enabled. It is a regression compared to 3.5.4.

ahrtr · 2022-10-11T21:37:50Z

@veshij Thanks for raising this issue. Good catch! Your proposed change looks good to me.

Previously we fixed a similar issue and fixed it in 14358 , but missed this case.

Please feel free to deliver a PR for this.

Please note that this issue can be easily workaround. FYI. #14355 (comment) #14355 (comment)

veshij · 2022-10-11T21:39:00Z

Thank you, will cut a PR a bit later today.

yishuT · 2022-10-12T00:59:57Z

@ahrtr Can we add more tests for auth enabled mode? I looked at couple integration tests, but seems none of them has auth enabled?e.g. https://github.com/etcd-io/etcd/blob/main/tests/integration/cluster_test.go

ahrtr · 2022-10-12T03:01:45Z

Please see #14574 (comment)

The auth related test cases should be included in v3_auth_test.go. cc @mitake

mitake · 2022-10-20T14:18:07Z

@euroelessar @yishuT @veshij Sorry for the issue... and thanks for fixing it. Let me check the PR.
Currently auth related tests are independent from others. I guess it's valuable to think about auth enabled version of other integration test cases for detecting regression like this. Changing the test code structure will be quite costly though. How do you think @ahrtr ?

ahrtr · 2022-10-28T07:04:16Z

Note that this is an important issue although it's simple fix, because new added etcd member may run into data inconsistency issue if auth is enabled.

Note that network partition might also lead to this issue. If users setup & enable auth during network partition, when the isolated member rejoins the cluster, it may recover from a snapshot, and so run into this issue as well. Usually I don't think users will do this operation (setup & enable auth during network partition), so it should be low possibility to run into this issue due to network partition.

Please refer to my summary in https://github.com/ahrtr/etcd-issues/tree/master/issues/14571

ahrtr · 2022-10-29T05:29:47Z

Next steps:

Backport [test] Support dynamically adding & starting new member #14589 to 3.5 and probably 3.4. @biosvs Please let me know if you have bandwith to do this.
Backport the e2e test case in etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574 to 3.5 and probably 3.4. @veshij Please let me know if you have bandwidth to do this. Of course, you need to do it after above task is done.

veshij · 2022-10-29T05:37:21Z

Yep, I’ll backport once test cases are backported Пт, 28 окт. 2022 г. в 22:30, Benjamin Wang ***@***.***>:

Next steps: 1. Backport #14589 <#14589> to 3.5 and probably 3.4. @biosvs <https://github.com/biosvs> Please let me know if you have bandwith to do this. 2. Backport the e2e test case in #14574 <#14574> to 3.5 and probably 3.4. @veshij <https://github.com/veshij> Please let me know if you have bandwidth to do this. Of course, you need to do it after above task is done. — Reply to this email directly, view it on GitHub <#14571 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOURUNPFQKHL23G2CXL3A3WFSY5PANCNFSM6AAAAAARCABEBY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Oleg Guba

ahrtr · 2022-10-30T06:54:15Z

Just added an e2e test for 3.5. FYI. #14656

serathius · 2022-10-30T07:28:40Z

FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past.

ahrtr · 2022-10-30T09:04:28Z

FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past.

Looks good to me.

All related PRs are merged. I think it's ready to release 3.4.22 and 3.5.6. I will release 3.4.22 sometime in the following week. Please add your comments or concerns in #14651 . @mitake @veshij @serathius @spzala @ptabor

ahrtr · 2022-11-02T01:13:34Z

FYI. v3.4.22 is just released, which includes the fix to this issue. https://github.com/etcd-io/etcd/releases/tag/v3.4.22

@serathius I suggest you to release 3.5.6 as well.

veshij added the type/bug label Oct 11, 2022

veshij changed the title ~~3.5.5.: client auth failures on first startup~~ 3.5.5: client auth failures on first startup Oct 11, 2022

ahrtr added the area/auth label Oct 11, 2022

veshij mentioned this issue Oct 11, 2022

etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574

Merged

ahrtr assigned veshij Oct 13, 2022

serathius mentioned this issue Oct 27, 2022

Introduce etcd linearizability tests #14045

Closed

33 tasks

ahrtr closed this as completed in #14574 Oct 27, 2022

ahrtr changed the title ~~3.5.5: client auth failures on first startup~~ 3.5.5: client auth failures on new member first startup Oct 28, 2022

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 28, 2022

This was referenced Oct 29, 2022

server: add unit test to cover the case of refreshing auth info when etcd recovers from a snapshot #14647

Merged

server/auth: invalidate range permission cache during recovering from… #13920

Closed

ahrtr mentioned this issue Oct 30, 2022

test: added e2e test case for issue 14571: etcd doesn't load auth info when recovering from a snapshot #14656

Merged

ahrtr mentioned this issue Oct 30, 2022

Plans for v3.4.22 release #14651

Closed

4 tasks

serathius mentioned this issue Nov 14, 2022

Release v3.5.6 #14750

Closed

22 tasks

chaochn47 mentioned this issue Jan 6, 2023

tests linearizability: reproduce and prevent 14571 #14819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.5.5: client auth failures on new member first startup #14571

3.5.5: client auth failures on new member first startup #14571

veshij commented Oct 11, 2022

paste your configuration here

veshij commented Oct 11, 2022

veshij commented Oct 11, 2022

euroelessar commented Oct 11, 2022

ahrtr commented Oct 11, 2022

veshij commented Oct 11, 2022

yishuT commented Oct 12, 2022

ahrtr commented Oct 12, 2022

mitake commented Oct 20, 2022 •

edited

Loading

ahrtr commented Oct 28, 2022 •

edited

Loading

ahrtr commented Oct 29, 2022

veshij commented Oct 29, 2022 via email

ahrtr commented Oct 30, 2022

serathius commented Oct 30, 2022

ahrtr commented Oct 30, 2022

ahrtr commented Nov 2, 2022

3.5.5: client auth failures on new member first startup #14571

3.5.5: client auth failures on new member first startup #14571

Comments

veshij commented Oct 11, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

veshij commented Oct 11, 2022

veshij commented Oct 11, 2022

euroelessar commented Oct 11, 2022

ahrtr commented Oct 11, 2022

veshij commented Oct 11, 2022

yishuT commented Oct 12, 2022

ahrtr commented Oct 12, 2022

mitake commented Oct 20, 2022 • edited Loading

ahrtr commented Oct 28, 2022 • edited Loading

ahrtr commented Oct 29, 2022

veshij commented Oct 29, 2022 via email

ahrtr commented Oct 30, 2022

serathius commented Oct 30, 2022

ahrtr commented Oct 30, 2022

ahrtr commented Nov 2, 2022

mitake commented Oct 20, 2022 •

edited

Loading

ahrtr commented Oct 28, 2022 •

edited

Loading