Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5.5: client auth failures on new member first startup #14571

Closed
Tracked by #14750
veshij opened this issue Oct 11, 2022 · 15 comments · Fixed by #14574
Closed
Tracked by #14750

3.5.5: client auth failures on new member first startup #14571

veshij opened this issue Oct 11, 2022 · 15 comments · Fixed by #14574
Assignees
Labels
area/auth priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug

Comments

@veshij
Copy link
Contributor

veshij commented Oct 11, 2022

What happened?

After cluster upgrade from 3.5.4 to 3.5.5 we are seeing auth issues on new nodes after first startup.
Cluster runs in auth-enable mode.

New node joins the cluster, fetches the snapshot, but fails all client requests with:

{"level":"error","ts":"2022-10-11T06:54:28.530Z","caller":"auth/range_perm_cache.go:114","msg":"user doesn't exist","user-name":"XXX","stacktrace":"go.etcd.io/etcd/server/v3/auth.(*authStore).isRangeOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/range_perm_cache.go:114\ngo.etcd.io/etcd/server/v3/auth.(*authStore).isOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:870\ngo.etcd.io/etcd/server/v3/auth.(*authStore).IsRangePermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:882\ngo.etcd.io/etcd/server/v3/etcdserver.checkTxnAuth\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:155\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Txn\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:166\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply.go:177\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:61\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2241\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2143\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1384\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1199\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1122\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\texternal/io_etcd_go_etcd_pkg_v3/schedule/schedule.go:157"}

etcd restart fixes the issue and no requests fail after that.

What did you expect to happen?

New node joins cluster and starts serving requests.

How can we reproduce it (as minimally and precisely as possible)?

WIP

Anything else we need to know?

might be related to
https://github.com/etcd-io/etcd/pull/14227/files ?

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.18.6
Go OS/Arch: linux/amd64

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
| 3dd26ba725e108d1 | started |   dfw5a-rh6-2a | https://10.80.109.66:5100 | https://10.80.109.66:5000 |      false |
| 4438c22e9bd12975 | started |   dfw5a-rg3-5a |  https://10.80.98.60:5100 |  https://10.80.98.60:5000 |      false |
| 6b9cffc673ba60e1 | started | dfw5a-rh11-23b | https://10.80.122.23:5100 | https://10.80.122.23:5000 |      false |
| ca3b971d3167b0cf | started | dfw5a-rh11-21a | https://10.80.122.28:5100 | https://10.80.122.28:5000 |      false |
| cbfad084fd48d0cf | started |  dfw5a-rh11-8a | https://10.80.122.54:5100 | https://10.80.122.54:5000 |      false |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.80.109.66:5000 | 3dd26ba725e108d1 |   3.5.5 |  1.5 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.23:5000 | 6b9cffc673ba60e1 |   3.5.5 |  1.7 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.28:5000 | ca3b971d3167b0cf |   3.5.5 |   12 GB |     false |      false |     11952 |  851807437 |          851807437 |        |
| 10.80.122.54:5000 | cbfad084fd48d0cf |   3.5.5 |  2.8 GB |      true |      false |     11952 |  851807437 |          851807437 |        |
|  10.80.98.60:5000 | 4438c22e9bd12975 |   3.5.5 |  602 MB |     false |      false |     11952 |  851807437 |          851807437 |        |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

@veshij veshij changed the title 3.5.5.: client auth failures on first startup 3.5.5: client auth failures on first startup Oct 11, 2022
@veshij
Copy link
Contributor Author

veshij commented Oct 11, 2022

Is it possible missing authStore.refreshRangePermCache() call in authStore.Recover() call triggers the issue on first startup?

@veshij
Copy link
Contributor Author

veshij commented Oct 11, 2022

diff --git a/server/auth/store.go b/server/auth/store.go
index b436939b2..e745fb98e 100644
--- a/server/auth/store.go
+++ b/server/auth/store.go
@@ -377,6 +377,7 @@ func (as *authStore) Recover(be backend.Backend) {
        as.enabled = enabled
        if enabled {
                as.tokenProvider.enable()
+               as.refreshRangePermCache(tx)
        }
        as.enabledMu.Unlock()
 }

seems like fixed an issue.

During the troubleshooting I found that this issue lead to data inconsistency in the cluster.
Is it possible new nodes with empty auth cache fail to apply entries from raft log due to auth failures but still advance the position in the log?

@euroelessar
Copy link

@mitake @ahrtr @serathius Please have a look, it looks like 3.5.5 is prone to data inconsistencies in newly added members if auth is enabled. It is a regression compared to 3.5.4.

@ahrtr
Copy link
Member

ahrtr commented Oct 11, 2022

@veshij Thanks for raising this issue. Good catch! Your proposed change looks good to me.

Previously we fixed a similar issue and fixed it in 14358 , but missed this case.

Please feel free to deliver a PR for this.

Please note that this issue can be easily workaround. FYI. #14355 (comment) #14355 (comment)

@veshij
Copy link
Contributor Author

veshij commented Oct 11, 2022

Thank you, will cut a PR a bit later today.

@yishuT
Copy link
Contributor

yishuT commented Oct 12, 2022

@ahrtr Can we add more tests for auth enabled mode? I looked at couple integration tests, but seems none of them has auth enabled?e.g. https://github.com/etcd-io/etcd/blob/main/tests/integration/cluster_test.go

@ahrtr
Copy link
Member

ahrtr commented Oct 12, 2022

Please see #14574 (comment)

The auth related test cases should be included in v3_auth_test.go. cc @mitake

@mitake
Copy link
Contributor

mitake commented Oct 20, 2022

@euroelessar @yishuT @veshij Sorry for the issue... and thanks for fixing it. Let me check the PR.
Currently auth related tests are independent from others. I guess it's valuable to think about auth enabled version of other integration test cases for detecting regression like this. Changing the test code structure will be quite costly though. How do you think @ahrtr ?

@ahrtr ahrtr changed the title 3.5.5: client auth failures on first startup 3.5.5: client auth failures on new member first startup Oct 28, 2022
@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 28, 2022
@ahrtr
Copy link
Member

ahrtr commented Oct 28, 2022

Note that this is an important issue although it's simple fix, because new added etcd member may run into data inconsistency issue if auth is enabled.

Note that network partition might also lead to this issue. If users setup & enable auth during network partition, when the isolated member rejoins the cluster, it may recover from a snapshot, and so run into this issue as well. Usually I don't think users will do this operation (setup & enable auth during network partition), so it should be low possibility to run into this issue due to network partition.

Please refer to my summary in https://github.com/ahrtr/etcd-issues/tree/master/issues/14571

@ahrtr
Copy link
Member

ahrtr commented Oct 29, 2022

Next steps:

  1. Backport [test] Support dynamically adding & starting new member #14589 to 3.5 and probably 3.4. @biosvs Please let me know if you have bandwith to do this.
  2. Backport the e2e test case in etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574 to 3.5 and probably 3.4. @veshij Please let me know if you have bandwidth to do this. Of course, you need to do it after above task is done.

@veshij
Copy link
Contributor Author

veshij commented Oct 29, 2022 via email

@ahrtr
Copy link
Member

ahrtr commented Oct 30, 2022

Just added an e2e test for 3.5. FYI. #14656

@serathius
Copy link
Member

FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past.

@ahrtr ahrtr mentioned this issue Oct 30, 2022
4 tasks
@ahrtr
Copy link
Member

ahrtr commented Oct 30, 2022

FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past.

Looks good to me.

All related PRs are merged. I think it's ready to release 3.4.22 and 3.5.6. I will release 3.4.22 sometime in the following week. Please add your comments or concerns in #14651 . @mitake @veshij @serathius @spzala @ptabor

@ahrtr
Copy link
Member

ahrtr commented Nov 2, 2022

FYI. v3.4.22 is just released, which includes the fix to this issue. https://github.com/etcd-io/etcd/releases/tag/v3.4.22

@serathius I suggest you to release 3.5.6 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auth priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug
Development

Successfully merging a pull request may close this issue.

6 participants