-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable static consistent hash ring #18183
Enable static consistent hash ring #18183
Conversation
c5d9377
to
a9a80a8
Compare
actually if its etcd it has already been using a static consistent hash ring. |
@lucyge2022 This PR makes us use dynamic consistent hash ring by default (no matter using ETCD or master). In this PR, I change the logic to get |
@jja725 wondering if the hash ring change impacts the logic of distributedMv related? |
@yyongycy this doesn't affect cp/mv, but probably affect distributed load, have to take a closer look. |
27c41bb
to
2e5c249
Compare
Actually, we do want to affect distributed load. That is why we want to have this change. If there is a worker that is offline temporarily, static consistent hash ring allows user to avoid writing data to other worker nodes when executing load command and reading a file cached on the offline worker. |
Then we can update DefaultWorkerProvider.getWorkerInfos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, looks like not a big changes. Master based worker registration would eventually be etcd based worker registration.
just need some data to see "time of the consistent hash ring building given it has 400K (100 physical nodes) vnodes"
Rest optimization can be done later.
dora/core/client/fs/src/main/java/alluxio/client/block/BlockWorkerInfo.java
Show resolved
Hide resolved
dora/core/client/fs/src/main/java/alluxio/client/file/FileSystemContext.java
Outdated
Show resolved
Hide resolved
dora/core/client/fs/src/main/java/alluxio/client/file/FileSystemContext.java
Outdated
Show resolved
Hide resolved
I updated |
27558f9
to
f1c347c
Compare
btw, please do test it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm based on test done
my understanding is that "master" mode is only used for backward purpose, will not be used for membership in the future. |
Yes, it is only used for backward purpose. But some features still depends on master, for example |
dora/core/client/fs/src/main/java/alluxio/client/block/RetryHandlingBlockMasterClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, But I'm still a little fuzzy what's the exact behavior/exception when we lost a worker which is still running the task or send a request to lost worker. Do you think we can add some comment on those behaviors?
The request will throw an exception after timeout. If we want to add a comment for this case, we need to identify the exact type of exception. |
5f4b521
to
5fe21eb
Compare
alluxio-bot, merge this please |
By default, we build a dynamic consistent hash with the live worker list that comes from master or ETCD. Sometimes we want to build a static consistent hash ring to make sure we won't write data to other worker node when a worker node is offline temporarily (especially when other worker nodes are running out of disk space).
This PR provides allows us to build a static consistent hash ring by setting
alluxio.user.dynamic.consistent.hash.ring.enabled=false
. In this case, client will read from UFS if the worker where the specified file locate is down.