Enable static consistent hash ring #18183

JiamingMai · 2023-09-20T18:26:57Z

By default, we build a dynamic consistent hash with the live worker list that comes from master or ETCD. Sometimes we want to build a static consistent hash ring to make sure we won't write data to other worker node when a worker node is offline temporarily (especially when other worker nodes are running out of disk space).

This PR provides allows us to build a static consistent hash ring by setting alluxio.user.dynamic.consistent.hash.ring.enabled=false. In this case, client will read from UFS if the worker where the specified file locate is down.

lucyge2022 · 2023-09-20T23:17:04Z

By default, we build a dynamic consistent hash with the live worker list that comes from master or ETCD. Sometimes we want to build a static consistent hash ring to make sure we won't write data to other worker node when a worker node is offline temporarily (especially when other worker nodes are running out of disk space).

This PR provides allows us to build a static consistent hash ring by setting alluxio.user.dynamic.consistent.hash.ring.enabled=false. In this case, client will read from UFS if the worker where the specified file locate is down.

actually if its etcd it has already been using a static consistent hash ring.

JiamingMai · 2023-09-21T02:07:57Z

By default, we build a dynamic consistent hash with the live worker list that comes from master or ETCD. Sometimes we want to build a static consistent hash ring to make sure we won't write data to other worker node when a worker node is offline temporarily (especially when other worker nodes are running out of disk space).
This PR provides allows us to build a static consistent hash ring by setting alluxio.user.dynamic.consistent.hash.ring.enabled=false. In this case, client will read from UFS if the worker where the specified file locate is down.

actually if its etcd it has already been using a static consistent hash ring.

@lucyge2022 This PR makes us use dynamic consistent hash ring by default (no matter using ETCD or master). In this PR, I change the logic to get getLiveMembers instead of getAllMembers from ETCD by default. This allows users to choose the way they want.

common/transport/src/main/proto/grpc/block_master.proto

yyongycy · 2023-09-21T02:14:56Z

@jja725 wondering if the hash ring change impacts the logic of distributedMv related?

jja725 · 2023-09-21T02:17:03Z

@yyongycy this doesn't affect cp/mv, but probably affect distributed load, have to take a closer look.

JiamingMai · 2023-09-21T02:23:55Z

@yyongycy this doesn't affect cp/mv, but probably affect distributed load, have to take a closer look.

Actually, we do want to affect distributed load. That is why we want to have this change. If there is a worker that is offline temporarily, static consistent hash ring allows user to avoid writing data to other worker nodes when executing load command and reading a file cached on the offline worker.

jja725 · 2023-09-21T02:28:28Z

@yyongycy this doesn't affect cp/mv, but probably affect distributed load, have to take a closer look.

Actually, we do want to affect distributed load. That is why we want to have this change. If there is a worker that is offline temporarily, static consistent hash ring allows user to avoid writing data to other worker nodes when executing load command and reading a file cached on the offline worker.

Then we can update DefaultWorkerProvider.getWorkerInfos

yyongycy

Overall LGTM, looks like not a big changes. Master based worker registration would eventually be etcd based worker registration.

just need some data to see "time of the consistent hash ring building given it has 400K (100 physical nodes) vnodes"
Rest optimization can be done later.

dora/core/client/fs/src/main/java/alluxio/client/block/BlockWorkerInfo.java

dora/core/client/fs/src/main/java/alluxio/client/file/FileSystemContext.java

JiamingMai · 2023-09-21T07:22:14Z

@yyongycy this doesn't affect cp/mv, but probably affect distributed load, have to take a closer look.

Actually, we do want to affect distributed load. That is why we want to have this change. If there is a worker that is offline temporarily, static consistent hash ring allows user to avoid writing data to other worker nodes when executing load command and reading a file cached on the offline worker.

Then we can update DefaultWorkerProvider.getWorkerInfos

I updated DefaultWorkerProvider.getWorkerInfos. Please take a look when you have time. @jja725

yyongycy · 2023-09-21T09:14:21Z

btw, please do test it.

yyongycy

lgtm based on test done

apc999 · 2023-09-21T20:22:25Z

my understanding is that "master" mode is only used for backward purpose, will not be used for membership in the future.
correct?

JiamingMai · 2023-09-22T02:27:45Z

my understanding is that "master" mode is only used for backward purpose, will not be used for membership in the future. correct?

Yes, it is only used for backward purpose. But some features still depends on master, for example load command requires master which has Scheduler for distributed load.

dora/core/client/fs/src/main/java/alluxio/client/block/RetryHandlingBlockMasterClient.java

jja725

LGTM, But I'm still a little fuzzy what's the exact behavior/exception when we lost a worker which is still running the task or send a request to lost worker. Do you think we can add some comment on those behaviors?

JiamingMai · 2023-09-24T15:35:45Z

LGTM, But I'm still a little fuzzy what's the exact behavior/exception when we lost a worker which is still running the task or send a request to lost worker. Do you think we can add some comment on those behaviors?

The request will throw an exception after timeout. If we want to add a comment for this case, we need to identify the exact type of exception.

JiamingMai · 2023-09-24T16:15:17Z

alluxio-bot, merge this please

JiamingMai added the type-feature This issue is a feature request label Sep 20, 2023

JiamingMai self-assigned this Sep 20, 2023

JiamingMai requested a review from yyongycy September 20, 2023 18:39

JiamingMai force-pushed the return-full-worker-lists-with-active-label branch from c5d9377 to a9a80a8 Compare September 20, 2023 18:58

yyongycy reviewed Sep 21, 2023

View reviewed changes

common/transport/src/main/proto/grpc/block_master.proto Show resolved Hide resolved

JiamingMai force-pushed the return-full-worker-lists-with-active-label branch from 27c41bb to 2e5c249 Compare September 21, 2023 02:17

yyongycy reviewed Sep 21, 2023

View reviewed changes

dora/core/client/fs/src/main/java/alluxio/client/block/BlockWorkerInfo.java Show resolved Hide resolved

yyongycy reviewed Sep 21, 2023

View reviewed changes

dora/core/client/fs/src/main/java/alluxio/client/file/FileSystemContext.java Outdated Show resolved Hide resolved

yyongycy reviewed Sep 21, 2023

View reviewed changes

dora/core/client/fs/src/main/java/alluxio/client/file/FileSystemContext.java Outdated Show resolved Hide resolved

JiamingMai force-pushed the return-full-worker-lists-with-active-label branch 2 times, most recently from 27558f9 to f1c347c Compare September 21, 2023 08:13

yyongycy approved these changes Sep 21, 2023

View reviewed changes

apc999 reviewed Sep 22, 2023

View reviewed changes

dora/core/client/fs/src/main/java/alluxio/client/block/RetryHandlingBlockMasterClient.java Outdated Show resolved Hide resolved

lucyge2022 approved these changes Sep 22, 2023

View reviewed changes

jja725 self-requested a review September 22, 2023 20:49

jja725 approved these changes Sep 22, 2023

View reviewed changes

enable static consistent hash ring

5fe21eb

JiamingMai force-pushed the return-full-worker-lists-with-active-label branch from 5f4b521 to 5fe21eb Compare September 24, 2023 15:49

alluxio-bot merged commit 189ca24 into Alluxio:main Sep 24, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable static consistent hash ring #18183

Enable static consistent hash ring #18183

JiamingMai commented Sep 20, 2023 •

edited

Loading

lucyge2022 commented Sep 20, 2023

JiamingMai commented Sep 21, 2023 •

edited

Loading

yyongycy commented Sep 21, 2023

jja725 commented Sep 21, 2023

JiamingMai commented Sep 21, 2023

jja725 commented Sep 21, 2023

yyongycy left a comment •

edited

Loading

JiamingMai commented Sep 21, 2023

yyongycy commented Sep 21, 2023

yyongycy left a comment

apc999 commented Sep 21, 2023

JiamingMai commented Sep 22, 2023

jja725 left a comment •

edited

Loading

JiamingMai commented Sep 24, 2023

JiamingMai commented Sep 24, 2023

Enable static consistent hash ring #18183

Enable static consistent hash ring #18183

Conversation

JiamingMai commented Sep 20, 2023 • edited Loading

lucyge2022 commented Sep 20, 2023

JiamingMai commented Sep 21, 2023 • edited Loading

yyongycy commented Sep 21, 2023

jja725 commented Sep 21, 2023

JiamingMai commented Sep 21, 2023

jja725 commented Sep 21, 2023

yyongycy left a comment • edited Loading

Choose a reason for hiding this comment

JiamingMai commented Sep 21, 2023

yyongycy commented Sep 21, 2023

yyongycy left a comment

Choose a reason for hiding this comment

apc999 commented Sep 21, 2023

JiamingMai commented Sep 22, 2023

jja725 left a comment • edited Loading

Choose a reason for hiding this comment

JiamingMai commented Sep 24, 2023

JiamingMai commented Sep 24, 2023

JiamingMai commented Sep 20, 2023 •

edited

Loading

JiamingMai commented Sep 21, 2023 •

edited

Loading

yyongycy left a comment •

edited

Loading

jja725 left a comment •

edited

Loading