Support fractional resource scheduling #258

pang-wu · 2022-07-25T07:54:20Z

Support fractional CPU and GPU resource scheduling. This PR actually achieve three goals:

Support schedule multiple executors in one vCPU, with each executor has 1 spark core (for parallism, like how Spark CPU request work on Kubernetes, but no limit), using spark.ray.actor.resource.cpu config.

For more details, please refer to this RFC.

pang-wu · 2022-07-25T08:36:29Z

@carsonwang & team, please kindly let me know if you want a call to discuss this proposal.

Use mock cluster based on doc here: https://docs.ray.io/en/latest/ray-core/examples/testing-tips.html#tip-4-create-a-mini-cluster-with-ray-cluster-utils-cluster

carsonwang · 2022-07-27T09:25:05Z

Thanks @pang-wu for the work! How will the gpu config be used as Spark actually is not aware of the gpu resource?

pang-wu · 2022-07-27T15:02:25Z

@carsonwang To my understanding the GPU based actor scheduling/allocation will be done by Ray, spark's executor runs inside the actor. Whether the code inside Spark will actually use GPU is up to the user. But we actually want to solve the other side of the problem as well: if a cluster has GPU, Spark can still launch executor on the worker nodes for CPU only tasks using this config. Right now developers has to use mixed node cluster to run Spark job if they also want to run GPU workload in the same cluster. In most of our usecase, the Spark processing job is small, the current setup increase the setup complexity.
In terms of how the workload will actually use GPU, user can put GPU aware code inside spark functions. This should be somewhat similar to how Spark handle custom resource scheduling with GPU (correct me if I am wrong)?

GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).

carsonwang · 2022-08-01T03:32:33Z

LGTM

Support fractional resource scheduling

e03628f

pang-wu mentioned this pull request Jul 25, 2022

[RFC] Fractional resource scheduling (CPU) #259

Closed

Fix java and scala code styling.

ccbe86b

pang-wu added 5 commits July 25, 2022 19:07

Fix tests.

835b272

Use marker to skip tests

fda2029

Refactor

3f50238

Use mock clusters.

0456cd7

Use mock cluster based on doc here: https://docs.ray.io/en/latest/ray-core/examples/testing-tips.html#tip-4-create-a-mini-cluster-with-ray-cluster-utils-cluster

try to fix test by running the custom resource test separately.

6125fc8

Remove GPU resource config.

81ac79a

GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).

carsonwang merged commit 240242d into oap-project:master Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fractional resource scheduling #258

Support fractional resource scheduling #258

pang-wu commented Jul 25, 2022 •

edited

Loading

pang-wu commented Jul 25, 2022

carsonwang commented Jul 27, 2022

pang-wu commented Jul 27, 2022 •

edited

Loading

carsonwang commented Aug 1, 2022

Support fractional resource scheduling #258

Support fractional resource scheduling #258

Conversation

pang-wu commented Jul 25, 2022 • edited Loading

pang-wu commented Jul 25, 2022

carsonwang commented Jul 27, 2022

pang-wu commented Jul 27, 2022 • edited Loading

carsonwang commented Aug 1, 2022

pang-wu commented Jul 25, 2022 •

edited

Loading

pang-wu commented Jul 27, 2022 •

edited

Loading