Make Scan tasks use a spread scheduling strategy #1940

jaychia · 2024-02-21T19:02:37Z

This should be a pretty simple fix:

When scheduling any tasks involving a read, we should spread them across the cluster (similar to what we do for reduces)

This PR forces a `SPREAD` scheduling strategy for scan tasks when using the Ray runner. This should result in better load balancing of read tasks across the Ray cluster, yielding: - better utilization of the aggregate network bandwidth of the cluster, - better memory stability due to a more even post-read object distribution, - better performance of downstream parallel compute operations due to a more even distribution of data over the compute bandwidth of the cluster. Closes #1940

jaychia added the p0 Priority 0 - to be addressed immediately label Feb 21, 2024

jaychia assigned clarkzinzow Feb 26, 2024

clarkzinzow mentioned this issue Feb 26, 2024

[PERF] Spread scan tasks over Ray cluster. #1950

Merged

clarkzinzow closed this as completed in #1950 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Scan tasks use a spread scheduling strategy #1940

Make Scan tasks use a spread scheduling strategy #1940

jaychia commented Feb 21, 2024 •

edited

Loading

Make Scan tasks use a spread scheduling strategy #1940

Make Scan tasks use a spread scheduling strategy #1940

Comments

jaychia commented Feb 21, 2024 • edited Loading

jaychia commented Feb 21, 2024 •

edited

Loading