Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: support multi-threading shuffler #1474

Merged
merged 3 commits into from
Oct 27, 2023
Merged

perf: support multi-threading shuffler #1474

merged 3 commits into from
Oct 27, 2023

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Oct 27, 2023

No description provided.

@wjones127 wjones127 changed the title chore: support multi-threading shuffler feat: support multi-threading shuffler Oct 27, 2023
@wjones127 wjones127 changed the title feat: support multi-threading shuffler perf: support multi-threading shuffler Oct 27, 2023
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I renamed PR to have perf prefix, since it is a performance optimization.

In the future, it would be nice to put in the PR description or a comment some benchmark results in the future. Will help us as we look back in PRs to see where improvements came from between versions.


/// We need to keep the temp_dir with Shuffler because ObjectStore crate does not
/// work with a NamedTempFile.
temp_dir: Arc<TempDir>,

writer: FileWriter,
writer: Arc<Mutex<FileWriter>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot it's one file for all the partitions. I wonder if that's a bottleneck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea: you could move the FileWriter to a background task and push batches to it using a buffered channel. That would make sure the writer is always busy and let the read tasks keep moving. Might get a little complex to handle the batch numbering, but not insurmountable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I can do follow up PRs.

@eddyxu eddyxu merged commit 2d5de9e into main Oct 27, 2023
17 checks passed
@eddyxu eddyxu deleted the lei/mt_shuffler branch October 27, 2023 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants