-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: support multi-threading shuffler #1474
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I renamed PR to have perf
prefix, since it is a performance optimization.
In the future, it would be nice to put in the PR description or a comment some benchmark results in the future. Will help us as we look back in PRs to see where improvements came from between versions.
|
||
/// We need to keep the temp_dir with Shuffler because ObjectStore crate does not | ||
/// work with a NamedTempFile. | ||
temp_dir: Arc<TempDir>, | ||
|
||
writer: FileWriter, | ||
writer: Arc<Mutex<FileWriter>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I forgot it's one file for all the partitions. I wonder if that's a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea: you could move the FileWriter
to a background task and push batches to it using a buffered channel. That would make sure the writer is always busy and let the read tasks keep moving. Might get a little complex to handle the batch numbering, but not insurmountable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I can do follow up PRs.
No description provided.