Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow externally generated shuffle buffer and multiple of them #1801

Merged
merged 2 commits into from
Jan 10, 2024

Conversation

chebbyChefNEQ
Copy link
Contributor

This PR does a few things to the IVF shuffler

  • add a set_unsorted_buffers method. This method allowes caller to set a list of buffers that is different from the unsorted.lance buffer used by default.
    • This is helpful for using externally calculated buffers (say buffers from multiple GPUs)
    • This also removed the limit of the number of rows that can be shuffled. Previously, we could only shuffle as many as 2^32-1 rows as we used a single shuffle buffer with a limit of 2^32 rows
  • added logic to handle multiple shuffle buffers, by default we only use one buffer
  • added a bunch of tests for the shuffler.

I will wire up python in the next PR to keep this smaller and more reviewable.

/// # Safety
///
/// user must ensure the buffers are valid.
pub unsafe fn set_unsorted_buffers(&mut self, unsorted_buffers: Vec<String>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a unsafe method?

Btw, Can we just hold a reference to &[String]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to note to user that if invalid buffers are passed here it can cause undefined behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to &[impl Into<String>]

rust/lance-index/src/vector/ivf/shuffler.rs Outdated Show resolved Hide resolved
rust/lance-index/src/vector/ivf/shuffler.rs Show resolved Hide resolved
.expect("part id should exist");

let mut stream = stream::iter(start..end)
.map(|i| reader.read_batch(i as i32, ReadBatchParams::RangeFull, &lance_schema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do reader.read_batch(i as i32, .., &lance_schema) i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

.map(|i| reader.read_batch(i as i32, ReadBatchParams::RangeFull, &lance_schema))
.buffered(16);

while let Some(batch) = stream.next().await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foreach?

Copy link
Contributor Author

@chebbyChefNEQ chebbyChefNEQ Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to update partition_sizes Fn closures can't do that easily because Stream::for_each captures multiple FnMut

rust/lance-index/src/vector/ivf/shuffler.rs Outdated Show resolved Hide resolved
rust/lance-index/src/vector/ivf/shuffler.rs Outdated Show resolved Hide resolved
rust/lance-index/src/vector/ivf/shuffler.rs Outdated Show resolved Hide resolved
.expect("part id should exist");

let mut stream = stream::iter(start..end)
.map(|i| reader.read_batch(i as i32, ReadBatchParams::RangeFull, &lance_schema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might consider putting read_batch into a tokio task. Otherwise progress on these IO calls will be blocked on the CPU-bound work happenining in your while let loop below.

Comment on lines 240 to 241
pq_codes.values()[i * num_sub_vectors..(i + 1) * num_sub_vectors]
.iter(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be wrong, but I think you can make this a slice and it can do a contiguous mem copy instead of running through the iterator.

Suggested change
pq_codes.values()[i * num_sub_vectors..(i + 1) * num_sub_vectors]
.iter(),
&pq_codes.values()[i * num_sub_vectors..(i + 1) * num_sub_vectors],

@eddyxu
Copy link
Contributor

eddyxu commented Jan 10, 2024

Btw, delete shuffle_dataset_v1?

@chebbyChefNEQ chebbyChefNEQ force-pushed the rmeng/shuffle-only-index branch 2 times, most recently from 70ebbe0 to 87af1c7 Compare January 10, 2024 01:03
@chebbyChefNEQ chebbyChefNEQ merged commit ba671ae into main Jan 10, 2024
16 of 17 checks passed
@chebbyChefNEQ chebbyChefNEQ deleted the rmeng/shuffle-only-index branch January 10, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants