-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: build ivf partition using disk based shuffler #1312
Conversation
eddyxu
commented
Sep 22, 2023
•
edited
Loading
edited
- Reduce the memory consumption to run IVF_PQ
- Building block for distributed IVF_PQ indexing
- Better multi-thread support during IVF assignment
8486ce2
to
edb8e9c
Compare
7ee145f
to
63ffe33
Compare
9df4538
to
72d7c5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts from a partial review.
rust/lance-linalg/src/kmeans.rs
Outdated
@@ -190,64 +185,38 @@ pub struct KMeanMembership { | |||
impl KMeanMembership { | |||
/// Reconstruct a KMeans model from the membership. | |||
async fn to_kmeans(&self) -> Result<KMeans> { | |||
let time = std::time::Instant::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe #[instrument]
the method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
rust/lance/src/index/vector/ivf.rs
Outdated
.iter() | ||
.zip(centroid.values().iter()) | ||
.map(|(v, c)| *v - *c) | ||
.collect::<Vec<_>>() // How to avoid one memory allocation here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flat_map
should accept a function that returns an iterator. Can you just remove the collect
? Maybe you need to do .copied()
to convert from Iterator<Item = &f32>
to Iterator<Item = f32>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It complaints returns a value referencing data owned by the current function
(centroid) tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fixed in another way: using foreach to append to pre-allocated vector.
@@ -649,6 +635,7 @@ pub async fn build_ivf_pq_index( | |||
#[cfg(not(feature = "opq"))] | |||
let transforms: Vec<Box<dyn Transformer>> = vec![]; | |||
|
|||
let start = std::time::Instant::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a span?
rust/lance-linalg/src/kmeans.rs
Outdated
for i in 0..dimension { | ||
new_centroids[*c as usize * dimension + i] += v[i]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit faster (especially for large dimension
) as it can avoid some bounds checking.
for i in 0..dimension { | |
new_centroids[*c as usize * dimension + i] += v[i]; | |
} | |
for (old, new) in new_centroids[(*c * dimension)..((*c + 1) * dimension)] | |
.iter_mut() | |
.zip(v) | |
{ | |
*old += new; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.and_then(|batch_and_range| async move { | ||
// Split batch into per-partition batches | ||
let (batch, range) = batch_and_range; | ||
Ok(stream::iter(range).map(move |part_id| { | ||
let predictions = BooleanArray::from_unary( | ||
batch | ||
.column_by_name(PARTITION_ID_COLUMN) | ||
.unwrap() | ||
.as_primitive::<UInt32Type>(), | ||
|pid| pid == part_id, | ||
); | ||
let parted_batch = | ||
filter_record_batch(&batch, &predictions)?.drop_column(PARTITION_ID_COLUMN)?; | ||
Ok::<(u32, RecordBatch), Error>((part_id, parted_batch)) | ||
})) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is CPU bound, so I think we need to put this into tokio::task::spawn_blocking()
to get any concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this piece of code.
#[cfg(test)] | ||
mod tests {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This going to be filled in? Or removed?
rust/lance-linalg/src/kmeans.rs
Outdated
.values() | ||
.chunks_exact(dimension) | ||
.zip(self.cluster_ids.iter()) | ||
.for_each(|(v, c)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: using single variable names makes this harder to read. (I had to cross reference different parts of the file to figure out what they meant.)
.for_each(|(v, c)| { | |
.for_each(|(vector, cluster_id)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
rust/lance-linalg/src/kmeans.rs
Outdated
.for_each(|(v, c)| { | ||
cluster_cnts[*c as usize] += 1; | ||
for i in 0..dimension { | ||
new_centroids[*c as usize * dimension + i] += v[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like it can be an overflow hazard. We probably haven't hit it because of sampling I'm guessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use iterator
rust/lance/src/index/vector/ivf.rs
Outdated
0..num_partitions, | ||
) | ||
.await?; | ||
println!("Building partitions: {}s", start.elapsed().as_secs_f32()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/building/built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
e4dd67f
to
c36e76d
Compare
rust/lance-linalg/src/kmeans.rs
Outdated
warn!("KMeans: cluster {} is empty", i); | ||
} else { | ||
for j in 0..dimension { | ||
new_centroids[i * dimension + j] /= cluster_cnts[i] as f32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility that cluster_cnts[i]
is zero here? Maybe if K is larger than the # of data points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. the training algorithm does not work that well to make sure no cluster at the moment.
5777884
to
e6fb3b3
Compare
e6fb3b3
to
91dd61d
Compare
953f8c1
to
efb6167
Compare
parent 8775776 author Lei Xu <[email protected]> 1695332018 -0700 committer Lei Xu <[email protected]> 1695926481 -0700 Ivf::compute_partitions use instructment avoid memory copy during residual computation
b468322
to
27c863a
Compare
let indices = UInt32Array::from(row_ids.clone()); | ||
// Use `take` to select rows. | ||
let str_arr = take(&struct_arr, &indices, None)?; | ||
let parted_batch: RecordBatch = str_arr.as_struct().into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have that new take()
method now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Fixed.