feat: benchmark and improve L2 partition compute #1453

eddyxu · 2023-10-22T23:00:57Z

Improve compute_partitions(centroids, vectors, L2) by 2x (6.9s -> 3.5s)

chebbyChefNEQ · 2023-10-22T23:28:41Z

rust/lance-linalg/src/distance/l2.rs

                    sums = _mm256_fmadd_ps(sub, sub, sums);
+                    sums = _mm256_fmadd_ps(s2, s2, sums);
+                    sums = _mm256_fmadd_ps(s3, s3, sums);


you can use 4 accumulators for this sums1..4 and only add the four together at the end. Might be better throughput

Done. Delivered another 10%

chebbyChefNEQ · 2023-10-22T23:29:40Z

rust/lance-linalg/src/kmeans.rs

+}
+
+/// Fast partition computation for L2 distance.
+fn compute_partitions_l2(centroids: &[f32], data: &[f32], dim: usize) -> Vec<u32> {


nit: compute_partitions_l2_f32 or compute_partitions_l2<f32>

ok, lemme make it _f32 first, get things going. Because f16/bf16 can prob increase STRIPE / TILE by 2x

chebbyChefNEQ · 2023-10-22T23:32:06Z

rust/lance-linalg/src/kmeans.rs

+    const STRIPE_SIZE: usize = 128;
+    const TILE_SIZE: usize = 16;


nit: we may want to benchmark these and use different numbers on different CPUs,

one idea is to use something like L1 cache size * factor

This is single CPU, so the cache is target to the 32KB / 64KB L1 per core.

Expect to have higher level (i.e., kmeans ) to handle distribution of multiple batches to centroids computation.

westonpace

This looks good. I won't promise to have stepped through the math in detail but the idea of a tiling approach seems very sound to me.

rust/lance-linalg/benches/compute_partition.rs

westonpace · 2023-10-23T13:51:29Z

rust/lance-linalg/src/distance/l2.rs

-                let len = from.len() / 8 * 8;
-                let mut sums = _mm256_setzero_ps();
-                for i in (0..len).step_by(8) {
+                let len = from.len() / 32 * 32;


Do we have any unit tests comparing this with a naive approach so we know we are calculating the right thing here?

Yes, https://github.com/lancedb/lance/blob/main/rust/lance-linalg/src/distance/l2.rs#L325

westonpace · 2023-10-23T14:01:43Z

rust/lance-linalg/src/kmeans.rs

+                        // Get a slice of `data[di][s..s+STRIP_SIZE]`.
+                        let cent_slice = get_slice(centroids, ci, s, dim, slice_len);
+                        let dist = data_slice.l2(cent_slice);
+                        dists[di * TILE_SIZE + (ci - centroid_start)] += dist;


It's very likely I'm reading this incorrectly but it looks like you are calculating:

sqrt(diff_sq(0) + ... + diff_sq(N))

by breaking it into pieces:

sqrt(diff_sq(0) + ... + diff_sq(a)) + sqrt(diff_sq(a+1) + ... + diff_sq(N))

But doesn't this require sqrt(a + b) = sqrt(a) + sqrt(b) which isn't true?

lance/rust/lance-linalg/src/distance/l2.rs

Line 190 in 2e4ce22

results[0] += l2_scalar(&from[len..], &to[len..]);

Our L2 implementation does not calculate sqrt()

Co-authored-by: Weston Pace <[email protected]>

eddyxu added 9 commits October 22, 2023 09:10

add benchmark

f549add

a bit tiling

f0e728c

add

e213cec

use tiling to reduce l1

ed8c04b

l2

9e2862c

add unit test

8707366

add cosine benchmark as well

3abadb8

use partition

2c92900

fix clippy

fb9a649

eddyxu requested review from chebbyChefNEQ, westonpace and wjones127 October 22, 2023 23:18

fx

0b4a132

chebbyChefNEQ reviewed Oct 22, 2023

View reviewed changes

eddyxu added 2 commits October 22, 2023 16:55

f32 signature

03fec07

manully unrolling

1937d21

eddyxu changed the title ~~feat: benchmark and improve l2 partition compute~~ feat: benchmark and improve L2 partition compute Oct 23, 2023

westonpace reviewed Oct 23, 2023

View reviewed changes

eddyxu and others added 2 commits October 23, 2023 08:29

Update rust/lance-linalg/benches/compute_partition.rs

04fb2dc

Co-authored-by: Weston Pace <[email protected]>

Merge branch 'main' into lei/bench_partition

7efde87

westonpace approved these changes Oct 23, 2023

View reviewed changes

eddyxu merged commit 4df9d33 into main Oct 23, 2023
16 checks passed

eddyxu deleted the lei/bench_partition branch October 23, 2023 15:57

eddyxu mentioned this pull request Oct 23, 2023

feat: use tiling to calculate cosine partitions #1455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: benchmark and improve L2 partition compute #1453

feat: benchmark and improve L2 partition compute #1453

eddyxu commented Oct 22, 2023 •

edited

Loading

chebbyChefNEQ Oct 22, 2023

eddyxu Oct 23, 2023 •

edited

Loading

chebbyChefNEQ Oct 22, 2023

eddyxu Oct 22, 2023 •

edited

Loading

chebbyChefNEQ Oct 22, 2023

chebbyChefNEQ Oct 22, 2023

eddyxu Oct 22, 2023

westonpace left a comment

westonpace Oct 23, 2023

eddyxu Oct 23, 2023

westonpace Oct 23, 2023

eddyxu Oct 23, 2023

feat: benchmark and improve L2 partition compute #1453

feat: benchmark and improve L2 partition compute #1453

Conversation

eddyxu commented Oct 22, 2023 • edited Loading

Choose a reason for hiding this comment

eddyxu Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu Oct 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu commented Oct 22, 2023 •

edited

Loading

eddyxu Oct 23, 2023 •

edited

Loading

eddyxu Oct 22, 2023 •

edited

Loading