feat: support miniblock with binary data #3099

broccoliSpicy · 2024-11-06T20:49:49Z

This PR enables miniblock encoding with binary data type.

github-actions · 2024-11-06T20:50:09Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

broccoliSpicy · 2024-11-06T21:22:40Z

rust/lance-encoding/src/data.rs

@@ -293,6 +293,55 @@ impl FixedWidthDataBlock {
    }
 }

+pub struct VariableWidthDataBlockBuilder1 {


for variable width data with offset u64, I will do

pub struct VariableWidthDataBlockBuilder2 { offsets: Vec<u64>, bytes: Vec<u8>, }

it will be added in a follow up PR.

Let's use better names please. Even if it's just VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder.

haha, sure! VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder are way better than ...1, ...2.

broccoliSpicy · 2024-11-06T21:55:09Z

rust/lance-encoding/src/format.rs

@@ -133,6 +133,12 @@ impl ProtobufUtils {
        }
    }

+    pub fn binary_miniblock() -> ArrayEncoding {
+        ArrayEncoding {
+            array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})),


not important to this PR but I defined proto buffer

message BinaryMiniBlock { }

but here it is BinaryMiniblock
B -> b
don't know the reason.

It is because you have BinaryMiniBlock binary_miniblock = 15; and not BinaryMiniBlock binary_mini_block = 15;

oh! gotcha, thanks!

codecov-commenter · 2024-11-06T22:15:30Z

Codecov Report

Attention: Patch coverage is 93.77432% with 16 lines in your changes missing coverage. Please review.

Project coverage is 77.13%. Comparing base (5c19fe5) to head (0163c41).

Files with missing lines	Patch %	Lines
...st/lance-encoding/src/encodings/physical/binary.rs	94.51%	9 Missing ⚠️
rust/lance-encoding/src/data.rs	96.15%	2 Missing ⚠️
rust/lance-encoding/src/decoder.rs	75.00%	0 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs	92.00%	2 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3099      +/-   ##
==========================================
+ Coverage   77.08%   77.13%   +0.05%     
==========================================
  Files         240      240              
  Lines       80412    80652     +240     
  Branches    80412    80652     +240     
==========================================
+ Hits        61987    62213     +226     
- Misses      15263    15275      +12     
- Partials     3162     3164       +2

Flag	Coverage Δ
unittests	`77.13% <93.77%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonpace

This is great. We will need test cases though. For fixed width data we just extended the 2.0 tests to run on 2.1 as well:

    #[rstest]
    #[test_log::test(tokio::test)]
    async fn test_value_primitive(
        #[values(LanceFileVersion::V2_0, LanceFileVersion::V2_1)] version: LanceFileVersion,
    ) {
        for data_type in PRIMITIVE_TYPES {
            log::info!("Testing encoding for {:?}", data_type);
            let field = Field::new("", data_type.clone(), false);
            check_round_trip_encoding_random(field, version).await;
        }
    }

You can probably do the same thing using the tests in physical/binary.rs.

Also, I think we have a problem with alignment. I have an idea I'll put together real quick.

westonpace · 2024-11-06T22:07:50Z

rust/lance-encoding/src/data.rs

@@ -293,6 +293,55 @@ impl FixedWidthDataBlock {
    }
 }

+pub struct VariableWidthDataBlockBuilder1 {


Let's use better names please. Even if it's just VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder.

westonpace · 2024-11-06T22:12:38Z

rust/lance-encoding/src/data.rs

+        for i in selection.start..selection.end {
+            let this_value_len = offsets[i as usize + 1] - offsets[i as usize];
+            self.offsets.push(previous_len as u32 + this_value_len);
+            previous_len += this_value_len as usize;
+        }


This is inefficient I think (don't want offsets[] in a loop because each call does bounds checking) but we can optimize later if you want.

thanks for the suggestion, fixed.

westonpace · 2024-11-06T22:14:39Z

rust/lance-encoding/src/decoder.rs

+            DataType::Binary | DataType::Utf8 => {
+                let column_info = column_infos.expect_next()?;
+                let scheduler = Box::new(StructuralPrimitiveFieldScheduler::try_new(
+                    column_info.as_ref(),
+                    self.decompressor_strategy.as_ref(),
+                )?);
+                column_infos.next_top_level();
+                Ok(scheduler)
+            }


Instead of doing this can we add Binary and Utf8 to Self::is_primitive?

it's a good idea, I am a bit worried doing this may cause some issues in lance version priori to 2.1, I will verify this.

turns out it does cause troubles with backward compatibility, for example, the test in python/tests/test_backwards_compatibility.py will fail. kept CoreDecompressorStrategy::is_primitive as is and kept this logic.

westonpace · 2024-11-06T22:15:35Z

rust/lance-encoding/src/encoder.rs

+                let max_len = max_len
+                    .as_any()
+                    .downcast_ref::<PrimitiveArray<UInt64Type>>()
+                    .unwrap();
+                if max_len.value(0) < 128 {


Why verify max_len again here? It's not harmful, just curious.

it's just being cautious, and subconsciously the fear of writing bug code. I think it's better to remove it.

westonpace · 2024-11-06T22:25:19Z

rust/lance-encoding/src/encodings/physical/binary.rs

+            bytes_start_offset: usize,
+            // every chunk is padded to 8 bytes.
+            // we need to interpret every chunk as &[u32] so we need it to padded at least to 4 bytes,
+            // 8 bytes are a more conservative behavior.


I think there is some confusing between padding and alignment here. You want to be able to take the offsets of a chunk and interpret them as &[u32]. In order for this to work the slice needs to be aligned to (not padded) 4 bytes (and it is quite concretely 4, let's not make it 8). The means the pointer to the start of the slice needs to be divisible by 4.

Padding the buffer to make the length divisible by 4 does not guarantee this, except possibly accidentally.

In fact, getting this alignment correct is kind of tricky, and is going to take some tweaking of the miniblock encoder itself 😦.

Let me throw together a PR. We can probably change our "pad to 6" logic to just be "pad to 8" and then pad the start of the data buffer appropriately so that the start of the data buffer is on an 8-byte aligned pointer. This will ensure that each chunk starts with an 8-byte aligned pointer.

I did a copy of the entire chunk in decompress(hope we can eliminate it after your PR).

because I did a copy, I think I can safely reinterpret the chunk as &[u32] as long as it's length in number of bytes is some multiple of 4.

westonpace · 2024-11-06T22:36:17Z

rust/lance-encoding/src/encodings/physical/binary.rs

+            // `this_chunk_offsets` are offsets that points to bytes in this chunk,
+            let this_chunk_offsets = offsets
+                [chunk.chunk_start_offset_in_orig_idx..chunk.chunk_last_offset_in_orig_idx + 1]
+                .iter()
+                .map(|offset| {
+                    offset - offsets[chunk.chunk_start_offset_in_orig_idx]
+                        + chunk.bytes_start_offset as u32
+                })
+                .collect::<Vec<_>>();


It appears you are copying n+1 offsets and normalizing them so they always start at 0. If you're going to normalize, there is no need to copy the 0.

Alternatively, you could skip the normalization, and copy the n+1 values. The decompressor would then need to subtract the first value from everything.

Not sure which approach is better, and this approach is fine for now, but it is technically doing more work than needed.

It appears you are copying n+1 offsets and normalizing them so they always start at 0.

they normalize to chunk.bytes_start_offset.
I can skip the normalization here but then I also need to store the chunk.bytes_start_offset in the output buffer so the decompressor know the chunk.bytes_start_offset

westonpace · 2024-11-06T22:39:48Z

rust/lance-encoding/src/encodings/physical/binary.rs

+    fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> {
+        assert!(data.len() >= 8);
+        let data = data.to_vec();
+        let offsets: &[u32] = cast_slice(&data);


It might be good to use try_cast_slice instead of cast_slice.

westonpace · 2024-11-06T22:41:26Z

rust/lance-encoding/src/format.rs

@@ -133,6 +133,12 @@ impl ProtobufUtils {
        }
    }

+    pub fn binary_miniblock() -> ArrayEncoding {
+        ArrayEncoding {
+            array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})),


It is because you have BinaryMiniBlock binary_miniblock = 15; and not BinaryMiniBlock binary_mini_block = 15;

broccoliSpicy · 2024-11-07T17:02:25Z

This is great. We will need test cases though. For fixed width data we just extended the 2.0 tests to run on 2.1 as well:

    #[rstest]
    #[test_log::test(tokio::test)]
    async fn test_value_primitive(
        #[values(LanceFileVersion::V2_0, LanceFileVersion::V2_1)] version: LanceFileVersion,
    ) {
        for data_type in PRIMITIVE_TYPES {
            log::info!("Testing encoding for {:?}", data_type);
            let field = Field::new("", data_type.clone(), false);
            check_round_trip_encoding_random(field, version).await;
        }
    }

You can probably do the same thing using the tests in physical/binary.rs.

actually, the check_round_trip_encoding_random and other test functions in rust/lance-encoding/src/testing.rs has many problems dealing with PrimitiveStructuralEncoder right now, I will see how to integrate them

…o binary-miniblock

broccoliSpicy · 2024-11-08T16:24:47Z

after merging PR #3101, I tried to get rid of

lance/rust/lance-encoding/src/encodings/physical/binary.rs

Line 662 in cdedc02

let data = data.to_vec();

and misalignment panic still happens when executing

lance/rust/lance-encoding/src/encodings/physical/binary.rs

Line 663 in cdedc02

let offsets: &[u32] = try_cast_slice(&data)

will deal with it in a separate PR.

westonpace · 2024-11-08T17:31:38Z

rust/lance-encoding/src/encoder.rs

@@ -1429,6 +1444,7 @@ pub mod tests {
        assert!(!is_dict_encoding_applicable(vec![Some("a"), Some("a")], 3));
    }

+    /*


Can we remove this?

westonpace · 2024-11-08T17:33:29Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

@@ -579,6 +579,7 @@ fn rows_in_buffer(
    bits_in_buffer / bits_per_value
 }

+/*


I think I fixed this (well, more worked around it). Can you uncomment and verify these tests still fail?

github-actions bot added the enhancement New feature or request label Nov 6, 2024

broccoliSpicy closed this Nov 6, 2024

broccoliSpicy force-pushed the miniblock_binary branch from 00b1084 to a6053ca Compare November 6, 2024 21:19

feat: miniblock + binary

d85337b

broccoliSpicy reopened this Nov 6, 2024

broccoliSpicy commented Nov 6, 2024

View reviewed changes

broccoliSpicy mentioned this pull request Nov 6, 2024

feat: implement DataBlockBuilderImpl for variable width data block #3096

Closed

broccoliSpicy requested a review from westonpace November 6, 2024 21:24

lint

1cdf215

broccoliSpicy changed the title ~~feat: Miniblock with binary data~~ feat: support miniblock with binary data Nov 6, 2024

add a comment

05fa9e2

broccoliSpicy commented Nov 6, 2024

View reviewed changes

westonpace requested changes Nov 6, 2024

View reviewed changes

broccoliSpicy and others added 14 commits November 7, 2024 14:20

enable check_round_trip_filed_encoding_random for 2.1

598edc6

Fix bug in miniblock scheduler

7600574

address PR comments

872346b

Merge branch 'main' into miniblock_binary

dae87ee

fix a merge error

04c8b22

delete a debug println

8f3bd57

use BINARY_MINIBLOCK_CHUNK_ALIGNMENT

d4f888a

Merge branch 'main' into miniblock_binary

8a4777c

revert change to CoreFieldDecoderStrategy::is_primitive

39971e5

Merge branch 'miniblock_binary' of github.com:broccoliSpicy/lance int…

36a4775

…o binary-miniblock

Merge branch 'main' into miniblock_binary

3c6329e

add a comment

02e1702

Merge branch 'miniblock_binary' of github.com:broccoliSpicy/lance int…

6ec4f84

…o binary-miniblock

lint

cdedc02

broccoliSpicy requested a review from westonpace November 8, 2024 16:26

westonpace approved these changes Nov 8, 2024

View reviewed changes

westonpace assigned broccoliSpicy Nov 8, 2024

broccoliSpicy and others added 3 commits November 8, 2024 14:24

add back some tests in bitpack.rs and fix-size-binary

1b9d5df

set MINIBLOCK_MAX_BYTE_LENGTH_PER_VALUE to 256

d28b880

Merge branch 'main' into miniblock_binary

0163c41

broccoliSpicy merged commit 3f2faf2 into lancedb:main Nov 8, 2024
26 checks passed

broccoliSpicy mentioned this pull request Nov 11, 2024

perf: we may have a chance to avoid some copies when decoding with some alignment adjustment(page alignment, chunk alignment) #3115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support miniblock with binary data #3099

feat: support miniblock with binary data #3099

broccoliSpicy commented Nov 6, 2024 •

edited

Loading

github-actions bot commented Nov 6, 2024

broccoliSpicy Nov 6, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 6, 2024

broccoliSpicy Nov 7, 2024

broccoliSpicy Nov 6, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 6, 2024

broccoliSpicy Nov 7, 2024

codecov-commenter commented Nov 6, 2024 •

edited

Loading

westonpace left a comment

westonpace Nov 6, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 7, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 6, 2024

broccoliSpicy Nov 7, 2024

broccoliSpicy Nov 8, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 7, 2024

broccoliSpicy Nov 8, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 6, 2024

westonpace Nov 6, 2024

broccoliSpicy Nov 6, 2024

westonpace Nov 6, 2024

westonpace Nov 6, 2024

broccoliSpicy commented Nov 7, 2024

broccoliSpicy commented Nov 8, 2024

westonpace Nov 8, 2024

westonpace Nov 8, 2024

feat: support miniblock with binary data #3099

feat: support miniblock with binary data #3099

Conversation

broccoliSpicy commented Nov 6, 2024 • edited Loading

github-actions bot commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 6, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broccoliSpicy commented Nov 7, 2024

broccoliSpicy commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broccoliSpicy commented Nov 6, 2024 •

edited

Loading

codecov-commenter commented Nov 6, 2024 •

edited

Loading