Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support miniblock with binary data #3099

Merged
merged 20 commits into from
Nov 8, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Nov 6, 2024

This PR enables miniblock encoding with binary data type.

@github-actions github-actions bot added the enhancement New feature or request label Nov 6, 2024
Copy link

github-actions bot commented Nov 6, 2024

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@broccoliSpicy broccoliSpicy reopened this Nov 6, 2024
@@ -293,6 +293,55 @@ impl FixedWidthDataBlock {
}
}

pub struct VariableWidthDataBlockBuilder1 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for variable width data with offset u64, I will do

pub struct VariableWidthDataBlockBuilder2 {
    offsets: Vec<u64>,
    bytes: Vec<u8>,
}

it will be added in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use better names please. Even if it's just VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, sure! VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder are way better than ...1, ...2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@broccoliSpicy broccoliSpicy changed the title feat: Miniblock with binary data feat: support miniblock with binary data Nov 6, 2024
@@ -133,6 +133,12 @@ impl ProtobufUtils {
}
}

pub fn binary_miniblock() -> ArrayEncoding {
ArrayEncoding {
array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not important to this PR but I defined proto buffer

message BinaryMiniBlock {
}

but here it is BinaryMiniblock
B -> b
don't know the reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because you have BinaryMiniBlock binary_miniblock = 15; and not BinaryMiniBlock binary_mini_block = 15;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! gotcha, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 93.77432% with 16 lines in your changes missing coverage. Please review.

Project coverage is 77.13%. Comparing base (5c19fe5) to head (0163c41).

Files with missing lines Patch % Lines
...st/lance-encoding/src/encodings/physical/binary.rs 94.51% 9 Missing ⚠️
rust/lance-encoding/src/data.rs 96.15% 2 Missing ⚠️
rust/lance-encoding/src/decoder.rs 75.00% 0 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs 92.00% 2 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3099      +/-   ##
==========================================
+ Coverage   77.08%   77.13%   +0.05%     
==========================================
  Files         240      240              
  Lines       80412    80652     +240     
  Branches    80412    80652     +240     
==========================================
+ Hits        61987    62213     +226     
- Misses      15263    15275      +12     
- Partials     3162     3164       +2     
Flag Coverage Δ
unittests 77.13% <93.77%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. We will need test cases though. For fixed width data we just extended the 2.0 tests to run on 2.1 as well:

    #[rstest]
    #[test_log::test(tokio::test)]
    async fn test_value_primitive(
        #[values(LanceFileVersion::V2_0, LanceFileVersion::V2_1)] version: LanceFileVersion,
    ) {
        for data_type in PRIMITIVE_TYPES {
            log::info!("Testing encoding for {:?}", data_type);
            let field = Field::new("", data_type.clone(), false);
            check_round_trip_encoding_random(field, version).await;
        }
    }

You can probably do the same thing using the tests in physical/binary.rs.

Also, I think we have a problem with alignment. I have an idea I'll put together real quick.

@@ -293,6 +293,55 @@ impl FixedWidthDataBlock {
}
}

pub struct VariableWidthDataBlockBuilder1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use better names please. Even if it's just VariableWidthDataBlockBuilder and LargeVariableWidthDataBlockBuilder.

Comment on lines 325 to 329
for i in selection.start..selection.end {
let this_value_len = offsets[i as usize + 1] - offsets[i as usize];
self.offsets.push(previous_len as u32 + this_value_len);
previous_len += this_value_len as usize;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inefficient I think (don't want offsets[] in a loop because each call does bounds checking) but we can optimize later if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, fixed.

Comment on lines +748 to +756
DataType::Binary | DataType::Utf8 => {
let column_info = column_infos.expect_next()?;
let scheduler = Box::new(StructuralPrimitiveFieldScheduler::try_new(
column_info.as_ref(),
self.decompressor_strategy.as_ref(),
)?);
column_infos.next_top_level();
Ok(scheduler)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this can we add Binary and Utf8 to Self::is_primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea, I am a bit worried doing this may cause some issues in lance version priori to 2.1, I will verify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out it does cause troubles with backward compatibility, for example, the test in python/tests/test_backwards_compatibility.py will fail. kept CoreDecompressorStrategy::is_primitive as is and kept this logic.

Comment on lines 799 to 803
let max_len = max_len
.as_any()
.downcast_ref::<PrimitiveArray<UInt64Type>>()
.unwrap();
if max_len.value(0) < 128 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why verify max_len again here? It's not harmful, just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just being cautious, and subconsciously the fear of writing bug code. I think it's better to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

bytes_start_offset: usize,
// every chunk is padded to 8 bytes.
// we need to interpret every chunk as &[u32] so we need it to padded at least to 4 bytes,
// 8 bytes are a more conservative behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some confusing between padding and alignment here. You want to be able to take the offsets of a chunk and interpret them as &[u32]. In order for this to work the slice needs to be aligned to (not padded) 4 bytes (and it is quite concretely 4, let's not make it 8). The means the pointer to the start of the slice needs to be divisible by 4.

Padding the buffer to make the length divisible by 4 does not guarantee this, except possibly accidentally.

In fact, getting this alignment correct is kind of tricky, and is going to take some tweaking of the miniblock encoder itself 😦.

Let me throw together a PR. We can probably change our "pad to 6" logic to just be "pad to 8" and then pad the start of the data buffer appropriately so that the start of the data buffer is on an 8-byte aligned pointer. This will ensure that each chunk starts with an 8-byte aligned pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a copy of the entire chunk in decompress(hope we can eliminate it after your PR).

because I did a copy, I think I can safely reinterpret the chunk as &[u32] as long as it's length in number of bytes is some multiple of 4.

Comment on lines +596 to +604
// `this_chunk_offsets` are offsets that points to bytes in this chunk,
let this_chunk_offsets = offsets
[chunk.chunk_start_offset_in_orig_idx..chunk.chunk_last_offset_in_orig_idx + 1]
.iter()
.map(|offset| {
offset - offsets[chunk.chunk_start_offset_in_orig_idx]
+ chunk.bytes_start_offset as u32
})
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears you are copying n+1 offsets and normalizing them so they always start at 0. If you're going to normalize, there is no need to copy the 0.

Alternatively, you could skip the normalization, and copy the n+1 values. The decompressor would then need to subtract the first value from everything.

Not sure which approach is better, and this approach is fine for now, but it is technically doing more work than needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears you are copying n+1 offsets and normalizing them so they always start at 0.

they normalize to chunk.bytes_start_offset.
I can skip the normalization here but then I also need to store the chunk.bytes_start_offset in the output buffer so the decompressor know the chunk.bytes_start_offset

fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> {
assert!(data.len() >= 8);
let data = data.to_vec();
let offsets: &[u32] = cast_slice(&data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to use try_cast_slice instead of cast_slice.

@@ -133,6 +133,12 @@ impl ProtobufUtils {
}
}

pub fn binary_miniblock() -> ArrayEncoding {
ArrayEncoding {
array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because you have BinaryMiniBlock binary_miniblock = 15; and not BinaryMiniBlock binary_mini_block = 15;

@broccoliSpicy
Copy link
Contributor Author

This is great. We will need test cases though. For fixed width data we just extended the 2.0 tests to run on 2.1 as well:

    #[rstest]
    #[test_log::test(tokio::test)]
    async fn test_value_primitive(
        #[values(LanceFileVersion::V2_0, LanceFileVersion::V2_1)] version: LanceFileVersion,
    ) {
        for data_type in PRIMITIVE_TYPES {
            log::info!("Testing encoding for {:?}", data_type);
            let field = Field::new("", data_type.clone(), false);
            check_round_trip_encoding_random(field, version).await;
        }
    }

You can probably do the same thing using the tests in physical/binary.rs.

actually, the check_round_trip_encoding_random and other test functions in rust/lance-encoding/src/testing.rs has many problems dealing with PrimitiveStructuralEncoder right now, I will see how to integrate them

@broccoliSpicy
Copy link
Contributor Author

after merging PR #3101, I tried to get rid of


and misalignment panic still happens when executing
let offsets: &[u32] = try_cast_slice(&data)

will deal with it in a separate PR.

@@ -1429,6 +1444,7 @@ pub mod tests {
assert!(!is_dict_encoding_applicable(vec![Some("a"), Some("a")], 3));
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

@@ -579,6 +579,7 @@ fn rows_in_buffer(
bits_in_buffer / bits_per_value
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed this (well, more worked around it). Can you uncomment and verify these tests still fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants