-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement ALP-RD compression #947
Conversation
Some Q's:
Separately just an observation, but ALP from the paper recommends having one pair of exponents per vector, rather than for the entire array like we do now. EDIT: answers
|
e293a50
to
c51ecc4
Compare
// dict-encode the left-parts, keeping track of exceptions | ||
for (idx, left) in left_parts.iter_mut().enumerate() { | ||
// TODO: revisit if we need to change the branch order for perf. | ||
if let Some(code) = self.codes.iter().position(|v| *v == *left) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally used HashMap for this, like the C++ code does, but it turns out that doing linear search on a small fixed-size array is considerably faster (~5x) than doing hashmap lookups
I implemented a small test using the POI Kaggle dataset referenced from the paper, and I was able to replicate the compression ration results.
We can see that our bits-per-value are roughly 55 for
This nets us an overall compression ratio of ~12.5 |
@@ -89,7 +89,7 @@ pub async fn rewrite_parquet_as_vortex<W: VortexWrite>( | |||
Ok(()) | |||
} | |||
|
|||
pub fn read_parquet_to_vortex(parquet_path: &Path) -> VortexResult<ChunkedArray> { | |||
pub fn read_parquet_to_vortex<P: AsRef<Path>>(parquet_path: P) -> VortexResult<ChunkedArray> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ergonomics
@@ -19,7 +26,14 @@ impl Display for Exponents { | |||
} | |||
} | |||
|
|||
pub trait ALPFloat: Float + Display + 'static { | |||
mod private { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory this was previously extensible, but we probably want to constrain it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poor f16 not considered in the paper. Anyway this is the right thing to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i too forgot about f16. I suppose that we probably want special compressors for things like bf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just a comment for future readers. The paper only talked about the common float types
encodings/alp/src/alp_rd/mod.rs
Outdated
} | ||
} | ||
|
||
// Only applies for F64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old comment. replace with real doc comment
one note - don't bother with accessors we decided that we likely need to change them |
encodings/alp/Cargo.toml
Outdated
@@ -17,6 +17,8 @@ readme = { workspace = true } | |||
workspace = true | |||
|
|||
[dependencies] | |||
fastlanes = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small changes
🥳 |
Fixes #10: Add ALP-RD compression.
Currently our only floating point compression algorithm is standard ALP, which targets floats/doubles that are originally decimal, and thus have some natural integer they can round to when you undo the exponent.
For science/math datasets, there are a lot of "real doubles", i.e. floating point numbers that use most/all of their available precision. These do not compress with standard ALP. The ALP paper authors had a solution for this called "ALP for 'Real' Doubles" / ALP-RD, which is implemented in this PR.
Basics
The key insight of ALP-RD is that even for dense floating point numbers, within a column they often share the front bits (exponent + first few bits of mantissa). We try and find the best cut-point within the leftmost 16-bits.
There are generally a small number of unique values for the leftmost bits, so you can create a dictionary of fixed size (here we use the choice of 8 from the C++ implementation) which naturally bit-packs down to 3 bits. If you compress perfectly without exceptions, you can store 49 bits/value ~23% compression. In practice the amount varies. In the comments below you can see a test with the POI dataset referenced in the ALP paper, and we replicate their results of 55 and 56 bits/value respectively.
List of changes
vortex-alp
crate. I created two top-level modules,alp
and alp_rd, and moved the previous implementation into the
alp` moduleALPRDArray
in thealp_rd
module. It supports both f32 and f64, and all major compute functions are implemented (save forMaybeCompareFn
and the Accessors I will file an issue to implement these in a FLUP if alright, this PR is already quite large)ALPRDCompressor
and wired the CompressorRef everywhere I could find ALPCompressor