Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
### Rationale for this change This encoding is defined by the [Parquet spec](https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9) but does not currently have a Go implementation. ### What changes are included in this PR? Implement BYTE_STREAM_SPLIT encoder/decoder for: - FIXED_LEN_BYTE_ARRAY - FLOAT - DOUBLE - INT32 - INT64 ### Are these changes tested? Yes. See unit tests, file read conformance tests, and benchmarks. **Benchmark results on my machine** ``` ➜ go git:(impl-pq-bytestreamsplit) go test ./parquet/internal/encoding -run=^$ -bench=BenchmarkByteStreamSplit -benchmem goos: darwin goarch: arm64 pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding BenchmarkByteStreamSplitEncodingInt32/len_1024-14 502117 2005 ns/op 2043.37 MB/s 5267 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_2048-14 328921 3718 ns/op 2203.54 MB/s 9879 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_4096-14 169642 7083 ns/op 2313.14 MB/s 18852 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_8192-14 82503 14094 ns/op 2324.99 MB/s 41425 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_16384-14 45006 26841 ns/op 2441.68 MB/s 74286 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_32768-14 23433 51233 ns/op 2558.33 MB/s 140093 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt32/len_65536-14 12019 99001 ns/op 2647.90 MB/s 271417 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_1024-14 996573 1199 ns/op 3417.00 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_2048-14 503200 2380 ns/op 3442.18 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_4096-14 252038 4748 ns/op 3450.90 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_8192-14 122419 9793 ns/op 3346.08 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_16384-14 63321 19040 ns/op 3442.00 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_32768-14 31051 38677 ns/op 3388.89 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32/len_65536-14 15792 77931 ns/op 3363.80 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_1024-14 981043 1221 ns/op 3354.53 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_2048-14 492319 2424 ns/op 3379.34 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_4096-14 248062 4850 ns/op 3378.20 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_8192-14 123064 9903 ns/op 3308.87 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_16384-14 61845 19567 ns/op 3349.29 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_32768-14 30568 39456 ns/op 3321.96 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt32Batched/len_65536-14 15172 78762 ns/op 3328.30 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_1024-14 319006 3690 ns/op 2220.13 MB/s 9880 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_2048-14 161006 7132 ns/op 2297.30 MB/s 18853 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_4096-14 85783 13925 ns/op 2353.12 MB/s 41421 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_8192-14 45015 26943 ns/op 2432.43 MB/s 74312 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_16384-14 20352 59259 ns/op 2211.84 MB/s 139940 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_32768-14 10000 111143 ns/op 2358.61 MB/s 271642 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingInt64/len_65536-14 5529 212652 ns/op 2465.47 MB/s 534805 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_1024-14 528987 2355 ns/op 3478.32 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_2048-14 262707 4701 ns/op 3485.08 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_4096-14 129212 9313 ns/op 3518.63 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_8192-14 53746 23315 ns/op 2810.90 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_16384-14 28782 41054 ns/op 3192.65 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_32768-14 14803 80157 ns/op 3270.39 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingInt64/len_65536-14 7484 164111 ns/op 3194.72 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14 291716 4107 ns/op 997.43 MB/s 5276 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14 148888 7975 ns/op 1027.18 MB/s 9914 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14 76587 15677 ns/op 1045.11 MB/s 18955 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14 39758 30277 ns/op 1082.26 MB/s 41752 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14 20306 59506 ns/op 1101.33 MB/s 74937 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14 10000 116043 ns/op 1129.52 MB/s 141290 B/op 3 allocs/op BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14 4770 236887 ns/op 1106.62 MB/s 277583 B/op 3 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14 601875 1723 ns/op 2376.70 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14 363206 3422 ns/op 2394.18 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14 173041 6906 ns/op 2372.45 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14 81810 14307 ns/op 2290.40 MB/s 0 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14 40518 29101 ns/op 2252.04 MB/s 1 B/op 0 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14 21338 56678 ns/op 2312.58 MB/s 6 B/op 1 allocs/op BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14 10000 111433 ns/op 2352.49 MB/s 26 B/op 6 allocs/op PASS ok github.com/apache/arrow/go/v17/parquet/internal/encoding 69.109s ``` ### Are there any user-facing changes? New ByteStreamSplit encoding option available. Godoc updated to reflect this. * GitHub Issue: #41640 Authored-by: Joel Lubinitsky <[email protected]> Signed-off-by: Matt Topol <[email protected]>
- Loading branch information