Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40154: [C++][Parquet] Separate encoders and decoder #43972

Merged
merged 1 commit into from
Sep 5, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 5, 2024

Rationale for this change

encoding.cc is quite large nowadays : around 4000 lines of code, which makes code navigation cumbersome. It combines the functionality of encoders and decoders, however, those use distinct infrastructures and do not share any code.

Other areas of Parquet tend to separate the reading and writing facilities: for example, column_reader.cc vs. column_writer.cc.

What changes are included in this PR?

The main change is to move encoders to encoder.cc, decoders to decoder.cc, and remove encoding.cc.

A small improvement is also to remove the inclusion of arrow/util/spaced.h in encoding.h by moving the TypedDecoder<T>::DecodeSpaced implementation into decoder.cc.

Note the massive code shuffle may obscure the git history quite a bit. git log -C doesn't seem able to track earlier versions of the encoder and decoder code, but git blame -C is.

Are these changes tested?

By existing tests.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

@mapleFU @felipecrv @wgtmac This splits encoders and decoder into two separate files, but keeps a single encoding.h header. Does that sound ok?

@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

I've checked that git blame -C is able to trace back through the history for encoder.cc and decoder.cc (though git log -C isn't, for some reason).

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2024

I just took a glimpse of it. It looks great!

@mapleFU
Copy link
Member

mapleFU commented Sep 5, 2024

LGTM

@pitrou pitrou marked this pull request as ready for review September 5, 2024 14:37
@pitrou pitrou requested a review from wgtmac as a code owner September 5, 2024 14:37
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 5, 2024
@pitrou
Copy link
Member Author

pitrou commented Sep 5, 2024

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented Sep 5, 2024

Revision: be17259

Submitted crossbow builds: ursacomputing/crossbow @ actions-d1fc910e2b

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou merged commit c2123b8 into apache:main Sep 5, 2024
38 of 39 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Sep 5, 2024
@pitrou pitrou deleted the gh40154-pq-encoding branch September 5, 2024 16:11
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c2123b8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
…43972)

### Rationale for this change

`encoding.cc` is quite large nowadays : around 4000 lines of code, which makes code navigation cumbersome. It combines the functionality of encoders and decoders, however, those use distinct infrastructures and do not share any code.

Other areas of Parquet tend to separate the reading and writing facilities: for example, `column_reader.cc` vs. `column_writer.cc`.

### What changes are included in this PR?

The main change is to move encoders to `encoder.cc`, decoders to `decoder.cc`, and remove `encoding.cc`.

A small improvement is also to remove the inclusion of `arrow/util/spaced.h` in `encoding.h` by moving the `TypedDecoder<T>::DecodeSpaced` implementation into `decoder.cc`.

Note the massive code shuffle may obscure the git history quite a bit. `git log -C` doesn't seem able to track earlier versions of the encoder and decoder code, but `git blame -C` is.

### Are these changes tested?

By existing tests.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40154

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…43972)

### Rationale for this change

`encoding.cc` is quite large nowadays : around 4000 lines of code, which makes code navigation cumbersome. It combines the functionality of encoders and decoders, however, those use distinct infrastructures and do not share any code.

Other areas of Parquet tend to separate the reading and writing facilities: for example, `column_reader.cc` vs. `column_writer.cc`.

### What changes are included in this PR?

The main change is to move encoders to `encoder.cc`, decoders to `decoder.cc`, and remove `encoding.cc`.

A small improvement is also to remove the inclusion of `arrow/util/spaced.h` in `encoding.h` by moving the `TypedDecoder<T>::DecodeSpaced` implementation into `decoder.cc`.

Note the massive code shuffle may obscure the git history quite a bit. `git log -C` doesn't seem able to track earlier versions of the encoder and decoder code, but `git blame -C` is.

### Are these changes tested?

By existing tests.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40154

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants