-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go][Parquet] Trouble using the C++ reader to read a Parquet file written with the Go writer #38503
Comments
Sorry for late reply. Run case in https://github.com/tschaub/parquet-issue-38503 can generate the case? |
This is weird 😅 This means go generate level == 2 when max-level == 1. Let me checkout the reason. |
Updated: I think C++ reader checks max-def-level, and it's 1. (and get 2). So it report the error. The problem is that go writer has a bug here. I'll find out and fix it. |
After rethink the impl, I found it's the impl's problem rather than a bug.
The code above is abit dangerous, the real code should be:
|
And currently, the generated file is a bad file here...(in your example) |
…#38581) ### Rationale for this change Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type, `ArrowColumnWriter` should considering the logic below: ``` /// 0 foo.bar /// foo.bar.baz 0 /// foo.bar.baz2 1 /// foo.qux 2 /// 1 foo2 3 /// 2 foo3 4 ``` The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node, so, the column id for parquet is list at right. In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here. ### What changes are included in this PR? Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter` ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #38503 Authored-by: mwish <[email protected]> Signed-off-by: Matt Topol <[email protected]>
This makes it so the Arrow column writer is not exported from the `pqarrow` package. This follows up on comments from #38581. * Closes: #38503 Authored-by: Tim Schaub <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…Writer (apache#38581) ### Rationale for this change Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type, `ArrowColumnWriter` should considering the logic below: ``` /// 0 foo.bar /// foo.bar.baz 0 /// foo.bar.baz2 1 /// foo.qux 2 /// 1 foo2 3 /// 2 foo3 4 ``` The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node, so, the column id for parquet is list at right. In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here. ### What changes are included in this PR? Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter` ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38503 Authored-by: mwish <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…pache#38727) This makes it so the Arrow column writer is not exported from the `pqarrow` package. This follows up on comments from apache#38581. * Closes: apache#38503 Authored-by: Tim Schaub <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Version: 7ef517e31ec3
OS: macOS 13 arm64
I'm uncertain if this is user error, an issue with the Go packages, or an issue with the C++ reader. I've put together a test that demonstrates the issue here: https://github.com/tschaub/parquet-issue-38503
I'm trying to use the
pqarrow
package to read an input Parquet file, transform some of the data, and write an output Parquet file. In the linked test case, there is no transformation step. So the test uses apqarrow.FileReader
, gets apqarrow.RowGroupReader
for each row group, reads each column as anarrow.Chunked
, and uses apqarrow.ArrowColumnWriter
to write out the same.When I try to use the C++
parquet-reader
to read in the output file, I see the following error:# parquet-reader output.parquet > /dev/null Parquet error: Malformed levels. min: 2 max: 2 out of range. Max Level: 1
This same test passes for other Parquet files. I originally encountered the problem with one of the Overture Maps Parquet files, and the linked test case is based on a subset of that data using only two columns and a single row.
Summarizing
input.parquet
output.parquet
Component(s)
Go, Parquet
The text was updated successfully, but these errors were encountered: