Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Trouble using the C++ reader to read a Parquet file written with the Go writer #38503

Closed
tschaub opened this issue Oct 28, 2023 · 5 comments · Fixed by #38581 or #38727
Closed

Comments

@tschaub
Copy link
Contributor

tschaub commented Oct 28, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Version: 7ef517e31ec3
OS: macOS 13 arm64

I'm uncertain if this is user error, an issue with the Go packages, or an issue with the C++ reader. I've put together a test that demonstrates the issue here: https://github.com/tschaub/parquet-issue-38503

I'm trying to use the pqarrow package to read an input Parquet file, transform some of the data, and write an output Parquet file. In the linked test case, there is no transformation step. So the test uses a pqarrow.FileReader, gets a pqarrow.RowGroupReader for each row group, reads each column as an arrow.Chunked, and uses a pqarrow.ArrowColumnWriter to write out the same.

When I try to use the C++ parquet-reader to read in the output file, I see the following error:

# parquet-reader output.parquet > /dev/null
Parquet error: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

This same test passes for other Parquet files. I originally encountered the problem with one of the Overture Maps Parquet files, and the linked test case is based on a subset of that data using only two columns and a single row.

Summarizing

file C++ reader Go reader
input.parquet
output.parquet

Component(s)

Go, Parquet

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

Sorry for late reply. Run case in https://github.com/tschaub/parquet-issue-38503 can generate the case?

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

# parquet-reader output.parquet > /dev/null
Parquet error: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

This is weird 😅 This means go generate level == 2 when max-level == 1. Let me checkout the reason.

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

Updated: I think C++ reader checks max-def-level, and it's 1. (and get 2). So it report the error.

The problem is that go writer has a bug here. I'll find out and fix it.

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

After rethink the impl, I found it's the impl's problem rather than a bug.

	ctx := context.Background()
	numFields := len(arrowReader.Manifest.Fields)
	numRowGroups := fileReader.NumRowGroups()
	for rowGroupIndex := 0; rowGroupIndex < numRowGroups; rowGroupIndex += 1 {
		rowGroupReader := arrowReader.RowGroup(rowGroupIndex)
		rowGroupWriter := fileWriter.AppendRowGroup()
		for fieldNum := 0; fieldNum < numFields; fieldNum += 1 {
			arr, err := rowGroupReader.Column(fieldNum).Read(ctx)
			fmt.Println("Read field: ", fieldNum)
			if err != nil {
				return nil, err
			}
			colWriter, err := pqarrow.NewArrowColumnWriter(arr, 0, int64(arr.Len()), arrowReader.Manifest, rowGroupWriter, fieldNum)
			if err != nil {
				return nil, err
			}
			if err := colWriter.Write(ctx); err != nil {
				return nil, err
			}
		}
	}
	if err := fileWriter.Close(); err != nil {
		return nil, err
	}
colWriter, err := pqarrow.NewArrowColumnWriter(arr, 0, int64(arr.Len()), arrowReader.Manifest, rowGroupWriter, fieldNum)

The code above is abit dangerous, the real code should be:

	numRowGroups := fileReader.NumRowGroups()
	for rowGroupIndex := 0; rowGroupIndex < numRowGroups; rowGroupIndex += 1 {
		rowGroupReader := arrowReader.RowGroup(rowGroupIndex)
		rowGroupWriter := fileWriter.AppendRowGroup()
		colIdx := 0
		for fieldNum := 0; fieldNum < numFields; fieldNum += 1 {
			arr, err := rowGroupReader.Column(fieldNum).Read(ctx)
			fmt.Println("Read field: ", fieldNum)
			for _, v := range arr.Chunks() {
				fmt.Println("CHUNK:" , v)
			}
			if err != nil {
				return nil, err
			}
			colWriter, err := pqarrow.NewArrowColumnWriter(arr, 0, int64(arr.Len()), arrowReader.Manifest, rowGroupWriter, colIdx)
			if err != nil {
				return nil, err
			}
			if err := colWriter.Write(ctx); err != nil {
				return nil, err
			}
			colIdx += colWriter.LeafCount()
		}
	}

colWriter.LeafCount() cannot be get now, maybe we need using other way.

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

And currently, the generated file is a bad file here...(in your example)

zeroshade pushed a commit that referenced this issue Nov 15, 2023
…#38581)

### Rationale for this change

Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type,  `ArrowColumnWriter` should considering the logic below:

```
  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4
```

The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node,
so, the column id for parquet is list at right.

In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering
using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here.

### What changes are included in this PR?

Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter`

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: #38503

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@zeroshade zeroshade added this to the 15.0.0 milestone Nov 15, 2023
zeroshade pushed a commit that referenced this issue Nov 15, 2023
This makes it so the Arrow column writer is not exported from the `pqarrow` package.  This follows up on comments from #38581.
* Closes: #38503

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…Writer (apache#38581)

### Rationale for this change

Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type,  `ArrowColumnWriter` should considering the logic below:

```
  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4
```

The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node,
so, the column id for parquet is list at right.

In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering
using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here.

### What changes are included in this PR?

Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter`

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#38503

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#38727)

This makes it so the Arrow column writer is not exported from the `pqarrow` package.  This follows up on comments from apache#38581.
* Closes: apache#38503

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment