Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC not respecting not preserving dict ID #6443

Closed
brancz opened this issue Sep 23, 2024 · 6 comments · Fixed by #6444
Closed

IPC not respecting not preserving dict ID #6443

brancz opened this issue Sep 23, 2024 · 6 comments · Fixed by #6444
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate bug next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Comments

@brancz
Copy link
Contributor

brancz commented Sep 23, 2024

Describe the bug

When setting with_preserve_dict_id(false) on IpcWriteOptions of a StreamWriter, and then write a record with multiple dicts whose Fields in the Schema have dict_id: 0, then the last dict's dictionary is actually used for all occurrences.

Best case this causes data to be incorrect, worst case, it causes a panic (which is what led me down this path because my first dictionary had more entries than the second and it caused an out of bounds panic).

To Reproduce

https://gist.github.com/brancz/067bfe6c9f9dfa7a7db82da1757e0edc results in

assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "b"
, data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
, DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
], row_count: 2 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "b"
, data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "c",
  "d",
]}
, DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
], row_count: 2 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior

Dicts are assigned correctly when the Schema's Field's dict_id is requested to not be preserved.

@alamb @tustvold

@tustvold
Copy link
Contributor

Tagging @thinkharderdev who implemented this functionality in #5971

@brancz
Copy link
Contributor Author

brancz commented Sep 23, 2024

This confused me at first as well, but #5971 actually only solved it for IPC via flight, because it pre-processes the schema. This is about IPC without flight.

I opened #6444 which intentionally doesn't change the behavior introduced in #5971, but I think long term it would be better to consolidate at the lower level that I implemented in #6444, because that will mean that we'll actually be able to remove the dict_id field from the Schema, as with the approach in #6444 the assigning the dict ID is done entirely separately from the Schema, and only through the dictionary tracker.

@alamb alamb added the parquet Changes to the parquet crate label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

label_issue.py automatically added labels {'parquet'} from #6444

@alamb alamb added the arrow Changes to the arrow crate label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

label_issue.py automatically added labels {'arrow'} from #6444

@alamb alamb added the arrow-flight Changes to the arrow-flight crate label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

label_issue.py automatically added labels {'arrow-flight'} from #6444

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

label_issue.py automatically added labels {'next-major-release'} from #6444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate bug next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
3 participants