Skip to content

Commit

Permalink
arrow-ipc: Read dictionary IDs from dictionary tracker in correct order
Browse files Browse the repository at this point in the history
When dictionary IDs are not preserved, then they are assigned depth
first, however, when reading them from the dictionary tracker to write
the IPC bytes, they were previously read from the dictionary tracker in
the order that the schema is traversed (first come first serve), which
caused an incorrect order of dictionaries serialized in IPC.
  • Loading branch information
brancz committed Sep 24, 2024
1 parent d21569e commit 200d1f5
Showing 1 changed file with 10 additions and 7 deletions.
17 changes: 10 additions & 7 deletions arrow-ipc/src/writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -395,13 +395,6 @@ impl IpcDataGenerator {
) -> Result<(), ArrowError> {
match column.data_type() {
DataType::Dictionary(_key_type, _value_type) => {
let dict_id = dict_id_seq
.next()
.or_else(|| field.dict_id())
.ok_or_else(|| {
ArrowError::IpcError(format!("no dict id for field {}", field.name()))
})?;

let dict_data = column.to_data();
let dict_values = &dict_data.child_data()[0];

Expand All @@ -415,6 +408,16 @@ impl IpcDataGenerator {
dict_id_seq,
)?;

// It's importnat to only take the dict_id at this point, because the dict ID
// sequence is assigned depth-first, so we need to first encode children and have
// them take their assigned dict IDs before we take the dict ID for this field.
let dict_id = dict_id_seq
.next()
.or_else(|| field.dict_id())
.ok_or_else(|| {
ArrowError::IpcError(format!("no dict id for field {}", field.name()))
})?;

let emit = dictionary_tracker.insert(dict_id, column)?;

if emit {
Expand Down

0 comments on commit 200d1f5

Please sign in to comment.