[C++][Dataset] Preserve order when writing dataset #26818

asfimport · 2020-12-11T12:49:01Z

Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.

Small code example:

In [1]: import pyarrow.dataset as ds

In [2]: table = pa.table({"a": range(10)})

In [3]: table.to_pandas()
Out[3]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

In [4]: batches = table.to_batches(max_chunksize=2)

In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")

In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
Out[6]: 
   a
0  4
1  5
2  8
3  9
4  6
5  7
6  2
7  3
8  0
9  1

Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order.
Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use pyarrow.dataset.write_dataset so isn't impacted by this.)

Some discussion about this started in #8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment.

I am not fully sure what the best way to solve this, but IMO at least having the option to preserve the order would be good.

cc @bkietz

Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok

Related issues:

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True (is duplicated by)
[C++] Add ordering information to exec batches (requires)

_{Note: This issue was originally created as ARROW-10883. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-05-10T17:17:23Z

Weston Pace / @westonpace:
I deleted the link to ARROW-12873 because I don't know that "batch index" needs to rely on that arbitrary metadata mechanism (and, given that many nodes will need to manipulate it, I don't think it is arbitrary metadata)

hu6360567 · 2024-04-10T09:20:00Z

Hi @westonpace,
Is there any updates on this?
Current "FileSystemDataset::Write" is implemented by a sequenced plan of scan, filter, project, write.
As referenced plan in #32991 , a batch_index has been added in the scanner and is used by "ordered_sink" to reorder exec_batches.
Should we consider implementing an "ordered" node that functions similar to ordered_sink without sinking? This node could be injected any place between scan and project.
I believe that the "ordered" node would be a more effective way to directly order the output of the "scan" node, providing a more flexible planning approach.

jerryqhyu · 2024-06-13T19:43:55Z

Could someone please solve this issue? This is clearly a bug in arrow, and it should least have an option to preserve order.

douglas-raillard-arm · 2024-07-12T13:48:21Z

Just got burnt by the same issue as I was trying to re-encode a parquet file with a different rowgroup size. Even if there is no plan to fix that issue, it might be a good idea to add a warning in the documentation, which currently mentions nothing about order not being preserved:
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#

u3Izx9ql7vW4 · 2024-07-12T20:55:49Z

Related: #39030

…s, because (as of now) it does not preserve ordering on a filesystem write. apache/arrow#26818 apache/arrow#39030

EnricoMi · 2024-10-18T11:11:47Z

Should we consider implementing an "ordered" node ...

Looks like we do not need to introduce a new node because the "write" can sequence exec batches already. For this to work, all we need is tell the "scan" to give batches and index (ImplicitOrdering) and the "write" will by default sequence the batches.

Here is the fix: #44470

asfimport added this to the 11.0.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True #31870

Closed

[C++] Add ordering information to exec batches #32991

Open

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

anjakefala mentioned this issue Feb 7, 2023

[C++] Use input pre-sortedness to create concatenated sorted table #33512

Open

u3Izx9ql7vW4 mentioned this issue Jul 12, 2024

[Python] Dataset sorting_columns support request #43239

Open

ds283 added a commit to ds283/SecondaryGWKit that referenced this issue Oct 4, 2024

Add comment to document that sorting in PyArrow is currently pointles…

b6d924c

…s, because (as of now) it does not preserve ordering on a filesystem write. apache/arrow#26818 apache/arrow#39030

gitmodimo mentioned this issue Oct 9, 2024

GH-41706: [C++][Acero] Enhance asof_join to work in multi-threaded execution by sequencing input #44083

Merged

github-actions bot mentioned this issue Oct 18, 2024

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Preserve order when writing dataset #26818

[C++][Dataset] Preserve order when writing dataset #26818

asfimport commented Dec 11, 2020 •

edited

Loading

asfimport commented May 10, 2022

hu6360567 commented Apr 10, 2024 •

edited

Loading

jerryqhyu commented Jun 13, 2024

douglas-raillard-arm commented Jul 12, 2024

u3Izx9ql7vW4 commented Jul 12, 2024

EnricoMi commented Oct 18, 2024

[C++][Dataset] Preserve order when writing dataset #26818

[C++][Dataset] Preserve order when writing dataset #26818

Comments

asfimport commented Dec 11, 2020 • edited Loading

Related issues:

asfimport commented May 10, 2022

hu6360567 commented Apr 10, 2024 • edited Loading

jerryqhyu commented Jun 13, 2024

douglas-raillard-arm commented Jul 12, 2024

u3Izx9ql7vW4 commented Jul 12, 2024

EnricoMi commented Oct 18, 2024

asfimport commented Dec 11, 2020 •

edited

Loading

hu6360567 commented Apr 10, 2024 •

edited

Loading