-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Dataset] Preserve order when writing dataset #26818
Comments
Weston Pace / @westonpace: |
Hi @westonpace, |
Could someone please solve this issue? This is clearly a bug in arrow, and it should least have an option to preserve order. |
Just got burnt by the same issue as I was trying to re-encode a parquet file with a different rowgroup size. Even if there is no plan to fix that issue, it might be a good idea to add a warning in the documentation, which currently mentions nothing about order not being preserved: |
Related: #39030 |
…s, because (as of now) it does not preserve ordering on a filesystem write. apache/arrow#26818 apache/arrow#39030
Looks like we do not need to introduce a new node because the Here is the fix: #44470 |
Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.
Small code example:
Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order.
Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use
pyarrow.dataset.write_dataset
so isn't impacted by this.)Some discussion about this started in #8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment.
I am not fully sure what the best way to solve this, but IMO at least having the option to preserve the order would be good.
cc @bkietz
Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok
Related issues:
Note: This issue was originally created as ARROW-10883. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: