-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37857: [Python][Dataset] Expose file size to python dataset #37868
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
b523351
to
6e183ba
Compare
Would you mind change |
So this |
|
Yes, that's the idea. In Delta Lake and I believe in Iceberg as well, the objects are not changed after they are written, so the size that those formats store in their metadata should be correct always. |
Thats a good idea, but I guess when we create InputStream in Object Store, the internal of stream will also issuing an HEAD request ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1232 ), I wonder if these are all duplicates... |
When the source has file size it will be passed here: arrow/cpp/src/arrow/filesystem/s3fs.cc Line 1224 in ebc2368
But that request you linked is also problematic with regards to threading. I wonder if it would be difficult to refactor it to use |
Can we use this part of interface: 4f8504f ? By the way, |
Which part do you mean? In the current PR FileSource is constructed with FileInfos and FileInfos get constructed with size: https://github.com/apache/arrow/pull/37868/files#diff-82a2d9a8424dfb436344c0802dc36275cd0fb87e2e9eb6399745fba43799f7e5R63
Yes, agreed it's not nice. I couldn't get the extension to compile when defaulting to size=None, got some errors about |
Ok, I mean can we use the one with |
9dedbb9
to
89bb95a
Compare
Hmm, I wonder why this breaks 🤔 It works on my machine... https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48125445#L3684 |
f961559
to
24791e0
Compare
Noticed that the code was always passing the size to FileSource even if it was the default -1... Surprisingly tests passed but it probably worked due to some internal validation in FileSource/FileInfo. Fixed now. But if someone has an idea of how to better handle the optional int value, I'm happy to change. |
81b211c
to
480fd25
Compare
Can you expand on this?
There is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get CI passing and then this change seems ok to me. @jorisvandenbossche any thoughts on the python API changes (though they are pretty minor, just adding a new optional parameter to make_fragment
)?
Might be interesting to @wjones127 as well.
I was running some experiments looking at the debug logs, and it seems that these HEAD requests always get executed on the main (?) thread. And from the code it also seems that way, it's not Async. So when a dataset consists of multiple fragments the file reads start effectively in sequence, and the latency from each HEAD adds up. EDIT: I believe it's originating from here arrow/cpp/src/arrow/dataset/file_parquet.cc Line 482 in e038498
|
suggested args
Ping @pitrou |
Thank you for contributing @eeroel ! |
Thanks everyone for the help, and @mapleFU in particular for pushing this forward! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 2ea7f79. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Should this fail with bad file_size values? I tried changing the file_size to -10000 and it still succeed for me. I am not sure if GCS uses this information though (I am using fsspec for the filesystem which uses the gcsfs library). I think I am using it wrong because I cannot get it to fail in general, regardless of whether the size I input is the real size of the file. I didn't see a fragment.size attribute or anything to check other than fragment.metadata.serialized_size which is different fragment = file_format.make_fragment(path, filesystem=dataset.filesystem, partition_expression=expression, file_size=-10000)
print(fragment)
<pyarrow.dataset.ParquetFileFragment path=bucket/discard/year=2023/part-0.parquet partition=[year=2023]>
<pyarrow.dataset.ParquetFileFragment path=bucket/discard/year=2024/part-0.parquet partition=[year=2024]>
# fresh dataset written with version 15.0
<pyarrow._parquet.FileMetaData object at 0x7f725816ecf0>
created_by: parquet-cpp-arrow version 15.0.0
num_columns: 56
num_rows: 21420
num_row_groups: 5
format_version: 2.6
serialized_size: 34164 |
Did you also try to create a dataset with those fragments and read it? There's no validation when the fragments are constructed, but it should fail when the parquet reader tries to start reading the file, in here: arrow/cpp/src/parquet/file_reader.cc Line 476 in 21ffd82
It would make sense to handle zero and negative sizes on the Python side though... Regarding fsspec, the file size information will only get used for Arrow internal file system implementations, and I believe currently it's only used for S3. |
I guess if fileSize is invalid, parsing a parquet footer might failed when size is bad. In other format the behavior might be undefined. |
Here are some more details: ic(pa.__version__)
ic(fs)
dataset = ds.dataset(gcs_path, filesystem=fs)
ic(dataset.filesystem)
table = dataset.to_table()
ic(table.num_rows)
file_format = ds.ParquetFileFormat()
paths = dataset.files
original_fragments = [frag for frag in dataset.get_fragments()]
original_dataset = ds.FileSystemDataset(
original_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)
assert dataset.to_table().equals(original_dataset.to_table())
ic(dataset.to_table().equals(original_dataset.to_table()))
wrong_fragments = []
for path in dataset.files:
fake_size = 55555555555
actual_size = dataset.filesystem.get_file_info(path).size
ic(actual_size, fake_size)
fragment = file_format.make_fragment(path, filesystem=dataset.filesystem, file_size=fake_size)
wrong_fragments.append(fragment)
test = ds.FileSystemDataset(
wrong_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)
assert dataset.to_table().equals(test.to_table())
ic(dataset.to_table().equals(test.to_table()))
___
ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: 55555555555
ic| dataset.to_table().equals(test.to_table()): True I was using dataset.filesystem which is a pyarrow filesystem (generated from the gcsfs filesystem). In that example, I made the file size some random large number and you can see it does not match the real size yet the table constructs and still matches the original table. I thought that would fail, unless I am doing something wrong. I have pyarrow 15 installed though. |
Do you also get this result if you set |
Yes, I changed fake_size to -9999 and reran it and it still worked. But since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is just ignored entirely. ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f55e8b88a10>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f562f099230>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: -9999
ic| dataset.to_table().equals(test.to_table()): True My original plan was to take a look at deltalake (delta-rs library) which already uses if not filesystem or pyarrow.__version__ >= "15.0":
file_sizes = self.get_add_actions().to_pydict()
file_sizes = {
x: y for x, y in zip(file_sizes["path"], file_sizes["size_bytes"])
}
format = ParquetFileFormat(
read_options=parquet_read_options,
default_fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True),
)
fragments = []
for file, part_expression in self._table.dataset_partitions(
self.schema().to_pyarrow(), partitions
):
if pyarrow.__version__ >= "15.0":
fragment = format.make_fragment(
file,
filesystem=filesystem,
partition_expression=part_expression,
file_size=file_sizes[file],
)
else:
fragment = format.make_fragment(
file,
filesystem=filesystem,
partition_expression=part_expression,
)
fragments.append(fragment) |
OK, thanks for confirming. I think it's good to check this, I didn't add a test case for negative values for this PR. Could be a bug, or it's possible there's some validation somewhere along the chain that silently ignores the value.
Yep, that was the motivation for this PR! I actually implemented this in deltalake for the filesystem implementation, so if you can use that then the optimization should already be applied: https://github.com/delta-io/delta-rs/blob/1b6c830aae4553d2a079a2bf42b024863fcbbb40/python/deltalake/table.py#L1035 To use this with the Arrow GCS implementation, I think the OpenFile method should be updated to check the size from FileInfo. |
…pache#37868) ### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: apache#37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Allow passing known file sizes to
make_fragment
, to avoid potential network requests.What changes are included in this PR?
Are these changes tested?
Yes, tests with S3 that file size gets used.
Are there any user-facing changes?
Yes, new function arguments.