GH-37857: [Python][Dataset] Expose file size to python dataset #37868

eeroel · 2023-09-26T04:14:25Z

Rationale for this change

Allow passing known file sizes to make_fragment, to avoid potential network requests.

What changes are included in this PR?

Are these changes tested?

Yes, tests with S3 that file size gets used.

Are there any user-facing changes?

Yes, new function arguments.

Closes: [Python] Allow passing file sizes to FileSystemDataset from Python #37857

github-actions · 2023-09-26T04:14:52Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

mapleFU · 2023-09-26T04:16:33Z

Would you mind change GS to GH?

mapleFU · 2023-09-26T04:18:09Z

So this size is expect to avoid a HEAD or LIST request in Object Store? Seems that this might be a bit dangerous if the size doesn't match

github-actions · 2023-09-26T04:24:12Z

⚠️ GitHub issue #37857 has been automatically assigned in GitHub to PR creator.

eeroel · 2023-09-26T04:30:36Z

So this size is expect to avoid a HEAD or LIST request in Object Store? Seems that this might be a bit dangerous if the size doesn't match

Yes, that's the idea. In Delta Lake and I believe in Iceberg as well, the objects are not changed after they are written, so the size that those formats store in their metadata should be correct always.

mapleFU · 2023-09-26T04:38:37Z

Thats a good idea, but I guess when we create InputStream in Object Store, the internal of stream will also issuing an HEAD request ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1232 ), I wonder if these are all duplicates...

eeroel · 2023-09-26T04:50:50Z

Thats a good idea, but I guess when we create InputStream in Object Store, the internal of stream will also issuing an HEAD request ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1232 ), I wonder if these are all duplicates...

When the source has file size it will be passed here: https://github.com/apache/arrow/blob/ebc23687cb0376e24a0f002fe710db5ad891c674/cpp/src/arrow/filesystem/s3fs.cc#L2546, so the function will early exit (

arrow/cpp/src/arrow/filesystem/s3fs.cc

Line 1224 in ebc2368

return Status::OK();

)

But that request you linked is also problematic with regards to threading. I wonder if it would be difficult to refactor it to use SubmitIO ?

mapleFU · 2023-09-26T05:19:26Z

Can we use this part of interface: 4f8504f ?

By the way, -1 is a bit misleading here...

eeroel · 2023-09-26T05:28:27Z

Can we use this part of interface: 4f8504f ?

Which part do you mean? In the current PR FileSource is constructed with FileInfos and FileInfos get constructed with size: https://github.com/apache/arrow/pull/37868/files#diff-82a2d9a8424dfb436344c0802dc36275cd0fb87e2e9eb6399745fba43799f7e5R63

By the way, -1 is a bit misleading here...

Yes, agreed it's not nice. I couldn't get the extension to compile when defaulting to size=None, got some errors about Python object "cannot be passed as a varargs parameter". I'm not familiar with Cython so I don't really know what that's about...

mapleFU · 2023-09-26T05:33:53Z

Ok, I mean can we use the one with CustomOpen, but this change also lgtm. You can firstly fix the ci, I'll ask the committers for review.

eeroel · 2023-09-26T07:34:32Z

Ok, I mean can we use the one with CustomOpen, but this change also lgtm. You can firstly fix the ci, I'll ask the committers for review.

Hmm, I wonder why this breaks 🤔 It works on my machine... https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48125445#L3684

eeroel · 2023-09-26T16:09:48Z

Noticed that the code was always passing the size to FileSource even if it was the default -1... Surprisingly tests passed but it probably worked due to some internal validation in FileSource/FileInfo. Fixed now. But if someone has an idea of how to better handle the optional int value, I'm happy to change.

westonpace · 2023-09-27T15:22:52Z

But that request you linked is also problematic with regards to threading

Can you expand on this?

By the way, -1 is a bit misleading here...

There is a kNoSize constant in filesystem.h that could be used in a few places but I don't know how much clearer that makes things. We are on C++17 now so we can use https://en.cppreference.com/w/cpp/utility/optional but that would be a considerably more significant refactor. I think -1 / kNoSize is probably ok for now.

westonpace

Let's get CI passing and then this change seems ok to me. @jorisvandenbossche any thoughts on the python API changes (though they are pretty minor, just adding a new optional parameter to make_fragment)?

Might be interesting to @wjones127 as well.

eeroel · 2023-09-27T15:37:17Z

But that request you linked is also problematic with regards to threading

Can you expand on this?

I was running some experiments looking at the debug logs, and it seems that these HEAD requests always get executed on the main (?) thread. And from the code it also seems that way, it's not Async. So when a dataset consists of multiple fragments the file reads start effectively in sequence, and the latency from each HEAD adds up.

EDIT: I believe it's originating from here

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 482 in e038498

ARROW_ASSIGN_OR_RAISE(auto input, source.Open());

suggested args

mapleFU · 2023-11-30T14:44:42Z

Ping @pitrou

pitrou · 2023-12-05T16:27:19Z

Thank you for contributing @eeroel !

eeroel · 2023-12-05T17:06:29Z

Thanks everyone for the help, and @mapleFU in particular for pushing this forward!

conbench-apache-arrow · 2023-12-05T23:25:03Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 2ea7f79.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

ldacey · 2024-01-28T01:47:28Z

Should this fail with bad file_size values? I tried changing the file_size to -10000 and it still succeed for me. I am not sure if GCS uses this information though (I am using fsspec for the filesystem which uses the gcsfs library). I think I am using it wrong because I cannot get it to fail in general, regardless of whether the size I input is the real size of the file.

I didn't see a fragment.size attribute or anything to check other than fragment.metadata.serialized_size which is different

fragment = file_format.make_fragment(path, filesystem=dataset.filesystem, partition_expression=expression, file_size=-10000)
print(fragment)

<pyarrow.dataset.ParquetFileFragment path=bucket/discard/year=2023/part-0.parquet partition=[year=2023]>
<pyarrow.dataset.ParquetFileFragment path=bucket/discard/year=2024/part-0.parquet partition=[year=2024]>

# fresh dataset written with version 15.0
<pyarrow._parquet.FileMetaData object at 0x7f725816ecf0>
  created_by: parquet-cpp-arrow version 15.0.0
  num_columns: 56
  num_rows: 21420
  num_row_groups: 5
  format_version: 2.6
  serialized_size: 34164

eeroel · 2024-01-28T08:26:12Z

Should this fail with bad file_size values? I tried changing the file_size to -10000 and it still succeed for me. I am not sure if GCS uses this information though (I am using fsspec for the filesystem which uses the gcsfs library). I think I am using it wrong because I cannot get it to fail in general, regardless of whether the size I input is the real size of the file.

Did you also try to create a dataset with those fragments and read it? There's no validation when the fragments are constructed, but it should fail when the parquet reader tries to start reading the file, in here:

arrow/cpp/src/parquet/file_reader.cc

Line 476 in 21ffd82

int64_t GetFooterReadSize() {

It would make sense to handle zero and negative sizes on the Python side though...

Regarding fsspec, the file size information will only get used for Arrow internal file system implementations, and I believe currently it's only used for S3.

mapleFU · 2024-01-28T09:40:14Z

Should this fail with bad file_size values? I tried changing the file_size to -10000 and it still succeed for me.

I guess if fileSize is invalid, parsing a parquet footer might failed when size is bad. In other format the behavior might be undefined.

ldacey · 2024-01-28T16:30:52Z

Here are some more details:

ic(pa.__version__)
ic(fs)
dataset = ds.dataset(gcs_path, filesystem=fs)
ic(dataset.filesystem)

table = dataset.to_table()
ic(table.num_rows)

file_format = ds.ParquetFileFormat()
paths = dataset.files

original_fragments = [frag for frag in dataset.get_fragments()]
original_dataset = ds.FileSystemDataset(
    original_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)
assert dataset.to_table().equals(original_dataset.to_table())
ic(dataset.to_table().equals(original_dataset.to_table()))

wrong_fragments = [] 
for path in dataset.files:
    fake_size = 55555555555
    actual_size = dataset.filesystem.get_file_info(path).size
    ic(actual_size, fake_size)
    fragment = file_format.make_fragment(path, filesystem=dataset.filesystem, file_size=fake_size)
    wrong_fragments.append(fragment)
    
test = ds.FileSystemDataset(
    wrong_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)

assert dataset.to_table().equals(test.to_table())
ic(dataset.to_table().equals(test.to_table()))

___

ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: 55555555555
ic| dataset.to_table().equals(test.to_table()): True

I was using dataset.filesystem which is a pyarrow filesystem (generated from the gcsfs filesystem). In that example, I made the file size some random large number and you can see it does not match the real size yet the table constructs and still matches the original table. I thought that would fail, unless I am doing something wrong. I have pyarrow 15 installed though.

eeroel · 2024-01-28T16:45:32Z

Here are some more details:

ic(pa.__version__)
ic(fs)
dataset = ds.dataset(gcs_path, filesystem=fs)
ic(dataset.filesystem)

table = dataset.to_table()
ic(table.num_rows)

file_format = ds.ParquetFileFormat()
paths = dataset.files

original_fragments = [frag for frag in dataset.get_fragments()]
original_dataset = ds.FileSystemDataset(
    original_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)
assert dataset.to_table().equals(original_dataset.to_table())
ic(dataset.to_table().equals(original_dataset.to_table()))

wrong_fragments = [] 
for path in dataset.files:
    fake_size = 55555555555
    actual_size = dataset.filesystem.get_file_info(path).size
    ic(actual_size, fake_size)
    fragment = file_format.make_fragment(path, filesystem=dataset.filesystem, file_size=fake_size)
    wrong_fragments.append(fragment)
    
test = ds.FileSystemDataset(
    wrong_fragments, format=file_format, schema=table.schema, filesystem=dataset.filesystem
)

assert dataset.to_table().equals(test.to_table())
ic(dataset.to_table().equals(test.to_table()))

___

ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: 55555555555
ic| dataset.to_table().equals(test.to_table()): True

I was using dataset.filesystem which is a pyarrow filesystem (generated from the gcsfs filesystem). In that example, I made the file size some random large number and you can see it does not match the real size yet the table constructs and still matches the original table. I thought that would fail, unless I am doing something wrong. I have pyarrow 15 installed though.

Do you also get this result if you set fake_size to a negative number? With a too-large size this is expected, because the size is not actually used in this case (it's only used with Arrow S3fs implementation at the moment).

ldacey · 2024-01-28T17:23:39Z

Yes, I changed fake_size to -9999 and reran it and it still worked. But since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is just ignored entirely.

ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f55e8b88a10>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f562f099230>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: -9999
ic| dataset.to_table().equals(test.to_table()): True

My original plan was to take a look at deltalake (delta-rs library) which already uses make_fragments(). Since the transaction log (get_add_actions()) has the actual file sizes then we could pass these to make_fragments() for some potential efficiency correct?

        if not filesystem or pyarrow.__version__ >= "15.0":
            file_sizes = self.get_add_actions().to_pydict()
            file_sizes = {
                x: y for x, y in zip(file_sizes["path"], file_sizes["size_bytes"])
            }

        format = ParquetFileFormat(
            read_options=parquet_read_options,
            default_fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True),
        )

        fragments = []
        for file, part_expression in self._table.dataset_partitions(
            self.schema().to_pyarrow(), partitions
        ):
            if pyarrow.__version__ >= "15.0":
                fragment = format.make_fragment(
                    file,
                    filesystem=filesystem,
                    partition_expression=part_expression,
                    file_size=file_sizes[file],
                )
            else:
                fragment = format.make_fragment(
                    file,
                    filesystem=filesystem,
                    partition_expression=part_expression,
                )
            fragments.append(fragment)

eeroel · 2024-01-28T17:45:33Z

Yes, I changed fake_size to -9999 and reran it and it still worked. But since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is just ignored entirely.

OK, thanks for confirming. I think it's good to check this, I didn't add a test case for negative values for this PR. Could be a bug, or it's possible there's some validation somewhere along the chain that silently ignores the value.

My original plan was to take a look at deltalake (delta-rs library) which already uses make_fragments(). Since the transaction log (get_add_actions()) has the actual file sizes then we could pass these to make_fragments() for some potential efficiency correct?

Yep, that was the motivation for this PR! I actually implemented this in deltalake for the filesystem implementation, so if you can use that then the optimization should already be applied: https://github.com/delta-io/delta-rs/blob/1b6c830aae4553d2a079a2bf42b024863fcbbb40/python/deltalake/table.py#L1035

To use this with the Arrow GCS implementation, I think the OpenFile method should be updated to check the size from FileInfo.

…pache#37868) ### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: apache#37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

eeroel requested a review from westonpace as a code owner September 26, 2023 04:14

github-actions bot added Component: C++ Component: Python labels Sep 26, 2023

eeroel force-pushed the feat/dataset_file_sizes branch from b523351 to 6e183ba Compare September 26, 2023 04:15

github-actions bot added the awaiting review Awaiting review label Sep 26, 2023

eeroel changed the title ~~GS-37857: [Python][Dataset] Expose file size to python dataset~~ GH-37857: [Python][Dataset] Expose file size to python dataset Sep 26, 2023

eeroel force-pushed the feat/dataset_file_sizes branch 3 times, most recently from 9dedbb9 to 89bb95a Compare September 26, 2023 06:37

eeroel force-pushed the feat/dataset_file_sizes branch 3 times, most recently from f961559 to 24791e0 Compare September 26, 2023 09:50

eeroel force-pushed the feat/dataset_file_sizes branch from 81b211c to 480fd25 Compare September 26, 2023 18:48

westonpace requested changes Sep 27, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 27, 2023

github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Nov 27, 2023

suggested args

056c96d

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 28, 2023

Merge pull request #1 from bkietz/feat/dataset_file_size/suggestion

0dee198

suggested args

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 28, 2023

bkietz approved these changes Nov 28, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 28, 2023

mapleFU requested a review from pitrou December 1, 2023 09:47

pitrou approved these changes Dec 5, 2023

View reviewed changes

pitrou merged commit 2ea7f79 into apache:main Dec 5, 2023
12 of 14 checks passed

pitrou removed the awaiting merge Awaiting merge label Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37857: [Python][Dataset] Expose file size to python dataset #37868

GH-37857: [Python][Dataset] Expose file size to python dataset #37868

eeroel commented Sep 26, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Sep 26, 2023

mapleFU commented Sep 26, 2023

mapleFU commented Sep 26, 2023

github-actions bot commented Sep 26, 2023

eeroel commented Sep 26, 2023

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023 •

edited

Loading

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023

eeroel commented Sep 26, 2023 •

edited

Loading

westonpace commented Sep 27, 2023

westonpace left a comment

eeroel commented Sep 27, 2023 •

edited

Loading

mapleFU commented Nov 30, 2023

pitrou commented Dec 5, 2023

eeroel commented Dec 5, 2023

conbench-apache-arrow bot commented Dec 5, 2023

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024 •

edited

Loading

mapleFU commented Jan 28, 2024

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024

GH-37857: [Python][Dataset] Expose file size to python dataset #37868

GH-37857: [Python][Dataset] Expose file size to python dataset #37868

Conversation

eeroel commented Sep 26, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 26, 2023

mapleFU commented Sep 26, 2023

mapleFU commented Sep 26, 2023

github-actions bot commented Sep 26, 2023

eeroel commented Sep 26, 2023

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023 • edited Loading

mapleFU commented Sep 26, 2023

eeroel commented Sep 26, 2023

eeroel commented Sep 26, 2023 • edited Loading

westonpace commented Sep 27, 2023

westonpace left a comment

Choose a reason for hiding this comment

eeroel commented Sep 27, 2023 • edited Loading

mapleFU commented Nov 30, 2023

pitrou commented Dec 5, 2023

eeroel commented Dec 5, 2023

conbench-apache-arrow bot commented Dec 5, 2023

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024 • edited Loading

mapleFU commented Jan 28, 2024

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024

ldacey commented Jan 28, 2024

eeroel commented Jan 28, 2024

eeroel commented Sep 26, 2023 •

edited by github-actions bot

Loading

eeroel commented Sep 26, 2023 •

edited

Loading

eeroel commented Sep 26, 2023 •

edited

Loading

eeroel commented Sep 27, 2023 •

edited

Loading

eeroel commented Jan 28, 2024 •

edited

Loading