Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][Python] Potential regression in Parquet parallel reading #38591

Closed
eeroel opened this issue Nov 5, 2023 · 15 comments · Fixed by #38621
Closed

[Parquet][Python] Potential regression in Parquet parallel reading #38591

eeroel opened this issue Nov 5, 2023 · 15 comments · Fixed by #38621

Comments

@eeroel
Copy link
Contributor

eeroel commented Nov 5, 2023

Describe the enhancement requested

UPDATE: this is looking more like a bug on closer look. What happens:

When calling to_table() on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,

  • Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
  • With current main, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.

Here's an example using 02de3c1, reading a FileSystemDataset using fragment_readahead = 100 and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):
Screenshot 2023-11-05 at 18 43 54

With the current main fc8c6b7 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.

Screenshot 2023-11-05 at 18 43 44

See comment below for reproducible example.

I'm running on Max OS 14.1.

Component(s)

Parquet

@eeroel
Copy link
Contributor Author

eeroel commented Nov 5, 2023

As a side note, I was a bit disappointed with the total download times I achieved with maximum parallelism using 02de3c1 on my system (Macbook to S3). The requests start with nice concurrency but a few of the GET requests end up taking quite long to finish, so fragment_readahead > 4 doesn't speed up the total completion time at all. Not sure if this is related to my network, or if it's something to optimize in Arrow.

@eeroel
Copy link
Contributor Author

eeroel commented Nov 5, 2023

I noticed that there is an extra HEAD request coming from somewhere so I think what I wrote below is unlikely to be related

Just a hunch, haven't followed the whole code path, but could it be that e.g. from here https://github.com/apache/arrow/blob/fc8c6b7dc8287c672b62c62f3a2bd724b3835063/cpp/src/arrow/dataset/file_parquet.cc#L606C72-L606C72 or EnsureCompleteMetadata the sync version of GetReader gets called, not GetReaderAsync? Introduced in 76c4a6e

EDIT: or maybe from here

FragmentsToBatches(std::move(fragment_gen), scan_options));
as I think this function gets called when invoking Python Dataset.to_table
Result<EnumeratedRecordBatchGenerator> AsyncScanner::ScanBatchesUnorderedAsync(

@eeroel eeroel changed the title Potential regression in Parquet parallel reading [Parquet] Potential regression in Parquet parallel reading Nov 6, 2023
@eeroel eeroel changed the title [Parquet] Potential regression in Parquet parallel reading [Parquet][Python] Potential regression in Parquet parallel reading Nov 6, 2023
@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

Here's a reproducible example that doesn't use FileSystemDataset but parquet.read_table:

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

from pyarrow.dataset import (
    ParquetFileFormat,
    ParquetFragmentScanOptions
)

import pyarrow
import pyarrow.fs
import pyarrow.parquet

fs = pyarrow.fs.S3FileSystem(region="us-east-2")
fs = pyarrow.fs.SubTreeFileSystem("ursa-labs-taxi-data", fs)

pyarrow.parquet.read_table("2019/01/data.parquet", filesystem=fs)

This does not only make two HEAD requests, but four of them in total (first two in getting the schema?)

cat /tmp/foo2.log | grep HeaderOut
[DEBUG] 2023-11-07 06:30:22.913 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.059 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.190 CURL [0x1e4d11ec0] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.613 CURL [0x1e4d11ec0] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.904 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.028 CURL [0x17017f000] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.158 CURL [0x17020b000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.552 CURL [0x170297000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:26.542 CURL [0x17017f000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:29.142 CURL [0x17020b000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1

@mapleFU
Copy link
Member

mapleFU commented Nov 7, 2023

Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.

Since this is merged recently, is that same api being used ? I was a bit confused because this part of code didn't changed frequency recently...

@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.

Since this is merged recently, is that same api being used ? I was a bit confused because this part of code didn't changed frequency recently...

I'm doing a bisect at the moment, I'll update here when I'm done.

@mapleFU
Copy link
Member

mapleFU commented Nov 7, 2023

#36765

The issue related patch changes Prefetch default mode from Default to LazyDefault, but it maybe doesn't increment the req-count..

@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

#36765

The issue related patch changes Prefetch default mode from Default to LazyDefault, but it maybe doesn't increment the req-count..

This is the commit after which there is an extra HEAD request:
0793432

@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

This is probably where the extra requests come from, I wonder if these source.Open() and parquet::ParquetFileReader::OpenAsync calls are necessary?

ARROW_ASSIGN_OR_RAISE(auto input, source.Open());

@mapleFU
Copy link
Member

mapleFU commented Nov 7, 2023

Very nice catch! Seems that this commit ignore your change and re-create the file. this might be done before your change is checked in 😅 And after your change and rebase, this didn't get modified.

@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

Very nice catch! Seems that this commit ignore your change and re-create the file. this might be done before your change is checked in 😅 And after your change and rebase, this didn't get modified.

Yeah that was my reading as well, I can try fixing it later

@mapleFU
Copy link
Member

mapleFU commented Nov 7, 2023

Yeah. Maybe we can add a test for counting IO-ops to prevent from regression later... This is so awkward :-(

@eeroel
Copy link
Contributor Author

eeroel commented Nov 7, 2023

Yeah. Maybe we can add a test for counting IO-ops to prevent from regression later... This is so awkward :-(

Do you have an idea how to do it? I currently just tested by setting S3fs logging to trace level and filtered on the command line. But it's not really elegant and the problem is not related to S3 per se (but not sure if it will manifest with all file system implementations?)

@mapleFU
Copy link
Member

mapleFU commented Nov 7, 2023

I think we can Provide a MockInputStream(with readAsync and read counting) and hardcode an IO-count here. Any change changes the IO count can report the change here.

Also cc @pitrou for any more ideas...

But you can take a quick fixing for this issue.

mapleFU pushed a commit that referenced this issue Nov 8, 2023
…ormat::GetReaderAsync` (#38621)

### Rationale for this change
There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432.

### What changes are included in this PR?
Remove the extra method calls.

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #38591

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: mwish <[email protected]>
@mapleFU mapleFU added this to the 15.0.0 milestone Nov 8, 2023
@mapleFU
Copy link
Member

mapleFU commented Nov 8, 2023

Thanks @eeroel

I've merged this patch now, maybe I'll try to add some counting test tonight

@eeroel
Copy link
Contributor Author

eeroel commented Nov 8, 2023

Thanks @eeroel

I've merged this patch now, maybe I'll try to add some counting test tonight

Thank you!

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Nov 9, 2023
…tFileFormat::GetReaderAsync` (apache#38621)

### Rationale for this change
There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432.

### What changes are included in this PR?
Remove the extra method calls.

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#38591

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: mwish <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tFileFormat::GetReaderAsync` (apache#38621)

### Rationale for this change
There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432.

### What changes are included in this PR?
Remove the extra method calls.

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#38591

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: mwish <[email protected]>
@raulcd raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023
raulcd pushed a commit that referenced this issue Nov 28, 2023
…ormat::GetReaderAsync` (#38621)

### Rationale for this change
There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432.

### What changes are included in this PR?
Remove the extra method calls.

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #38591

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: mwish <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tFileFormat::GetReaderAsync` (apache#38621)

### Rationale for this change
There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432.

### What changes are included in this PR?
Remove the extra method calls.

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#38591

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants