-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet][Python] Potential regression in Parquet parallel reading #38591
Comments
As a side note, I was a bit disappointed with the total download times I achieved with maximum parallelism using 02de3c1 on my system (Macbook to S3). The requests start with nice concurrency but a few of the GET requests end up taking quite long to finish, so fragment_readahead > 4 doesn't speed up the total completion time at all. Not sure if this is related to my network, or if it's something to optimize in Arrow. |
I noticed that there is an extra HEAD request coming from somewhere so I think what I wrote below is unlikely to be related
|
Here's a reproducible example that doesn't use FileSystemDataset but
This does not only make two HEAD requests, but four of them in total (first two in getting the schema?)
|
Since this is merged recently, is that same api being used ? I was a bit confused because this part of code didn't changed frequency recently... |
I'm doing a bisect at the moment, I'll update here when I'm done. |
The issue related patch changes Prefetch default mode from |
This is probably where the extra requests come from, I wonder if these arrow/cpp/src/arrow/dataset/file_parquet.cc Line 507 in 0793432
|
Very nice catch! Seems that this commit ignore your change and re-create the file. this might be done before your change is checked in 😅 And after your change and rebase, this didn't get modified. |
Yeah that was my reading as well, I can try fixing it later |
Yeah. Maybe we can add a test for counting IO-ops to prevent from regression later... This is so awkward :-( |
Do you have an idea how to do it? I currently just tested by setting S3fs logging to trace level and filtered on the command line. But it's not really elegant and the problem is not related to S3 per se (but not sure if it will manifest with all file system implementations?) |
I think we can Provide a MockInputStream(with readAsync and read counting) and hardcode an IO-count here. Any change changes the IO count can report the change here. Also cc @pitrou for any more ideas... But you can take a quick fixing for this issue. |
…ormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38591 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: mwish <[email protected]>
Thanks @eeroel I've merged this patch now, maybe I'll try to add some counting test tonight |
Thank you! |
…tFileFormat::GetReaderAsync` (apache#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#38591 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: mwish <[email protected]>
…tFileFormat::GetReaderAsync` (apache#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#38591 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: mwish <[email protected]>
…ormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38591 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: mwish <[email protected]>
…tFileFormat::GetReaderAsync` (apache#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#38591 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: mwish <[email protected]>
Describe the enhancement requested
UPDATE: this is looking more like a bug on closer look. What happens:
When calling
to_table()
on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,main
, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.Here's an example using 02de3c1, reading a FileSystemDataset using
fragment_readahead = 100
and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):With the current
main
fc8c6b7 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.See comment below for reproducible example.
I'm running on Max OS 14.1.
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: