Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces #339

Open
crusaderky opened this issue Sep 16, 2022 · 8 comments

Comments

@crusaderky
Copy link
Contributor

In test_parquet.py::test_download_throughput, both runtime and average memory usage are extremely stable.
Screenshots from 0.1.0:

image
image

Peak memory usage, however, randomly bounces between ~3.8 GiB and ~6.6 GiB.
Much like in #315, the two sets of values are extremely stable internally:

image

CC @fjetter @hendrikmakait

@crusaderky crusaderky changed the title test_parquet.py::test_download_throughput peak memory randomly bounces test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces Sep 16, 2022
@crusaderky
Copy link
Contributor Author

crusaderky commented Sep 16, 2022

This could be explained by a peak that is shorter than the heartbeat.

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

I suggest to disable memory measurements for this test. I doubt this is useful information

@ntabris
Copy link
Member

ntabris commented Sep 16, 2022

I ran test_parquet.py::test_download_throughput[pandas] a few times while collecting high-res hardware metrics.

image

Is it intentional and/or known that only one worker is doing most of the work (measured by cpu/mem util over the largest period of the test run)?

(I suppose there's also a slight chance that I'm doing something wrong?)

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

Is it intentional and/or known that only one worker is doing most of the work (measured by cpu/mem util over the largest period of the test run)?

Yes, this test is extremely simple. It dispatches a single task that is loading parquet data. This is why I don't think our memory measure makes a lot of sense. Assuming that pyarrow (or whatever parquet reader we have) is not misbehaving there is no reason to expect any bump in memory when reading a single file

If we care about the memory measure, we should probably add a small sleep in the task to ensure that our system monitor measures the spike and a heartbeat goes through but I'm not sure if we actually care about this.

@ian-r-rose I think you've been involved in the parquet development. Any strong opinions?

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

FWIW our system monitor samples in 500ms intervals I suspect the parquet read is just faster than that

@ian-r-rose
Copy link
Contributor

I suggest to disable memory measurements for this test. I doubt this is useful information

This is possible, though it's interesting that the peak is also basically identical to the dask flavor of the test. Today, pandas read_parquet and dask read_parquet go through slightly different code paths in the pyarrow code base, and dask's path is a bit slower, and with worse memory characteristics. So I think there actually is structure there, and if we were to update dask's usage of pyarrow we might see it turn more into pandas-like behavior.

@ian-r-rose
Copy link
Contributor

Assuming that pyarrow (or whatever parquet reader we have) is not misbehaving there is no reason to expect any bump in memory when reading a single file

I'm not so sure this is a good assumption, and indeed, this is part of the reason for having this test! Very small changes in the parquet IO code can have have significant effects in the download throughput, and it may be possible to bring these lines down with a bit of work.

@fjetter
Copy link
Member

fjetter commented Oct 3, 2022

@ian-r-rose I understand your concern about the different memory characteristics and how we should be sensitive to this. I am strongly in favor of having such a test but I am not convinced that this setup is suited to measure this. The call is too fast and our monitoring doesn't pick it up. There is also no reason to actually involve any Coiled cluster in this measurement.

For these situations it is likely better to have a separate benchmark setup that just runs stuff on a single machine using different mechanisms to measure memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants