`test_parquet.py::test_download_throughput[pandas]` peak memory randomly bounces #339

crusaderky · 2022-09-16T10:58:52Z

In test_parquet.py::test_download_throughput, both runtime and average memory usage are extremely stable.
Screenshots from 0.1.0:

Peak memory usage, however, randomly bounces between ~3.8 GiB and ~6.6 GiB.
Much like in #315, the two sets of values are extremely stable internally:

CC @fjetter @hendrikmakait

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-09-16T11:02:19Z

This could be explained by a peak that is shorter than the heartbeat.

fjetter · 2022-09-16T12:20:23Z

I suggest to disable memory measurements for this test. I doubt this is useful information

ntabris · 2022-09-16T13:12:42Z

I ran test_parquet.py::test_download_throughput[pandas] a few times while collecting high-res hardware metrics.

Is it intentional and/or known that only one worker is doing most of the work (measured by cpu/mem util over the largest period of the test run)?

(I suppose there's also a slight chance that I'm doing something wrong?)

fjetter · 2022-09-16T13:54:31Z

Is it intentional and/or known that only one worker is doing most of the work (measured by cpu/mem util over the largest period of the test run)?

Yes, this test is extremely simple. It dispatches a single task that is loading parquet data. This is why I don't think our memory measure makes a lot of sense. Assuming that pyarrow (or whatever parquet reader we have) is not misbehaving there is no reason to expect any bump in memory when reading a single file

If we care about the memory measure, we should probably add a small sleep in the task to ensure that our system monitor measures the spike and a heartbeat goes through but I'm not sure if we actually care about this.

@ian-r-rose I think you've been involved in the parquet development. Any strong opinions?

fjetter · 2022-09-16T13:57:21Z

FWIW our system monitor samples in 500ms intervals I suspect the parquet read is just faster than that

ian-r-rose · 2022-09-16T15:16:28Z

I suggest to disable memory measurements for this test. I doubt this is useful information

This is possible, though it's interesting that the peak is also basically identical to the dask flavor of the test. Today, pandas read_parquet and dask read_parquet go through slightly different code paths in the pyarrow code base, and dask's path is a bit slower, and with worse memory characteristics. So I think there actually is structure there, and if we were to update dask's usage of pyarrow we might see it turn more into pandas-like behavior.

ian-r-rose · 2022-09-16T15:19:35Z

Assuming that pyarrow (or whatever parquet reader we have) is not misbehaving there is no reason to expect any bump in memory when reading a single file

I'm not so sure this is a good assumption, and indeed, this is part of the reason for having this test! Very small changes in the parquet IO code can have have significant effects in the download throughput, and it may be possible to bring these lines down with a bit of work.

fjetter · 2022-10-03T08:55:39Z

@ian-r-rose I understand your concern about the different memory characteristics and how we should be sensitive to this. I am strongly in favor of having such a test but I am not convinced that this setup is suited to measure this. The call is too fast and our monitoring doesn't pick it up. There is also no reason to actually involve any Coiled cluster in this measurement.

For these situations it is likely better to have a separate benchmark setup that just runs stuff on a single machine using different mechanisms to measure memory

crusaderky changed the title ~~test_parquet.py::test_download_throughput peak memory randomly bounces~~ test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces Sep 16, 2022

crusaderky added the performance-volatility label Sep 16, 2022

crusaderky mentioned this issue Sep 16, 2022

Integration test that focus on AMM informing our decision to toggle this on by default #140

Closed

fjetter mentioned this issue Oct 7, 2022

Don't use clusters for quick tests #420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_parquet.py::test_download_throughput[pandas]` peak memory randomly bounces #339

`test_parquet.py::test_download_throughput[pandas]` peak memory randomly bounces #339

crusaderky commented Sep 16, 2022

crusaderky commented Sep 16, 2022 •

edited

Loading

fjetter commented Sep 16, 2022

ntabris commented Sep 16, 2022

fjetter commented Sep 16, 2022 •

edited

Loading

fjetter commented Sep 16, 2022 •

edited

Loading

ian-r-rose commented Sep 16, 2022

ian-r-rose commented Sep 16, 2022

fjetter commented Oct 3, 2022

test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces #339

test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces #339

Comments

crusaderky commented Sep 16, 2022

crusaderky commented Sep 16, 2022 • edited Loading

fjetter commented Sep 16, 2022

ntabris commented Sep 16, 2022

fjetter commented Sep 16, 2022 • edited Loading

fjetter commented Sep 16, 2022 • edited Loading

ian-r-rose commented Sep 16, 2022

ian-r-rose commented Sep 16, 2022

fjetter commented Oct 3, 2022

`test_parquet.py::test_download_throughput[pandas]` peak memory randomly bounces #339

`test_parquet.py::test_download_throughput[pandas]` peak memory randomly bounces #339

crusaderky commented Sep 16, 2022 •

edited

Loading

fjetter commented Sep 16, 2022 •

edited

Loading

fjetter commented Sep 16, 2022 •

edited

Loading