-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_parquet.py::test_download_throughput[pandas]
peak memory randomly bounces
#339
Comments
test_parquet.py::test_download_throughput
peak memory randomly bouncestest_parquet.py::test_download_throughput[pandas]
peak memory randomly bounces
This could be explained by a peak that is shorter than the heartbeat. |
I suggest to disable memory measurements for this test. I doubt this is useful information |
I ran Is it intentional and/or known that only one worker is doing most of the work (measured by cpu/mem util over the largest period of the test run)? (I suppose there's also a slight chance that I'm doing something wrong?) |
Yes, this test is extremely simple. It dispatches a single task that is loading parquet data. This is why I don't think our memory measure makes a lot of sense. Assuming that pyarrow (or whatever parquet reader we have) is not misbehaving there is no reason to expect any bump in memory when reading a single file If we care about the memory measure, we should probably add a small sleep in the task to ensure that our system monitor measures the spike and a heartbeat goes through but I'm not sure if we actually care about this. @ian-r-rose I think you've been involved in the parquet development. Any strong opinions? |
FWIW our system monitor samples in 500ms intervals I suspect the parquet read is just faster than that |
This is possible, though it's interesting that the peak is also basically identical to the |
I'm not so sure this is a good assumption, and indeed, this is part of the reason for having this test! Very small changes in the parquet IO code can have have significant effects in the download throughput, and it may be possible to bring these lines down with a bit of work. |
@ian-r-rose I understand your concern about the different memory characteristics and how we should be sensitive to this. I am strongly in favor of having such a test but I am not convinced that this setup is suited to measure this. The call is too fast and our monitoring doesn't pick it up. There is also no reason to actually involve any Coiled cluster in this measurement. For these situations it is likely better to have a separate benchmark setup that just runs stuff on a single machine using different mechanisms to measure memory |
In
test_parquet.py::test_download_throughput
, both runtime and average memory usage are extremely stable.Screenshots from 0.1.0:
Peak memory usage, however, randomly bounces between ~3.8 GiB and ~6.6 GiB.
Much like in #315, the two sets of values are extremely stable internally:
CC @fjetter @hendrikmakait
The text was updated successfully, but these errors were encountered: