Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Shift(s): bbe75acd #4845

Closed
trexfeathers opened this issue Jul 1, 2022 · 12 comments · Fixed by #4881
Closed

Performance Shift(s): bbe75acd #4845

trexfeathers opened this issue Jul 1, 2022 · 12 comments · Fixed by #4881
Assignees
Labels
Bot A bot generated issue/pull-request Type: Performance
Milestone

Comments

@trexfeathers
Copy link
Contributor

trexfeathers commented Jul 1, 2022

Raising on behalf of the GHA workflow, which was having troubles at the time this originally came up!


Benchmark comparison has identified performance shifts at

Please review the report below and take corrective/congratulatory action as appropriate 🙂

Performance shift report
       before           after         ratio
     [0efd78d9]       [bbe75acd]

+       20.203125      54.70703125     2.71  experimental.ugrid.regions_combine.CombineRegionsComputeRealData.track_addedmem_compute_data(500)
@trexfeathers trexfeathers added Type: Performance Bot A bot generated issue/pull-request labels Jul 1, 2022
@trexfeathers trexfeathers added this to the v3.3.0 milestone Jul 1, 2022
@trexfeathers trexfeathers self-assigned this Jul 1, 2022
@trexfeathers
Copy link
Contributor Author

Not to self: this memory shift is within one Dask chunk size; run/write a scalability benchmark to work out if it's a problem for larger Cubes.

@trexfeathers
Copy link
Contributor Author

Bad news! The regression still exists at scale. Will need to investigate the Dask and NumPy change logs.

nox --session="benchmarks(custom)" -- run 0efd78d9^..bbe75acd --attribute rounds=1 --show-stderr --bench=sperf.combine_regions.ComputeRealData.track_addedmem_compute_data


nox --session="benchmarks(custom)" -- compare 0efd78d9 bbe75acd

              5.0              5.0     1.00  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(100)
+     32.17578125     175.20703125     5.45  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1000)
+        225.4375      599.8046875     2.66  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)
+             5.0        6.5546875     1.31  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(200)
+      5.86328125      18.01953125     3.07  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(300)
+       20.359375      54.48046875     2.68  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(500)


nox --session="benchmarks(custom)" -- publish '0efd78d9^..bbe75acd'

image

@trexfeathers
Copy link
Contributor Author

trexfeathers commented Jul 22, 2022

@pp-mo
Copy link
Member

pp-mo commented Jul 22, 2022

I guess the key observation here is
225.4375 599.8046875 2.66 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)

However, the dask chunksize is 200Mb, and in my previous experience I think it typically uses about 3 * chunks even when chunked operation is working correctly.
The expected full data size of a re-combined cube in this case "ought" to be (6 * 1668 * 1668 * 4 bytes) (for float not double dtype),
so that is only ~64Mb per data array.
So I think this isn't quite a smoking gun, yet.

It could well be useful (again, in my experience!) to test with a lower chunksize, it's usually then pretty clear whether it is generally blowing memory or not.
This way is handy :
with dask.config.set('array.chunksize', '10Mb'): ... or whatever : note the string content method of specifying sizes!

@trexfeathers
Copy link
Contributor Author

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

@trexfeathers
Copy link
Contributor Author

The Dask changes seem very unlikely to be the cause, as they are the kind of things you would expect in a bugfix release. Unless anyone else thinks differently?

@pp-mo
Copy link
Member

pp-mo commented Jul 22, 2022

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

Well, hopefully, that the total memory cost would reduce, as a multiple of the chunksize.
I.E. ultimately we expect it to be a minimum of (N * chunks) and (M * data-array-size),
-- so that for sufficent datasize (and small enough chunksize), the memory reaches a plateau.

Within the "N" factor is also how many workers it can run in parallel (we expect threads in this case).
So you can either stop it doing that (go to synchronous, or restrict n-workers), or rely on knowledge of the platform.
Another practical learning point : if you allow parallel operation at all, I wouldn't use less than 4 workers. It doesn't seem to cope well with that.

@trexfeathers
Copy link
Contributor Author

Changing the chunk size to 10Mib considerably reduced the memory used for commit 0efd78d, but made no difference to bbe75ac. bbe75ac results are also suspiciously close to the estimated full sizes of those meshes, suggesting the entire thing is being realised. Now to work out why...

@pp-mo
Copy link
Member

pp-mo commented Jul 28, 2022

Updates + observations..

  1. Unfortunately, I was basically wrong about chunk sizes, since the test uses a cube of dims "(1, n_mesh_faces)", and the mesh dim cannot be chunked in the combine_regions calculations, which is like a regrid in that respect.

  2. testing with a param (cubesphere-edge-number) of 1000, we get 1M-points per face
    --> 6M points per cube --> 48Mbytes (since data is float64)
    which is all in one dask chunk (see previous)

  3. I created a really simple test to run outside of ASV , like this :

tst = ComputeRealData()
param = 1000
tst.params = [param]
# tst.setup_cache()   # only needed first time
tst.setup(param)
with TrackAddedMemoryAllocation() as mb:
    tst.recombined_cube.data
print('Cube data size = ', tst.recombined_cube.core_data().nbytes * 1.0e-6)
print('Operation measured Mb = ', mb.addedmem_mb())

When initially run, this claimed that no memory at all was used (!)
I think that is because the setup is run in the same process as the to-be-measured calc,
whereas (also I think) ASV is running each test in its own fork, hence the RSS-based measure then works
( Actually I'm increasingly concerned that this memory measurement method is not great, and a tracemalloc approach would be a better choice anyway )

So I replaced the measurement with one based on tracemalloc, and got these results:
BEFORE the lockfile change

Cube data size =  48.0
Operation measured Mb =  97.321608543396

AFTER the lockfile change

Cube data size =  48.0
Operation measured Mb =  228.9274206161499
  1. took a copy of the lockfile env + upgraded numpy from 1.22.4 to 1.23.0
    Results are as latest above : I.E. it is the numpy change that was significant (and not dask)

  2. note that the memory numbers above

  • BEFORE ~= data-size * 2
  • AFTER = ~ data-size * 5
    latter is interesting because it is NOT * 7 (there are 7 regions combined)
    so the problem is not "just" new space used up by each individual region-combine op

@trexfeathers
Copy link
Contributor Author

Conclusion - won't fix

This regression is a moderate hindrance at worst, and is therefore only worth a limited amount of investigation.

More detail: we continue to discover new possible avenues of investigation, without any promising end in sight. The worst expected memory demand is <700MiB - a C1668 cubesphere horizontal slice (which must be a single chunk) - chunking will prevent larger sizes. @pp-mo may wish to add to this.

I'm writing a cautionary What's New entry to notify users of the increased memory demand.

@pp-mo
Copy link
Member

pp-mo commented Jul 28, 2022

worst expected is <700MiB

A bit more detail:
From investigating effects of chunking and different numpy versions, we found that the total memory cost of the operation is something like "2 *main-data-size + (2 or 3?) * combined-region-data-sizes".

The region data can be chunked, but the main data chunks must have a full mesh dimension, or indexing errors will occur (which needs fixing in the code - see below).

We've now shown that smaller chunks can reduce the cost due to region-data handling (but not the main data).
The conclusion is that it might effectively be copying/storing all region data, perhaps x2-3, but this is only ~same as main data.
In particular, it is not costing "n-regions * full-data-size".
From that it seems, that the additional cost comes in management of the region data, and not the central map_blocks operation -- as in that case it would scale as n-regions

Hence, we believe the total cost is limited to about 3 * full-data-array-size.
Which is certainly larger, but not prohibitive.
As this operation already can't chunk over the mesh dimension, we can accept costs of this magnitude.
(though we also want to be quite sure that we can still chunk the operation within outer dimensions - see below)

@pp-mo
Copy link
Member

pp-mo commented Jul 28, 2022

TODO in a follow-on fixes/changes :
see #4882 and #4883

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot A bot generated issue/pull-request Type: Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants