apply_neighborhood/regrid: rdd size/partition inflation #191

jdries · 2023-08-02T07:40:06Z

It seems that regridding from a large tile size (e.g. 10224 or 512) to a small size (e.g. 64) results in very large rdd/partition sizes, which is unexpected.

My theory is that the 'crop' method used in Regrid:
https://github.com/locationtech/geotrellis/blob/d65d6a22eb70efd96caa5c6f5f660b2b936b2763/spark/src/main/scala/geotrellis/spark/regrid/Regrid.scala#L122
Is a lazy crop, which keeps the original array instead of copying the smaller chunk of data out of the larger one. So when Spark serializes the rdd, it also copies over all of the larger arrays backing the cropped types, inflating the data a lot.

#191

jdries · 2023-08-03T14:01:01Z

Committed a fix for this one, would still like to open a PR in geotrellis.

jdries · 2023-08-07T15:10:34Z

closing this one, a PR has been opened!

jdries self-assigned this Aug 3, 2023

jdries added a commit that referenced this issue Aug 3, 2023

avoid memory inflation in apply_neighborhood

120f9b1

#191

jdries mentioned this issue Aug 7, 2023

Regrid: force crop to avoid going out of memory locationtech/geotrellis#3517

Closed

2 tasks

jdries closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply_neighborhood/regrid: rdd size/partition inflation #191

apply_neighborhood/regrid: rdd size/partition inflation #191

jdries commented Aug 2, 2023 •

edited

Loading

jdries commented Aug 3, 2023

jdries commented Aug 7, 2023

apply_neighborhood/regrid: rdd size/partition inflation #191

apply_neighborhood/regrid: rdd size/partition inflation #191

Comments

jdries commented Aug 2, 2023 • edited Loading

jdries commented Aug 3, 2023

jdries commented Aug 7, 2023

jdries commented Aug 2, 2023 •

edited

Loading