Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle weight files more elegantly #11

Open
JiaweiZhuang opened this issue Jan 1, 2018 · 21 comments
Open

Handle weight files more elegantly #11

JiaweiZhuang opened this issue Jan 1, 2018 · 21 comments

Comments

@JiaweiZhuang
Copy link
Owner

(continuing #2 and #9)

In the just-released v0.1.1, the default behavior is to write weights to disk, and there is a regridder.clean_weight_file() method to optionally remove it (demo).

This exactly matches how ESMPy works (directly dump weights to disk, not returning a numpy array) but feels a bit odd. A more intuitive workflow is only getting in-memory weights by default, and have a regridder.save() method to optionally save it to disk (like the model.save() method in Keras).

With current ESMPy (up to 7.1.0dev41) I need to do very weird things to implement this new design:

  1. Use ESMPy to write weights to disk
  2. Read it back to memory
  3. Then automatically delete the weight file by default (what a waste if you actually need it)
  4. Finally, have a method to optionally re-write the weight file to disk. (duplicates step1)

This incurs a lot of unnecessary disk I/O. If ESMPy has the plan to return in-memory weights I will definitely use the new design. Otherwise I'd rather keep the current approach.

@bekozi Any plans on that?

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Jan 1, 2018

Also the weight file needs to have more complete metadata, not just numbers (can be done outside of ESMPy)

@jhamman
Copy link

jhamman commented Mar 16, 2018

@JiaweiZhuang / @bekozi - has there been any update on this issue? As I understand it, we need a way to have ESMPy return in memory weights. Has that issue been raised upstream?

@JiaweiZhuang
Copy link
Owner Author

I raised this issue a while ago; not sure if there's any progress

also @rokuingh

@bekozi
Copy link

bekozi commented Jun 21, 2018

Talked with @rokuingh today, and we have a plan for moving this forward. I hope it won't be longer than a few weeks.

@JiaweiZhuang
Copy link
Owner Author

That's great news @bekozi ! Is it planned for some patch version of 7.1.0 or 8.0.0?

@bekozi
Copy link

bekozi commented Jun 22, 2018

Good question. It will be in 8 but rolled out in a beta snapshot much sooner...

@bekozi
Copy link

bekozi commented Aug 27, 2018

We have factors in the Python interface now and are working on testing. There are still some compiler-related things to address.

Before charging forward, I want to get some input on the interface. The factors themselves are allocated in Fortran, so we need to make sure their deallocation is handled smartly. With bigger grids, we want to minimize copies too. Nothing is fixed now, so open to any suggestions.

regrid = Regrid(..., factors=True)  # factors is False by default
factor_list, factor_index_list = regrid.get_factors(deep_copy=True)  # deep_copy is False by default
# Preferred deallocation method for factors but __del__ will also do this.
regrid.destroy()

I am thinking it makes sense to add an option for zero-based factor indexing. Maybe a start_index option to Regrid that defaults to 1? It needs to happen in the Regrid call so we can adjust the indexing in ESMF and not in the Python layer - again to avoid copies.

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Aug 27, 2018

Glad to see progress on this important issue! I have several comments & questions:

  1. What's in factor_list and factor_index_list? I guess factor_list is an 1D numpy array of weights (S in the offline weights file). Is factor_index_list a 2D array containing col and row, or a list of two 1D arrays? I guess it is a list so you need deep_copy, not just copy?

I would prefer just returning 3 numpy arrays and use copy :

row, col, S = regrid.get_factors(copy=True)

Those arrays will then construct a Scipy sparse matrix by scipy.sparse.coo_matrix(S, (row, col), copy=False) without copying.

  1. Zero-based indexing: Currently in xESMF I simply subtract 1 from default grid indices. Will the start_index implementation be more computationally efficient? If not, this correction can be done in Python level, in-place, so not extra copies will be made.

My general feeling is that direct ESMF/ESMPy calls should only deal with the very core functionalities that must be done in Fortran-level. Then we rarely, rarely need to update ESMPy's API. Once a new ESMPy function like get_factors() is implemented, it can remain stable for many versions. Additional small fixes can be quickly done in the Python level, which is much easier to debug and update.

  1. Naming: would regrid.get_weights() be more intuitive than regrid.get_factors()?

  2. For the sake of consistency, the Field.get_area() method should probably need a copy option as well. It can indeed by done by Field.get_area().copy(), but an copy option is pretty common in numpy/scipy functions.

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Aug 27, 2018

With bigger grids, we want to minimize copies too.

A zero-copying workflow would be:

row, col, S = regrid.get_weights(copy=False)
row -= 1  # modify the Fortran array in-place. Any potential issue?
col -= 1
weights = scipy.sparse.coo_matrix(S, (row, col), copy=False)
regridder = xesmf.Regridder()
regridder.weights = weights  # will be done in class __init__ instead
# do not call regrid.destroy()

In this case, the xesmf.Regridder directly uses the internal Fortran arrays in the ESMF.Regrid object. regrid.destroy() releases the Fortran array, which will destroy the xESMF regridder as well. Is there any potential danger with this approach, even if explicit destroy() is prohibited in xESMF user API? Will other ESMF error mess up the Fortran array? If so, the default behavior should be copy=True followed by destroy(), so the numpy array for weights is decoupled from ESMF; while copy=False without destroy() is only an option to minimize copying. The eventual memory use of the two cases are actually the same; the former uses more peak memory during copying.

@bekozi
Copy link

bekozi commented Aug 27, 2018

Thanks for the input! All good stuff.

Is factor_index_list a 2D array containing col and row, or a list of two 1D arrays? I guess it is a list so you need deep_copy, not just copy?

Out of habit, I prefer to be explicit between shallow and deep copies in Python even when working with NumPy arrays. However, we don't have a policy for naming these in ESMPy. I agree copy is cleaner...need to do some thinking on this...

The function will return a np.float64 ndarray with shape (nfactors,) for the factor list and an np.int32 ndarray with shape (2, nfactors) for the factor index list. The first column in the factor index list is the source index with the second column being the destination index. We'll make sure the docs are very clear about this. My head spins sometimes when trying to map Fortran onto Python and vice versa.

With ESMPy, we have tried to stay close to how ESMF implements things under the hood. Hence, the factor and factor index list will take the same form as the regrid store call in ESMF. Mostly this is an attempt to reduce duplication with the docs.

I would prefer just returning 3 numpy arrays and use copy. Those arrays will then construct a Scipy sparse matrix by scipy.sparse.coo_matrix(S, (row, col), copy=False) without copying.

We could add a convenience function or options that would do this. We may want to expose the ESMF sparse matrix utility through Python at some point. In that case, we would want the factor array forms to be compatible with ESMF.

I thought at one point to consider a get_factors_dict which would return a dictionary with keys (row, col, S) mapping to their associated values (numpy arrays). What do you think? This function would also take a copy parameter. We could definitely name this one get_weights_dict as I expect it could be the preferred entry point for xESMF.

Will the start_index implementation be more computationally efficient?

I don't really know the answer to this. My sense is that it is not worth implementing as in-place NumPy stuff is pretty fast, but it was a question worth posing at least.

My general feeling is that direct ESMF/ESMPy calls should only deal with the very core functionalities that must be done in Fortran-level. Then we rarely, rarely need to update ESMPy's API.

This. 😄 Summed up the ESMPy's design mentality. It doesn't feel Pythonic most of the time!

Naming: would regrid.get_weights() be more intuitive than regrid.get_factors()

It would. It's mostly an attempt to establish consistency with the named factor arguments to ESMF. I don't feel strongly about it though. In my mind, we should consider a get_weights method that just passes through to get_factors...

For the sake of consistency, the Field.get_area() method should probably need a copy option as well.

Good point. The copy flags to these functions are meant to be additional indicators to users that they need to be conscious of ESMF memory.

Is there any potential danger with this approach, even if explicit destroy() is prohibited in xESMF user API? Will other ESMF error mess up the Fortran array?

The safest bet is to deep copy everything as you know. Code that does not explicitly call destroy but creates numerous ESMF.Regrid(..., factors=True) objects in the same scope can get some serious memory accumulations. One option could be to introduce a context manager for ESMF.Regrid to ensure destroys are appropriately called. I hope to answer this better with additional testing.

I'm not quite sure what you mean by other ESMF errors. I think once these arrays are created they should be stable independent of other ESMF usage (in ESMF these arrays are not attached to anything). The same goes for the deallocation. I'm probably not understanding your second question.

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Aug 28, 2018

The function will return a np.float64 ndarray with shape (nfactors,) for the factor list and an np.int32 ndarray with shape (2, nfactors) for the factor index list.

We could add a convenience function or options that would do this.

(2, nfactors) looks good to me. No need to change it or add additional wrappers. The ESMPy API can be as lean as possible.

A minor concern is that factor_index_list probably has Fortran-ordering as other ESMPy output, so row = factor_index_list[0,:] and col = factor_index_list[1,:] are not contiguous in memory. This slows down the sparse MM by ~40%. See this demo. It can be fixed by copying and rearranging the memory layout, anyway.

It would. It's mostly an attempt to establish consistency with the named factor arguments to ESMF. I don't feel strongly about it though. In my mind, we should consider a get_weights method that just passes through to get_factors...

It makes perfect sense for ESMPy to stay close to ESMF. I am OK with the current name. Multiple aliases to the same method will probably confuse pure-ESMPy users.

I thought at one point to consider a get_factors_dict which would return a dictionary with keys (row, col, S)

Looks good me if you don't mind adding & maintaining one more function😀

The safest bet is to deep copy everything as you know. Code that does not explicitly call destroy but creates numerous ESMF.Regrid(..., factors=True) objects in the same scope can get some serious memory accumulations.

This is exactly what I am concerning about. I am not super familiar with ESMPy's memory management, but from what you said I guess repeated calls to regrid = ESMF.Regrid(...) without regrid.destroy() will lead to lots of memory leak. Even though the old ESMF.Regrid object is not referenced by any variable identifiers, the memory will not be freed (in contrast to the automatic garbage collection for normal numpy arrays).

If true, with the zero-copying implementation mentioned above, there will be memory leak if a user repeatedly calls regridder=xesmf.Regridder(...). With copying and explicit destroy(), the regridder will only hold normally-created numpy arrays, and when theregridder variable identifier is overwritten (i.e. points to a new regridder object), the old memory should be automatically garbage-collected.

I'm not quite sure what you mean by other ESMF errors.

I guess I more mean "mis-uses" (like the above case) than "errors"...

In summary:

  • Current API and naming conventions are all fine to me, as I understand that many of them come from ESMF conventions.
  • I don't particularly need any convenience wrappers (a stable and simple API is more important to me, not matter how it looks). But if you want to add some wrappers for pure-ESMPy users, I am happy to take advantage of them.
  • xESMF should make copies of the weights and then explicitly destroy the ESMF object, to safely manage memory. Directly referencing the Fortran array might be an (unrecommended) option for extremely large grids.

@JiaweiZhuang
Copy link
Owner Author

@bekozi My mistake -- field.get_area() is to compute the area and add to field.data, not directly returning the area array. So it behaves differently from get_factors() and there is no way to add copy option...

@bekozi
Copy link

bekozi commented Aug 29, 2018

This slows down the sparse MM by ~40%. See this demo. It can be fixed by copying and rearranging the memory layout, anyway.

This is a downer. I think what we'll need is a follow on issue where we adapt the regrid store interface in ESMF to return vectors for the index lists directly.

Looks good me if you don't mind adding & maintaining one more function.

Don't mind.

Even though the old ESMF.Regrid object is not referenced by any variable identifiers, the memory will not be freed (in contrast to the automatic garbage collection for normal numpy arrays).

It's hard to guarantee how the Python garbage collection will run. As usual, SO provides a good overview. This is why we introduced the destroy method. The memory will be cleaned up eventually, it may just not happen when you really need it. destroy will be called via __del__ / the destructor on the ESMF.Regrid object when gc runs.

In summary

Thanks for this.

My mistake -- field.get_area() is to compute the area and add to field.data, not directly returning the area array.

You are correct, and you are forgiven. 😄 Not like I caught it anyway.

I'm really hoping to ship these features soon.

@bekozi
Copy link

bekozi commented Sep 18, 2018

Quick update. We hit an impasse with compilers that we are working through. This was not unexpected. The implementation is working with the standard GNU compiler stack but is unstable on others (i.e. Discover standard modules). The rub with this feature is that the weight arrays are dynamically allocated in Fortran complicating pointer management. @rokuingh and I are working through how best to move the arrays up to Python using Fortran's iso_c_binding module. We used a temporary solution to allow Python development to move forward.

You can see the ESMPy method for retrieving the weight dictionary on SF if you like. The good news is that deep copying out of this method will result in contiguous memory from what I can tell.

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Sep 18, 2018

Thanks very much for the work! I think all xESMF users use conda (with GNU compiler) and won't be affected by this compiler issue.

@bekozi
Copy link

bekozi commented Nov 14, 2018

@JiaweiZhuang, the in-memory weights branch is ready for testing. A Dockerfile for building ESMF from that branch is available here.

For reference, here are links to the docstrings for the new interfaces. ESMF doesn't build docs for branches...

In our testing, the factor arrays contain zero elements with no spatial overlap (disjoint geometries) and unmapped action is ignore. Again, sorry for the delay on this. Hopefully we won't encounter too many issues during your implementation.

JiaweiZhuang added a commit to JiaweiZhuang/Docker_ESMPy that referenced this issue Nov 14, 2018
@JiaweiZhuang
Copy link
Owner Author

@bekozi Aha, finally!!

I can use the Dockerfile without problems. Will let you know if I hit any issues.

@JiaweiZhuang
Copy link
Owner Author

@bekozi One minor documentation thing: create_rh is not documented in ESMF.Regrid's docstring.

Simply using the one in ESMF docs should be fine:

create_rh
Specifies whether or not to create a routehandle, or just write weights to file. If not specified, defaults to ESMF_TRUE.

I assume that it should be set to False if the Regrid call is only for weight retrieval, either offline (filename = '...') or online (factors=True) .

@bekozi
Copy link

bekozi commented Nov 16, 2018

One minor documentation thing: create_rh is not documented in ESMF.Regrid's docstring.

Noted! Thanks.

I assume that it should be set to False if the Regrid call is only for weight retrieval, either offline (filename = '...') or online (factors=True) .

Yes, it makes sense to use create_rh=False for best performance when returning factors. This should be noted in the documentation. I'll update the test loop too. Edit: It looks like this is already tested appropriately.

@JiaweiZhuang JiaweiZhuang pinned this issue Apr 25, 2019
@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Aug 6, 2019

I already have a mostly-working code for in-memory weight retrieval, but it relies on ESMF 8.0.0 that is not publicly released yet. So far I've been building the development branch of ESMF at https://github.com/JiaweiZhuang/Docker_ESMPy. Will push my new code when ESMPy/ESMF 8.0.0 becomes conda-installable.

This is of top-priority as the current offline weight approach is quite awkward and hard to test.

@JiaweiZhuang
Copy link
Owner Author

JiaweiZhuang commented Oct 16, 2019

ESMF 8.0.0 is released today and supports online weight retrieval (see release notes):

The following capabilities were added to the ESMF Python interface (ESMPy):
Added an in-memory weight generation option to the Regrid class, allowing re-use of weight vectors without writing them to netCDF. The weight arrays can be returned as NumPy objects or Python dictionary of weight vectors. This allows retrieval of the weights by source and destination key/value pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants