Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) #47

Closed
valeriupredoi opened this issue Feb 26, 2019 · 13 comments
Assignees
Labels
preprocessor Related to the preprocessor

Comments

@valeriupredoi
Copy link
Contributor

two used cases from running a recipe and profiling it via debug and resource file:

First instance example
-----------------------
2019-02-26 14:03:40.811397      7250.1  7576.5  108     2.7     0       0.606   0.006
2019-02-26 14:03:41.897243      7251.2  7577.7  108     3.8     0       0.606   0.006
2019-02-26 14:03:42.984943      7252.3  7578.9  108     5.8     0       0.606   0.006
2019-02-26 14:03:44.071390      7253.4  7580.0  108     6.4     0       0.606   0.006
2019-02-26 14:03:45.156149      7254.5  7581.2  109     7.9     0       0.606   0.006
2019-02-26 14:03:46.243115      7255.5  7582.4  108     9.4     0       0.606   0.006
2019-02-26 14:03:47.341183      7256.6  7583.6  109     11.1    1       0.606   0.006
2019-02-26 14:03:48.465922      7257.8  7584.8  111     11.7    1       0.606   0.006
2019-02-26 14:03:49.567116      7258.9  7586.0  110     13.2    1       0.606   0.006
2019-02-26 14:03:50.656928      7260.0  7587.1  96      0.6     0       0.606   0.007

2019-02-26 14:03:41,277 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step average_region
2019-02-26 14:03:41,280 UTC [16489] DEBUG   esmvaltool.preprocessor:187 Running area_average(precipitation_flux / (kg m-2 s-1)   (time: 1740; latitude: 360; longitude: 720)
2019-02-26 14:03:41,284 UTC [16489] INFO    esmvaltool.preprocessor._area_pp:179 Calculated grid area:(1740, 360, 720)
2019-02-26 14:03:50,494 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data


Second instance example
-----------------------
2019-02-26 14:36:54.558816      9243.9  9778.3  108     12.4    1       0.708   0.007
2019-02-26 14:36:55.667920      9245.0  9779.5  111     13.5    1       0.708   0.007
2019-02-26 14:36:58.053553      9247.4  9781.9  100     0.5     0       0.708   0.007

2019-02-26 14:17:33,671 UTC [16489] INFO    esmvaltool.preprocessor._area_pp:179 Calculated grid area:(1740, 360, 720)
2019-02-26 14:36:57,901 UTC [16489] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data

I am assigning this to me since if I can't optimize it myself then I can talk to Bill next week.

@valeriupredoi valeriupredoi self-assigned this Feb 26, 2019
@valeriupredoi
Copy link
Contributor Author

OK - I found something very useful: here is a little function:

def simple_area(cube, coord1, coord2):
    grid_areas = iris.analysis.cartography.area_weights(cube)
    print('LAZY', cube.has_lazy_data())
    result = cube.collapsed([coord1, coord2],
                            iris.analysis.MEAN)
                            weights=grid_areas)
    print('LAZY', result.has_lazy_data())
    return result

if you run this on a 500MB file with weights in it will chuck in a max mem=1.3GB, if you don't apply the weights it will be blindingly fast and max mem=20MB. Note that with weights in the result cube will NOT have lazy data - @bjlittle why is it that using weights as keyword arg for collapsed data gets realized and as such, performance drops like a rock?

@valeriupredoi
Copy link
Contributor Author

argh! I see now - in cube.py when passing weights as kwarg the aggregation is not lazy anymore 😢

@valeriupredoi
Copy link
Contributor Author

OK - good resources:
SciTools/iris#3280
SciTools/iris#2418

@ledm how important are those weights in the area average calculations? ie is it something that alters the results by a significant margin that justifies their use?

@valeriupredoi
Copy link
Contributor Author

ah nevermind! Lack of Area weights may introduce biases of avg 10degs for temperature (just tested with/without weights)

@valeriupredoi valeriupredoi changed the title Area operations are very inefficient Collapsing cubes with weights kwarg very inefficient (full realization of data, twice) Feb 26, 2019
@BenMGeo
Copy link

BenMGeo commented Mar 11, 2019

I'm currently trying to resolve this in c3s_511:

A code snippet to work around weighting memory while increasing cpu time is:

def __apply_fun2cube__(self, cube, dims=None, function=None,
                           incl_weights = True, **kwargs):
        """
        applies function to a sliced cube (memory saving)
        """
        self.__logger__.info("====================================================")
        self.__logger__.info("Running apply_fun2_cube")
#        self.__logger__.info(cube)
        self.__logger__.info(dims)
        self.__logger__.info(kwargs)
        self.__logger__.info("still lazy?")
        self.__logger__.info(cube.has_lazy_data())
        
        try:
            if "time" not in dims:
    #        if "latitude" in dims:
                latlon_list = []
                
                for latlon in cube.slices(["latitude","longitude"]):
                    
    #                self.__logger__.info(latlon)
                    
                    latlon.remove_coord("day_of_month")
                    latlon.remove_coord("day_of_year")
                    latlon.remove_coord("month_number")
                    latlon.remove_coord("year")
                    
                    if incl_weights and "latitude" in dims:
                        latlon = latlon * self.map_area
                        latlon.standard_name = cube.standard_name
                        latlon.long_name = cube.long_name
                        
    #                self.__logger__.info(latlon)
    
                    latlon = latlon.collapsed(dims,
                                              function,
                                              **kwargs)
                    latlon_list.append(latlon)
            
                cube_list = iris.cube.CubeList(latlon_list)
    #            self.__logger__.info(cube_list)
    #            self.__logger__.info([c.coords for c in cube_list])
                
                new_cube = cube_list.merge_cube()
                
                self.__logger__.info("going path one")
        
#        except: 
            else:
                new_cube = cube.collapsed(dims, function, **kwargs)
                self.__logger__.info("going path two")
        except:
            new_cube = cube.collapsed(dims, function, **kwargs)
            self.__logger__.info("going path three")
            
#        self.__logger__.info(new_cube)
            
#        self.__logger__.info(function)
        self.__logger__.info("still lazy?")
        self.__logger__.info(cube.has_lazy_data())
        self.__logger__.info("====================================================")
        
        return new_cube

You can call the function like:

new_cube = self.__apply_fun2cube__(cube, dims=["latitude"], function=iris.analysis.MEAN)

You see, I'm still debugging and trying to catch everything that kills the function. It works in this way for 1 coord (supposed to work with all usual dimensions in a cube in ESMValTool) staying lazy. You can call it twice, if latitude goes first, in my tests (this is why I need the except... does not work on scalar coordinates!).

@BenMGeo
Copy link

BenMGeo commented Mar 11, 2019

Plus: It does not work for std_dev, but I could not find out why, yet.

@valeriupredoi
Copy link
Contributor Author

hi @BenMGeo good stuff, man! There is an open SciTools/iris gitHub issue SciTools/iris#3129 that @bjlittle and me we talked about last week, you may want to comment on that one so maybe you and the iris guys can work together on it - we will want a solution straight into iris rather than something in esmvaltool 🍺

@BenMGeo
Copy link

BenMGeo commented Mar 13, 2019

It works now for STD_DEV, but of course, the weighted standard deviation is not the standard deviation of the weighted values (doh).

I can provide you with a clean function of the above for having a weighted mean, though.

I do not see any other solution to this than writing actual iris aggregators with - based on the iris version we use - xarray or daskarray supported lazy functions yourself.

I'll have to deep dive into the writing of custom aggregators and the packages xarray and dask to provide these aggregators to our codes. Whenever I got there, I'll have them proposed to iris.

@valeriupredoi
Copy link
Contributor Author

great stuff! 🍺 I suggest getting in touch with the iris guys via the associated SciTools issue and present them with the solution, mention @bjlittle and myslef please so we can keep track of the implementation, it's best to have it straight in iris than in esmvaltool - but yeah, cool stuff so far!

@valeriupredoi
Copy link
Contributor Author

@bouweandela good stuff - but have you tested the implementation of SciTools/iris#3299 within ESMValTool?

@mattiarighi mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019
@mattiarighi mattiarighi added the preprocessor Related to the preprocessor label Jun 11, 2019
@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Feb 20, 2020

we can now pass lazy weights to collapse operations eg:

import dask.array as da
def simple_area(cube, coord1, coord2):
    grid_areas = iris.analysis.cartography.area_weights(cube)
    grid_areas = da.array(grid_areas)
    result = cube.collapsed([coord1, coord2],
                            iris.analysis.MEAN,
                            weights=grid_areas)
    return result

and the input data and the output data will be LAZY (iris2.4) but the execution time of the collapse operation with or without weights still differ by a factor of 50 and memory increases by 2-3x
@bjlittle -> any news on the iris front? 🍺

@bouweandela
Copy link
Member

@valeriupredoi Is this still an issue?

@schlunma
Copy link
Contributor

I have extensively tested this in SciTools/iris#5341, so this should not be an issue anymore. Please re-open if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessor Related to the preprocessor
Projects
None yet
Development

No branches or pull requests

5 participants