Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with aggregated_by and mdtol and masks with iris2.1 and numpy1.15 #3190

Closed
mcguirepatr opened this issue Oct 5, 2018 · 8 comments · Fixed by #4246
Closed

problem with aggregated_by and mdtol and masks with iris2.1 and numpy1.15 #3190

mcguirepatr opened this issue Oct 5, 2018 · 8 comments · Fixed by #4246

Comments

@mcguirepatr
Copy link

mcguirepatr commented Oct 5, 2018

When I run these commands in python2.7 with iris 2.1 and numpy 1.15, I get the error messages in traceback below.

PYTHON commands:

import iris
cube = iris.load_cube('example.nc')
#check to see if the data is there:
cube.data
cube = cube.aggregated_by(["year", "day_of_year"], iris.analysis.MEAN, mdtol=0.03)

The 20MB example.nc NETCDF file is available here:
https://drive.google.com/open?id=1hZNWXs7wOtBq2dXWM1zbAJnVMjy_DXc7

Traceback (most recent call last):
  File "test10.py", line 10, in <module>
    cube = cube.aggregated_by(["year", "day_of_year"], iris.analysis.MEAN, mdtol=0.03)
  File "/Users/pmcguire/fluxnet1/iris/cube.py", line 3414, in aggregated_by
    **kwargs)
  File "/Users/pmcguire/fluxnet1/iris/analysis/__init__.py", line 524, in aggregate
    result.mask = result.mask | mask_update
  File "/Users/pmcguire/.local/lib/python2.7/site-packages/numpy/ma/core.py", line 6339, in __setattr__
    "attributes of {!r} are not writeable".format(self))
AttributeError: attributes of masked are not writeable

I have tracked this down I think.

I could get test_fluxnet_evaluation.py to work if I comment out the three lines below from the _Aggregator(object).aggregate method in iris/analysis/init.py (near line 514) of a copy of iris2.1.
The result.mask below was not defined for some reason even though ma.isMaskedArray(result) is true.

      result = self.call_func(data, axis=axis, **kwargs)
        if (mdtol is not None and ma.isMaskedArray(data)):
            fraction_not_missing = data.count(axis=axis) / data.shape[axis]
            mask_update = 1 - mdtol > fraction_not_missing
            #if ma.isMaskedArray(result):
            #    result.mask = result.mask | mask_update
            #else:
            result = ma.array(result, mask=mask_update)

In version Iris1.13, this problem does not exist.
I can reproduce this problem if I upgrade from Iris1.13 to Iris2.2.0dev0, including all dependencies.

But then when I downgrade only Iris2.2 -> Iris.1.13, keeping the updated dependencies, the bug persists.
But then when I downgrade only Iris2.2 -> Iris.1.13 and numpy1.15->numpy1.12.1, keeping the updated dependencies, the bug goes away.
So I think it's something in numpy.

When it works with Iris1.13 and numpy1.12.1, Aggregrate (with iris.analysis.Mean) function tries to override the (non-array) mask value for the day, and it can do this, because the mask is writeable. But when it doesn't work with iris2.1 and numpy1.15, this function fails because this mask is not writeable.

2)A quick perusal of the numpy1.15 documentation suggests that they changed the mask behavior: https://docs.scipy.org/doc/numpy/release.html
"np.ma.masked is no longer writeable:
Attempts to mutate the masked constant now error, as the underlying arrays are marked readonly. In the past, it was possible to get away with...".
I am not sure if this is the change which is affecting us or not.

Can someone fix this problem in iris? Should I just maintain a separate iris branch (with those three lines commented out) for my own purposes till this is fixed? Or does anybody have a better solution? I am not sure if I should downgrade numpy or not.

Thanks,
Patrick McGuire

Department of Meteorology and Department of Geography & Environmental Science
University of Reading
Personal Web Page: http://www.personal.reading.ac.uk/~gn916173

@DPeterK
Copy link
Member

DPeterK commented Oct 5, 2018

Hi @mcguirepatr - thanks for raising this issue report! As it happens I was just looking at your post on Google Groups; apologies that no-one has responded to your post there as yet.

I've successfully reproduced the issue you report here, and I think you're spot on in looking at NumPy as the cause of this issue, so we'll try and dig in and see if we can get a fix in place. In fact, your note about removing the mask update lines is very helpful for pointing towards a fix for this as it points to what the origins of the problem likely are. There are two:

  • The NetCDF representation of data has recently changed such that all files are loaded with masked data by default (and we're thinking about what to do about this), even if no masked points are present. This can cause a functionality problem in some cases, where the mask can be False (just a single value). I think this is happening here - but I need to check that.
  • The line that reads if ma.isMaskedArray(result): should read if ma.is_masked(result):, as that provides a more general check that can deal with the presence of a mask, even if no masked points are set (that is, it can deal with situations like the one I outlined in the previous bullet point).

By the way, you don't need to call cube.data, as per the third line in your sample code. What this will do is load all data points off of disk and into memory. This has a performance impact and memory usage impact that increases with increased data sizes. Most operations in Iris are fully-functional (and automatically parallelised) without needing to load data in this way. Those that aren't will automatically load the data when it's needed.
(In connection to this point, I tried running your sample code without loading the data, and I noticed that Iris did not use the lazy aggregator in this case, which should probably also be investigated.)

P.S. I edited your initial post as there was some odd styling going on.

@DPeterK
Copy link
Member

DPeterK commented Oct 5, 2018

Iris did not use the lazy aggregator in this case

This is explained by #3129.

@mcguirepatr
Copy link
Author

Thanks for your help with this, Mr. Killick!
Did you figure out how to fix it?
Patrick McGuire

@darkblue-b
Copy link

darkblue-b commented Oct 24, 2018

as supporting info, I confirm that the example given appears CORRECTLY in OSGeoLive 12.0

scitools-iris==2.1.0; Cartopy==0.14.2; numpy==1.13.3; cf-units==2.0.2 matplotlib==2.1.1; nc-time-axis==1.1.0 jupyter-core==4.4.0; jupyter-client==5.2.2 netCDF4==1.3.1; Cython==0.26.1; future==0.15.2; pyke==1.1.1; six==1.10.0

Ubuntu 1804; Python 2.7; iris_3190_bug_example.nc 21MB

@kwilliams-mo
Copy link
Contributor

kwilliams-mo commented Oct 29, 2018

I'm also having this problem on the same system as Patrick - I think it's happening when one or more of the elements are masked after applying the aggregating function but before using mdtol.

Is there a work-around?

Thanks!

Example 1:

import numpy as np
import iris
import iris.cube
cube = iris.cube.Cube(np.ma.MaskedArray([3]))

cube.add_dim_coord(iris.coords.DimCoord([1],long_name='foo'), (0,))

cube.aggregated_by('foo', iris.analysis.MEAN, mdtol=0.5) # no error

cube.data[0] = np.ma.masked

cube.aggregated_by('foo', iris.analysis.MEAN) # no error
cube.aggregated_by('foo', iris.analysis.MEAN, mdtol=0.5) # ValueError: assignment destination is read-only

Example 2:

import numpy as np
import iris
import iris.cube

cube = iris.cube.Cube(np.arange(4) * 10.0)
cube.data = np.ma.asanyarray(cube.data)

dim_coord = iris.coords.DimCoord(np.arange(4),long_name='sample')
aux_coord = iris.coords.AuxCoord([2000, 2000, 2001, 2001], long_name='year')
cube.add_dim_coord(dim_coord, (0,))
cube.add_aux_coord(aux_coord, (0,))

cube.aggregated_by('year', iris.analysis.MEAN, mdtol=0.03) # no error

cube.data[0] = np.ma.masked
cube.aggregated_by('year', iris.analysis.MEAN, mdtol=0.03) # no error

cube.data[1] = np.ma.masked
cube.aggregated_by('year', iris.analysis.MEAN, mdtol=0.03) # ValueError: assignment destination is read-only

@kwilliams-mo
Copy link
Contributor

kwilliams-mo commented Oct 29, 2018

This is what I've come up with so far as a work-around for Patrick's original example but it's pretty convoluted - there must be a better way! :)

import numpy as np

import iris

def wrap_aggregated_by_with_mdtol(cube, coord_names, aggregator, mdtol=None):
    '''
    Hack to get around https://github.com/SciTools/iris/issues/3190
    
    args:
     * cube: cube to aggregated
     * coord_names: should be either ['year', 'day_of_year'], ['year', 'month']
       or ['year', 'month_number']
     * aggregator: Iris aggregator object e.g. iris.analysis.MEAN
     * mdtol: tolerance of missing data to pass on to the Iris aggregated_by function  
    '''
    
    allowed_coord_names = (['year', 'day_of_year'], ['year', 'month'],
        ['year', 'month_number'])
	
    coord_names_are_allowed = False
    for cn in allowed_coord_names:
        if coord_names == cn:
	    coord_names_are_allowed = True	
	
    if not coord_names_are_allowed:
        raise UserWarning('coord_names is not one of the allowed combinations: ' 
	    + str(allowed_coord_names))

    try:
        agg_cube = cube.aggregated_by(coord_names, aggregator, mdtol=mdtol)

    except (AttributeError, ValueError):

        def make_str_arr(in_cube, coord_names):
            '''
            Makes a list of the form 
	    ['DOY-YYYY', 'DOY-YYYY',....] e.g. ['001-2000', '002-2000', ...]
	    or	    
	    ['MM-YYYY', 'MM-YYYY',....] e.g. ['01-2000', '02-2000', ...]
            '''
	    n = len(in_cube.coord('year').points)
	  
	    if coord_names[0] == 'year':
	        str_c0 = [format(in_cube.coord(coord_names[0]).points[i], '03') for i in range(n)]
            else:
	        raise Exception('have not implemented this')
	    
	    if coord_names[1] == 'day_of_year':	    
		str_c1 = [format(in_cube.coord(coord_names[1]).points[i], '03') 
		          for i in range(n)]    
            elif coord_names[1] == 'month_number':
	        str_c1 = [format(in_cube.coord(coord_names[1]).points[i], '02')    
		          for i in range(n)] 
            elif coord_names[1] == 'month':
	        month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
		              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
	        str_c1 = [format(month_list.index(in_cube.coord(coord_names[1]).points[i]) + 1, '02') 
		          for i in range(n)]
            else:
	        raise Exception('have not implemented this')		
						
            mylist = [str_c1[i] + '-' + str_c0[i] for i in range(n)]
            return mylist

        # do the aggregation without mdtol
        agg_cube_no_mdtol_used = cube.aggregated_by(coord_names, aggregator)

        # create cube to store final result (after correct mdtol), but fill with nans for now
        agg_cube = agg_cube_no_mdtol_used.copy()
        agg_cube *= np.nan

        # get rid of masked days/months (i.e. days/months with no valid data)
        agg_cube_no_mdtol_used = agg_cube_no_mdtol_used[~agg_cube_no_mdtol_used.data.mask]

        # get list of days/months which have at least one data point
        str_arr_after_agg = list(set(make_str_arr(agg_cube_no_mdtol_used, coord_names)))

        # get list of strings for each time step in original cube
        str_arr_before_agg = make_str_arr(cube, coord_names)

        # has_data indicates, for each timestep in the original cube, whether it's part of a day/month with at least some data
        has_data = []
        for i in range(len(str_arr_before_agg)):
            if str_arr_before_agg[i] in str_arr_after_agg:
                has_data.append(True)
            else:
                has_data.append(False)

        # take out timesteps from days/months with no data
        new_cube = cube[has_data]

        # try the aggregation with mdtol again
        agg_cube_not_yet_filled = new_cube.aggregated_by(coord_names, aggregator, mdtol=mdtol)

        # put the valid data into agg_cube
        j = 0

        for i in range(agg_cube.shape[0]):
            c0_i = agg_cube.coord(coord_names[0]).points[i]
            c1_i = agg_cube.coord(coord_names[1]).points[i]

            c0_j = agg_cube_not_yet_filled.coord(coord_names[0]).points[j]
            c1_j = agg_cube_not_yet_filled.coord(coord_names[1]).points[j]

            if (c0_i == c0_j) and (c1_i == c1_j):
                agg_cube.data[i] = agg_cube_not_yet_filled.data[j]
                j += 1

    return agg_cube

def main():

    cube = iris.load_cube('~/example.nc')
    agg_cube = wrap_aggregated_by_with_mdtol(cube, ['year', 'day_of_year'], iris.analysis.MEAN, mdtol=0.03)

    print agg_cube
    print agg_cube.data

@rcomer
Copy link
Member

rcomer commented Sep 28, 2020

I've reproduced the errors from @mcguirepatr's and @kwilliams-mo's examples with latest master branch (almost v3). However, I note that the problem goes away if the cube's input data is lazy (lazy aggregated_by was introduced at Iris v2.4). So one way to fix this would be to implement #2418.

@rcomer
Copy link
Member

rcomer commented Mar 24, 2022

This is now fixed ready for Iris v3.3 (due out in the summer). Sorry it took so long and thanks for your efforts in digging into the problem @mcguirepatr and @kwilliams-mo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants