Support nan-ops for object-typed arrays #1883

fujiisoup · 2018-02-02T23:16:39Z

Closes aggregation ops for object-dtype are missing #1866
Tests added (for all bug fixes or enhancements)
Tests passed
Fully documented, including whats-new.rst for all changes and api.rst for new API

I am working to add aggregation ops for object-typed arrays, which may make #1837 cleaner.
I added some tests but maybe not sufficient.
Any other cases which should be considered?
e.g. [True, 3.0, np.nan] etc...

fujiisoup

I think this is ready for review, but some API decisions would be needed.

fujiisoup · 2018-02-07T00:06:04Z

xarray/core/duck_array_ops.py

+        data = -1 if valid_count == 0 else int(data)
+        return np.array(data)  # return 0d-array
+    # convert all nan part axis to nan
+    return where_method(data, valid_count != 0, -1)


In numpy, nanargmin raise ValueError if there are all-NaN slice/axis.
Do we follow this?
Considering our getitem_with_mask method, I think it is consistent to return -1 for such a case,
but it could be confusing sometimes.

Edit:
Also, now this function is called only for object-type.
If we adopt the above API, we may need to update for a numeric case.

max-sixty · 2018-02-08T14:55:04Z

doc/whats-new.rst

+
+  .. ipython:: python
+
+    da = xray.DataArray(np.array([True, False, np.nan], dtype=object), dims='x')


xr / xarray?

shoyer · 2018-02-11T23:24:39Z

xarray/core/duck_array_ops.py

+    """ In house nanmean. ddof argument will be used in _nanvar method """
+    valid_count = count(value, axis=axis)
+    value = fillna(value, 0.0)
+    # TODO numpy's mean does not support object-type array, so we assume float


Why not use sum in this function and simply divide by valid_count - ddof instead of rescaling the mean?

You could potentially copy at least part of the implementation from NumPy's own mean:
https://github.com/numpy/numpy/blob/e06d3614182e7b97d5e0d90291642027d147744b/numpy/core/_methods.py#L53

Why not use sum in this function and simply divide by valid_count - ddof instead of rescaling the mean?

I feel pretty dumb now. Updated.

shoyer · 2018-02-11T23:27:49Z

xarray/core/duck_array_ops.py

@@ -171,6 +171,79 @@ def _ignore_warnings_if(condition):
        yield


+def _nansum(value, axis=None, **kwargs):


Can we give these functions a more explicit names like _nansum_object?

shoyer · 2018-02-11T23:32:23Z

xarray/core/duck_array_ops.py

+    filled_value = fillna(value, fill_value)
+    data = _dask_or_eager_func(func)(filled_value, axis=axis, **kwargs)
+    if not hasattr(data, 'dtype'):  # scalar case
+        data = np.nan if valid_count == 0 else data


Instead passing in the fill_value, let's figure it out from dtypes.maybe_promote()

shoyer · 2018-02-11T23:41:14Z

xarray/core/duck_array_ops.py

+
+
+_nan_funcs = {'sum': _nansum,
+              'min': partial(_nan_minmax, 'min', np.inf),


I am concerned that this will break on arrays of strings. On Python 2, the code probably works (but gives an incorrect result), but on Python 3 np.inf > 'abc' raises a TypeError.

Given that these are only used for object arrays, maybe we should use special objects for this instead, e.g.,

@functools.total_ordering class AlwaysLessThan(object): def __lt__(self, other): return True def __eq__(self, other): return isinstance(other, type(self))

We should probably also add some unit tests for object arrays of strings/NaN (probably do this first!). Currently these raise an error, but I think this code could fix them:

>>> xr.DataArray(np.array([np.nan, 'foo'], dtype=object)).min() TypeError: '<=' not supported between instances of 'float' and 'str'

shoyer · 2018-02-11T23:49:02Z

xarray/core/duck_array_ops.py

+    kwargs_mean.pop('keepdims', None)
+    value_mean = _nanmean_ddof(ddof=0, value=value, axis=axis, keepdims=True,
+                               **kwargs_mean)
+    squared = _dask_or_eager_func('square')(value.astype(value_mean.dtype) -


You could potentially just use the operator ** 2 instead of the dask_or_eager_func here.

fujiisoup · 2018-02-12T12:53:37Z

xarray/core/duck_array_ops.py

-    dtype = kwargs.get('dtype', None)
+    value = fillna(value, 0)
+    # As dtype inference is impossible for object dtype, we assume float
+    dtype = kwargs.pop('dtype', None)
    if dtype is None and value.dtype.kind == 'O':
        dtype = value.dtype if value.dtype.kind in ['cf'] else float


Is there a good workaround to infer the output dtype of the object-typed array?
We need to pass this to dask for the next division but dtype=object is not allowed.

Is this fixed by your dask PR dask/dask#3137?

If so, can we maybe say this requires using the latest dask release?

fujiisoup · 2018-02-12T13:08:35Z

xarray/tests/test_duck_array_ops.py

+
+
+@pytest.mark.parametrize('dim_num', [1, 2])
+@pytest.mark.parametrize('dtype', [float, int, np.float32, np.bool_, str])


Added a test for str-type

shoyer · 2018-02-12T16:26:56Z

xarray/core/dtypes.py

@@ -40,7 +64,7 @@ def maybe_promote(dtype):
    return np.dtype(dtype), fill_value


-def get_fill_value(dtype):
+def get_fill_value(dtype, fill_value_typ=None):


Can we make separate functions for this, maybe get_pos_infinity and get_neg_infinity? It feels a little strange to put it all in one function, and with separate functions you can avoid the need to validate the fill_value_typ argument.

shoyer · 2018-02-12T16:32:33Z

xarray/core/duck_array_ops.py

+    if isinstance(value, dask_array_type):
+        data = data.astype(int)
+    if not hasattr(data, 'dtype'):  # scalar case
+        # TODO should we raise ValueError if all-nan slice encountered?


For consistency with nanargmin(), we probably should still raise ValueError('All-NaN slice encountered') for now. -1 would make sense, but it would need to documented. NaN could also make sense, but would not be so useful since floats are not valid indexers.

shoyer · 2018-02-12T16:34:55Z

xarray/core/duck_array_ops.py

-    dtype = kwargs.get('dtype', None)
+    value = fillna(value, 0)
+    # As dtype inference is impossible for object dtype, we assume float
+    dtype = kwargs.pop('dtype', None)
    if dtype is None and value.dtype.kind == 'O':
        dtype = value.dtype if value.dtype.kind in ['cf'] else float


Is this fixed by your dask PR dask/dask#3137?

If so, can we maybe say this requires using the latest dask release?

shoyer · 2018-02-15T02:35:19Z

xarray/core/dtypes.py

+    -------
+    fill_value : positive infinity value corresponding to this dtype.
+    """
+    if np.issubdtype(dtype, np.floating):


I think we want:

issubclass(dtype.type, (np.floating, np.integer)) -> np.inf

issubclass(dtype.type, np.complexfloating) -> np.inf + 1j * np.inf

Using np.inf for integer types should be faster, since it doesn't require comparing everything as objets. And I think we need np.inf + 1j * np.inf to match numpy's sort order for complex values.

It's better to use issubclass with dtype.type because np.issubdtype has some weird (deprecated) fallback rules: https://github.com/numpy/numpy/blob/v1.14.0/numpy/core/numerictypes.py#L699-L758

shoyer · 2018-02-15T06:05:54Z

xarray/core/duck_array_ops.py

+
+def _nan_minmax_object(func, fill_value_typ, value, axis=None, **kwargs):
+    """ In house nanmin and nanmax for object array """
+    if fill_value_typ == '+inf':


Nit: Instead of passing a separate string, we might just pass the function to make the fill value directly (dtypes.get_pos_infinity or dtypes.get_neg_infinity).

That would let us drop these conditionals and error prone string matching.

# Conflicts: # xarray/core/dtypes.py

… it raises ValueError in argmin/argmax.

shoyer

Looks good to me. Feel free to merge...

fujiisoup added 3 commits February 3, 2018 07:58

First support of sum, min, max for object-typed arrays

1b9c05f

typo

4f2f209

flake8

4c45504

fujiisoup mentioned this pull request Feb 4, 2018

Whether should we follow pandas or numpy if they have different API? #1886

Closed

fujiisoup added 7 commits February 4, 2018 18:47

Pandas compatiblity test. Added nanmean for object-type array

e01d0f8

Improve test

de9c05c

Support nanvar, nanstd

ebeea79

Fix bug in _create_nan_agg_method

bb3b3b0

Added nanargmin/nanargmax

d194a8c

Support numpy<1.13.

33724f4

Update tests.

9616915

fujiisoup commented Feb 7, 2018

View reviewed changes

fujiisoup changed the title ~~[WIP] Support nan-ops for object-typed arrays~~ Support nan-ops for object-typed arrays Feb 7, 2018

Some cleanups and whatsnew

670ae8a

fujiisoup mentioned this pull request Feb 7, 2018

Rolling window with as_strided #1837

Merged

5 tasks

fujiisoup added 2 commits February 8, 2018 21:27

Simplify tests. Drop support std.

4da55c4

flake8

9fb1715

max-sixty reviewed Feb 8, 2018

View reviewed changes

fujiisoup added 3 commits February 9, 2018 08:41

xray -> xr

de1c613

Merge branch 'master' into nanops

63c0a9d

Merge branch 'master' into nanops

3d0e22a

shoyer reviewed Feb 11, 2018

View reviewed changes

fujiisoup added 2 commits February 12, 2018 17:06

string array support

c31fb49

Support str dtype. Refactor nanmean

4dcc1aa

fujiisoup commented Feb 12, 2018

View reviewed changes

shoyer reviewed Feb 12, 2018

View reviewed changes

fujiisoup mentioned this pull request Feb 12, 2018

Arithmetic operations of dtype arrays raise ValueError dask/dask#3162

Open

added get_pos_inifinity and get_neg_inifinity

f93a618

shoyer reviewed Feb 15, 2018

View reviewed changes

fujiisoup added 3 commits February 15, 2018 17:57

Merge branch 'master' into nanops

2c32342

# Conflicts: # xarray/core/dtypes.py

Use function for get_fill_value instead of str. Add test to make sure…

28f0e0a

… it raises ValueError in argmin/argmax.

Tests for dtypes.INF

e46d07d

shoyer approved these changes Feb 15, 2018

View reviewed changes

fujiisoup merged commit b6a0d60 into pydata:master Feb 15, 2018

fujiisoup deleted the nanops branch February 15, 2018 22:03

shoyer mentioned this pull request Aug 21, 2019

argmax() causes dask to compute #3237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support nan-ops for object-typed arrays #1883

Support nan-ops for object-typed arrays #1883

fujiisoup commented Feb 2, 2018 •

edited

Loading

fujiisoup left a comment

fujiisoup Feb 7, 2018 •

edited

Loading

max-sixty Feb 8, 2018

shoyer Feb 11, 2018

shoyer Feb 11, 2018

fujiisoup Feb 12, 2018

shoyer Feb 11, 2018

fujiisoup Feb 12, 2018

shoyer Feb 11, 2018

shoyer Feb 11, 2018

shoyer Feb 11, 2018

fujiisoup Feb 12, 2018

fujiisoup Feb 12, 2018

shoyer Feb 12, 2018

fujiisoup Feb 12, 2018

shoyer Feb 12, 2018

shoyer Feb 12, 2018

shoyer Feb 12, 2018

shoyer Feb 15, 2018

shoyer Feb 15, 2018

shoyer left a comment


		.. ipython:: python

		da = xray.DataArray(np.array([True, False, np.nan], dtype=object), dims='x')

		@@ -171,6 +171,79 @@ def _ignore_warnings_if(condition):
		yield


		def _nansum(value, axis=None, **kwargs):



		_nan_funcs = {'sum': _nansum,
		'min': partial(_nan_minmax, 'min', np.inf),



		@pytest.mark.parametrize('dim_num', [1, 2])
		@pytest.mark.parametrize('dtype', [float, int, np.float32, np.bool_, str])

Support nan-ops for object-typed arrays #1883

Support nan-ops for object-typed arrays #1883

Conversation

fujiisoup commented Feb 2, 2018 • edited Loading

fujiisoup left a comment

Choose a reason for hiding this comment

fujiisoup Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

fujiisoup commented Feb 2, 2018 •

edited

Loading

fujiisoup Feb 7, 2018 •

edited

Loading