Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Aggregation methods for object-type array #3137

Merged
merged 11 commits into from
Feb 6, 2018

Conversation

fujiisoup
Copy link
Contributor

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

This fixes #3133.
As the current assert_eq looks unable to take care

  • nan-included arrays
  • scalar
    I added assert_nan_eq function in test_reductions.py, but there should be better ways to do this.

I appreciate any suggestions.

assert np.isnan(b[nanidx].astype(float)).all()
a[nanidx] = 0.0
b[nanidx] = 0.0
assert_eq(a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I'm somewhat in favor of including nan handling logic directly into assert_eq. We've tried to centralize all testing logic into that one function in the past for the sake of simplicity. I can't think of a case where I would prefer to have non-nan-handling behavior. We could make this a keyword argument if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed, adding equal_nan=True to assert_eq already works.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides some comments on how the tests are implemented, this looks fine to me.

I see you have some additional commits dragged in (I assume from a merge with master?) If you can, it'd be nice to rebase on master to see if you can remove those. No worries if this proves too difficult.

assert np.isnan(b[nanidx].astype(float)).all()
a[nanidx] = 0.0
b[nanidx] = 0.0
assert_eq(a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed, adding equal_nan=True to assert_eq already works.

if _not_empty(a):
assert a.dtype == a_original.dtype
if check_shape:
assert_eq_shape(a_original.shape, a.shape, check_nan=False)
else:
adt = getattr(a, 'dtype', None)
if not hasattr(a, 'dtype'):
a = np.array(a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time that things should pass through without a dtype attribute (which is on all numpy scalars) is if it's an object dtype. At that, I'd switch the logic to:

if not hasattr(a, 'dtype'):
    a = np.array(a, dtype='O')

You'll also want to move this above the adt = getattr(a, 'dtype', None), as we'll probably want assert_eq(np.array(1, dtype='O'), 1) to pass.


if str(adt) != str(bdt):
if check_dtype and str(adt) != str(bdt):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this flag entirely, in favor of better handling of objects in assert_eq. If a result doesn't have a dtype attribute, it's implicitly an object dtype. We should always be checking the dtype of the results.

return np.allclose(a, b, **kwargs)
except TypeError: # explicitly cast if implicit cast is not possible
return np.allclose(a.astype(float), b.astype(float), **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling things this way, I'd leave the allclose backported implementation as is (equal_nan was added in numpy 1.10), and define a new function that adds additional support for allclose on object dtypes. Something like:

def myallclose(a, b, equal_nan=False, **kwargs):
    if a.dtype != 'O':
        return allclose(a, b, equal_nan=equal_nan, **kwargs)
    if equal_nan:
        return (a.shape == b.shape and
                all(np.isnan(b) if np.isnan(a) else a == b
                    for (a, b) in zip(a.flat, b.flat)))
    return (a == b).all()

a benefit of this is it works on dtypes that can't be cast to floats. Since our results are usually small, this implementation should be fine.

Could even redefine/reimport the existing allclose as _allclose, then define your new function as allclose.

Doing it this way is nice because it more cleanly demarcates what functionality is being backported to older versions of numpy (adding equal_nan support for numpy < 1.10) and what we're adding for our own tests (support for object dtypes).

pass

c = a == b
print(a)
print(b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from debugging?

@jcrist
Copy link
Member

jcrist commented Feb 6, 2018

Looks good to me. Thanks @fujiisoup! Merging.

@jcrist jcrist merged commit 83c3129 into dask:master Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sum requires dtype argument for object array.
3 participants