Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dim-expression support in Dataset.select #3920

Merged
merged 6 commits into from
Sep 3, 2019
Merged

Conversation

jonmmease
Copy link
Collaborator

This PR adds a new selection_expr argument to the Dataset.select method. This argument may be set to a holoviews.dim expression as a way to select rows from a dataset using symbolic expressions.

For example:

import holoviews as hv
import pandas as pd
from holoviews import dim
import dask.dataframe as dd

df = pd.DataFrame({
    'a': [1, 1, 3, 3, 2, 2, 0, 0],
    'b': [10, 20, 30, 40, 10, 20, 30, 40],
    'c': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
    'd': [-1, -2, -3, -4, -5, -6, -7, -8]
})

ds = hv.Dataset(df)

print(ds.data)
   a   b  c  d
0  1  10  A -1
1  1  20  A -2
2  3  30  B -3
3  3  40  B -4
4  2  10  C -5
5  2  20  C -6
6  0  30  D -7
7  0  40  D -8
print(ds.select(selection_expr=(dim('b') >= 20) & (dim('a') % 2 == 0)).data)
   a   b  c  d
5  2  20  C -6
6  0  30  D -7
7  0  40  D -8

As long as the expressions are compatible with Dask DataFrames, this also works for Datasets backed by Dask dataframes.

ddf = dd.from_pandas(df, npartitions=2)
dds = ds.clone(data=ddf)
print(dds.select(selection_expr=(dim('b') >= 30)).data)
Dask DataFrame Structure:
                   a      b       c      d
npartitions=2                             
0              int64  int64  object  int64
4                ...    ...     ...    ...
7                ...    ...     ...    ...
Dask Name: getitem, 8 tasks
print(dds.select(selection_expr=(dim('b') >= 30)).data.compute())
   a   b  c  d
2  3  30  B -3
3  3  40  B -4
6  0  30  D -7
7  0  40  D -8

Not all expressions currently work with Dask data structures. This is something I'd like to improve, but I think that should be in a separate PR.

@jlstevens @philippjfr

Adds new `compute` and `keep_index` kwargs to the `values` method of all data interfaces.

For dask data structures, compute controls whether value is evaluated before being returned.

For pandas/dask Series, keep_index controls whether the values are returned as a series (keep_index=True) or array (keep_index=False).
Makes it possible to filter a dataset using a pandas or dask series with index that matches the index of the element's `data` data frame.
This may be set to a dim predicate express indicating which rows should kept.

squash into 433640
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! selection_expr is quite wordy, though. expr, maybe?

In any case, is there a strong reason dim objects can't simply be accepted the same as any other selection specifier, as the first argument? That would seem much more natural and expected, to me. Are there cases where that would somehow be ambiguous?

@jonmmease
Copy link
Collaborator Author

The current (without this PR) API signature for select looks like this:

def select(self, selection_specs=None, **selection):
    ...

So right now, selection are only specified as kwargs, and the first positional argument is treated as a selection specification that I believe configures the nested application of selections.

I went with selection_expr since there was already selection_specs and selection_mask in the API. One reason to err on the side of being wordy here is that this name lives in the same "namespace" as the dimensions names being selected. So right now, if there is a dimensions named selection_specs it's not possible to reference it in a standard kwarg-based selection.

@jbednar
Copy link
Member

jbednar commented Aug 21, 2019

Ok, that's a good reason for having a longer name, but isn't selection_expr just another kind of selection_spec, which would let it be used as a positional argument?

@philippjfr
Copy link
Member

philippjfr commented Aug 21, 2019

No selection_spec specifies which object the selection applies to, selection_expr applies to the actual selected objects.

@jbednar
Copy link
Member

jbednar commented Aug 21, 2019

Ah, ok; I had to go back and look at examples to see what was up here; I was only remembering part of how to use .select(). As I see it, ds.select(b=20) works fine already due to kwargs processing, but we would also want to do things like ds.select(b=>20), which wouldn't be properly evaluated by Python. That's what dim is for, so can't we get ds.select(dim('b') >= 20) to work, without requiring the awkward and verbose ds.select(selection_expr=(dim('b') >= 20))?

To make that work, it seems like all we'd need to do is to make selection_expr be the first argument in the signature, rather than selection_specs. That would break compatibility for anyone who had used selection_specs as a positional argument, but maybe (a) that's rare (I hope!) and (b) could we detect someone trying to pass a selection_spec to selection_expr and raise an error? Having this work smoothly and cleanly seems important to me for making it easily usable in practice...

@jonmmease
Copy link
Collaborator Author

I'd be happier having selection_expr as the first argument for the usability reasons you mentioned.

One thing that we could do would be to make selection_expr the first argument, but send the value along the selection_spec codepath if the selection_expr object isn't a dim expression. Though this is probably too much magic.

@philippjfr @jlstevens Do you have strong feelings about the compatibility/convenience trade-off here?

@jbednar
Copy link
Member

jbednar commented Aug 21, 2019

I think that's too much magic to happen silently, but it seems reasonable if accompanied by a warning message saying that people need to use the selection_specs keyword argument instead.

Add error message if first arg is not a dim expression, referring users
to the selection_specs keyword argument.
@jonmmease
Copy link
Collaborator Author

Alright, in 6c1937a I moved selection_expr to the first positional arg and raise an exception referring users to selection_specs if the argument is not dim expression.

The first example above may now be written as:

ds.select((dim('b') >= 20) & (dim('a') % 2 == 0))

@philippjfr
Copy link
Member

Do you have strong feelings about the compatibility/convenience trade-off here?

I think usage of selection_specs is pretty rare tbh, selections already only apply if they match any dimensions. I think there are one or two places in the plotting code that use it. I wonder if we could check if the first argument is a dim expression and if not issue a warning about specs no longer being the first argument. I'd also want to rename selection_specs -> specs (aliasing for now maybe?) and then simply call selection_expr expr.

@jbednar
Copy link
Member

jbednar commented Aug 22, 2019

I'd also want to rename selection_specs -> specs (aliasing for now maybe?) and then simply call selection_expr expr.

But with the simpler names, won't it conflict with possible dimension names as Jon mentioned? Seems likely there will be data sources with columns named expr and possibly specs; selection_specs and selection_expr seem much less likely.

@philippjfr
Copy link
Member

Oh right, missed that and yes that's a good point.

@jlstevens
Copy link
Contributor

jlstevens commented Aug 30, 2019

Reading the above before looking at the code, I am happy to see that all the suggestions I was thinking of were proposed and accepted as I was thinking them up. selection_expr is indeed ugly but it also needs to avoid clashes with dimension names so making it the first positional argument makes sense to me.

I agree that usage of selection_specs is pretty rare and I am happy with any reasonable way of either warning about potential mistakes or detecting the type of the argument and processing it accordingly. Now I'm going to look through the diff but so far it all looks good!

@@ -477,7 +477,7 @@ def initialize_unbounded(obj, dimensions, key):
"""
select = dict(zip([d.name for d in dimensions], key))
try:
obj.select([DynamicMap], **select)
obj.select(selection_specs=[DynamicMap], **select)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count only four times where selection_specs had to be specified as a keyword instead of by position! If that is how often it was used that positional argument in our own codebase, I am pretty certain users barely used it (if at all).

@jlstevens
Copy link
Contributor

Looks good to me!

@jbednar
Copy link
Member

jbednar commented Aug 31, 2019

It looks like the appveyor failures are independent of this PR? Seems like they need to be addressed in any case. Otherwise this seems mergeable.

@jonmmease
Copy link
Collaborator Author

Thanks for the feedback all. Merging!

@jonmmease jonmmease merged commit 996e7c4 into master Sep 3, 2019
@philippjfr philippjfr deleted the expression_selection branch September 20, 2019 10:40
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants