Dataset aliases #1075

philippjfr · 2017-01-20T03:53:18Z

Addresses #1001. Needs more checking and a lot of unit tests.

jlstevens · 2017-01-20T10:59:08Z

Looks good though as you say, needs a lot more testing.

The unit tests on travis suggest something went wrong in groupby:

File "/home/travis/build/ioam/holoviews/holoviews/core/data/pandas.py", line 124, in groupby
    group_by = [d.alias for d in dimensions]
AttributeError: 'str' object has no attribute 'alias'

I suspect dimensions is expected to be a list of dimension objects and not strings. The corresponding change is here. It seems to be a line added instead of a line replaced (like the rest of them) so I'm hoping this is just a mistake.

Once the unit tests pass, we'll get the notebooks passing before adding more unit tests...

philippjfr · 2017-01-22T15:25:00Z

Ready for initial review, need to add to the dimensions method docstring. @jlstevens You can also start working on some aliasing documentation, a standalone tutorial will be helpful but a short example in the Columnar_Data tutorial would also help.

jlstevens · 2017-01-23T00:06:13Z

By far the most important implications of this PR are regarding the changes toDimension. This is something I discussed with @philippjfr at length and we came to the following conclusions:

Having a way to always access the 'short name' is the right thing to do. If we were to redesign HoloViews from scratch, we might have kept name as always referring to the short name and added description (or similar) for the long name. It is much too late to make such a change - it would be a big disruption for our users.
This PR shows that the idea of .alias is sound. In addition, these changes can be made to work without disrupting the rest of the code.
The problem is the name as alias does not sound like a fundamental core attribute of Dimension. It is feels off to see core code working with dimension referencing alias everywhere instead of name, especially as the 'alias system' has almost been a hidden feature till now. To my mind, alias sound optional, and not a fundamental concept in the way name is.

As is often the case it comes down to naming. I quite liked the idea of label until Philipp pointed that label is already used in at least two different ways already. The other alternatives were identifier (bit too long) and id which is short but clashes with the built-in name and sounds like a number (and we already have an internal concept of anid). We didn't particularly like ident as a shortened version of identifier. Another suggestion we considered was varname.

We decided that the current best suggestion is key (hence Philipp's latest commit) due to these advantages:

Sounds like an fundamental core concept: a dimension key sounds like a more important thing than a dimension alias or even a dimension name.
It seems compatible with the existing notion of keys both in terms of methods operating on dimensions (e.g __getattr__) and in terms of how HoloMap keys are expressed by kdims.
Even shorter than name.

One possible disadvantage is that all dimensions will have key whether or not they are key or value dimensions. That said, dimensions already have an optional values attribute and we didn't worry that this would cause confusion with value dimensions.

jlstevens · 2017-01-23T00:08:54Z

holoviews/core/dimension.py

@@ -677,7 +677,7 @@ def dimensions(self, selection='all', label=False):
        else:
            raise KeyError("Invalid selection %r, valid selections include"
                           "'all', 'value' and 'key' dimensions" % repr(selection))
-        return [(dim.name if label == 'long' else dim.alias)
+        return [(dim.name if label == 'long' else dim.key)


Now that we have label='key' I am wondering if the alternative isn't label='long' but label='name'?

jlstevens · 2017-01-23T00:47:02Z

Note that if we make key a parameter, we'll have the following ways of declaring a dimension:

1 - Dimension('Spectrum'): Both name and key are 'Spectrum'. Most of HoloViews currently use this type of declaration:

If key a parameter, we now get option 4 for specifying a name of 'Frequence Spectrum' and a key of 'Spectrum':

2 - Dimension(('Spectrum', 'Frequency spectrum')) : Tuple format.
3 - Dimension( hv.util.Aliases(Spectrum='Frequency spectrum').Spectrum) : Using the helper.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I don't think having multiple approaches is necessarily an issue: the key (if specified) is always the short version of the name. Compared to option 1, option 4 simply makes it easier to use a longer name by explicitly specifying a short key.

Option 2. is probably the most problematic as you need to remember the order of the tuple is (<short>, <long>) which is why I would probably always recommend option 3 over option 2. Having option 3. is still useful as it lets you define your aliases once instead of having to specify key everywhere (see the example in #312).

This leads me to the following recommendation:

1 - Dimension('Spectrum'): Use this if you are happy for the short name to match the long name.
2 - Dimension( al.Spectrum) : Use the Aliases object to allow easy repeated declaration of key.
3 - Dimension('Frequency Spectrum', key='Spectrum') : Explicit declaration of the shorter key.

Of course if we don't want to support option 3. in this revised list, we can leave key as an attribute. We might want to do this to encourage the use of the Aliases helper - if you don't keep the associations between names and keys consistent, things are likely to get messy.

jlstevens · 2017-01-23T01:12:32Z

One annoyance is that the order between options 2 and 4 is switched:

2 - Dimension(('Spectrum', 'Frequency spectrum')) : Tuple format.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I now wish I had reversed the tuple so that these would be equivalent:

2 - Dimension(('Frequency spectrum', 'Spectrum')) : Tuple format.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I can still switch the tuple ordering around but it would break backwards compatibility. At least if I made a matching reversal in hv.util.Aliases, then option 3. above would keep working.

I am tempted by this change as it would look more consistent and I never really recommended the tuple format precisely because you have to always keep those declarations consistent (and remember the right order).

One other approach would be to have Aliases not return tuples but something similar but distinguishable (named tuples?). Then only named tuples would be allowed - support for regular tuples could be restored through a backwards compatibility switch. This would help encourage use of the Aliases helper function while also making things more explicit for anyone wanting to use their own named tuples:

As before using Aliases (unchanged, recommended):

Dimension( hv.util.Aliases(Spectrum='Frequency spectrum').Spectrum)

The pointless way (shown to define the named tuple):

from collections import namedtuple
alias = namedtuple('alias', ['name','key']) # Might as well be in HoloViews!
Dimension(alias('Frequency spectrum', 'Spectrum')) # Explicit and consistently ordered.

Alternatively to do a single alias without using Aliases (which is still recommended!):

Dimension(hv.util.alias('Frequency spectrum', 'Spectrum'))
Dimension(hv.util.alias(name='Frequency spectrum', key='Spectrum')) # Equivalent

I hope you can follow what I'm proposing - I think I like it!

jlstevens · 2017-01-23T01:19:59Z

As one last note, you can disable implicit named tuples (mapping based on order) by defining hv.util.alias as follows:

def alias(**kwargs):
   if set(kwargs.keys()) != set(['name', 'key']):
      raise KeyError('Alias requires a name and a key')
   nt = namedtuple('alias', ['name','key'])
   return nt(name=kwargs['name'], key=kwargs['key'])

This disables the top version of option 3. above so you would have to be explicit and use:

Dimension(hv.util.alias(name='Frequency spectrum', key='Spectrum'))

philippjfr · 2017-01-23T01:39:47Z

I think it's perfectly natural to declare the name of the data in your datasource first and the label second so I'm hesitant about making the proposed change. The real issue as you point out above that ideally we'd have declared a name and a description or label. Aliases were designed to give long names short aliases for easier access, however imo the more common use case is for the user to want to give their shortened variable name a longer more readable label.

Overall I feel explaining that you supply an alias for the name of your data in the datasource as a tuple is much more intuitive than having to grasp aliases and the difference between key and name, particularly because the actual distinction is inconsequential because the API is agnostic about which one you use.

In a simple example I feel this:

df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[('x', 'X-Label')], vdims=[('y', 'Y-label')])

is much clearer than this:

df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[hv.util.alias(name='X-label', key='x')], vdims=[hv.util.alias(name='Y-label', key='y')])

If you really do think this is unclear I will investigate whether we can't have .name and .label/description instead. It may not be as bad as I thought because the API generally doesn't care and the axis labels are easily fixed.

jlstevens · 2017-01-23T02:47:38Z

The recommended thing would be:

al = hv.util.Aliases(x='X-label', y='Y-label')
df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[al.x], vdims=[al.y])

I only showed the underlying named-tuple approach to illustrate what would be happening behind the scenes.

philippjfr · 2017-01-23T15:23:55Z

Opened a separate branch to check whether we can't just switch them. Overall it doesn't seem too bad: https://travis-ci.org/ioam/holoviews/builds/194476686 so perhaps it's doable with a bit more work.

jbednar · 2017-01-23T16:09:03Z

imo the more common use case is for the user to want to give their shortened variable name a longer more readable label.

I agree -- people usually want long/complex/typeset names for the viewable labels in figures, and rarely want long/complex/encoded names as Python attributes or Pandas column names. So the curent aliases support seems backwards, and if we're ever to have a chance to fix that, it seems like now is the time.

jlstevens · 2017-01-23T16:27:43Z

I'm also in agreement. My main concerns are:

Breaking the existing alias system - probably not too bad as it certainly hasn't been used in the official documentation yet.
How bad it is to switch around - hopefully Philipp's initial reaction is correct and it won't be too bad!
Usual naming issues - as we discussed label is used in a fair few places already and description is a bit long...

Just to make sure we are on the same page - the idea would be that name is always the short version which means it is up to the plotting code to use the longer, less programmer friendly but more descriptive string. Is that right?

philippjfr · 2017-01-23T16:36:36Z

Breaking the existing alias system - probably not too bad as it certainly hasn't been used in the official documentation yet.

Hopefully we can maintain compatibility here, but there's a definitely a few issues to figure out here. Currently the dimension_sanitizer is used both to support actual aliases but also for sanitizing dimension names containing spaces.

Usual naming issues - as we discussed label is used in a fair few places already and description is a bit long...

Agreed, not hugely attached to either, but I don't think description is quite right because it's mostly used as a readable label while description sounds like something longer. @jbednar Any other suggestions?

jbednar · 2017-01-23T16:41:00Z

Seems like a label to me.

jlstevens · 2017-01-23T16:53:01Z

Seems like a label to me.

I like label as long as it doesn't get confused with group.label (i.e the label on dimensioned objects) the label option when getting dimensions.

jlstevens · 2017-01-26T12:45:13Z

I've discussed this again with Philipp and I think this is might come together quite nicely! In particular, I think it would be good to use util.Aliases to define all the other dimension parameters. The result would be a system that can be used instead of dimension presets (another mostly undocumented mechanism).

You could use the following to define the dimension name/label:

al = hv.util.Aliases(x='X-label', y='Y-label')

Or dictionaries for the kwargs e.g:

al = hv.util.Aliases(x=dict(label='X-label', range=(0,1))

Now when you refer to al.x, you get a Dimension object you can use. This can work alongside the tuple format and can be made backwards compatible with how util.Aliases worked before if we allow dimensions to be passed into the Dimension constructor (resulting in an equivalent definition).

All this would need to be documented in a new aliases tutorial (that should probably discuss redim as well). I think the old mechanisms should still be compatible but maybe we should add deprecation warnings so there aren't too many ways of doing the same thing.

One of the goals of the new tutorial would be to lay out the recommended approach for different cases e.g library authors, users trying to define things once quickly and users who expect to be reusing dimension definitions over and over again.

philippjfr · 2017-01-27T01:16:18Z

Closing in favor of #1083, turns out it was fairly easy to switch the name and the alias.

philippjfr added 2 commits January 20, 2017 03:40

Add Dimension.alias and Dimensioned.dimensions(label='alias')

03cc9fc

Access data using Dimension.alias in Dataset interfaces

5dd21d2

philippjfr added the status: WIP label Jan 20, 2017

Fixes for handling dimensions on Dataset

d2c5861

philippjfr force-pushed the dataset_aliases branch from f65f3b5 to 3970188 Compare January 20, 2017 12:38

Further fixes ensuring interfaces use aliases

acbbfc9

philippjfr force-pushed the dataset_aliases branch from 3970188 to acbbfc9 Compare January 20, 2017 12:41

philippjfr added 9 commits January 20, 2017 13:21

Implement Dimension.__setstate__ for pickle compatibility

1a36687

Improved Dimension alias lookup

cdc3a8b

Further fixes for aliases on Dataset

f4f0cb5

Fixed redim with aliases

22df99c

Add support for alias in Dataset.add_dimension

25be8a0

Added various alias unit tests

0fc82bf

Fixed aliases on Dataset aggregate

f969c93

Unit tests for Dataset aliases

bf1d8a5

Add Dataset aliasing unit tests

0c3aa69

philippjfr mentioned this pull request Jan 22, 2017

Implement Image interface #994

Merged

Renamed Dimension.alias to key

cb559bb

jlstevens reviewed Jan 23, 2017

View reviewed changes

philippjfr mentioned this pull request Jan 26, 2017

Add Dimension.label replacing global alias state #1083

Merged

philippjfr closed this Jan 27, 2017

philippjfr deleted the dataset_aliases branch February 10, 2017 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset aliases #1075

Dataset aliases #1075

philippjfr commented Jan 20, 2017

jlstevens commented Jan 20, 2017 •

edited

Loading

philippjfr commented Jan 22, 2017

jlstevens commented Jan 23, 2017 •

edited

Loading

jlstevens Jan 23, 2017

jlstevens commented Jan 23, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017

philippjfr commented Jan 23, 2017

jlstevens commented Jan 23, 2017

philippjfr commented Jan 23, 2017 •

edited

Loading

jbednar commented Jan 23, 2017

jlstevens commented Jan 23, 2017 •

edited

Loading

philippjfr commented Jan 23, 2017

jbednar commented Jan 23, 2017

jlstevens commented Jan 23, 2017

jlstevens commented Jan 26, 2017

philippjfr commented Jan 27, 2017

Dataset aliases #1075

Dataset aliases #1075

Conversation

philippjfr commented Jan 20, 2017

jlstevens commented Jan 20, 2017 • edited Loading

philippjfr commented Jan 22, 2017

jlstevens commented Jan 23, 2017 • edited Loading

jlstevens Jan 23, 2017

Choose a reason for hiding this comment

jlstevens commented Jan 23, 2017 • edited Loading

jlstevens commented Jan 23, 2017 • edited Loading

jlstevens commented Jan 23, 2017

philippjfr commented Jan 23, 2017

jlstevens commented Jan 23, 2017

philippjfr commented Jan 23, 2017 • edited Loading

jbednar commented Jan 23, 2017

jlstevens commented Jan 23, 2017 • edited Loading

philippjfr commented Jan 23, 2017

jbednar commented Jan 23, 2017

jlstevens commented Jan 23, 2017

jlstevens commented Jan 26, 2017

philippjfr commented Jan 27, 2017

jlstevens commented Jan 20, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017 •

edited

Loading

philippjfr commented Jan 23, 2017 •

edited

Loading

jlstevens commented Jan 23, 2017 •

edited

Loading