Add .iloc and .ndloc integer indexing methods for Datasets #1435

philippjfr · 2017-05-14T20:09:16Z

Adds a tabular integer indexing interface for all data interfaces, allowing slicing and indexing by row and column indices. In the case of gridded ndarray data this operates on the flattened arrays. The main issue is that dask does not support integer indexing so iloc actually has to evaluate the graph and load a whole column at a time into memory, we should probably warn about this.

Valid signatures:

# Get first row
ds.iloc[0]

# Get first 10 rows
ds.iloc[:10]

# Get 1st, 3rd and 5th row
ds.iloc[[1, 3, 5]]

# Get every 3rd row
ds.iloc[::3]

# Get first column
ds.iloc[:, 0]

# Get 1st and 3rd column
ds.iloc[:, [0, 3]]

# Get columns 1 to 3
ds.iloc[:, 1:4]

And any combination of these.

Support for row, column indexing on all interfaces
Support ndarray like indexing on gridded interfaces
Unit tests
Better docstrings
Documentation

This also allows for the following optimizations:

Improve the decimate operation implementation
Improve Tabular indexing and implement truncating of bokeh table output
Vectorized Image.sample using sheet2matrixidx (fixing Use sheet2matrixidx for Image snapping #1450)

jlstevens · 2017-05-14T21:39:41Z

Still looking through this PR but my initial impression is that it looks good.

Here are my first two questions:

Why not call TabularIndex something like ILoc? That seems all it is used for right now.
How do the signatures that you listed compare to pandas iloc indexing?

philippjfr · 2017-05-14T21:41:26Z

Why not call TabularIndex something like ILoc? That seems all it is used for right now.

Could do I suppose, not sure that's a better name though.

How do the signatures that you listed compare to pandas iloc indexing?

They match although I realized I still need to add boolean array indexing support (or at least add tests for it).

jlstevens · 2017-05-14T21:59:49Z

I think naming it the same as the method (but capitalized) makes it instantly clear what the class is for. I agree that TabularIndex would be a better name if you are planning a use for it outside of the iloc method.

We already have a class redim(object) implementing the redim method and in my PR there is class periodic(object) implementing the periodic method. To be consistent, this should be class iloc(object) unless you see another use for it.

philippjfr · 2017-06-18T11:28:24Z

I've now added a section to our Indexing and Selecting Data User Guide, which will be committed to the big docs PR.

philippjfr · 2017-06-18T13:30:21Z

I've now also added an .ndloc interface for gridded datasets. It works by applying integer indexing to the canonical value dimension orientations. Since our Image ndarray interface is flipped along the y-axis the indexing behavior there is probably unintuitive but it has to be consistent.

Here's an example of creating an Image using the Image ndarray, xarray and gridded dictionary interfaces and applying ndloc:

arr = np.random.randn(5, 10)
ds = hv.Dataset({'x': range(10), 'y': range(5), 'z': arr}, kdims=['x', 'y'], vdims=['z'],
               datatype=['grid'])
dict_img = hv.Image(ds, label='Gridded Dictionary')
xr_img = img.clone(datatype=['xarray'], label='XArray')
arr_img = hv.Image(arr[::-1], bounds=img.bounds, label='NdArray+Bounds')

(dict_img + dict_img.ndloc[1:3, 0:5] +
 xr_img + xr_img.ndloc[1:3, 0:5] +
 arr_img + arr_img.ndloc[1:3, 0:5]).cols(2)

philippjfr · 2017-06-18T16:31:19Z

With my latest commit Image.sample now uses sheet2matrixidx and ndloc to efficiently sample the underlying array in continuous coordinates, this addresses issues with floating point precision on sampling and is considerably more efficient (addressing #1450), all existing sampling unit tests pass. I'll have to add some more direct unit tests and docstrings for ndloc though.

philippjfr · 2017-06-19T00:00:20Z

@jlstevens Ready for review.

philippjfr · 2017-06-19T00:22:25Z

doc/Tutorials/Introduction.ipynb

@@ -550,7 +550,7 @@
   "source": [
    "print(rgb_parrot)\n",
    "print(rgb_parrot[0,0])\n",
-    "print(rgb_parrot[0,0][0])"
+    "print(rgb_parrot[0,0].iloc[0, 0])"


This was left over from when we returned tuples when indexing RGBs, so I ended up updating it, suppose rgb_parrot[0, 0, 'R'] would have been clearer but we're probably throwing this notebook out right?

jlstevens · 2017-06-19T00:44:54Z

holoviews/core/data/__init__.py

+        Allow selection by integer index, slice and list of integer
+        indices and boolean arrays, e.g.:
+
+        Examples:


Bit redundant after 'e.g:' (which is sufficient)

jlstevens · 2017-06-19T00:47:38Z

holoviews/core/data/dask.py

+        """
+        Dask does not support iloc, therefore iloc will execute
+        the call graph and lose the laziness of the operation.
+        """


I wonder if there could be optional performance warnings we could issue when laziness is lost. Not for this PR though.

jlstevens · 2017-06-19T00:50:45Z

Other than a very minor comment about the docstrings, this PR looks good: the API is nice and clean and thanks for adding all those unit tests. I don't think people will need integer indexing often but there are certainly times when you need it and having this API will really help.

Happy to merge when the docstring is updated. The tests are passing except for one transient in the Python2 pr build.

jlstevens · 2017-06-19T01:18:45Z

Thanks! Travis/conda is having issues at the moment but the tests were passing before the docstring changes so I'll just merge.

philippjfr added tag: component: data type: feature A major new feature labels May 14, 2017

philippjfr changed the title ~~Add iloc integer indexing interface for Datasets~~ Add .iloc integer indexing interface for Datasets May 14, 2017

philippjfr changed the title ~~Add .iloc integer indexing interface for Datasets~~ Add .iloc integer indexing method for Datasets May 14, 2017

philippjfr force-pushed the iloc_indexing branch 3 times, most recently from 76db7c4 to fd350c3 Compare May 14, 2017 21:25

philippjfr force-pushed the iloc_indexing branch from fd350c3 to 9efc3d9 Compare June 18, 2017 11:26

philippjfr changed the title ~~Add .iloc integer indexing method for Datasets~~ Add .iloc and .ndloc integer indexing method for Datasets Jun 18, 2017

philippjfr changed the title ~~Add .iloc and .ndloc integer indexing method for Datasets~~ Add .iloc and .ndloc integer indexing methods for Datasets Jun 18, 2017

philippjfr force-pushed the iloc_indexing branch from b91554c to af4353d Compare June 18, 2017 13:34

philippjfr added 13 commits June 18, 2017 23:57

Small fix for auto-indexing

3b9b4ea

Added iloc tabular indexing interface

c2bc41b

Small docstring improvements

48d7a06

Updated Point selection example to use .iloc

ddff785

Renamed TabularIndex object to iloc

0fcf161

Added ndloc indexing interface

c93995c

Implemented Image indexing using ndloc

0d2a9c1

Implemented Image.sample on top of ndloc interface

969c06c

Fixed bug in Dataset unit test setup

2b72785

Fixed closest bug in Image.sample

9638fb8

Fixed Image.sample y-coord index

0da749b

Minor fixes for sampling

4af6c8c

Added Image sampling test

3f5ba05

philippjfr added 5 commits June 18, 2017 23:57

Small fix for ndloc

788cea8

Vectorized Image.sample

bdf2ad3

Added ndloc unit tests

6571609

Simplified decimate operation using iloc

8607c20

Use iloc in Tabular.pprint_cell

922accd

philippjfr force-pushed the iloc_indexing branch from 7729e63 to 922accd Compare June 18, 2017 22:58

Improved iloc and ndloc docstrings

5a1d930

philippjfr requested a review from jlstevens June 19, 2017 00:00

philippjfr commented Jun 19, 2017

View reviewed changes

jlstevens reviewed Jun 19, 2017

View reviewed changes

Small docstring fixes for iloc and ndloc

3ecff25

jlstevens merged commit 3804863 into master Jun 19, 2017

philippjfr mentioned this pull request Jun 23, 2017

Columns row/column based indexing API #541

Closed

philippjfr deleted the iloc_indexing branch June 25, 2017 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .iloc and .ndloc integer indexing methods for Datasets #1435

Add .iloc and .ndloc integer indexing methods for Datasets #1435

philippjfr commented May 14, 2017 •

edited

Loading

jlstevens commented May 14, 2017

philippjfr commented May 14, 2017 •

edited

Loading

jlstevens commented May 14, 2017

philippjfr commented Jun 18, 2017

philippjfr commented Jun 18, 2017 •

edited

Loading

philippjfr commented Jun 18, 2017 •

edited

Loading

philippjfr commented Jun 19, 2017

philippjfr Jun 19, 2017

jlstevens Jun 19, 2017

jlstevens Jun 19, 2017

jlstevens Jun 19, 2017

jlstevens commented Jun 19, 2017

jlstevens commented Jun 19, 2017

Add .iloc and .ndloc integer indexing methods for Datasets #1435

Add .iloc and .ndloc integer indexing methods for Datasets #1435

Conversation

philippjfr commented May 14, 2017 • edited Loading

jlstevens commented May 14, 2017

philippjfr commented May 14, 2017 • edited Loading

jlstevens commented May 14, 2017

philippjfr commented Jun 18, 2017

philippjfr commented Jun 18, 2017 • edited Loading

philippjfr commented Jun 18, 2017 • edited Loading

philippjfr commented Jun 19, 2017

philippjfr Jun 19, 2017

Choose a reason for hiding this comment

jlstevens Jun 19, 2017

Choose a reason for hiding this comment

jlstevens Jun 19, 2017

Choose a reason for hiding this comment

jlstevens Jun 19, 2017

Choose a reason for hiding this comment

jlstevens commented Jun 19, 2017

jlstevens commented Jun 19, 2017

philippjfr commented May 14, 2017 •

edited

Loading

philippjfr commented May 14, 2017 •

edited

Loading

philippjfr commented Jun 18, 2017 •

edited

Loading

philippjfr commented Jun 18, 2017 •

edited

Loading