Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .iloc and .ndloc integer indexing methods for Datasets #1435

Merged
merged 20 commits into from
Jun 19, 2017

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented May 14, 2017

Adds a tabular integer indexing interface for all data interfaces, allowing slicing and indexing by row and column indices. In the case of gridded ndarray data this operates on the flattened arrays. The main issue is that dask does not support integer indexing so iloc actually has to evaluate the graph and load a whole column at a time into memory, we should probably warn about this.

Valid signatures:

# Get first row
ds.iloc[0]

# Get first 10 rows
ds.iloc[:10]

# Get 1st, 3rd and 5th row
ds.iloc[[1, 3, 5]]

# Get every 3rd row
ds.iloc[::3]

# Get first column
ds.iloc[:, 0]

# Get 1st and 3rd column
ds.iloc[:, [0, 3]]

# Get columns 1 to 3
ds.iloc[:, 1:4]

And any combination of these.

  • Support for row, column indexing on all interfaces
  • Support ndarray like indexing on gridded interfaces
  • Unit tests
  • Better docstrings
  • Documentation

This also allows for the following optimizations:

@philippjfr philippjfr changed the title Add iloc integer indexing interface for Datasets Add .iloc integer indexing interface for Datasets May 14, 2017
@philippjfr philippjfr changed the title Add .iloc integer indexing interface for Datasets Add .iloc integer indexing method for Datasets May 14, 2017
@philippjfr philippjfr force-pushed the iloc_indexing branch 3 times, most recently from 76db7c4 to fd350c3 Compare May 14, 2017 21:25
@jlstevens
Copy link
Contributor

Still looking through this PR but my initial impression is that it looks good.

Here are my first two questions:

  1. Why not call TabularIndex something like ILoc? That seems all it is used for right now.
  2. How do the signatures that you listed compare to pandas iloc indexing?

@philippjfr
Copy link
Member Author

philippjfr commented May 14, 2017

Why not call TabularIndex something like ILoc? That seems all it is used for right now.

Could do I suppose, not sure that's a better name though.

How do the signatures that you listed compare to pandas iloc indexing?

They match although I realized I still need to add boolean array indexing support (or at least add tests for it).

@jlstevens
Copy link
Contributor

I think naming it the same as the method (but capitalized) makes it instantly clear what the class is for. I agree that TabularIndex would be a better name if you are planning a use for it outside of the iloc method.

We already have a class redim(object) implementing the redim method and in my PR there is class periodic(object) implementing the periodic method. To be consistent, this should be class iloc(object) unless you see another use for it.

@philippjfr
Copy link
Member Author

I've now added a section to our Indexing and Selecting Data User Guide, which will be committed to the big docs PR.

@philippjfr philippjfr changed the title Add .iloc integer indexing method for Datasets Add .iloc and .ndloc integer indexing method for Datasets Jun 18, 2017
@philippjfr philippjfr changed the title Add .iloc and .ndloc integer indexing method for Datasets Add .iloc and .ndloc integer indexing methods for Datasets Jun 18, 2017
@philippjfr
Copy link
Member Author

philippjfr commented Jun 18, 2017

I've now also added an .ndloc interface for gridded datasets. It works by applying integer indexing to the canonical value dimension orientations. Since our Image ndarray interface is flipped along the y-axis the indexing behavior there is probably unintuitive but it has to be consistent.

Here's an example of creating an Image using the Image ndarray, xarray and gridded dictionary interfaces and applying ndloc:

arr = np.random.randn(5, 10)
ds = hv.Dataset({'x': range(10), 'y': range(5), 'z': arr}, kdims=['x', 'y'], vdims=['z'],
               datatype=['grid'])
dict_img = hv.Image(ds, label='Gridded Dictionary')
xr_img = img.clone(datatype=['xarray'], label='XArray')
arr_img = hv.Image(arr[::-1], bounds=img.bounds, label='NdArray+Bounds')

(dict_img + dict_img.ndloc[1:3, 0:5] +
 xr_img + xr_img.ndloc[1:3, 0:5] +
 arr_img + arr_img.ndloc[1:3, 0:5]).cols(2)

image

@philippjfr
Copy link
Member Author

philippjfr commented Jun 18, 2017

With my latest commit Image.sample now uses sheet2matrixidx and ndloc to efficiently sample the underlying array in continuous coordinates, this addresses issues with floating point precision on sampling and is considerably more efficient (addressing #1450), all existing sampling unit tests pass. I'll have to add some more direct unit tests and docstrings for ndloc though.

@philippjfr
Copy link
Member Author

@jlstevens Ready for review.

@@ -550,7 +550,7 @@
"source": [
"print(rgb_parrot)\n",
"print(rgb_parrot[0,0])\n",
"print(rgb_parrot[0,0][0])"
"print(rgb_parrot[0,0].iloc[0, 0])"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over from when we returned tuples when indexing RGBs, so I ended up updating it, suppose rgb_parrot[0, 0, 'R'] would have been clearer but we're probably throwing this notebook out right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

Allow selection by integer index, slice and list of integer
indices and boolean arrays, e.g.:

Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit redundant after 'e.g:' (which is sufficient)

"""
Dask does not support iloc, therefore iloc will execute
the call graph and lose the laziness of the operation.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there could be optional performance warnings we could issue when laziness is lost. Not for this PR though.

@jlstevens
Copy link
Contributor

Other than a very minor comment about the docstrings, this PR looks good: the API is nice and clean and thanks for adding all those unit tests. I don't think people will need integer indexing often but there are certainly times when you need it and having this API will really help.

Happy to merge when the docstring is updated. The tests are passing except for one transient in the Python2 pr build.

@jlstevens
Copy link
Contributor

Thanks! Travis/conda is having issues at the moment but the tests were passing before the docstring changes so I'll just merge.

@jlstevens jlstevens merged commit 3804863 into master Jun 19, 2017
@philippjfr philippjfr deleted the iloc_indexing branch June 25, 2017 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants