Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas ExtensionArray for storing homogeneous ragged arrays #687

Merged
merged 48 commits into from
Mar 1, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
fc148de
Fix for pandas 0.24.0rc1
jonmmease Jan 12, 2019
864a235
Initial RaggedArray implementation
jonmmease Jan 13, 2019
440e207
Add the extension test suite provided by pandas and fix tests.
jonmmease Jan 13, 2019
2f18587
Import register_extension_dtype from pandas public location
jonmmease Jan 13, 2019
5f46b8e
Fix copy/paste error
jonmmease Jan 14, 2019
a6b3c27
KeyError -> IndexError
jonmmease Jan 14, 2019
fbc5065
Document, validate, and test fast-path RaggedArray construction
jonmmease Jan 14, 2019
527e9d6
Support indexing RaggedArray with a list
jonmmease Jan 14, 2019
8d1c34b
Create single RaggedDtype() instance per RaggedArray
jonmmease Jan 14, 2019
dad6cc2
Allow astype() to cast RaggedArray to other extension array types
jonmmease Jan 14, 2019
fff0c3e
Allow RaggedArray constructor to accept a RaggedArray to copy
jonmmease Jan 14, 2019
478b655
Remove mask property and consider missing to be equivalent to empty
jonmmease Jan 14, 2019
9d84b3c
More test fixes for `[]` being null
jonmmease Jan 15, 2019
d71f866
Update datashader/datatypes.py
jbednar Jan 15, 2019
4cd7b4c
Add RaggedElement wrapper class for internal pandas operations
jonmmease Jan 16, 2019
16aff67
Override fillna is RaggedArray and enable test
jonmmease Jan 17, 2019
5772ade
Add vectorized equality operators
jonmmease Jan 17, 2019
939405b
pass start_indices and flat_array arrays as args to _validate_ragged_…
jonmmease Jan 17, 2019
7f355d2
Add copy arg to RaggedArray constructor
jonmmease Jan 17, 2019
9e44946
+=
jonmmease Jan 17, 2019
a52728a
Fix missing return
jonmmease Jan 17, 2019
75f914d
Parameterize RaggedDtype by element type
jonmmease Jan 17, 2019
32f4a3c
Remove tuple conversions in RaggedElement
jonmmease Jan 17, 2019
27403a7
Designate _RaggedElement as an internal class
jonmmease Jan 17, 2019
e93c24d
numba jit utility functions
jonmmease Jan 18, 2019
3fda786
Don't auto-import RaggedArray unless pandas is at least version 0.24.0
jonmmease Jan 18, 2019
04453ce
wrap _compute_*_bounds static methods with compute_*_bounds methods
jonmmease Jan 20, 2019
642a858
Small refactor to remove the need for a specialized _PolygonLike glyp…
jonmmease Jan 20, 2019
97bccf5
Refactor to extract required_columns glyph method
jonmmease Jan 20, 2019
2860511
Initial cvs.lines and LinesXY glyph
jonmmease Jan 20, 2019
d7cf092
WIP of LinesRagged type
jonmmease Jan 20, 2019
e781a0f
Merge branch 'master' into enh_ragged
jonmmease Feb 7, 2019
ea08fd1
Remove unused canvas.lines method
jonmmease Feb 7, 2019
1b02b0d
Add RaggedArray line aggregation support for pandas
jonmmease Feb 8, 2019
2314311
Dask ragged array support
jonmmease Feb 8, 2019
2078aad
Merge branch 'master' into enh_ragged
jonmmease Feb 8, 2019
f4a40eb
flake8
jonmmease Feb 8, 2019
59b0b3a
Add validation for LinesAxis1Ragged
jonmmease Feb 8, 2019
c48429e
Exception handling on import for pandas < 0.24
jonmmease Feb 8, 2019
cdecd85
Add pandas >=0.24.1 as testing dependency so that we can test RaggedA…
jonmmease Feb 8, 2019
7c8b953
absolute import
jonmmease Feb 8, 2019
c846f0c
specify that int lists should cast to int64 numpy arrays
jonmmease Feb 9, 2019
4145fb9
Merge branch 'master' into enh_ragged
jonmmease Feb 23, 2019
cad7d0a
Remove parameterized args from skipped tests
jonmmease Feb 24, 2019
89d1d51
Add Dask optimized bounds calculations for ragged list glyph
jonmmease Feb 24, 2019
92eaab2
Apply suggestions from code review
jbednar Feb 28, 2019
1538909
Refer to parent docstrings rather than duplicate
jonmmease Feb 28, 2019
c42f0df
Remove docstring references
jonmmease Mar 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions datashader/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
except ImportError:
pass

# Make ragged pandas extension array available
from . import datatypes
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

# make pyct's example/data commands available if possible
from functools import partial
try:
Expand Down
361 changes: 361 additions & 0 deletions datashader/datatypes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,361 @@
import numpy as np
from pandas.api.extensions import ExtensionDtype, ExtensionArray
from pandas.core.dtypes.dtypes import register_extension_dtype
jonmmease marked this conversation as resolved.
Show resolved Hide resolved
from numbers import Integral


@register_extension_dtype
class RaggedDtype(ExtensionDtype):
jonmmease marked this conversation as resolved.
Show resolved Hide resolved
name = 'ragged'
type = np.ndarray
base = np.dtype('O')

@classmethod
def construct_array_type(cls):
return RaggedArray

@classmethod
def construct_from_string(cls, string):
if string == cls.name:
return cls()
else:
raise TypeError("Cannot construct a '{}' from '{}'"
.format(cls, string))


class RaggedArray(ExtensionArray):
def __init__(self, data, dtype=None):
jonmmease marked this conversation as resolved.
Show resolved Hide resolved
"""
Construct a RaggedArray

Parameters
----------
data
List or numpy array of lists or numpy arrays
jonmmease marked this conversation as resolved.
Show resolved Hide resolved
dtype: np.dtype or str or None (default None)
Datatype to use to store underlying values from data.
If none (the default) then dtype will be determined using the
numpy.result_type function
"""
if (isinstance(data, dict) and
all(k in data for k in
['mask', 'start_indices', 'flat_array'])):

self._mask = data['mask']
self._start_indices = data['start_indices']
self._flat_array = data['flat_array']
else:
# Compute lengths
index_len = len(data)
buffer_len = sum(len(datum)
if datum is not None
else 0 for datum in data)

# Compute necessary precision of start_indices array
for nbits in [8, 16, 32, 64]:
start_indices_dtype = 'uint' + str(nbits)
max_supported = np.iinfo(start_indices_dtype).max
if buffer_len <= max_supported:
break

# infer dtype if not provided
if dtype is None:
dtype = np.result_type(*[np.atleast_1d(v)
for v in data
if v is not None])

# Initialize representation arrays
self._mask = np.zeros(index_len, dtype='bool')
self._start_indices = np.zeros(index_len, dtype=start_indices_dtype)
self._flat_array = np.zeros(buffer_len, dtype=dtype)

# Populate arrays
next_start_ind = 0
for i, array_el in enumerate(data):
jbednar marked this conversation as resolved.
Show resolved Hide resolved
# Check for null values
isnull = array_el is None

# Compute element length
n = len(array_el) if not isnull else 0

# Update mask
self._mask[i] = isnull

# Update start indices
self._start_indices[i] = next_start_ind

# Update flat array
self._flat_array[next_start_ind:next_start_ind+n] = array_el

# increment next start index
next_start_ind += n

# This is a workaround (hack?) to keep pandas.lib.infer_dtype from
# "raising cannot infer type" ValueError error when calling:
# >>> pd.Series([[0, 1], [1, 2, 3]], dtype='ragged')
self._values = self._flat_array
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hack to work around ValueError: cannot infer type for <class 'NoneType'> in pandas._libs.lib.infer_dtype


@property
def flat_array(self):
"""
numpy array containing concatenation of all nested arrays

Returns
-------
np.ndarray
"""
return self._flat_array

@property
def mask(self):
"""
boolean numpy array the same length as the ragged array where values
of True indicate missing values.

Returns
-------
np.ndarray
"""
return self._mask

@property
def start_indices(self):
"""
integer numpy array the same length as the ragged array where values
represent the index into flat_array where the corresponding ragged
array element begins.

Returns
-------
np.ndarray
"""
return self._start_indices

def __len__(self):
"""
Length of this array

Returns
-------
length : int
"""
return len(self._start_indices)

def __getitem__(self, item):
jonmmease marked this conversation as resolved.
Show resolved Hide resolved
"""
Parameters
----------
item : int, slice, or ndarray
* int: The position in 'self' to get.

* slice: A slice object, where 'start', 'stop', and 'step' are
integers or None

* ndarray: A 1-d boolean NumPy ndarray the same length as 'self'
"""
if isinstance(item, Integral):
if item < -len(self) or item >= len(self):
raise IndexError(item)
elif self.mask[item]:
return None
else:
# Convert negative item index
if item < 0:
item = 5 + item
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

slice_start = self.start_indices[item]
slice_end = (self.start_indices[item+1]
if item + 1 <= len(self) - 1
else len(self.flat_array))

return self.flat_array[slice_start:slice_end]

elif type(item) == slice:
data = []
selected_indices = np.arange(len(self))[item]

for selected_index in selected_indices:
data.append(self[selected_index])

return RaggedArray(data, dtype=self.flat_array.dtype)

elif isinstance(item, np.ndarray) and item.dtype == 'bool':
data = []

for i, m in enumerate(item):
if m:
data.append(self[i])

return RaggedArray(data, dtype=self.flat_array.dtype)
else:
raise KeyError(item)
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
"""
Construct a new RaggedArray from a sequence of scalars.

Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
dtype : dtype, optional
Construct for this particular dtype. This should be a Dtype
compatible with the ExtensionArray.
copy : boolean, default False
If True, copy the underlying data.

Returns
-------
RaggedArray
"""
return RaggedArray(scalars)
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def _from_factorized(cls, values, original):
"""
Reconstruct an ExtensionArray after factorization.
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
values : ndarray
An integer ndarray with the factorized values.
original : RaggedArray
The original ExtensionArray that factorize was called on.
jonmmease marked this conversation as resolved.
Show resolved Hide resolved

See Also
--------
pandas.factorize
ExtensionArray.factorize
"""
return RaggedArray(values, dtype=original.flat_array.dtype)

def _values_for_factorize(self):
# Here we return a list of the ragged elements converted into tuples.
# This is very inefficient, but the elements of this list must be
# hashable, and we must be able to reconstruct a new Ragged Array
# from these elements.
#
# Perhaps we could replace these tuples with a class that provides a
# read-only view of an ndarray slice and provides a hash function.
return [tuple(self[i]) if not self.mask[i] else None
for i in range(len(self))], None

def isna(self):
"""
A 1-D array indicating if each value is missing.

Returns
-------
na_values : np.ndarray
boolean ndarray the same length as the ragged array where values
of True represent missing/NA values.
"""
return self.mask

def take(self, indices, allow_fill=False, fill_value=None):
"""
Take elements from an array.

Parameters
----------
indices : sequence of integers
Indices to be taken.
allow_fill : bool, default False
How to handle negative values in `indices`.

* False: negative values in `indices` indicate positional indices
from the right (the default). This is similar to
:func:`numpy.take`.

* True: negative values in `indices` indicate
missing values. These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.

fill_value : any, default None
Fill value to use for NA-indices when `allow_fill` is True.

Returns
-------
RaggedArray

Raises
------
IndexError
When the indices are out of bounds for the array.
"""
if allow_fill:
sequence = [self[i] if i >= 0 else fill_value
for i in indices]
else:
sequence = [self[i] for i in indices]

return RaggedArray(sequence, dtype=self.flat_array.dtype)

def copy(self, deep=False):
"""
Return a copy of the array.

Parameters
----------
deep : bool, default False
Also copy the underlying data backing this array.

Returns
-------
RaggedArray
"""
data = dict(
mask=self.mask,
flat_array=self.flat_array,
start_indices=self.start_indices)

if deep:
# Copy underlying numpy arrays
data = {k: v.copy() for k, v in data.items()}

return RaggedArray(data)

@classmethod
def _concat_same_type(cls, to_concat):
"""
Concatenate multiple RaggedArray instances

Parameters
----------
to_concat : list of RaggedArray

Returns
-------
RaggedArray
"""
# concat masks
mask = np.hstack(ra.mask for ra in to_concat)

# concat flat_arrays
flat_array = np.hstack(ra.flat_array for ra in to_concat)

# offset and concat start_indices
offsets = np.hstack([
[0],
np.cumsum([len(ra.flat_array) for ra in to_concat[:-1]])])

start_indices = np.hstack([ra.start_indices + offset
for offset, ra in zip(offsets, to_concat)])

return RaggedArray(dict(
mask=mask, flat_array=flat_array, start_indices=start_indices))

@property
def dtype(self):
return RaggedDtype

@property
def nbytes(self):
"""
The number of bytes needed to store this object in memory.
"""
return (self._flat_array.nbytes +
self._start_indices.nbytes +
self._mask.nbytes)
Loading