Skip to content

Commit

Permalink
API: Add string extension type (pandas-dev#27949)
Browse files Browse the repository at this point in the history
  • Loading branch information
TomAugspurger authored and proost committed Dec 19, 2019
1 parent 6fe2510 commit c4e1ecd
Show file tree
Hide file tree
Showing 20 changed files with 908 additions and 76 deletions.
4 changes: 4 additions & 0 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,10 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
-k"-from_arrays -from_breaks -from_intervals -from_tuples -set_closed -to_tuples -interval_range"
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Doctests arrays/string_.py' ; echo $MSG
pytest -q --doctest-modules pandas/core/arrays/string_.py
RET=$(($RET + $?)) ; echo $MSG "DONE"

fi

### DOCSTRINGS ###
Expand Down
19 changes: 16 additions & 3 deletions doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -986,7 +986,7 @@ not noted for a particular column will be ``NaN``:
tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})
.. _basics.aggregation.mixed_dtypes:
.. _basics.aggregation.mixed_string:

Mixed dtypes
++++++++++++
Expand Down Expand Up @@ -1704,14 +1704,21 @@ built-in string methods. For example:

.. ipython:: python
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
s.str.lower()
Powerful pattern-matching methods are provided as well, but note that
pattern-matching generally uses `regular expressions
<https://docs.python.org/3/library/re.html>`__ by default (and in some cases
always uses them).

.. note::

Prior to pandas 1.0, string methods were only available on ``object`` -dtype
``Series``. Pandas 1.0 added the :class:`StringDtype` which is dedicated
to strings. See :ref:`text.types` for more.

Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
description.

Expand Down Expand Up @@ -1925,9 +1932,15 @@ period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :class:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Strings :class:`StringDtype` :class:`str` :class:`arrays.StringArray` :ref:`text`
=================== ========================= ================== ============================= =============================

Pandas uses the ``object`` dtype for storing strings.
Pandas has two ways to store strings.

1. ``object`` dtype, which can hold any Python object, including strings.
2. :class:`StringDtype`, which is dedicated to strings.

Generally, we recommend using :class:`StringDtype`. See :ref:`text.types` fore more.

Finally, arbitrary objects may be stored using the ``object`` dtype, but should
be avoided to the extent possible (for performance and interoperability with
Expand Down
26 changes: 25 additions & 1 deletion doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.array
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
=================== ========================= ================== =============================

Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
Expand Down Expand Up @@ -460,6 +461,29 @@ and methods if the :class:`Series` contains sparse values. See
:ref:`api.series.sparse` for more.


.. _api.arrays.string:

Text data
---------

When working with text data, where each valid element is a string or missing,
we recommend using :class:`StringDtype` (with the alias ``"string"``).

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

arrays.StringArray

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

StringDtype

The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.


.. Dtype attributes which are manually listed in their docstrings: including
.. it here to make sure a docstring page is built for them
Expand All @@ -471,4 +495,4 @@ and methods if the :class:`Series` contains sparse values. See
DatetimeTZDtype.unit
DatetimeTZDtype.tz
PeriodDtype.freq
IntervalDtype.subtype
IntervalDtype.subtype
Loading

0 comments on commit c4e1ecd

Please sign in to comment.