Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add string extension type #27949

Merged
merged 59 commits into from
Oct 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
c24b5b6
API: Add string extension type
TomAugspurger Jul 31, 2019
3ecb5cc
test fixups
TomAugspurger Aug 16, 2019
59a7d39
string dtype
TomAugspurger Aug 16, 2019
7c07070
35 compat
TomAugspurger Aug 16, 2019
9e1a73b
doc
TomAugspurger Aug 16, 2019
16ccad8
fixups
TomAugspurger Aug 16, 2019
1027463
doc
TomAugspurger Aug 16, 2019
aafb53b
doc
TomAugspurger Aug 19, 2019
9cdfe2f
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Aug 19, 2019
ab49169
fix doc warnings
TomAugspurger Aug 19, 2019
978fb55
fixup docstrings
TomAugspurger Aug 19, 2019
aebc688
fixup docstrings
TomAugspurger Aug 19, 2019
d90d0ad
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 9, 2019
41dc0f9
lint
TomAugspurger Sep 9, 2019
b783559
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 16, 2019
13cdddd
typing
TomAugspurger Sep 16, 2019
78c2eaa
removed double assert
TomAugspurger Sep 18, 2019
726d0af
experimental
TomAugspurger Sep 19, 2019
69d24e5
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 19, 2019
9cd9945
failing
TomAugspurger Sep 19, 2019
070fb76
xfails
TomAugspurger Sep 19, 2019
2b90639
Handle non-ndarray in add
TomAugspurger Sep 19, 2019
381c889
fixup
TomAugspurger Sep 19, 2019
bf82aad
fixup
TomAugspurger Sep 19, 2019
79bd87a
note
TomAugspurger Sep 19, 2019
2af8c81
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 23, 2019
fd24274
spacing
TomAugspurger Sep 23, 2019
0635ede
warning note
TomAugspurger Sep 23, 2019
d3311ee
update doc
TomAugspurger Sep 23, 2019
dce9258
doc updates
TomAugspurger Sep 23, 2019
0524f7e
update ctor
TomAugspurger Sep 23, 2019
292a8f3
clean up wrapping
TomAugspurger Sep 23, 2019
2c88e3b
clarify
TomAugspurger Sep 23, 2019
1b8c83a
reduce sum
TomAugspurger Sep 23, 2019
f1dad2a
skip reduce sum
TomAugspurger Sep 23, 2019
be95ecb
rename
TomAugspurger Sep 23, 2019
903ea2f
move
TomAugspurger Sep 23, 2019
0e1f479
missed
TomAugspurger Sep 23, 2019
c168ecf
missed
TomAugspurger Sep 23, 2019
d06ba73
fixup rename
TomAugspurger Sep 24, 2019
3ba27c3
fixup
TomAugspurger Sep 24, 2019
fe8ee77
doctest
TomAugspurger Sep 24, 2019
d9f63aa
updates
TomAugspurger Sep 24, 2019
d3c49e2
fixups
TomAugspurger Sep 24, 2019
dcb84f9
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 24, 2019
43b51cd
length check
TomAugspurger Sep 24, 2019
4fd2d11
unimplement sum
TomAugspurger Sep 24, 2019
713f807
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 26, 2019
777b295
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 30, 2019
8714a53
fixup
TomAugspurger Sep 30, 2019
41f234c
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 1, 2019
dc9ef3c
rename
TomAugspurger Oct 1, 2019
9419af2
rename
TomAugspurger Oct 1, 2019
462b29d
doc updates
TomAugspurger Oct 1, 2019
0391563
fixups
TomAugspurger Oct 1, 2019
129fe29
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 3, 2019
6aebd8c
move and perf
TomAugspurger Oct 4, 2019
2ee5e30
test is_string_dtype
TomAugspurger Oct 4, 2019
7e92cde
helper
TomAugspurger Oct 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,10 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
-k"-from_arrays -from_breaks -from_intervals -from_tuples -set_closed -to_tuples -interval_range"
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Doctests arrays/string_.py' ; echo $MSG
pytest -q --doctest-modules pandas/core/arrays/string_.py
RET=$(($RET + $?)) ; echo $MSG "DONE"

fi

### DOCSTRINGS ###
Expand Down
19 changes: 16 additions & 3 deletions doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -986,7 +986,7 @@ not noted for a particular column will be ``NaN``:

tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})

.. _basics.aggregation.mixed_dtypes:
.. _basics.aggregation.mixed_string:

Mixed dtypes
++++++++++++
Expand Down Expand Up @@ -1704,14 +1704,21 @@ built-in string methods. For example:

.. ipython:: python

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
s.str.lower()

Powerful pattern-matching methods are provided as well, but note that
pattern-matching generally uses `regular expressions
<https://docs.python.org/3/library/re.html>`__ by default (and in some cases
always uses them).

.. note::

Prior to pandas 1.0, string methods were only available on ``object`` -dtype
``Series``. Pandas 1.0 added the :class:`StringDtype` which is dedicated
jreback marked this conversation as resolved.
Show resolved Hide resolved
to strings. See :ref:`text.types` for more.

Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
description.

Expand Down Expand Up @@ -1925,9 +1932,15 @@ period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :class:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Strings :class:`StringDtype` :class:`str` :class:`arrays.StringArray` :ref:`text`
=================== ========================= ================== ============================= =============================

Pandas uses the ``object`` dtype for storing strings.
Pandas has two ways to store strings.

1. ``object`` dtype, which can hold any Python object, including strings.
2. :class:`StringDtype`, which is dedicated to strings.

Generally, we recommend using :class:`StringDtype`. See :ref:`text.types` fore more.

Finally, arbitrary objects may be stored using the ``object`` dtype, but should
be avoided to the extent possible (for performance and interoperability with
Expand Down
26 changes: 25 additions & 1 deletion doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.array
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
=================== ========================= ================== =============================

Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
Expand Down Expand Up @@ -460,6 +461,29 @@ and methods if the :class:`Series` contains sparse values. See
:ref:`api.series.sparse` for more.


.. _api.arrays.string:

Text data
---------

When working with text data, where each valid element is a string or missing,
we recommend using :class:`StringDtype` (with the alias ``"string"``).

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

arrays.StringArray

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

StringDtype

The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.


.. Dtype attributes which are manually listed in their docstrings: including
.. it here to make sure a docstring page is built for them
Expand All @@ -471,4 +495,4 @@ and methods if the :class:`Series` contains sparse values. See
DatetimeTZDtype.unit
DatetimeTZDtype.tz
PeriodDtype.freq
IntervalDtype.subtype
IntervalDtype.subtype
Loading