Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: 2D support for MaskedArray #38992

Merged
merged 57 commits into from
Oct 16, 2021
Merged

Conversation

jbrockmendel
Copy link
Member

This doesn't in any way use the 2D support, but opens up the option of incrementally fleshing out the tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me.

pandas/core/arrays/_mixins.py Show resolved Hide resolved
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 6, 2021
@jreback jreback added this to the 1.3 milestone Jan 6, 2021
@@ -80,6 +80,8 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):

# The value used to fill '_data' to avoid upcasting
_internal_fill_value: Scalar
_data: np.ndarray
_mask: np.ndarray[Any, bool]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ndarray in numpy 1.20 is not generic so although mypy is happy with type parameters on Any, this will raise errors when we transition to numpy types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the [Any, bool] part of this. is there an approximate calendar for the transition to numpy types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are in 1.20 (releaseing blocking on arrow update atm)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an approximate calendar for the transition to numpy types?

I updated to 1.20.0rc2 locally and there were no changes to the mypy errors. ( I've not yet checked the commit history to see if there were any changes)

Once 1.20 is released, and we pick it up on CI we will see the errors in #36092. (I'll merge master and make a start on updating now)

I assume that we will pin numpy in ci while we discuss how to sort out the errors. (once we know what the status is with the released numpy)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a significant architectural change shouldn't be merged without prior discussion

@jreback
Copy link
Contributor

jreback commented Jan 6, 2021

Such a significant architectural change shouldn't be merged without prior discussion

sure, what are your concerns

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 8, 2021

The ExtensionArrays have been 1D from the start (for around 3 years now). And idem for MaskedArray in specific. So if Brock proposes to change that, then I think it is to start up to the proposer to come up with arguments for changing this.

There has been discussion before about 1D vs 2D extension arrays, for sure (although I don't think any of that discussion resulted in a clear decision to merge a PR like this). But specifically for MaskedArray, we didn't have any discussion about this, AFAIK. I think that at least requires some discussion about whether we want this or not?

The masked arrays have been explicitly designed to be 1D, of course also because currently ExtensionArrays in general are 1D, but in addition, there are several ideas for future improvements (#30435), exploring bitmask instead of boolean mask (#31293), optionally using arrow under the hood like we are doing for the string array, more efficient zero-copy conversion with arrow, nested data types (eg #35176). Those are all not impossible with 2D arrays, but IMO will be much easier with 1D arrays, and thus requires some consideration.

It's also not fully clear to me where this PR is going towards. At the moment it is adding quite some complexity for something that is not yet used. What's the plan for how to actually use it in pandas? Do we want to give the masked arrays 2D capabilities (so this ability can be used for certain operations), but keep storing them as 1D in DataFrames? Or do we want to change the ExtensionBlock to a consolidated 2D Block? But only for masked arrays, or for all ExtensionArrays? What for arrays that cannot easily be 2D (eg nested array)? What's the idea for externally defined ExtensionArrays? ...

@jbrockmendel
Copy link
Member Author

The reason to do this is roughly the same reason why we're moving forward with ArrayManager: so that we can see if actually using this is something we want to do longer-term.

@jbrockmendel
Copy link
Member Author

gentle ping, plenty more tests where these came from

@jreback
Copy link
Contributor

jreback commented Jan 20, 2021

ok I am +1 on merging this. I agree with @jbrockmendel reasoning here. We don't really know where we are going to ultimately go, e.g. ArrayManger or simplified BlockManager. We need more support & performance testing to see. Sure I'd like to see a unified approach, but we have advocates for both and would rather not inhibit experimentation.

cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

The reason to do this is roughly the same reason why we're moving forward with ArrayManager: so that we can see if actually using this is something we want to do longer-term.

For the simplified non-consolidating BlockManager, I started with a description of arguments for it, we had an extensive discussion about it, with several people expressing their interest for it, and with the main question mark being performance. At which point we need a proof of concept to test things.

As far as I know, we have had no such discussion about 2D masked arrays.

@jbrockmendel
Copy link
Member Author

As far as I know, we have had no such discussion about 2D masked arrays.

We've had the same discussion about 2D EAs repeatedly.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2021

@jorisvandenbossche do you have actual concrete objections to merging this? We are allowing ArrayManager on an experimental basis, I don't see how this is any different.

@jorisvandenbossche
Copy link
Member

I think my longer comment above (#38992 (comment)) already includes some concrete concerns. Reformulating them:

  • We have had many discussions about 2D ExtensionArrays, yes (mostly in 2019, see eg EA: support basic 2D operations #27142 and linked PRs and the mailing list discussion at https://mail.python.org/pipermail/pandas-dev/2019-June/000983.html), but AFAIK those discussions have not yet led to a consensus or compromise in favor of 2D EAs (if I recall correctly, the use of 2D arrays for datetime ops was discussed then as a compromise).
    Fully supporting 2D ExtensionArrays is a big change, that requires a more detailed proposal and discussion IMO. And we already have the existing consolidating BlockManager to know how internals with 2D arrays would work (and to compare the ArrayManager with).
  • IMO this PR is missing context on how we would actually use this in pandas. I think we should at least have some idea about that before merging this (I asked several questions above (eg do we want to make ExtensionBlock 2D? What does this mean for other EAs? ...), to which no response has been given)
  • The POC for the ArrayManager is mostly independent from the existing code (eg the BlockManager didn't become any more complex due to merging it), while this is profoundly changing the existing MaskedArrays, making it more difficult to further improve them as 1D arrays (there is still a lot of work to make them fully feature-complete to start with, and I mentioned several possible additional enhancements above)

(having a call about this might help resolve some of those discussion points?)

@jbrockmendel
Copy link
Member Author

I'm tired of repeating myself. At the sprint in 2019 we (including Wes) agreed to move forward with 2D EA support for experimentation. Since then the only thing we've learned is that you are more willing to repeat the same arguments over and over again than I am, and everyone else makes the entirely reasonable decision to tune it out.

there is still a lot of work to make them fully feature-complete to start with

I would dearly like to see that, which I see as part-in-parcel with the fix-many-xfails that I brought up on last week's call. But I don't see any effort towards making them happen, or any reason why they are mutually exclusive with this experimentation.

@jbrockmendel
Copy link
Member Author

gentle ping; this would simplify a corner case in #33036

@jreback
Copy link
Contributor

jreback commented Aug 4, 2021

ok tests are failing.

i suppose it makes sense to support both 1d and 2d kernels on things. it does lead to some code duplication, but performance can be great if we don't need to operate column-by-column all the time. However the codebase is mostly 1d currently, with some efforts to add 2d kernels.

does this concur with your thinking?

@jbrockmendel
Copy link
Member Author

ok tests are failing.

Fixed

[...] does this concur with your thinking?

Was this part of the comment supposed to go in #42841? Will answer there.

@simonjayhawkins
Copy link
Member

@jbrockmendel needs rebase

@jreback jreback added this to the 1.4 milestone Oct 6, 2021
pandas/core/arrays/boolean.py Show resolved Hide resolved
@@ -115,6 +117,9 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):

# The value used to fill '_data' to avoid upcasting
_internal_fill_value: Scalar
_data: np.ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment about these

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. great to see how to proceed.

skipna: bool = True,
axis: Optional[int] = None,
):
return _minmax(np.max, values=values, mask=mask, skipna=skipna, axis=axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there doc strings here? if so can you update (can be followup as well)

@@ -194,9 +199,23 @@ def test_reductions_2d_axis0(self, data, method, request):
if method in ["sum", "prod"] and data.dtype.kind in ["i", "u"]:
# FIXME: kludge
if data.dtype.kind == "i":
dtype = pd.Int64Dtype()
if is_platform_windows() or not IS64:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah followup with this to make nicer

@jreback jreback merged commit 4d9b6f7 into pandas-dev:master Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Oct 16, 2021

certainly fine for testing things out. thanks @jbrockmendel

a couple of followups

@jbrockmendel jbrockmendel deleted the enh-masked-2d branch October 18, 2021 15:02
continue
fill_count += 1
values[j, i] = val
mask[j, i] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask should be previous mask... see pad_inplace #39953

import numpy as np
import pandas as pd

dtype = pd.Int64Dtype()

data_missing = pd.array([pd.NA, 1], dtype=dtype)

arr = data_missing.repeat(4).reshape(4, 2)

result = arr.fillna(method="pad")
print(result)

expected = data_missing.fillna(method="pad").repeat(4).reshape(4, 2)
print(expected)
<IntegerArray>
[
[<NA>, <NA>],
[1, 1],
[1, 1],
[1, 1]
]
Shape: (4, 2), dtype: Int64
<IntegerArray>
[
[<NA>, <NA>],
[<NA>, <NA>],
[1, 1],
[1, 1]
]
Shape: (4, 2), dtype: Int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fixing this on my numba branch results in...

@numba.njit
def _pad_2d_inplace(values, mask, limit=None):
    if values.shape[1]:
        K, N = values.shape
        if limit is None:
            for j in range(K):
                val, prev_mask = values[j, 0], mask[j, 0]
                for i in range(N):
                    if mask[j, i]:
                        values[j, i], mask[j, i] = val, prev_mask
                    else:
                        val, prev_mask = values[j, i], mask[j, i]
        else:
            for j in range(K):
                fill_count = 0
                val, prev_mask = values[j, 0], mask[j, 0]
                for i in range(N):
                    if mask[j, i]:
                        if fill_count >= limit:
                            continue
                        fill_count += 1
                        values[j, i], mask[j, i] = val, prev_mask
                    else:
                        fill_count = 0
                        val, prev_mask = values[j, i], mask[j, i]

I have some duplication here but a perf improvement for the common case of no limit, the duplication can probably be mitigated by reshaping a 1d array and removing the 1d version pad_inplace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might look into a variant using 2 loops, the first to find the first not missing value. and the second to fill without tracking the previous mask, then we could just do mask[j, i] = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing while in the neighborhood, I think for i in range(N) should probably be for i in range(1, N)?

@@ -656,10 +656,11 @@ def pad_2d_inplace(numeric_object_t[:, :] values, const uint8_t[:, :] mask, limi
val = values[j, 0]
for i in range(N):
if mask[j, i]:
if fill_count >= lim:
if fill_count >= lim or i == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Copy link
Member Author

@simonjayhawkins addressing your post-merge comments has been on my todo list for a while, looks likely to fall off. Do they merit their own Issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants