Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: apply() fails on some value types #34529

Closed
simondsmart opened this issue Jun 2, 2020 · 9 comments · Fixed by #34812
Closed

BUG: apply() fails on some value types #34529

simondsmart opened this issue Jun 2, 2020 · 9 comments · Fixed by #34812
Labels
Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@simondsmart
Copy link

We have some existing code that manipulates data that is decoded into numpy arrays (by a C powered backend). This code has stopped working.

I've tried to strip it down to a reduced case

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([b'abcd', b'efgh']), columns=['col'])
df.apply(lambda x: x.astype('object'))

This fails with an error inside an internal function of apply:

ValueError                                Traceback (most recent call last)
<ipython-input-88-a5fa9cabd101> in <module>
----> 1 df.apply(lambda x: x.astype('object'))

~/local/pkg/miniconda3/envs/odc/lib/python3.8/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   6876             kwds=kwds,
   6877         )
-> 6878         return op.get_result()
   6879 
   6880     def applymap(self, func) -> "DataFrame":

~/local/pkg/miniconda3/envs/odc/lib/python3.8/site-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

~/local/pkg/miniconda3/envs/odc/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self)
    293 
    294             try:
--> 295                 result = libreduction.compute_reduction(
    296                     values, self.f, axis=self.axis, dummy=dummy, labels=labels
    297                 )

pandas/_libs/reduction.pyx in pandas._libs.reduction.compute_reduction()

pandas/_libs/reduction.pyx in pandas._libs.reduction.Reducer.__init__()

pandas/_libs/reduction.pyx in pandas._libs.reduction.Reducer._check_dummy()

ValueError: Dummy array must be same dtype

If we use a different dtype, then it works.

df = pd.DataFrame(np.array(['abcd', 'efgh']), columns=['col'])
df.apply(lambda x: x.astype('object'))
print(df)

which gives the expected result

    col
0  abcd
1  efgh
@Veronur
Copy link
Contributor

Veronur commented Jun 3, 2020

hello i would like to look into this issue!

@jorisvandenbossche jorisvandenbossche added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version labels Jun 3, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jun 3, 2020
@jorisvandenbossche
Copy link
Member

@simondsmart thanks for the report. That's indeed an error that should not be seen by the user (and in 0.25 it was working)

Now, although it should not raise an error, it's also not fully clear to me what you are trying to achieve. The column in the dataframe already has a object dtype, so doing apply(lambda x: x.astype('object')) should basically be a no-op.

@Veronur always welcome to take a look!

@simondsmart
Copy link
Author

@jorisvandenbossche the code came from a rather different context. We have a library that does decoding of a rather esoteric data type (ODB2, the pyodc library). It has an optional ability to offload to a (separate) C++ library that does the decoding much faster - but requires that we set up arrays with rather strict memory layout requirements to decode into.

We then have to do a bit of ... coercion ... to get back to something appropriate in python land.

The bug report was rather aggressively simplified to the simplest case I could make to trigger the same error. So it looks like something rather daft.

Many thanks for your help!

@Veronur
Copy link
Contributor

Veronur commented Jun 6, 2020

so, i did some checking and this is what i found out so far: the problem as shown above happens inside of the libreduction.compute_reduction function. So i checked its arguments and found out the following: the dtypes for the dummy variables just like for nomal strings(object). But the problem seems to be that for the arr variable the dtype is |S# (# being the number of chars on the byte string).
image
image
I think that making those two equal would solve the problem so i would like sugestions on how to do that.
Thanks

@Veronur
Copy link
Contributor

Veronur commented Jun 8, 2020

Ok so what i found out is this: the example you give is instead of using df.apply(lambda x: x.astype('object')) you do df.apply(lambda x: x.astype('|S')) it the command will run beacuse currently pandas is intepreting strings as an object itself but fo byteStrings it gets an array_interface https://numpy.org/devdocs/reference/arrays.interface.html#arrays-interface so thats why the dtypes were diferent. If this is unexpected behavour id like some guidelines on how to make a fix for it and if this is supposed to happen id be happy to make some test for it!
By the way, it works from object to |S# but not from |S# to object

@TomAugspurger
Copy link
Contributor

Thanks for looking into this @Veronur. I'm also not sure the best way to handle it, but it'd be nice to fix the regression for the 1.1 release (in a few weeks) as long as we don't give up on other behaviors..

@Veronur
Copy link
Contributor

Veronur commented Jun 12, 2020

Alright i have done some more digging and i ma getting close to to source of the problem, so bar i found out that the dtype is changed from |S# to object on the Series generation for the dummy array on the " generic.NDFrame.init(self, data) " fuction call. There the values array is generated with object dtype instead of |S# as expected.

@Veronur
Copy link
Contributor

Veronur commented Jun 15, 2020

Alright i found the problem and made a fix for it. The problem with on the pandas/core/dtypes/cast.py file. It was made on issue #21083 but since python3 "U" types and "S" types became different things

@Veronur
Copy link
Contributor

Veronur commented Jun 16, 2020

The pull request is having some issues now with some tests that use the None type because thats what
issue #21083 fixed. I will need some ideas on how to manage those.

@jreback jreback changed the title apply() fails on some value types BUG: apply() fails on some value types Jun 16, 2020
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 16, 2020
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 16, 2020
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 16, 2020
correction and test for issue-pandas-dev#34529
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 16, 2020
correction and test for issue-pandas-dev#34529

made the formating changes
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 16, 2020
correction and test for issue-pandas-dev#34529

made the formating changes

fixing tests on issue-pandas-dev#34529
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 17, 2020
correction and test for issue-pandas-dev#34529

made the formating changes

fixing tests on issue-pandas-dev#34529

add whats new entry on issue-pandas-dev#34539
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 17, 2020
correction and test for issue-pandas-dev#34529

made the formating changes

fixing tests on issue-pandas-dev#34529

add whats new entry on issue-pandas-dev#34539

add whats new entry correction issue-pandas-dev#34539
Veronur pushed a commit to Veronur/pandas that referenced this issue Jun 18, 2020
correction and test for issue-pandas-dev#34529

made the formating changes

fixing tests on issue-pandas-dev#34529

add whats new entry on issue-pandas-dev#34539

add whats new entry correction issue-pandas-dev#34539

whats new correction issue-pandas-dev#34539
Veronur added a commit to Veronur/pandas that referenced this issue Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants