-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2611: [Python] Fix Python 2 integer serialization #2055
ARROW-2611: [Python] Fix Python 2 integer serialization #2055
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
*result = PyInt_FromSsize_t(checked_cast<const Int64Array&>(arr).Value(index)); | ||
return Status::OK(); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's slightly clearer to do
#if PY_MAJOR_VERSION < 3
// handle Py2 case
#else
// handle Py3 case
#endif
instead of
#if PY_MAJOR_VERSION < 3
// handle Py2 case
#endif
// handle Py3 case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only a Python 2 case here, the "Python 3" case needs to be handled in both Python 2 and Python 3 (in Python 2 there is both int and long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, actually, now that I think about it, we don't need the #if
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed to get it to compile, since in Python 3 the function PyInt_FromSsize_t doesn't exist. Arrow has some common code to handle differences like this but I didn't find anything that covers this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR should simply touch the deserialization path without creating a distinct serialization format for int
(as opposed to long
).
@@ -481,7 +490,7 @@ Status Append(PyObject* context, PyObject* elem, SequenceBuilder* builder, | |||
} | |||
#if PY_MAJOR_VERSION < 3 | |||
} else if (PyInt_Check(elem)) { | |||
RETURN_NOT_OK(builder->AppendInt64(static_cast<int64_t>(PyInt_AS_LONG(elem)))); | |||
RETURN_NOT_OK(builder->AppendPy2Int64(static_cast<int64_t>(PyInt_AS_LONG(elem)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this it the right approach. int
and long
are pretty much interchangeable in Python 2. So only the deserialization needs to be changed and attempt to build an int
before falling back on long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue with this approach is that a serialized long
with a small numeric value will be deserialized as an int
which could cause issues for applications relying on type checking.
For example:
In [9]: value = 0L
In [10]: type(value)
Out[10]: long
In [11]: serialized = pa.serialize(value)
In [12]: deserialized = serialized.deserialize()
In [13]: type(deserialized)
Out[13]: int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting only long
and refusing int
sounds like a weird thing to do, though. Can you point an API that works this way?
(there are rare APIs that do the reverse - accept only int
and reject long
- but usually it's because they were written in C and the programmer couldn't care more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't know one off the top of my head -- I agree that it's bad practice. I do think that being consistent with types is important from a correctness point of view.
Pickle, for reference, does not deserialize a small long
as an int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Let's go with this patch, then. Hopefully we'll ditch Python 2 compatibility soon ;-)
Great, thanks for the discussion and @pschafhalter for getting this to work, I'll merge it then :) [Retriggering the build first] |
e2adf74
to
7b96b67
Compare
Codecov Report
@@ Coverage Diff @@
## master #2055 +/- ##
==========================================
+ Coverage 86.28% 86.28% +<.01%
==========================================
Files 242 242
Lines 41034 41041 +7
==========================================
+ Hits 35405 35412 +7
Misses 5629 5629
Continue to review full report at Codecov.
|
Looking at this late. This fix seems odd to me but I haven't looked closely enough at the nuances to judge for certain. Would it be possible to have only a single int64 array in the serialized representation and then convert back to either PyInt or PyLong depending on the size of the value? In Python 2 it seems like |
I think the discussion with @pitrou touched on this. The main issue with converting to PyInt or PyLong depending on the size of the value is that if I serialize a |
Fixes an issue where serialization turns integers into longs in Python 2. ```python In [1]: import pyarrow as pa In [2]: value = 1 In [3]: type(value) Out[3]: int In [4]: serialized = pa.serialize(value) In [5]: deserialized = serialized.deserialize() In [6]: type(deserialized) Out[6]: long ``` Author: Peter Schafhalter <[email protected]> Closes apache#2055 from pschafhalter/fix-python2-int-serialization and squashes the following commits: 7b96b67 <Peter Schafhalter> Fix bug with Python 3 C++ API 5d8ff26 <Peter Schafhalter> Add type checking in assert_equal d5e5e5d <Peter Schafhalter> Fix python2 integer serialization bug
Fixes an issue where serialization turns integers into longs in Python 2.