Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2611: [Python] Fix Python 2 integer serialization #2055

Closed

Conversation

pschafhalter
Copy link
Contributor

Fixes an issue where serialization turns integers into longs in Python 2.

In [1]: import pyarrow as pa

In [2]: value = 1

In [3]: type(value)
Out[3]: int

In [4]: serialized = pa.serialize(value)

In [5]: deserialized = serialized.deserialize()

In [6]: type(deserialized)
Out[6]: long

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

*result = PyInt_FromSsize_t(checked_cast<const Int64Array&>(arr).Value(index));
return Status::OK();
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's slightly clearer to do

#if PY_MAJOR_VERSION < 3
  // handle Py2 case
#else
  // handle Py3 case
#endif

instead of

#if PY_MAJOR_VERSION < 3
  // handle Py2 case
#endif
// handle Py3 case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only a Python 2 case here, the "Python 3" case needs to be handled in both Python 2 and Python 3 (in Python 2 there is both int and long).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, actually, now that I think about it, we don't need the #if, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed to get it to compile, since in Python 3 the function PyInt_FromSsize_t doesn't exist. Arrow has some common code to handle differences like this but I didn't find anything that covers this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good.

pitrou
pitrou previously requested changes May 17, 2018
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR should simply touch the deserialization path without creating a distinct serialization format for int (as opposed to long).

@@ -481,7 +490,7 @@ Status Append(PyObject* context, PyObject* elem, SequenceBuilder* builder,
}
#if PY_MAJOR_VERSION < 3
} else if (PyInt_Check(elem)) {
RETURN_NOT_OK(builder->AppendInt64(static_cast<int64_t>(PyInt_AS_LONG(elem))));
RETURN_NOT_OK(builder->AppendPy2Int64(static_cast<int64_t>(PyInt_AS_LONG(elem))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this it the right approach. int and long are pretty much interchangeable in Python 2. So only the deserialization needs to be changed and attempt to build an int before falling back on long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with this approach is that a serialized long with a small numeric value will be deserialized as an int which could cause issues for applications relying on type checking.

For example:

In [9]: value = 0L

In [10]: type(value)
Out[10]: long

In [11]: serialized = pa.serialize(value)

In [12]: deserialized = serialized.deserialize()

In [13]: type(deserialized)
Out[13]: int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting only long and refusing int sounds like a weird thing to do, though. Can you point an API that works this way?
(there are rare APIs that do the reverse - accept only int and reject long - but usually it's because they were written in C and the programmer couldn't care more)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't know one off the top of my head -- I agree that it's bad practice. I do think that being consistent with types is important from a correctness point of view.
Pickle, for reference, does not deserialize a small long as an int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Let's go with this patch, then. Hopefully we'll ditch Python 2 compatibility soon ;-)

@pitrou pitrou dismissed their stale review May 17, 2018 21:10

PR is good enough after all

@pcmoritz
Copy link
Contributor

pcmoritz commented May 17, 2018

Great, thanks for the discussion and @pschafhalter for getting this to work, I'll merge it then :)

[Retriggering the build first]

@pcmoritz pcmoritz force-pushed the fix-python2-int-serialization branch from e2adf74 to 7b96b67 Compare May 17, 2018 22:10
@codecov-io
Copy link

Codecov Report

Merging #2055 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2055      +/-   ##
==========================================
+ Coverage   86.28%   86.28%   +<.01%     
==========================================
  Files         242      242              
  Lines       41034    41041       +7     
==========================================
+ Hits        35405    35412       +7     
  Misses       5629     5629
Impacted Files Coverage Δ
cpp/src/arrow/python/arrow_to_python.cc 91.71% <100%> (+0.2%) ⬆️
python/pyarrow/tests/test_serialization.py 90.33% <100%> (ø) ⬆️
cpp/src/arrow/python/python_to_arrow.cc 88.29% <100%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd54041...7b96b67. Read the comment docs.

@pcmoritz pcmoritz changed the title [Python] Fix Python 2 integer serialization ARROW-2611: [Python] Fix Python 2 integer serialization May 18, 2018
@pcmoritz pcmoritz closed this in 3d41978 May 18, 2018
@wesm
Copy link
Member

wesm commented May 18, 2018

Looking at this late. This fix seems odd to me but I haven't looked closely enough at the nuances to judge for certain.

Would it be possible to have only a single int64 array in the serialized representation and then convert back to either PyInt or PyLong depending on the size of the value? In Python 2 it seems like int would always be a safe type if the value fits. If this sounds reasonable perhaps we can open a JIRA to investigate further another time

@pschafhalter
Copy link
Contributor Author

I think the discussion with @pitrou touched on this.

The main issue with converting to PyInt or PyLong depending on the size of the value is that if I serialize a long, I expect to receive a long on deserialization. In Python 2, small values may have type long, e.g. type(0L) == long, but choosing PyInt or PyLong depending on the size of the value would deserialize 0L as an int.

@pschafhalter pschafhalter deleted the fix-python2-int-serialization branch May 18, 2018 23:03
fjetter pushed a commit to fjetter/arrow that referenced this pull request May 25, 2018
Fixes an issue where serialization turns integers into longs in Python 2.

```python
In [1]: import pyarrow as pa

In [2]: value = 1

In [3]: type(value)
Out[3]: int

In [4]: serialized = pa.serialize(value)

In [5]: deserialized = serialized.deserialize()

In [6]: type(deserialized)
Out[6]: long
```

Author: Peter Schafhalter <[email protected]>

Closes apache#2055 from pschafhalter/fix-python2-int-serialization and squashes the following commits:

7b96b67 <Peter Schafhalter> Fix bug with Python 3 C++ API
5d8ff26 <Peter Schafhalter> Add type checking in assert_equal
d5e5e5d <Peter Schafhalter> Fix python2 integer serialization bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants