-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Segfault in to_pandas()
on batch from IPC stream in specific edge cases
#41469
Comments
to_pandas()
on batch from IPC stream in specific edge casesto_pandas()
on batch from IPC stream in specific edge cases
Thanks for the report!
Could you try creating an IPC file from the pyspark dataframe? (I don't know if pyspark provides the functionality for that) Or can you convert the pyspark dataframe to pyarrow first (not going through pandas), and then save it? And something else you could try: does it still reproduce after a roundtrip to Parquet? |
The quick attempt at reproducing this in pyarrow, but as you also observed that doesn't crash: import pyarrow as pa
typ = pa.struct([
pa.field("id", pa.int64()),
pa.field("value", pa.list_(pa.list_(pa.list_(pa.list_(pa.list_(pa.float64()))))))
])
arr = pa.array([{"id": 0, "value": None}], typ)
arr.to_pandas() |
Thanks for the suggestions. PySpark doesn't really support this but I can hack it to make it do this.
No, after a roundtrip to parquet the problem no longer occurs. To test I modified this section of pyspark https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L323-L324 Approximately normalfor batch in batches:
pyarrow_table = pa.Table.from_batches([batch])
yield [self.arrow_to_pandas(c) for c in pyarrow_table.itercolumns()] With round trip in parquetfor batch in batches:
import tempfile
import pyarrow.parquet
with tempfile.TemporaryFile() as tempdir:
pyarrow_table = pa.Table.from_batches([batch])
pyarrow.parquet.write_table(pyarrow_table, tempdir)
read_back_pyarrow_table = pyarrow.parquet.read_table(tempdir)
yield [self.arrow_to_pandas(c) for c in read_back_pyarrow_table.itercolumns()]
I managed to hack something in My hack
Original is https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L108-L113 And now it can be reproduced with just
arrow_stream.txt (the |
Wait.... there is something else weird going. This smaller reproduce doesn't always work depending on the python environment. |
So it turns out the bug is also only reproducible when import numpy as np
import pyarrow as pa
with open("/tmp/arrow_stream", "rb") as read_file:
with pa.ipc.open_stream(read_file) as reader:
schema = reader.schema
for batch in reader:
batch.to_pandas()
print("SUCCESS") Where the file is as in my previous comment arrow_stream.txt |
I managed to attach the a debugger so I can see a bit about why its segfaulting. Ultimately the segfault is on For an IPC file created from python Python code to create similar IPC fileschema = pa.schema(
[pa.field("value", pa.list_(pa.list_(pa.list_(pa.list_(pa.list_(pa.list_(pa.float64())))))))]
)
pyarrow_table = pa.Table.from_arrays([pa.array([None])], schema=schema)
with open("/tmp/arrow_stream2", "wb") as write_file:
with pa.ipc.new_stream(write_file, schema) as writer:
for batch in pyarrow_table.to_batches():
writer.write_batch(batch) |
Thanks, I can now reproduce it as well! I think you are right in the observation that this seems a problem with the data generated on the Java/Spark side (although it is still strange it segfaults or not depending on numpy being imported first or not) When reading your IPC stream file without converting to pandas and inspecting the data, we can see that it is indeed invalid data: import pyarrow as pa
with pa.ipc.open_stream("../Downloads/arrow_stream.txt") as reader:
batch = reader.read_next_batch()
arr = batch["value"]
>>> arr
<pyarrow.lib.ListArray object at 0x7fe352a8d1e0>
[
null,
null,
null
]
# Validating / inspecting the parent array
>>> arr.validate(full=True)
>>> arr.offsets
<pyarrow.lib.Int32Array object at 0x7fe29daa0460>
[
0,
0,
0,
0
]
>>> arr.values
<pyarrow.lib.ListArray object at 0x7fe29d84c0a0>
[]
# Validating / inspecting the first child array
>>> arr.values.validate(full=True)
>>> arr.values.offsets
<pyarrow.lib.Int32Array object at 0x7fe29f3ef880>
<Invalid array: Buffer #1 too small in array of type int32 and length 1: expected at least 4 byte(s), got 0
>>> arr.values.values
<pyarrow.lib.ListArray object at 0x7fe29f238760>
[] So the offsets of the child array are missing. This child array has a length of 0, but following the format the offsets still need to have length of 1. |
A different way to inspect the data using nanoarrow (using the >>> import nanoarrow as na
>>> na.array(arr).inspect()
<ArrowArray list<element: list<element: list<element: list<element: l>
- length: 3
- offset: 0
- null_count: 3
- buffers[2]:
- validity <bool[1 b] 00000000>
- data_offset <int32[16 b] 0 0 0 0>
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: list<element: list<elemen>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: list<element: double>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: double>>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: double>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray double>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data <double[0 b] >
- dictionary: NULL
- children[0]: |
Related issue: #31396 |
And a more recent issue: #40038, with a fix in Arrow Java 16.0 for this (at least for strings, not sure if the lists were automatically fixed as well). Although the discussion was only about the C Data Interface, so for IPC they might still do the same as before (didn't check the PR in detail) |
I did try building a custom I will probably try to create a minimal reproducer for the bad IPC file with java arrow and without spark. |
We are having some discussion about this on Zulip chat, and the conclusion might be that the C++ library is generally forgiving about this and accepts it as input. |
I suspect that the call to arrow/python/pyarrow/src/arrow/python/arrow_to_pandas.cc Lines 766 to 777 in b754d5a
...probably we can just bail before that if I can't seem to reproduce the crash here, although perhaps it's my environment: import urllib.request
import numpy as np
import pyarrow as pa
with urllib.request.urlopen("https://github.com/apache/arrow/files/15181244/arrow_stream2.txt") as f:
with open("/tmp/arrow_stream", "wb") as fout:
fout.write(f.read())
with open("/tmp/arrow_stream", "rb") as read_file:
with pa.ipc.open_stream(read_file) as reader:
schema = reader.schema
for batch in reader:
batch.to_pandas()
print("SUCCESS")
#> SUCCESS An experiment at creating something similar to see where some errors occur: import io
import numpy
import nanoarrow as na
from nanoarrow import ipc # requires about-to-be-released nanoarrow for writer
import pyarrow as pa
from pyarrow import ipc as pa_ipc
bad_arr = na.c_array_from_buffers(
na.list_(na.list_(na.float64())),
3,
[na.c_buffer([False, False, False], na.bool_()), na.c_buffer([0, 0, 0, 0], na.int32())],
null_count=3,
children=[
na.c_array_from_buffers(
na.list_(na.float64()),
0,
[None, None],
children=[na.c_array([], na.float64())]
)
]
)
# This will error
# pa.array(bad_arr)
# ArrowInvalid: ArrowArrayStruct contains null data pointer for a buffer with non-zero computed size
bad_batch = na.c_array_from_buffers(
na.struct({"col": bad_arr.schema}),
3,
[],
null_count=0,
children=[bad_arr]
)
buf = io.BytesIO()
with ipc.StreamWriter.from_writable(buf) as writer:
writer.write_stream(bad_batch)
with pa_ipc.open_stream(buf.getvalue()) as reader:
batch = reader.read_next_batch()
bad_pyarr = batch.column(0)
bad_pyarr.values.offsets
import io
import nanoarrow as na
from nanoarrow import ipc # requires about-to-be-released nanoarrow for writer
import pyarrow as pa
from pyarrow import ipc as pa_ipc
bad_arr = na.c_array_from_buffers(
na.list_(na.list_(na.float64())),
3,
[na.c_buffer([False, False, False], na.bool_()), na.c_buffer([0, 0, 0, 0], na.int32())],
null_count=3,
children=[
na.c_array_from_buffers(
na.list_(na.float64()),
0,
[None, None],
children=[na.c_array([], na.float64())]
)
]
)
# This will error
# pa.array(bad_arr)
# ArrowInvalid: ArrowArrayStruct contains null data pointer for a buffer with non-zero computed size
bad_batch = na.c_array_from_buffers(
na.struct({"col": bad_arr.schema}),
3,
[],
null_count=0,
children=[bad_arr]
)
buf = io.BytesIO()
with ipc.StreamWriter.from_writable(buf) as writer:
writer.write_stream(bad_batch)
with pa_ipc.open_stream(buf.getvalue()) as reader:
batch = reader.read_next_batch()
bad_pyarr = batch.column(0)
bad_pyarr.values.offsets
#> <pyarrow.lib.Int32Array object at 0x10d463580>
#> <Invalid array: Buffer #1 too small in array of type int32 and length 1: expected at least 4 byte(s), got 0> |
I just tried it and I also couldn't reproduce. So I tried creating a new stream in the same way as I did last time and with the new stream I can reproduce again. import urllib.request
import numpy as np # For some reason this needs to be imported before pyarrow to reproduce the segfault
import pyarrow as pa
with urllib.request.urlopen("https://github.com/user-attachments/files/17258296/arrow_stream_2024_10_04.txt") as f:
with open("/tmp/arrow_stream", "wb") as fout:
fout.write(f.read())
with open("/tmp/arrow_stream", "rb") as read_file:
with pa.ipc.open_stream(read_file) as reader:
schema = reader.schema
print(schema)
for batch in reader:
print(batch.to_pandas())
print("SUCCESS") This was a fresh python venv using Python 3.10.14 then install latest version of
|
Describe the bug, including details regarding any error messages, version, and platform.
So far I've only been able to reproduce this case with
pyspark
but I think the bug is probably on the arrow side. The problem was introduced with #15210 and reverting this change still fixes the problem on the 16.0.0 release.Reproduce
The smallest reproducer I've found is the following.
reproduce_pyspark.py.txt (it has a
.txt
extensions because github doesn't let me upload.py
)Versions:
Error is:
full_stdout.txt
A few things I've noticed:
to_pandas
on it.Component(s)
C++, Python, Java
The text was updated successfully, but these errors were encountered: