Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Arrow PyCapsule Interface for export #786

Merged
merged 8 commits into from
Aug 14, 2024

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 13, 2024

closes #784

@kylebarron fancy taking a look to see if this is what needs doing / if I've understood the assignment?

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

  • Related issue #
  • Closes #

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

@github-actions github-actions bot added the enhancement New feature or request label Aug 13, 2024
Copy link

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense! There's not much you can do if the underlying dataframe doesn't support it. If you had some way to access an arrow table from the underlying dataframe you could do something more, but I think this is good

@MarcoGorelli
Copy link
Member Author

cool, thanks!

If you had some way to access an arrow table from the underlying dataframe

we do have narwhals.DataFrame.to_arrow, which returns a pyarrow table - is that what you meant? If so, what else would you suggest adding?

@kylebarron
Copy link

In that case I'd suggest

def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
    try:
        return self._compliant_frame._native_frame.__arrow_c_stream__(
            requested_schema=requested_schema
        )
    except AttributeError:
        pyarrow_table = self.to_arrow()
        return pyarrow_table.__arrow_c_stream__(requested_schema=requested_schema)

So in the first case, we can use the source's implementation of the pycapsule interface and pyarrow doesn't need to be installed, while in the second case we can still ensure the method never raises an exception.

@MarcoGorelli
Copy link
Member Author

thanks! though I think the user could call this themselves? e.g.

df: nw.DataFrame
try:
    result = df.__arrow_c_stream__(requested_schema=requested_schema)
except AttributeError:
    result = df.to_arrow().__arrow_c_stream__(requested_schema=requested_schema)

?
I think that'd be more explicit, not totally sure we should be calling to_arrow on behalf of the user. By 'user', I mean the library developer using Narwhals, I'd be inclined to leave it up to them whether or not to fallback to PyArrow

@kylebarron
Copy link

Well, one of my primary arguments for the pycapsule interface is that it allows an ecosystem of data producers and consumers to interoperate without any knowledge of the other, solely by looking for an __arrow_c_stream__ dunder method. Calling .to_arrow() would indeed be more explicit, but it would require library consumers to know about narwhals, which I'd argue in the general case is not true. E.g. pyarrow.table() only knows to check for __arrow_c_stream__.

@MarcoGorelli
Copy link
Member Author

I was thinking more of the Vegafusion case - I think it's better for them if they choose to explicitly call to_arrow. Otherwise they might be calling __arrow_c_stream__ all over the places, thinking it's cheap, whereas it would've been better to do a single to_arrow upfront πŸ˜‡

I think any library developer using Narwhals (e.g. vegafusion) would and should know about Narwhals, whereas lower-level libraries like PyArrow shouldn't:

  • PyArrow can just check for __arrow_c_stream__ (as it currently does)
  • Vegafusion can choose whether and when to call to_arrow before accessing __arrow_c_stream__

@MarcoGorelli
Copy link
Member Author

In any case, as they often say, "in open source, 'no' is temporary but 'yes' is forever" - doubly so with our stable api policy πŸ˜†

So, as the current implementation looks good to you, I'd say - let's start with that, we can always loosen it later if necessary

Thanks for your review and input, much appreciated πŸ™ !

@kylebarron
Copy link

Otherwise they might be calling __arrow_c_stream__ all over the places, thinking it's cheap, whereas it would've been better to do a single to_arrow upfront

As a general note, consumers can't do this because consumers don't have a way to know whether the source is a table that already exists in memory or whether it's a stream that can only be called once. E.g. a pyarrow RecordBatchReader is a stream and you can only call __arrow_c_stream__ once. So for a consumer like vegafusion, it would be important for it to import all the data once and then operate as it needs to on it.

@MarcoGorelli
Copy link
Member Author

Ah nice, thanks for explaining!

In that case, I'm leaning more towards your suggestion - it would also mean being able to support this for versions of pandas prior to 2.2 but which support converting to pyarrow table

raise ModuleNotFoundError(msg) from exc
if parse_version(pa.__version__) < (14, 0): # pragma: no cover
msg = f"PyArrow>=14.0.0 is required for `__arrow_c_stream__` for object of type {type(native_series)}"
raise ModuleNotFoundError(msg)
ca = pa.chunked_array([self.to_arrow()])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might require pyarrow 15

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pandas the requirement is PyArrow 14+ (I also just ran the tests with pyarrow 13 and 14 - the former fails, the latter passes)

Copy link
Member Author

@MarcoGorelli MarcoGorelli Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, that's for DataFrame. looks like it's even PyArrow 16+ for chunkedarray?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added to pa.chunked_array in a later release, yes. I think it was here: apache/arrow#40818

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 13, 2024

Wow I am learning a bunch from this PR πŸ™ŒπŸΌ

@MarcoGorelli probably worth adding these methods in the api docs as well πŸ˜‡?!

@MarcoGorelli
Copy link
Member Author

thanks Kyle for your help!

cool, let's ship this 🚒

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Pass through the Arrow PyCapsule Interface
3 participants