PDEP-15: Reject PDEP-10 #58623

lithomas1 · 2024-05-07T21:56:25Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Just something quick and dirty I threw up that we should talk about tomorrow at the dev call.

lithomas1 · 2024-05-07T23:04:22Z

/preview

github-actions · 2024-05-07T23:05:48Z

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58623/

lithomas1 · 2024-05-07T23:12:38Z

Ugh, looks like the bullets aren't rendering correctly.

WillAyd · 2024-05-07T23:17:35Z

Did I miss an official vote on rejecting this? I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

lithomas1 · 2024-05-07T23:21:51Z

Nope, just opening since I said I would in the discussion issue.

We'll still need a formal vote - I'm just kicking off the discussion here.

WillAyd · 2024-05-07T23:18:40Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.
+
+   For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics


PDEP 14 does not change performance or memory savings if you do not have pyarrow installed

added a note in parentheses at the end of that sentence.

Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

I updated the PDEP-15 text, and forgot to remove the PDEP-10 changes.

I've removed the PDEP-10 changes now.

WillAyd · 2024-05-07T23:22:43Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+The primary reasons for rejecting this PDEP are twofold:
+
+1) Requiring pyarrow as a dependency causes installation problems.
+   - Pyarrow does not fit or has a hard time fitting in space-constrained environments 


I think what we could learn from this process is what caused this to change our minds? These issues were discussed leading up to the acceptance of PDEP-10.

The way this is written I think reads more as "we discovered this after the fact" instead of "we decided that X amount of negative feedback on these points was enough to revert". I think there is some value to future PDEPs to set expectations around the latter

lithomas1 · 2024-05-08T15:29:27Z

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

xref #57073 (comment) for context

simonjayhawkins

Thanks @lithomas1 for making the updates needed to formally reject PDEP-10.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

Dr-Irv · 2024-05-08T19:03:46Z

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

simonjayhawkins · 2024-05-08T19:08:35Z

I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

I'm now leaning towards approving the rejection. My approval of the original PDEP was based solely on improvements to default inference for other dtypes. Despite some recent comments about this, no discussion/clarification has followed on this topic. I'd need to see some positive evidence that the original PDEP-10 authors still intend to support delivering the promised enhancements in this area. Now that the implications of using pd.NA as a default has been discussed in more depth, I suspect that any improved inference would need a couple of dtype variants.

lithomas1 · 2024-05-08T21:27:44Z

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

Yep, I'm planning on updating this current PR to do that, so if anyone has any objections or whatever, we can still discuss here.

WillAyd · 2024-05-18T14:29:34Z

Minor note - do we need to rename this PR? Right now PDEP-10 shows twice on the website

lithomas1 · 2024-05-18T21:16:17Z

Yeah, I'll probably change the name to PDEP-15 once I get around to moving this to a separate PDEP (probably tomorrow).

I was travelling the past week, so didn't really have time then.

WillAyd · 2024-05-20T12:41:57Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+The primary reasons for rejecting this PDEP are twofold:
+
+1) Requiring pyarrow as a dependency causes installation problems.
+   - Pyarrow does not fit or has a hard time fitting in space-constrained environments 


Within the context of recent conversation I don't think this comment about AWS is true. AWS distributes an official pandas image for lambda which already includes pyarrow, pandas, and NumPy. This is all required by their own "AWS SDK on pandas" library.

The issue more finely scoped I think is that the default wheel installation via pip into a lambda image exceeds the 256 MB limit. Either using the official AWS provided image or using miniconda should not exceed the space limits

WillAyd · 2024-05-20T12:46:35Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.
+
+   For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics


Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

Dr-Irv

My main comment is that PDEP-10 should be minimally modified, and that PDEP-15 has all the discussion about why we did the rejection.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

web/pandas/pdeps/0015-do-not-require-pyarrow.md

WillAyd · 2024-05-20T19:15:46Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+   While both of these reasons are mentioned in the drawbacks section of this PDEP, at the time of the writing
+of the PDEP, we underestimated the impact this would have on users, and also downstream developers.
+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.


I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

High performance string operations

Direct string creation from I/O routines (i.e. no intermediate copies)

Zero copy data exchange through Arrow C Data Interface

Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

High performance string operations

Direct string creation from I/O routines (i.e. no intermediate copies)

Zero copy data exchange through Arrow C Data Interface

Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

In PDEP-10, there are 3 benefits listed

pyarrow strings (possible to provide users this benefit without making pyarrow required)

Nested datatypes (can't have this without arrow, but this is a bit niche)

Interopability (the alternative is the dataframe interchange protocol, which is more widely adopted at the moment. Not sure about the zero-copy stuff for that, though. I think it also might be possible to implement Arrow C Data interface support without taking on a hard dep on pyarrow)

Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

So, IMO, this argument is accurate, in that most of the benefits in PDEP-10 can be made possible (for those user that have pyarrow installed) without making pyarrow required.

The future benefits of Arrow are very compelling, but decisions on making a dependency required should be based on immediate and not future benefits. Like I said before, it is easy to reconsider this decision in a years time if those future benefits are materialize.

If you think points 1 and 3 are possible without pyarrow then the alternatives for that should be laid out in this PDEP, at least at a super high level. I'm assuming point 1 refers to the nanoarrow POC I was sharing; point 3 requires reimplementing the conversions that pyarrow already has. (I personally don't think building either of those from scratch is a good long term solution but it can at least be discussed)

For point 2 how do you know those are niche applications? Its easy to dismiss things that don't exist today as not worthwhile, but I get the feeling that there could be plenty of use cases for the aggregate types, since they have a natural fit with many of the Python containers.

On interoperability the long term prospects for the dataframe interchange protocol seem dubious, and we have even discussed moving that out of pandas (see #56732).

Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

The Arrow interchange protocol can be used by any library that needs to work with Arrow data - there is no limit to it being used by other dataframe libraries. It provides a standardized API so that third parties don't need to hack into our internals, which is a direct benefit for us. It also works in two directions - we can be a consumer just as much as a producer.

Nested datatypes (can't have this without arrow, but this is a bit niche)

Also wanted to point out that arrow has a decimal128 and decimal256 type which is especially useful for financial calculations where floating point inaccuracies cannot be tolerated, and the arrow decimal types are an extremely significant improvement over using object.

Sure, will update and add a note in the PDEP when I get time again.

Co-authored-by: Irv Lustig <[email protected]>

Dr-Irv

I'm fine with this PDEP, although I'm unsure whether using language as "we, the core team", should appear in a PDEP.

github-actions · 2024-06-22T00:06:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Dr-Irv · 2024-06-24T14:21:04Z

@lithomas1 I think you should update the PDEP based on comments from @WillAyd so we can move this forward and have a vote.

phofl · 2024-06-24T15:35:46Z

My vote on this will heavily depend if we can come to an agreement on PDEP 14

lithomas1 · 2024-06-25T00:28:34Z

Thanks for the reminder, this slipped off my radar.

My vote on this will heavily depend if we can come to an agreement on PDEP 14

Yeah, ideally, we'd vote on these at the same time.
I haven't been involved in maintenance for a while, so no idea how far along that one is.

…ct-pdep10

lithomas1 · 2024-06-26T04:21:41Z

Marking as draft as although I've pushed up another commit, I haven't fully finished and refined my thoughts yet.

I also need to fix the formatting issues in this PDEP.

Dr-Irv · 2024-07-24T18:19:25Z

@lithomas1 now that PDEP-14 is accepted, we don't need to require pyarrow for pandas 3.0, so we should resolve this PDEP and get a vote going.

lithomas1 · 2024-08-27T13:32:35Z

I was out for the summer, but I think this PDEP should be updated such that all comments were addressed.

Dr-Irv

Thanks for the changes.

@pandas-dev/pandas-core Please provide any additional comments so we can move this to a vote. If no additional comments are requesting any changes within 15 days, we'll open up the voting period.

WillAyd · 2024-08-27T21:47:39Z

web/pandas/pdeps/0015-do-not-require-pyarrow.md

+
+1) Requiring pyarrow as a dependency causes installation problems.
+
+   - Pyarrow does not fit or has a hard time fitting in space-constrained environments


I think this really exaggerates the problem on AWS. AWS has long distributed its own AWS SDK for pandas library (formerly called awswrangler) which uses pyarrow to better integrate with many of its services (ex: pyarrow is used for high performance data exports to/from AWS Redshift, rather than using a traditional ODBC driver)

The issue here is really just scoped to users that don't want to use the AWS Lambda Managed Layer, but instead want to build their environment from scratch, assumedly without the AWS SDK for pandas. Even then, it may not be a current issue given the drastic reductions in the binary size of pyarrow through both ~~pip and~~ conda

I think it's only conda that's been drastically reduced - install pyarrow and pandas in a fresh venv and it already hits 295 MB. And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required

My current stance is: if the desire was to make PyArrow dtypes the default for all dtypes, then ok, maybe that'd be justified. But probably not if it's just for the sake of strings, for which I think that

recommending pip install pandas[pyarrow] in all instructions

auto-inferring pyarrow strings if pyarrow is installed

should be enough

Sounds good to me.

I'll add this to the PDEP.

And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required

Out of curiosity - is this something that the PyPi maintainers have mentioned as a problem?

If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))

If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))

Wouldn't we then be asking our users to be dependent on the AWS people updating their pre-provided image whenever we created a new release of pandas or the arrow team did a new release of pyarrow ?

Yea I think that is generally how lambda works, even with the overall Python version. It's not quite an "anything goes" type of execution environment

There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars

https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/

There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars

https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/

I'm not too worried about this - PyPI's bill is subsidized anyways. Of course, it is nice to reduce our load on PyPI, but I don't think we are close to being the worse offenders here (those would probably be something like tensorflow and pytorch), and it's important to keep in mind that a lot of people have pyarrow installed for whatever reason anyways.

I would mostly only be concerned about an increase in size in our own pandas package (since PyPI does limit the total size of all packages uploaded by a project, and raising the limit is a manual and annoying process)

I've made another go at clarifying point1, incorporating the feedback here.

PTAL @WillAyd @MarcoGorelli

Thanks for the feedback, but I still disagree with calling the AWS-provided layer a "workaround" - essentially it is the canonical approach to solve an issue that has existed for years with pandas and lambda functions.

A quick google search for something how "how to run pandas on aws lambda" yields a slew of conflicting results on how to get this to work. If the AWS-provided layer is a workaround, then what are we calling the proper approach?

https://stackoverflow.com/questions/36054976/pandas-aws-lambda
https://stackoverflow.com/questions/53824556/how-to-install-numpy-and-pandas-for-aws-lambdas
https://medium.com/swlh/how-to-add-python-pandas-layer-to-aws-lambda-bab5ea7ced4f
https://medium.com/@johnnymao/how-to-use-pandas-in-your-aws-lambda-function-c3ce29f6f189
https://medium.com/@shimo164/lambda-layer-to-use-numpy-and-pandas-in-aws-lambda-function-8a0e040faa18
https://www.youtube.com/watch?v=1UDEp90S9h8

web/pandas/pdeps/0015-do-not-require-pyarrow.md

WillAyd · 2024-08-27T21:59:45Z

web/pandas/pdeps/0015-do-not-require-pyarrow.md

+       that support Arrow in a zero-copy manner.
+
+     - Support for the Arrow C Data interface in pandas and other libraries in the ecosystem is still very new, though,
+       (support in pandas itself was only added as of pandas 2.2), and the dataframe interchange protocol, which allows


I don't know that this is a true statement about the dataframe interchange protocol being better supported in other libraries. Very recently, the Arrow C Data interface has been adopted by a ton of new libraries.

@MarcoGorelli is probably the most knowledgeable on this overall

yup - pandas also requires pyarrow in some cases for the interchange protocol anyway

pandas/pandas/core/interchange/from_dataframe.py

Lines 461 to 469 in 56ea76a

assert length is not None, "`length` must be specified for a bit-mask buffer."

pa = import_optional_dependency("pyarrow")

arr = pa.BooleanArray.from_buffers(

pa.bool_(),

length,

[None, pa.foreign_buffer(buffer.ptr, length)],

offset=offset,

)

return np.asarray(arr)

Besides, the interchange protocol doesn't support nested data types nor dates, and the progress on the project seems to have completely stalled. I'd suggest not mentioning it at all if it's OK

Cool, if I understand corrrectly, though, the main benefit is to other dataframe libraries, as of right now.
(doing a quick search: it seems like lots of downstream libraries are using this as a mechanism to gain support for polars and co. in addition to pre-existing pandas support)

Are there any libraries that didn't support pandas before, but now support pandas due to the interchange protocol?
This could be something good to highlight.

Are there any libraries that didn't support pandas before, but now support pandas due to the interchange protocol?

nope, there's been no such case

…ct-pdep10

WillAyd · 2024-09-03T16:04:33Z

web/pandas/pdeps/0015-do-not-require-pyarrow.md

+     arrays to those dtypes by default, without forcing a pyarrow requirement on users,
+     as there is no Python/numpy equivalent for these dtypes).
+
+   - Interoperability


I think @MarcoGorelli point was to remove the Interoperability section entirely, but if that's not true then I don't understand the point this is trying to make in its current form.

The beneficiary of the Arrow C Data interface is not just other dataframe libraries - a decent listing can be found here:

apache/arrow#39195 (comment)

For a direct benefit to pandas, it helps with I/O to boost performance, ensure proper data types, and reduce the amount of code burden. We have already seen this benefit with ADBC drivers, and from the link above, it looks like there is some near-term potential for it to help Excel I/O via fastexcel

Sure, will take it out the next go.

PDEP-10: Change status to rejected

98eb85a

lithomas1 added the PDEP pandas enhancement proposal label May 7, 2024

WillAyd reviewed May 7, 2024

View reviewed changes

lithomas1 marked this pull request as ready for review May 8, 2024 15:09

lithomas1 requested a review from datapythonista as a code owner May 8, 2024 15:09

simonjayhawkins reviewed May 8, 2024

View reviewed changes

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved

lithomas1 changed the title ~~PDEP-10: Change status to rejected~~ PDEP-15: Change status to rejected May 19, 2024

Split out into new pdep

5e451db

WillAyd reviewed May 20, 2024

View reviewed changes

Dr-Irv changed the title ~~PDEP-15: Change status to rejected~~ PDEP-15: Change status of PDEP-10 to rejected May 20, 2024

Dr-Irv requested changes May 20, 2024

View reviewed changes

WillAyd reviewed May 20, 2024

View reviewed changes

lithomas1 force-pushed the reject-pdep10 branch from a5905d4 to 46a0cea Compare May 21, 2024 05:28

remove pdep-10 changes

2af5632

lithomas1 force-pushed the reject-pdep10 branch from 46a0cea to 2af5632 Compare May 21, 2024 05:29

lithomas1 and others added 2 commits May 20, 2024 22:30

Apply suggestions from code review

6e4efe5

Co-authored-by: Irv Lustig <[email protected]>

Apply suggestions from code review

45754bf

Co-authored-by: Irv Lustig <[email protected]>

Dr-Irv reviewed May 21, 2024

View reviewed changes

github-actions bot added the Stale label Jun 22, 2024

lithomas1 removed the Stale label Jun 26, 2024

lithomas1 added 2 commits June 25, 2024 21:18

update a little

7833637

Merge branch 'reject-pdep10' of github.com:lithomas1/pandas into reje…

1ccca56

…ct-pdep10

lithomas1 marked this pull request as draft June 26, 2024 04:21

lithomas1 added 2 commits July 28, 2024 12:04

minor update

1b3bdee

Merge branch 'main' of github.com:pandas-dev/pandas into reject-pdep10

e52e2e7

lithomas1 changed the title ~~PDEP-15: Change status of PDEP-10 to rejected~~ PDEP-15: Reject PDEP-10 Aug 27, 2024

Merge branch 'main' into reject-pdep10

fef0c92

Dr-Irv approved these changes Aug 27, 2024

View reviewed changes

WillAyd requested changes Aug 27, 2024

View reviewed changes

lithomas1 marked this pull request as ready for review August 29, 2024 11:16

lithomas1 added 2 commits August 29, 2024 23:20

small updates

e5de753

Merge branch 'reject-pdep10' of github.com:lithomas1/pandas into reje…

c159851

…ct-pdep10

lithomas1 requested review from WillAyd and MarcoGorelli September 2, 2024 18:47

WillAyd reviewed Sep 3, 2024

View reviewed changes


		2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.

		For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics


		1) Requiring pyarrow as a dependency causes installation problems.

		- Pyarrow does not fit or has a hard time fitting in space-constrained environments

	assert length is not None, "`length` must be specified for a bit-mask buffer."
	pa = import_optional_dependency("pyarrow")
	arr = pa.BooleanArray.from_buffers(
	pa.bool_(),
	length,
	[None, pa.foreign_buffer(buffer.ptr, length)],
	offset=offset,
	)
	return np.asarray(arr)

PDEP-15: Reject PDEP-10 #58623

Are you sure you want to change the base?

PDEP-15: Reject PDEP-10 #58623

Conversation

lithomas1 commented May 7, 2024

lithomas1 commented May 7, 2024

github-actions bot commented May 7, 2024

lithomas1 commented May 7, 2024

WillAyd commented May 7, 2024

lithomas1 commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented May 8, 2024

simonjayhawkins left a comment

Choose a reason for hiding this comment

Dr-Irv commented May 8, 2024

simonjayhawkins commented May 8, 2024

lithomas1 commented May 8, 2024

WillAyd commented May 18, 2024

lithomas1 commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 22, 2024

Dr-Irv commented Jun 24, 2024

phofl commented Jun 24, 2024

lithomas1 commented Jun 25, 2024

lithomas1 commented Jun 26, 2024

Dr-Irv commented Jul 24, 2024

lithomas1 commented Aug 27, 2024

Dr-Irv left a comment

Choose a reason for hiding this comment

WillAyd Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

WillAyd Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Aug 27, 2024 •

edited

Loading

MarcoGorelli Aug 28, 2024 •

edited

Loading

MarcoGorelli Aug 29, 2024 •

edited

Loading

WillAyd Sep 3, 2024 •

edited

Loading