How to make a future dataframe API available? #79

rgommers · 2022-06-30T16:07:55Z

This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.

A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?

Options include:

In a separate namespace, ala .array_api in NumPy/CuPy,
In a separate retrievable-only namespace, ala __array_namespace__,
Behind an environment variable (NumPy has done this a couple of times, for example with __array_function__ and more recently with dtype casting rules changes),
With a context manager,
With a from __future__ import new_behavior type import (i.e., new features on a per-module basis),
As an external package, which may for example monkeypatch internals (added for completeness, not preferred),

One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.

For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.

Experiences with a separate namespace for the array API standard

The short summary of this is:

there's a problem where we now have two array objects, and supporting both in a code base is cumbersome and requires bi-directional conversions.
a summary of this problem and approaches taken in scikit-learn and SciPy to work around it are described in Array API standard and Numpy compatibility array-api#400
in NumPy the preferred solution direction longer term is to make the main numpy namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.

Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.

Using a context manager

Pandas already has a context manager, namely pandas.option_context. This is used for existing options, see pd.describe_option(). While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:

mode.chained_assignment (raise, warn, or ignore)
mode.data_manager ("block" or "array")
mode.use_inf_as_null (bool)

It could be used similarly to currently available options, one option per feature:

 with pd.option_context('mode.casting_rules', 'api-standard'):
     do_stuff()

Or there could be a single option to switch to "API-compliant mode":

 with pd.option_context('mode.api_standard', True):
     do_stuff()

Or both of those together.

Question: do other dataframe libraries have a similar context manager?

Using a `from future` import

It looks like it's possible to implement features with a from __future__ itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a from dflib.__future__ import X is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:

from pandas.__future__ import api_standard_unique

# should use the `unique` behavior described in the API standard
df.unique()

from other_lib import do_stuff

# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)

Now of course this scope propagation is also what a context manager does. However, the point of a from __future__ import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.

Comparing a context manager and a `from future` import

For new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)

For changes to existing functions or methods, both will work too. The module-local behavior of a from __future__ import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.

For behavior changes there's an issue with the from __future__ import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.

My current impression

A separate namespace is not desired, and a separate dataframe object is really not desired,
An environment variable is easy to implement, but pretty coarse - and given the fairly extensive backwards-compatibility issues that are likely, probably not good enough,
A context manager is nicest for behavior, and fine for new methods/functions
The from __future__ import xxx is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.

References

somewhat related discussion on dataframe namespaces: Dataframe namespaces #23
How to expose API to downstream libraries? array-api#16
https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python (by @shoyer)

The text was updated successfully, but these errors were encountered:

rgommers · 2022-07-04T17:30:28Z

Here's a summary of some of the feedback/discussion on this in a call last week:

There is a trade-off between what is easier for end users vs. for dataframe-consuming libraries vs. for dataframe implementers`:

the from __future__ import solution is better for dataframe-consuming libraries, because they can switch gradually inside their own code base; during that switch they do not need to support two "modes of operation"; and the switch is decoupled from anything happening elsewhere (like in one of their dependencies). It is also easier for testing (no non-local state that controls behavior),
- for a concrete example, see the integer division to float/true division that happened in the py2->py3 transition with from __future__ import division,
the context manager (or a global setting) may be better for end users, because it gives them more control,
it should not matter much for dataframe implementers, because they'd likely need to have two implementations in parallel for quite a while anyway,
for a JIT compiler like Bodo, the content manager may be preferable (@ehariri may want to think about this a bit more). It looks like there needs to be some recognizable syntax or state to inspect, so an import hook could pose a challenge.

We should have a better idea in terms of how this will all work once we actually see how extensive the differences are between, e.g., pandas and a standardized dataframe object.

@shwina says that he expects cuDF to go with a separate dataframe object, because it will be hard to (for example) support a missing/optional index in the Cython implementation of the current dataframe class.

@vnlitvinov says that for Modin he'd probably prefer future imports.

Other points made:

Dataframes are a little different from arrays: consuming libraries are less important and there's a lot fewer of them; code written by end users is more important.
The Pandas context manager seems to not be used much - at least there aren't many issues or feature requests that indicate it's used a lot.
A context manager can be made thread-safe, but that's work. It'd be necessary though, to play well with for example multiprocessing or joblib.
When talking about a migration path: DeprecationWarnings or FutureWarnings only make sense if there's a way for the author of the code that's causing them to update their code to make the warning go away.

rgommers mentioned this issue Jul 8, 2022

Restrictions on column labels #7

Closed

jorisvandenbossche mentioned this issue Sep 13, 2022

Allow to reconstruct a library-specific DataFrame object from an interchange object #85

Open

rgommers mentioned this issue Apr 27, 2023

Add a way to retrieve the standard-compliant namespace, fill "how to adopt" and "future evolution" #156

Merged

rgommers closed this as completed in #156 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make a future dataframe API available? #79

How to make a future dataframe API available? #79

rgommers commented Jun 30, 2022 •

edited

Loading

rgommers commented Jul 4, 2022

How to make a future dataframe API available? #79

How to make a future dataframe API available? #79

Comments

rgommers commented Jun 30, 2022 • edited Loading

Experiences with a separate namespace for the array API standard

Using a context manager

Using a from __future__ import

Comparing a context manager and a from __future__ import

My current impression

References

rgommers commented Jul 4, 2022

rgommers commented Jun 30, 2022 •

edited

Loading

Using a `from future` import

Comparing a context manager and a `from future` import