Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make a future dataframe API available? #79

Closed
rgommers opened this issue Jun 30, 2022 · 1 comment · Fixed by #156
Closed

How to make a future dataframe API available? #79

rgommers opened this issue Jun 30, 2022 · 1 comment · Fixed by #156

Comments

@rgommers
Copy link
Member

rgommers commented Jun 30, 2022

This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.

A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?

Options include:

  1. In a separate namespace, ala .array_api in NumPy/CuPy,
  2. In a separate retrievable-only namespace, ala __array_namespace__,
  3. Behind an environment variable (NumPy has done this a couple of times, for example with __array_function__ and more recently with dtype casting rules changes),
  4. With a context manager,
  5. With a from __future__ import new_behavior type import (i.e., new features on a per-module basis),
  6. As an external package, which may for example monkeypatch internals (added for completeness, not preferred),

One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.

For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.

Experiences with a separate namespace for the array API standard

The short summary of this is:

  • there's a problem where we now have two array objects, and supporting both in a code base is cumbersome and requires bi-directional conversions.
  • a summary of this problem and approaches taken in scikit-learn and SciPy to work around it are described in Array API standard and Numpy compatibility array-api#400
  • in NumPy the preferred solution direction longer term is to make the main numpy namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.

Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.

Using a context manager

Pandas already has a context manager, namely pandas.option_context. This is used for existing options, see pd.describe_option(). While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:

  • mode.chained_assignment (raise, warn, or ignore)
  • mode.data_manager ("block" or "array")
  • mode.use_inf_as_null (bool)

It could be used similarly to currently available options, one option per feature:

 with pd.option_context('mode.casting_rules', 'api-standard'):
     do_stuff()

Or there could be a single option to switch to "API-compliant mode":

 with pd.option_context('mode.api_standard', True):
     do_stuff()

Or both of those together.

Question: do other dataframe libraries have a similar context manager?

Using a from __future__ import

It looks like it's possible to implement features with a from __future__ itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a from dflib.__future__ import X is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:

from pandas.__future__ import api_standard_unique

# should use the `unique` behavior described in the API standard
df.unique()

from other_lib import do_stuff

# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)

Now of course this scope propagation is also what a context manager does. However, the point of a from __future__ import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.

Comparing a context manager and a from __future__ import

For new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)

For changes to existing functions or methods, both will work too. The module-local behavior of a from __future__ import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.

For behavior changes there's an issue with the from __future__ import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.

My current impression

  • A separate namespace is not desired, and a separate dataframe object is really not desired,
  • An environment variable is easy to implement, but pretty coarse - and given the fairly extensive backwards-compatibility issues that are likely, probably not good enough,
  • A context manager is nicest for behavior, and fine for new methods/functions
  • The from __future__ import xxx is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.

References

  1. somewhat related discussion on dataframe namespaces: Dataframe namespaces #23
  2. How to expose API to downstream libraries? array-api#16
  3. https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python (by @shoyer)
@rgommers
Copy link
Member Author

rgommers commented Jul 4, 2022

Here's a summary of some of the feedback/discussion on this in a call last week:

There is a trade-off between what is easier for end users vs. for dataframe-consuming libraries vs. for dataframe implementers`:

  • the from __future__ import solution is better for dataframe-consuming libraries, because they can switch gradually inside their own code base; during that switch they do not need to support two "modes of operation"; and the switch is decoupled from anything happening elsewhere (like in one of their dependencies). It is also easier for testing (no non-local state that controls behavior),
    • for a concrete example, see the integer division to float/true division that happened in the py2->py3 transition with from __future__ import division,
  • the context manager (or a global setting) may be better for end users, because it gives them more control,
  • it should not matter much for dataframe implementers, because they'd likely need to have two implementations in parallel for quite a while anyway,
  • for a JIT compiler like Bodo, the content manager may be preferable (@ehariri may want to think about this a bit more). It looks like there needs to be some recognizable syntax or state to inspect, so an import hook could pose a challenge.

We should have a better idea in terms of how this will all work once we actually see how extensive the differences are between, e.g., pandas and a standardized dataframe object.

@shwina says that he expects cuDF to go with a separate dataframe object, because it will be hard to (for example) support a missing/optional index in the Cython implementation of the current dataframe class.

@vnlitvinov says that for Modin he'd probably prefer future imports.

Other points made:

  • Dataframes are a little different from arrays: consuming libraries are less important and there's a lot fewer of them; code written by end users is more important.
  • The Pandas context manager seems to not be used much - at least there aren't many issues or feature requests that indicate it's used a lot.
  • A context manager can be made thread-safe, but that's work. It'd be necessary though, to play well with for example multiprocessing or joblib.
  • When talking about a migration path: DeprecationWarnings or FutureWarnings only make sense if there's a way for the author of the code that's causing them to update their code to make the warning go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant