Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make some (most) dependencies optional? #7844

Closed
1 task done
MarcoGorelli opened this issue Dec 24, 2023 · 10 comments · Fixed by #7878
Closed
1 task done

feat: make some (most) dependencies optional? #7844

MarcoGorelli opened this issue Dec 24, 2023 · 10 comments · Fixed by #7878
Labels
feature Features or general enhancements

Comments

@MarcoGorelli
Copy link

Is your feature request related to a problem?

I'd like to use the ibis API to work with a pandas dataframe

Describe the solution you'd like

To be able to do just do that, without having to install:

  • google cloud storage
  • sqlalchemy

and so many more

I may have misunderstood (in which case, sorry, apologies), but are these really required dependencies?

What version of ibis are you running?

7.2.0

What backend(s) are you using, if any?

duckdb

Code of Conduct

  • I agree to follow this project's Code of Conduct
@MarcoGorelli MarcoGorelli added the feature Features or general enhancements label Dec 24, 2023
@cpcloud
Copy link
Member

cpcloud commented Dec 24, 2023

Hi @MarcoGorelli 👋🏻!

How did you install Ibis?

Most of Ibis's dependencies are optional.

The following dependencies are listed as required:

atpublic = ">=2.3,<5"  # __all__ handling
bidict = ">=0.22.1,<1" # `.cache()` implementation
filelock = ">=3.7.0,<4"  # this should definitely be optional
multipledispatch = ">=0.6,<2"  # required for various internals
numpy = ">=1,<2" # required for `execute()`
pandas = ">=1.2.5,<3" # required for `execute()`
parsy = ">=2,<3" # required for type parsing, might be possible to make optional
pins = { version = ">=0.8.3,<1", extras = ["gcs"] } # see note below
pyarrow = ">=2,<15" # required for `to_pyarrow()`
pyarrow-hotfix = ">=0.4,<1" # security patch for pyarrow
python-dateutil = ">=2.8.2,<3" # timezones
pytz = ">=2022.7" # timezones, yes there are two libraries for timezone handling because users can provide both
rich = ">=12.4.4,<14" # interactive mode table formatting
sqlglot = ">=18.12.0,<21" # general structured sql manipulation and pretty printing
toolz = ">=0.11,<1" # various python data structure utilities

# tons more below, all related to backends or optional features

Notes

sqlalchemy

This shouldn't be a required dependency I believe, so we can look into making that optional. In 9.0.0, it's likely that we will remove sqlalchemy from the codebase altogether, so if it's a lot of effort to make it optional we'll probably wait until that refactor is in a release. Hopefully that is okay!

pins

This one is a bit trickier.

We use GCS to store our example data. pins is the library that manages pulling down that data for us.

The tricky bit is the UX here: we want pip install ibis-framework[some-backend] to include the ability to work with example data out of the box without users having to type pip install ibis-framework[some-backend,examples].

I'm not sure how we can make the installation of Ibis friendly to end users regarding examples, while avoiding the google cloud storage dependency (a transitive dependency of Ibis via pins via its gcs extra).

One option is to simply make examples an extra, but we think that has the potential downside of being able to try ibis out right out of the box with real data.

Open to suggestions!

@cpcloud
Copy link
Member

cpcloud commented Dec 24, 2023

I've opened #7845 to move filelock to the test dependency group which should prevent its installation when using pip install ibis-framework[...]

@cpcloud
Copy link
Member

cpcloud commented Dec 24, 2023

There are a few pieces of information that suggest sqlalchemy is not a required dependency:

  1. sqlalchemy isn't listed as required in pyproject.toml
  2. inspecting sys.modules after running import ibis, I don't see any modules with "sqlalchemy" that are loaded, including after importing a bunch of our non-SQL backends:
>>> import sys
>>> sum('sqlalchemy' in k for k in sys.modules.keys())
0
>>> import ibis
>>> sum('sqlalchemy' in k for k in sys.modules.keys())
0
>>> import ibis.backends.pandas
>>> sum('sqlalchemy' in k for k in sys.modules.keys())
0
>>> import ibis.backends.dask
>>> sum('sqlalchemy' in k for k in sys.modules.keys())
0
>>> import ibis.backends.polars
>>> sum('sqlalchemy' in k for k in sys.modules.keys())
0

How are you determining that sqlalchemy is a required dependency?

@MarcoGorelli
Copy link
Author

thanks for looking into this!

How did you install Ibis?

I just ran pip install ibis-framework[duckdb] in a fresh virtual environment and notes a huge number of dependencies being pulled in

@cpcloud
Copy link
Member

cpcloud commented Dec 24, 2023

For DuckDB, sqlalchemy is currently required because we're using duckdb-engine (a sqlalchemy dialect).

We're working on removing sqlalchemy entirely, but that's at least 2 major releases away (ibis 9.x).

Any chance you can show pip list after running pip install ibis-framework[duckdb] in a fresh virtualenv?

@MarcoGorelli
Copy link
Author

ah thanks!

have just got a train (w/o reliable wifi) so I can try again next week when back

@MarcoGorelli
Copy link
Author

hi again - hope you had a nice xmas if you celebrated

here's the requested output:

$ pip list
Package                  Version
------------------------ ------------
aiohttp                  3.9.1
aiosignal                1.3.1
appdirs                  1.4.4
atpublic                 4.0
attrs                    23.1.0
bidict                   0.22.1
cachetools               5.3.2
certifi                  2023.11.17
charset-normalizer       3.3.2
decorator                5.1.1
duckdb                   0.9.2
duckdb_engine            0.10.0
filelock                 3.13.1
frozenlist               1.4.1
fsspec                   2023.6.0
gcsfs                    2023.6.0
google-api-core          2.15.0
google-auth              2.25.2
google-auth-oauthlib     1.2.0
google-cloud-core        2.4.1
google-cloud-storage     2.14.0
google-crc32c            1.5.0
google-resumable-media   2.7.0
googleapis-common-protos 1.62.0
greenlet                 3.0.3
humanize                 4.9.0
ibis-framework           7.2.0
idna                     3.6
importlib-metadata       7.0.1
importlib-resources      6.1.1
Jinja2                   3.1.2
joblib                   1.3.2
markdown-it-py           3.0.0
MarkupSafe               2.1.3
mdurl                    0.1.2
multidict                6.0.4
multipledispatch         1.0.0
numpy                    1.26.2
oauthlib                 3.2.2
pandas                   2.1.4
parsy                    2.1
pins                     0.8.3
pip                      23.2.1
protobuf                 4.25.1
pyarrow                  14.0.2
pyarrow-hotfix           0.6
pyasn1                   0.5.1
pyasn1-modules           0.3.0
Pygments                 2.17.2
python-dateutil          2.8.2
pytz                     2023.3.post1
PyYAML                   6.0.1
requests                 2.31.0
requests-oauthlib        1.3.1
rich                     13.7.0
rsa                      4.9
setuptools               65.5.0
six                      1.16.0
SQLAlchemy               2.0.24
sqlalchemy-views         0.3.2
sqlglot                  20.4.0
toolz                    0.12.0
typing_extensions        4.9.0
tzdata                   2023.3
urllib3                  2.1.0
xxhash                   3.4.1
yarl                     1.9.4
zipp                     3.17.0

I'm asking about this because I'm curious to see if ibis could already be used as a lightweight compatibility layer between dataframe libraries (like https://github.com/data-apis/dataframe-api-compat aims to be)

Currently I don't think it's feasible, I don't think a dataframe-consuming library (say, scikit-learn, skrub, hvplot, seaborn, ...there's a few 😄 ) would be willing to take on so many extra dependencies in exchange for cross-dataframe compatibility. If I understand correctly, ibis is aimed at end-users, rather than at library developers?

Just wondering, though, if it could be possible to write, say,

penguins.aggregate(
    by="species",
    total_bill_depth=penguins.bill_depth_mm.sum(),
    avg_bill_length=penguins.bill_length_mm.mean(),
)

and have it dispatch to the underlying library natively, but only requiring an extra lightweight dependency

@PeterJCLaw
Copy link

pins

The tricky bit is the UX here: we want pip install ibis-framework[some-backend] to include the ability to work with example data out of the box without users having to type pip install ibis-framework[some-backend,examples].

I'm not sure how we can make the installation of Ibis friendly to end users regarding examples, while avoiding the google cloud storage dependency (a transitive dependency of Ibis via pins via its gcs extra).

One option is to simply make examples an extra, but we think that has the potential downside of being able to try ibis out right out of the box with real data.

I'm 👍 on introducing an examples extra here. From the perspective of installing and using ibis in production settings it's highly desirable to be able to install the minimal dependencies for a package. There are a number of reasons for this, not least being security considerations -- the more packages present in an environment the longer users typically have to wait for all the relevant packages to support each others latest versions in order to be able to upgrade.
As a current example -- the latest released version of pins requires an older version of fsspec, which in turn holds users back to an old version of transformers which has potential vulnerabilities. While that chain isn't ibis' fault directly, the forced presence of the pins package for all ibis users contributes to the problem.

Could you elaborate on why you feel that pip install ibis-framework[some-backend,examples] is significantly more arduous than pip install ibis-framework[some-backend], especially given that pip install ibis-framework might be seen as the more natural thing for anyone even slightly familiar with pip/PyPI and yet pip install ibis-framework[some-backend] is essentially required for ibis to work?

Another idea might be to have separate packages -- ibis-core perhaps being the core functionality and ibis-framework extending that by including the examples (plus the non-optional dependencies to make them work). (And/or spell this by keeping ibis-framework as-is and adding ibis-examples).

In all these spellings the user needs to refer to some form of the documentation (even if just the README) to be able to install the right things -- though allowing different use-cases more control over the transitive dependencies which aren't always required.

Another idea might be to have the examples include something like this:

try:
    import pins
except ImportError:
    exit("Run `pip install ibis-framework[examples]` to use the examples")

Kinda ugly in the code, but stays somewhat user-friendly if people aren't used to diagnosing ImportErrors.

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2024

I think it's reasonable to have a way of installing the minimum necessary dependencies needed to run Ibis code on a given backend. It is quite undesirable and annoying to be dependency-constrained by packages you'll never use.

The shortest path to that seems to be having an examples extra.

Separating packages might be possible, but considering how much more effort that is than making examples an extra it doesn't seem worth it.

I'll put up a PR to make an examples extra and we can continue the discussion there.

@PeterJCLaw
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants