From ff8cfd4494cd8a3d9d977b5604cad56a0a3aa12d Mon Sep 17 00:00:00 2001 From: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Tue, 17 Sep 2024 13:36:06 +0100 Subject: [PATCH 1/2] restructure docs --- docs/api-reference/index.md | 46 +++++++++++++++---------------------- docs/extending.md | 46 ------------------------------------- docs/installation.md | 46 +++++++++++++++++++++++++++++++++++++ docs/levels.md | 43 ---------------------------------- docs/overhead.md | 20 ---------------- docs/quick_start.md | 45 ------------------------------------ docs/related.md | 13 ----------- docs/roadmap.md | 11 --------- mkdocs.yml | 22 ++++++++---------- 9 files changed, 74 insertions(+), 218 deletions(-) delete mode 100644 docs/extending.md delete mode 100644 docs/levels.md delete mode 100644 docs/overhead.md delete mode 100644 docs/quick_start.md delete mode 100644 docs/related.md delete mode 100644 docs/roadmap.md diff --git a/docs/api-reference/index.md b/docs/api-reference/index.md index 0c2c81fa3..b4cbc78fc 100644 --- a/docs/api-reference/index.md +++ b/docs/api-reference/index.md @@ -1,30 +1,20 @@ # API Reference -Anything documented in the API reference is intended to work consistently among -supported backends. - -For example: -```python -import narwhals as nw - -df.with_columns( - a_mean=nw.col("a").mean(), - a_std=nw.col("a").std(), -) -``` -is supported, as `DataFrame.with_columns`, `narwhals.col`, `Expr.mean`, and `Expr.std` are -all documented in the API reference. - -However, -```python -import narwhals as nw - -df.with_columns( - a_ewm_mean=nw.col("a").ewm_mean(alpha=0.7), -) -``` -is not - `Expr.ewm_mean` only appears in the Polars API reference, but not in the Narwhals -one. - -In general, you should expect any fundamental dataframe operation to be supported - if -one that you need is not, please do open a feature request! +- [Top-level functions](narwhals.md) +- [narwhals.DataFrame](dataframe.md) +- [narwhals.Expr](expr.md) +- [narwhals.Expr.cat](expr_cat.md) +- [narwhals.Expr.dt](expr_dt.md) +- [narwhals.Expr.name](expr_name.md) +- [narwhals.Expr.str](expr_str.md) +- [narwhals.GroupBy](group_by.md) +- [narwhals.LazyFrame](lazyframe.md) +- [narwhals.Schema](schema.md) +- [narwhals.Series](series.md) +- [narwhals.Series.cat](series_cat.md) +- [narwhals.Series.dt](series_dt.md) +- [narwhals.Series.str](series_str.md) +- [narwhals.dependencies](dependencies.md) +- [narwhals.dtypes](dtypes.md) +- [narwhals.selectors](selectors.md) +- [narwhals.typing](typing.md) diff --git a/docs/extending.md b/docs/extending.md deleted file mode 100644 index 1a750431f..000000000 --- a/docs/extending.md +++ /dev/null @@ -1,46 +0,0 @@ -# List of supported libraries (and how to add yours!) - -Currently, Narwhals supports the following libraries as inputs: - -- pandas -- Polars -- cuDF -- Modin -- PyArrow - -If you want your own library to be recognised too, you're welcome open a PR (with tests)! -Alternatively, if you can't do that (for example, if you library is closed-source), see -the next section for what else you can do. - -To check which methods are supported for which backend in depth, please refer to the -[API completeness page](api-completeness/index.md). - -## Extending Narwhals - -We love open source, but we're not "open source absolutists". If you're unable to open -source you library, then this is how you can make your library compatible with Narwhals. - -Make sure that, in addition to the public Narwhals API, you also define: - - - `DataFrame.__narwhals_dataframe__`: return an object which implements public methods - from `Narwhals.DataFrame` - - `DataFrame.__narwhals_namespace__`: return an object which implements public top-level - functions from `narwhals` (e.g. `narwhals.col`, `narwhals.concat`, ...) - - `DataFrame.__native_namespace__`: return a native namespace object which must have a - `from_dict` method - - `LazyFrame.__narwhals_lazyframe__`: return an object which implements public methods - from `Narwhals.LazyFrame` - - `LazyFrame.__narwhals_namespace__`: return an object which implements public top-level - functions from `narwhals` (e.g. `narwhals.col`, `narwhals.concat`, ...) - - `LazyFrame.__native_namespace__`: return a native namespace object which must have a - `from_dict` method - - `Series.__narwhals_series__`: return an object which implements public methods - from `Narwhals.Series` - - If your library doesn't distinguish between lazy and eager, then it's OK for your dataframe - object to implement both `__narwhals_dataframe__` and `__narwhals_lazyframe__`. In fact, - that's currently what `narwhals._pandas_like.dataframe.PandasLikeDataFrame` does. So, if you're stuck, - take a look at the source code to see how it's done! - -Note that the "extension" mechanism is still experimental. If anything is not clear, or -doesn't work, please do raise an issue or contact us on Discord (see the link on the README). diff --git a/docs/installation.md b/docs/installation.md index 617606817..3bb37b494 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -14,3 +14,49 @@ Then, if you start the Python REPL and see the following: '1.8.1' ``` then installation worked correctly! + +# Quick start + +## Prerequisites + +Please start by following the [installation instructions](installation.md). + +To follow along with the examples which follow, please install the following (though note that +they are not required dependencies - Narwhals only ever uses what the user passes in): + +- [pandas](https://pandas.pydata.org/docs/getting_started/install.html) +- [Polars](https://pola-rs.github.io/polars/user-guide/installation/) + +## Simple example + +Create a Python file `t.py` with the following content: + +```python exec="1" source="above" session="quickstart" result="python" +from __future__ import annotations + +import pandas as pd +import polars as pl +import narwhals as nw +from narwhals.typing import IntoFrame + + +def my_function(df_native: IntoFrame) -> list[str]: + df = nw.from_native(df_native) + column_names = df.columns + return column_names + + +df_pandas = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) +df_polars = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + +print("pandas output") +print(my_function(df_pandas)) +print("Polars output") +print(my_function(df_polars)) +``` + +If you run `python t.py` then your output should look like the above. This is the simplest possible example of a dataframe-agnostic +function - as we'll soon see, we can do much more advanced things. +Let's learn about what you just did, and what Narwhals can do for you! + +Note: these examples are only using pandas and Polars. Please see the following to find the [supported libriaries](extending.md). diff --git a/docs/levels.md b/docs/levels.md deleted file mode 100644 index 743334663..000000000 --- a/docs/levels.md +++ /dev/null @@ -1,43 +0,0 @@ -# Levels - -Narwhals comes with two levels of support: "full" and "interchange". - -Libraries for which we have full support can benefit from the whole -[Narwhals API](https://narwhals-dev.github.io/narwhals/api-reference/). - -For example: - -```python exec="1" source="above" -import narwhals as nw -from narwhals.typing import FrameT - - -@nw.narwhalify -def func(df: FrameT) -> FrameT: - return df.group_by("a").agg( - b_mean=nw.col("b").mean(), - b_std=nw.col("b").std(), - ) -``` -will work for any of pandas, Polars, cuDF, Modin, and PyArrow. - -However, sometimes you don't need to do complex operations on dataframes - all you need -is to inspect the schema a bit before making other decisions, such as which columns to -select or whether to convert to another library. For that purpose, we also provide "interchange" -level of support. If a library implements the -[Dataframe Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/), then -a call such as - -```python exec="1" source="above" -from typing import Any - -import narwhals as nw -from narwhals.schema import Schema - - -def func(df: Any) -> Schema: - df = nw.from_native(df, eager_or_interchange_only=True) - return df.schema -``` -is also supported, meaning that, in addition to the libraries mentioned above, you can -also pass Ibis, Vaex, PyArrow, and any other library which implements the protocol. diff --git a/docs/overhead.md b/docs/overhead.md deleted file mode 100644 index 1477f6fa6..000000000 --- a/docs/overhead.md +++ /dev/null @@ -1,20 +0,0 @@ -# Overhead - -Narwhals converts Polars syntax to non-Polars dataframes. - -So, what's the overhead of running pandas vs pandas via Narwhals? - -Based on experiments we've done, the answer is: it's negligible. Here -are timings from the TPC-H queries, comparing running pandas directly -vs running pandas via Narwhals: - -![Comparison of pandas vs "pandas via Narwhals" timings on TPC-H queries showing neglibile overhead](https://github.com/narwhals-dev/narwhals/assets/33491632/71029c26-4121-43bb-90fb-5ac1c16ab8a2) - -[Here](https://www.kaggle.com/code/marcogorelli/narwhals-tpc-h-results-s-2)'s the code to -reproduce the plot above, check the input -sources for notebooks which run each individual query, along with -the data sources. - -On some runs, the Narwhals code makes things marginally faster, on others -marginally slower. The overall picture is clear: with Narwhals, you -can support both Polars and pandas APIs with little to no impact on either. diff --git a/docs/quick_start.md b/docs/quick_start.md deleted file mode 100644 index f3ff8c05a..000000000 --- a/docs/quick_start.md +++ /dev/null @@ -1,45 +0,0 @@ -# Quick start - -## Prerequisites - -Please start by following the [installation instructions](installation.md). - -To follow along with the examples which follow, please install the following (though note that -they are not required dependencies - Narwhals only ever uses what the user passes in): - -- [pandas](https://pandas.pydata.org/docs/getting_started/install.html) -- [Polars](https://pola-rs.github.io/polars/user-guide/installation/) - -## Simple example - -Create a Python file `t.py` with the following content: - -```python exec="1" source="above" session="quickstart" result="python" -from __future__ import annotations - -import pandas as pd -import polars as pl -import narwhals as nw -from narwhals.typing import IntoFrame - - -def my_function(df_native: IntoFrame) -> list[str]: - df = nw.from_native(df_native) - column_names = df.columns - return column_names - - -df_pandas = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) -df_polars = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) - -print("pandas output") -print(my_function(df_pandas)) -print("Polars output") -print(my_function(df_polars)) -``` - -If you run `python t.py` then your output should look like the above. This is the simplest possible example of a dataframe-agnostic -function - as we'll soon see, we can do much more advanced things. -Let's learn about what you just did, and what Narwhals can do for you! - -Note: these examples are only using pandas and Polars. Please see the following to find the [supported libriaries](extending.md). diff --git a/docs/related.md b/docs/related.md deleted file mode 100644 index 38b0522d8..000000000 --- a/docs/related.md +++ /dev/null @@ -1,13 +0,0 @@ -# Related projects - -## Dataframe Interchange Protocol - -Standardised way of interchanging data between libraries, see -[here](https://data-apis.org/dataframe-protocol/latest/index.html). - -Narwhals builds upon it by providing one level of support to libraries which implement it - -this includes Ibis and Vaex. See [levels](levels.md) for details. - -## Array API - -Array counterpart to the DataFrame API, see [here](https://data-apis.org/array-api/2022.12/index.html). diff --git a/docs/roadmap.md b/docs/roadmap.md deleted file mode 100644 index 87b224bf9..000000000 --- a/docs/roadmap.md +++ /dev/null @@ -1,11 +0,0 @@ -# Roadmap - -Priorities, as of August 2024, are: - -- Works towards supporting projects which have shown interest in Narwhals. -- Implement when/then/otherwise so that Narwhals is API-complete enough to complete all the TPC-H queries. -- Make Dask support complete-enough, at least to the point that it can execute TPC-H queries. -- Improve support for cuDF, which we can't currently test in CI (unless NVIDIA helps us out :wink:) but - which we can and do test manually in Kaggle notebooks. -- Add extra docs and tutorials to make the project more accessible and easy to get started with. -- Look into extra backends, such as DuckDB and Ibis. diff --git a/mkdocs.yml b/mkdocs.yml index 8b635f78d..5af943bc2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -5,28 +5,26 @@ watch: nav: - Home: index.md - Why: why.md - - Installation: installation.md - - Quick start: quick_start.md + - Installation and quick start: installation.md - Tutorial: - basics/dataframe.md - basics/column.md - basics/complete_example.md - - Pandas-like concepts: + - Coming from pandas: - other/pandas_index.md - other/user_warning.md - other/column_names.md - - levels.md - - overhead.md + - other/overhead.md - backcompat.md - - extending.md - how_it_works.md - - Roadmap: roadmap.md - - Related projects: related.md - - API Completeness: + - Roadmap and related projects: roadmap.md + - Supported libraries: - api-completeness/index.md - - api-completeness/dataframe.md - - api-completeness/expr.md - - api-completeness/series.md + - Supported DataFrame methods: api-completeness/dataframe.md + - Supporteda Expr methods: api-completeness/expr.md + - Supported Series methods: api-completeness/series.md + - api-completeness/levels.md + - api-completeness/extending.md - API Reference: - api-reference/narwhals.md - api-reference/dataframe.md From 3625feba7f3f966521270938df0d4a1897e2e90f Mon Sep 17 00:00:00 2001 From: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Tue, 17 Sep 2024 16:43:12 +0100 Subject: [PATCH 2/2] missing pages --- .../column_names.md | 0 docs/coming_from_pandas/extending.md | 46 +++++++++++++++++++ docs/coming_from_pandas/overhead.md | 20 ++++++++ .../pandas_index.md | 0 .../user_warning.md | 0 docs/installation.md | 10 ++-- docs/roadmap_and_related.md | 25 ++++++++++ mkdocs.yml | 8 ++-- 8 files changed, 101 insertions(+), 8 deletions(-) rename docs/{other => coming_from_pandas}/column_names.md (100%) create mode 100644 docs/coming_from_pandas/extending.md create mode 100644 docs/coming_from_pandas/overhead.md rename docs/{other => coming_from_pandas}/pandas_index.md (100%) rename docs/{other => coming_from_pandas}/user_warning.md (100%) create mode 100644 docs/roadmap_and_related.md diff --git a/docs/other/column_names.md b/docs/coming_from_pandas/column_names.md similarity index 100% rename from docs/other/column_names.md rename to docs/coming_from_pandas/column_names.md diff --git a/docs/coming_from_pandas/extending.md b/docs/coming_from_pandas/extending.md new file mode 100644 index 000000000..1a750431f --- /dev/null +++ b/docs/coming_from_pandas/extending.md @@ -0,0 +1,46 @@ +# List of supported libraries (and how to add yours!) + +Currently, Narwhals supports the following libraries as inputs: + +- pandas +- Polars +- cuDF +- Modin +- PyArrow + +If you want your own library to be recognised too, you're welcome open a PR (with tests)! +Alternatively, if you can't do that (for example, if you library is closed-source), see +the next section for what else you can do. + +To check which methods are supported for which backend in depth, please refer to the +[API completeness page](api-completeness/index.md). + +## Extending Narwhals + +We love open source, but we're not "open source absolutists". If you're unable to open +source you library, then this is how you can make your library compatible with Narwhals. + +Make sure that, in addition to the public Narwhals API, you also define: + + - `DataFrame.__narwhals_dataframe__`: return an object which implements public methods + from `Narwhals.DataFrame` + - `DataFrame.__narwhals_namespace__`: return an object which implements public top-level + functions from `narwhals` (e.g. `narwhals.col`, `narwhals.concat`, ...) + - `DataFrame.__native_namespace__`: return a native namespace object which must have a + `from_dict` method + - `LazyFrame.__narwhals_lazyframe__`: return an object which implements public methods + from `Narwhals.LazyFrame` + - `LazyFrame.__narwhals_namespace__`: return an object which implements public top-level + functions from `narwhals` (e.g. `narwhals.col`, `narwhals.concat`, ...) + - `LazyFrame.__native_namespace__`: return a native namespace object which must have a + `from_dict` method + - `Series.__narwhals_series__`: return an object which implements public methods + from `Narwhals.Series` + + If your library doesn't distinguish between lazy and eager, then it's OK for your dataframe + object to implement both `__narwhals_dataframe__` and `__narwhals_lazyframe__`. In fact, + that's currently what `narwhals._pandas_like.dataframe.PandasLikeDataFrame` does. So, if you're stuck, + take a look at the source code to see how it's done! + +Note that the "extension" mechanism is still experimental. If anything is not clear, or +doesn't work, please do raise an issue or contact us on Discord (see the link on the README). diff --git a/docs/coming_from_pandas/overhead.md b/docs/coming_from_pandas/overhead.md new file mode 100644 index 000000000..1477f6fa6 --- /dev/null +++ b/docs/coming_from_pandas/overhead.md @@ -0,0 +1,20 @@ +# Overhead + +Narwhals converts Polars syntax to non-Polars dataframes. + +So, what's the overhead of running pandas vs pandas via Narwhals? + +Based on experiments we've done, the answer is: it's negligible. Here +are timings from the TPC-H queries, comparing running pandas directly +vs running pandas via Narwhals: + +![Comparison of pandas vs "pandas via Narwhals" timings on TPC-H queries showing neglibile overhead](https://github.com/narwhals-dev/narwhals/assets/33491632/71029c26-4121-43bb-90fb-5ac1c16ab8a2) + +[Here](https://www.kaggle.com/code/marcogorelli/narwhals-tpc-h-results-s-2)'s the code to +reproduce the plot above, check the input +sources for notebooks which run each individual query, along with +the data sources. + +On some runs, the Narwhals code makes things marginally faster, on others +marginally slower. The overall picture is clear: with Narwhals, you +can support both Polars and pandas APIs with little to no impact on either. diff --git a/docs/other/pandas_index.md b/docs/coming_from_pandas/pandas_index.md similarity index 100% rename from docs/other/pandas_index.md rename to docs/coming_from_pandas/pandas_index.md diff --git a/docs/other/user_warning.md b/docs/coming_from_pandas/user_warning.md similarity index 100% rename from docs/other/user_warning.md rename to docs/coming_from_pandas/user_warning.md diff --git a/docs/installation.md b/docs/installation.md index 3bb37b494..6dc2bfa9d 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,4 +1,6 @@ -# Installation +# Installation and quick start + +## Installation First, make sure you have [created and activated](https://docs.python.org/3/library/venv.html) a Python3.8+ virtual environment. @@ -15,9 +17,9 @@ Then, if you start the Python REPL and see the following: ``` then installation worked correctly! -# Quick start +## Quick start -## Prerequisites +### Prerequisites Please start by following the [installation instructions](installation.md). @@ -27,7 +29,7 @@ they are not required dependencies - Narwhals only ever uses what the user passe - [pandas](https://pandas.pydata.org/docs/getting_started/install.html) - [Polars](https://pola-rs.github.io/polars/user-guide/installation/) -## Simple example +### Simple example Create a Python file `t.py` with the following content: diff --git a/docs/roadmap_and_related.md b/docs/roadmap_and_related.md new file mode 100644 index 000000000..ad05b0533 --- /dev/null +++ b/docs/roadmap_and_related.md @@ -0,0 +1,25 @@ +# Roadmap and related projects + +## Roadmap + +Priorities, as of September 2024, are: + +- Works towards supporting projects which have shown interest in Narwhals. +- Add extra docs and tutorials to make the project more accessible and easy to get started with. +- Improve support for cuDF, which we can't currently test in CI (unless NVIDIA helps us out :wink:) but + which we can and do test manually in Kaggle notebooks. +- Define a lazy-only layer of support which can include DuckDB, Ibis, and PySpark. + +## Related projects + +### Dataframe Interchange Protocol + +Standardised way of interchanging data between libraries, see +[here](https://data-apis.org/dataframe-protocol/latest/index.html). + +Narwhals builds upon it by providing one level of support to libraries which implement it - +this includes Ibis and Vaex. See [levels](levels.md) for details. + +### Array API + +Array counterpart to the DataFrame API, see [here](https://data-apis.org/array-api/2022.12/index.html). diff --git a/mkdocs.yml b/mkdocs.yml index 5af943bc2..8a0577886 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,10 +11,10 @@ nav: - basics/column.md - basics/complete_example.md - Coming from pandas: - - other/pandas_index.md - - other/user_warning.md - - other/column_names.md - - other/overhead.md + - coming_from_pandas/pandas_index.md + - coming_from_pandas/user_warning.md + - coming_from_pandas/column_names.md + - coming_from_pandas/overhead.md - backcompat.md - how_it_works.md - Roadmap and related projects: roadmap.md