Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: some improves of as_polars_df #896

Merged
merged 3 commits into from
Mar 5, 2024
Merged

feat: some improves of as_polars_df #896

merged 3 commits into from
Mar 5, 2024

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Mar 5, 2024

The performance of the conversion from nanoarrow_array_stream is as follows, slower than as_arrow_table but faster than as_tibble, possibly due to the String type conversion. (And I am surprised that as_tibble is so fast)

library(adbcdrivermanager)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#>     timestamp
library(tibble)
library(polars)

polars_info()
#> Polars R package version : 0.15.0.9000
#> Rust Polars crate version: 0.38.1
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated

db <- adbc_database_init(adbcsqlite::adbcsqlite())
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL
flights |>
  write_adbc(con, "flights")

query <- "SELECT * from flights"

bench::mark(
  polars_df_1 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  arrow_table = {
    con |>
      read_adbc(query) |>
      as_arrow_table()
  },
  tibble = {
    con |>
      read_adbc(query) |>
      as_tibble()
  },
  polars_df_2 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars_df_1    1.62s    1.78s    0.116     4.94MB   0.162
#> 2 arrow_table    1.25s    1.39s    0.722     1.62MB   0
#> 3 tibble          3.3s    5.39s    0.0938    46.5MB   0.0751
#> 4 polars_df_2    1.58s     3.4s    0.223   149.91KB   0.134

Created on 2024-03-05 with reprex v2.0.2

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, do you want to bump NEWS to add a note on improved performance?

Comment on lines +205 to +206
#' Should match the number of columns in `x` and correspond to each column in `x` by position.
#' If a column in `x` does not match the name or type at the same position, it will be renamed/recast.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow I didn't notice that it would automatically rename / recast but that seems like a really bad behavior:

library(polars)
df = data.frame(a = 1:3, b = 4:6)
as_polars_df(df, schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b   ┆ y   │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘

This doesn't work in py-polars:

test = pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String, "y": pl.Int32},
)
ValueError: the given column-schema names do not match the data dictionary

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema_overrides={"b": pl.String, "y": pl.Int32},
)

shape: (3, 2)
┌─────┬──────┐
│ ab    │
│ ------  │
│ i64str  │
╞═════╪══════╡
│ 1null │
│ 2null │
│ 3null │
└─────┴──────┘

This wasn't introduced in this PR so I don't think it's a blocker, but it should definitely be fixed

Copy link
Collaborator Author

@eitsupi eitsupi Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema argument is from polars.from_arrow of Python, which I believe works the same way polars.from_arrow in Python given that it copies the complete logic from that (I also think this is bad behavior and could be removed at some point).

Copy link
Collaborator Author

@eitsupi eitsupi Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like these:

>>> import polars as pl
>>> import pyarrow as pa
>>> data = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> pl.from_arrow(data, schema=[("b", pl.String), ("y", pl.Int32)])
shape: (3, 2)
┌─────┬─────┐
│ by   │
│ ------ │
│ stri32 │
╞═════╪═════╡
│ 14   │
│ 25   │
│ 36   │
└─────┴─────┘
>>> pl.from_arrow(data, schema={"b": pl.String, "y": pl.Int32})
shape: (3, 2)
┌─────┬─────┐
│ by   │
│ ------ │
│ stri32 │
╞═════╪═════╡
│ 14   │
│ 25   │
│ 36   │
└─────┴─────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so I guess they need to harmonize the behavior upstream between pl.DataFrame and pl.from_arrow then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite surprising that this behavior is performed, especially when specified with the dict type.

In general, order in dict is not guaranteed, so processing based on order is not a good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so I guess they need to harmonize the behavior upstream between pl.DataFrame and pl.from_arrow then?

Not sure about that.
Perhaps the schema is intended for Series rather than DataFrame (e.g., pyarrow.Array or something has no name), since it is guaranteed that the column names already exist in the arrow::Table for as_polars_df, so we can simply remove the schema argument from here and use only schema_override.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, there is no user demand to change column names in as_polars_df().
So simply removing the schema argument and leaving only schema_override would be sufficient.

In terms of type change, the schema argument is more difficult to use than schema_override in that all columns must be specified.

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, do you want to bump NEWS to add a note on improved performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

as_polars_df() for nanoarrow_array_stream seems slow Rewrite as_polars_df.nanoarrow_array
2 participants