The `schema` argument of `as_polars_df()` is needed? #897

eitsupi · 2024-03-05T13:49:35Z

In my opinion, there is no user demand to change column names in as_polars_df().
So simply removing the schema argument and leaving only schema_override would be sufficient.

In terms of type change, the schema argument is more difficult to use than schema_override in that all columns must be specified.

Originally posted by @eitsupi in #896 (comment)

The text was updated successfully, but these errors were encountered:

eitsupi · 2024-03-11T11:45:23Z

@etiennebacher What do you think about this? I was thinking of deprecating it, but I feel like we could remove it since probably very few people use this.

etiennebacher · 2024-03-11T11:57:29Z

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame() and that allows piping. If that's the case I don't think we should remove functionalities that are in pl$DataFrame().

eitsupi · 2024-03-11T12:04:38Z

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame()

I do not understand your position. This is like data.frame() and as.data.frame(), I think the purpose of the function is different.
(And, pl$DataFrame() can be used after |>.)

I don't think we should remove functionalities that are in pl$DataFrame().

I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)
Are you suggesting that this does not need to be removed?

etiennebacher · 2024-03-11T12:24:43Z

I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)

The problem there is that the behavior of schema when column names in schema do not match column names in the data is different between pl$DataFrame() and as_polars_df():

in pl$DataFrame() it errors:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String, y = pl$Int32))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: Some columns in `schema` are not in the DataFrame.

This is consistent with py-polars:

import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String, "y": pl.Int32},
)

ValueError: the given column-schema names do not match the data dictionary

in as_polars_df() it automatically renames by position:

library(polars)
options(polars.do_not_repeat_call = TRUE)

as_polars_df(data.frame(a = 1:3, b = 4:6), schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b   ┆ y   │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘

IMO this is not a good behavior and it should rather throw an error like in pl$DataFrame().

So supposing this behavior is fixed, I don't think there's a reason to remove schema from as_polars_df().

etiennebacher · 2024-03-11T12:31:19Z

By the way, the argument schema in pl$DataFrame() should also be fixed as it is equivalent to schema_overrides in py-polars for now (i.e also accepts partial schema).

This should error since not all column names are specified in schema:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ a   ┆ b   │
#> │ --- ┆ --- │
#> │ i32 ┆ str │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘

import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String}
)

ValueError: the given column-schema names do not match the data dictionary

eitsupi · 2024-03-11T12:31:38Z

This is consistent with py-polars:

I feel that is inconsistent with the behavior described in the documentation.

It says:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

https://github.com/pola-rs/polars/blob/4bc67a0d0f6c9a113fd6b231d0d9638e58407156/py-polars/polars/dataframe/frame.py#L214-L216

In other words, from_arrow is doing the right thing here, and the current DataFrame behavior may be a bug.

etiennebacher · 2024-03-11T12:37:39Z

I think this is already tracked here: pola-rs/polars#14386

eitsupi · 2024-03-11T12:42:15Z

Good point.

But in any case I have no idea when to use the schema argument.
Is there any reason why you can't just use schema_overrides?

etiennebacher · 2024-03-11T12:47:32Z

In my mind schema was a stricter version of schema_overrides where you needed to specify the type for all columns. But that was before you showed me this part:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

which I think doesn't make sense. I think the current state is quite confusing, so that's also why I think we should wait for them to clarify (hopefully quite quickly).

eitsupi · 2024-03-11T12:55:41Z

That makes sense.

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

etiennebacher · 2024-03-11T13:02:10Z

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

If so, yes. If the expected behavior is to only specify the column types, without renaming, then I don't think it should be removed.

eitsupi added the enhancement New feature or request label Mar 5, 2024

eitsupi modified the milestones: 0.16, 0.15 Mar 8, 2024

eitsupi removed this from the 0.16 milestone Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `schema` argument of `as_polars_df()` is needed? #897

The `schema` argument of `as_polars_df()` is needed? #897

eitsupi commented Mar 5, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024 •

edited

Loading

etiennebacher commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

The schema argument of as_polars_df() is needed? #897

The schema argument of as_polars_df() is needed? #897

Comments

eitsupi commented Mar 5, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024 • edited Loading

etiennebacher commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

eitsupi commented Mar 11, 2024

etiennebacher commented Mar 11, 2024

The `schema` argument of `as_polars_df()` is needed? #897

The `schema` argument of `as_polars_df()` is needed? #897

eitsupi commented Mar 11, 2024 •

edited

Loading