Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The schema argument of as_polars_df() is needed? #897

Open
eitsupi opened this issue Mar 5, 2024 · 11 comments
Open

The schema argument of as_polars_df() is needed? #897

eitsupi opened this issue Mar 5, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 5, 2024

In my opinion, there is no user demand to change column names in as_polars_df().
So simply removing the schema argument and leaving only schema_override would be sufficient.

In terms of type change, the schema argument is more difficult to use than schema_override in that all columns must be specified.

Originally posted by @eitsupi in #896 (comment)

@eitsupi eitsupi added the enhancement New feature or request label Mar 5, 2024
@eitsupi eitsupi modified the milestones: 0.16, 0.15 Mar 8, 2024
@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 11, 2024

@etiennebacher What do you think about this? I was thinking of deprecating it, but I feel like we could remove it since probably very few people use this.

@etiennebacher
Copy link
Collaborator

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame() and that allows piping. If that's the case I don't think we should remove functionalities that are in pl$DataFrame().

@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 11, 2024

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame()

I do not understand your position. This is like data.frame() and as.data.frame(), I think the purpose of the function is different.
(And, pl$DataFrame() can be used after |>.)

I don't think we should remove functionalities that are in pl$DataFrame().

I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)
Are you suggesting that this does not need to be removed?

@etiennebacher
Copy link
Collaborator

I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)

The problem there is that the behavior of schema when column names in schema do not match column names in the data is different between pl$DataFrame() and as_polars_df():

  • in pl$DataFrame() it errors:
library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String, y = pl$Int32))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: Some columns in `schema` are not in the DataFrame.

This is consistent with py-polars:

import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String, "y": pl.Int32},
)

ValueError: the given column-schema names do not match the data dictionary
  • in as_polars_df() it automatically renames by position:
library(polars)
options(polars.do_not_repeat_call = TRUE)

as_polars_df(data.frame(a = 1:3, b = 4:6), schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b   ┆ y   │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘

IMO this is not a good behavior and it should rather throw an error like in pl$DataFrame().

So supposing this behavior is fixed, I don't think there's a reason to remove schema from as_polars_df().

@etiennebacher
Copy link
Collaborator

By the way, the argument schema in pl$DataFrame() should also be fixed as it is equivalent to schema_overrides in py-polars for now (i.e also accepts partial schema).

This should error since not all column names are specified in schema:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ a   ┆ b   │
#> │ --- ┆ --- │
#> │ i32 ┆ str │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘
import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String}
)

ValueError: the given column-schema names do not match the data dictionary

@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 11, 2024

This is consistent with py-polars:

I feel that is inconsistent with the behavior described in the documentation.

It says:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

https://github.com/pola-rs/polars/blob/4bc67a0d0f6c9a113fd6b231d0d9638e58407156/py-polars/polars/dataframe/frame.py#L214-L216

In other words, from_arrow is doing the right thing here, and the current DataFrame behavior may be a bug.

@eitsupi eitsupi removed this from the 0.16 milestone Mar 11, 2024
@etiennebacher
Copy link
Collaborator

I think this is already tracked here: pola-rs/polars#14386

@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 11, 2024

Good point.

But in any case I have no idea when to use the schema argument.
Is there any reason why you can't just use schema_overrides?

@etiennebacher
Copy link
Collaborator

In my mind schema was a stricter version of schema_overrides where you needed to specify the type for all columns. But that was before you showed me this part:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

which I think doesn't make sense. I think the current state is quite confusing, so that's also why I think we should wait for them to clarify (hopefully quite quickly).

@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 11, 2024

That makes sense.

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

@etiennebacher
Copy link
Collaborator

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

If so, yes. If the expected behavior is to only specify the column types, without renaming, then I don't think it should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants