Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable parse of 0/1 as bool in csv #18504

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mcrumiller
Copy link
Contributor

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 1, 2024
Copy link

codecov bot commented Sep 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.86%. Comparing base (4dc90a9) to head (97c5484).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18504      +/-   ##
==========================================
- Coverage   79.87%   79.86%   -0.02%     
==========================================
  Files        1501     1501              
  Lines      202032   202032              
  Branches     2868     2868              
==========================================
- Hits       161370   161349      -21     
- Misses      40115    40136      +21     
  Partials      547      547              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

This adds extra branches to the csv parser, making it slower for the default cases. This should be added in the casting logic instead.

@mcrumiller mcrumiller marked this pull request as draft September 2, 2024 17:07
@mcrumiller
Copy link
Contributor Author

This adds extra branches to the csv parser, making it slower for the default cases. This should be added in the casting logic instead.

@ritchie46 I'm not sure how to deal with this. If the user supplies the schema via schema override then the field is specified as boolean, and the boolean parser gets used. We could ignore booleans in the schema override and make sure we cast back later, but that feels messy. Did you have another solution in mind?

@jakob-keller
Copy link
Contributor

jakob-keller commented Sep 3, 2024

Would it be possible to make this even more flexible? I am dealing with boolean data in CSVs that is coded as Y and N.

Maybe the dtype could be specified as pl.Boolean(false="N", true="Y"), if that makes any sense at all.

@mcrumiller
Copy link
Contributor Author

@jakob-keller can you not just do something like:

pl.scan_csv("file.csv").with_columns(
    col("bool_col").replace_strict({"Y": True, "N": False})
)

@jakob-keller
Copy link
Contributor

@jakob-keller can you not just do something like:

pl.scan_csv("file.csv").with_columns(
    col("bool_col").replace_strict({"Y": True, "N": False})
)

Sure, that's what I am doing right now. And the same goes for 0/1, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants