Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: concat_str #1128

Merged
merged 13 commits into from
Oct 6, 2024
Merged

feat: concat_str #1128

merged 13 commits into from
Oct 6, 2024

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Oct 3, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

This was a bit harder/more complex than expected to get right. The idea being that polars can switch between:

  • ignore_nulls=False (default), which means that a single null in the row will be propagated to the final result.
  • ignore_nulls=True, when this happens, it is not as simple as replacing the null with an empty string, since otherwise the separator will get repeated (I am using a reduce). Also the separator "attached" to the same column and row needs to be nullified to empty string.

All this comes out of the box with polars>=0.20.6 and pyarrow (not sure about min version, CI will tell). On the other hand for pandas, dask and polars<0.20.6, a fair bit of manipulation had to be done, especially for ignore_nulls=True.

The gist of it is using a null_mask, a list of series corresponding to the null values of the original series, and use it to map both the null values and the separators to an empty string.

It has also been a long day, so maybe tomorrow I can take another look with some fresh eyes

@github-actions github-actions bot added the enhancement New feature or request label Oct 3, 2024
@FBruzzesi
Copy link
Member Author

FBruzzesi commented Oct 4, 2024

@MarcoGorelli I can't reproduce the CI failure even with the same pinned versions (screenshot attached)

Can reproduce by lowering pandas version πŸ€”

Fixed via the following change:

+ series = (s.astype(str) for _expr in parsed_exprs for s in _expr._call(df))
- series = (s for _expr in parsed_exprs for s in _expr.cast(self._dtypes.String()._call(df))

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one @FBruzzesi, great use of pl.fold!

@MarcoGorelli MarcoGorelli merged commit 8f2e834 into main Oct 6, 2024
26 checks passed
@FBruzzesi FBruzzesi deleted the feat/concat-str branch October 6, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Add nw.concat_str to horizontally concatenate string columns
2 participants