Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add head/tail under string namespace #10339

Closed
wants to merge 7 commits into from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Aug 7, 2023

Resolves #10337 and #10349

There is some discussion on the naming of the functions.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 7, 2023
@avimallu
Copy link
Contributor

avimallu commented Aug 7, 2023

Following the discussion in the linked issue:

I'd much prefer head and tail that always keep a given amount of Unicode codepoints (which corresponds to bytes for ASCII) from respectively the start/end of the string.

Do you think the docstrings should specify what exactly a "character" is?

@mcrumiller
Copy link
Contributor Author

@avimallu perhaps, yes. We don't current for slice either. From what I gather, .str.slice() returns code points (characters), which isn't referenced in the docstring, but probably should be.

@avimallu
Copy link
Contributor

avimallu commented Aug 7, 2023

Okay @mcrumiller. The average person (ala me) isn't probably familiar with codepoints, but it is an important distinction to be aware of. Maybe to avoid confusion for the unfamiliar folks, while simultaneously providing enough info to the ones looking for it:

Returns the first/last n characters (strictly, UTF8 code points) of a UTF8 string.

Replace UTF8 code points with what is technically accurate?

@mcrumiller mcrumiller requested a review from orlp as a code owner August 7, 2023 18:11

Returns
-------
Expr
Expression of data type :class:`Utf8`.

Notes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added this note to both the str and expr docstrings for slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be explicit in the definition as per @orlp 's comment here, but I wonder if we should remove the "non-surrogate" part, since not many people will recognize what that means.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it it's technically not correct. You can also call it a Unicode Scalar Value like the Rust docs if you prefer.

@orlp
Copy link
Collaborator

orlp commented Aug 8, 2023

This should still support negative arguments as we discussed.

@mcrumiller
Copy link
Contributor Author

@orlp I'm working on. I'm still pretty new to rust and haven't really internalized most of the concepts. pyarrow2's string slicing only operates on a fixed input length so I have to unpack the pyarrow array and apply this to the elements inside, and I'm still trying to figure out how to do that. I may post a non-working commit and ask for some assistance.

@orlp
Copy link
Collaborator

orlp commented Aug 8, 2023

@mcrumiller I understand, I commented more in case someone else wanted to review/merge in the current state.

@mcrumiller mcrumiller changed the title feat(python): add head/tail under string namespace WIP: feat(python): add head/tail under string namespace Aug 8, 2023
@stinodego
Copy link
Member

GitHub has a "draft" feature you can use while your work is in progress. I went ahead and clicked the button for you 😸

GitHub docs 1
GitHub docs 2

@stinodego stinodego marked this pull request as draft August 9, 2023 14:18
@stinodego stinodego changed the title WIP: feat(python): add head/tail under string namespace feat(python): add head/tail under string namespace Aug 9, 2023
@mcrumiller
Copy link
Contributor Author

That is a useful feature, thank you @stinodego !

@mcrumiller
Copy link
Contributor Author

I'm going to close and start from scratch rather than trying to revive this ancient PR.

@mcrumiller mcrumiller closed this Feb 9, 2024
@mcrumiller mcrumiller deleted the str-head-tail branch February 9, 2024 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add .str.head and .str.tail
4 participants