Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define "character" in str.slice docstring #10349

Closed
mcrumiller opened this issue Aug 7, 2023 · 2 comments · Fixed by #14395
Closed

Define "character" in str.slice docstring #10349

mcrumiller opened this issue Aug 7, 2023 · 2 comments · Fixed by #14395
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

Problem description

From #10339, the docstring for str.slice currently is:

        Create subslices of the string values of a Utf8 Series.

        Parameters
        ----------
        offset
            Start index. Negative indexing is supported.
        length
            Length of the slice. If set to ``None`` (default), the slice is taken to the
            end of the string.

It's a bit ambiguous as to whether or not a slice returns bytes or characters/codepoints. Using the following:

>>> import polars as pl

>>> # the following is Google's translation of "polar bear" into Arabic. It is 21 bytes with 11 code points.
>>> s = 'الدب القطبي'
>>> [len(c.encode()) for c in s]
[2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2]

Polars slices return codepoints:

>>> s2 = pl.Series([s])
>>> len(s2.str.slice(0, 1)[0].encode())
2

We should probably specify this. I think I'll go by @avimallu's recommendation.

@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Aug 7, 2023
@orlp
Copy link
Collaborator

orlp commented Aug 7, 2023

For reference, this is what Rust's str::chars method has as its documentation:

As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.

It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.

@mcrumiller
Copy link
Contributor Author

@ritchie46 I ended up just doing small implementations since I wasn't sure where aliasing should stop; maybe the compiler can detect some shortcut this way. Looks like everything is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants