Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add {floor,ceil}_char_boundary methods to str #86497

Merged
merged 1 commit into from
Feb 8, 2022

Conversation

clarfonthey
Copy link
Contributor

@clarfonthey clarfonthey commented Jun 20, 2021

This is technically already used internally by the standard library in the form of truncate_to_char_boundary.

Essentially these are two building blocks to allow for approximate string truncation, where you want to cut off the string at "approximately" a given length in bytes but don't know exactly where the character boundaries lie. It's also a good candidate for the standard library as it can easily be done naively, but would be difficult to properly optimise. Although the existing code that's done in error messages is done naively, this code will explicitly only check a window of 4 bytes since we know that a boundary must lie in that range, and because it will make it possible to vectorise.

Although this method doesn't take into account graphemes or other properties, this would still be a required building block for splitting that takes those into account. For example, if you wanted to split at a grapheme boundary, you could take your approximate splitting point and then determine the graphemes immediately following and preceeding the split. If you then notice that these two graphemes could be merged, you can decide to either include the whole grapheme or exclude it depending on whether you decide splitting should shrink or expand the string.

This takes the most conservative approach and just offers the raw indices to the user, and they can decide how to use them. That way, the methods are as useful as possible despite having as few methods as possible.

(Note: I'll add some tests and a tracking issue if it's decided that this is worth including.)

@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 20, 2021
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

(I know how to fix the error but my desktop is in literal pieces at the moment, so, I'll fix it this weekend.)

@scottmcm
Copy link
Member

The functionality of these seems like a good thing to provide 👍

For the bikeshed, mixing "nearest" and "min/max" in the name is a bit confusing to me. Maybe char_boundary_begin and char_boundary_end or something? (And thus is_char_boundary == (char_boundary_begin == char_boundary_end).)

Or maybe, with subtly different semantics char_{begin|end}? So that s.char_begin(i) .. s.char_end(i) is the Range<usize> representing the USV that includes s.as_bytes()[i]?

That raises the question of endpoints, I guess. What's .nearest_char_boundary_min(usize::MAX) supposed to do?

@clarfonthey
Copy link
Contributor Author

That raises the question of endpoints, I guess. What's .nearest_char_boundary_min(usize::MAX) supposed to do?

Forgetting the names (and I agree, it's hard to come up with something reasonable) the current behaviour here is to panic.

The one which is upper-bounded will currently not panic, which I think is good for truncating behaviour -- if you want your string to be under, say, 20 bytes, and it's empty, that's fine.

Lower-bounded ones must panic if the index is above len, since they can't go any lower.

If we were to use the concept of bounding the results to a specific character, then our issue is that we treat past the end as an "end of string" character, and this isn't C.

My initial terms for the methods were "largest char boundary not exceeding" and "smallest char boundary not below," and I figured those names sucked, but that's the intent that I wanted to preserve in the methods themselves.

library/core/src/str/mod.rs Outdated Show resolved Hide resolved
library/core/src/str/mod.rs Outdated Show resolved Hide resolved
@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Jun 28, 2021

r? @dtolnay as T-libs-api to evaluate the addition of unstable API surface area. Implementations look OK to me modulo comments but I didn't evaluate the correctness of the search criteria (i.e., as casts and equality to -0x40).

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Jun 28, 2021

Note that the -0x40 was taken directly from is_char_boundary. I could factor out that bit into a separate private method (like byte_is_char_boundary) to avoid potential drift.

(Edit: I did that)

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Jun 30, 2021

That test passed for me locally, and I'm not 100% sure on how to fix it. I also know that u8::is_utf8_char_boundary should probably be inlined. Mostly want to wait for a more affirmative "yes, I like this API ± name bikeshedding" before I put more work into it.

@clarfonthey clarfonthey force-pushed the nearest_char_boundary branch 2 times, most recently from a814a25 to 8b6d128 Compare July 1, 2021 21:58
@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

Rebased, would be nice if someone could take a look before I have to rebase again. 😉

@@ -213,7 +209,81 @@ impl str {
None => index == self.len(),

// This is bit magic equivalent to: b < 128 || b >= 192

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should probably move to is_utf8_char_boundary.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These method names and signatures look good to me. I was skeptical about the previous iterations but these are nice. Thanks for all the discussion above.

Please open a tracking issue and update the unstable attributes in the PR.

This is going to require more extensive test coverage prior to landing. See if you can find where the unit tests are for the other methods in this impl block.

library/core/src/str/mod.rs Outdated Show resolved Hide resolved
library/core/src/str/mod.rs Outdated Show resolved Hide resolved
@dtolnay dtolnay added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 1, 2021
@crlf0710
Copy link
Member

@clarfonthey Ping from triage, would you mind addressing the comments above? Thanks!

@clarfonthey
Copy link
Contributor Author

Hey, sorry that I took so long getting to this. Will add some more tests and see what I can do about fixing the implementation.

@clarfonthey
Copy link
Contributor Author

Okay, this is finally ready for a proper review. I added a (hopefully) thorough set of tests and fixed the issue with ceil_char_boundary, which also did result in the code being more readable (IMHO) anyway.

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 29, 2022
@bors
Copy link
Contributor

bors commented Feb 6, 2022

☔ The latest upstream changes (presumably #90414) made this pull request unmergeable. Please resolve the merge conflicts.

@scottmcm
Copy link
Member

scottmcm commented Feb 6, 2022

This is looking pretty good to me, code-wise, now.

Can you please add the tracking issue (as mentioned in #86497 (review))? With that I think it'll be good to go.

@clarfonthey
Copy link
Contributor Author

Tracking issue made, also fixed the test being called ceil_char_bonudary instead of ceil_char_boundary 😳

Should be ready for final CR

@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

(I had accidentally committed the wrong version of a submodule when rebasing, so, hopefully it doesn't fail this time.)

@scottmcm
Copy link
Member

scottmcm commented Feb 7, 2022

Thanks for your persistence here! These look good to go for nightly.

@bors r+

If anyone has further thoughts, please use the tracking issue #93743 or a new bug cc'ing it.

@bors
Copy link
Contributor

bors commented Feb 7, 2022

📌 Commit edd318c has been approved by scottmcm

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 7, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 8, 2022
…askrgr

Rollup of 9 pull requests

Successful merges:

 - rust-lang#86497 (Add {floor,ceil}_char_boundary methods to str)
 - rust-lang#92695 (Add `#[no_coverage]` tests for nested functions)
 - rust-lang#93521 (Fix hover effects in sidebar)
 - rust-lang#93568 (Include all contents of first line of scraped item in Rustdoc)
 - rust-lang#93569 (rustdoc: correct unclosed HTML tags as generics)
 - rust-lang#93672 (update comment wrt const param defaults)
 - rust-lang#93715 (Fix horizontal trim for block doc comments)
 - rust-lang#93721 (rustdoc: Special-case macro lookups less)
 - rust-lang#93728 (Add in ValuePair::Term)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 1f841fc into rust-lang:master Feb 8, 2022
@rustbot rustbot added this to the 1.60.0 milestone Feb 8, 2022
@clarfonthey clarfonthey deleted the nearest_char_boundary branch February 10, 2022 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.