Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update stdlib string functions to work on Unicode extended grapheme clusters #1200

Merged
merged 10 commits into from
Mar 29, 2023

Conversation

matthew-healy
Copy link
Contributor

@matthew-healy matthew-healy commented Mar 25, 2023

This PR updates the standard library string module to function predominantly on Unicode extended grapheme clusters.

In particular:

  • string.split, string.characters, string.replace, string.substring and string.contains all avoid ever breaking up extended grapheme clusters,
  • string.codepoint and string.from_codepoint have been removed,
  • a lot of string functionality has been pulled out of eval/operation.rs and moved into a new term/string.rs module.

What's left?

This PR doesn't touch any of the functions which take a regex argument. I'll make a follow-up PR when I get some time to work on it.

Review tips

I'd recommend reviewing c72f490 and e197dab individually, but then skipping to 60132c0, as the commits in-between mostly reimplement the string operations in-place, whereas that commit refactors everything into the string module. It's likely simpler to just review the (heavily-commented) string module than each of the intermediate states.

Please let me know if you have any questions!

@github-actions github-actions bot temporarily deployed to pull request March 25, 2023 00:45 Inactive
@github-actions github-actions bot temporarily deployed to pull request March 25, 2023 16:16 Inactive
@matthew-healy matthew-healy marked this pull request as ready for review March 27, 2023 12:14
src/term/string.rs Outdated Show resolved Hide resolved
/// containing a single Unicode extended grapheme cluster.
///
/// This method has `O(self.len())` time complexity.
pub fn characters(&self) -> Array {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to have this method be an exact alias of grapheme_clusters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My justification was that outside of the string module - and specifically within operation.rs - we don't necessarily want to have to think about what an extended grapheme cluster is. However, inside string.rs we likely do want to know that we're breaking a string into grapheme clusters, e.g. in the separator.is_empty() clause of split. But I could also be overthinking it a little. 🤷

I guess I could at least mark grapheme_clusters as #[inline(always)].

src/term/string.rs Outdated Show resolved Hide resolved
// We need to know the length of the grapheme cluster iterator,
// and we have to walk the graphemes to do that, so we might
// as well build a Vec to make slicing out the substring easier.
let graphemes: Vec<_> = self.graphemes(true).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to know the length in advance? Can't we just consume the iterator and raise an error if we are running out of elements before end? For example use skip(start).take(end-start), collecting, and checking after the fact that the size of the slice is as large as expected. Because the current approach means always allocating a vec of the size of the whole string even if we only want 4 characters out of 1000, which doesn't sound optimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would never have thought to do this! Great suggestion.

Two small caveats to this though. One is that we've replaced the allocation with an extra iteration of the length of the substring at the end, because we need to walk its grapheme clusters to figure out whether it was actually the length we wanted. The other is that we need to walk the entire length of the string in order to produce error messages.

I think optimising for the happy path, and for shorter substring lengths, makes sense though, so I've implemented your suggestion.

src/term/string.rs Outdated Show resolved Hide resolved
src/term/string.rs Show resolved Hide resolved
src/term/string.rs Show resolved Hide resolved
src/term/string.rs Outdated Show resolved Hide resolved
src/term/string.rs Show resolved Hide resolved
Matthew Healy added 10 commits March 28, 2023 21:36
This commit splits the monolithic stdlib string test file into
multiple smaller files, roughly split according to "domain" behaviour.

It also adds some new test cases in places where the coverage seemed
a little thin.
This commit re-implements `string.split` so that extended grapheme
clusters are never broken up, even if the separator is included
within.
These functions break the extended grapheme cluster abstraction,
and don't have equivalents in the Nix language, so we've decided
to remove them.
This updates the implementation of `string.contains` to avoid returning
true if the string we're searching for only exists as part of a larger
extended grapheme cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants