-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update stdlib string functions to work on Unicode extended grapheme clusters #1200
Conversation
3ddd8de
to
b761aad
Compare
/// containing a single Unicode extended grapheme cluster. | ||
/// | ||
/// This method has `O(self.len())` time complexity. | ||
pub fn characters(&self) -> Array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have this method be an exact alias of grapheme_clusters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My justification was that outside of the string
module - and specifically within operation.rs
- we don't necessarily want to have to think about what an extended grapheme cluster is. However, inside string.rs
we likely do want to know that we're breaking a string into grapheme clusters, e.g. in the separator.is_empty()
clause of split
. But I could also be overthinking it a little. 🤷
I guess I could at least mark grapheme_clusters
as #[inline(always)]
.
src/term/string.rs
Outdated
// We need to know the length of the grapheme cluster iterator, | ||
// and we have to walk the graphemes to do that, so we might | ||
// as well build a Vec to make slicing out the substring easier. | ||
let graphemes: Vec<_> = self.graphemes(true).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to know the length in advance? Can't we just consume the iterator and raise an error if we are running out of elements before end
? For example use skip(start).take(end-start)
, collecting, and checking after the fact that the size of the slice is as large as expected. Because the current approach means always allocating a vec of the size of the whole string even if we only want 4 characters out of 1000, which doesn't sound optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would never have thought to do this! Great suggestion.
Two small caveats to this though. One is that we've replaced the allocation with an extra iteration of the length of the substring at the end, because we need to walk its grapheme clusters to figure out whether it was actually the length we wanted. The other is that we need to walk the entire length of the string in order to produce error messages.
I think optimising for the happy path, and for shorter substring lengths, makes sense though, so I've implemented your suggestion.
This commit splits the monolithic stdlib string test file into multiple smaller files, roughly split according to "domain" behaviour. It also adds some new test cases in places where the coverage seemed a little thin.
This commit re-implements `string.split` so that extended grapheme clusters are never broken up, even if the separator is included within.
These functions break the extended grapheme cluster abstraction, and don't have equivalents in the Nix language, so we've decided to remove them.
This updates the implementation of `string.contains` to avoid returning true if the string we're searching for only exists as part of a larger extended grapheme cluster.
b761aad
to
81d231c
Compare
This PR updates the standard library
string
module to function predominantly on Unicode extended grapheme clusters.In particular:
string.split
,string.characters
,string.replace
,string.substring
andstring.contains
all avoid ever breaking up extended grapheme clusters,string.codepoint
andstring.from_codepoint
have been removed,eval/operation.rs
and moved into a newterm/string.rs
module.What's left?
This PR doesn't touch any of the functions which take a regex argument. I'll make a follow-up PR when I get some time to work on it.
Review tips
I'd recommend reviewing c72f490 and e197dab individually, but then skipping to 60132c0, as the commits in-between mostly reimplement the string operations in-place, whereas that commit refactors everything into the
string
module. It's likely simpler to just review the (heavily-commented)string
module than each of the intermediate states.Please let me know if you have any questions!