Update stdlib string functions to work on Unicode extended grapheme clusters #1200

matthew-healy · 2023-03-25T00:42:42Z

This PR updates the standard library string module to function predominantly on Unicode extended grapheme clusters.

In particular:

string.split, string.characters, string.replace, string.substring and string.contains all avoid ever breaking up extended grapheme clusters,
string.codepoint and string.from_codepoint have been removed,
a lot of string functionality has been pulled out of eval/operation.rs and moved into a new term/string.rs module.

What's left?

This PR doesn't touch any of the functions which take a regex argument. I'll make a follow-up PR when I get some time to work on it.

Review tips

I'd recommend reviewing c72f490 and e197dab individually, but then skipping to 60132c0, as the commits in-between mostly reimplement the string operations in-place, whereas that commit refactors everything into the string module. It's likely simpler to just review the (heavily-commented) string module than each of the intermediate states.

Please let me know if you have any questions!

src/term/string.rs

yannham · 2023-03-27T15:52:24Z

src/term/string.rs

+    /// containing a single Unicode extended grapheme cluster.
+    ///
+    /// This method has `O(self.len())` time complexity.
+    pub fn characters(&self) -> Array {


Is there any reason to have this method be an exact alias of grapheme_clusters?

My justification was that outside of the string module - and specifically within operation.rs - we don't necessarily want to have to think about what an extended grapheme cluster is. However, inside string.rs we likely do want to know that we're breaking a string into grapheme clusters, e.g. in the separator.is_empty() clause of split. But I could also be overthinking it a little. 🤷

I guess I could at least mark grapheme_clusters as #[inline(always)].

src/term/string.rs

yannham · 2023-03-27T15:57:15Z

src/term/string.rs

+        // We need to know the length of the grapheme cluster iterator,
+        // and we have to walk the graphemes to do that, so we might
+        // as well build a Vec to make slicing out the substring easier.
+        let graphemes: Vec<_> = self.graphemes(true).collect();


Do we really need to know the length in advance? Can't we just consume the iterator and raise an error if we are running out of elements before end? For example use skip(start).take(end-start), collecting, and checking after the fact that the size of the slice is as large as expected. Because the current approach means always allocating a vec of the size of the whole string even if we only want 4 characters out of 1000, which doesn't sound optimal.

I would never have thought to do this! Great suggestion.

Two small caveats to this though. One is that we've replaced the allocation with an extra iteration of the length of the substring at the end, because we need to walk its grapheme clusters to figure out whether it was actually the length we wanted. The other is that we need to walk the entire length of the string in order to produce error messages.

I think optimising for the happy path, and for shorter substring lengths, makes sense though, so I've implemented your suggestion.

src/term/string.rs

This commit splits the monolithic stdlib string test file into multiple smaller files, roughly split according to "domain" behaviour. It also adds some new test cases in places where the coverage seemed a little thin.

This commit re-implements `string.split` so that extended grapheme clusters are never broken up, even if the separator is included within.

These functions break the extended grapheme cluster abstraction, and don't have equivalents in the Nix language, so we've decided to remove them.

This updates the implementation of `string.contains` to avoid returning true if the string we're searching for only exists as part of a larger extended grapheme cluster.

github-actions bot temporarily deployed to pull request March 25, 2023 00:45 Inactive

matthew-healy force-pushed the feature/extd-grapheme-clusters branch from 3ddd8de to b761aad Compare March 25, 2023 16:13

github-actions bot temporarily deployed to pull request March 25, 2023 16:16 Inactive

matthew-healy marked this pull request as ready for review March 27, 2023 12:14

matthew-healy requested review from yannham, ebresafegaga, vkleen and dpl0a March 27, 2023 12:16

yannham approved these changes Mar 27, 2023

View reviewed changes

vkleen approved these changes Mar 28, 2023

View reviewed changes

src/term/string.rs Outdated Show resolved Hide resolved

src/term/string.rs Show resolved Hide resolved

Matthew Healy added 10 commits March 28, 2023 21:36

Delete dead code from strings.ncl

d611bc7

Split stdlib_string.ncl into multiple test files

cdae7b3

This commit splits the monolithic stdlib string test file into multiple smaller files, roughly split according to "domain" behaviour. It also adds some new test cases in places where the coverage seemed a little thin.

Maintain extended grapheme cluster abstraction in string.split

965faad

This commit re-implements `string.split` so that extended grapheme clusters are never broken up, even if the separator is included within.

Return extended grapheme clusters from string.characters

363588a

Remove string.codepoint and string.from_codepoint

04bbea8

These functions break the extended grapheme cluster abstraction, and don't have equivalents in the Nix language, so we've decided to remove them.

Use grapheme cluster indices in string.substring

d18c785

Ensure string.replace doesn't separate extended grapheme clusters

337efaa

string.contains should check entire grapheme clusters

8516ccd

This updates the implementation of `string.contains` to avoid returning true if the string we're searching for only exists as part of a larger extended grapheme cluster.

Refactor Nickel string implementation to term::string module

2b74bf4

Update stdlib documentation

81d231c

matthew-healy force-pushed the feature/extd-grapheme-clusters branch from b761aad to 81d231c Compare March 28, 2023 20:36

github-actions bot temporarily deployed to pull request March 28, 2023 20:39 Inactive

matthew-healy merged commit 828d972 into master Mar 29, 2023

matthew-healy deleted the feature/extd-grapheme-clusters branch March 29, 2023 08:30

matthew-healy mentioned this pull request Mar 29, 2023

stdlib string functions should work on extended grapheme clusters #1007

Closed

matthew-healy mentioned this pull request Apr 14, 2023

Avoid breaking up grapheme clusters in stdlib regex functions #1252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update stdlib string functions to work on Unicode extended grapheme clusters #1200

Update stdlib string functions to work on Unicode extended grapheme clusters #1200

matthew-healy commented Mar 25, 2023 •

edited

Loading

yannham Mar 27, 2023

matthew-healy Mar 28, 2023

yannham Mar 27, 2023

matthew-healy Mar 28, 2023

Update stdlib string functions to work on Unicode extended grapheme clusters #1200

Update stdlib string functions to work on Unicode extended grapheme clusters #1200

Conversation

matthew-healy commented Mar 25, 2023 • edited Loading

What's left?

Review tips

yannham Mar 27, 2023

Choose a reason for hiding this comment

matthew-healy Mar 28, 2023

Choose a reason for hiding this comment

yannham Mar 27, 2023

Choose a reason for hiding this comment

matthew-healy Mar 28, 2023

Choose a reason for hiding this comment

matthew-healy commented Mar 25, 2023 •

edited

Loading