Optimized isascii and length #22

oxinabox · 2020-10-30T16:00:53Z

This PR needs to be rebased on top of #30 once that is merged and the conflicts fixed

oxinabox · 2020-10-30T16:01:42Z

src/base.jl

+        return Char(shifted % UInt32 & 0xffff), i+3
+    else  # 4 byte character
+        return Char(shifted % UInt32), i+3
+    end


I am not sure i have this right

Char is incredibly inefficient, going from a UInt to a Char.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32, and reinterpret it to Char instead, which should be a lot more efficient.

Yep it is wrong.
to do this right Need to look at 4 byte chunks.
But it is seriously fiddly

Do you want me to write this function?

yes that would be useful.

@inline _get_word(s::ShortString{T}, i::Int) where {T} = (x.size_content >> (8*(sizeof(T) - i - 3)))%UInt32 @inline function Base.iterate(s::ShortString, i::Int=1) 0 < i <= ncodeunits(s) || return nothing chr = _get_word(s, i) chr < 0x8000_0000 ? (reinterpret(Char, chr & 0xFF00_0000), i + 1) : chr < 0xe000_0000 ? (reinterpret(Char, chr & 0xFFFF_0000), i + 2) : chr < 0xf000_0000 ? (reinterpret(Char, chr & 0xFFFF_FF00), i + 3) : (reinterpret(Char, chr), i + 4) end

Try this

I think that's probably about as fast as possible for this, since it keeps the bytes in the same big endian order that both ShortString and Char use, and just does nothing more than a shift and mask

I don't know how reinterpret works here for 4 byte characters
https://discourse.julialang.org/t/how-does-char-get-stored/49366

I split out just the fast iterate code above, and put it into a separate PR (#30), I hope you don't mind, @oxinabox!
I'm just trying to get things moving along as quickly as possible 😄

ScottPJones · 2020-10-30T18:43:05Z

src/base.jl

+        return Char(shifted % UInt32 & 0xffff), i+3
+    else  # 4 byte character
+        return Char(shifted % UInt32), i+3
+    end


Char is incredibly inefficient, going from a UInt to a Char.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32, and reinterpret it to Char instead, which should be a lot more efficient.

src/base.jl

ScottPJones · 2020-10-30T20:27:17Z

src/base.jl


-Base.collect(s::ShortString) = collect(String(s))
+function ==(s::ShortString{S}, b::AbstractString) where S
+    ncodeunits(b) == ncodeunits(s) || return false


This check is incorrect, the number of code units is dependent on how it is encoded.
I think it would be useful for ShortString to have another parameter, which is the type of string that is being encoded, such as String, or ASCIIStr, UTF8Str, Latin1Str, etc. (or even the old LegacyString types)
Maybe some traits could be used / defined so that these could be compared in the most efficient fashion.
(As an example, an Emoji takes 4 codeunits in String or UTF8Str, 2 codeunits with UTF16Str, and 1 codeunit with UTF32Str)

true, this actually shouldn't be in this PR. it got caught up from #16

number of code units is dependent on how it is encoded.

this is not something I understand. Is there anything I can read up?

Some encoding schemes are byte oriented, such as UTF-8, but UCS2 and UTF-16 use 16-bit codeunits, and UTF-32 uses 32-bit codeunits.

UTF-16 is actually a lot more efficient for storing non Western European languages than UTF-8. Most all of the common Chinese, Korean, and Japanese characters only take 2 bytes, instead of 3 bytes in UTF-8, so you'd be able to store 7 characters in a UInt128, instead of just 5.

The largest size in ShortStrings would store up to 63 UTF-16 characters (but only up to 42 of the same Asian characters using UTF-8)

src/base.jl

ScottPJones · 2020-10-31T18:18:47Z

If you move your package to the JuliaString org (not the other, Johnny-come-lately one!), you will be able to still work on it as you wish, as well as Lyndon and Rafael Fourquet and myself, and handle updating it, reviewing PRs, etc.

ScottPJones · 2020-11-01T21:41:11Z

I have a question - why are the lengths 30, 62, 126, instead of 31, 63, and 127 (and if there were a UInt2048, 255)?
The max size length (in bytes) for any of those is just one byte (2 nibbles).
Also, since characters are only encoded in 1 or more bytes, the length could be just expressed as bytes, unless you want to use an extra nibble to store other information about the short string (which might be useful).

oxinabox · 2020-11-02T12:30:34Z

Good question. Out of scope for this PR, open an issue?

oxinabox commented Oct 30, 2020

View reviewed changes

oxinabox changed the title ~~WIP: Fix hash (and iterate)~~ WIP: Do iterate directly Oct 30, 2020

ScottPJones reviewed Oct 30, 2020

View reviewed changes

ScottPJones reviewed Oct 31, 2020

View reviewed changes

src/base.jl Outdated Show resolved Hide resolved

oxinabox added 6 commits October 31, 2020 18:41

WIP

11315c9

fix last

2918aa8

fix iteration

763455e

Detect if isascii and use that to compute length

f3a2d2b

Don't change Hashing

ce28af0

add a test showing that is screwing up iteration

657c345

oxinabox force-pushed the ox/iteratehash branch from a6681b7 to 657c345 Compare October 31, 2020 18:43

xiaodaigh mentioned this pull request Nov 2, 2020

I have a question - why are the lengths 30, 62, 126, instead of 31, 63, and 127 (and if there were a UInt2048, 255)? #29

Closed

oxinabox changed the title ~~WIP: Do iterate directly~~ Optimized isascii and length Nov 2, 2020

Merge branch 'master' into ox/iteratehash

5d9f970

ScottPJones merged commit 14bf784 into JuliaString:master Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized isascii and length #22

Optimized isascii and length #22

oxinabox commented Oct 30, 2020 •

edited

Loading

oxinabox Oct 30, 2020

ScottPJones Oct 30, 2020

oxinabox Oct 30, 2020 •

edited

Loading

ScottPJones Oct 30, 2020

oxinabox Oct 30, 2020

ScottPJones Oct 30, 2020

ScottPJones Oct 30, 2020

oxinabox Oct 31, 2020

ScottPJones Nov 2, 2020 •

edited

Loading

ScottPJones Oct 30, 2020

ScottPJones Oct 30, 2020

oxinabox Oct 30, 2020

xiaodaigh Oct 31, 2020

ScottPJones Oct 31, 2020

ScottPJones Oct 31, 2020

ScottPJones Oct 31, 2020

ScottPJones commented Oct 31, 2020

ScottPJones commented Nov 1, 2020

oxinabox commented Nov 2, 2020

Optimized isascii and length #22

Optimized isascii and length #22

Conversation

oxinabox commented Oct 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Oct 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Oct 31, 2020

ScottPJones commented Nov 1, 2020

oxinabox commented Nov 2, 2020

oxinabox commented Oct 30, 2020 •

edited

Loading

oxinabox Oct 30, 2020 •

edited

Loading

ScottPJones Nov 2, 2020 •

edited

Loading