-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized isascii and length #22
Conversation
return Char(shifted % UInt32 & 0xffff), i+3 | ||
else # 4 byte character | ||
return Char(shifted % UInt32), i+3 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure i have this right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Char is incredibly inefficient, going from a UInt
to a Char
.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32
, and reinterpret it to Char
instead, which should be a lot more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it is wrong.
to do this right Need to look at 4 byte chunks.
But it is seriously fiddly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to write this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inline _get_word(s::ShortString{T}, i::Int) where {T} =
(x.size_content >> (8*(sizeof(T) - i - 3)))%UInt32
@inline function Base.iterate(s::ShortString, i::Int=1)
0 < i <= ncodeunits(s) || return nothing
chr = _get_word(s, i)
chr < 0x8000_0000 ? (reinterpret(Char, chr & 0xFF00_0000), i + 1) :
chr < 0xe000_0000 ? (reinterpret(Char, chr & 0xFFFF_0000), i + 2) :
chr < 0xf000_0000 ? (reinterpret(Char, chr & 0xFFFF_FF00), i + 3) :
(reinterpret(Char, chr), i + 4)
end
Try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's probably about as fast as possible for this, since it keeps the bytes in the same big endian order that both ShortString and Char use, and just does nothing more than a shift and mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how reinterpret
works here for 4 byte characters
https://discourse.julialang.org/t/how-does-char-get-stored/49366
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Char(shifted % UInt32 & 0xffff), i+3 | ||
else # 4 byte character | ||
return Char(shifted % UInt32), i+3 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Char is incredibly inefficient, going from a UInt
to a Char
.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32
, and reinterpret it to Char
instead, which should be a lot more efficient.
|
||
Base.collect(s::ShortString) = collect(String(s)) | ||
function ==(s::ShortString{S}, b::AbstractString) where S | ||
ncodeunits(b) == ncodeunits(s) || return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is incorrect, the number of code units is dependent on how it is encoded.
I think it would be useful for ShortString to have another parameter, which is the type of string that is being encoded, such as String
, or ASCIIStr
, UTF8Str
, Latin1Str
, etc. (or even the old LegacyString
types)
Maybe some traits could be used / defined so that these could be compared in the most efficient fashion.
(As an example, an Emoji takes 4 codeunits in String or UTF8Str, 2 codeunits with UTF16Str, and 1 codeunit with UTF32Str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, this actually shouldn't be in this PR. it got caught up from #16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of code units is dependent on how it is encoded.
this is not something I understand. Is there anything I can read up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some encoding schemes are byte oriented, such as UTF-8, but UCS2 and UTF-16 use 16-bit codeunits, and UTF-32 uses 32-bit codeunits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-16 is actually a lot more efficient for storing non Western European languages than UTF-8. Most all of the common Chinese, Korean, and Japanese characters only take 2 bytes, instead of 3 bytes in UTF-8, so you'd be able to store 7 characters in a UInt128, instead of just 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The largest size in ShortStrings would store up to 63 UTF-16 characters (but only up to 42 of the same Asian characters using UTF-8)
If you move your package to the JuliaString org (not the other, Johnny-come-lately one!), you will be able to still work on it as you wish, as well as Lyndon and Rafael Fourquet and myself, and handle updating it, reviewing PRs, etc. |
a6681b7
to
657c345
Compare
I have a question - why are the lengths 30, 62, 126, instead of 31, 63, and 127 (and if there were a UInt2048, 255)? |
Good question. Out of scope for this PR, open an issue? |
This PR needs to be rebased on top of #30 once that is merged and the conflicts fixed