Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized isascii and length #22

Merged
merged 7 commits into from
Nov 3, 2020

Conversation

oxinabox
Copy link
Member

@oxinabox oxinabox commented Oct 30, 2020

This PR needs to be rebased on top of #30 once that is merged and the conflicts fixed

return Char(shifted % UInt32 & 0xffff), i+3
else # 4 byte character
return Char(shifted % UInt32), i+3
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure i have this right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Char is incredibly inefficient, going from a UInt to a Char.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32, and reinterpret it to Char instead, which should be a lot more efficient.

Copy link
Member Author

@oxinabox oxinabox Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it is wrong.
to do this right Need to look at 4 byte chunks.
But it is seriously fiddly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to write this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inline _get_word(s::ShortString{T}, i::Int) where {T} =
           (x.size_content >> (8*(sizeof(T) - i - 3)))%UInt32

@inline function Base.iterate(s::ShortString, i::Int=1)
       0 < i <= ncodeunits(s) || return nothing
       chr = _get_word(s, i)
       chr < 0x8000_0000 ? (reinterpret(Char, chr & 0xFF00_0000), i + 1) :
           chr < 0xe000_0000 ? (reinterpret(Char, chr & 0xFFFF_0000), i + 2) :
           chr < 0xf000_0000 ? (reinterpret(Char, chr & 0xFFFF_FF00), i + 3) :
           (reinterpret(Char, chr), i + 4)
end

Try this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's probably about as fast as possible for this, since it keeps the bytes in the same big endian order that both ShortString and Char use, and just does nothing more than a shift and mask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how reinterpret works here for 4 byte characters
https://discourse.julialang.org/t/how-does-char-get-stored/49366

Copy link
Member

@ScottPJones ScottPJones Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split out just the fast iterate code above, and put it into a separate PR (#30), I hope you don't mind, @oxinabox!
I'm just trying to get things moving along as quickly as possible 😄

@oxinabox oxinabox changed the title WIP: Fix hash (and iterate) WIP: Do iterate directly Oct 30, 2020
return Char(shifted % UInt32 & 0xffff), i+3
else # 4 byte character
return Char(shifted % UInt32), i+3
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Char is incredibly inefficient, going from a UInt to a Char.
Since in this case, the string is already presumed to be UTF8 encoded, then you could look at the indexed byte, and then shift and mask so that you get the 1-4 bytes in a UInt32, and reinterpret it to Char instead, which should be a lot more efficient.

src/base.jl Outdated Show resolved Hide resolved

Base.collect(s::ShortString) = collect(String(s))
function ==(s::ShortString{S}, b::AbstractString) where S
ncodeunits(b) == ncodeunits(s) || return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is incorrect, the number of code units is dependent on how it is encoded.
I think it would be useful for ShortString to have another parameter, which is the type of string that is being encoded, such as String, or ASCIIStr, UTF8Str, Latin1Str, etc. (or even the old LegacyString types)
Maybe some traits could be used / defined so that these could be compared in the most efficient fashion.
(As an example, an Emoji takes 4 codeunits in String or UTF8Str, 2 codeunits with UTF16Str, and 1 codeunit with UTF32Str)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, this actually shouldn't be in this PR. it got caught up from #16

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of code units is dependent on how it is encoded.

this is not something I understand. Is there anything I can read up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some encoding schemes are byte oriented, such as UTF-8, but UCS2 and UTF-16 use 16-bit codeunits, and UTF-32 uses 32-bit codeunits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-16 is actually a lot more efficient for storing non Western European languages than UTF-8. Most all of the common Chinese, Korean, and Japanese characters only take 2 bytes, instead of 3 bytes in UTF-8, so you'd be able to store 7 characters in a UInt128, instead of just 5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The largest size in ShortStrings would store up to 63 UTF-16 characters (but only up to 42 of the same Asian characters using UTF-8)

src/base.jl Outdated Show resolved Hide resolved
@ScottPJones
Copy link
Member

If you move your package to the JuliaString org (not the other, Johnny-come-lately one!), you will be able to still work on it as you wish, as well as Lyndon and Rafael Fourquet and myself, and handle updating it, reviewing PRs, etc.

@ScottPJones
Copy link
Member

I have a question - why are the lengths 30, 62, 126, instead of 31, 63, and 127 (and if there were a UInt2048, 255)?
The max size length (in bytes) for any of those is just one byte (2 nibbles).
Also, since characters are only encoded in 1 or more bytes, the length could be just expressed as bytes, unless you want to use an extra nibble to store other information about the short string (which might be useful).

@oxinabox
Copy link
Member Author

oxinabox commented Nov 2, 2020

Good question. Out of scope for this PR, open an issue?

@oxinabox oxinabox changed the title WIP: Do iterate directly Optimized isascii and length Nov 2, 2020
@ScottPJones ScottPJones merged commit 14bf784 into JuliaString:master Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants