-
-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Base.hash #192
Use Base.hash #192
Conversation
Codecov Report
@@ Coverage Diff @@
## master #192 +/- ##
==========================================
+ Coverage 82.80% 82.93% +0.12%
==========================================
Files 4 4
Lines 826 832 +6
==========================================
+ Hits 684 690 +6
Misses 142 142
Continue to review full report at Codecov.
|
Any idea where all the allocations are from? |
Line 853 in bed7f32
looks kinda crazy to me. Why would we need to copy the whole lexer? |
Because interpolation resets everything? Anyways, the remaining allocs were due to me reading the whole file into a string while profiling; without that, I get
|
1ce9b5e
to
0ca2041
Compare
Hmm, well not everything? You are still reading characters into the same char buffer, so that one seems unnecessary to copy, or? |
Right, but crucially we can't reuse |
RawToken seems what you would use if you want the best performance so I think it makes sense to optimize for that. |
src/lexer.jl
Outdated
(h & (UInt64(0x3ff) << (64 - 10))) > 0 && return UInt64(0xff) | ||
UInt64(h) << 5 + UInt8(c - 'a' + 1) | ||
end | ||
@inline simple_hash(c::Char, h::UInt64) = hash(c, h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple_hash
currently has the property that it's a perfect hash on the subset of keywords, so identifiers can't collide with keywords.
But I guess using Base.hash
won't necessarily have this property and #189 could reoccur?
It may be better just to optimize simple_hash
a bit and see if all the branches can be removed. Tracking the length of the identifier (in bytes not chars?) as well as this hash would probably make the 0xff
special cases unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. I pushed a commit that should implement a correct perfect hash which is only barely slower than Base.hash
. Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.
I tried this but I also found quite a lot (~ 5%) difference between runs (on a linux laptop with julia 1.6). As far as I could tell the changes in this PR were performance neutral with respect to that measurement noise level.
I also tried a table-based hash like the following which seemed significantly (~2x) faster in a microbenchmark. But it does cost some memory and didn't change the performance overall compared to the noise.
const _keyword_hash_charvals = let
x = fill(0x1f, 256)
x[Int.('a':'z')] .= 1:26
x
end
@inline function keyword_hash(c::Char, h::UInt64)
b = Base.first_utf8_byte(c)
# Saturate top 5 bits to indicate overflow at >= 12 chars.
(h & 0xf800000000000000) | (h << 5) | _keyword_hash_charvals[b]
end
Co-authored-by: Chris Foster <[email protected]>
@@ -1021,39 +1004,42 @@ function lex_cmd(l::Lexer, doemit=true) | |||
end | |||
end | |||
|
|||
const MAX_KW_LENGTH = 10 | |||
function lex_identifier(l::Lexer{IO_t,T}, c) where {IO_t,T} | |||
if T == Token | |||
readon(l) | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly part of this PR but I just noticed the T == Token
check here. Why do we have this here when readon(l)
already contains the logic for Token
vs RawToken
?
@pfitzseb, is this ready to go? |
I think so, but it doesn't actually do much :P |
The benchmark in the first post looks pretty good, or? |
FWIW I based my copy/fork in |
This improves performance by a bit when tokenizing Base and reduces code complexity: