Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Base.hash #192

Merged
merged 6 commits into from
Feb 7, 2022
Merged

Use Base.hash #192

merged 6 commits into from
Feb 7, 2022

Conversation

pfitzseb
Copy link
Member

This improves performance by a bit when tokenizing Base and reduces code complexity:

Tokenize on master [$!+] 
λ julia --project=. test/profile.jl
First run took 0.51364849 seconds with 8.73827 MB allocated
Tokenized 1556 files in 0.8023270789999989 seconds with 923.379736 MB allocated

Tokenize on sp/better-hash [$] 
λ julia --project=. test/profile.jl      
First run took 0.485027563 seconds with 4.93732 MB allocated
Tokenized 1556 files in 0.6423206330000001 seconds with 923.379736 MB allocated

@codecov
Copy link

codecov bot commented Nov 17, 2021

Codecov Report

Merging #192 (65f8b74) into master (860bdc6) will increase coverage by 0.12%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
+ Coverage   82.80%   82.93%   +0.12%     
==========================================
  Files           4        4              
  Lines         826      832       +6     
==========================================
+ Hits          684      690       +6     
  Misses        142      142              
Impacted Files Coverage Δ
src/lexer.jl 93.80% <96.42%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 860bdc6...65f8b74. Read the comment docs.

@KristofferC
Copy link
Sponsor Member

Any idea where all the allocations are from?

@pfitzseb
Copy link
Member Author

pfitzseb commented Nov 17, 2021

--track-allocation attributes 3.7MB to

l2 = copy(l)

and 1 MB to
tokenize(x, ::Type{RawToken}) = Lexer(x, RawToken)

Still looking for the rest...

@KristofferC
Copy link
Sponsor Member

l2 = copy(l)

looks kinda crazy to me. Why would we need to copy the whole lexer?

@pfitzseb
Copy link
Member Author

Because interpolation resets everything?

Anyways, the remaining allocs were due to me reading the whole file into a string while profiling; without that, I get

λ julia --project=. test/profile.jl
First run took 0.561283966 seconds with 73.583024 MB allocated
Tokenized 1556 files in 3.632539304999996 seconds with 2.368488 MB allocated

@pfitzseb
Copy link
Member Author

pfitzseb commented Nov 17, 2021

image
is what the profile looks like. So yes, allocating the new IOBuffer() in copy looks expensive, but that's mostly because the GC runs then.

@KristofferC
Copy link
Sponsor Member

Because interpolation resets everything?

Hmm, well not everything? You are still reading characters into the same char buffer, so that one seems unnecessary to copy, or?

@pfitzseb
Copy link
Member Author

Right, but crucially we can't reuse l.charstore. The latest commit removes the copy and is faster in the RawToken case, but slightly worse for normal Tokens. I'll leave the decision on what implementation to use up to you :)

@KristofferC
Copy link
Sponsor Member

RawToken seems what you would use if you want the best performance so I think it makes sense to optimize for that.

src/lexer.jl Outdated
(h & (UInt64(0x3ff) << (64 - 10))) > 0 && return UInt64(0xff)
UInt64(h) << 5 + UInt8(c - 'a' + 1)
end
@inline simple_hash(c::Char, h::UInt64) = hash(c, h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple_hash currently has the property that it's a perfect hash on the subset of keywords, so identifiers can't collide with keywords.

But I guess using Base.hash won't necessarily have this property and #189 could reoccur?

It may be better just to optimize simple_hash a bit and see if all the branches can be removed. Tracking the length of the identifier (in bytes not chars?) as well as this hash would probably make the 0xff special cases unnecessary.

Copy link
Member Author

@pfitzseb pfitzseb Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I pushed a commit that should implement a correct perfect hash which is only barely slower than Base.hash. Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.

I tried this but I also found quite a lot (~ 5%) difference between runs (on a linux laptop with julia 1.6). As far as I could tell the changes in this PR were performance neutral with respect to that measurement noise level.

I also tried a table-based hash like the following which seemed significantly (~2x) faster in a microbenchmark. But it does cost some memory and didn't change the performance overall compared to the noise.

const _keyword_hash_charvals = let
    x = fill(0x1f, 256)
    x[Int.('a':'z')] .= 1:26
    x
end

@inline function keyword_hash(c::Char, h::UInt64)
    b = Base.first_utf8_byte(c)
    # Saturate top 5 bits to indicate overflow at >= 12 chars.
    (h & 0xf800000000000000) | (h << 5) | _keyword_hash_charvals[b]
end

src/lexer.jl Outdated Show resolved Hide resolved
Co-authored-by: Chris Foster <[email protected]>
@@ -1021,39 +1004,42 @@ function lex_cmd(l::Lexer, doemit=true)
end
end

const MAX_KW_LENGTH = 10
function lex_identifier(l::Lexer{IO_t,T}, c) where {IO_t,T}
if T == Token
readon(l)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly part of this PR but I just noticed the T == Token check here. Why do we have this here when readon(l) already contains the logic for Token vs RawToken?

@KristofferC
Copy link
Sponsor Member

@pfitzseb, is this ready to go?

@pfitzseb
Copy link
Member Author

pfitzseb commented Feb 4, 2022

I think so, but it doesn't actually do much :P

@KristofferC
Copy link
Sponsor Member

The benchmark in the first post looks pretty good, or?

@c42f
Copy link
Member

c42f commented Feb 7, 2022

FWIW I based my copy/fork in JuliaSyntax.Tokenize off this branch. I'd be happy to see it merged.

@KristofferC KristofferC merged commit c7732fc into master Feb 7, 2022
@KristofferC KristofferC deleted the sp/better-hash branch February 7, 2022 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants