Use Base.hash #192

pfitzseb · 2021-11-17T11:14:38Z

This improves performance by a bit when tokenizing Base and reduces code complexity:

Tokenize on master [$!+] 
λ julia --project=. test/profile.jl
First run took 0.51364849 seconds with 8.73827 MB allocated
Tokenized 1556 files in 0.8023270789999989 seconds with 923.379736 MB allocated

Tokenize on sp/better-hash [$] 
λ julia --project=. test/profile.jl      
First run took 0.485027563 seconds with 4.93732 MB allocated
Tokenized 1556 files in 0.6423206330000001 seconds with 923.379736 MB allocated

codecov · 2021-11-17T11:17:24Z

Codecov Report

Merging #192 (65f8b74) into master (860bdc6) will increase coverage by 0.12%.
The diff coverage is 96.42%.

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
+ Coverage   82.80%   82.93%   +0.12%     
==========================================
  Files           4        4              
  Lines         826      832       +6     
==========================================
+ Hits          684      690       +6     
  Misses        142      142

Impacted Files	Coverage Δ
src/lexer.jl	`93.80% <96.42%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 860bdc6...65f8b74. Read the comment docs.

KristofferC · 2021-11-17T11:26:33Z

Any idea where all the allocations are from?

pfitzseb · 2021-11-17T11:55:15Z

--track-allocation attributes 3.7MB to

Tokenize.jl/src/lexer.jl

Line 853 in bed7f32

l2 = copy(l)

and 1 MB to

Tokenize.jl/src/lexer.jl

Line 100 in bed7f32

tokenize(x, ::Type{RawToken}) = Lexer(x, RawToken)

Still looking for the rest...

KristofferC · 2021-11-17T11:57:53Z

Tokenize.jl/src/lexer.jl

Line 853 in bed7f32

l2 = copy(l)

looks kinda crazy to me. Why would we need to copy the whole lexer?

pfitzseb · 2021-11-17T12:02:30Z

Because interpolation resets everything?

Anyways, the remaining allocs were due to me reading the whole file into a string while profiling; without that, I get

λ julia --project=. test/profile.jl
First run took 0.561283966 seconds with 73.583024 MB allocated
Tokenized 1556 files in 3.632539304999996 seconds with 2.368488 MB allocated

pfitzseb · 2021-11-17T12:15:53Z

is what the profile looks like. So yes, allocating the new IOBuffer() in copy looks expensive, but that's mostly because the GC runs then.

KristofferC · 2021-11-17T13:31:02Z

Because interpolation resets everything?

Hmm, well not everything? You are still reading characters into the same char buffer, so that one seems unnecessary to copy, or?

pfitzseb · 2021-11-17T15:44:50Z

Right, but crucially we can't reuse l.charstore. The latest commit removes the copy and is faster in the RawToken case, but slightly worse for normal Tokens. I'll leave the decision on what implementation to use up to you :)

KristofferC · 2021-11-18T17:13:53Z

RawToken seems what you would use if you want the best performance so I think it makes sense to optimize for that.

c42f · 2021-11-22T03:02:24Z

src/lexer.jl

-    (h & (UInt64(0x3ff) << (64 - 10))) > 0 && return UInt64(0xff)
-    UInt64(h) << 5 + UInt8(c - 'a' + 1)
-end
+@inline simple_hash(c::Char, h::UInt64) = hash(c, h)


simple_hash currently has the property that it's a perfect hash on the subset of keywords, so identifiers can't collide with keywords.

But I guess using Base.hash won't necessarily have this property and #189 could reoccur?

It may be better just to optimize simple_hash a bit and see if all the branches can be removed. Tracking the length of the identifier (in bytes not chars?) as well as this hash would probably make the 0xff special cases unnecessary.

That's a fair point. I pushed a commit that should implement a correct perfect hash which is only barely slower than Base.hash. Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.

c42f

Would be great if you (or someone else) could also try benchmarking this because timings are somewhat unstable on my machine.

I tried this but I also found quite a lot (~ 5%) difference between runs (on a linux laptop with julia 1.6). As far as I could tell the changes in this PR were performance neutral with respect to that measurement noise level.

I also tried a table-based hash like the following which seemed significantly (~2x) faster in a microbenchmark. But it does cost some memory and didn't change the performance overall compared to the noise.

const _keyword_hash_charvals = let
    x = fill(0x1f, 256)
    x[Int.('a':'z')] .= 1:26
    x
end

@inline function keyword_hash(c::Char, h::UInt64)
    b = Base.first_utf8_byte(c)
    # Saturate top 5 bits to indicate overflow at >= 12 chars.
    (h & 0xf800000000000000) | (h << 5) | _keyword_hash_charvals[b]
end

src/lexer.jl

Co-authored-by: Chris Foster <[email protected]>

c42f · 2021-11-23T23:40:38Z

src/lexer.jl

@@ -1021,39 +1004,42 @@ function lex_cmd(l::Lexer, doemit=true)
    end
 end

+const MAX_KW_LENGTH = 10
 function lex_identifier(l::Lexer{IO_t,T}, c) where {IO_t,T}
    if T == Token
        readon(l)
    end


It's not strictly part of this PR but I just noticed the T == Token check here. Why do we have this here when readon(l) already contains the logic for Token vs RawToken?

KristofferC · 2022-02-04T15:03:43Z

@pfitzseb, is this ready to go?

pfitzseb · 2022-02-04T15:06:14Z

I think so, but it doesn't actually do much :P

KristofferC · 2022-02-04T15:29:27Z

The benchmark in the first post looks pretty good, or?

c42f · 2022-02-07T01:54:18Z

FWIW I based my copy/fork in JuliaSyntax.Tokenize off this branch. I'd be happy to see it merged.

Use Base.hash and add profiling script

4f91b83

pfitzseb requested a review from KristofferC November 17, 2021 11:14

KristofferC approved these changes Nov 17, 2021

View reviewed changes

iterate instead of collect in profile script

0ca2041

pfitzseb force-pushed the sp/better-hash branch from 1ce9b5e to 0ca2041 Compare November 17, 2021 13:29

speed up RawToken interpolation

b3d2be8

pfitzseb requested a review from KristofferC November 18, 2021 17:05

c42f reviewed Nov 22, 2021

View reviewed changes

more accurate hash

767bd9b

c42f reviewed Nov 23, 2021

View reviewed changes

src/lexer.jl Outdated Show resolved Hide resolved

Update src/lexer.jl

7664d7d

Co-authored-by: Chris Foster <[email protected]>

c42f reviewed Nov 23, 2021

View reviewed changes

Merge branch 'master' into sp/better-hash

65f8b74

KristofferC merged commit c7732fc into master Feb 7, 2022

KristofferC deleted the sp/better-hash branch February 7, 2022 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Base.hash #192

Use Base.hash #192

pfitzseb commented Nov 17, 2021

codecov bot commented Nov 17, 2021 •

edited

Loading

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021 •

edited

Loading

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021

pfitzseb commented Nov 17, 2021 •

edited

Loading

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021

KristofferC commented Nov 18, 2021

c42f Nov 22, 2021

pfitzseb Nov 22, 2021 •

edited

Loading

c42f left a comment

c42f Nov 23, 2021

KristofferC commented Feb 4, 2022

pfitzseb commented Feb 4, 2022

KristofferC commented Feb 4, 2022

c42f commented Feb 7, 2022 •

edited

Loading

Use Base.hash #192

Use Base.hash #192

Conversation

pfitzseb commented Nov 17, 2021

codecov bot commented Nov 17, 2021 • edited Loading

Codecov Report

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021 • edited Loading

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021

pfitzseb commented Nov 17, 2021 • edited Loading

KristofferC commented Nov 17, 2021

pfitzseb commented Nov 17, 2021

KristofferC commented Nov 18, 2021

c42f Nov 22, 2021

Choose a reason for hiding this comment

pfitzseb Nov 22, 2021 • edited Loading

Choose a reason for hiding this comment

c42f left a comment

Choose a reason for hiding this comment

c42f Nov 23, 2021

Choose a reason for hiding this comment

KristofferC commented Feb 4, 2022

pfitzseb commented Feb 4, 2022

KristofferC commented Feb 4, 2022

c42f commented Feb 7, 2022 • edited Loading

codecov bot commented Nov 17, 2021 •

edited

Loading

pfitzseb commented Nov 17, 2021 •

edited

Loading

pfitzseb commented Nov 17, 2021 •

edited

Loading

pfitzseb Nov 22, 2021 •

edited

Loading

c42f commented Feb 7, 2022 •

edited

Loading