Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Dict{K,V} (~5%) by storing elements in pairs::Vector{Pair{K,V}} #44332

Closed
wants to merge 24 commits into from

Conversation

petvana
Copy link
Member

@petvana petvana commented Feb 24, 2022

Updated on March 28, 2022 - when testing on multiple sizes, the difference is not so significant, or zero.

I have noticed that Dict performance can be improved by storing keys and values together in a single vector of pairs. It can provide up to about 5% performance improvement for large dictionaries because it limits the number of random accesses to the memory. This PR is a kind of proof-of-concept. There is no change to the algorithm. Do you think it's worth it?

Although the PR is considered to be non-breaking, it may brake code to those who utilizes internal representation of the Dict.

Master:

mutable struct Dict{K,V} <: AbstractDict{K,V}
    # Metadata: empty => 0x00, removed => 0x7f, full => 0b1[7 most significant hash bits]
    slots::Vector{UInt8}
    keys::Array{K,1}
    vals::Array{V,1}
   ...

PR:

mutable struct Dict{K,V} <: AbstractDict{K,V}
    # Metadata: empty => 0x00, removed => 0x7f, full => 0b1[7 most significant hash bits]
    slots::Vector{UInt8}
    pairs::Vector{Pair{K,V}} # stored pairs (key::K => value::V)
   ...

I've measured the total elapsed time and total allocated memory over large dictionaries with various sizes.

sizes = (1:10) * 1_000_000

Master:

TypeSETGETGET!emptyGET!fullITERATEMEM_GB
1Dict{Int64, Int64}3.9672.0690.6920.5260.05915.408
2Dict{Int64, Int8}3.6052.1490.6390.5160.0612.418
3Dict{Any, Int64}15.642.962.3370.7280.04923.826
4Dict{Int64, Any}5.4527.5221.3961.2750.89621.95
5Dict{Any, Any}22.4527.8946.561.3670.94930.379
6Dict{String, Int64}12.484.4452.3011.140.05820.308
Total time 112.185 s

PR:

TypeSETGETGET!emptyGET!fullITERATEMEM_GB
1Dict{Int64, Int64}3.5311.8230.6180.4990.05415.409
2Dict{Int64, Int8}3.4051.8420.6130.5250.05415.402
3Dict{Any, Int64}15.8662.9352.390.6120.06323.827
4Dict{Int64, Any}5.717.0091.5941.1330.93821.95
5Dict{Any, Any}24.5586.2186.8311.2610.91630.379
6Dict{String, Int64}12.5243.8582.0081.0690.06520.308
Total time 110.522 s
Testing code
module TestDict

using Printf
using Random
#using DataStructures
using DataFrames

const sizes = (1:10) * 1_000_000

function test_set(dict, x)
    xn = length(x)
    #sizehint!(dict, xn)
    for i in 1:xn
        dict[x[i]] = i
    end
end
test_get(dict, x) = sum(dict[x[i]] for i = 1:length(x))
test_get!(dict, x) = sum(get!(dict, x[i], i) for i = 1:length(x))
test_iterate(dict) = sum(v for v = values(dict))

function test_set(dict::Dict{K,Int8}, x) where K
    xn = length(x)
    #sizehint!(dict, xn)
    for i in 1:xn
        dict[x[i]] = i>100
    end
end
test_get!(dict::Dict{K,Int8}, x) where K = sum(get!(dict, x[i], i>100) for i = 1:length(x))
test_iterate(dict::Dict{K,Int8}) where K = sum(v for v = values(dict))

for D in [Dict]
    df = DataFrame()
    for (A,B) in [(Int, Int), (Int, Int8), (Any, Int), (Int, Any), (Any, Any), (String, Int)]
        time_set = time_get = time_get!_empty = time_get!_full = time_iterate = 0.0
        mem = @allocated for n in sizes
            Random.seed!(42)
            if A == String
                keys = [randstring() for i = 1:n]
            else
                keys = rand(A == Any ? Int : A, n)
            end
            keys = unique(keys)
            if B == Int8
                correct_sum = sum([1:length(keys)...] .> 100)
            else
                correct_sum = sum([1:length(keys)...])
            end

            #precompile
            dict = D{A, B}()
            test_get!(dict, keys[1:100])
            test_set(dict, keys[1:200])
            test_get(dict, keys[1:200])
            test_iterate(dict)

            dict = D{A, B}()
            time_set += @elapsed test_set(dict, keys)
            time_get += @elapsed getsum = test_get(dict, keys)
            @assert getsum == correct_sum

            dict = D{A, B}()
            time_get!_empty = @elapsed getsum = test_get!(dict, keys)
            @assert getsum == correct_sum
            time_get!_full = @elapsed getsum = test_get!(dict, keys)
            @assert getsum == correct_sum

            test_iterate(dict)
            time_iterate = @elapsed getsum = test_iterate(dict)
            @assert getsum == correct_sum
        end

        new_data = ( 
            Type = D{A,B}, 
            SET = time_set, 
            GET = time_get, 
            GET!empty = time_get!_empty,
            GET!full = time_get!_full,
            ITERATE = time_iterate,
            MEM_GB = mem / 1024.0^3
        )
        println(new_data)
        push!(df, new_data)
    end
    total = sum(sum(x) for x in eachcol(df[:,2:end-1]))
    df[:,2:end] = round.(df[:,2:end]; digits = 3)
    show(stdout, MIME("text/plain"), df)
    println("\n")
    show(stdout, MIME("text/html"), df; eltypes = false, summary = false)
    println("\n")
    @printf "Total time %.3f s\n\n\n" total
end

end

@oscardssmith
Copy link
Member

This looks great! I think #38145 is probably the direction we want to move in the longer term, but free performance improvements are always great!

@KristofferC
Copy link
Sponsor Member

KristofferC commented Feb 24, 2022

What's the actual benchmark that you ran? Did you measure iterating over values and keys for example?

@petvana

This comment was marked as outdated.

@JeffBezanson
Copy link
Sponsor Member

This is probably a good idea, since we can now store Pairs in-line in all cases (before they might have been heap-allocated, which would surely make performance worse!).

@petvana
Copy link
Member Author

petvana commented Feb 24, 2022

The remaining test failure is about precompilation only.

However, I've found a very significant performance drop when using Any as the type for storing Int64 as key/value. The produced code is less optimized code.
(Solved by explicit types for pairs, results updated.)

Master

typeof(dict) = Dict{Int64, Any}
SET 1.331 s, GET 1.040 s, GET! 1.153 s, ITER. KEYS 0.080 s, ITER. VALS 0.525 s
typeof(dict) = Dict{Any, Int64}
SET 2.706 s, GET 1.306 s, GET! 5.691 s, ITER. KEYS 0.536 s, ITER. VALS 0.080 s
typeof(dict) = Dict{Any, Any}
SET 2.557 s, GET 1.922 s, GET! 7.591 s, ITER. KEYS 0.533 s, ITER. VALS 0.558 s
Total time 27.608 s

PR

typeof(dict) = Dict{Int64, Any}
SET 0.976 s, GET 0.927 s, GET! 1.897 s, ITER. KEYS 0.086 s, ITER. VALS 0.533 s
typeof(dict) = Dict{Any, Int64}
SET 1.837 s, GET 2.029 s, GET! 5.541 s, ITER. KEYS 0.551 s, ITER. VALS 0.089 s
typeof(dict) = Dict{Any, Any}
SET 2.941 s, GET 1.988 s, GET! 6.506 s, ITER. KEYS 0.591 s, ITER. VALS 0.577 s
Total time 27.070 s

@JeffBezanson
Copy link
Sponsor Member

Very interesting; we should look into that.

@petvana
Copy link
Member Author

petvana commented Feb 24, 2022

The problem seems to be caused by extra allocations because of @nospecialize macro in

julia/base/dict.jl

Lines 386 to 387 in 8306858

function setindex!(h::Dict{K,Any}, v, key::K) where K
@nospecialize v
.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 24, 2022

Seems like someone accidentally wrote k=>v later in that function, which had the wrong type

@petvana
Copy link
Member Author

petvana commented Feb 25, 2022

Seems like someone accidentally wrote k=>v later in that function, which had the wrong type

Thank you, fixed. I was just naive that this can be optimized out automatically.

test/precompile.jl Outdated Show resolved Hide resolved
test/precompile.jl Outdated Show resolved Hide resolved
@petvana petvana changed the title WIP: Improve performance of 'Dict' by about 5% Improve performance of Dict{K,V} (~5%) by storing elements in pairs::Vector{Pair{K,V}} Feb 25, 2022
@petvana petvana marked this pull request as ready for review February 25, 2022 17:42
@simeonschaub
Copy link
Member

Can you post the updated benchmarks with concrete as well as abstract types?

@petvana
Copy link
Member Author

petvana commented Feb 25, 2022

Can you post the updated benchmarks with concrete as well as abstract types?

All results should be updated now.

Copy link
Member

@simeonschaub simeonschaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR! I have a few comments and I think it would be good to have another review from someone more familiar with dict internals than me, but overall I think this is a very nice improvement.

base/dict.jl Outdated Show resolved Hide resolved
base/dict.jl Outdated Show resolved Hide resolved
function getindex(h::Dict{K,V}, key) where V where K
index = ht_keyindex(h, key)
@inbounds return (index < 0) ? throw(KeyError(key)) : h.vals[index]::V
@inbounds return (index < 0) ? throw(KeyError(key)) : h.pairs[index].second::V
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why all these type annotations here were added in the first place. They probably don't hurt, but I also don't see why they'd be needed.

base/dict.jl Outdated Show resolved Hide resolved
base/set.jl Outdated Show resolved Hide resolved
@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 8, 2022

@petvana
Copy link
Member Author

petvana commented Mar 11, 2022

I'm closing this in favor of #44513.

@petvana petvana closed this Mar 11, 2022
@petvana
Copy link
Member Author

petvana commented Mar 24, 2022

Reopening to run CI. I'll update benchmarks once I have some time.

@petvana petvana reopened this Mar 24, 2022
@petvana petvana marked this pull request as draft March 24, 2022 19:27
@petvana
Copy link
Member Author

petvana commented Mar 28, 2022

I've updated the evaluation. The speed-up is still about 5% for concrete types, but none for abstract types. Furthermore, Julia currently doesn't support packing of Pair or Struct, as C does. Therefore, Dict{Int64, Int8} uses more memory. So, not sure if the PR is worth it at this stage.

@oscardssmith
Copy link
Member

one thing that might be worth trying is storing 8 keys followed by 8 values. this would fix the alignment issues as least

@petvana
Copy link
Member Author

petvana commented Nov 23, 2022

I've updated the PR against master, since I found it beneficial for small Sets and Dicts as it removes one allocation (keys+vals => pairs).

julia> @btime Set(x) setup=(x=rand()); # PR
  62.848 ns (3 allocations: 336 bytes)

julia> @btime Set(x) setup=(x=rand()); # master
  87.330 ns (4 allocations: 400 bytes)



julia> @btime Dict(x => x) setup=(x=rand()); # PR
  64.738 ns (3 allocations: 480 bytes)

julia> @btime Dict(x => x) setup=(x=rand()); # master
  90.961 ns (4 allocations: 544 bytes)

julia> @btime Base.ImmutableDict(x => x) setup=(x=rand()); # only as a ground truth
  10.300 ns (2 allocations: 64 bytes)

@petvana
Copy link
Member Author

petvana commented Nov 24, 2022

@nanosoldier runtests(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@petvana

This comment was marked as outdated.

@KristofferC
Copy link
Sponsor Member

(you can leave out the vs argument if running against master)

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@petvana
Copy link
Member Author

petvana commented Nov 28, 2022

@nanosoldier runtests(["ZChop", "MusicManipulations", "Orthography", "SerialDependence", "TextAnalysis", "Fairness", "ClimateSatellite", "SparseArrayKit", "Hecke", "GeoLearning", "WeakValueDicts", "Discreet", "SUNRepresentations", "EqualitySampler", "Tries", "EMIRT", "Hygese", "DeepDish", "PopGenCore", "LicenseGrabber", "ScrapeSEC", "BangBang", "EmbeddingsAnalysis", "DetectionTheory", "ChemometricsData", "MetidaNCA", "AlgebraicRelations", "EBayes", "HDF5Utils", "Mex", "TMLE", "CategoricalDistributions", "MLJFlux", "EPOCHInput", "LITS", "FastaLoader", "ClimateTasks", "FatDatasets", "MLJLinearModels", "SparseMatrixDicts", "TrueSkillThroughTime", "AsterReader", "MLJTestInterface", "Pidfile", "OneRule", "ClassicalCiphers", "StatsBase", "EvoTrees", "TextClassification", "GigaSOM", "MLJ", "Graph500", "Config", "MosimoBase", "FunctionalBallDropping", "DelayDiffEq", "CorrectMatch", "CachedFunctions", "StringDistances", "Bobby", "PreprocessMD", "HDF5Logger", "ConformalPrediction", "MLJEnsembles", "SDFResults", "DevIL", "CitableParserBuilder", "GeoStatsBase", "MLLabelUtils", "PoreMatMod", "LanguageFinder", "CitableCorpusAnalysis", "MLJTuning", "MLJModels", "MLJAbstractGPsGlue", "MLJClusteringInterface", "Nonconvex", "AIBECS", "LightPropagation", "MajoranaReps", "FreqTables", "TopologyPreprocessing", "PolaronMobility", "CSVReader", "Evolutionary", "SortedIteratorProducts", "PkgDependency", "GridUtilities", "QXTns", "MLJTestIntegration", "BinomialSynapses", "LazyGroupBy"])

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@petvana
Copy link
Member Author

petvana commented Nov 29, 2022

one thing that might be worth trying is storing 8 keys followed by 8 values. this would fix the alignment issues as least

@oscardssmith Nice idea, but I'm not aware how to implement that in pure Julia. Closest way I can imagine is using NTuple but it is immutable. Using other types would lead to unnecessary allocations.

julia> v = Pair{NTuple{8,Int64}, NTuple{8,Int8}}[]
Pair{NTuple{8, Int64}, NTuple{8, Int8}}[]

julia> resize!(v, 4)
4-element Vector{Pair{NTuple{8, Int64}, NTuple{8, Int8}}}:
 (0, 0, 0, 0, 0, 0, 0, 0) => (0, 0, 0, 0, 0, 0, 0, 0)
 (0, 0, 0, 0, 0, 0, 0, 0) => (0, 0, 0, 0, 0, 0, 0, 0)
 (0, 0, 0, 0, 0, 0, 0, 0) => (0, 0, 0, 0, 0, 0, 0, 0)
 (0, 0, 0, 0, 0, 0, 0, 0) => (0, 0, 0, 0, 0, 0, 0, 0)

julia> isbitstype(Pair{NTuple{8,Int64}, NTuple{8,Int8}})
true

@oscardssmith
Copy link
Member

good point. @JeffBezanson this is another great example of why we should have a simple buffer type.

@fingolfin
Copy link
Contributor

This PR has been approved but was never merged; it has various conflicts now. Also apparently the performance benefit in the current version is minimal to non-existent.

Thus I think it is OK to close this. Feel free to re-open should I be mistaken, or just submit a new PR with a pointer to this one (I believe this will increase its chance of being "seen" by reviewers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.