Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precompile interacts badly with Const-specialization #38983

Closed
timholy opened this issue Dec 24, 2020 · 15 comments · Fixed by #42125
Closed

precompile interacts badly with Const-specialization #38983

timholy opened this issue Dec 24, 2020 · 15 comments · Fixed by #42125
Labels
compiler:latency Compiler latency compiler:precompilation Precompilation of modules

Comments

@timholy
Copy link
Sponsor Member

timholy commented Dec 24, 2020

(I am not 100% certain this is the right title for this issue, but it seems likely to be the cause.)

TL; DR
In reference to #30488 (comment), this illustrates at least one (and probably more than one) problem with precompilation. I think the most fundamental issue is that our backedges only record MethodInstances and not optimizations due to constant-propagation/specialization; when you go to use the code, Julia seems to decide that it needs to re-optimize the code and thus runs a fresh round of inference on it. A potential solution is to incorporate Const-data into our backedges, and cache all specializations. (There are potential cache-blowup issues to worry about, of course.) An alternative is to just use the partially-specialized code and not run inference again.

It's worth noting that the tiny function in this demo accounts for the majority of remaining (aka, unfixed) inference time in JuliaImages/ImageFiltering.jl#201, despite being outweighed by at least 100x in terms of the total amount of code that needs to be inferred to run imfilter. Another example I've encountered occurs in JuliaInterpreter.step_expr!, where I suspect istoplevel::Bool suffers from the same problem due to well-intentioned but very unhelpful Const-specialization. (I'm finding this issue to be a bigger problem with LoopVectorization, where it crops up for dozens of methods and is the main source of currently-unfixable "inference triggers.") Overall, this is a sign of just how well precompile works in many circumstances. The failures illustrated here are serious but specific.

Demo:

module PCConst

# Adpated from from ImageFiltering
defaultlen(σ) = 4*ceil(Int,σ)+1

function gaussian::Real, l::Int=defaultlen(σ))
    isodd(l) || throw(ArgumentError("length must be odd"))
    w = l>>1
    g = iszero(σ) ? [exp/(2*oneunit(σ)^2))] : [exp(-x^2/(2*σ^2)) for x=-w:w]
    return g/sum(g)
end

getgaussian(x::AbstractVector) = gaussian(length(x)/10)

# An alternative implementation that, depending on what precompile directives you issue,
# has different inference triggers for the same outcome.
function hiddengaussian(@nospecialize(x::Vector))
    len = length(x)
    σ = len/10
    l = defaultlen(σ)
    isodd(l) || throw(ArgumentError("length must be odd"))
    y = exp(σ)    # placing this outside the generator allows the backedge
    w = l>>1
    g = iszero(σ) ? [exp/(2*oneunit(σ)^2))] : [exp(-x^2/(2*σ^2)) for x=-w:w]
    return g/sum(g)
end

gethiddengaussian(x::AbstractVector) = hiddengaussian(x)

if ccall(:jl_generating_output, Cint, ()) == 1
    @assert precompile(getgaussian, (Vector{Int},))
    @assert precompile(gethiddengaussian, (Vector{Int},))
    # @assert precompile(hiddengaussian, (Vector,))   # uncommenting this makes hiddengaussian behave like gaussian
end

end # module

Session (requires the master branch of SnoopCompile):

julia> using PCConst
[ Info: Precompiling PCConst [6318dccd-da5d-4af4-8660-b22ceef7e1b0]

julia> using MethodAnalysis

julia> # First let's check backedges and existing MethodInstances:
       mis = methodinstances(PCConst.gaussian)
2-element Vector{Core.MethodInstance}:
 MethodInstance for gaussian(::Float64)
 MethodInstance for gaussian(::Float64, ::Int64)

julia> mi = mis[1]
MethodInstance for gaussian(::Float64)

julia> mi.backedges
1-element Vector{Any}:
 MethodInstance for getgaussian(::Vector{Int64})

julia> mi = mis[2]
MethodInstance for gaussian(::Float64, ::Int64)

julia> mi.backedges
1-element Vector{Any}:
 MethodInstance for gaussian(::Float64)

julia> mi = methodinstance(exp, (Float64,))   # what about calls to Base code?
MethodInstance for exp(::Float64)

julia> mi.backedges                           # this appears to have the necessary backedges...but in this case that will prove illusory
4-element Vector{Any}:
 MethodInstance for hiddengaussian(::Vector{Int64})
 MethodInstance for gaussian(::Float64, ::Int64)
 MethodInstance for (::PCConst.var"#1#2"{Float64})(::Int64)
 MethodInstance for (::PCConst.var"#3#4"{Float64})(::Int64)

julia> using SnoopCompile

julia> tinf = @snoopi_deep PCConst.getgaussian([1,2,3,4,5,6])
Core.Compiler.Timings.Timing(InferenceFrameInfo for Core.Compiler.Timings.ROOT()) with 3 children

julia> using AbstractTrees

julia> print_tree(tinf)
59.496429ms: InferenceFrameInfo for Core.Compiler.Timings.ROOT()
├─ 0.883058ms: InferenceFrameInfo for collect(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}})
│  ├─ 0.077179ms: InferenceFrameInfo for getproperty(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, ::Symbol)
│  ├─ 0.03477ms: InferenceFrameInfo for getproperty(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, iter::Symbol)
│  ├─ 0.028224ms: InferenceFrameInfo for getproperty(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, f::Symbol)
│  ├─ 0.027447ms: InferenceFrameInfo for getproperty(Core.Compiler::Module, return_type::Symbol)
│  ├─ 0.276433ms: InferenceFrameInfo for first(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}})
│  │  ├─ 0.393247ms: InferenceFrameInfo for iterate(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}})
│  │  │  ├─ 0.032031ms: InferenceFrameInfo for getindex(::Tuple{Int64, Int64}, 1::Int64)
│  │  │  └─ 0.027917ms: InferenceFrameInfo for getindex(::Tuple{Int64, Int64}, 2::Int64)
│  │  └─ 0.039028ms: InferenceFrameInfo for getindex(::Tuple{Float64, Int64}, 1::Int64)
│  ├─ 0.05782ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Int64}, 1::Int64)
│  │  └─ 0.051712ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Int64}, 1::Int64, 1::Int64)
│  │     └─ 0.034622ms: InferenceFrameInfo for +(1::Int64, 1::Int64)
│  ├─ 0.05079ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Int64}, 2::Int64, 2::Int64)
│  │  └─ 0.025358ms: InferenceFrameInfo for +(2::Int64, 1::Int64)
│  └─ 0.223613ms: InferenceFrameInfo for Base.collect_to_with_first!(::Vector{Float64}, ::Float64, ::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, ::Int64)
│     ├─ 0.09384ms: InferenceFrameInfo for LinearIndices(::Vector{Float64})
│     │  └─ 0.05502ms: InferenceFrameInfo for LinearIndices(::Tuple{Base.OneTo{Int64}})
│     └─ 0.430213ms: InferenceFrameInfo for Base.collect_to!(::Vector{Float64}, ::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, ::Int64, ::Int64)
│        └─ 0.306434ms: InferenceFrameInfo for iterate(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}, ::Int64)
├─ 1.373042ms: InferenceFrameInfo for exp(::Float64)
│  ├─ 0.053939ms: InferenceFrameInfo for Base.Math.LogBo256INV(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.020152ms: InferenceFrameInfo for Base.Math.MAGIC_ROUND_CONST(::Type{Float64})
│  ├─ 0.039228ms: InferenceFrameInfo for muladd(::Float64, 369.3299304675746::Float64, 6.755399441055744e15::Float64)
│  ├─ 0.018735ms: InferenceFrameInfo for Base.Math.LogBo256U(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.029325ms: InferenceFrameInfo for muladd(::Float64, -0.002707606173999011::Float64, ::Float64)
│  ├─ 0.029262ms: InferenceFrameInfo for Base.Math.LogBo256L(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.095741ms: InferenceFrameInfo for muladd(::Float64, -6.327543041662719e-14::Float64, ::Float64)
│  ├─ 0.199199ms: InferenceFrameInfo for >>(::Int32, ::Int64)
│  ├─ 0.134542ms: InferenceFrameInfo for &(::Int32, 255::Int64)
│  │  ├─ 0.026995ms: InferenceFrameInfo for rem(255::Int64, ::Type{Int64})
│  │  └─ 0.028472ms: InferenceFrameInfo for &(::Int64, 255::Int64)
│  ├─ 0.251593ms: InferenceFrameInfo for Base.Math.table_unpack(::Int64)
│  │  ├─ 0.12633ms: InferenceFrameInfo for getindex(::NTuple{256, UInt64}, ::Int64)
│  │  ├─ 0.118617ms: InferenceFrameInfo for getindex((0x0000000000000000, 0xaac00b1afa5abcbe, 0x9b60163da9fb3335, 0xab502168143b0280, 0xadc02c9a3e778060, 0x656037d42e11bbcc, 0xa7a04315e86e7f84, 0x84c04e5f72f654b1, 0x8d7059b0d3158574, 0xa510650a0e3c1f88, 0xa8d0706b29ddf6dd, 0x83207bd42b72a836, 0x6180874518759bc8, 0xa4b092bdf66607df, 0x91409e3ecac6f383, 0x85d0a9c79b1f3919, 0x98a0b5586cf9890f, 0x94f0c0f145e46c85, 0x9010cc922b7247f7, 0xa210d83b23395deb, 0x4030e3ec32d3d1a2, 0xa5b0efa55fdfa9c4, 0xae40fb66affed31a, 0x8d41073028d7233e, 0xa4911301d0125b50, 0xa1a11edbab5e2ab5, 0xaf712abdc06c31cb, 0xae8136a814f204aa, 0xa661429aaea92ddf, 0xa9114e95934f312d, 0x82415a98c8a58e51, 0x58f166a45471c3c2, 0xab9172b83c7d517a, 0x70917ed48695bbc0, 0xa7718af9388c8de9, 0x94a1972658375d2f, 0x8e51a35beb6fcb75, 0x97b1af99f8138a1c, 0xa351bbe084045cd3, 0x9001c82f95281c6b, 0x9e01d4873168b9aa, 0xa481e0e75eb44026, 0xa711ed5022fcd91c, 0xa201f9c18438ce4c, 0x8dc2063b88628cd6, 0x935212be3578a819, 0x82a21f49917ddc96, 0x8d322bdda27912d1, 0x99b2387a6e756238, 0x8ac2451ffb82140a, 0x8ac251ce4fb2a63f, 0x93e25e85711ece75, 0x82b26b4565e27cdd, 0x9e02780e341ddf29, 0xa2d284dfe1f56380, 0xab4291ba7591bb6f, 0x86129e9df51fdee1, 0xa352ab8a66d10f12, 0xafb2b87fd0dad98f, 0xa572c57e39771b2e, 0x9002d285a6e4030b, 0x9d12df961f641589, 0x71c2ecafa93e2f56, 0xaea2f9d24abd886a, 0x86f306fe0a31b715, 0x89531432edeeb2fd, 0x8a932170fc4cd831, 0xa1d32eb83ba8ea31, 0x93233c08b26416ff, 0xab23496266e3fa2c, 0xa92356c55f929ff0, 0xa8f36431a2de883a, 0xa4e371a7373aa9ca, 0xa3037f26231e7549, 0xa0b38cae6d05d865, 0xa3239a401b7140ee, 0xad43a7db34e59ff6, 0x9543b57fbfec6cf4, 0xa083c32dc313a8e4, 0x7fe3d0e544ede173, 0x8ad3dea64c123422, 0xa943ec70df1c5174, 0xa413fa4504ac801b, 0x8bd40822c367a024, 0xaf04160a21f72e29, 0xa3d423fb27094689, 0xab8431f5d950a896, 0x88843ffa3f84b9d4, 0x48944e086061892d, 0xae745c2042a7d231, 0x9c946a41ed1d0057, 0xa1e4786d668b3236, 0x73c486a2b5c13cd0, 0xab1494e1e192aed1, 0x99c4a32af0d7d3de, 0xabb4b17dea6db7d6, 0x7d44bfdad5362a27, 0x9054ce41b817c114, 0x98e4dcb299fddd0d, 0xa564eb2d81d8abfe, 0xa5a4f9b2769d2ca6, 0x7a2508417f4531ee, 0xa82516daa2cf6641, 0xac65257de83f4eee, 0xabe5342b569d4f81, 0x879542e2f4f6ad27, 0xa8a551a4ca5d920e, 0xa7856070dde910d1, 0x99b56f4736b527da, 0xa7a57e27dbe2c4ce, 0x82958d12d497c7fd, 0xa4059c0827ff07cb, 0x9635ab07dd485429, 0xa245ba11fba87a02, 0x3c45c9268a5946b7, 0xa195d84590998b92, 0x9ba5e76f15ad2148, 0xa985f6a320dceb70, 0xa60605e1b976dc08, 0x9e46152ae6cdf6f4, 0xa636247eb03a5584, 0x984633dd1d1929fd, 0xa8e6434634ccc31f, 0xa28652b9febc8fb6, 0xa226623882552224, 0xa85671c1c70833f5, 0x60368155d44ca973, 0x880690f4b19e9538, 0xa216a09e667f3bcc, 0x7a36b052fa75173e, 0xada6c012750bdabe, 0x9c76cfdcddd47645, 0xae46dfb23c651a2e, 0xa7a6ef9298593ae4, 0xa9f6ff7df9519483, 0x59d70f7466f42e87, 0xaba71f75e8ec5f73, 0xa6f72f8286ead089, 0xa7a73f9a48a58173, 0x90474fbd35d7cbfd, 0xa7e75feb564267c8, 0x9b777024b1ab6e09, 0x986780694fde5d3f, 0x934790b938ac1cf6, 0xaaf7a11473eb0186, 0xa207b17b0976cfda, 0x9f17c1ed0130c132, 0x91b7d26a62ff86f0, 0x7057e2f336cf4e62, 0xabe7f3878491c490, 0xa6c80427543e1a11, 0x946814d2add106d9, 0xa1582589994cce12, 0x9998364c1eb941f7, 0xa9c8471a4623c7ac, 0xaf2857f4179f5b20, 0xa01868d99b4492ec, 0x85d879cad931a436, 0x99988ac7d98a6699, 0x9d589bd0a478580f, 0x96e8ace5422aa0db, 0x9ec8be05bad61778, 0xade8cf3216b5448b, 0xa478e06a5e0866d8, 0x85c8f1ae99157736, 0x959902fed0282c8a, 0xa119145b0b91ffc5, 0xab2925c353aa2fe1, 0xae893737b0cdc5e4, 0xa88948b82b5f98e4, 0xad395a44cbc8520e, 0xaf296bdd9a7670b2, 0xa1797d829fde4e4f, 0x7ca98f33e47a22a2, 0xa749a0f170ca07b9, 0xa119b2bb4d53fe0c, 0x7c79c49182a3f090, 0xa579d674194bb8d4, 0x7829e86319e32323, 0xaad9fa5e8d07f29d, 0xa65a0c667b5de564, 0x9c6a1e7aed8eb8bb, 0x963a309bec4a2d33, 0xa2aa42c980460ad7, 0xa16a5503b23e255c, 0x650a674a8af46052, 0x9bca799e1330b358, 0xa58a8bfe53c12e58, 0x90fa9e6b5579fdbf, 0x889ab0e521356eba, 0xa81ac36bbfd3f379, 0x97ead5ff3a3c2774, 0x97aae89f995ad3ad, 0xa5aafb4ce622f2fe, 0xa21b0e07298db665, 0x94db20ce6c9a8952, 0xaedb33a2b84f15fa, 0xac1b468415b749b0, 0xa1cb59728de55939, 0x92ab6c6e29f1c52a, 0xad5b7f76f2fb5e46, 0xa24b928cf22749e3, 0xa08ba5b030a10649, 0xafcbb8e0b79a6f1e, 0x823bcc1e904bc1d2, 0xafcbdf69c3f3a206, 0xa08bf2c25bd71e08, 0xa89c06286141b33c, 0x811c199bdd85529c, 0xa48c2d1cd9fa652b, 0x9b4c40ab5fffd07a, 0x912c544778fafb22, 0x928c67f12e57d14b, 0xa86c7ba88988c932, 0x71ac8f6d9406e7b5, 0xaa0ca3405751c4da, 0x750cb720dcef9069, 0xac5ccb0f2e6d1674, 0xa88cdf0b555dc3f9, 0xa2fcf3155b5bab73, 0xa1ad072d4a07897b, 0x955d1b532b08c968, 0xa15d2f87080d89f1, 0x93dd43c8eacaa1d6, 0x82ed5818dcfba487, 0x5fed6c76e862e6d3, 0xa77d80e316c98397, 0x9a0d955d71ff6075, 0x9c2da9e603db3285, 0xa24dbe7cd63a8314, 0x92ddd321f301b460, 0xa1ade7d5641c0657, 0xa72dfc97337b9b5e, 0xadae11676b197d16, 0xa42e264614f5a128, 0xa30e3b333b16ee11, 0x839e502ee78b3ff6, 0xaa7e653924676d75, 0x92de7a51fbc74c83, 0xa77e8f7977cdb73f, 0xa0bea4afa2a490d9, 0x948eb9f4867cca6e, 0xa1becf482d8e67f0, 0x91cee4aaa2188510, 0x9dcefa1bee615a27, 0xa66f0f9c1cb64129, 0x93af252b376bba97, 0xacdf3ac948dd7273, 0x99df50765b6e4540, 0x9faf6632798844f8, 0xa12f7bfdad9cbe13, 0xaeef91d802243c88, 0x874fa7c1819e90d8, 0xacdfbdba3692d513, 0x62efd3c22b8f71f1, 0x74afe9d96b2a23d9)::NTuple{256, UInt64}, ::Int64)
│  │  ├─ 0.026041ms: InferenceFrameInfo for &(::UInt64, 4503599627370495::UInt64)
│  │  ├─ 0.025355ms: InferenceFrameInfo for |(4607182418800017408::UInt64, ::UInt64)
│  │  └─ 0.024029ms: InferenceFrameInfo for |(4323455642275676160::UInt64, ::UInt64)
│  ├─ 0.052238ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Float64}, 1::Int64)
│  │  └─ 0.05335ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Float64}, 1::Int64, 1::Int64)
│  │     └─ 0.0296ms: InferenceFrameInfo for +(1::Int64, 1::Int64)
│  ├─ 0.045782ms: InferenceFrameInfo for Base.indexed_iterate(::Tuple{Float64, Float64}, 2::Int64, 2::Int64)
│  │  └─ 0.025904ms: InferenceFrameInfo for +(2::Int64, 1::Int64)
│  ├─ 0.389975ms: InferenceFrameInfo for Base.Math.expm1b_kernel(Val{:ℯ}()::Val{:ℯ}, ::Float64)
│  │  └─ 0.174329ms: InferenceFrameInfo for evalpoly(::Float64, (0.9999999999999912, 0.4999999999999997, 0.1666666857598779, 0.04166666857598777)::NTuple{4, Float64})
│  │     ├─ 0.029071ms: InferenceFrameInfo for getindex((0.9999999999999912, 0.4999999999999997, 0.1666666857598779, 0.04166666857598777)::NTuple{4, Float64}, 4::Int64)
│  │     ├─ 0.023579ms: InferenceFrameInfo for getindex((0.9999999999999912, 0.4999999999999997, 0.1666666857598779, 0.04166666857598777)::NTuple{4, Float64}, 3::Int64)
│  │     ├─ 0.028283ms: InferenceFrameInfo for muladd(::Float64, 0.04166666857598777::Float64, 0.1666666857598779::Float64)
│  │     ├─ 0.02418ms: InferenceFrameInfo for getindex((0.9999999999999912, 0.4999999999999997, 0.1666666857598779, 0.04166666857598777)::NTuple{4, Float64}, 2::Int64)
│  │     ├─ 0.025685ms: InferenceFrameInfo for muladd(::Float64, ::Float64, 0.4999999999999997::Float64)
│  │     ├─ 0.023515ms: InferenceFrameInfo for getindex((0.9999999999999912, 0.4999999999999997, 0.1666666857598779, 0.04166666857598777)::NTuple{4, Float64}, 1::Int64)
│  │     └─ 0.025202ms: InferenceFrameInfo for muladd(::Float64, ::Float64, 0.9999999999999912::Float64)
│  ├─ 0.020752ms: InferenceFrameInfo for Base.Math.SUBNORM_EXP(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.016344ms: InferenceFrameInfo for Base.Math.MAX_EXP(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.016751ms: InferenceFrameInfo for Base.Math.MIN_EXP(Val{:ℯ}()::Val{:ℯ}, ::Type{Float64})
│  ├─ 0.034984ms: InferenceFrameInfo for UInt64(53::Int64)
│  │  └─ 0.042172ms: InferenceFrameInfo for Core.toUInt64(53::Int64)
│  │     └─ 0.047479ms: InferenceFrameInfo for Core.check_top_bit(::Type{UInt64}, 53::Int64)
│  │        └─ 0.049407ms: InferenceFrameInfo for Core.is_top_bit_set(53::Int64)
│  │           
│  │           
│  ├─ 0.144234ms: InferenceFrameInfo for +(::Int32, ::UInt64)
│  │  ├─ 0.066589ms: InferenceFrameInfo for Base.promote_typeof(::Int32, ::UInt64)
│  │  │  └─ 0.076001ms: InferenceFrameInfo for promote_type(::Type{Int32}, ::Type{UInt64})
│  │  │     ├─ 0.020399ms: InferenceFrameInfo for promote_rule(::Type{Int32}, ::Type{UInt64})
│  │  │     │  
│  │  │     │  
│  │  │     ├─ 0.017439ms: InferenceFrameInfo for promote_rule(::Type{UInt64}, ::Type{Int32})
│  │  │     │  
│  │  │     │  
│  │  │     └─ 0.030622ms: InferenceFrameInfo for Base.promote_result(::Type{Int32}, ::Type{UInt64}, ::Type{Union{}}, ::Type{UInt64})
│  │  │        
│  │  │        
│  │  └─ 0.019185ms: InferenceFrameInfo for Base.not_sametype(::Tuple{Int32, UInt64}, ::Tuple{UInt64, UInt64})
│  ├─ 0.029075ms: InferenceFrameInfo for Val{-53}()
│  ├─ 0.080985ms: InferenceFrameInfo for Base.literal_pow(^::typeof(^), ::Float64, Val{-53}()::Val{-53})
│  │  └─ 0.271619ms: InferenceFrameInfo for ^(::Float64, -53::Int64)
│  │     ├─ 16.741692ms: InferenceFrameInfo for ==(-53::Int64, -1::Int64)
│  │     ├─ 0.041581ms: InferenceFrameInfo for ==(-53::Int64, 0::Int64)
│  │     ├─ 0.024597ms: InferenceFrameInfo for ==(-53::Int64, 1::Int64)
│  │     ├─ 0.024418ms: InferenceFrameInfo for ==(-53::Int64, 2::Int64)
│  │     ├─ 0.022065ms: InferenceFrameInfo for ==(-53::Int64, 3::Int64)
│  │     ├─ 0.027362ms: InferenceFrameInfo for Float64(-53::Int64)
│  │     ├─ 0.037626ms: InferenceFrameInfo for Base.cconvert(::Type{Float64}, -53.0::Float64)
│  │     │  └─ 0.017457ms: InferenceFrameInfo for convert(::Type{Float64}, -53.0::Float64)
│  │     │     
│  │     │     
│  │     └─ 0.017038ms: InferenceFrameInfo for Base.unsafe_convert(::Type{Float64}, -53.0::Float64)
│  └─ 0.064417ms: InferenceFrameInfo for Base.literal_pow(^::typeof(^), 2.0::Float64, Val{-53}()::Val{-53})
│     └─ 0.145424ms: InferenceFrameInfo for ^(2.0::Float64, -53::Int64)
│        ├─ 0.033322ms: InferenceFrameInfo for Base.cconvert(::Type{Float64}, 2.0::Float64)
│        │  └─ 0.019421ms: InferenceFrameInfo for convert(::Type{Float64}, 2.0::Float64)
│        │     
│        │     
│        └─ 0.017224ms: InferenceFrameInfo for Base.unsafe_convert(::Type{Float64}, 2.0::Float64)
└─ 0.481137ms: InferenceFrameInfo for /(::Vector{Float64}, ::Float64)
   └─ 0.601574ms: InferenceFrameInfo for Base.Broadcast.broadcast_preserving_zero_d(::typeof(/), ::Vector{Float64}, ::Float64)
      ├─ 0.206691ms: InferenceFrameInfo for Base.Broadcast.broadcasted(::typeof(/), ::Vector{Float64}, ::Float64)
      │  ├─ 0.045067ms: InferenceFrameInfo for Base.Broadcast.broadcastable(::Vector{Float64})
      │  ├─ 0.019835ms: InferenceFrameInfo for Base.Broadcast.broadcastable(::Float64)
      │  ├─ 0.04857ms: InferenceFrameInfo for Base.Broadcast.combine_styles(::Vector{Float64}, ::Float64)
      │  │  ├─ 0.051803ms: InferenceFrameInfo for Base.Broadcast.combine_styles(::Vector{Float64})
      │  │  │  
      │  │  │  
      │  │  └─ 0.039839ms: InferenceFrameInfo for Base.Broadcast.combine_styles(::Float64)
      │  │     
      │  │     
      │  └─ 0.210104ms: InferenceFrameInfo for Base.Broadcast.broadcasted(::Base.Broadcast.DefaultArrayStyle{1}, ::typeof(/), ::Vector{Float64}, ::Float64)
      │     ├─ 0.098777ms: InferenceFrameInfo for (Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Axes, F, Args} where Args<:Tuple where F where Axes)(/::typeof(/), ::Tuple{Vector{Float64}, Float64})
      │     │  
      │     │  
      │     └─ 0.167252ms: InferenceFrameInfo for (Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Axes, F, Args} where Args<:Tuple where F where Axes)(/::typeof(/), ::Tuple{Vector{Float64}, Float64})
      │        
      │        
      ├─ 0.535055ms: InferenceFrameInfo for Base.Broadcast.materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}})
      │  ├─ 0.344224ms: InferenceFrameInfo for Base.Broadcast.instantiate(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}})
      │  │  ├─ 0.045665ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}}, ::Symbol)
      │  │  │  
      │  │  │  
      │  │  ├─ 0.025575ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}}, axes::Symbol)
      │  │  │  
      │  │  │  
      │  │  ├─ 0.022817ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}}, args::Symbol)
      │  │  │  
      │  │  │  
      │  │  ├─ 0.106135ms: InferenceFrameInfo for Base.Broadcast.combine_axes(::Vector{Float64}, ::Float64)
      │  │  │  
      │  │  │  
      │  │  ├─ 0.024495ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}}, f::Symbol)
      │  │  │  
      │  │  │  
      │  │  ├─ 0.095981ms: InferenceFrameInfo for (Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Axes, F, Args} where Args<:Tuple where F where Axes)(/::typeof(/), ::Tuple{Vector{Float64}, Float64}, ::Tuple{Base.OneTo{Int64}})
      │  │  │  
      │  │  │  
      │  │  └─ 0.082441ms: InferenceFrameInfo for (Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Axes, F, Args} where Args<:Tuple where F where Axes)(/::typeof(/), ::Tuple{Vector{Float64}, Float64}, ::Tuple{Base.OneTo{Int64}})
      │  │     
      │  │     
      │  └─ 0.576416ms: InferenceFrameInfo for copy(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}})
      │     ├─ 0.046581ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}}, ::Symbol)
      │     │  
      │     │  
      │     ├─ 0.025953ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}}, f::Symbol)
      │     │  
      │     │  
      │     ├─ 0.029579ms: InferenceFrameInfo for getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}}, args::Symbol)
      │     │  
      │     │  
      │     ├─ 0.054823ms: InferenceFrameInfo for Base.Broadcast.combine_eltypes(/::typeof(/), ::Tuple{Vector{Float64}, Float64})
      │     │  
      │     │  
      │     ├─ 0.104857ms: InferenceFrameInfo for similar(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}}, ::Type{Float64})
      │     │  
      │     │  
      │     └─ 0.642781ms: InferenceFrameInfo for copyto!(::Vector{Float64}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(/), Tuple{Vector{Float64}, Float64}})
      │        
      │        
      ├─ 0.084705ms: InferenceFrameInfo for axes(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}})
      │  └─ 0.100609ms: InferenceFrameInfo for Base.Broadcast._axes(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(/), Tuple{Vector{Float64}, Float64}}, nothing::Nothing)
      └─ 0.025072ms: InferenceFrameInfo for ==(1::Int64, 0::Int64)

julia> itrigs = inference_triggers(tinf)    # just to emphasize the calls that required a fresh (unlinked) run of inference
3-element Vector{InferenceTrigger}:
 Inference triggered to call MethodInstance for collect(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}) from gaussian (/tmp/PCConst/src/PCConst.jl:9) with specialization MethodInstance for gaussian(::Float64, ::Int64)
 Inference triggered to call MethodInstance for exp(::Float64) from #1 (./none:0) inlined into MethodInstance for collect(::Base.Generator{UnitRange{Int64}, PCConst.var"#1#2"{Float64}}) (./array.jl:664)
 Inference triggered to call MethodInstance for /(::Vector{Float64}, ::Float64) from gaussian (/tmp/PCConst/src/PCConst.jl:10) with specialization MethodInstance for gaussian(::Float64, ::Int64)

Generally I would have expected no inference at all, given that we already had the MethodInstance. Note that in the tree output, you can see the Const-specialization from sequences like foo(1::Int64) which indicate that foo was Const-specialized with an argument of value 1::Int64.

It's worth starting a fresh session and doing the same analysis for gethiddengaussian, to see that it has different entrance-points into inference. But if you directly precompile hiddengaussian (uncommenting that line), then you get back to something almost identical to getgaussian.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 5, 2021

Along the lines of this issue and #38080, the new SnoopCompile 2.2.0 has a not-yet-documented "profile-guided deoptimization" toolset. Here's a demo running on Franklin:

pgdo

Each dot is a Method. You can see the inference time (including for the MethodInstance's callees) vs the amount of "self" time (not including callees) it spent at runtime. Some of the ones with more inference time are very cheap to infer once but are inferred a whole bunch of times for different constants (e.g., getproperty, the yellow dot in the plot). But even the top-most left dot has been inferred for 38 different type+const combinations, even though there are only 6 MethodInstances. I haven't broken out how much constprop accounts for that, but it's pretty interesting to see.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 9, 2021

Here's a concrete example of the impact of this issue: the kwfunc for this method in DataFrames ends up taking a whopping 23s of inference time while running the DataFrames tests (the total test suite is ~700s, so just inference on just that one method is ~3% of the total time). It gets inferred 1778 separate times. Of the first 4 arguments of the kwfunc,

(::DataFrames.var"#_combine_prepare##kw")(::Any, ::typeof(DataFrames._combine_prepare), gd::GroupedDataFrame, cs)

the number of distinct slottypes (including Const, PartialStruct, etc) is:

 705
   1
   1
 573

So the Boolean kwargs account for a ton of re-inference, and despite the @nospecialize annotation on cs it ends up triggering a heck of a lot of inference. It's worth pointing out that

julia> m.nospecialize
8

where m is this method, so the @nospecialize does propagate to argument 4 of the kwfunc, but the constant-prop on the NamedTuple seems to force re-inference.

A small sample of the diversity of the NamedTuple is

julia> arg1
Base.IdSet{Any} with 705 elements:
  Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Core.Const(true), Bool, Core.Const(true), Core.Const(false), Core.Const(true)])
  Core.PartialStruct(NamedTuple{(:copycols, :keepkeys, :ungroup, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Bool, Core.Const(true), Core.Const(true), Core.Const(true), Bool])
  Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Bool, Bool, Core.Const(true), Core.Const(false), Bool])
  Core.PartialStruct(NamedTuple{(:keepkeys, :ungroup, :copycols, :keeprows, :renamecols), NTuple{5, Bool}}, Any[Core.Const(true), Bool, Core.Const(true), Core.Const(false), Core.Const(true)])
  ⋮ 

and that of the cs argument

julia> arg4
Base.IdSet{Any} with 573 elements:
  Tuple{Pair{Vector{Symbol}, Pair{Main.TestGrouping.var"#367#376"{Int64, Int64, UnitRange{Int64}}, Symbol}}}
  Tuple{Pair{Vector{Int64}, Pair{Main.TestGrouping.var"#198#275", Symbol}}}
  Core.PartialStruct(Tuple{Colon, Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}, N} where N}, Any[Core
  Tuple{Pair{Symbol, Symbol}, Pair{Symbol, ByRow{typeof(sin)}}}
  Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#510#522", Symbol}}, Pair{Symbol, Pair{Main.TestGrouping.var"#511#523", Symbol}}}
  Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#573#587"{typeof(sum)}, Symbol}}}
  Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#581#595"{ComposedFunction{typeof(prod), typeof(skipmissing)}}, Symbol}}}
  Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#278#293"{typeof(maximum)}, Symbol}}}
  Core.Const((Main.TestGrouping.var"#633#673"(),))
   

Here's the sorted set (essentially the cumulative distribution) of inference times for each of these runs:

image

so you can see there are quite a few that take non-negligible time.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 26, 2021

Now that #38080 is merged, I'd like to propose that we turn off const-prop on keyword arguments by default; anyone who wants it can annotate with @aggressive_constprop. Julia packages are chock full of methods like f(args...; allow_failure=false) where it's completely useless to const-specialize on allow_failure. (See the DataFrames example above for a real-world cost.)

@martinholters
Copy link
Member

Just investigated a case where precompilation was not as effective as I hoped it would be and reduced it until I realized it was this issue. Here is the MWE:

module Foo

bar(x::Int) = hcat(rand())

foo(x) = bar(1)
#foo(x) = bar(x)   # this precompiles just fine

@assert precompile(foo, (Int,))

end # module

Then, in a fresh session:

julia> first(methods(hcat, Tuple{Number})).specializations # none yet
svec()

julia> using Foo

julia> first(methods(Foo.foo)).specializations
svec(MethodInstance for Foo.foo(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> first(methods(Foo.bar)).specializations
svec(MethodInstance for Foo.bar(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> first(methods(hcat, Tuple{Number})).specializations # contains hcat(::Float64) from Foo
svec(MethodInstance for hcat(::Float64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> using SnoopCompileCore

julia> tinf = @snoopi_deep Foo.foo(1);

julia> using SnoopCompile, AbstractTrees

julia> print_tree(tinf, maxdepth=3)
InferenceTimingNode: 0.018412/0.044012 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000297/0.025600 on Foo.foo(::Int64) with 1 direct children
   └─ InferenceTimingNode: 0.000358/0.025303 on Foo.bar(1::Int64) with 2 direct children
      ├─ InferenceTimingNode: 0.000404/0.003161 on rand() with 1 direct children
      │  
      │  
      └─ InferenceTimingNode: 0.000777/0.021783 on hcat(::Float64) with 13 direct children
         

So precompilation has generated the expected MethodInstances, but they get inferred again. Now I understand why bar(1) might be inferred if only bar(::Int) is stored in the precompile file. (Is it?) But note that also hcat(::Float64) gets inferred again, although it was never part of the const-propagation itself. That looks completely unnecessary.

@timholy
Copy link
Sponsor Member Author

timholy commented Aug 9, 2021

if only bar(::Int) is stored in the precompile file. (Is it?)

Funny you should ask. I got tired of not really understanding this well myself, so I am partway through writing PrecompileCacheUtils.jl, with the aim of providing a non-instantiating mirror of the content in the *.ji file (essentially like src/dump.c's jl_deserialize* pipeline but non-instantiating). But it's not working yet.

We could add a macro @no_constprop (sort of the opposite of #38080), but I am hesitant to suggest that too strongly since every annotation we add makes Julia more complicated for the developer. I'd much rather find a systematic solution to this problem, if possible.

Meanwhile, to get a sense for how common this issue is, I'm contemplating special coloration in the flamegraphs to highlight constprop (maybe orange just like we use for gc in performance profiling). Thoughts?

@martinholters
Copy link
Member

I'm still trying to get a handle on what the problem is, exactly. To systematically go through different cases, I slightly changed above module Foo to contain

foo() = bar(1)
foo(x) = bar(x)

and then added precompiles for eitherr of them or both and also ran either of them or both under @snoopi_deep. I was looking at whether hcat(::Float64) had to be (re-)inferred. (To summarize again, foo() involves constant propagation, foo(::Int) does not, but either way, hcat is never called with a constant.)

Here is what I got (:x: means hcat(::Float64) was inferred):

run foo() run foo(1) run foo(); foo(1) run foo(1); foo()
no precompile ❌ ✔️ ❌ ✔️
precompile foo() ✔️ ❌ ✔️ ✔️ ❌
precompile foo(::Int) ✔️ ❌ ✔️ ✔️ ❌
precompile foo(), foo(::Int) ✔️ ❌ ✔️ ✔️ ❌
precompile foo(::Int), foo() ✔️ ❌ ✔️ ✔️ ❌

(In the last two columns, the first result is of course redundant with the respective column before.)
There are some aspects that make sense:

  • Without any precompiles, the first run of foo always causes inference of hcat(::Float64). The second run doesn't, it uses the result obtained in the first run.
  • It makes no difference what method of foo is precompiled. Reasonable, as they both involve inference of hcat(::Float64).
  • If any foo method is precompiled, foo(1) does not need to infer hcat(::Float64), but can apparently use the result from precompilation.

However, calling foo() always requires inference of hcat(::Float64), unless it was already inferred by a call to foo(1) in the running session. (The only ✔️ for foo() is in the top-right table cell.)

So if I may venture a generalization, it looks like this: When inference is within a callchain that involves constant propagation, it cannot use inference results from precompile files, only those produced in the running session (and probably those from the sysimg?). Does that comply with your observations?

@timholy
Copy link
Sponsor Member Author

timholy commented Aug 10, 2021

Very interesting! One thing to check my understanding, when you say "was inferred" you mean "had previously been inferred so that it didn't need to infer it again when I ran it," right? Moreover, in all cases you're loading the module from a cache file (using Foo) and not just defining it in the REPL. If I define Foo it in the REPL, any precompile directive is enough to ensure that hcat(::Float64) does not need re-inference.

You conclusion makes sense, and in particular suggests that this is not a limitation of our cache files but instead a constraint we might need on the current running session.

@timholy
Copy link
Sponsor Member Author

timholy commented Aug 10, 2021

One other thing we don't do is cache hcat(::Float64)'s inference results. This appears to be a consequence of inlining, but presumably those results are available prior to the inlining decision, and I wonder if it would be useful to cache them in case they are handy on their own.

I also wonder if #41328 with bar(x::Int) = @noinline hcat(rand()) would "fix" this. This isn't the way you want to fix it, but it would be a proof-of-principle.

@martinholters
Copy link
Member

One thing to check my understanding, when you say "was inferred" you mean "had previously been inferred so that it didn't need to infer it again when I ran it," right?

You mean in " (:x: means hcat(::Float64) was inferred)"? Then no, it means that inference of hcat(::Float64) happened when running (as reported by @snoopi_deep). Looks (to me) as if it is always inferred during precompilation, but the result is not always re-used, so ❌ means result not re-used (though it should be available).

Moreover, in all cases you're loading the module from a cache file (using Foo) and not just defining it in the REPL.

Correct.

If I define Foo it in the REPL, any precompile directive is enough to ensure that hcat(::Float64) does not need re-inference.

So definition in REPL with precompilation works just as fine as running without precompilation.

You conclusion makes sense, and in particular suggests that this is not a limitation of our cache files but instead a constraint we might need on the current running session.

So to avoid any misunderstanding, this is what I see, if I interpret my results correctly:

hcat(::Float64) was inferred in running session hcat(::Float64) inference results come from cache file
inference of new method without any const-propagation ✔️ inference result re-used ✔️ inference result re-used
inference of new method with some const-propagation ✔️ inference result re-used hcat(::Float64) is re-inferred

Where I assume that if hcat(::Float64) shows up in @snoopi_deep output, it was re-inferred, otherwise the cached inference result is used.

One other thing we don't do is cache hcat(::Float64)'s inference results.

Ah. So it's actually the reuse of bar's inference results that makes the difference...? Then the presence of const-propagation at that point causing a difference makes much more sense.

This appears to be a consequence of inlining, but presumably those results are available prior to the inlining decision, and I wonder if it would be useful to cache them in case they are handy on their own.

If our analysis is correct up to here, then ... yes, that sounds useful.

I also wonder if #41328 with bar(x::Int) = @noinline hcat(rand()) would "fix" this. This isn't the way you want to fix it, but it would be a proof-of-principle.

Right, but I should be able to conduct the same experiment with

@noinline bar() = hcat(rand())
bar(x::Int) = bar()

right? Will do and report back...

@martinholters
Copy link
Member

Right, but I should be able to conduct the same experiment with ...

Hm, no, inlining there doesn't matter, but the extra indirection helps. My new module:

module Foo

# as before
bar(x::Int) = hcat(rand())

# with extra indrection
@inline bar() = hcat(rand())
bar(x::Float64) = bar()

foo1() = bar(1) # previously known as foo
foo2() = bar(1.0) # with extra indirection

@assert precompile(foo1, ())
@assert precompile(foo2, ())

end # module

(Again defined in a package and loaded from a precompile file).
Then:

julia> print_tree(@snoopi_deep(Foo.foo1()), maxdepth=3)
InferenceTimingNode: 0.018147/0.045257 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000343/0.027110 on Foo.foo1() with 1 direct children
   └─ InferenceTimingNode: 0.000385/0.026768 on Foo.bar(1::Int64) with 2 direct children
      ├─ InferenceTimingNode: 0.000289/0.002451 on rand() with 1 direct children
      │  
      │  
      └─ InferenceTimingNode: 0.000925/0.023932 on hcat(::Float64) with 13 direct children
         

This was to be expected given everything discussed above. But now in a fresh session:

julia> print_tree(@snoopi_deep(Foo.foo2()), maxdepth=3)
InferenceTimingNode: 0.019031/0.020039 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000323/0.001008 on Foo.foo2() with 2 direct children
   ├─ InferenceTimingNode: 0.000412/0.000412 on Foo.bar(::Float64) with 0 direct children
   └─ InferenceTimingNode: 0.000273/0.000273 on Foo.bar(1.0::Float64) with 0 direct children

(Replacing the @inline with @noinline or leaving it off completely makes no difference.)

No sure what to make of it.

@martinholters
Copy link
Member

Trying to drill a little deeper, I've extended above Foo module with two more functions

foo3() = bar()
foo4() = hcat(rand())

and their respective precompiles. Those give

julia> print_tree(@snoopi_deep(Foo.foo3()), maxdepth=3)
InferenceTimingNode: 0.019327/0.019715 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000388/0.000388 on Foo.foo3() with 0 direct children

and

julia> print_tree(@snoopi_deep(Foo.foo4()), maxdepth=3)
InferenceTimingNode: 0.018091/0.018091 on Core.Compiler.Timings.ROOT() with 0 direct children

(each in a fresh session). So foo4 actually is the only one that does not need any re-inference. Trying to understand why, I did (in a fresh session again, of course):

julia> visit(Foo) do item
           if item isa Core.CodeInstance
               println(item.def, ": ", repr(item.max_world))
           end
           return item isa Union{Module,Method,Function,Core.MethodInstance,Core.CodeInstance}
       end
MethodInstance for Foo.bar(): 0xffffffffffffffff
MethodInstance for Foo.bar(::Int64): 0xffffffffffffffff
MethodInstance for Foo.bar(::Float64): 0x0000000000000000
MethodInstance for Foo.foo1(): 0x0000000000000000
MethodInstance for Foo.foo2(): 0x0000000000000000
MethodInstance for Foo.foo3(): 0x0000000000000000
MethodInstance for Foo.foo4(): 0xffffffffffffffff

Comparing with the inference trees, this makes sense: Inference can stop going deeper once it reaches a matching method/specialization where there is a CodeInstance with max_world not zeroed out.

But why are some of the max_worlds set to zero?

@martinholters
Copy link
Member

martinholters commented Aug 11, 2021

Ah, I think I've found something. Doesn't have anything to do with const-propagation directly, so sorry for hijacking this issue, but it should probably be sorted out before worrying about the interaction of precompilation and const-propagation. So here is what I believe is the cause of all the invalidated CodeInstances (those with max_world == 0). When deserializing the module, we first initialize max_world = 0 everwhere, then at the end, we go through the edges to external methods, check their validity and if they are still valid, insert the needed backedges and mark the caller as valid by setting max_world to 0xff...f. But we only do this for methods which have edges to external methods. For those which have only module-internal callees, we completely skip this step, which is fine as far as inserting missing backedges goes - there aren't any - but we fail to mark them valid.

So I've tried reversing the logic, starting with all CodeInstance valid and then marking the invalid ones as such. That made all the woes I had observed above go away, nothing needs to be re-inferred. For reference, this is the patch I've used:

diff --git a/src/dump.c b/src/dump.c
index 49fa6efa43..167acbf70a 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -1616,8 +1616,10 @@ static jl_value_t *jl_deserialize_value_code_instance(jl_serializer_state *s, jl
         codeinst->precompile = 1;
     codeinst->next = (jl_code_instance_t*)jl_deserialize_value(s, (jl_value_t**)&codeinst->next);
     jl_gc_wb(codeinst, codeinst->next);
-    if (validate)
+    if (validate) {
         codeinst->min_world = jl_world_counter;
+        codeinst->max_world = ~(size_t)0;
+    }
     return (jl_value_t*)codeinst;
 }
 
@@ -2042,15 +2044,15 @@ static void jl_insert_backedges(jl_array_t *list, jl_array_t *targets)
                     jl_method_table_add_backedge(mt, callee, (jl_value_t*)caller);
                 }
             }
-            // then enable it
+        }
+        else {
+            // invalid, so disable it
             jl_code_instance_t *codeinst = caller->cache;
             while (codeinst) {
                 if (codeinst->min_world > 0)
-                    codeinst->max_world = ~(size_t)0;
+                    codeinst->max_world = 0;
                 codeinst = jl_atomic_load_relaxed(&codeinst->next);
             }
-        }
-        else {
             if (_jl_debug_method_invalidation) {
                 jl_array_ptr_1d_push(_jl_debug_method_invalidation, (jl_value_t*)caller);
                 loctag = jl_cstr_to_string("insert_backedges");

Don't know whether this is 100% sound, it might be too optimistic, but what we have at the moment is definitely too pessimistic. @vtjnash ?

@martinholters
Copy link
Member

Don't know whether this is 100% sound

Of course it's not, as it is missing transitive invalidations. With which I mean, if internal_func1() only calls internal_func2() which in turn calls external_func(), then at the moment, we always mark internal_func1() invalid, no matter whether external_func() was actually changed therby invalidating internal_func2(). That's too pessimistic. With above patch, however, we never mark internal_func1() invalid, even if external_func() was changed, which is clearly too optimistic.

But even though we thus invalidate too little with this patch, it seems to have near to zero effect latency-wise on real-world packages. Probably a large fraction of functions without external dependencies are trivial wrappers (e.g. providing default args) which don't add any relevant amount of time to inference, so invalidating them unnecessarily isn't too bad. But for the kind of analysis we are doing here, it would still be nice to get rid of these, I guess. (Also, the fact that @snoopr doesn't report them further hinders analysis.)

@timholy
Copy link
Sponsor Member Author

timholy commented Aug 12, 2021

It seems we could solve this by maintaining an htable mapping MethodInstance => handled::Bool, and after the invalidation-linked pass go through the entire table and look for not-yet-handled instances. Then we mark them all as valid.

@martinholters
Copy link
Member

Let's take this to #41872 and then come back to interaction with const-specialization here eventually.

timholy added a commit that referenced this issue Aug 22, 2021
In #38983 and #41872, it was discovered that only CodeInstances with
external backedges got validated after deserialization.
This stores new CodeInstances in a list, and if they have neither been
validated nor invalidated, it validates them.

Closes #41872
timholy added a commit that referenced this issue Aug 23, 2021
In #38983 and #41872, it was discovered that only CodeInstances with
external backedges got validated after deserialization.
This stores new CodeInstances in a list, and if they have neither been
validated nor invalidated, it validates them.

Closes #41872
timholy added a commit that referenced this issue Aug 23, 2021
In #38983 and #41872, it was discovered that only CodeInstances with
external backedges get validated after deserialization.
This adds a "second chance" for each CodeInstance: it validates any that
have neither been validated nor invalidated by the end of deserialization.

Closes #41872
timholy added a commit that referenced this issue Sep 5, 2021
Our heuristics for constant propagation are imperfect (and probably
never will be perfect), and I've now seen many examples of
methods that no developer would ask to have const-propped get
that treatment. In some cases the cost for latency/precompilation
is very large. This renames `@aggressive_constprop` to `@constprop`
and allows two settings, `:aggressive` and `:none`.

Closes #38983 (presuming this will propagate to kw body methods)
timholy added a commit that referenced this issue Sep 10, 2021
Our heuristics for constant propagation are imperfect (and probably
never will be perfect), and I've now seen many examples of
methods that no developer would ask to have const-propped get
that treatment. In some cases the cost for latency/precompilation
is very large. This renames `@aggressive_constprop` to `@constprop`
and allows two settings, `:aggressive` and `:none`.

Closes #38983

Co-authored-by: Shuhei Kadowaki <[email protected]>
Co-authored-by: Martin Holters <[email protected]>
simeonschaub pushed a commit that referenced this issue Sep 16, 2021
Our heuristics for constant propagation are imperfect (and probably
never will be perfect), and I've now seen many examples of
methods that no developer would ask to have const-propped get
that treatment. In some cases the cost for latency/precompilation
is very large. This renames `@aggressive_constprop` to `@constprop`
and allows two settings, `:aggressive` and `:none`.

Closes #38983

Co-authored-by: Shuhei Kadowaki <[email protected]>
Co-authored-by: Martin Holters <[email protected]>
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Feb 22, 2022
In JuliaLang#38983 and JuliaLang#41872, it was discovered that only CodeInstances with
external backedges get validated after deserialization.
This adds a "second chance" for each CodeInstance: it validates any that
have neither been validated nor invalidated by the end of deserialization.

Closes JuliaLang#41872
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Feb 22, 2022
Our heuristics for constant propagation are imperfect (and probably
never will be perfect), and I've now seen many examples of
methods that no developer would ask to have const-propped get
that treatment. In some cases the cost for latency/precompilation
is very large. This renames `@aggressive_constprop` to `@constprop`
and allows two settings, `:aggressive` and `:none`.

Closes JuliaLang#38983

Co-authored-by: Shuhei Kadowaki <[email protected]>
Co-authored-by: Martin Holters <[email protected]>
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Mar 8, 2022
In JuliaLang#38983 and JuliaLang#41872, it was discovered that only CodeInstances with
external backedges get validated after deserialization.
This adds a "second chance" for each CodeInstance: it validates any that
have neither been validated nor invalidated by the end of deserialization.

Closes JuliaLang#41872
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Mar 8, 2022
Our heuristics for constant propagation are imperfect (and probably
never will be perfect), and I've now seen many examples of
methods that no developer would ask to have const-propped get
that treatment. In some cases the cost for latency/precompilation
is very large. This renames `@aggressive_constprop` to `@constprop`
and allows two settings, `:aggressive` and `:none`.

Closes JuliaLang#38983

Co-authored-by: Shuhei Kadowaki <[email protected]>
Co-authored-by: Martin Holters <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency compiler:precompilation Precompilation of modules
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants