-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up "time to first result"? #746
Comments
As @thofma pointed out: it may well be that in the end, we can't do anything about this other than building a custom baseimage. Quite true, but perhaps |
@fingolfin @fieker There is a PR over at julia, and someone (don't want to ping them) used Hecke as a benchmark: P.S.: Caching the compiled stuff will yield
Amazing how much stuff we have. |
People are looking into reducing the bloat. From what I heard, usually most of those libraries include serialised types in the data section, the machine code is a fraction of the whole library. |
Just tried JuliaLang/julia#47184 and with it, I get:
which is already better (and in the end, who cares). It's really amazing how it makes things. Indeed:
So instead of 5.5 seconds for the first call, it's now 0.06 seconds. Fantastic! Unfortunately, we don't benefit as much in Oscar, due to CxxWrap: if I first load
I.e. CxxWrap invalidates tons of stuff. I've reported this before at the CxxWrap repository, but zero reaction so far. I think we'll have to handle this ourselves if want to see it improved... AFAICT most (?) of the overhead comes from the TONS of automagic Another hypothetical "solution" would be to have AbstractAlgebra and GAP.jl depend on CxxWrap and also load it -- of course nobody wants that, I am just saying it would solve the issue because then all those invalidation introduced by CxxWrap would have happened at precompile time for all (?) our packages (I verified that it does). |
You can force your module to always do -O1, (or even -O0 and/or --compile=min) if helpful. It helps with TTFX but would you be worried about speed after that with lower optimization? Is the speed low on the default setting or as far as you know on that non-default? I'm not sure about this package, but is it mostly used interactively, and even 2x slower (not saying that will happen, just a guess or rather asking for your tolerance for slowdown) after first use would be ok? The PR for it is trivial (see how done in e.g. Plots.jl or ask me, maybe I do a PR; you may though need to add for some or all of your dependencies in addition or instead). It's great to see 92x faster for first use because of the new PR in Base! Still:
As you describe the order of using affect/cause invalidations. I don't know for sure about invalidations, I just have a feeling it has a lot to do with inlining of code, which is the default on -O2. You can check just that with: julia -O2 --inline=no I don't know about the different levels of optimization e.g. -O1 (I don't think it's even documented what they mean exactly, maybe in LLVM, it might only be passed over to it). At that level, or at least -O0 I'm though pretty sure inlining is off. Possibly always off for lowest, otherwise the threshold lowered. Compilation needs not at all be slow (it's not in fully non-optimizing compilers). With no inlining I have a hard time seeing a reason for invalidations. However the best optimization implies "whole program optimization", i.e. inlining, and that means compilation is always going to be slow, unless resulting machine code stored. We are getting there now in Julia, for packages and sysimage, but I think individually, and that rules out inlining across packages (as e.g. for Python packages written in C, which seem plenty fast enough). I'm not sure if the new package forbids such inlining, or more likely is some cases still has to reoptimize/invalidate because of the inlining your default optimization allows. |
I am working on various fixes for CXxWrap which considerably improve the situation for me. See this comment for an overview. As a datapoint, the timings printed by using Hecke; using CxxWrap ; @time @eval maximal_order(quadratic_field(-3)[1]); gives the following timings (in seconds) and allocations, when run with various Julia and CxxWrap pull requests, and with latest AbstractAlgebra DEV (which has some invalidation fixes). Note that timings fluctuate quite a bit, and I did just two measurements each (the first one triggered precompilation and "warmed the caches" and I then noted down the second timings). So only take the relative order of magnitude into account. The number of allocations of course is stable, and thus perhaps a better proxy for the actual impact.
So JuliaInterop/CxxWrap.jl#338 has a major impact across the board, only together with JuliaInterop/CxxWrap.jl#335 does it achieve its full potential. While JuliaInterop/CxxWrap.jl#337 seems not that important (but I think this depends on what exactly you measure -- it definitely does get rid of invalidations). Interestingly (and annoyingly) with Julia 1.8.4, it seems better to leave out JuliaInterop/CxxWrap.jl#335 ... I have no idea so far why (but I also haven't made any efforts to find out so far). |
I'm a bit confused: in JuliaLang/julia#44527 (comment) I got a time of the order of 0.02 seconds without having to do anything, what changed since? |
You got those numbers without the |
Oh, cool 😄 So your work is basically to make sure loading also CxxWrap doesn't defeat all the gains from native code caching? |
@giordano indeed. I've long had my sights on CxxWrap as likely culprit for slowdowns / invalidations, see JuliaInterop/CxxWrap.jl#278 -- but past attempt had mixed to now effect on timings (as my table above illustrates ;-) ). Now with the native code caching it suddenly becomes possibly to benefit, yay! |
Just had a look at this again. Sadly the situation got worse again it seems. With Julia 1.10.3, Hecke master and CxxWrap 0.15.1:
So I inspected the output of this: using Hecke
using SnoopCompileCore
invalidations = @snoopr begin using CxxWrap end
using SnoopCompile
trees = invalidation_trees(invalidations) which reports thousands of invalidations. I disabled the top method causing these and repeated this process one time. Afterwards the timings were fine again. For reference here are the changes I made to CxxWrap to avoid the invalidations in this specific example (not claiming I covered all): diff --git a/src/CxxWrap.jl b/src/CxxWrap.jl
index 421bffe..d902fd4 100644
--- a/src/CxxWrap.jl
+++ b/src/CxxWrap.jl
@@ -118,7 +118,6 @@ Base.AbstractFloat(x::CxxNumber) = Base.AbstractFloat(to_julia_int(x))
# Convenience constructors
(::Type{T})(x::Number) where {T<:CxxNumber} = reinterpret(T, convert(julia_int_type(T), x))
(::Type{T})(x::Rational) where {T<:CxxNumber} = reinterpret(T, convert(julia_int_type(T), x))
-(::Type{T1})(x::T2) where {T1<:Number, T2<:CxxNumber} = T1(reinterpret(julia_int_type(T2), x))::T1
(::Type{T1})(x::T2) where {T1<:CxxNumber, T2<:CxxNumber} = T1(reinterpret(julia_int_type(T2), x))::T1
Base.Bool(x::T) where {T<:CxxNumber} = Bool(reinterpret(julia_int_type(T), x))::Bool
(::Type{T})(x::T) where {T<:CxxNumber} = x
@@ -137,7 +136,6 @@ end
# Enum type interface
abstract type CppEnum <: Integer end
-(::Type{T})(x::CppEnum) where {T <: Integer} = T(reinterpret(Int32, x))::T
(::Type{T})(x::Integer) where {T <: CppEnum} = reinterpret(T, Int32(x))
(::Type{T})(x::T) where {T <: CppEnum} = x
import Base: +, | Of course I am also not saying making this change to CxxWrap is a solution (it probably would break functionality), not even that CxxWrap is at fault (I dunno who is -- CxxWrap, Hecke, Julia stdlib, ...) |
For the record, here are the top invalidations:
Of particular note:
I then tried to use JET to find out more:
which suggests a type stability issue in Hecke (for this very first instance): in the call to The calling function itself looks like this:
but unfortunately it doesn't have the proper type for Tracing further up, the type instability begins inside a call to fl, x = is_principal_with_data(q * prod(__P[i]^Int(c.coeff[i]) for i in 1:length(__P))) I'll dig some more, but it seems we are taxing JET (?) quite a bit here, it is slow as molasses. |
On my M1 Mac with Julia 1.8-rc1:
This is not a simple matter of caching as it also is fast with different arguments, e.g.
Most of this time seems to be spent in compiling. Indeed, starting Julia with
-O1
things are much faster:I've used
@snoopl
from SnoopCompile to get some insights into this. So I didThe resulting data files can be loaded separately:
So yeah, we spend a lot of time in compilation. It compiles 999 method instances, it seems.
Unfortunately there don't seem to be many functions to help analyze this further...
Attached is the list of method instances, sorted alphabetically; perhaps this already gives some ideas...
methods.txt
The text was updated successfully, but these errors were encountered: