Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance opportunities #206

Open
2 of 4 tasks
timholy opened this issue Mar 22, 2019 · 13 comments
Open
2 of 4 tasks

Performance opportunities #206

timholy opened this issue Mar 22, 2019 · 13 comments

Comments

@timholy
Copy link
Member

timholy commented Mar 22, 2019

In my initial performance work, the big 3 improvements were local method tables (framecode.methodtables), reusing framedata (the junk mechanism), and getting rid of runtime dispatch internally within the interpreter (by manual union-splitting, aka, if isa(x, ConcreteType)...).

Some possible future improvements:

  • Recent changes diving into Core._apply, Core._apply_latest, and invoke hurt performance. Perhaps some of that might be clawed back by moving more of this work into optimize!?
  • ccalls are slow. I just pushed a teh/compiled_ccall that contains an old attempt to create one compiled function per ccall. I was most interested in seeing whether it would circumvent MWE of char crash #28 (it didn't seem to), but I don't think I looked at its runtime implications (esp on a ccall-heavy workload). I suspect that branch is pretty close to usable, if anyone wants to pick it up and run with it.
  • specifically look for calls to iterate and do something special, a.k.a., (1) determine whether iterate could possibly hit a breakpoint or error (if not then you're allowed to play tricks), (2) see if you can determine whether the loop will be long, and if so (3) pass the iterator and frame to a special run_loop function. That would have to compile for each iterator type, so is worth doing only if the loop is long. But you would save the time spent diving in and out of iterate, and that could be quite helpful.
  • apply similar treatment to arrayrefs. Even our builtin handling of arrayref could be improved by separately treating small dimensions (e.g., if nargs == 1 ... elseif nargs == 2 ... elseif nargs == 3 ... else <varargs version> end.

Obviously spending some time with ProfileView will be useful. I did a ton of that at the beginning and fixed most everything I that I noticed. But I haven't done it in a while and we may have regressed in places as we expanded our functionality.

@KristofferC

This comment has been minimized.

@timholy

This comment has been minimized.

@KristofferC
Copy link
Member

What is your opinion about pushing things to compiled_methods that are very commonly used but unlikely to be of interest for breakpoints.

For example, iterate(a::Array) (with Core_apply now being stepped through) makes us have to do

[1] promote_typeof(x) at promotion.jl:263
[2] promote_typeof(x, xs) at promotion.jl:264
[3] -(a, b) at int.jl:797
[4] iterate(A, i) at array.jl:705
[5] iterate(A) at array.jl:705

Is it a slippery slope to start doing this?

@pfitzseb
Copy link
Member

We could temporarily pop them from compiled_methods when someone tries to set a breakpoint in one of them. I think it does make sense to these kinds of methods by default.

@timholy
Copy link
Member Author

timholy commented Mar 28, 2019

I worry about that slope. Where do we stop? Will packages suddenly fill with @require JuliaInterpreter begin push!(...) end blocks?

@timholy
Copy link
Member Author

timholy commented Mar 28, 2019

I do propose special handling of iterate above, one that would effectively inline the calls to iterate for all containers, not just ::Array.

Also, if we start using compiled frames, one thought: I think there are only two classes of slow-to-interpret methods: ones that iterate and ones that recurse. If we can detect those specifically then we might save ourselves compile time.

@KristofferC
Copy link
Member

Here is a fun one:

julia> using JuliaInterpreter, Plots

julia> @time @interpret plot(rand(5))
  5.703063 seconds (39.36 M allocations: 2.545 GiB, 6.56% gc time)

vs

julia> using JuliaInterpreter, CSV, Plots # note the CSV load

julia> @time @interpret plot(rand(5))
 16.160595 seconds (51.27 M allocations: 2.868 GiB, 3.46% gc time)

@timholy
Copy link
Member Author

timholy commented Mar 28, 2019

Is it all method-table searches? (Basically, whichtt or related?)

@KristofferC
Copy link
Member

I haven't profiled yet, just observed it.

@KristofferC
Copy link
Member

KristofferC commented Mar 29, 2019

I think this is the problem:

julia> using BenchmarkTools

julia> c = Some{Any}("foo")
Some("foo")

julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1%1 = (Base.getfield)(x, :value)::Any%2 = %new(Some{Any}, %1)::Some{Any}
└──      return %2

julia> @btime convert(Some{Any}, $c)
  4.299 ns (1 allocation: 16 bytes)
Some("foo")

julia> using CSV

julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1%1 = $(Expr(:static_parameter, 1))::Core.Compiler.Const(Any, false)
│   %2 = (Base.getfield)(x, :value)::Any%3 = (Base.convert)(%1, %2)::Any%4 = (Some{Any})(%3)::Some{Any}
└──      return %4

julia> @btime convert(Some{Any}, $c)
  923.077 ns (1 allocation: 16 bytes)
Some("foo")

@KristofferC
Copy link
Member

With JuliaLang/julia#31536 we are at

julia>  @time @interpret plot(rand(5))
  2.930957 seconds (13.54 M allocations: 481.296 MiB, 2.31% gc time)

for the non CSV case and

julia>  @time @interpret plot(rand(5))
  5.765768 seconds (15.13 M allocations: 507.931 MiB, 1.52% gc time)

with CSV. So much better but still signficiantly slower.

@KristofferC
Copy link
Member

KristofferC commented Mar 30, 2019

Changing junk from a Set back to a Vector I get

julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
  9.845403 seconds (64.34 M allocations: 2.011 GiB, 3.19% gc time)

vs master

julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
 12.480055 seconds (66.37 M allocations: 2.219 GiB, 2.56% gc time)

We changed this when we found the bug when I executed the same snippet twice and managed to get two of the same frames into the Vector. However, perhaps we can be smarter here than just uses a full blown IdSet. We don't end up with that many frames in junk so we should be able to exploit that. Also, can it ever happen that the same Frame end up in junk if we empty it when we enter create the "first" frame from toplevel?

@timholy
Copy link
Member Author

timholy commented Mar 30, 2019

👍 In theory it seems possible to design the logic such that we never put the same frame in twice. In that case we can just use Vector without any kind of checking. For development purposes, we might want to start with checking and an assertion, and then strip that out when we find it's no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants