Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiled frames: a sketch #204

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Compiled frames: a sketch #204

wants to merge 1 commit into from

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 21, 2019

This is a skeleton illustrating how I think we should implement compiled frames. Like all untested sketches, this could of course run into serious roadblocks.

The code here is heavily commented, so that may be sufficient. (I should say that I started out intending to test this on a method inc1(x) = x + 1, but never quite got that far; that may explain some of the elements of the code.) But let me explain some of the overall strategy here. The idea is that for any method, we create an instrumented variant: in addition to doing its regular duty, every time it computes something, store intermediate results in a FrameData that gets passed in as an extra argument. Basically, the idea is that foo(x, y) becomes foo#instrumented!(x, y, #framedata#). In the instrumented variant, assignments to slots and "used" ssavalues (in the sense of framedata.used, where framedata is the framedata for foo itself) will need an extra statement inserted that performs the same assignment to the #framedata# argument.

Now, if we were to compile this, we'd probably get a fairly respectable result on that particular method. But what to do about all those calls it makes? If we do nothing, it would be like running the frame in Compiled() mode, OK for certain things but very limiting.

Here is where the real fun begins. The idea is to intercept inference, and modify the invokes to call instrumented variants of the normal methods. A potentially huge win here is that if we do this after running normal inference (including the optimizer), then we get inlining for free. I am anticipating that, with compiled frames, framedata allocation will become our single biggest expense; we can presumably avoid most of that by inlining all the "simple stuff."

Of course that means we'll be blind to what goes on inside the inlined methods. Optionally, I suppose we could turn off the optimizer. But I think a better way to handle that would be on the UI side: all we'd need to do is copy the framedata that executes that call, and create a normal (slow) interpreted frame with the same data and then start executing that in normal interpreted mode. This essentially lets us "snip out" the little piece of the computation when we need it, but get the (probably huge) benefit of inlining for 99% of the execution time.

While I barely know anything about Cassette (something I should remedy some day), I suspect that there are similarities between what I'm proposing here and stuff Cassette is presumably already good at. However, were I to take a guess, I'd say the inlining tricks I'm proposing are something that would be difficult to do via Cassette. Here, by running inference on what is close to the normal method body (with genuinely-normal callees) and then modifying it, we should get something that's very close to the "normal" inlining decisions.

Why am I posting this as a sketch, rather than just doing it? The reality is that because I've prioritized getting the debugger out above most of my other duties, in my "real job" I have many fires burning. Worse (here I'm being a pessimist), I suspect that what serious coding time I can muster over the next few weeks will likely be eaten up by solving problems I've inadvertently created by rewriting the Revise stack. So I'm afraid I'm going to be limited in terms of how much "difficult" development I can do here. Of course I'm happy to offer what help I can, but it may be in an advisory capacity for many things. But I thought I'd throw this out there to see if it helps get things going.

This ventures into some scary territory, it might be nice to get feedback from folks who know the compiler far better than I: CC @Keno, @JeffBezanson, @vtjnash.

@timholy
Copy link
Member Author

timholy commented Mar 21, 2019

I should also add that if someone wants to pick this up, I'll do what I can to provide support.

@timholy
Copy link
Member Author

timholy commented Mar 21, 2019

One problem: JuliaLang/julia#31429

@KristofferC
Copy link
Member

KristofferC commented Mar 22, 2019

With regards to performance there are two ways forward as I see it.

  • Just microoptimizing the interpreter.
    • Might be able to fix the most egregious performance cases but will always be a lot slower than compiled code.
    • However, might be fast enough (cf Python)? Upside is perfect debug information.
    • Can be done incrementally.
  • Doing something like proposed here, tagging a long a context, Cassette style and then let lose the optimizer on this newly created method. This has of course is great performance from a performance point of view but there are quite a few drawbacks:
    • Complexity of implementation. The interpreter is almost purely in Julia and is decently understandable without needing to know too much about the internals of Julia.
    • Optimizations generally lead to worse debug info, could probably be remedied with optimization settings
    • Worse compilation time, now we have to compile more code than running the code normally.
    • Starting to look a lot like Cassette. It begs the question what we are going to do here that is fundamentally different from Cassette. Could we use Cassette here in combination with the interpreter?

I think it is important if we go for the second approach, the one suggested in this PR, that we are sure we will be able to retain enough debug information to keep the excellent experience that the current debugger has (when it is not too slow of course).

Sorry for the lack of "meat" in this post. Just jotting down some initial thoughts I had. I'm also echoing how nice it would be to have some comment here from the compiler team and what they think is a good way forward.

@KristofferC
Copy link
Member

For example, the example in https://jrevels.github.io/Cassette.jl/latest/overdub.html, looks pretty much exactly like the foo#instrumented!(x, y, #framedata# suggestion

@timholy
Copy link
Member Author

timholy commented Mar 22, 2019

We should definitely apply what micro-optimizations we can. #206 might be a good landing place for ideas.

Initially my goal was to also make this a platform for also working on the compiler latency problem. I am a little less optimistic about this now than I was, but obviously that would still be nice to keep. Some of the iterate analysis in #206 might help decide whether to run a frame in compiled vs. interpreted mode, and we might even be able to compile "just the loop" which would be an interesting optimization.

W.r.t. Cassette and related ideas, I do think there would be something lost, but because we still have the full interpreter I am not certain it would be limiting. See my point in the OP about copying the state of the caller stack and then re-running in full-interpreter mode---you may not get everything you want when you first execute the frame, but if you can easily do it again in more detail a second time, then everything should be OK.

For example, the example in https://jrevels.github.io/Cassette.jl/latest/overdub.html, looks pretty much exactly like the foo#instrumented!(x, y, #framedata# suggestion

Not quite. The key point I was trying to make is this: ultimately (once we've done every performance optimization we can think of), framedata creation is going to be the bottleneck; the only way to get performance that's even close to compiled code will be to do much less of it. IIUC (and I may not), Cassette will instrument every call. That means a frame gets created for everything (as it does here now). If instead we run the optimizer first and it does inlining, then we instrument only the non-inlined calls. For a tight loop that should have huge performance implications: instead of creating a framedata for each call to iterate, each getindex, and each setindex!, you run all those operations in the parent frame. (E.g., 1 framedata vs 3n framedatas, where n is the number of iterations.)

So I think this strategy might easily get us more than an order of magnitude better performance than what you could hope for from Cassette.

@vchuravy
Copy link
Member

Cassette will instrument every call. That means a frame gets created for everything (as it does here now).

Yes and no Cassette would create an new frame per call, but that might be inlined and then LLVM optimizations kick in. So the question becomes more: Can we reuse the frames for repeated calls in a loop.

OTOH instead of the callee creating the frame, let the caller create the frame and reuse when looping. But now I am purely speculating, I haven't looked at the Interpreter internals enough.

@timholy
Copy link
Member Author

timholy commented Mar 22, 2019

but that might be inlined

Frame creation is really expensive, so inline_worthy (at least with default parameters) will always return false for any wrapper-instrumented call. You want to make the decision to instrument after you know how expensive the non-instrumented call is.

EDIT: I like the idea of reusing the frame by the parent. We do that now a bit via the recycle mechanism but it's not as well optimized if we literally use the same frame in the same spot each time. However, without inference you can't even guarantee that the same method will be called at each line, so it's not entirely obvious you can do much better than we are now.

@KristofferC
Copy link
Member

If the "only" difference is the order between instrumentation and optimization, couldn't Cassette be made to instrument optimized code? On the surface it just seems like the purpose of Cassette is so similar to the stuff we want to do here. Rewrite the IR and pass along a context.

Tagging @jrevels in case he is interested in the discussion.

@KristofferC
Copy link
Member

What's the thought about breakpoints. Would we insert a shouldbreak(#framedata#, stmtidx) && debugger_hook(#framedata#) between all statements?

@timholy
Copy link
Member Author

timholy commented Mar 25, 2019

The framecode, not the framedata, holds the breakpoints. So I think this is something that would require either recompilation or a specialized design (we could pass in both #framedata# and #breakpoints#, for example).

There seems to be a tradeoff here: runtime performance or compile-time performance? I'd probably first try passing in #breakpoints# and put in line-by-line checks as you say. But if that turns out to be a huge performance bottleneck then it would be worth considering whether one should just insert the ones for which isassigned(breakpoints, idx) is true. That would require recompiation anytime someone inserted a breakpoint at a new location.

Also inlining would make it much harder to insert breakpoints. Ouch. It might still be possible using the LineInfoNodes, though statement-by-statement correspondence with the lowered code could be lost.

@KristofferC
Copy link
Member

Recompiling to insert breakpoints kinda seems like a non-starter. Sure, there will be a performance hit to check the break condition but it should be predicted correctly and it is not like we are trying to keep things SIMDing while debugging.

A project I want to try out, just to get a feeling for the performance is to "rewrite" the interpreter using Cassette, while keeping most of the datastructures that we have here. So we would turn something like

1%1 = (Base.float)(x)
│   %2 = (Base.float)(y)
│   %3 = %1 / %2
└──      return %3

into (something very roughly like):

    insert_locals!(ctx, x=x, y=y)
    ctx.pc = 1
    should_break(ctx, 1) && return hook(ctx)
    %1 = (Base.float)(x)
    insert_ssa!(ctx, 1, %1)
    ctx.pc = 2
    should_break(ctx, 2) && hook!(ctx)
    %2 = Cassette.overdub(ctx, Base.float, y)
    insert_ssa!(ctx, 2, %2)
    ctx.pc = 3
    should_break(ctx, 3) && hook!(ctx)
    %3 = Cassette.overdub(ctx, /, %1, %2)
    insert_ssa!(ctx, 3, %3)
    ctx.pc = 4
    should_break(ctx, 4) && hook!(ctx)
    return %3

where ctx is (similar to) our Frame and hook! would allow us to return to e.g. the debugger interface (which can modify ctx and thereby modify where the next should_break will happen, perhaps depending on the input command n or nc etc). The advantage from this from what I can see it is that we move a lot of the Expr matching on the lowered code to compile time. While it will not execute close to native speed, I am just curious what the typical slowdowns are.

@vchuravy
Copy link
Member

@KristofferC let me know if I can help with Cassette side of things to try this out! Cassette has a metadata mechanism that one could use to make these things feasible.

@timholy
Copy link
Member Author

timholy commented Mar 25, 2019

Seems like a worthy experiment. It would also be worth seeing if that's essentially what MagneticReadHead does. If so, one possibility would be to merge the two projects? CC @oxinabox.

@timholy
Copy link
Member Author

timholy commented Mar 25, 2019

Here's an altered version of my top proposal, with the goal of losing nothing in terms of usability:

  1. When you start on a frame (and know the argument types), call inference with optimize=false. The goal here is to discover which calls have definite argtypes (and thus you know in advance which method will be called).
  2. For the called methods, run Core.Compiler.inline_worthy on them.
  3. For those deemed worth inlining, do the inlining in the lowered code. (That's basically just splicing their AST in place.) It's quite possible that Meta.partially_inline! might be useful here, I haven't looked carefully. Except, of course, we also have to insert the breakpoints from the method we're inlining.
  4. Keep a record of all places that a particular method got inlined (backedges). We'd need to update all of them anytime someone adjusted the breakpoints.
  5. Compile the appropriately-modified frame as discussed above.

This gets us the benefits of reduced framedata-creation while not losing anything, I think, in terms of usability.

@oxinabox
Copy link
Contributor

(I will return and comment on this later, in general I am down to help with any cassette related things you need; but am yet to look at this plan.)

@oxinabox
Copy link
Contributor

cool cool cool.

Some comments fairy scattered.
I'm sure some readers are aware of much of this already,
e.g. which bits are like MagneticReadHead,
but for those who aren't I will mention.

But let me explain some of the overall strategy here. The idea is that for any method, we create an instrumented variant: in addition to doing its regular duty, every time it computes something, store intermediate results in a FrameData that gets passed in as an extra argument. Basically, the idea is that foo(x, y) becomes foo#instrumented!(x, y, #framedata#). In the instrumented variant, assignments to slots and "used" ssavalues (in the sense of framedata.used, where framedata is the framedata for foo itself) will need an extra statement inserted that performs the same assignment to the #framedata# argument.

Yes, this is indeed what MagneticReadHead.jl does.
(It doesn't store every SSA, just every slot assigment, but same principle applies).


On running after the optimiser

For reference Cassette runs before typing/specialization.
(@code_lowered not @code_typed)
See JuliaLabs/Cassette.jl#67.
which is well before optimizer.

I don't know much about the optimizers outpuut
I worry that if running after the optimizer, the IR that is available,
will be basically void of useful information.
Apparently, all knowledge of slotnames is gone before then.
Some discussion at oxinabox/MagneticReadHead.jl#14 (comment)

On using ccall(:jl_method_def,

Both Cassette and Zygote (cc: @MikeInnes)
use the trick of returning a CodeInfo from a @generated function.
As the mechanism to screw around with IR level code.
As I read it, by doing ccall(:jl_method_def, the middle man is cutout.
I am not sure what that is going to change.
I feel like we have a bit of a handle of what kind of bugs one runs into
when doing it the Cassette way.
E.g. JuliaLabs/Cassette.jl#6
i.e. FluxML/Zygote.jl#22

I'm not sure, but I think ccall(:jl_method_def, is more uncharted territory.


Breakpoints

What's the thought about breakpoints. Would we insert a shouldbreak(#framedata#, stmtidx) && debugger_hook(#framedata#) between all statements?

This is what MagneticReadHead does.

A project I want to try out, just to get a feeling for the performance is to "rewrite" the interpreter using Cassette, while keeping most of the datastructures that we have here.

Sounds fun and worthwilde, I would be interested to be involved/helpout.

it would be worth considering whether one should just insert the ones for which isassigned(breakpoints, idx) is true. That would require recompiation anytime someone inserted a breakpoint at a new location.

This is problematic.
Because while inside a breakpoint, for compile code,
someone might add a breakpoint somewhere in a method that will be called from the current code.
Like if foo() calls bar() and while at a breakpoint at the start of foo,
they add a breakpoint in bar.
then the compiled code for foo is still going to have a reference to the previouisly compiled version of bar that has not been instrumented.

I feel like
this is going to lead to world-age/#256 issues.
Perhaps one mightwork around that by replacing all calls in an instrumented
function with invokelatests?


For those deemed worth inlining, do the inlining in the lowered code

manually inlining, huh.
Hmmm.


For reference MagneticReadhead is only 800 lines of code.
So I wouldn't worry about using Cassette introducing complexity.

@MikeInnes
Copy link
Contributor

Doing this in a Cassette-like way (generated function + reflection + return CodeInfo) is the right way to go. Cassette itself doesn't let you work on typed IR, but it's easy to just write out the generator yourself and grab that IR. That's pretty much equivalent to the current jl_method_def approach but with the advantages that it's gradually getting real support in the compiler (e.g. redefining f will re-run the generator, avoiding Lyndon's 265-like issues), you don't have to explicitly run it for every method, it fully handles dynamic semantics, and so on.

I like the sound of Tim's altered proposal, for which a simple generated function is pretty much all you need, but RE the original proposal of re-using the base optimiser:

Working on typed IR works fine, with the caveat that you have to return a CodeInfo, which means converting phis back to slots and allowing type inference to run again on the optimised code. Odd but technically very easy and it gets the job done. Allowing inlining also compromises your ability to intercept specific calls, though likely you don't need this (may change as these kind of optimiser plugins become officially supported).

There is the issue that Base's IR does not currently preserve much debug info; that's probably the only place where you need fixes in Base here. But this is not a research problem, all compilers do it, and it's clearly necessary long-term for any serious debugging effort; it just needs some straightforward hacking on the SSA data structures.

@timholy
Copy link
Member Author

timholy commented Mar 28, 2019

I don't think there's any advantage in doing it via a generated function; jl_method_def does all the good things you cite (like handling cache-invalidation for 265), and it's a bit more direct. The generated approach ends up calling eval which then invokes the C-interpreter which then calls jl_method_def. But since we already have all the pieces (because we're working with frames), it's likely to be easier to call jl_method_def directly than it would be to set up the generated function (not that it would be hard, of course).

Anyway, it sounds like we have several plans. I suspect the "altered" version is the way to go and will give excellent results. I think inlining will be a must if we're ever to get anything resembling normal performance.

@Keno
Copy link
Contributor

Keno commented Mar 28, 2019

I'll have more to chime in here later, but just a note that the yakc mechanism (JuliaLang/julia#31253) covers the case of just wanting to run some CodeInfo as a one-off.

@timholy
Copy link
Member Author

timholy commented Mar 29, 2019

Just a brief update. With some of the really lovely changes @KristofferC has made recently, we're now competitive with Julia's standard execution mode on tasks like plot(rand(5)) for the first call: JIT time is about tied with the overhead of JuliaInterpreter. Moreover, virtually all of the cost is frame creation. If we could beat that down (dramatically), then we'd be entering territory where we can run certain classes of code more quickly in interpreted mode than in compiled code.

So, another thought: rather than compiling frames, what about a "loop compiler"? To me it seems likely that the vast majority of slow-to-interpret code will fall in one of two categories, either having loops or using deep recursion (e.g., fibonacci(20)). Loops can be easily detected by backwards-pointing gotos. What about the following:

  1. As you execute a frame, look for a backward-pointing goto; trigger a special mode
  2. Analyze the basic blocks that comprise the loop. In the simplest case, we're looking for constant variables (slots that never get assigned within the loop). This is essentially a cheap form of inference, because (1) we've already run through the loop once and know which methods got called, and now (2) that methods that get called on static objects won't change.
  3. Perform what inlining we can on just the loop body, to avoid the cost of frame creation.

The likely sticking point is that not enough will be constant: for example, in a for loop the iterator won't change, but the state variable will. This is not yet well-enough fleshed out for me to know what to do about that, but one option would be to do a variant of inlining that is effectively union-splitting:

if isa(state, TypeWeSawTheFirstTime)
    <inlined code goes here>
else
    <generic call goes here>
end

Given the recent improvements from @KristofferC's elegant contributions, I am beginning to become optimistic that we might get more benefit from doing this than from creating a compiled variant of what we do now.

@KristofferC
Copy link
Member

KristofferC commented Mar 29, 2019

Moreover, virtually all of the cost is frame creation.

How have you determined this? Shouldn't there cycle mechanics handle this quite well? For interpreting plot(rand(5)) frames are created from scratch 39 times and from a recycled frame 1624154 times.

@timholy
Copy link
Member Author

timholy commented Mar 29, 2019

It's almost all the recycling (maybe I should have said "frame setup" rather than "frame creation"). I just did @profile (p = @interpret(plot(rand(5))); @interpret(display(p))) and checked by ProfileView. Near the top of the vast majority (would be good to quantify) of the flames is a call to get_call_framecode, prepare_framedata, or the operations of recycle itself (e.g., rehash!). The only other thing that comes close is maybe_evaluate_builtin. We do have some cases where dynamic dispatch is still being used, so this may overestimate the impact of frame setup, but I don't think by a lot.

@KristofferC
Copy link
Member

KristofferC commented Mar 31, 2019

I don't see how manual inlining will have a big effect without reducing debugging information. If we want to provide the same information as we do now to someone debugging we have to keep track of pretty much the same thing as we do now with the FrameData anyway.

Also, any time we run something compiled, we are going to have troubles with break_on (unless we prove it can't throw, which I know there are some stuff in Compiler to do).

@timholy
Copy link
Member Author

timholy commented Mar 31, 2019

I've put a rough test up here: https://gist.github.com/timholy/369c3fbf5d64ee09c3f9692e2db6c489
It's far from perfect---it's not fully reduced to a bunch of builtins and intrinsics, and prepare_frame_caller still accounts for 40% of the runtime---but it gives some idea. It's not as much of a help as I had hoped, but it does drop the runtime by a factor of about 3. It's pretty much the first time that I've seen steps of the interpreter other than frame preparation show up at the top level of the profile.

So complete inlining might give us something like a factor of 6. Not as much as I was hoping, but not bad either.

@timholy
Copy link
Member Author

timholy commented Apr 1, 2019

For the record, here are some performance numbers. For these, I simulated complete inlining in the summer_inlined benchmark by replacing the call to promote_type with a couple of typeasserts (essentially mimicking what "correct" @pure handling would do) and got rid of the not_same_type call. As a consequence there are no recursive calls other than the eltype in the first couple of lines, and it only runs once. For all of these, A = rand(10^5), and the reported times are per iteration.

Command summer summer_inlined
JuliaInterpreter.@interpret 17.1μs 3.6μs
MagneticReadHead.@iron 346.8μs 23.0μs
Native execution 0.0045μs 0.0045μs

Inlining yielded a 15x advantage for MagneticReadHead but only a 5x advantage for JuliaInterpreter. For JuliaInterpreter, here's an accounting of the cost:

  • 40% for the call to maybe_evaluate_builtin
  • 15% for do_assignment!
  • 16% for lookup_var and expansion of @lookup
  • ~10% for the try/catch in step_expr!

The remaining 19% is fairly widely scattered.

@KristofferC
Copy link
Member

Looks good! And as long as we can map back everything in the inlined function to the original one we should be fine with debug info. This mapping doesn't have to be very fast to retrieve since we only need them when showing information to the user. I guess that is the plan?

@oxinabox
Copy link
Contributor

oxinabox commented Apr 1, 2019

This is incredibly useful information for MagneticReadHead development.
It also suggests more generally Cassette should be doing some more inlining if this can be worked out how to generalize.
Generalizing this is very desirable, it came up in the ML call, that Zygote really needs that to make higher order deriviatives work without generating a bunch of pointless code.

The other useful datapoint youi have here is that a Naive Cassette implementation (which is what MagneticReadHead is) is not going to, in and of itself, give you performance improvements.

@timholy
Copy link
Member Author

timholy commented Apr 1, 2019

I added a row for "native execution" above. Another useful data point: I created a "minimally instrumented" version of the form used by JuliaInterpreter. This is frankly overly optimistic; normally we'd also instrument the return from iterate etc. With that caveat:

function summer_instrumented!(A, framedata)
    s = zero(eltype(A))
    framedata.locals[2] = Some{Any}(s)
    framedata.last_reference[:s] = 2
    for a in A
        framedata.locals[5] = Some{Any}(s)
        framedata.last_reference[:a] = 5
        s += a
        framedata.locals[2] = Some{Any}(s)
        framedata.last_reference[:s] = 2
    end
    return framedata.locals[2]
end

This has a time per iteration, when running in Julia's native mode, of 0.16μs. So as soon as you add any instrumentation, there's a 35x performance hit. On the fully-inlined version, JuliaInterpreter is already within 20x of this, and within 100x on the non-inlined version.

So while we have a ways to go, we're closer to optimal than I thought---I'd be really impressed if we can gain more than 10x compared to where we are now no matter how hard we're willing to work. Really, the only way to recover true compiled-code performance is to interact with native call stacks, aka Gallium.

@timholy
Copy link
Member Author

timholy commented Apr 1, 2019

I guess that is the plan?

I'm not certain what the plan is 😄. I'm a bit greedy, and given the effort involved in implementing inlining 5x seems like less gain than I was hoping. I think we should continue to think about it.

It is possible to build fast interpreters: https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/ gets down to about 30 nanoseconds per iteration, though I'm unsure of the relevance for a language as complex as Julia. I find myself contemplating modes in which we "record" the actual actions taken (using, e.g., integer tokens so everything is inferrable) and then run through that recording when handling loops. That's a little too vague to be called "a plan" but perhaps it conveys the flavor of what I'm currently thinking of. And yes, whatever we do, we need to leave clear breadcrumbs so that one can map back to original source code.

@oxinabox
Copy link
Contributor

oxinabox commented Jul 14, 2019

For reference, here is a measure of the performance of MRH these days.
I think getting compiled frames into Debugger.jl may be worth it.
Let's plan to talk at juliacon.

image

using Debugger, MagneticReadHead, Plots

function summer(A)
    s = zero(eltype(A))
    for a in A
        s += a
    end
    return s
end

# Warmup
summer(ones(3))
Debugger.@run summer(ones(3))
MagneticReadHead.@run summer(ones(3))


naitive_time = Float64[]
debugger_time = Float64[]
mrh_time = Float64[]

lens = 2 .^ (0:16)

let
    for len in lens
        data = rand(len)
        push!(naitive_time, @elapsed summer(data))
        push!(debugger_time, @elapsed(Debugger.@run summer(data)))
        push!(mrh_time, @elapsed(MagneticReadHead.@run summer(data)))
    end
end

plot(lens, [debugger_time, mrh_time, naitive_time], 
    label=["Debugger", "MRH", "Native"],
    title = "`summer` Benchmark",
    xlabel = "Size of Array",
    ylabel = "Execution Time (seconds)",
    linewidth = log2.(log2.(lens)),
    legend=:topleft,
    dpi=400,
)

savefig("benchmark.png")

@KristofferC
Copy link
Member

KristofferC commented Jul 14, 2019

It is interesting and would be good to have a discussion but I just want to point out some drawbacks:

using e.g.

g() = sprand(5,5,0.5) + sprand(5,5,0.5)

f() = rand(5,5) + rand(5,5)

I get

julia> using MagneticReadHead

julia> @time @run g()
208.436596 seconds (191.57 M allocations: 7.703 GiB, 2.74% gc time)

julia> @time @run f()
 21.894716 seconds (42.04 M allocations: 1.778 GiB, 5.27% gc time)

vs

julia> using Debugger

julia> @time @run g()
  5.388586 seconds (10.10 M allocations: 495.592 MiB, 3.60% gc time)

julia> @time @run f()
  0.114333 seconds (206.46 k allocations: 9.683 MiB, 4.99% gc time)

so you pay a hefty compilation price even if you are just debugging trivial functions.

In order to do any real comparison I feel we also need stuff like oxinabox/MagneticReadHead.jl#56 fixed to see how things scale to real code.

@oxinabox
Copy link
Contributor

Absolutely, I agree. MRH is no magic bullet.
And compile time really matters for debugging,
particularly since it is very common to edit code in between every time you use the debugger.
So it doesn't matter how fast it is on the second run.

of-course the unchanged library code will still be cached so it will be faster the second time.
I suspect wehat we should be thinking about is things a precompiled image for Base+StdLibs, with debugging instrumentation compiled in,
then when those are hit use compiled mode, and otherwise use interpretted mode.
But that is not as easy as it sounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants