Inline callsite of invoke when possible (`invoke` improvement No. 1) #10964

yuyichao · 2015-04-23T18:46:02Z

This is the refactor of my previous attempt to improve the performance of invoke (#9642).

As I mentioned on the mailing list, the previous PR need to be refactored and I decide to start sth relatively self-contained and (hopefully) decoupled from other parts. Hopefully this can also make it easier to be reviewed.

The patch only deal with the new signature of invoke (i.e. the second parameter is a tuple type rather than a tuple of types) mainly because it's much easier to handle. It should not do anything for the old signature.

Thanks @vtjnash for very helpful suggestions.

Tested with the following script.

@noinline f1(a, b) = a + b

g1_1() = f1(1, 2)
g1_2() = invoke(f1, Tuple{Any, Any}, 1, 2)

@assert g1_1() == g1_2()

@code_llvm g1_1()
@code_llvm g1_2()

Output on current master

define i64 @julia_g1_1_43908() {
top:
  %0 = call i64 @julia_f1_43896(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43911() {
top:
  %0 = alloca [6 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [6 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 2
  %2 = bitcast [6 x %jl_value_t*]* %0 to i64*
  store i64 8, i64* %2, align 8
  %3 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 1
  %4 = bitcast %jl_value_t** %3 to %jl_value_t***
  %5 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %5, %jl_value_t*** %4, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %1, align 8
  %6 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %7 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %8 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = load %jl_value_t** inttoptr (i64 140525719411568 to %jl_value_t**), align 16
  store %jl_value_t* %9, %jl_value_t** %1, align 8
  %10 = load %jl_value_t** inttoptr (i64 140525686087088 to %jl_value_t**), align 16
  store %jl_value_t* %10, %jl_value_t** %6, align 8
  %11 = load %jl_value_t** inttoptr (i64 140525686091216 to %jl_value_t**), align 16
  store %jl_value_t* %11, %jl_value_t** %7, align 8
  %12 = load %jl_value_t** inttoptr (i64 140525686091216 to %jl_value_t**), align 16
  store %jl_value_t* %12, %jl_value_t** %8, align 8
  %13 = call %jl_value_t* @jl_f_instantiate_type(%jl_value_t* null, %jl_value_t** %6, i32 3)
  store %jl_value_t* %13, %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 140525686161872 to %jl_value_t*), %jl_value_t** %7, align 8
  store %jl_value_t* inttoptr (i64 140525686161968 to %jl_value_t*), %jl_value_t** %8, align 8
  %14 = call %jl_value_t* @jl_f_invoke(%jl_value_t* null, %jl_value_t** %1, i32 4)
  %15 = bitcast %jl_value_t* %14 to i64*
  %16 = load i64* %15, align 8
  %17 = load %jl_value_t*** %4, align 8
  store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8
  ret i64 %16
}

With this patch

define i64 @julia_g1_1_43911() {
top:
  %0 = call i64 @julia_f1_43899(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43914() {
top:
  %0 = call i64 @julia_f1_43899(i64 1, i64 2)
  ret i64 %0
}

yuyichao · 2015-04-23T19:02:54Z

A slightly more complete test (doesn't include the @code_llvm output of the master version since it's way too long....)

@noinline f1(a, b) = a + b

g1_1() = f1(1, 2)
g1_2() = invoke(f1, Tuple{Any, Any}, 1, 2)

@inline t1() = Tuple{Any, Any}
@noinline t2() = Tuple{Any, Any}

g1_3() = invoke(f1, t1(), 1, 2)
g1_4() = invoke(f1, t2(), 1, 2)

@assert g1_1() == g1_2() == g1_3() == g1_4()

@code_llvm g1_1()
@code_llvm g1_2()
@code_llvm g1_3()
@code_llvm g1_4()

function timing(f, args...)
    println(f, args)
    f(args...)
    gc()
    @time for i in 1:10000000
        f(args...)
    end
    gc()
end

timing(g1_1)
timing(g1_2)
timing(g1_3)
timing(g1_4)

master without @code_llvm output

g1_1()
elapsed time: 0.308610973 seconds (0 bytes allocated)
g1_2()
elapsed time: 1.89693006 seconds (1449 MB allocated, 1.36% gc time in 66 pauses with 0 full sweep)
g1_3()
elapsed time: 1.892066647 seconds (1449 MB allocated, 1.41% gc time in 66 pauses with 0 full sweep)
g1_4()
elapsed time: 1.883954768 seconds (1449 MB allocated, 1.41% gc time in 66 pauses with 0 full sweep)

with this PR with @code_llvm output

define i64 @julia_g1_1_43914() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43917() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_3_43918() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_4_43919() {
top:
  %0 = call %jl_value_t* @julia_t2_43907()
  %1 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %1
}
g1_1()
elapsed time: 0.319919904 seconds (0 bytes allocated)
g1_2()
elapsed time: 0.372261468 seconds (0 bytes allocated)
g1_3()
elapsed time: 0.323118056 seconds (0 bytes allocated)
g1_4()
elapsed time: 1.10516864 seconds (1449 MB allocated, 2.79% gc time in 66 pauses with 0 full sweep)

yuyichao · 2015-04-24T00:06:19Z

I don't know what up with the error on Travis-CI (which seems to be popping up randomly on almost all commits) and is AppVeyor having some connection issue?

pao · 2015-04-24T00:33:04Z

The 32-bit error on Travis is indeed tupocalypse fallout and it's being worked on.

tkelman · 2015-04-24T02:24:12Z

I restarted the builds, though https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.4069/job/wwdnwkpxha87m2me looked like a timeout when trying to execute a trivial command with win64 Julia without using sys.dll. We've had subtle issues there before, will see if it happens again a second time.

yuyichao · 2015-04-24T02:33:51Z

Was the issue before repeatable?
I asked whether it's a connection issue because both build 1 2 stops responding at ~11mins at different places.....

tkelman · 2015-04-24T02:48:10Z

Was the issue before repeatable?

Yes, and that's why we added a new test for it.

This second link https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.4066/job/o3777iciq624y3jd looks like something intermittent that we have seen before (possibly #7942), where the freeze occurs during one of the first few linalg tests. The first link (build 4069) froze during the newly added test for executing a trivial command without sys.dll present. If you didn't change any substantial code between the two builds, maybe build 4069 was a fluke.

yuyichao · 2015-04-24T03:02:16Z

@tkelman Well, I rebased to latest master between the two build I refered to. I haven't changed anything since the second one and it seems that the build already passed the point where it got stuck before.

tkelman · 2015-04-24T03:03:21Z

I'm fine with calling it a probably-#7942-related fluke then

vtjnash · 2015-04-24T03:20:49Z

src/gf.c

@@ -1730,6 +1730,78 @@ DLLEXPORT jl_value_t *jl_gf_invoke_lookup(jl_function_t *gf, jl_datatype_t *type
    return (jl_value_t*)m;
 }

+jl_function_t*
+jl_gf_invoke_get_specialization(jl_function_t *gf, jl_tupletype_t *types,


there seems to be a lot of code duplication between this method and jl_gf_invoke. can you refactor the two so that the common bits are shared?

I've thought about just convert the argument to jl_gf_invoke to tupletype and use this function to get the specialized function. I didn't bother too much (not that I like this kind of code duplication either though) because IMHO this is basically the same duplication of jl_method_table_assoc_exact and jl_method_table_assoc_exact_by_type.

Another way I've thought about is to just use a (more C friendly) (pointer to) array of types but again that kind of mean rewrite/merge jl_method_table_assoc_exact and jl_method_table_assoc_exact_by_type and I hoped to make this patch as small as possible.

Which one do you prefer?

yuyichao · 2015-04-25T13:18:50Z

@vtjnash So I fixed two of the three issues you brought up.

For the thrid one, it seems that emit_call made a lot of assumptions of what emit_known_call might do and I feel like I would have too pass back too much information (nargs/args, f) without gaining much.

I've tried to replace the emit_nthptr_recast with just mfunc->linfo->functionObject and got the same SegFault with before.

yuyichao · 2015-04-25T13:24:00Z

And somehow the x86_64 test freeze again............. :(

yuyichao · 2015-05-30T15:46:31Z

I think I've finally understand how these code works (thanks to #11439) and why it was crashing before.

As @carnaval guessed, it was related to GC and as I noted in the commit 8dce033#diff-6d4d21428a67320600faf5a1a9f3a16aR2437, the function object can be collected when the function object is passed to the jlcall function. This shouldn't be an issue now and not even with #11439 since a non-wrapper jlcall specialization never accesses the function object (first) argument. I'm still not entirely happy that there's a useless and possibly invalid pointer in the generated code though...................

And I have changed it to use f->linfo->functionObject instead of emit_nthptr_recast as @vtjnash suggested and I don't really understand why it didn't work before...........

yuyichao · 2016-09-11T11:53:04Z

Replaced by #18444

yuyichao mentioned this pull request Apr 23, 2015

inline invoke #9642

Closed

yuyichao force-pushed the inline-invoke1 branch from 5e5eed3 to a423bea Compare April 23, 2015 21:22

vtjnash reviewed Apr 24, 2015
View reviewed changes

yuyichao force-pushed the inline-invoke1 branch from a423bea to 675d4dc Compare April 25, 2015 09:33

yuyichao force-pushed the inline-invoke1 branch 6 times, most recently from c8c79a9 to cb78497 Compare April 28, 2015 04:30

yuyichao force-pushed the inline-invoke1 branch from cb78497 to d66854f Compare May 2, 2015 14:53

yuyichao changed the title ~~Inline callsite of invoke when possible~~ Inline callsite of invoke when possible (invoke improvement No. 1) May 6, 2015

yuyichao force-pushed the inline-invoke1 branch from d66854f to c7c859a Compare May 7, 2015 13:10

yuyichao force-pushed the inline-invoke1 branch from c7c859a to 8dce033 Compare May 30, 2015 15:37

Inline callsite of invoke when possible

afcc9b5

yuyichao force-pushed the inline-invoke1 branch from 8dce033 to afcc9b5 Compare June 9, 2015 05:57

yuyichao added the performance Must go faster label Jun 30, 2015

yuyichao mentioned this pull request Sep 11, 2016

Inline invoke (take 3) plus a few other tweaks #18444

Merged

yuyichao closed this Sep 11, 2016

yuyichao deleted the inline-invoke1 branch September 11, 2016 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline callsite of invoke when possible (`invoke` improvement No. 1) #10964

Inline callsite of invoke when possible (`invoke` improvement No. 1) #10964

yuyichao commented Apr 23, 2015

yuyichao commented Apr 23, 2015

yuyichao commented Apr 24, 2015

pao commented Apr 24, 2015

tkelman commented Apr 24, 2015

yuyichao commented Apr 24, 2015

tkelman commented Apr 24, 2015

yuyichao commented Apr 24, 2015

tkelman commented Apr 24, 2015

vtjnash Apr 24, 2015

yuyichao Apr 24, 2015

yuyichao Apr 25, 2015

yuyichao commented Apr 25, 2015

yuyichao commented Apr 25, 2015

yuyichao commented May 30, 2015

yuyichao commented Sep 11, 2016

Inline callsite of invoke when possible (invoke improvement No. 1) #10964

Inline callsite of invoke when possible (invoke improvement No. 1) #10964

Conversation

yuyichao commented Apr 23, 2015

yuyichao commented Apr 23, 2015

yuyichao commented Apr 24, 2015

pao commented Apr 24, 2015

tkelman commented Apr 24, 2015

yuyichao commented Apr 24, 2015

tkelman commented Apr 24, 2015

yuyichao commented Apr 24, 2015

tkelman commented Apr 24, 2015

vtjnash Apr 24, 2015

Choose a reason for hiding this comment

yuyichao Apr 24, 2015

Choose a reason for hiding this comment

yuyichao Apr 25, 2015

Choose a reason for hiding this comment

yuyichao commented Apr 25, 2015

yuyichao commented Apr 25, 2015

yuyichao commented May 30, 2015

yuyichao commented Sep 11, 2016

Inline callsite of invoke when possible (`invoke` improvement No. 1) #10964

Inline callsite of invoke when possible (`invoke` improvement No. 1) #10964