-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConsList Performance Slowdown #188
Comments
I tried in master to implement |
I'll have to understand how the |
I refactored all tailrec methods with while loops in order to be sure they were not the cause of the performance drop. I benchmarked
However, I found out a case where performance drop by a factor of 10, here is a link to the code: https://github.com/julien-truffaut/brique/blob/bbb28176cd8310a2ccf2c01a8939f210604898bb/core/src/main/scala/brique/ConsList.scala#L154-L168
|
I run the entire benchmark with and without miniboxing, you can find the results here: https://docs.google.com/spreadsheets/d/1tTPKvkjqKVxXm6BeO0yLm1cFbTtkLGKrLg6LWrVyDZc/edit?usp=sharing |
Very strange, this is what I'm getting: With miniboxing: [info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.198 ± 0.013 us/op
[info] ConsListBench.appendConsList 100 avgt 15 1.350 ± 0.054 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 13.756 ± 0.661 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 111.328 ± 4.386 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1345.934 ± 34.768 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 20270.947 ± 1249.934 us/op
[info] ConsListBench.appendList 10 avgt 15 0.156 ± 0.009 us/op
[info] ConsListBench.appendList 100 avgt 15 1.251 ± 0.060 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.972 ± 0.197 us/op
[info] ConsListBench.appendList 10000 avgt 15 244.344 ± 6.984 us/op
[info] ConsListBench.appendList 100000 avgt 15 2464.826 ± 171.291 us/op
[info] ConsListBench.appendList 1000000 avgt 15 11816.217 ± 1596.650 us/op without miniboxing: [info] # Run complete. Total time: 00:06:20
[info]
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.090 ± 0.002 us/op
[info] ConsListBench.appendConsList 100 avgt 15 0.832 ± 0.031 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 8.634 ± 0.510 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 81.850 ± 5.412 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1031.133 ± 33.813 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 15298.762 ± 1205.139 us/op
[info] ConsListBench.appendList 10 avgt 15 0.143 ± 0.006 us/op
[info] ConsListBench.appendList 100 avgt 15 1.246 ± 0.044 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.854 ± 0.792 us/op
[info] ConsListBench.appendList 10000 avgt 15 236.025 ± 6.038 us/op
[info] ConsListBench.appendList 100000 avgt 15 2481.521 ± 165.411 us/op
[info] ConsListBench.appendList 1000000 avgt 15 11137.584 ± 1358.864 us/op |
@julien-truffaut can you tell a bit more about your setup? Are you running JVM in client mode? |
@DarkDimius I am not sure, how can I check? I am running the benchmarks on my mac book pro |
I think I see the problem. Running with warn] /mnt/data1/Work/Workspace/dev/brique/core/src/main/scala/brique/ConsList.scala:272:
The class scala.Tuple2 would benefit from miniboxing type parameter T1, since it is
instantiated by miniboxed type parameter A of method unapply.
[warn] final case class Cons[@miniboxed A](head: A, tail: ConsList[A]) extends ConsList[A]
[warn] ^ Translated to English: In every match/case such as I moved #155 to milestone 0.4, so I fix it by the end of March. But at least we have a lead :) |
An alternative approach, which we used in the linked lists, was to expose the final def foldLeft[@miniboxed B](b: B)(f: (B, A) => B): B = {
var acc = b
var l = this
while(true){
l match {
case Cons(h, t) =>
acc = f(acc, h)
l = t
case CNil() => return acc
}
}
acc
} to: final def foldLeft[@miniboxed B](b: B)(f: (B, A) => B): B = {
var acc = b
var l = this
while(true){
l match {
case cons: Cons =>
acc = f(acc, cons.head)
l = cons.tail
case CNil() => return acc
}
}
acc
} Notice the two cases: case Cons(h, t) =>
acc = f(acc, h)
l = t and case cons: Cons =>
acc = f(acc, cons.head)
l = cons.tail |
ah it makes sense, thank you @VladUreche |
Seems even with this, miniboxing is still slowed down: [info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.179 ± 0.006 us/op
[info] ConsListBench.appendConsList 100 avgt 15 1.147 ± 0.021 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 11.635 ± 0.171 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 102.367 ± 1.427 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1238.197 ± 63.866 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 19298.500 ± 1295.293 us/op
[info] ConsListBench.appendList 10 avgt 15 0.154 ± 0.017 us/op
[info] ConsListBench.appendList 100 avgt 15 1.212 ± 0.065 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.223 ± 0.074 us/op
[info] ConsListBench.appendList 10000 avgt 15 229.220 ± 1.908 us/op
[info] ConsListBench.appendList 100000 avgt 15 2269.490 ± 161.668 us/op
[info] ConsListBench.appendList 1000000 avgt 15 9624.083 ± 1374.385 us/op The commits are: VladUreche/brique@a13de7b + VladUreche/brique@1f7389c I enabled the |
I guess the only solution is to take one of the benchmarks and see the bytecode, since I can't tell where the slowdown is coming from... |
I am not familiar with byte code, could you show me how to output the byte code of a particular method? |
If you do |
@julien-truffaut, if you have some time, the best thing you could do would be to take one of the benchmarks (one that exhibits the slowdown) and minimize it to a single file, eliminating all external dependencies -- this way, it's going to be much easier to see the bytecode. |
I created the following minimal example: here are the results of the benchmarks with miniboxing:
and here is the byte code generated by
I am not sure what is this line corresponding to:
|
Thanks a lot @julien-truffaut! I will look into this by Saturday evening (SF time). |
@VladUreche did you get some time to look into this? |
@julien-truffaut no, sorry, I've been working full-time on the ScalaDays presentation. Let me have a quick look now. |
@julien-truffaut, I was able to isolate the problem down to a single file:
|
And yes, I can see the 10-20% slowdown. Looking into it now :) |
I looked at the bytecode and the JIT compilation process and they're identical across the generic and miniboxed versions of the code. The problem seems to be related to GC cycles: the miniboxed version of the code triggers more GC cycles, which in turn cause the 10-20% performance drop. I tried reasoning about the object size on the heap for miniboxed and generic Right now, I would leave this pending since I don't see an obvious direction. You should get a much better performance when you start mapping over the TList, since there you'll have a clear benefit. @julien-truffaut, WDYT? |
Since there was too much text, here's the link: https://gist.github.com/VladUreche/9faa054d78c00fd66f33 |
Also the |
@VladUreche thanks for looking into that. A few points, I would not rely on benchmarking with loops and |
@julien-truffaut, I have a proof that the slowdown is indeed caused by the different layout in miniboxed classes and its interaction with the GC. Here's the proof: We can easily create a list using the miniboxed object-based versions of
Then, running the benchmark on the generic version of the code (with Now, I think you could use the same technique to debug slowdowns in the other benchmarks: you can change the input to generate the object miniboxed versions of the class in https://github.com/julien-truffaut/brique/blob/miniboxing/bench/src/main/scala/brique/bench/input/TListInput.scala: @Setup
def setup(): Unit =
tList = (range(size).foldRight[TList[Any]](TNil())(TCons(_, _))).asInstanceOf[TList[Int]] I will leave the bug open until I have some time to look into all the slowdowns -- some may be legitimate, so I really want to understand the problem. |
Sure, that makes sense. I wanted a self-contained (literally zero-dependency) benchmark to run in the debug build of the JVM.
Can you please run with the Any version as well for the input (described in the previous message)? I would be curious to see how much of the slowdown can be attributed to the in-memory class layout.
Yes, that's exactly why I left the bug open. Now I'm a bit pressed to write a paper (deadline is on the 26th of March, in 4 days from now) but as soon as I'm done with that I will look into the other slowdowns as well. One more point: Keep in mind that Scala collections do all kinds of tricks to improve performance, for example the |
@VladUreche no problem take your time, I just wanted to be sure that we agreed on the importance of this issue. Indeed scala list uses some tricks but we can outperform them almost everywhere (except |
Based on the numbers you gathered, I prepared a version of the spreadsheet comparing generic and miniboxed times: https://docs.google.com/spreadsheets/d/1tTPKvkjqKVxXm6BeO0yLm1cFbTtkLGKrLg6LWrVyDZc/edit?usp=sharing
Given the red areas, I think the main priority is to check And yes, I agree it's important to have a good idea of what is happening -- for some slowdowns we may not be able to do much, since they're not under our control, but at the very least I must provide a proper explanation of each slowdown and why it occurs. Btw, I'm really looking forward to seeing the numbers gathered with |
One more thing that came to mind: could you also prepare macro-benchmarks of the |
I will try to prepare some banchmarks this week or over the weekend. |
Thanks a lot @julien-truffaut! |
sorry pretty busy atm, I will have to postpone this. |
@julien-truffaut, no worries. I will have more time now, so I can isolate the benchmarks myself. |
Benchmark with miniboxing
without miniboxing
How to reproduce:
clone brique
run benchmark in branch master (non miniboxing) and branch miniboxing (with miniboxing)
The text was updated successfully, but these errors were encountered: