Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk - Optimise Flatten implementation. #3190

Merged
merged 2 commits into from
Aug 29, 2023

Conversation

diesalbla
Copy link
Contributor

The current flatten performs one Chunk.Queue insertion per source chunk, which means it creates a new "Chunk.Queue" object and one "Queue" object for each one. Instead, we can avoid those objects and use a Queue- builder.

@armanbilge
Copy link
Member

We can make a similar optimization here?

def apply[O](chunks: Chunk[O]*): Queue[O] =
chunks.foldLeft(empty[O])(_ :+ _)

@mpilquist
Copy link
Member

I'm not convinced this one is worth the gains. Can we benchmark it?

@diesalbla
Copy link
Contributor Author

I'm not convinced this one is worth the gains. Can we benchmark it?

Sure. I hope that it is not too "dirty" for the optimisation it gains (which is at least two objects per chunk). Is the worry that it may not be that frequent a case?

@mpilquist
Copy link
Member

Yeah, I'm concerned we're inlining implementations here without proof that the allocations matter. I don't want to put too much faith in to things like escape analysis, but also want to ensure any code duplication / inlining is justified with benchmarks showing the improvements actually matter in real use cases.

@diesalbla diesalbla force-pushed the light_flatten branch 4 times, most recently from cc9201e to 91244e4 Compare April 2, 2023 20:23
@diesalbla
Copy link
Contributor Author

As an update, I have moved the "inlined" implementation to a def build factory method within the Queue companion object, so the constructor can be kept private as it is.

@diesalbla
Copy link
Contributor Author

diesalbla commented Apr 3, 2023

I have pushed a file with a benchmark for flatten, with different parameters of number of chunks to flatten and size of each chunk. Running it with the command

Jmh/run -i 3 -wi 3 -f1 -t1 .*ChunkFlatten.* -prof gc -jvmArgs -XX:MaxInlineLevel=18

Focusing on the gc.alloc.rate.norm of each case, it obtains the following results for main:

:·gc.alloc.rate.norm            1               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1              32  thrpt    3      2608.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1             128  thrpt    3     10288.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2              32  thrpt    3      2608.000 ±       0.004    B/op
:·gc.alloc.rate.norm            2             128  thrpt    3     10288.000 ±       0.005    B/op
:·gc.alloc.rate.norm           10               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm           10              32  thrpt    3      2560.000 ±       0.002    B/op
:·gc.alloc.rate.norm           10             128  thrpt    3     10288.001 ±       0.016    B/op
:·gc.alloc.rate.norm           50               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50              32  thrpt    3      2608.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50             128  thrpt    3     10288.001 ±       0.016    B/op

And these are the results with the changes in this PR:

:·gc.alloc.rate.norm            1               8  thrpt    3       360.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1              32  thrpt    3       936.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1             128  thrpt    3      3240.001 ±       0.012    B/op
:·gc.alloc.rate.norm            2               8  thrpt    3       192.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2              32  thrpt    3       936.000 ±       0.003    B/op
:·gc.alloc.rate.norm            2             128  thrpt    3      3240.000 ±       0.014    B/op
:·gc.alloc.rate.norm           10               8  thrpt    3       192.000 ±       0.001    B/op
:·gc.alloc.rate.norm           10              32  thrpt    3       936.000 ±       0.002    B/op
:·gc.alloc.rate.norm           10             128  thrpt    3      3240.000 ±       0.009    B/op
:·gc.alloc.rate.norm           50               8  thrpt    3       360.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50              32  thrpt    3       936.000 ±       0.003    B/op
:·gc.alloc.rate.norm           50             128  thrpt    3      3240.000 ±       0.001    B/op

The first parameter is the size of the chunk, and the second one is the number of chunks. In all cases the total number of allocations is reduced by up to a third or more than half.

@diesalbla
Copy link
Contributor Author

@mpilquist Let me know if there are any further benchmarks that can be run.

@mpilquist
Copy link
Member

Benchmark looks good, though I'm still a bit concerned about chasing allocation rate improvements like this. What type of application would benefit from this change?

@diesalbla
Copy link
Contributor Author

Benchmark looks good, though I'm still a bit concerned about chasing allocation rate improvements like this. What type of application would benefit from this change?

Flattening chunks could be useful in the context of using unconsN or unconsMin or groupWithin, although it is currently not used. However, I do not have any measurements (memory profiling) from any production system, that may allow us to see what other major allocation hotspots there are. Any actual measurements on the field would be helpful.

@diesalbla
Copy link
Contributor Author

@mpilquist What is your verdict on this PR? Merge? Close? Worthy improvement? Would rather seek a stronger optimisation?

@mpilquist
Copy link
Member

I'm good with merging. Can we inline Chunk.build in to the else block inside of flatten?

The current flatten performs one Chunk.Queue insertion
per source chunk, which means it creates a new
"Chunk.Queue" object and one "Queue" object for each one.
Instead, we can avoid those objects and use a Queue- builder.
@diesalbla
Copy link
Contributor Author

diesalbla commented Jul 30, 2023

I'm good with merging. Can we inline Chunk.build in to the else block inside of flatten?

No, because the constructor of Chunk.Queue, which the Queue.build method uses, is private to that companion object.

@mpilquist mpilquist merged commit 77c1909 into typelevel:main Aug 29, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants