Chunk - Optimise Flatten implementation. #3190

diesalbla · 2023-03-26T20:32:50Z

The current flatten performs one Chunk.Queue insertion per source chunk, which means it creates a new "Chunk.Queue" object and one "Queue" object for each one. Instead, we can avoid those objects and use a Queue- builder.

armanbilge · 2023-03-26T20:42:20Z

We can make a similar optimization here?

fs2/core/shared/src/main/scala/fs2/Chunk.scala

Lines 1173 to 1174 in f5a4d11

    
           def apply[O](chunks: Chunk[O]*): Queue[O] = 
        
             chunks.foldLeft(empty[O])(_ :+ _)

mpilquist · 2023-03-27T13:27:00Z

I'm not convinced this one is worth the gains. Can we benchmark it?

diesalbla · 2023-03-30T15:29:41Z

I'm not convinced this one is worth the gains. Can we benchmark it?

Sure. I hope that it is not too "dirty" for the optimisation it gains (which is at least two objects per chunk). Is the worry that it may not be that frequent a case?

mpilquist · 2023-03-30T18:49:08Z

Yeah, I'm concerned we're inlining implementations here without proof that the allocations matter. I don't want to put too much faith in to things like escape analysis, but also want to ensure any code duplication / inlining is justified with benchmarks showing the improvements actually matter in real use cases.

diesalbla · 2023-04-02T20:24:36Z

As an update, I have moved the "inlined" implementation to a def build factory method within the Queue companion object, so the constructor can be kept private as it is.

diesalbla · 2023-04-03T02:30:35Z

I have pushed a file with a benchmark for flatten, with different parameters of number of chunks to flatten and size of each chunk. Running it with the command

Jmh/run -i 3 -wi 3 -f1 -t1 .*ChunkFlatten.* -prof gc -jvmArgs -XX:MaxInlineLevel=18

Focusing on the gc.alloc.rate.norm of each case, it obtains the following results for main:

:·gc.alloc.rate.norm            1               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1              32  thrpt    3      2608.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1             128  thrpt    3     10288.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2              32  thrpt    3      2608.000 ±       0.004    B/op
:·gc.alloc.rate.norm            2             128  thrpt    3     10288.000 ±       0.005    B/op
:·gc.alloc.rate.norm           10               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm           10              32  thrpt    3      2560.000 ±       0.002    B/op
:·gc.alloc.rate.norm           10             128  thrpt    3     10288.001 ±       0.016    B/op
:·gc.alloc.rate.norm           50               8  thrpt    3       688.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50              32  thrpt    3      2608.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50             128  thrpt    3     10288.001 ±       0.016    B/op

And these are the results with the changes in this PR:

:·gc.alloc.rate.norm            1               8  thrpt    3       360.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1              32  thrpt    3       936.000 ±       0.001    B/op
:·gc.alloc.rate.norm            1             128  thrpt    3      3240.001 ±       0.012    B/op
:·gc.alloc.rate.norm            2               8  thrpt    3       192.000 ±       0.001    B/op
:·gc.alloc.rate.norm            2              32  thrpt    3       936.000 ±       0.003    B/op
:·gc.alloc.rate.norm            2             128  thrpt    3      3240.000 ±       0.014    B/op
:·gc.alloc.rate.norm           10               8  thrpt    3       192.000 ±       0.001    B/op
:·gc.alloc.rate.norm           10              32  thrpt    3       936.000 ±       0.002    B/op
:·gc.alloc.rate.norm           10             128  thrpt    3      3240.000 ±       0.009    B/op
:·gc.alloc.rate.norm           50               8  thrpt    3       360.000 ±       0.001    B/op
:·gc.alloc.rate.norm           50              32  thrpt    3       936.000 ±       0.003    B/op
:·gc.alloc.rate.norm           50             128  thrpt    3      3240.000 ±       0.001    B/op

The first parameter is the size of the chunk, and the second one is the number of chunks. In all cases the total number of allocations is reduced by up to a third or more than half.

diesalbla · 2023-04-08T09:55:48Z

@mpilquist Let me know if there are any further benchmarks that can be run.

mpilquist · 2023-04-08T12:48:42Z

Benchmark looks good, though I'm still a bit concerned about chasing allocation rate improvements like this. What type of application would benefit from this change?

diesalbla · 2023-04-08T13:06:40Z

Benchmark looks good, though I'm still a bit concerned about chasing allocation rate improvements like this. What type of application would benefit from this change?

Flattening chunks could be useful in the context of using unconsN or unconsMin or groupWithin, although it is currently not used. However, I do not have any measurements (memory profiling) from any production system, that may allow us to see what other major allocation hotspots there are. Any actual measurements on the field would be helpful.

diesalbla · 2023-07-26T13:17:37Z

@mpilquist What is your verdict on this PR? Merge? Close? Worthy improvement? Would rather seek a stronger optimisation?

mpilquist · 2023-07-26T13:23:13Z

I'm good with merging. Can we inline Chunk.build in to the else block inside of flatten?

The current flatten performs one Chunk.Queue insertion per source chunk, which means it creates a new "Chunk.Queue" object and one "Queue" object for each one. Instead, we can avoid those objects and use a Queue- builder.

diesalbla · 2023-07-30T20:19:55Z

I'm good with merging. Can we inline Chunk.build in to the else block inside of flatten?

No, because the constructor of Chunk.Queue, which the Queue.build method uses, is private to that companion object.

diesalbla force-pushed the light_flatten branch from f5a4d11 to 95e5762 Compare March 26, 2023 23:27

diesalbla force-pushed the light_flatten branch 4 times, most recently from cc9201e to 91244e4 Compare April 2, 2023 20:23

diesalbla force-pushed the light_flatten branch from 91244e4 to 5fdb290 Compare April 2, 2023 20:25

diesalbla added 2 commits July 30, 2023 21:15

Chunk - Optimise Flatten implementation.

9788538

The current flatten performs one Chunk.Queue insertion per source chunk, which means it creates a new "Chunk.Queue" object and one "Queue" object for each one. Instead, we can avoid those objects and use a Queue- builder.

Add a Benchmark for flatten.

5486d3f

diesalbla force-pushed the light_flatten branch from 243f7bb to 5486d3f Compare July 30, 2023 20:15

diesalbla requested a review from mpilquist August 29, 2023 12:25

mpilquist merged commit 77c1909 into typelevel:main Aug 29, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk - Optimise Flatten implementation. #3190

Chunk - Optimise Flatten implementation. #3190

diesalbla commented Mar 26, 2023

armanbilge commented Mar 26, 2023

mpilquist commented Mar 27, 2023

diesalbla commented Mar 30, 2023

mpilquist commented Mar 30, 2023

diesalbla commented Apr 2, 2023

diesalbla commented Apr 3, 2023 •

edited

Loading

diesalbla commented Apr 8, 2023

mpilquist commented Apr 8, 2023

diesalbla commented Apr 8, 2023

diesalbla commented Jul 26, 2023

mpilquist commented Jul 26, 2023

diesalbla commented Jul 30, 2023 •

edited

Loading

Chunk - Optimise Flatten implementation. #3190

Chunk - Optimise Flatten implementation. #3190

Conversation

diesalbla commented Mar 26, 2023

armanbilge commented Mar 26, 2023

mpilquist commented Mar 27, 2023

diesalbla commented Mar 30, 2023

mpilquist commented Mar 30, 2023

diesalbla commented Apr 2, 2023

diesalbla commented Apr 3, 2023 • edited Loading

diesalbla commented Apr 8, 2023

mpilquist commented Apr 8, 2023

diesalbla commented Apr 8, 2023

diesalbla commented Jul 26, 2023

mpilquist commented Jul 26, 2023

diesalbla commented Jul 30, 2023 • edited Loading

diesalbla commented Apr 3, 2023 •

edited

Loading

diesalbla commented Jul 30, 2023 •

edited

Loading