Redefine `EventNetwork` to reduce allocations #238

ocharles · 2022-01-09T12:32:38Z

The goal of this change is to reduce an allocation that occurs when firing the event network via with fromAddHandler. If we inspect the current STG, we see:

Reactive.Banana.Internal.Combinators.$wfromAddHandler =
    \r [w_sydr w1_syds w2_sydt w3_sydu]
        case Reactive.Banana.Prim.IO.$wnewInput w2_sydt w3_sydu of {
        (#,#) ipv_sydw ipv1_sydx ->
        case ipv1_sydx of {
        (,) p_sydz fire_sydA ->
        let {
          sat_sydH :: Control.Event.Handler.Handler a_sx2k =
              \r [x_sydB]
                  case w1_syds of {
                  Reactive.Banana.Internal.Combinators.EventNetwork ds_sydD _ _ ->
                  let {
                    sat_sydG :: Reactive.Banana.Prim.Types.Step =
                        \u [] fire_sydA x_sydB;
                  } in  ds_sydD sat_sydG;
                  };
        } in

sat_sydH is the function that is used to register against the AddHandler to fire the network. If we look at this in more detail, we see that after pattern matching on an EventNetwork, we then allocate a Step to pass to ds_sydD - ds_sydD is the runStep function.

What we'd really like is to inline runStep, but because it's part of EventNetwork, this is very difficult. Fortunately, runStep can be lifted out to the top-level, and the free variables pulled out and moved into EventNetwork. This breaks a bit of encapsulation (though we could recover this with module).

The impact of this change is noticable. I first ran the benchmarks with cabal run benchmarks -- --stdev 1 --csv baseline.csv. Next, I applied the changes in this PR, and re-ran cabal run benchmarks -- --stdev 1 --baseline baseline.csv. The results are:

  netsize = 1
    duration = 1:   OK (0.94s)
      6.93 μs ± 133 ns
    duration = 2:   OK (1.78s)
      13.4 μs ± 213 ns,  3% faster than baseline
    duration = 4:   OK (7.10s)
      27.1 μs ± 210 ns,  1% faster than baseline
    duration = 8:   OK (7.08s)
      54.2 μs ± 613 ns,  1% faster than baseline
    duration = 16:  OK (1.77s)
      108  μs ± 1.8 μs
    duration = 32:  OK (0.87s)
      213  μs ± 3.5 μs,  4% faster than baseline
    duration = 64:  OK (3.54s)
      433  μs ± 5.5 μs
    duration = 128: OK (1.77s)
      865  μs ± 7.9 μs,  1% faster than baseline
  netsize = 2
    duration = 1:   OK (0.95s)
      7.15 μs ± 115 ns,  3% faster than baseline
    duration = 2:   OK (1.86s)
      14.1 μs ± 121 ns,  4% faster than baseline
    duration = 4:   OK (0.94s)
      28.3 μs ± 526 ns,  2% faster than baseline
    duration = 8:   OK (1.85s)
      55.9 μs ± 492 ns,  3% faster than baseline
    duration = 16:  OK (0.93s)
      113  μs ± 1.4 μs,  2% faster than baseline
    duration = 32:  OK (0.94s)
      226  μs ± 2.6 μs
    duration = 64:  OK (0.93s)
      450  μs ± 7.8 μs,  3% faster than baseline
    duration = 128: OK (0.93s)
      902  μs ±  15 μs,  1% faster than baseline
  netsize = 4
    duration = 1:   OK (3.86s)
      7.35 μs ±  57 ns,  1% faster than baseline
    duration = 2:   OK (1.91s)
      14.7 μs ± 275 ns
    duration = 4:   OK (0.96s)
      29.1 μs ± 386 ns
    duration = 8:   OK (0.96s)
      57.6 μs ± 940 ns,  2% faster than baseline
    duration = 16:  OK (3.79s)
      116  μs ± 983 ns,  1% faster than baseline
    duration = 32:  OK (3.77s)
      230  μs ± 2.2 μs,  2% faster than baseline
    duration = 64:  OK (1.88s)
      459  μs ± 6.2 μs
    duration = 128: OK (3.79s)
      922  μs ± 4.8 μs,  2% faster than baseline
  netsize = 8
    duration = 1:   OK (1.00s)
      7.72 μs ± 117 ns
    duration = 2:   OK (0.98s)
      15.0 μs ± 251 ns,  2% faster than baseline
    duration = 4:   OK (3.88s)
      29.5 μs ± 170 ns,  4% faster than baseline
    duration = 8:   OK (0.96s)
      58.1 μs ± 1.0 μs,  4% faster than baseline
    duration = 16:  OK (0.97s)
      117  μs ± 1.6 μs,  2% faster than baseline
    duration = 32:  OK (1.90s)
      231  μs ± 1.5 μs,  1% faster than baseline
    duration = 64:  OK (0.97s)
      468  μs ± 6.3 μs,  1% faster than baseline
    duration = 128: OK (0.96s)
      934  μs ±  14 μs,  1% faster than baseline
  netsize = 16
    duration = 1:   OK (2.16s)
      8.18 μs ± 130 ns
    duration = 2:   OK (2.05s)
      15.6 μs ± 123 ns,  2% faster than baseline
    duration = 4:   OK (4.02s)
      30.5 μs ± 247 ns,  4% faster than baseline
    duration = 8:   OK (1.98s)
      60.3 μs ± 942 ns,  2% faster than baseline
    duration = 16:  OK (1.96s)
      119  μs ± 1.6 μs,  3% faster than baseline
    duration = 32:  OK (0.97s)
      234  μs ± 4.3 μs,  4% faster than baseline
    duration = 64:  OK (0.97s)
      467  μs ± 8.4 μs,  5% faster than baseline
    duration = 128: OK (3.91s)
      953  μs ±  13 μs,  3% faster than baseline
  netsize = 32
    duration = 1:   OK (2.41s)
      9.22 μs ±  59 ns,  1% faster than baseline
    duration = 2:   OK (8.93s)
      17.0 μs ±  59 ns,  1% faster than baseline
    duration = 4:   OK (1.07s)
      32.1 μs ± 411 ns,  4% faster than baseline
    duration = 8:   OK (1.03s)
      62.9 μs ± 1.0 μs,  3% faster than baseline
    duration = 16:  OK (1.03s)
      125  μs ± 2.0 μs,  1% faster than baseline
    duration = 32:  OK (1.01s)
      243  μs ± 4.6 μs,  3% faster than baseline
    duration = 64:  OK (4.06s)
      493  μs ± 8.2 μs,  2% faster than baseline
    duration = 128: OK (8.09s)
      983  μs ±  12 μs,  2% faster than baseline
  netsize = 64
    duration = 1:   OK (1.50s)
      11.3 μs ± 175 ns,  2% faster than baseline
    duration = 2:   OK (1.26s)
      19.2 μs ± 265 ns,  4% faster than baseline
    duration = 4:   OK (1.19s)
      35.9 μs ± 676 ns
    duration = 8:   OK (2.19s)
      66.4 μs ± 751 ns,  3% faster than baseline
    duration = 16:  OK (2.18s)
      133  μs ± 1.1 μs,  2% faster than baseline
    duration = 32:  OK (4.30s)
      261  μs ± 1.3 μs
    duration = 64:  OK (1.08s)
      521  μs ± 9.3 μs
    duration = 128: OK (4.22s)
      1.03 ms ±  13 μs,  2% faster than baseline
  netsize = 128
    duration = 1:   OK (8.43s)
      16.1 μs ± 108 ns,  1% faster than baseline
    duration = 2:   OK (1.64s)
      25.1 μs ± 192 ns,  1% faster than baseline
    duration = 4:   OK (1.37s)
      41.4 μs ± 542 ns,  4% faster than baseline
    duration = 8:   OK (5.02s)
      76.6 μs ± 918 ns,  1% faster than baseline
    duration = 16:  OK (0.60s)
      144  μs ± 2.9 μs,  3% faster than baseline
    duration = 32:  OK (1.14s)
      277  μs ± 3.1 μs,  4% faster than baseline
    duration = 64:  OK (2.46s)
      598  μs ± 4.6 μs,  4% slower than baseline
    duration = 128: OK (0.61s)
      1.20 ms ±  22 μs,  4% slower than baseline
  Boring:           OK (14.40s)
    232  ms ± 1.5 ms, 29% faster than baseline

The most striking difference is "Boring". This is a "no-op" benchmark, which is now 30% faster (!), and also apparently allocates 0 bytes (though I don't entirely believe this, I think it's measurement error).

HeinrichApfelmus

Nothing wrong with a bit of defunctionalization for the sake of performance! 😊

To make sure that the small performance improvements are real: What happens if you use the new code as baseline and benchmark the old code against this baseline? Would it also report 1%-4% improvements due to statistical noise?

reactive-banana/src/Reactive/Banana/Internal/Combinators.hs

Co-authored-by: Heinrich Apfelmus <[email protected]>

ocharles · 2022-01-09T14:58:51Z

I switched machines (to my more powerful desktop) so the numbers don't quite match, but here's what I get if I use this PR as the baseline, and switch back to master:

  netsize = 1
    duration = 1:   OK (5.65s)
      5.36 μs ±  10 ns,  4% slower than baseline
    duration = 2:   OK (1.41s)
      10.7 μs ± 200 ns,  2% slower than baseline
    duration = 4:   OK (1.41s)
      21.4 μs ± 405 ns,  5% slower than baseline
    duration = 8:   OK (2.78s)
      42.3 μs ± 476 ns,  3% slower than baseline
    duration = 16:  OK (5.57s)
      85.3 μs ± 1.4 μs,  7% slower than baseline
    duration = 32:  OK (0.72s)
      175  μs ± 2.7 μs, 11% slower than baseline
    duration = 64:  OK (1.41s)
      344  μs ± 3.6 μs,  7% slower than baseline
    duration = 128: OK (2.87s)
      701  μs ± 8.2 μs,  7% slower than baseline
  netsize = 2
    duration = 1:   OK (6.09s)
      5.79 μs ±  94 ns,  8% slower than baseline
    duration = 2:   OK (1.54s)
      11.7 μs ± 157 ns, 10% slower than baseline
    duration = 4:   OK (24.27s)
      23.0 μs ± 319 ns,  8% slower than baseline
    duration = 8:   OK (0.73s)
      45.1 μs ± 828 ns,  6% slower than baseline
    duration = 16:  OK (6.11s)
      92.7 μs ± 1.3 μs,  9% slower than baseline
    duration = 32:  OK (3.01s)
      183  μs ± 2.2 μs,  7% slower than baseline
    duration = 64:  OK (0.76s)
      369  μs ± 6.1 μs,  9% slower than baseline
    duration = 128: OK (5.97s)
      727  μs ± 5.4 μs,  7% slower than baseline
  netsize = 4
    duration = 1:   OK (3.07s)
      5.83 μs ±  82 ns,  6% slower than baseline
    duration = 2:   OK (49.34s)
      11.8 μs ± 174 ns,  7% slower than baseline
    duration = 4:   OK (0.77s)
      23.5 μs ± 388 ns,  6% slower than baseline
    duration = 8:   OK (1.52s)
      46.1 μs ± 734 ns,  6% slower than baseline
    duration = 16:  OK (1.53s)
      93.0 μs ± 1.4 μs,  7% slower than baseline
    duration = 32:  OK (1.51s)
      184  μs ± 1.9 μs,  6% slower than baseline
    duration = 64:  OK (0.76s)
      370  μs ± 6.2 μs,  6% slower than baseline
    duration = 128: OK (1.51s)
      737  μs ± 7.0 μs,  7% slower than baseline
  netsize = 8
    duration = 1:   OK (3.23s)
      6.14 μs ±  97 ns,  7% slower than baseline
    duration = 2:   OK (1.57s)
      11.9 μs ± 176 ns,  5% slower than baseline
    duration = 4:   OK (3.08s)
      23.5 μs ±  83 ns,  6% slower than baseline
    duration = 8:   OK (0.79s)
      47.7 μs ± 899 ns, 11% slower than baseline
    duration = 16:  OK (3.08s)
      93.6 μs ± 383 ns,  6% slower than baseline
    duration = 32:  OK (6.09s)
      185  μs ± 594 ns,  6% slower than baseline
    duration = 64:  OK (1.50s)
      367  μs ± 4.7 μs,  6% slower than baseline
    duration = 128: OK (1.53s)
      744  μs ± 9.5 μs,  6% slower than baseline
  netsize = 16
    duration = 1:   OK (0.84s)
      6.33 μs ±  88 ns,  3% slower than baseline
    duration = 2:   OK (0.81s)
      12.2 μs ± 210 ns,  3% slower than baseline
    duration = 4:   OK (0.80s)
      24.2 μs ± 432 ns,  5% slower than baseline
    duration = 8:   OK (3.14s)
      47.8 μs ± 391 ns,  5% slower than baseline
    duration = 16:  OK (3.14s)
      95.6 μs ± 511 ns,  8% slower than baseline
    duration = 32:  OK (0.77s)
      190  μs ± 3.3 μs,  6% slower than baseline
    duration = 64:  OK (12.63s)
      385  μs ± 4.7 μs,  7% slower than baseline
    duration = 128: OK (0.79s)
      765  μs ±  12 μs,  7% slower than baseline
  netsize = 32
    duration = 1:   OK (0.93s)
      7.06 μs ±  93 ns,  4% slower than baseline
    duration = 2:   OK (1.74s)
      13.1 μs ± 146 ns,  5% slower than baseline
    duration = 4:   OK (0.83s)
      25.1 μs ± 330 ns,  3% slower than baseline
    duration = 8:   OK (0.81s)
      49.2 μs ± 947 ns,  6% slower than baseline
    duration = 16:  OK (1.64s)
      100  μs ± 1.8 μs,  8% slower than baseline
    duration = 32:  OK (3.23s)
      196  μs ± 1.2 μs,  7% slower than baseline
    duration = 64:  OK (1.59s)
      386  μs ± 4.3 μs,  7% slower than baseline
    duration = 128: OK (3.20s)
      781  μs ± 8.9 μs,  5% slower than baseline
  netsize = 64
    duration = 1:   OK (1.09s)
      8.24 μs ±  90 ns,  3% slower than baseline
    duration = 2:   OK (1.93s)
      14.6 μs ± 263 ns,  6% slower than baseline
    duration = 4:   OK (0.89s)
      27.4 μs ± 534 ns,  8% slower than baseline
    duration = 8:   OK (1.71s)
      52.2 μs ± 345 ns, 11% slower than baseline
    duration = 16:  OK (1.66s)
      101  μs ± 944 ns,  6% slower than baseline
    duration = 32:  OK (0.83s)
      201  μs ± 3.2 μs,  8% slower than baseline
    duration = 64:  OK (3.31s)
      401  μs ± 1.5 μs, 10% slower than baseline
    duration = 128: OK (3.28s)
      797  μs ±  11 μs,  9% slower than baseline
  netsize = 128
    duration = 1:   OK (11.59s)
      11.0 μs ±  87 ns,  5% slower than baseline
    duration = 2:   OK (4.59s)
      17.4 μs ± 257 ns,  6% slower than baseline
    duration = 4:   OK (0.99s)
      30.0 μs ± 404 ns,  7% slower than baseline
    duration = 8:   OK (1.82s)
      55.3 μs ± 507 ns,  8% slower than baseline
    duration = 16:  OK (1.77s)
      108  μs ± 1.1 μs, 10% slower than baseline
    duration = 32:  OK (0.87s)
      212  μs ± 4.2 μs, 10% slower than baseline
    duration = 64:  OK (27.20s)
      413  μs ± 3.8 μs,  8% slower than baseline
    duration = 128: OK (1.68s)
      819  μs ±  13 μs,  7% slower than baseline
  Boring:           OK (24.86s)
    194  ms ± 3.6 ms,  6% slower than baseline

ocharles · 2022-01-09T15:00:01Z

Also just want to add I tried running benchmarks on this PR once to get a baseline, and again against the baseline - the second run doesn't show any significant changes (that is, matches the previous run, so we're not just observing measurement error).

HeinrichApfelmus · 2022-01-09T16:01:49Z

Here's what I get if I use this PR as the baseline, and switch back to master:

Awesome, thanks for the diligence! 😊

Redefine EventNetwork to reduce allocations

c9a3432

ocharles requested review from HeinrichApfelmus and mitchellwrosen January 9, 2022 12:32

ocharles changed the title ~~Redefine EventNetwork to reduce allocations~~ Redefine EventNetwork to reduce allocations Jan 9, 2022

HeinrichApfelmus approved these changes Jan 9, 2022

View reviewed changes

reactive-banana/src/Reactive/Banana/Internal/Combinators.hs Outdated Show resolved Hide resolved

Update reactive-banana/src/Reactive/Banana/Internal/Combinators.hs

b0b0308

Co-authored-by: Heinrich Apfelmus <[email protected]>

ocharles merged commit f2a9def into master Jan 9, 2022

ocharles deleted the redefine-EventNetwork branch January 9, 2022 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redefine `EventNetwork` to reduce allocations #238

Redefine `EventNetwork` to reduce allocations #238

ocharles commented Jan 9, 2022

HeinrichApfelmus left a comment

ocharles commented Jan 9, 2022

ocharles commented Jan 9, 2022

HeinrichApfelmus commented Jan 9, 2022

Redefine EventNetwork to reduce allocations #238

Redefine EventNetwork to reduce allocations #238

Conversation

ocharles commented Jan 9, 2022

HeinrichApfelmus left a comment

Choose a reason for hiding this comment

ocharles commented Jan 9, 2022

ocharles commented Jan 9, 2022

HeinrichApfelmus commented Jan 9, 2022

Redefine `EventNetwork` to reduce allocations #238

Redefine `EventNetwork` to reduce allocations #238