Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefine EventNetwork to reduce allocations #238

Merged
merged 2 commits into from
Jan 9, 2022
Merged

Conversation

ocharles
Copy link
Collaborator

@ocharles ocharles commented Jan 9, 2022

The goal of this change is to reduce an allocation that occurs when firing the event network via with fromAddHandler. If we inspect the current STG, we see:

Reactive.Banana.Internal.Combinators.$wfromAddHandler =
    \r [w_sydr w1_syds w2_sydt w3_sydu]
        case Reactive.Banana.Prim.IO.$wnewInput w2_sydt w3_sydu of {
        (#,#) ipv_sydw ipv1_sydx ->
        case ipv1_sydx of {
        (,) p_sydz fire_sydA ->
        let {
          sat_sydH :: Control.Event.Handler.Handler a_sx2k =
              \r [x_sydB]
                  case w1_syds of {
                  Reactive.Banana.Internal.Combinators.EventNetwork ds_sydD _ _ ->
                  let {
                    sat_sydG :: Reactive.Banana.Prim.Types.Step =
                        \u [] fire_sydA x_sydB;
                  } in  ds_sydD sat_sydG;
                  };
        } in

sat_sydH is the function that is used to register against the AddHandler to fire the network. If we look at this in more detail, we see that after pattern matching on an EventNetwork, we then allocate a Step to pass to ds_sydD - ds_sydD is the runStep function.

What we'd really like is to inline runStep, but because it's part of EventNetwork, this is very difficult. Fortunately, runStep can be lifted out to the top-level, and the free variables pulled out and moved into EventNetwork. This breaks a bit of encapsulation (though we could recover this with module).

The impact of this change is noticable. I first ran the benchmarks with cabal run benchmarks -- --stdev 1 --csv baseline.csv. Next, I applied the changes in this PR, and re-ran cabal run benchmarks -- --stdev 1 --baseline baseline.csv. The results are:

  netsize = 1
    duration = 1:   OK (0.94s)
      6.93 μs ± 133 ns
    duration = 2:   OK (1.78s)
      13.4 μs ± 213 ns,  3% faster than baseline
    duration = 4:   OK (7.10s)
      27.1 μs ± 210 ns,  1% faster than baseline
    duration = 8:   OK (7.08s)
      54.2 μs ± 613 ns,  1% faster than baseline
    duration = 16:  OK (1.77s)
      108  μs ± 1.8 μs
    duration = 32:  OK (0.87s)
      213  μs ± 3.5 μs,  4% faster than baseline
    duration = 64:  OK (3.54s)
      433  μs ± 5.5 μs
    duration = 128: OK (1.77s)
      865  μs ± 7.9 μs,  1% faster than baseline
  netsize = 2
    duration = 1:   OK (0.95s)
      7.15 μs ± 115 ns,  3% faster than baseline
    duration = 2:   OK (1.86s)
      14.1 μs ± 121 ns,  4% faster than baseline
    duration = 4:   OK (0.94s)
      28.3 μs ± 526 ns,  2% faster than baseline
    duration = 8:   OK (1.85s)
      55.9 μs ± 492 ns,  3% faster than baseline
    duration = 16:  OK (0.93s)
      113  μs ± 1.4 μs,  2% faster than baseline
    duration = 32:  OK (0.94s)
      226  μs ± 2.6 μs
    duration = 64:  OK (0.93s)
      450  μs ± 7.8 μs,  3% faster than baseline
    duration = 128: OK (0.93s)
      902  μs ±  15 μs,  1% faster than baseline
  netsize = 4
    duration = 1:   OK (3.86s)
      7.35 μs ±  57 ns,  1% faster than baseline
    duration = 2:   OK (1.91s)
      14.7 μs ± 275 ns
    duration = 4:   OK (0.96s)
      29.1 μs ± 386 ns
    duration = 8:   OK (0.96s)
      57.6 μs ± 940 ns,  2% faster than baseline
    duration = 16:  OK (3.79s)
      116  μs ± 983 ns,  1% faster than baseline
    duration = 32:  OK (3.77s)
      230  μs ± 2.2 μs,  2% faster than baseline
    duration = 64:  OK (1.88s)
      459  μs ± 6.2 μs
    duration = 128: OK (3.79s)
      922  μs ± 4.8 μs,  2% faster than baseline
  netsize = 8
    duration = 1:   OK (1.00s)
      7.72 μs ± 117 ns
    duration = 2:   OK (0.98s)
      15.0 μs ± 251 ns,  2% faster than baseline
    duration = 4:   OK (3.88s)
      29.5 μs ± 170 ns,  4% faster than baseline
    duration = 8:   OK (0.96s)
      58.1 μs ± 1.0 μs,  4% faster than baseline
    duration = 16:  OK (0.97s)
      117  μs ± 1.6 μs,  2% faster than baseline
    duration = 32:  OK (1.90s)
      231  μs ± 1.5 μs,  1% faster than baseline
    duration = 64:  OK (0.97s)
      468  μs ± 6.3 μs,  1% faster than baseline
    duration = 128: OK (0.96s)
      934  μs ±  14 μs,  1% faster than baseline
  netsize = 16
    duration = 1:   OK (2.16s)
      8.18 μs ± 130 ns
    duration = 2:   OK (2.05s)
      15.6 μs ± 123 ns,  2% faster than baseline
    duration = 4:   OK (4.02s)
      30.5 μs ± 247 ns,  4% faster than baseline
    duration = 8:   OK (1.98s)
      60.3 μs ± 942 ns,  2% faster than baseline
    duration = 16:  OK (1.96s)
      119  μs ± 1.6 μs,  3% faster than baseline
    duration = 32:  OK (0.97s)
      234  μs ± 4.3 μs,  4% faster than baseline
    duration = 64:  OK (0.97s)
      467  μs ± 8.4 μs,  5% faster than baseline
    duration = 128: OK (3.91s)
      953  μs ±  13 μs,  3% faster than baseline
  netsize = 32
    duration = 1:   OK (2.41s)
      9.22 μs ±  59 ns,  1% faster than baseline
    duration = 2:   OK (8.93s)
      17.0 μs ±  59 ns,  1% faster than baseline
    duration = 4:   OK (1.07s)
      32.1 μs ± 411 ns,  4% faster than baseline
    duration = 8:   OK (1.03s)
      62.9 μs ± 1.0 μs,  3% faster than baseline
    duration = 16:  OK (1.03s)
      125  μs ± 2.0 μs,  1% faster than baseline
    duration = 32:  OK (1.01s)
      243  μs ± 4.6 μs,  3% faster than baseline
    duration = 64:  OK (4.06s)
      493  μs ± 8.2 μs,  2% faster than baseline
    duration = 128: OK (8.09s)
      983  μs ±  12 μs,  2% faster than baseline
  netsize = 64
    duration = 1:   OK (1.50s)
      11.3 μs ± 175 ns,  2% faster than baseline
    duration = 2:   OK (1.26s)
      19.2 μs ± 265 ns,  4% faster than baseline
    duration = 4:   OK (1.19s)
      35.9 μs ± 676 ns
    duration = 8:   OK (2.19s)
      66.4 μs ± 751 ns,  3% faster than baseline
    duration = 16:  OK (2.18s)
      133  μs ± 1.1 μs,  2% faster than baseline
    duration = 32:  OK (4.30s)
      261  μs ± 1.3 μs
    duration = 64:  OK (1.08s)
      521  μs ± 9.3 μs
    duration = 128: OK (4.22s)
      1.03 ms ±  13 μs,  2% faster than baseline
  netsize = 128
    duration = 1:   OK (8.43s)
      16.1 μs ± 108 ns,  1% faster than baseline
    duration = 2:   OK (1.64s)
      25.1 μs ± 192 ns,  1% faster than baseline
    duration = 4:   OK (1.37s)
      41.4 μs ± 542 ns,  4% faster than baseline
    duration = 8:   OK (5.02s)
      76.6 μs ± 918 ns,  1% faster than baseline
    duration = 16:  OK (0.60s)
      144  μs ± 2.9 μs,  3% faster than baseline
    duration = 32:  OK (1.14s)
      277  μs ± 3.1 μs,  4% faster than baseline
    duration = 64:  OK (2.46s)
      598  μs ± 4.6 μs,  4% slower than baseline
    duration = 128: OK (0.61s)
      1.20 ms ±  22 μs,  4% slower than baseline
  Boring:           OK (14.40s)
    232  ms ± 1.5 ms, 29% faster than baseline

The most striking difference is "Boring". This is a "no-op" benchmark, which is now 30% faster (!), and also apparently allocates 0 bytes (though I don't entirely believe this, I think it's measurement error).

@ocharles ocharles changed the title Redefine EventNetwork to reduce allocations Redefine EventNetwork to reduce allocations Jan 9, 2022
Copy link
Owner

@HeinrichApfelmus HeinrichApfelmus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with a bit of defunctionalization for the sake of performance! 😊

To make sure that the small performance improvements are real: What happens if you use the new code as baseline and benchmark the old code against this baseline? Would it also report 1%-4% improvements due to statistical noise?

@ocharles
Copy link
Collaborator Author

ocharles commented Jan 9, 2022

I switched machines (to my more powerful desktop) so the numbers don't quite match, but here's what I get if I use this PR as the baseline, and switch back to master:

  netsize = 1
    duration = 1:   OK (5.65s)
      5.36 μs ±  10 ns,  4% slower than baseline
    duration = 2:   OK (1.41s)
      10.7 μs ± 200 ns,  2% slower than baseline
    duration = 4:   OK (1.41s)
      21.4 μs ± 405 ns,  5% slower than baseline
    duration = 8:   OK (2.78s)
      42.3 μs ± 476 ns,  3% slower than baseline
    duration = 16:  OK (5.57s)
      85.3 μs ± 1.4 μs,  7% slower than baseline
    duration = 32:  OK (0.72s)
      175  μs ± 2.7 μs, 11% slower than baseline
    duration = 64:  OK (1.41s)
      344  μs ± 3.6 μs,  7% slower than baseline
    duration = 128: OK (2.87s)
      701  μs ± 8.2 μs,  7% slower than baseline
  netsize = 2
    duration = 1:   OK (6.09s)
      5.79 μs ±  94 ns,  8% slower than baseline
    duration = 2:   OK (1.54s)
      11.7 μs ± 157 ns, 10% slower than baseline
    duration = 4:   OK (24.27s)
      23.0 μs ± 319 ns,  8% slower than baseline
    duration = 8:   OK (0.73s)
      45.1 μs ± 828 ns,  6% slower than baseline
    duration = 16:  OK (6.11s)
      92.7 μs ± 1.3 μs,  9% slower than baseline
    duration = 32:  OK (3.01s)
      183  μs ± 2.2 μs,  7% slower than baseline
    duration = 64:  OK (0.76s)
      369  μs ± 6.1 μs,  9% slower than baseline
    duration = 128: OK (5.97s)
      727  μs ± 5.4 μs,  7% slower than baseline
  netsize = 4
    duration = 1:   OK (3.07s)
      5.83 μs ±  82 ns,  6% slower than baseline
    duration = 2:   OK (49.34s)
      11.8 μs ± 174 ns,  7% slower than baseline
    duration = 4:   OK (0.77s)
      23.5 μs ± 388 ns,  6% slower than baseline
    duration = 8:   OK (1.52s)
      46.1 μs ± 734 ns,  6% slower than baseline
    duration = 16:  OK (1.53s)
      93.0 μs ± 1.4 μs,  7% slower than baseline
    duration = 32:  OK (1.51s)
      184  μs ± 1.9 μs,  6% slower than baseline
    duration = 64:  OK (0.76s)
      370  μs ± 6.2 μs,  6% slower than baseline
    duration = 128: OK (1.51s)
      737  μs ± 7.0 μs,  7% slower than baseline
  netsize = 8
    duration = 1:   OK (3.23s)
      6.14 μs ±  97 ns,  7% slower than baseline
    duration = 2:   OK (1.57s)
      11.9 μs ± 176 ns,  5% slower than baseline
    duration = 4:   OK (3.08s)
      23.5 μs ±  83 ns,  6% slower than baseline
    duration = 8:   OK (0.79s)
      47.7 μs ± 899 ns, 11% slower than baseline
    duration = 16:  OK (3.08s)
      93.6 μs ± 383 ns,  6% slower than baseline
    duration = 32:  OK (6.09s)
      185  μs ± 594 ns,  6% slower than baseline
    duration = 64:  OK (1.50s)
      367  μs ± 4.7 μs,  6% slower than baseline
    duration = 128: OK (1.53s)
      744  μs ± 9.5 μs,  6% slower than baseline
  netsize = 16
    duration = 1:   OK (0.84s)
      6.33 μs ±  88 ns,  3% slower than baseline
    duration = 2:   OK (0.81s)
      12.2 μs ± 210 ns,  3% slower than baseline
    duration = 4:   OK (0.80s)
      24.2 μs ± 432 ns,  5% slower than baseline
    duration = 8:   OK (3.14s)
      47.8 μs ± 391 ns,  5% slower than baseline
    duration = 16:  OK (3.14s)
      95.6 μs ± 511 ns,  8% slower than baseline
    duration = 32:  OK (0.77s)
      190  μs ± 3.3 μs,  6% slower than baseline
    duration = 64:  OK (12.63s)
      385  μs ± 4.7 μs,  7% slower than baseline
    duration = 128: OK (0.79s)
      765  μs ±  12 μs,  7% slower than baseline
  netsize = 32
    duration = 1:   OK (0.93s)
      7.06 μs ±  93 ns,  4% slower than baseline
    duration = 2:   OK (1.74s)
      13.1 μs ± 146 ns,  5% slower than baseline
    duration = 4:   OK (0.83s)
      25.1 μs ± 330 ns,  3% slower than baseline
    duration = 8:   OK (0.81s)
      49.2 μs ± 947 ns,  6% slower than baseline
    duration = 16:  OK (1.64s)
      100  μs ± 1.8 μs,  8% slower than baseline
    duration = 32:  OK (3.23s)
      196  μs ± 1.2 μs,  7% slower than baseline
    duration = 64:  OK (1.59s)
      386  μs ± 4.3 μs,  7% slower than baseline
    duration = 128: OK (3.20s)
      781  μs ± 8.9 μs,  5% slower than baseline
  netsize = 64
    duration = 1:   OK (1.09s)
      8.24 μs ±  90 ns,  3% slower than baseline
    duration = 2:   OK (1.93s)
      14.6 μs ± 263 ns,  6% slower than baseline
    duration = 4:   OK (0.89s)
      27.4 μs ± 534 ns,  8% slower than baseline
    duration = 8:   OK (1.71s)
      52.2 μs ± 345 ns, 11% slower than baseline
    duration = 16:  OK (1.66s)
      101  μs ± 944 ns,  6% slower than baseline
    duration = 32:  OK (0.83s)
      201  μs ± 3.2 μs,  8% slower than baseline
    duration = 64:  OK (3.31s)
      401  μs ± 1.5 μs, 10% slower than baseline
    duration = 128: OK (3.28s)
      797  μs ±  11 μs,  9% slower than baseline
  netsize = 128
    duration = 1:   OK (11.59s)
      11.0 μs ±  87 ns,  5% slower than baseline
    duration = 2:   OK (4.59s)
      17.4 μs ± 257 ns,  6% slower than baseline
    duration = 4:   OK (0.99s)
      30.0 μs ± 404 ns,  7% slower than baseline
    duration = 8:   OK (1.82s)
      55.3 μs ± 507 ns,  8% slower than baseline
    duration = 16:  OK (1.77s)
      108  μs ± 1.1 μs, 10% slower than baseline
    duration = 32:  OK (0.87s)
      212  μs ± 4.2 μs, 10% slower than baseline
    duration = 64:  OK (27.20s)
      413  μs ± 3.8 μs,  8% slower than baseline
    duration = 128: OK (1.68s)
      819  μs ±  13 μs,  7% slower than baseline
  Boring:           OK (24.86s)
    194  ms ± 3.6 ms,  6% slower than baseline

@ocharles
Copy link
Collaborator Author

ocharles commented Jan 9, 2022

Also just want to add I tried running benchmarks on this PR once to get a baseline, and again against the baseline - the second run doesn't show any significant changes (that is, matches the previous run, so we're not just observing measurement error).

@ocharles ocharles merged commit f2a9def into master Jan 9, 2022
@ocharles ocharles deleted the redefine-EventNetwork branch January 9, 2022 15:00
@HeinrichApfelmus
Copy link
Owner

Here's what I get if I use this PR as the baseline, and switch back to master:

Awesome, thanks for the diligence! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants