inclusive_scan.html

<!doctype html>
<html>
<meta charset='utf-8'>

<title>inclusive scan results</title>
<script src='https://cdn.plot.ly/plotly-latest.min.js'></script>
<script type='text/javascript' src='scripts/bench.js'></script>

</html>

<body>
  <h1> inclusive_scan </h1>

  My implementation is based on <a
    href="https://stackoverflow.com/a/19496697/5021064">StackOverflow answer by
    Z boson.</a> <br />

  So far only have an inplace version - replace elements with running sum -
  didn't support writing to a different array. <br />
  Only tried aligned loads and stores <br />
  The best unrolling seems to be `2` for shorts and '1' for chars/ints. <br />
  The main routine that does the work is <a
    href="https://github.com/DenisYaroshevskiy/unsq_eve/blob/6f1acf7dd2d0b609c08f5f54966d6f7dd644da51/src/eve_extra/scan_wide.h#L50">`eve_extra::inclusive_scan_wide`</a>
  that computes (as the name suggests) <br />
  `inclusive_scan` for one wide register. <br /> <br />

  Our theoretical win comes from us doing log additions from our size. <br />
  So for:<br />
  * 32 chars - we do 5 additions instead of 31 => cound be a 6 times speed up.
  <br />
  * 16 shorts we do 4 additions intead of 15 => could be a 4 times speed up.
  <br>
  * 8 ints we do 3 additions as oppose to 7 of scalar => could be a 2 times
  speed up.<br />

  <h2> 40 bytes summary </h2>

  For 40 bytes we don't win for anything except for chars.<br />
  Originally I implemented `store(ignore)` with `maskmoveu` that was really bad,
  <br />
  but then I used memcpy for chars and shorts and it got better, see the whole
  story on <a href="https://stackoverflow.com/a/62235276/5021064">
    StackOverflow</a>
  <br />

  I tried to solve this with different iteration patterns but was not
  successful: <a
    href="https://stackoverflow.com/a/62492369/5021064">StackOveflow</a><br />

  <h3> 40 bytes, data</h3>

  <div id='40_bytes_data'></div>
  <script>dynamicEntryPoint('40_bytes_data', {
      name: 'inplace transform',
      size: 40,
      algorithm: 'selection',
      type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['eve::algo::inclusive_scan_inplace',
      'std::inclusive_scan']);
  </script>

  <h2> 1000 bytes summary </h2>

  5 times win for chars, almost 3 times win for shors and 2 times for
  ints.<br />
  So not quite 6, 4 and 2 but not too off.
  <br />

  <h3> 1000 bytes, data</h3>

  <div id='1000_bytes_data'></div>
  <script>dynamicEntryPoint('1000_bytes_data', {
      name: 'inplace transform',
      size: 1000,
      algorithm: 'selection',
      type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['eve::algo::inclusive_scan_inplace',
      'std::inclusive_scan']);
  </script>

  <h2> 10'000 bytes summary </h2>

  Again - 5, 3, 2 times win.

  <h3> 10'000 bytes, data</h3>

  <div id='10000_bytes_data'></div>
  <script>dynamicEntryPoint('10000_bytes_data', {
      name: 'inplace transform',
      size: 10000,
      algorithm: 'selection',
      type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['eve::algo::inclusive_scan_inplace',
      'std::inclusive_scan']);
  </script>

  <h2> Total benchmark </h2>
  <div id='total_benchmark'></div>
  <script>dynamicEntryPoint('total_benchmark', {
      name: 'inplace transform',
      size: 'x',
      algorithm: 'selection',
      type: 'selection',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['inclusive_scan']);
  </script>

</body>