-
Notifications
You must be signed in to change notification settings - Fork 1
/
inclusive_scan.html
123 lines (102 loc) · 3.58 KB
/
inclusive_scan.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
<!doctype html>
<html>
<meta charset='utf-8'>
<title>inclusive scan results</title>
<script src='https://cdn.plot.ly/plotly-latest.min.js'></script>
<script type='text/javascript' src='scripts/bench.js'></script>
</html>
<body>
<h1> inclusive_scan </h1>
My implementation is based on <a
href="https://stackoverflow.com/a/19496697/5021064">StackOverflow answer by
Z boson.</a> <br />
So far only have an inplace version - replace elements with running sum -
didn't support writing to a different array. <br />
Only tried aligned loads and stores <br />
The best unrolling seems to be `2` for shorts and '1' for chars/ints. <br />
The main routine that does the work is <a
href="https://github.com/DenisYaroshevskiy/unsq_eve/blob/6f1acf7dd2d0b609c08f5f54966d6f7dd644da51/src/eve_extra/scan_wide.h#L50">`eve_extra::inclusive_scan_wide`</a>
that computes (as the name suggests) <br />
`inclusive_scan` for one wide register. <br /> <br />
Our theoretical win comes from us doing log additions from our size. <br />
So for:<br />
* 32 chars - we do 5 additions instead of 31 => cound be a 6 times speed up.
<br />
* 16 shorts we do 4 additions intead of 15 => could be a 4 times speed up.
<br>
* 8 ints we do 3 additions as oppose to 7 of scalar => could be a 2 times
speed up.<br />
<h2> 40 bytes summary </h2>
For 40 bytes we don't win for anything except for chars.<br />
Originally I implemented `store(ignore)` with `maskmoveu` that was really bad,
<br />
but then I used memcpy for chars and shorts and it got better, see the whole
story on <a href="https://stackoverflow.com/a/62235276/5021064">
StackOverflow</a>
<br />
I tried to solve this with different iteration patterns but was not
successful: <a
href="https://stackoverflow.com/a/62492369/5021064">StackOveflow</a><br />
<h3> 40 bytes, data</h3>
<div id='40_bytes_data'></div>
<script>dynamicEntryPoint('40_bytes_data', {
name: 'inplace transform',
size: 40,
algorithm: 'selection',
type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['eve::algo::inclusive_scan_inplace',
'std::inclusive_scan']);
</script>
<h2> 1000 bytes summary </h2>
5 times win for chars, almost 3 times win for shors and 2 times for
ints.<br />
So not quite 6, 4 and 2 but not too off.
<br />
<h3> 1000 bytes, data</h3>
<div id='1000_bytes_data'></div>
<script>dynamicEntryPoint('1000_bytes_data', {
name: 'inplace transform',
size: 1000,
algorithm: 'selection',
type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['eve::algo::inclusive_scan_inplace',
'std::inclusive_scan']);
</script>
<h2> 10'000 bytes summary </h2>
Again - 5, 3, 2 times win.
<h3> 10'000 bytes, data</h3>
<div id='10000_bytes_data'></div>
<script>dynamicEntryPoint('10000_bytes_data', {
name: 'inplace transform',
size: 10000,
algorithm: 'selection',
type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['eve::algo::inclusive_scan_inplace',
'std::inclusive_scan']);
</script>
<h2> Total benchmark </h2>
<div id='total_benchmark'></div>
<script>dynamicEntryPoint('total_benchmark', {
name: 'inplace transform',
size: 'x',
algorithm: 'selection',
type: 'selection',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['inclusive_scan']);
</script>
</body>