Update README.md

Keith-Cancel · Mar 27, 2023 · 2d3ef80 · 2d3ef80
1 parent c1453f7
commit 2d3ef80
Showing 1 changed file with 12 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # K-HASHV 🔨
-A single header hash function with both vectorized and scalar versions. The function is quite fast when vectorized achieving approximately an average of **~9.6 GB/s** on a 7 year old Xeon E3-1230 v5.
+A single header hash function with both vectorized and scalar versions. The function is quite fast when vectorized achieving approximately an average of **~10.2 GB/s** on a 7 year old Xeon E3-1230 v5. The header contains explicit intrinsics for x86_64, and also has a version that will use GCC's portable vector built-ins, and the last fall back is a scalar version for portability. The results of the function should be the same regardless of endianness.
 
-Additionally, it also passes all the SMHasher hash function quality tests: https://github.com/rurban/smhasher
+Additionally, it also passes all the SMHasher hash function quality tests: https://github.com/rurban/smhasher.
 
 Moreover, it is quite easy to choose a new function at runtime by just using new seed as shown below:
 ```C
@@ -45,18 +45,18 @@ When testing on 1.25 GB and 512 KB of random data I get the following on average
 <table>
 <thead><tr><th>Processor</th><th>1.25 GB Time</th><th>1.25 GB Speed</th> <th>512 KB Time</th><th>512 KB Speed</th><th>OS</th><th>Compiler</th><th>Type</th></tr></thead>
 <tbody>
-<tr> <td>Xeon E3-1230 v5</td> <td>0.1298 s</td> <td>9.6285 GB/s</td> <td>052.5107 us</td> <td>9.2987 GB/s</td><td>Linux</td><td>GCC 12.1.0</td><td><strong>Vectorized<strong></td></tr>
-<tr> <td>Xeon E3-1230 v5</td> <td>1.1911 s</td> <td>1.0495 GB/s</td> <td>494.1932 us</td> <td>0.9880 GB/s</td><td>Linux</td><td>GCC 12.1.0</td><td><strong>Scalar<strong></td></tr>
-<tr> <td>Xeon E3-1230 v5</td> <td>0.1418 s</td> <td>8.8142 GB/s</td> <td>055.9333 us</td> <td>8.7297 GB/s</td><td>Linux</td><td>Clang 14.0.6</td><td><strong>Vectorized<strong></td></tr>
-<tr> <td>Ryzen 9 7900</td> <td>0.1227 s</td> <td>10.1881 GB/s</td> <td>046.0273 us</td> <td>10.6085 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Vectorized<strong></td></tr>
-<tr> <td>Ryzen 9 7900</td> <td>0.8693 s</td> <td>1.4379 GB/s</td> <td>375.0820 us</td> <td>1.3018 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Scalar<strong></td></tr>
+<tr> <td>Xeon E3-1230 v5</td> <td>0.1226 s</td> <td>10.1987 GB/s</td> <td>045.3515 us</td> <td>10.7666 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Vectorized<strong></td></tr>
+<tr> <td>Xeon E3-1230 v5</td> <td>1.1803 s</td> <td>1.0495 GB/s</td> <td>462.9862 us</td> <td>1.0546 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Scalar<strong></td></tr>
+<tr> <td>Xeon E3-1230 v5</td> <td>0.1388 s</td> <td>9.0061 GB/s</td> <td>052.8114 us</td> <td>9.2457 GB/s</td><td>Linux</td><td>Clang 15.0.7</td><td><strong>Vectorized<strong></td></tr>
+<tr> <td>Ryzen 9 7900</td> <td>0.1182 s</td> <td>10.5742 GB/s</td> <td>044.4734</td> <td>10.9792 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Vectorized<strong></td></tr>
+<tr> <td>Ryzen 9 7900</td> <td>0.7890 s</td> <td>1.5843 GB/s</td> <td>307.4712 us</td> <td>1.5881 GB/s</td><td>Linux</td><td>GCC 12.2.1</td><td><strong>Scalar<strong></td></tr>
 </tbody>
 </table>
 
 The scalar version is slower at a tad over ~1 GB/s on my system when compiling test_speed.c with gcc using `-O3`.
 On windows Microsoft's compiler does not seem to generate as performant code from the intrinsics, but the GCC mingw64 compiler generates pretty comparable numbers for me at least.
 
-Definitely, want to add other machines to this table. But if you are curious how it performs on your machine compile test_speed.c with `-O3 -march=native` and `-O3 -march=native -D KHASHV_SCALAR`.
+Definitely, want to add other machines to this table. But if you are curious how it performs on your machine compile `test_speed.c` with `-O3 -lm -march=native` and `-O3 -lm -march=native -D KHASHV_SCALAR`.
 
 # API
 
@@ -119,11 +119,13 @@ for(int i = 0; i < sizeof(hash_bytes); i++) {
 When thinking about things to improve the code and hash function these are the first few things that come to mind for me.
 1. A faster mixing function (e.g. `khashv_mix_words_<type>`) I think is probably the next thing that could be improved. If that could be made shorter/faster it would reduce latency for smaller inputs. Any ideas or feedback for this would be appreciated.
 
-2. The main thing would be try to get both Clang and MSVC to output code that runs as fast GCC or as close as possible. They both seem to do some silly things when compared to GCC losing some performance when looking at the generated assembly. Microsoft's compiler being the worst, and probably the fastest fix for me to implement would be to write some assembly code. However, it then would no longer be a single header file hash function since MSVC does not support inline assembly for 64-bit builds, and thusly would require a separate file.
+2. The next thing would be try to get both Clang and MSVC to output code that runs as fast GCC or as close as possible. They both seem to do some silly things when compared to GCC losing some performance when looking at the generated assembly. Microsoft's compiler being the worst, and probably the fastest fix for me to implement would be to write some assembly code. However, it then would no longer be a single header file hash function since MSVC does not support inline assembly for 64-bit builds, and thusly would require a separate file.
 
 3. Then probably consider using intrinsics for some other systems like ARM NEON, but the for now there is scalar code and code written using GCC's vector built-ins that will generate vectorized code for other architectures that GCC supports.
 
-4. Probably, the next thing I could think of is to choose a better value for S1 and S2 that are used to basically substitute bytes. The current values where found randomly checking a small set of criteria. Mainly focusing on each bit of S1 and S2 as columns. Then Xor-ing them effectively creating an 8 bit input boolean function, and making sure the entire thing maps each input to a unique value. There likely are better values that could chosen, and criteria to look at that look at all bits at once. However, the search space is huge effectively 2^(2\*8\*16) possible permutations for S1 and S2. However, the current values do seem to work well, from my testing.
+4. Probably, the next thing I could think of is to choose a better value for S1 and S2 that are used to basically substitute bytes. The current values where found randomly checking a small set of criteria. Mainly focusing on each bit of S1 and S2 as columns. Then Xor-ing them effectively creating an 8 bit input boolean function, and making sure the entire thing maps each input to a unique value. There likely are better values that could chosen, and criteria to look at that look at all bits at once. However, the search space is huge effectively 2^(2\*8\*16) possible permutations for S1 and S2. However, the current values do seem to work well, from my testing. An other constant that could be looked at as well is the new shuffle constant I have in v2 that randomly permutes the bytes, it's quite likely their exists a better constant for this as well.
+
+5. Maybe, write some assembly versions to get around some of the compiler differences. Also maybe a rust version.
 
 # Copyright and License