-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] performance tricks to improve std/json, parsers, lookups #188
Comments
Best thing would be if manage to map |
If you look in Kostya's benchmarks, the fastest JSON parser seems to be in the D standard library and seems to be 2x faster than simdjson: https://github.com/kostya/benchmarks#json
|
[EDIT]
git clone --depth 1 https://github.com/mleise/fast.git gdc -o json_d_gdc_fast -O3 -frelease test_fast.d fast/source/fast/cstring.d fast/source/fast/buffer.d fast/source/fast/json.d fast/source/fast/parsing.d fast/source/fast/intmath.d fast/source/fast/internal/sysdef.di fast/source/fast/internal/helpers.d fast/source/fast/unicode.d fast/source/fast/internal/unicode_tables.d fast/source/std/simd.d as you can see, it does use simd, and perhaps other tricks; we need to look into it; in particular: https://github.com/mleise/fast/blob/master/source/fast/json.d
note:packedjson improves a bit but not much: import pkg/packedjson
let file = "/tmp/1.json"
let jobj = parseFile(file)
let coordinates = jobj["coordinates"]
let len = float(coordinates.len)
var x = 0.0
var y = 0.0
var z = 0.0
for coord in coordinates:
x += coord["x"].getFloat
y += coord["y"].getFloat
z += coord["z"].getFloat
echo x / len
echo y / len
echo z / len on my machine, this goes from 0:02.63 down to 0:01.59 ie a 1.53X speedup We need to gprof/profile this, the gap is not good for nim; more importantly, the optimization opportunities (eg simd) are likely to generalize and benefit to other areas of nim beyond json parsing |
Wocjiek Mula and Daniel Lemire are the same people who wrote the json parser btw |
Something that might also be of interest is the (unfortunately compiler specific) function multi-versioning that both GCC and Clang provide. |
but do we need function multi-versioning when we have |
The mechanism behind function multi-versioning happens at runtime, when the system loader is loading an executable and it's libraries. This allows using a function implementation optimized for a particular instruction set extension, while still providing a fallback for older processors that don't have the instruction set extension. In contrast, the There is are ways to do this in a more agnostic fashion, however they either involve creating support dlls (compile multiple dlls from one set of code, each targeting a particular set of extensions) or creating object files and linking in some form of runtime dispatch. |
Note: MSVC allow mixing SIMD in binaries and does not require specific flags so causes no issue. Currently to allow a multiple SIMD paths in the same GCC or Clang executable we have the following options. Each SIMD target in separate filesThe SIMD flag can be passed as a In the past we had to use an undocumented feature of Multiple SIMD target in the same fileThis one is a bit more tricky at the moment until nim-lang/Nim#10682 is solved. We would be able to do that: when defined(gcc) or defined(clang):
{.pragma: avx2, codegenDecl: "__attribute__((__target__("avx2"))) $# $#$#".}
else:
{.pragma: avx2.}
proc bar_generic() =
echo 1
proc bar_avx2() {.avx2.} =
echo 1 This can already be done but the it overwrites all the N_LIBPRIVATE, N_INLINE that nim adds so you need some macro magic to check if the proc is exported or inline to rebuild the proper codegenDecl. DispatchUnfortunately Nim can rely on GCC function attribute dispatch AFAIK, they require the functions to have the same name in C. AFAIK it's possible via {.exportc.} but Nim complains (to be confirmed, it's been a while). Alternatively, either the functions have a wrapper that checks the SIMD supported at runtime before calling the proper one. This can be done via:
All 3 dispatching techniques require the same amount of code duplication SIMDThe SIMD PR here nim-lang/Nim#11816 is well done and I'm replacing my wrapper of Facebook's cpuinfo by it (which is the best lightweight CPU feature detection package, outside of NUMA or HyperThreading detection needs) |
For what it's worth, at least for that kostya benchmark, SIMD is not so critical. See this pr and a test program. I get within 1.1x of D fast and within 1.4x of the Lemire optimized SIMD on demand (>700 MiB/s on my machine). (PGO build is very necessary - 1.5x speed up, as is often the case with Nim generated C.) The executive summary is just that the current |
I have made a faster json serializer and deserializer (in pure Nim, no SIMD) (https://github.com/treeform/jsony). Why is it faster? Currently the Nim standard module first parses or serializes json into JsonNodes and then turns the JsonNodes into your objects with the to() macro. This is slower and creates unnecessary work for the garbage collector. My library skips the JsonNodes and creates the objects you want directly. Another speed up comes from not using StringStream. Stream has a function dispatch overhead because it has to be able to switch between StringStream or FileStream at runtime. Jsony skips the overhead and just directly writes to memory buffers. Parse speed.
Serialize speed.
|
Here is the bench on my machine with nim-json-serialization: However at the moment it brings Chronos and BearSSL as a dependency :/ status-im/nim-json-serialization#25 |
@mratsim Thank very much for pointing Parse speed.
Serialize speed.
Lessons learned:
I hope this is useful for any one who wants to implement fast parsers. |
@treeform, I hope I'm not too forward, but I'd love to see you contributing to nim-json-serialization. I'm sure you'll be able to deliver similar optimisations there as well, because I haven't even tried to optimise the performance yet. nim-json-serialization and jsony both share the key architectural trait that makes parsing fast. You skip the dynamic
I've also took a quick glance at the jsony source code and I can recognize some early mistakes that I did go through while developing nim-json-serialization. For example, if you handle the built-in types with |
A lot of these lessons I would consider "tribal knowledge", I think it would help the whole community if these made their way into a more widely-shared document, perhaps an article on nim-lang.org. @treeform I would encourage you to do a write up, but I understand if you don't have the time so I created an issue on the website repo for this: nim-lang/website#251. If anyone is interested on doing a write up please say in the issue :) |
There was a good article recently about unintuitive decisions Microsoft made with respect to string comparison algos in Windows. Maybe we should see if we can steal something from that when reimplementing strutils. |
a few good reusable techniques at play here: samuell/gccontent-benchmark#22
|
nim-lang/Nim#18183 shows how a trivial code change can give a 20x performance speedup (in jsonutils deserialization), by using a template instead of a proc which allows lazy param evaluation in cases like this: checkJson ok, $(json.len, num, numMatched, $T, json)
# The 2nd param `msg` is only evaluated on failure if checkJson is a template, but not if it's a proc |
as mentioned by @Araq we could port/adapt/get inspired by https://github.com/lemire/simdjson to improve performance in several places, eg std/json or other parsers; here's the talk: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s ; highly recommended
it has a number of distinct useful ideas to improve performance that could be used (not just for json parsing), as reusable components:
avoiding branching (to avoid branch misprediction) using clever tricks
using SIMD, AVX-512 etc
adding performance regression tests (we don't do that IIRC in testament)
UTF8 validation eg:
_mm256_subs_epu8(current_bytes, 244)
has no branches, 1 instruction checks 32 bytes at oncedynamic runtime dispatch (with caching of which function pointer to use) to pick the most optimized function for a given processor/architecture, without having to recompile
caveats regarding how to evaluate performance where branch predictions are involved: when using random data with a fixed seed (even with say 2000 iterations), the CPU might "learn" to perfectly do branch prediction but that won't reflect performance on real data where there is no fixed pattern
ARM neon and x64 processors have instructions to lookup 16 byte tables in a vectorized manner (16 values at a time): pshufb, tbl
=> could that be used to optimize
[EDIT]set[int]
set[char]
operations ?fast number parsing (number parsing is expensive: 10 branch misses per FP number)
=> https://youtu.be/wlvKAT7SZIQ?list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&t=2368
etc...
links
The text was updated successfully, but these errors were encountered: