Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] performance tricks to improve std/json, parsers, lookups #188

Open
timotheecour opened this issue Jan 31, 2020 · 18 comments
Open

Comments

@timotheecour
Copy link
Member

timotheecour commented Jan 31, 2020

as mentioned by @Araq we could port/adapt/get inspired by https://github.com/lemire/simdjson to improve performance in several places, eg std/json or other parsers; here's the talk: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s ; highly recommended

it has a number of distinct useful ideas to improve performance that could be used (not just for json parsing), as reusable components:

  • avoiding branching (to avoid branch misprediction) using clever tricks

  • using SIMD, AVX-512 etc

  • adding performance regression tests (we don't do that IIRC in testament)

  • UTF8 validation eg: _mm256_subs_epu8(current_bytes, 244) has no branches, 1 instruction checks 32 bytes at once

  • dynamic runtime dispatch (with caching of which function pointer to use) to pick the most optimized function for a given processor/architecture, without having to recompile

  • caveats regarding how to evaluate performance where branch predictions are involved: when using random data with a fixed seed (even with say 2000 iterations), the CPU might "learn" to perfectly do branch prediction but that won't reflect performance on real data where there is no fixed pattern

  • ARM neon and x64 processors have instructions to lookup 16 byte tables in a vectorized manner (16 values at a time): pshufb, tbl
    => could that be used to optimize set[int] [EDIT] set[char] operations ?

  • fast number parsing (number parsing is expensive: 10 branch misses per FP number)
    => https://youtu.be/wlvKAT7SZIQ?list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&t=2368

  • etc...

links

@Araq
Copy link
Member

Araq commented Jan 31, 2020

Best thing would be if manage to map set[char] and loops over them to SIMD instructions.

@mratsim
Copy link
Collaborator

mratsim commented Feb 4, 2020

If you look in Kostya's benchmarks, the fastest JSON parser seems to be in the D standard library and seems to be 2x faster than simdjson: https://github.com/kostya/benchmarks#json

Language Time, s Memory, MiB Energy, J
GDC fast 0.14 ± 00.00 231.46 ± 00.12 3.37 ± 00.18
Rust Serde Custom 0.19 ± 00.00 108.54 ± 00.07 5.14 ± 00.06
Rust Serde Typed 0.20 ± 00.00 120.33 ± 00.13 5.74 ± 00.73
C++ simdjson 0.31 ± 00.00 496.82 ± 00.92 8.29 ± 00.46
C++ gason 0.32 ± 00.00 313.18 ± 00.09 8.28 ± 00.42
C++ RapidJSON 0.36 ± 00.00 345.17 ± 00.09 11.32 ± 00.29
Java 0.49 ± 00.01 569.25 ± 24.45 14.19 ± 00.74
Scala 0.57 ± 00.00 461.18 ± 04.73 19.27 ± 00.86
Julia JSON3 0.83 ± 00.01 834.65 ± 00.68 22.72 ± 01.15
C++ RapidJSON SAX 0.89 ± 00.00 110.19 ± 00.04 23.35 ± 00.63
Go jsoniter 0.98 ± 00.01 225.83 ± 00.13 26.63 ± 00.76
Node.js 0.98 ± 00.00 293.57 ± 03.03 37.35 ± 01.44
Rust Serde Untyped 1.04 ± 00.08 916.54 ± 00.06 27.79 ± 01.44
PyPy 1.07 ± 00.00 405.69 ± 00.23 29.33 ± 00.85
Perl Cpanel::JSON::XS 1.29 ± 00.02 517.07 ± 00.03 35.33 ± 00.53
Crystal Schema 1.43 ± 00.01 157.33 ± 00.12 39.11 ± 01.87
Crystal Pull 1.44 ± 00.02 128.49 ± 00.03 40.26 ± 02.61
Clojure 1.60 ± 00.04 1542.98 ± 20.03 64.27 ± 01.99
PHP 1.65 ± 00.02 803.43 ± 00.08 42.46 ± 01.94
Crystal 1.74 ± 00.01 503.61 ± 00.03 59.80 ± 02.46
CPython UltraJSON 2.06 ± 00.01 689.01 ± 01.60 66.12 ± 00.57
Go 2.08 ± 00.01 400.68 ± 00.14 58.15 ± 01.63
V GCC 2.08 ± 00.01 591.95 ± 00.05 54.50 ± 00.48
V Clang 2.08 ± 00.01 592.46 ± 00.02 54.88 ± 00.66
Python 2.33 ± 00.01 519.47 ± 00.04 63.84 ± 02.25
C++ json-c 2.70 ± 00.02 1756.22 ± 00.06 70.98 ± 01.54
Nim GCC 2.99 ± 00.01 908.77 ± 00.05 78.37 ± 00.66
Nim Clang 3.06 ± 00.01 909.29 ± 00.04 80.44 ± 01.25

@timotheecour
Copy link
Member Author

timotheecour commented Feb 4, 2020

[EDIT]

git clone --depth 1 https://github.com/mleise/fast.git

gdc -o json_d_gdc_fast -O3 -frelease test_fast.d fast/source/fast/cstring.d fast/source/fast/buffer.d fast/source/fast/json.d fast/source/fast/parsing.d fast/source/fast/intmath.d fast/source/fast/internal/sysdef.di fast/source/fast/internal/helpers.d fast/source/fast/unicode.d fast/source/fast/internal/unicode_tables.d fast/source/std/simd.d

as you can see, it does use simd, and perhaps other tricks; we need to look into it; in particular: https://github.com/mleise/fast/blob/master/source/fast/json.d

  • I tried with ldc + fast (no easy access to gdc on osx) and it gave on my machine (with ldmd2 -O5 -release -inline -boundscheck=off): 0.27, ie a factor 6.77 speedup over nim using std/json, but I expect gdc to be even fast according to benchmark

note:

packedjson improves a bit but not much:
(and using parseFile over readFile + parseJson improves a tiny bit, but not even used in D)

  import pkg/packedjson
  let file = "/tmp/1.json"
  let jobj = parseFile(file)
  let coordinates = jobj["coordinates"]
  let len = float(coordinates.len)

  var x = 0.0
  var y = 0.0
  var z = 0.0

  for coord in coordinates:
    x += coord["x"].getFloat
    y += coord["y"].getFloat
    z += coord["z"].getFloat

  echo x / len
  echo y / len
  echo z / len

on my machine, this goes from 0:02.63 down to 0:01.59 ie a 1.53X speedup

We need to gprof/profile this, the gap is not good for nim;

more importantly, the optimization opportunities (eg simd) are likely to generalize and benefit to other areas of nim beyond json parsing

@Varriount
Copy link

@mratsim
Copy link
Collaborator

mratsim commented Feb 8, 2020

https://core.ac.uk/download/pdf/150449188.pdf

Wocjiek Mula and Daniel Lemire are the same people who wrote the json parser btw

@Varriount
Copy link

Varriount commented Feb 10, 2020

Something that might also be of interest is the (unfortunately compiler specific) function multi-versioning that both GCC and Clang provide.

@timotheecour
Copy link
Member Author

but do we need function multi-versioning when we have when ?

@Varriount
Copy link

The mechanism behind function multi-versioning happens at runtime, when the system loader is loading an executable and it's libraries.

This allows using a function implementation optimized for a particular instruction set extension, while still providing a fallback for older processors that don't have the instruction set extension.

In contrast, the when mechanism works at compile time. You can use it to compile functions optimized for a particular instruction set extension, however execution of those functions on architectures that do not support that instruction set extension will result in the program crashing.

There is are ways to do this in a more agnostic fashion, however they either involve creating support dlls (compile multiple dlls from one set of code, each targeting a particular set of extensions) or creating object files and linking in some form of runtime dispatch.

@mratsim
Copy link
Collaborator

mratsim commented Mar 2, 2020

Note: MSVC allow mixing SIMD in binaries and does not require specific flags so causes no issue.

Currently to allow a multiple SIMD paths in the same GCC or Clang executable we have the following options.

Each SIMD target in separate files

The SIMD flag can be passed as a {.localPassC:"-mavx".} pragma for example.
This is what I use in Weave: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/common/gemm_ukernel_avx2.nim#L6

In the past we had to use an undocumented feature of nim.cfg to specify per-flag pragma but it caused deployment issues in Laser/Arraymancer.

Multiple SIMD target in the same file

This one is a bit more tricky at the moment until nim-lang/Nim#10682 is solved.

We would be able to do that:

when defined(gcc) or defined(clang):
  {.pragma: avx2, codegenDecl: "__attribute__((__target__("avx2"))) $# $#$#".}
else:
  {.pragma: avx2.}

proc bar_generic() =
  echo 1

proc bar_avx2() {.avx2.} =
  echo 1

This can already be done but the it overwrites all the N_LIBPRIVATE, N_INLINE that nim adds so you need some macro magic to check if the proc is exported or inline to rebuild the proper codegenDecl.

Dispatch

Unfortunately Nim can rely on GCC function attribute dispatch AFAIK, they require the functions to have the same name in C. AFAIK it's possible via {.exportc.} but Nim complains (to be confirmed, it's been a while).

Alternatively, either the functions have a wrapper that checks the SIMD supported at runtime before calling the proper one. This can be done via:

  • a dispatch table (similar to the low-level implementation of async dispatch), caveat, it prevents inlining and is a guaranteed cache miss due to dereference.
  • a switch dispatch, which should be very easily predictable by the processor since it takes the same branch every single time.
  • "pipeline-level" dispatch which means that within a processing pipelines the implementation is selected by a static enum on function entry (say json parsing and json exporting). Example: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L217-L254 with
    type CPUFeatureX86* = enum
      x86_Generic,
      x86_SSE,
      x86_SSE2,
      x86_SSE4_1,
      x86_AVX,
      x86_AVX_FMA,
      x86_AVX2,

All 3 dispatching techniques require the same amount of code duplication

SIMD

The SIMD PR here nim-lang/Nim#11816 is well done and I'm replacing my wrapper of Facebook's cpuinfo by it (which is the best lightweight CPU feature detection package, outside of NUMA or HyperThreading detection needs)

@c-blake
Copy link

c-blake commented Nov 20, 2020

For what it's worth, at least for that kostya benchmark, SIMD is not so critical. See this pr and a test program. I get within 1.1x of D fast and within 1.4x of the Lemire optimized SIMD on demand (>700 MiB/s on my machine). (PGO build is very necessary - 1.5x speed up, as is often the case with Nim generated C.)

The executive summary is just that the current parsejson just does the same work in triplicate. While that last 1.4x might be nice someday de-triplicating should probably be the initial focus.

@treeform
Copy link

treeform commented Jan 7, 2021

I have made a faster json serializer and deserializer (in pure Nim, no SIMD) (https://github.com/treeform/jsony). Why is it faster?

Currently the Nim standard module first parses or serializes json into JsonNodes and then turns the JsonNodes into your objects with the to() macro. This is slower and creates unnecessary work for the garbage collector. My library skips the JsonNodes and creates the objects you want directly.

Another speed up comes from not using StringStream. Stream has a function dispatch overhead because it has to be able to switch between StringStream or FileStream at runtime. Jsony skips the overhead and just directly writes to memory buffers.

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony .................... 10.121 ms     10.809 ms    ±1.208   x100
planetis-m/eminim ................. 12.006 ms     12.574 ms    ±1.156   x100
nim std/json ...................... 23.282 ms     34.679 ms    ±7.512   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 4.047 ms      4.172 ms    ±0.225   x100
planetis-m/eminim .................. 7.173 ms      7.324 ms    ±0.253   x100
disruptek/jason ................... 10.220 ms     11.155 ms    ±0.689   x100
nim std/json ...................... 11.526 ms     15.181 ms    ±0.857   x100

@mratsim
Copy link
Collaborator

mratsim commented Jan 7, 2021

Here is the bench on my machine with nim-json-serialization:

image

However at the moment it brings Chronos and BearSSL as a dependency :/ status-im/nim-json-serialization#25

@treeform
Copy link

@mratsim Thank very much for pointing json_serialization library out. I studied the library on how it goes this fast and have implemented things I learned in mine:

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 6.724 ms     12.047 ms    ±3.694   x100
status-im/nim-json-serialization ... 7.119 ms     14.276 ms    ±2.033   x100
nim std/json ...................... 24.141 ms     38.741 ms    ±5.417   x100
planetis-m/eminim ................. 10.974 ms     18.355 ms    ±3.994   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 1.531 ms      2.779 ms    ±0.091   x100
status-im/nim-json-serialization ... 2.043 ms      3.448 ms    ±0.746   x100
planetis-m/eminim .................. 5.951 ms      9.305 ms    ±3.210   x100
disruptek/jason ................... 10.312 ms     13.471 ms    ±3.107   x100
nim std/json ...................... 12.551 ms     19.419 ms    ±4.039   x100

Lessons learned:

I hope this is useful for any one who wants to implement fast parsers.

@zah
Copy link
Member

zah commented Jan 10, 2021

@treeform, I hope I'm not too forward, but I'd love to see you contributing to nim-json-serialization. I'm sure you'll be able to deliver similar optimisations there as well, because I haven't even tried to optimise the performance yet.

nim-json-serialization and jsony both share the key architectural trait that makes parsing fast. You skip the dynamic JsonNodes and instead you directly populate a strongly typed object. I'd say that nim-json-serialization has some additional advantages:

  1. It's part of the nim-serialization family of libraries conforming to the same standard API. We also have libraries for TOML, ProtoBuf, SSZ and other formats that can all be tested with a single generic test suite defined in nim-serialization.

  2. Since we've used the library for a long time in many projects now, it had to learn to handle many tricky aspects of Nim such as case objects, inheritance, when statements in the record fields, etc. What's even better is that this knowledge is embedded in the parent nim-serialization library and as we improve it, it improves all of the formats together. The common serialization API allows many advanced features such as skipped or renamed fields and a lot of options when defining custom serializers.

  3. nim-json-serialization builds on top of nim-faststreams instead of working with strings only. This makes the library significantly more versatile and you'll get the same stellar performance when working with files, sockets and other input sources.

I've also took a quick glance at the jsony source code and I can recognize some early mistakes that I did go through while developing nim-json-serialization. For example, if you handle the built-in types with when statements instead of overloads, your users will face significantly less noisy error messages when dealing with overload visibility problems.

@dom96
Copy link
Contributor

dom96 commented Jan 10, 2021

Lessons learned:

A lot of these lessons I would consider "tribal knowledge", I think it would help the whole community if these made their way into a more widely-shared document, perhaps an article on nim-lang.org.

@treeform I would encourage you to do a write up, but I understand if you don't have the time so I created an issue on the website repo for this: nim-lang/website#251. If anyone is interested on doing a write up please say in the issue :)

@disruptek
Copy link

There was a good article recently about unintuitive decisions Microsoft made with respect to string comparison algos in Windows. Maybe we should see if we can steal something from that when reimplementing strutils.

@timotheecour
Copy link
Member Author

a few good reusable techniques at play here: samuell/gccontent-benchmark#22

  • memfile.open instead of open
  • zero-copy lines iterator
  • replace case c of ... by array[low(char)..high(char), T] where possible
  • bit packing

@timotheecour
Copy link
Member Author

timotheecour commented Jun 5, 2021

nim-lang/Nim#18183 shows how a trivial code change can give a 20x performance speedup (in jsonutils deserialization), by using a template instead of a proc which allows lazy param evaluation in cases like this:

checkJson ok, $(json.len, num, numMatched, $T, json)
  # The 2nd param `msg` is only evaluated on failure if checkJson is a template, but not if it's a proc

@ringabout ringabout removed their assignment Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants