Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and update the documentation #388

Merged
merged 2 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ known-key = ["once_cell", "ahash"]
# use 8 number at once parsing strategy
swar-number-parsing = []

# Uses a approximate float parsing algorithm that is faster
# Uses an approximate float parsing algorithm that is faster
# but does not guarantee round trips for the edges
approx-number-parsing = []

Expand Down
193 changes: 146 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,89 +10,186 @@
[Code Coverage]: https://coveralls.io/repos/github/simd-lite/simd-json/badge.svg?branch=main
[coveralls]: https://coveralls.io/github/simd-lite/simd-json?branch=main

**Rust port of extremely fast [simdjson](https://github.com/lemire/simdjson) JSON parser with [serde][1] compatibility.**
**Rust port of extremely fast [simdjson](https://github.com/lemire/simdjson) JSON parser with [Serde][serde] compatibility.**

---

## readme (for real!)
simd-json is a Rust port of the [simdjson c++ library](https://simdjson.org/).
It follows most of the design closely with a few exceptions to make it better
fit into the Rust ecosystem.

### simdjson version
## Goals

**Currently tracking version 0.2.x of simdjson upstream (work in progress, feedback is welcome!).**
The goal of the Rust port of simdjson is not to create a one-to-one
copy, but to integrate the principles of the C++ library into
a Rust library that plays well with the Rust ecosystem. As such
we provide both compatibility with Serde as well as parsing to a
DOM to manipulate data.

### CPU target
## Performance

To be able to take advantage of `simd-json` your system needs to be SIMD-capable. On `x86`, it will select the best SIMD feature set (`avx2` or `sse4.2`) during runtime. If `simd-json` is compiled with SIMD support, it will disable runtime detection.
As a rule of thumb this library tries to get as close as possible
to the performance of the C++ implementation (currently tracking 0.2.x, work in progress).
However, in some design decisions—such as parsing to a DOM or a tape—ergonomics is prioritized over
performance. In other places Rust makes it harder to achieve the same level of performance.

`simd-json` supports AVX2, SSE4.2, NEON, and simd128 (wasm) natively. It also includes an unoptimized fallback implementation using native rust for other platforms; however, this is a last resort measure and nothing we'd recommend relying on.
To take advantage of this library your system needs to support SIMD instructions. On `x86`, it will
select the best available supported instruction set (`avx2` or `sse4.2`) when the `runtime-detection` feature
is enabled (default). On `aarch64` this library uses the `NEON` instruction set. On `wasm` this library uses
the `simd128` instruction set when available. When no supported SIMD instructions are found, this library will use a
fallback implementation, but this is significantly slower.

### Performance characteristics
### Allocator
For best performance, we highly suggest using [mimalloc](https://crates.io/crates/mimalloc) or [jemalloc](https://crates.io/crates/jemalloc) instead of the system allocator used by
default. Another recent allocator that works well (but we have yet to test it in production) is [snmalloc](https://github.com/microsoft/snmalloc).

- CPU native cpu compilation results in the best performance.
- CPU detection for AVX and SSE4.2 is the second fastest (on x86_* only).
- Portable std::simd is the next fast implementation when compiled with a native CPU target.
- std::simd or the rust native implementation is the least performant.

### allocator
## Safety

For best performance, we highly suggest using [mimalloc](https://crates.io/crates/mimalloc) or [jemalloc](https://crates.io/crates/jemalloc) instead of the system allocator used by default. Another recent allocator that works well (but we have yet to test it in production) is [snmalloc](https://github.com/microsoft/snmalloc).
`simd-json` uses **a lot** of unsafe code.

### `runtime-detection`
There are a few reasons for this:

This feature allows selecting the optimal algorithm based on available features during runtime; it has no effect on non-x86 or x86_64 platforms. When neither `AVX2` nor `SSE4.2` is supported, it will fall back to a native rust implementation.
* SIMD intrinsics are inherently unsafe. These uses of unsafe are inescapable in a library such as `simd-json`.
* We work around some performance bottlenecks imposed by safe rust. These are avoidable, but at a performance cost.
This is a more considered path in `simd-json`.

Note that an application compiled with `runtime-detection` will not run as fast as an application compiled for a specific CPU. The reason is that rust can't optimise as far as the instruction set when it uses the generic instruction set, and non-simd parts of the code won't be optimised for the given instruction set either.

### `portable`
`simd-json` goes through extra scrutiny for unsafe code. These steps are:

**Currently disabled**
* Unit tests - to test 'the obvious' cases, edge cases, and regression cases
* Structural constructive property based testing - We generate random valid JSON objects to exercise the full `simd-json`
codebase stochastically. Floats are currently excluded since slightly different parsing algorithms lead to slightly
different results here. In short "is simd-json correct".
* Data-oriented property-based testing of string-like data - to assert that sequences of legal printable characters
don't panic or crash the parser (they might and often error so - they are not valid JSON!)
* Destructive Property based testing - make sure that no illegal byte sequences crash the parser in any way
* Fuzzing - fuzz based on upstream & jsonorg simd pass/fail cases
* Miri testing for UB

An implementation of the algorithm using `std::simd` and up to 512 byte wide registers, currently disabled due to dependencies and is highly experimental.
This doesn't ensure complete safety nor is at a bulletproof guarantee, but it does go a long way
to assert that the library is of high production quality and fit for purpose for practical industrial applications.

### `serde_impl`
## Features
Various features can be enabled or disabled to tweak various parts of this library. Any features not mentioned here are
for internal configuration and testing.

`simd-json` is compatible with serde and `serde-json`. The Value types provided implement serializers and deserializers. In addition to that `simd-json` implements the `Deserializer` trait for the parser so it can deserialize anything that implements the serde `Deserialize` trait. Note, that serde provides both a `Deserializer` and a `Deserialize` trait.
### `runtime-detection` (default)

That said the serde support is contained in the `serde_impl` feature which is part of the default feature set of `simd-json`, but it can be disabled.
This feature allows selecting the optimal algorithm based on available features during runtime. It has no effect on
non-`x86` platforms. When neither `AVX2` nor `SSE4.2` is supported, it will fall back to a native Rust implementation.

Disabling this feature (with `default-features = false`) **and** setting `RUSTFLAGS="-C target-cpu=native` will result
in better performance but the resulting binary will not be portable across `x86` processors.

### `serde_impl` (default)

Enable [Serde](https://serde.rs) support. This consist of implementing `serde::Serializer` and `serde::Deserializer`,
allowing types that implement `serde::Serialize`/`serde::Deserialize` to be constructed/serialized to
`BorrowedValue`/`OwnedValue`.
In addition, this provides the same convenience functions that [`serde_json`](https://docs.rs/serde_json/latest/serde_json/) provides.

Disabling this feature (with `default-features = false`) will remove `serde` and `serde_json` from the dependencies.

### `swar-number-parsing` (default)
Enables a parsing method that will parse 8 digits at a time for floats. This is a common pattern but comes at a slight
performance hit if most of the float have less than 8 digits.

### `known-key`

The `known-key` feature changes the hash mechanism for the DOM representation of the underlying JSON object, from `ahash` to `fxhash`. The `ahash` hasher is faster at hashing and provides protection against DOS attacks by forcing multiple keys into a single hashing bucket. The `fxhash` hasher, on the other hand allows for repeatable hashing results, which in turn allows memoizing hashes for well known keys and saving time on lookups. In workloads that are heavy on accessing some well-known keys, this can be a performance advantage.
The `known-key` feature changes the hash mechanism for the DOM representation of the underlying JSON object from
`ahash` to `fxhash`. The `ahash` hasher is faster at hashing and provides protection against DOS attacks by forcing
multiple keys into a single hashing bucket. The `fxhash` hasher allows for repeatable hashing results,
which in turn allows memoizing hashes for well known keys and saving time on lookups. In workloads that are heavy on
accessing some well-known keys, this can be a performance advantage.

The `known-key` feature is optional and disabled by default and should be explicitly configured.

### `value-no-dup-keys`


**This flag has no effect on simd-json itself but purely affects the `Value` structs.**

The `value-no-dup-keys` feature flag toggles stricter behavior for objects when deserializing into a `Value`. When enabled, the Value deserializer will remove duplicate keys in a JSON object and only keep the last one. If not set duplicate keys are considered undefined behavior and Value will not make guarantees on it's behavior.
The `value-no-dup-keys` feature flag enables stricter behavior for objects when deserializing into a `Value`. When
enabled, the Value deserializer will remove duplicate keys in a JSON object and only keep the last one. If not set
duplicate keys are considered undefined behavior and Value will not make guarantees on its behavior.

### `big-int-as-float`

The `big-int-as-float` feature flag treats very large integers that won't fit into u64 as f64 floats. This prevents parsing errors if the JSON you are parsing contains very large numbers. Keep in mind that f64 loses some precision when representing very large numbers.
The `big-int-as-float` feature flag treats very large integers that won't fit into u64 as f64 floats. This prevents
parsing errors if the JSON you are parsing contains very large integers. Keep in mind that f64 loses some precision when
representing very large numbers.

## safety
### `128bit`

`simd-json` uses **a lot** of unsafe code.
Add support for parsing and serializing 128-bit integers. This feature is disabled by default because such large numbers
are rare in the wild and adding the support incurs a performance penalty.

There are a few reasons for this:
### `beef`

* SIMD intrinsics are inherently unsafe. These uses of unsafe are inescapable in a library such as `simd-json`.
* We work around some performance bottlenecks imposed by safe rust. These are avoidable, but at a performance cost. This is a more considered path in `simd-json`.
**Enabling this feature can break dependencies in your dependency tree that are using `simd-json`.**

Replace [`std::borrow::Cow`](https://doc.rust-lang.org/std/borrow/enum.Cow.html) with
[`beef::lean::Cow`][beef] This feature is disabled by default, because
it is a breaking change in the API.

`simd-json` goes through extra scrutiny for unsafe code. These steps are:
### `portable`

* Unit tests - to test 'the obvious' cases, edge cases, and regression cases
* Structural constructive property based testing - We generate random valid JSON objects to exercise the full `simd-json` codebase stochastically. Floats are currently excluded since slightly different parsing algorithms lead to slightly different results here. In short "is simd-json correct".
* Data-oriented property-based testing of string-like data - to assert that sequences of legal printable characters don't panic or crash the parser (they might and often error so - they are not valid JSON!)
* Destructive Property based testing - make sure that no illegal byte sequences crash the parser in any way
* Fuzzing - fuzz based on upstream & jsonorg simd pass/fail cases
* Miri testing for UB
**Currently disabled**

This doesn't ensure complete safety nor is at a bulletproof guarantee, but it does go a long way
to assert that the library is of high production quality and fit for purpose for practical industrial applications.
An highly experimental implementation of the algorithm using `std::simd` and up to 512 byte wide registers.


## Usage

simd-json offers three main entry points for usage:

### Values API

The values API is a set of optimized DOM objects that allow parsed
JSON to JSON data that has no known variable structure. `simd-json`
has two versions of this:

**Borrowed Values**

```rust
use simd_json;
let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
let v: simd_json::BorrowedValue = simd_json::to_borrowed_value(&mut d).unwrap();
```

**Owned Values**

```rust
use simd_json;
let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
let v: simd_json::OwnedValue = simd_json::to_owned_value(&mut d).unwrap();
```
Kriskras99 marked this conversation as resolved.
Show resolved Hide resolved

### Serde Compatible API

```rust ignore
use simd_json;
use serde_json::Value;

let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
let v: Value = simd_json::serde::from_slice(&mut d).unwrap();
```

### Tape API

```rust
use simd_json;

let mut d = br#"{"the_answer": 42}"#.to_vec();
let tape = simd_json::to_tape(&mut d).unwrap();
let value = tape.as_value();
// try_get treats value like an object, returns Ok(Some(_)) because the key is found
assert!(value.try_get("the_answer").unwrap().unwrap() == 42);
// returns Ok(None) because the key is not found but value is an object
assert!(value.try_get("does_not_exist").unwrap() == None);
// try_get_idx treats value like an array, returns Err(_) because value is not an array
assert!(value.try_get_idx(0).is_err());
```

## Other interesting things

Expand All @@ -102,17 +199,19 @@ There are also bindings for upstream `simdjson` available [here](https://github.

simd-json itself is licensed under either of

* Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
* [Apache License, Version 2.0, (LICENSE-APACHE)](http://www.apache.org/licenses/LICENSE-2.0)
* [MIT license (LICENSE-MIT)](http://opensource.org/licenses/MIT)

at your option.

However it ports a lot of code from [simdjson](https://github.com/lemire/simdjson) so their work and copyright on that should be respected along side.
However it ports a lot of code from [simdjson](https://github.com/lemire/simdjson) so their work and copyright on that should also be respected.

The [serde][1] integration is based on their example and `serde-json` so again, their copyright should as well be respected.
The [Serde][serde] integration is based on `serde-json` so their copyright should as well be respected.

[1]: https://serde.rs
[serde]: https://serde.rs
[beef]: https://docs.rs/beef/latest/beef/lean/type.Cow.html

### All Thanks To Our Contributors:
<a href="https://github.com/simd-lite/simd-json/graphs/contributors">
<img src="https://contrib.rocks/image?repo=simd-lite/simd-json" />
<img alt="GitHub profile pictures of all contributors to simd-json" src="https://contrib.rocks/image?repo=simd-lite/simd-json" />
</a>
8 changes: 6 additions & 2 deletions src/cow.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
//! Reexport of Cow

//! Re-export of Cow
//!
//! If feature `beef` is enabled, this will re-export [`beef::lean::Cow`][beef].
//! Otherwise, it will re-export [`std::borrow::Cow`].
//!
//! [beef]: https://docs.rs/beef/latest/beef/lean/type.Cow.html
#[cfg(not(feature = "beef"))]
pub use std::borrow::Cow;

Expand Down
106 changes: 2 additions & 104 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,109 +10,7 @@
missing_docs
)]
#![allow(clippy::module_name_repetitions, renamed_and_removed_lints)]

//! simd-json is a rust port of the simdjson c++ library. It follows
//! most of the design closely with a few exceptions to make it better
//! fit into the rust ecosystem.
//!
//! Note: On `x86` it will select the best SIMD featureset
//! (`avx2`, or `sse4.2`) during runtime. If `simd-json` is compiled
//! with SIMD support, it will disable runtime detection.
//!
//! ## Goals
//!
//! the goal of the rust port of simdjson is not to create a one to
//! one copy, but to integrate the principles of the c++ library into
//! a rust library that plays well with the rust ecosystem. As such
//! we provide both compatibility with serde as well as parsing to a
//! dom to manipulate data.
//!
//! ## Performance
//!
//! As a rule of thumb this library tries to get as close as possible
//! to the performance of the c++ implementation, but some of the
//! design decisions - such as parsing to a dom or a tape, weigh
//! ergonomics over performance. In other places Rust makes it harder
//! to achieve the same level of performance.
//!
//! ## Safety
//!
//! this library uses unsafe all over the place, and while it leverages
//! quite a few test cases along with property based testing, please use
//! this library with caution.
//!
//!
//! ## Features
//!
//! simd-json.rs comes with a number of features that can be toggled,
//! the following features are intended for 'user' selection. Additional
//! features in the `Cargo.toml` exist to work around cargo limitations.
//!
//! ### `swar-number-parsing` (default)
//!
//! Enables a parsing method that will parse 8 digits at a time for
//! floats - this is a common pattern but comes as a slight perf hit
//! if all the floats have less then 8 digits.
//!
//! ### `serde_impl` (default)
//!
//! Compatibility with [serde](https://serde.rs/). This allows to use
//! [simd-json.rs](https://simd-json.rs) to deserialize serde objects
//! as well as serde compatibility of the different Value types.
//! This can be disabled if serde is not used alongside simd-json.
//!
//! ### `128bit`
//!
//! Support for signed and unsigned 128 bit integer. This feature
//! is disabled by default as 128 bit integers are rare in the wild
//! and parsing them comes as a performance penalty due to extra logic
//! and a changed memory layout.
//!
//! ### `known-key`
//!
//! The known-key feature changes hasher for the objects, from ahash
//! to fxhash, ahash is faster at hashing and provides protection
//! against DOS attacks by forcing multiple keys into a single hashing
//! bucket. fxhash on the other hand allows for repeatable hashing
//! results, that allows memorizing hashes for well know keys and saving
//! time on lookups. In workloads that are heavy at accessing some well
//! known keys this can be a performance advantage.
//!
//! ## Usage
//!
//! simd-json offers two main entry points for usage:
//!
//! ### Values API
//!
//! The values API is a set of optimized DOM objects that allow parsed
//! json to JSON data that has no known variable structure. simd-lite
//! has two versions of this:
//!
//! **Borrowed Values**
//!
//! ```
//! use simd_json;
//! let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
//! let v: simd_json::BorrowedValue = simd_json::to_borrowed_value(&mut d).unwrap();
//! ```
//!
//! **Owned Values**
//!
//! ```
//! use simd_json;
//! let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
//! let v: simd_json::OwnedValue = simd_json::to_owned_value(&mut d).unwrap();
//! ```
//!
//! ### Serde Compatible API
//!
//! ```ignore
//! use simd_json;
//! use serde_json::Value;
//!
//! let mut d = br#"{"some": ["key", "value", 2]}"#.to_vec();
//! let v: Value = simd_json::serde::from_slice(&mut d).unwrap();
//! ```
#![doc = include_str!(concat!(env!("CARGO_MANIFEST_DIR"), "/README.md"))]

#[cfg(feature = "serde_impl")]
extern crate serde as serde_ext;
Expand Down Expand Up @@ -144,7 +42,7 @@ use stage2::StackState;

mod impls;

/// Reexport of Cow
/// Re-export of Cow
pub mod cow;

/// The maximum padding size required by any SIMD implementation
Expand Down
Loading
Loading