From 5fe142481ae867c1595ed2e3441e4c1238d32a0f Mon Sep 17 00:00:00 2001 From: Qianqian Fang Date: Sun, 15 May 2022 03:09:06 -0400 Subject: [PATCH] Updated BJData documentation, #3464 (#3493) * update bjdata.md for #3464 * Minor edit * Fix URL typo * Add info on demoting ND array to a 1-D optimized array when singleton dimension --- .../docs/features/binary_formats/bjdata.md | 207 +++++++++++++++++- 1 file changed, 198 insertions(+), 9 deletions(-) diff --git a/docs/mkdocs/docs/features/binary_formats/bjdata.md b/docs/mkdocs/docs/features/binary_formats/bjdata.md index 34d3d4dd8c..fae55658fb 100644 --- a/docs/mkdocs/docs/features/binary_formats/bjdata.md +++ b/docs/mkdocs/docs/features/binary_formats/bjdata.md @@ -1,20 +1,209 @@ # BJData -The Binary JData (BJData) Specification defines an efficient serialization protocol for unambiguously storing complex -and strongly-typed binary data found in diverse applications. The BJData specification is the binary counterpart to the -JSON format, both of which are used to serialize complex data structures supported by the -[JData specification](https://openjdata.org). The BJData spec is derived and extended from the -[Universal Binary JSON(UBJSON)](https://ubjson.org) specification (Draft 12). It adds supports for N-dimensional packed -arrays and extended binary data types. +The [BJData format](https://neurojson.org) was derived from and improved upon +[Universal Binary JSON(UBJSON)](https://ubjson.org) specification (Draft 12). +Specifically, it introduces an optimized array container for efficient storage +of N-dimensional packed arrays (**ND-arrays**); it also adds 4 new type markers - +`[u] - uint16`, `[m] - uint32`, `[M] - uint64` and `[h] - float16` - to +unambigiously map common binary numeric types; furthermore, it uses little-endian +(LE) to store all numerics instead of big-endian (BE) as in UBJSON to avoid +unnecessary conversions on commonly available platforms. + +Compared to other binary-JSON-like formats such as MessagePack and CBOR, both BJData and +UBJSON demonstrate a rare combination of being both binary and **quasi-human-readable**. This +is because all semantic elements in BJData and UBJSON, including the data-type markers +and name/string types are directly human-readable. Data stored in the BJData/UBJSON format +are not only compact in size, fast to read/write, but also can be directly searched +or read using simple processing. !!! abstract "References" - - [BJData Specification](https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md) + - [BJData Specification](https://neurojson.org/bjdata/draft2) ## Serialization -TODO +The library uses the following mapping from JSON values types to BJData types according to the BJData specification: + +| JSON value type | value/range | BJData type | marker | +|-----------------|-------------------------------------------|----------------|--------| +| null | `null` | null | `Z` | +| boolean | `true` | true | `T` | +| boolean | `false` | false | `F` | +| number_integer | -9223372036854775808..-2147483649 | int64 | `L` | +| number_integer | -2147483648..-32769 | int32 | `l` | +| number_integer | -32768..-129 | int16 | `I` | +| number_integer | -128..127 | int8 | `i` | +| number_integer | 128..255 | uint8 | `U` | +| number_integer | 256..32767 | int16 | `I` | +| number_integer | 32768..65535 | uint16 | `u` | +| number_integer | 65536..2147483647 | int32 | `l` | +| number_integer | 2147483648..4294967295 | uint32 | `m` | +| number_integer | 4294967296..9223372036854775807 | int64 | `L` | +| number_integer | 9223372036854775808..18446744073709551615 | uint64 | `M` | +| number_unsigned | 0..127 | int8 | `i` | +| number_unsigned | 128..255 | uint8 | `U` | +| number_unsigned | 256..32767 | int16 | `I` | +| number_unsigned | 32768..65535 | uint16 | `u` | +| number_unsigned | 65536..2147483647 | int32 | `l` | +| number_unsigned | 2147483648..4294967295 | uint32 | `m` | +| number_unsigned | 4294967296..9223372036854775807 | int64 | `L` | +| number_unsigned | 9223372036854775808..18446744073709551615 | uint64 | `M` | +| number_float | *any value* | float64 | `D` | +| string | *with shortest length indicator* | string | `S` | +| array | *see notes on optimized format/ND-array* | array | `[` | +| object | *see notes on optimized format* | map | `{` | + +!!! success "Complete mapping" + + The mapping is **complete** in the sense that any JSON value type can be converted to a BJData value. + + Any BJData output created by `to_bjdata` can be successfully parsed by `from_bjdata`. + +!!! warning "Size constraints" + + The following values can **not** be converted to a BJData value: + + - strings with more than 18446744073709551615 bytes (theoretical) + +!!! info "Unused BJData markers" + + The following markers are not used in the conversion: + + - `Z`: no-op values are not created. + - `C`: single-byte strings are serialized with `S` markers. + +!!! info "NaN/infinity handling" + + If NaN or Infinity are stored inside a JSON number, they are + serialized properly. This behavior differs from the `dump()` + function which serializes NaN or Infinity to `null`. + + +!!! info "Endianness" + + A breaking difference between BJData and UBJSON is the endianness + of numerical values. In BJData, all numerical data types (integers + `UiuImlML` and floating-point values `hdD`) are stored in the little-endian (LE) + byte order as opposed to big-endian as used by UBJSON. To adopt LE + to store numeric records avoids unnecessary byte swapping on most modern + computers where LE is used as the default byte order. + +!!! info "Optimized formats" + + The optimized formats for containers are supported: Parameter + `use_size` adds size information to the beginning of a container and + removes the closing marker. Parameter `use_type` further checks + whether all elements of a container have the same type and adds the + type marker to the beginning of the container. The `use_type` + parameter must only be used together with `use_size = true`. + + Note that `use_size = true` alone may result in larger representations - + the benefit of this parameter is that the receiving side is + immediately informed on the number of elements of the container. + +!!! info "ND-array optimized format" + + BJData extends UBJSON's optimized array **size** marker to support + ND-array of uniform numerical data types (referred to as the *packed array*). + For example, 2-D `uint8` integer array `[[1,2],[3,4],[5,6]]` that can be stored + as nested optimized array in UBJSON `[ [$U#i2 1 2 [$U#i2 3 4 [$U#i2 5 6 ]`, + can be further compressed in BJData and stored as `[$U#[$i#i2 2 3 1 2 3 4 5 6` + or `[$U#[i2 i3] 1 2 3 4 5 6`. + + In order to maintain the type and dimension information of an ND-array, + when this library parses a BJData ND-array via `from_bjdata`, it converts the + data into a JSON object, following the **annotated array format** as defined in the + [JData specification (Draft 3)](https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#annotated-storage-of-n-d-arrays). + For example, the above 2-D `uint8` array can be parsed and accessed as + + ```json + { + "_ArrayType_": "uint8", + "_ArraySize_": [2,3], + "_ArrayData_": [1,2,3,4,5,6] + } + ``` + + In the reversed direction, when `to_bjdata` detects a JSON object in the + above form, it automatically converts such object into a BJData ND-array + to generate compact output. The only exception is that when the 1-D dimensional + vector stored in `"_ArraySize_"` contains a single integer, or two integers with + one being 1, a regular 1-D optimized array is generated. + + The current version of this library has not yet supported automatic + recognition and conversion from a nested JSON array input to a BJData ND-array. + +!!! info "Restrictions in optimized data types for arrays and objects" + + Due to diminished space saving, hampered readability, and increased + security risks, in BJData, the allowed data types following the `$` marker + in an optimized array and object container are restricted to + **non-zero-fixed-length** data types. Therefore, the valid optimized + type markers can only be one of `UiuImlMLhdDC`. This also means other + variable (`[{SH`) or zero-length types (`TFN`) can not be used in an + optimized array or object in BJData. + +!!! info "Binary values" + + If the JSON data contains the binary type, the value stored is a list + of integers, as suggested by the BJData documentation. In particular, + this means that serialization and the deserialization of a JSON + containing binary values into BJData and back will result in a + different JSON object. + + +??? example + + ```cpp + --8<-- "examples/to_bjdata.cpp" + ``` + + Output: + + ```c + --8<-- "examples/to_bjdata.output" + ``` ## Deserialization -TODO +The library maps BJData types to JSON value types as follows: + +| BJData type | JSON value type | marker | +|-------------|-----------------------------------------|--------| +| no-op | *no value, next value is read* | `N` | +| null | `null` | `Z` | +| false | `false` | `F` | +| true | `true` | `T` | +| float16 | number_float | `h` | +| float32 | number_float | `d` | +| float64 | number_float | `D` | +| uint8 | number_unsigned | `U` | +| int8 | number_integer | `i` | +| uint16 | number_unsigned | `u` | +| int16 | number_integer | `I` | +| uint32 | number_unsigned | `m` | +| int32 | number_integer | `l` | +| uint64 | number_unsigned | `M` | +| int64 | number_integer | `L` | +| string | string | `S` | +| char | string | `C` | +| array | array (optimized values are supported) | `[` | +| ND-array | object (in JData annotated array format)|`[$.#[.`| +| object | object (optimized values are supported) | `{` | + +!!! success "Complete mapping" + + The mapping is **complete** in the sense that any BJData value can be converted to a JSON value. + + +??? example + + ```cpp + --8<-- "examples/from_bjdata.cpp" + ``` + + Output: + + ```json + --8<-- "examples/from_bjdata.output" + ```