databendlabs · sundy-li · Mar 4, 2022 · Mar 4, 2022 · Mar 4, 2022 · Xuanwo
diff --git a/docs/dev/rfcs/query/0007-semi-structured-data-types.md b/docs/dev/rfcs/query/0007-semi-structured-data-types.md
@@ -0,0 +1,137 @@
+# Semi-structured data types design
+
+## Summary
+
+Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on.
+In order to be compatible with [Snowflake's SQL syntax](https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html), we support the following three semi-structured data types:
+
+- `Variant`: A tagged universal type, which can store values of any other type, including `Object` and `Array`.
+- `Object`: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value of `Variant` type.
+- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types.
+
+Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example.
+
+### Examples
+
+The following example shows how to create a table with `VARIANT`, `ARRAY` and `OBJECT` data types, insert and query some test data.
+
+```sql
+CREATE TABLE test_semi_structured (
+    var variant,
+    arr array,
+    obj object
+);
+
+INSERT INTO test_semi_structured (var, arr, obj)
+SELECT 1, array_construct(1, 2, 3)
+    , parse_json(' { "key1": "value1", "key2": "value2" } ');
+
+INSERT INTO test_semi_structured (var, arr, obj)
+SELECT to_variant('abc')
+    , array_construct('a', 'b', 'c')
+    , parse_json(' { "key1": [1, 2, 3], "key2": ["a", "b", "c"] } ');
+
+
+SELECT * FROM test_semi_structured;
+
++-------+-------------------+----------------------------------------------------+
+| var   | arr               | obj                                                |
++-------+-------------------+----------------------------------------------------+
+| 1     | [ 1, 2, 3 ]       | { "key1": "value1", "key2": "value2" }             |
+| "abc" | [ "a", "b", "c" ] | { "key1": [ 1, 2, 3 ], "key2": [ "a", "b", "c" ] } |
++-------+-------------------+----------------------------------------------------+
+```
+
+## Design Details
+
+### Data storage format
+
+In order to store the `Variant` type values in the `parquet` format file with schema, we need to do some conversion on the original raw value. We have the following two choices:
+
+#### Store data in one column as JSON or binary JSON-like format
+
+JSON (JavaScript Object Notation) is the most common semi-structured format that can represent arbitrarily complex hierarchical values. It is very suitable for representing this kind of semi-structured data. Data of type `Variant` can be encoded in JSON format and stored as a raw string value.
+The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access.
+For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format.
+[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient.
+All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.
+
+#### Store each unique key of data in sub-columns
+
+Despite JSON format can represent arbitrary data, in practice, JSON data is usually generated by the machine, so we can predict the Shema and structure.
+Based on this feature, we can extract and flatten each unique key in JSON data into multiple independent virtual sub-columns.
+
+For example, suppose we have a column named `tweet` and store the following JSON data:
+
+```json
+{"id":1, "date": "1/11", type: "story", "score": 3, "desc": 2, "title": "...", "url": "..."}
+{"id":2, "date": "1/12", type: "poll", "score": 5, "desc": 2, "title": "..."}
+{"id":3, "date": "1/13", type: "pollop", "score": 6, "poll": 2, "title": "..."}
+{"id":4, "date": "1/14", type: "story", "score": 1, "desc": 1, "title": "...", "url": "..."}
+{"id":5, "date": "1/15", type: "comment", "parent": 4, "text": "..."}
+{"id":6, "date": "1/16", type: "comment", "parent": 1, "text": "..."}
+{"id":7, "date": "1/17", type: "pollop", "score": 3, "poll": 2, "title": "..."}
+{"id":8, "date": "1/18", type: "comment", "parent": 1, "text": "..."}
+```
+
+This column can be split into 10 virtual sub-columns: `tweet.id`, `tweet.date`, `tweet.type`, `tweet.score`, `tweet.desc`, `tweet.title`, `tweet.url`, `tweet.parent`, `tweet.text`, `tweet.poll`.
+The data type of each sub-column can also be automatically deducted from the value, then we can automatically create those sub-columns and insert the corresponding value.
+
+The main advantage of this storage format is that it does not need to parse the raw JSON string when querying the data, which can greatly speed up the query processing.
+The disadvantage is that additional processing is required when inserting data, and the schema of data in each row is not exactly the same. In some scenarios with large differences, many sub-column data will be Null.
+In order to have good performance and balance in various scenarios, we can refer to the optimization algorithms introduced in the paper [JSON Tiles](https://db.in.tum.de/people/sites/durner/papers/json-tiles-sigmod21.pdf).
+
+From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns.
+However, in order to simplify development, we use the JSON format in the first version.
+Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.
+
+### Data Types
+
+Add three new values `Variant`, `VariantArray`, `VariantObject` to the enumeration `TypeID`, respectively support these three semi-structured data types.
+Since we now have a value called `Array`, we name the semi-structured `Array` type as `VariantArray` to distinguish from it.
+Define the corresponding structures for these types, and implement the trait `DataType`.
+The `PhysicalTypeID` corresponding to these types are `String`, the JSON value will be converted to a raw string for storage.
+
+```rust
+pub enum TypeID {
+    ...
+    Variant
+    VariantArray
+    VariantObject
+}
+
+pub struct VariantType {}
+
+pub struct VariantArrayType {}
+
+pub struct VariantObjectType {}
+
+```
+
+### Object Column
+
+Currently `Column` is only implemented for fundamental types, custom structs or enumerations like `serde_json::Value` don't have a suitable `Column` implementation to store.
+Define `ObjectColumn` and `MutableObjectColumn` as generic structures to store custom data types, and implement trait `Column` and `MutableColumn` respectively.
+`ObjectType` can be any custom type of structure or enumerations, we can define `JsonColumn` by specified parameter as `serde_json::Value`.
+All the `variant` data will be automatically cast to `serde_json::Value` and generate a `JsonColumn`.
+Other custom data types like BitmapColumn can be supported easily in the future.
+
+```rust
+#[derive(Clone)]
+pub struct ObjectColumn<T: ObjectType> {
+    values: Vec<T>,
+}
+
+#[derive(Debug)]
+pub struct MutableObjectColumn<T: ObjectType> {
+    data_type: DataTypePtr,
+    pub(crate) values: Vec<T>,
+}
+
+type JsonColumn = ObjectColumn<serde_json::Value>;
+
+```
+
+## TODO
+
+- Use better storage formats to improve query performance