Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Semi-structured Data Types #4320

Merged
merged 2 commits into from
Mar 4, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/dev/rfcs/query/0007-semi-structured-data-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Semi-structured data types design

## Summary

Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on.
In order to be compatible with [Snowflake's SQL syntax](https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html), we support the following three semi-structured data types:

- `Variant`: A tagged universal type, which can store values of any other type, including `Object` and `Array`.
- `Object`: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value of `Variant` type.
- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types.

Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of how users will use this feature? Like presenting a real SQL?

Take https://github.com/datafuselabs/opendal/blob/main/docs/rfcs/0000-example.md#guide-level-explanation for a look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some examples, PTAL @Xuanwo

### Examples

The following example shows how to create a table with `VARIANT`, `ARRAY` and `OBJECT` data types, insert and query some test data.

```sql
CREATE TABLE test_semi_structured (
var variant,
arr array,
obj object
);

INSERT INTO test_semi_structured (var, arr, obj)
SELECT 1, array_construct(1, 2, 3)
, parse_json(' { "key1": "value1", "key2": "value2" } ');

INSERT INTO test_semi_structured (var, arr, obj)
SELECT to_variant('abc')
, array_construct('a', 'b', 'c')
, parse_json(' { "key1": [1, 2, 3], "key2": ["a", "b", "c"] } ');


SELECT * FROM test_semi_structured;

+-------+-------------------+----------------------------------------------------+
| var | arr | obj |
+-------+-------------------+----------------------------------------------------+
| 1 | [ 1, 2, 3 ] | { "key1": "value1", "key2": "value2" } |
| "abc" | [ "a", "b", "c" ] | { "key1": [ 1, 2, 3 ], "key2": [ "a", "b", "c" ] } |
+-------+-------------------+----------------------------------------------------+
```

## Design Details

### Data storage format

In order to store the `Variant` type values in the `parquet` format file with schema, we need to do some conversion on the original raw value. We have the following two choices:

#### Store data in one column as JSON or binary JSON-like format

JSON (JavaScript Object Notation) is the most common semi-structured format that can represent arbitrarily complex hierarchical values. It is very suitable for representing this kind of semi-structured data. Data of type `Variant` can be encoded in JSON format and stored as a raw string value.
The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access.
For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format.
[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient.
All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for further storage layout optimization, I think this paper also provide some insights, https://arxiv.org/pdf/2111.11517.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll look into this paper


#### Store each unique key of data in sub-columns

Despite JSON format can represent arbitrary data, in practice, JSON data is usually generated by the machine, so we can predict the Shema and structure.
Based on this feature, we can extract and flatten each unique key in JSON data into multiple independent virtual sub-columns.

For example, suppose we have a column named `tweet` and store the following JSON data:

```json
{"id":1, "date": "1/11", type: "story", "score": 3, "desc": 2, "title": "...", "url": "..."}
{"id":2, "date": "1/12", type: "poll", "score": 5, "desc": 2, "title": "..."}
{"id":3, "date": "1/13", type: "pollop", "score": 6, "poll": 2, "title": "..."}
{"id":4, "date": "1/14", type: "story", "score": 1, "desc": 1, "title": "...", "url": "..."}
{"id":5, "date": "1/15", type: "comment", "parent": 4, "text": "..."}
{"id":6, "date": "1/16", type: "comment", "parent": 1, "text": "..."}
{"id":7, "date": "1/17", type: "pollop", "score": 3, "poll": 2, "title": "..."}
{"id":8, "date": "1/18", type: "comment", "parent": 1, "text": "..."}
```

This column can be split into 10 virtual sub-columns: `tweet.id`, `tweet.date`, `tweet.type`, `tweet.score`, `tweet.desc`, `tweet.title`, `tweet.url`, `tweet.parent`, `tweet.text`, `tweet.poll`.
The data type of each sub-column can also be automatically deducted from the value, then we can automatically create those sub-columns and insert the corresponding value.

The main advantage of this storage format is that it does not need to parse the raw JSON string when querying the data, which can greatly speed up the query processing.
The disadvantage is that additional processing is required when inserting data, and the schema of data in each row is not exactly the same. In some scenarios with large differences, many sub-column data will be Null.
In order to have good performance and balance in various scenarios, we can refer to the optimization algorithms introduced in the paper [JSON Tiles](https://db.in.tum.de/people/sites/durner/papers/json-tiles-sigmod21.pdf).

From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns.
However, in order to simplify development, we use the JSON format in the first version.
Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that for the first version, those cols will not be stored in a "column-oriented" way?

how about the dremel style encoding of nested data structure (which is supported by parquet out of the box)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty complicated, and it seems to only work for the fixed schema column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet nested data type must define schema, not suitable for storing such arbitrarily data.


### Data Types

Add three new values `Variant`, `VariantArray`, `VariantObject` to the enumeration `TypeID`, respectively support these three semi-structured data types.
Since we now have a value called `Array`, we name the semi-structured `Array` type as `VariantArray` to distinguish from it.
Define the corresponding structures for these types, and implement the trait `DataType`.
The `PhysicalTypeID` corresponding to these types are `String`, the JSON value will be converted to a raw string for storage.

```rust
pub enum TypeID {
...
Variant
VariantArray
VariantObject
}

pub struct VariantType {}

pub struct VariantArrayType {}

pub struct VariantObjectType {}

```

### Object Column

Currently `Column` is only implemented for fundamental types, custom structs or enumerations like `serde_json::Value` don't have a suitable `Column` implementation to store.
Define `ObjectColumn` and `MutableObjectColumn` as generic structures to store custom data types, and implement trait `Column` and `MutableColumn` respectively.
`ObjectType` can be any custom type of structure or enumerations, we can define `JsonColumn` by specified parameter as `serde_json::Value`.
All the `variant` data will be automatically cast to `serde_json::Value` and generate a `JsonColumn`.
Other custom data types like BitmapColumn can be supported easily in the future.

```rust
#[derive(Clone)]
pub struct ObjectColumn<T: ObjectType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectType can extend from arrow2's NativeType.

So values can be Arc<Bytes<T>> which will have zero cost to take slice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

values: Vec<T>,
}

#[derive(Debug)]
pub struct MutableObjectColumn<T: ObjectType> {
data_type: DataTypePtr,
pub(crate) values: Vec<T>,
}

type JsonColumn = ObjectColumn<serde_json::Value>;

```

## TODO

- Use better storage formats to improve query performance