-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Semi-structured Data Types #4320
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Semi-structured data types design | ||
|
||
## Summary | ||
|
||
Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on. | ||
In order to be compatible with [Snowflake's SQL syntax](https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html), we support the following three semi-structured data types: | ||
|
||
- `Variant`: A tagged universal type, which can store values of any other type, including `Object` and `Array`. | ||
- `Object`: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value of `Variant` type. | ||
- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types. | ||
|
||
Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example. | ||
|
||
### Examples | ||
|
||
The following example shows how to create a table with `VARIANT`, `ARRAY` and `OBJECT` data types, insert and query some test data. | ||
|
||
```sql | ||
CREATE TABLE test_semi_structured ( | ||
var variant, | ||
arr array, | ||
obj object | ||
); | ||
|
||
INSERT INTO test_semi_structured (var, arr, obj) | ||
SELECT 1, array_construct(1, 2, 3) | ||
, parse_json(' { "key1": "value1", "key2": "value2" } '); | ||
|
||
INSERT INTO test_semi_structured (var, arr, obj) | ||
SELECT to_variant('abc') | ||
, array_construct('a', 'b', 'c') | ||
, parse_json(' { "key1": [1, 2, 3], "key2": ["a", "b", "c"] } '); | ||
|
||
|
||
SELECT * FROM test_semi_structured; | ||
|
||
+-------+-------------------+----------------------------------------------------+ | ||
| var | arr | obj | | ||
+-------+-------------------+----------------------------------------------------+ | ||
| 1 | [ 1, 2, 3 ] | { "key1": "value1", "key2": "value2" } | | ||
| "abc" | [ "a", "b", "c" ] | { "key1": [ 1, 2, 3 ], "key2": [ "a", "b", "c" ] } | | ||
+-------+-------------------+----------------------------------------------------+ | ||
``` | ||
|
||
## Design Details | ||
|
||
### Data storage format | ||
|
||
In order to store the `Variant` type values in the `parquet` format file with schema, we need to do some conversion on the original raw value. We have the following two choices: | ||
|
||
#### Store data in one column as JSON or binary JSON-like format | ||
|
||
JSON (JavaScript Object Notation) is the most common semi-structured format that can represent arbitrarily complex hierarchical values. It is very suitable for representing this kind of semi-structured data. Data of type `Variant` can be encoded in JSON format and stored as a raw string value. | ||
The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access. | ||
For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format. | ||
[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient. | ||
All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for further storage layout optimization, I think this paper also provide some insights, https://arxiv.org/pdf/2111.11517.pdf There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I'll look into this paper |
||
|
||
#### Store each unique key of data in sub-columns | ||
|
||
Despite JSON format can represent arbitrary data, in practice, JSON data is usually generated by the machine, so we can predict the Shema and structure. | ||
Based on this feature, we can extract and flatten each unique key in JSON data into multiple independent virtual sub-columns. | ||
|
||
For example, suppose we have a column named `tweet` and store the following JSON data: | ||
|
||
```json | ||
{"id":1, "date": "1/11", type: "story", "score": 3, "desc": 2, "title": "...", "url": "..."} | ||
{"id":2, "date": "1/12", type: "poll", "score": 5, "desc": 2, "title": "..."} | ||
{"id":3, "date": "1/13", type: "pollop", "score": 6, "poll": 2, "title": "..."} | ||
{"id":4, "date": "1/14", type: "story", "score": 1, "desc": 1, "title": "...", "url": "..."} | ||
{"id":5, "date": "1/15", type: "comment", "parent": 4, "text": "..."} | ||
{"id":6, "date": "1/16", type: "comment", "parent": 1, "text": "..."} | ||
{"id":7, "date": "1/17", type: "pollop", "score": 3, "poll": 2, "title": "..."} | ||
{"id":8, "date": "1/18", type: "comment", "parent": 1, "text": "..."} | ||
``` | ||
|
||
This column can be split into 10 virtual sub-columns: `tweet.id`, `tweet.date`, `tweet.type`, `tweet.score`, `tweet.desc`, `tweet.title`, `tweet.url`, `tweet.parent`, `tweet.text`, `tweet.poll`. | ||
The data type of each sub-column can also be automatically deducted from the value, then we can automatically create those sub-columns and insert the corresponding value. | ||
|
||
The main advantage of this storage format is that it does not need to parse the raw JSON string when querying the data, which can greatly speed up the query processing. | ||
The disadvantage is that additional processing is required when inserting data, and the schema of data in each row is not exactly the same. In some scenarios with large differences, many sub-column data will be Null. | ||
In order to have good performance and balance in various scenarios, we can refer to the optimization algorithms introduced in the paper [JSON Tiles](https://db.in.tum.de/people/sites/durner/papers/json-tiles-sigmod21.pdf). | ||
|
||
From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns. | ||
However, in order to simplify development, we use the JSON format in the first version. | ||
Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it mean that for the first version, those cols will not be stored in a "column-oriented" way? how about the dremel style encoding of nested data structure (which is supported by parquet out of the box) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's pretty complicated, and it seems to only work for the fixed schema column. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parquet nested data type must define schema, not suitable for storing such arbitrarily data. |
||
|
||
### Data Types | ||
|
||
Add three new values `Variant`, `VariantArray`, `VariantObject` to the enumeration `TypeID`, respectively support these three semi-structured data types. | ||
Since we now have a value called `Array`, we name the semi-structured `Array` type as `VariantArray` to distinguish from it. | ||
Define the corresponding structures for these types, and implement the trait `DataType`. | ||
The `PhysicalTypeID` corresponding to these types are `String`, the JSON value will be converted to a raw string for storage. | ||
|
||
```rust | ||
pub enum TypeID { | ||
... | ||
Variant | ||
VariantArray | ||
VariantObject | ||
} | ||
|
||
pub struct VariantType {} | ||
|
||
pub struct VariantArrayType {} | ||
|
||
pub struct VariantObjectType {} | ||
|
||
``` | ||
|
||
### Object Column | ||
|
||
Currently `Column` is only implemented for fundamental types, custom structs or enumerations like `serde_json::Value` don't have a suitable `Column` implementation to store. | ||
Define `ObjectColumn` and `MutableObjectColumn` as generic structures to store custom data types, and implement trait `Column` and `MutableColumn` respectively. | ||
`ObjectType` can be any custom type of structure or enumerations, we can define `JsonColumn` by specified parameter as `serde_json::Value`. | ||
All the `variant` data will be automatically cast to `serde_json::Value` and generate a `JsonColumn`. | ||
Other custom data types like BitmapColumn can be supported easily in the future. | ||
|
||
```rust | ||
#[derive(Clone)] | ||
pub struct ObjectColumn<T: ObjectType> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ObjectType can extend from arrow2's So values can be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok |
||
values: Vec<T>, | ||
} | ||
|
||
#[derive(Debug)] | ||
pub struct MutableObjectColumn<T: ObjectType> { | ||
data_type: DataTypePtr, | ||
pub(crate) values: Vec<T>, | ||
} | ||
|
||
type JsonColumn = ObjectColumn<serde_json::Value>; | ||
|
||
``` | ||
|
||
## TODO | ||
|
||
- Use better storage formats to improve query performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of how users will use this feature? Like presenting a real SQL?
Take https://github.com/datafuselabs/opendal/blob/main/docs/rfcs/0000-example.md#guide-level-explanation for a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some examples, PTAL @Xuanwo