RFC Semi-structured Data Types #4320

b41sh · 2022-03-04T00:42:17Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Add RFC for Semi-structured data types design

Changelog

Documentation

Related Issues

#3916

Test Plan

Unit Tests

Stateless Tests

vercel · 2022-03-04T00:42:22Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/C8wHJXrF8LzmA5QqhbuxPZkXyr1X
✅ Preview: https://databend-git-fork-b41sh-rfc-semi-structured-databend.vercel.app

mergify · 2022-03-04T00:42:50Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

ZhiHanZ · 2022-03-04T01:13:49Z

docs/dev/rfcs/query/0007-semi-structured-data-types.md

+The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access.
+For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format.
+[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient.
+All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.


for further storage layout optimization, I think this paper also provide some insights, https://arxiv.org/pdf/2111.11517.pdf

ok, I'll look into this paper

sundy-li · 2022-03-04T01:29:58Z

docs/dev/rfcs/query/0007-semi-structured-data-types.md

+
+```rust
+#[derive(Clone)]
+pub struct ObjectColumn<T: ObjectType> {


ObjectType can extend from arrow2's NativeType.

So values can be Arc<Bytes<T>> which will have zero cost to take slice.

Xuanwo · 2022-03-04T01:35:33Z

docs/dev/rfcs/query/0007-semi-structured-data-types.md

+- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types.
+
+Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example.
+


Can you give an example of how users will use this feature? Like presenting a real SQL?

Take https://github.com/datafuselabs/opendal/blob/main/docs/rfcs/0000-example.md#guide-level-explanation for a look.

I have added some examples, PTAL @Xuanwo

dantengsky · 2022-03-04T01:41:16Z

docs/dev/rfcs/query/0007-semi-structured-data-types.md

+
+From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns.
+However, in order to simplify development, we use the JSON format in the first version.
+Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.


Does it mean that for the first version, those cols will not be stored in a "column-oriented" way?

how about the dremel style encoding of nested data structure (which is supported by parquet out of the box)

That's pretty complicated, and it seems to only work for the fixed schema column.

parquet nested data type must define schema, not suitable for storing such arbitrarily data.

Add Semi-structured Data Types RFC

371e3aa

b41sh requested a review from BohuTANG as a code owner March 4, 2022 00:42

databend-bot added the need-review label Mar 4, 2022

vercel bot deployed to Preview March 4, 2022 00:42 View deployment

mergify bot added the pr-doc-fix label Mar 4, 2022

b41sh requested a review from sundy-li March 4, 2022 00:45

BohuTANG requested review from dantengsky and zhang2014 March 4, 2022 00:46

ZhiHanZ reviewed Mar 4, 2022

View reviewed changes

sundy-li reviewed Mar 4, 2022

View reviewed changes

Xuanwo reviewed Mar 4, 2022

View reviewed changes

dantengsky reviewed Mar 4, 2022

View reviewed changes

add some example

dc0f6cf

vercel bot deployed to Preview March 4, 2022 08:25 View deployment

sundy-li merged commit 06842a9 into databendlabs:main Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Semi-structured Data Types #4320

RFC Semi-structured Data Types #4320

b41sh commented Mar 4, 2022

vercel bot commented Mar 4, 2022 •

edited

Loading

mergify bot commented Mar 4, 2022

ZhiHanZ Mar 4, 2022

b41sh Mar 4, 2022

sundy-li Mar 4, 2022

b41sh Mar 4, 2022

Xuanwo Mar 4, 2022

b41sh Mar 4, 2022

dantengsky Mar 4, 2022

sundy-li Mar 4, 2022

b41sh Mar 4, 2022

		- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types.

		Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example.

RFC Semi-structured Data Types #4320

RFC Semi-structured Data Types #4320

Conversation

b41sh commented Mar 4, 2022

Summary

Changelog

Related Issues

Test Plan

vercel bot commented Mar 4, 2022 • edited Loading

mergify bot commented Mar 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Mar 4, 2022 •

edited

Loading