Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Semi-structured Data Types #4320

Merged
merged 2 commits into from
Mar 4, 2022
Merged

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Mar 4, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Add RFC for Semi-structured data types design

Changelog

  • Documentation

Related Issues

#3916

Test Plan

Unit Tests

Stateless Tests

@vercel
Copy link

vercel bot commented Mar 4, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/C8wHJXrF8LzmA5QqhbuxPZkXyr1X
✅ Preview: https://databend-git-fork-b41sh-rfc-semi-structured-databend.vercel.app

@mergify
Copy link
Contributor

mergify bot commented Mar 4, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@mergify mergify bot added the pr-doc-fix label Mar 4, 2022
@b41sh b41sh requested a review from sundy-li March 4, 2022 00:45
The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access.
For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format.
[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient.
All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for further storage layout optimization, I think this paper also provide some insights, https://arxiv.org/pdf/2111.11517.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll look into this paper


```rust
#[derive(Clone)]
pub struct ObjectColumn<T: ObjectType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectType can extend from arrow2's NativeType.

So values can be Arc<Bytes<T>> which will have zero cost to take slice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types.

Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of how users will use this feature? Like presenting a real SQL?

Take https://github.com/datafuselabs/opendal/blob/main/docs/rfcs/0000-example.md#guide-level-explanation for a look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some examples, PTAL @Xuanwo


From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns.
However, in order to simplify development, we use the JSON format in the first version.
Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that for the first version, those cols will not be stored in a "column-oriented" way?

how about the dremel style encoding of nested data structure (which is supported by parquet out of the box)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty complicated, and it seems to only work for the fixed schema column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet nested data type must define schema, not suitable for storing such arbitrarily data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants