Skip to content

Commit

Permalink
[DOCS] Update timeline and table types pages
Browse files Browse the repository at this point in the history
- Add local configs
- Keep uptodate on different query types
  • Loading branch information
bhasudha committed Jul 26, 2023
1 parent c7e0358 commit f1c5846
Show file tree
Hide file tree
Showing 3 changed files with 89 additions and 12 deletions.
54 changes: 47 additions & 7 deletions website/docs/table_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,19 @@
title: Table & Query Types
summary: "In this page, we describe the different tables types in Hudi."
toc: true
toc_min_heading_level: 2
toc_max_heading_level: 4
last_modified_at:
---

## Table and Query Types
Hudi table types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written).
In turn, `query types` define how the underlying data is exposed to the queries (i.e how data is read).

| Table Type | Supported Query types |
|-------------- |------------------|
| Copy On Write | Snapshot Queries + Incremental Queries |
| Merge On Read | Snapshot Queries + Incremental Queries + Read Optimized Queries |
| Table Type | Supported Query types |
|-------------- |-------------------------------------------------------------------------------------------------------------------------------------------------|
| Copy On Write | <ul><li>Snapshot Queries</li><li>Incremental Queries</li><li>Incremental Queries (CDC)</li><li>Bootstrap Queries</li><li>Time Travel</li></ul> |
| Merge On Read | <ul><li>Snapshot Queries</li><li>Incremental Queries</li><li>Read Optimized Queries</li><li>Bootstrap Queries(RO)</li><li>Bootstrap Queries (Snapshot)</li><li>Time Travel</li></ul> |

### Table Types
Hudi supports the following table types.
Expand All @@ -36,17 +38,28 @@ Hudi supports the following query types

- **Snapshot Queries** : Queries see the latest snapshot of the table as of a given commit or compaction action. In case of merge on read table, it exposes near-real time data(few mins) by merging
the base and delta files of the latest file slice on-the-fly. For copy on write table, it provides a drop-in replacement for existing parquet tables, while providing upsert/delete and other write side features.
- **Incremental Queries** : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines.
- **Read Optimized Queries** : Queries see the latest snapshot of table as of a given commit/compaction action. Exposes only the base/columnar files in latest file slices and guarantees the
- **Incremental Queries** : Queries only see new data written to the table, since a given commit/compaction.
This effectively provides change streams to enable incremental data pipelines. By default, this produces the latest
snapshot of the changes since a given point in timeline.
- ***Incremental Queries(CDC)*** : These are a subtype fo Incremental queries, where queries see all changed data since
a given commit/compaction as opposed to latest state of changed data. This enables full cdc style query use cases
allowing to see before and after images of the changes along with operations that caused the change.
- **Read Optimized Queries** : Queries see the latest snapshot of table as of a given commit/compaction action. Exposes only the base/columnar files in the latest file slices and guarantees the
same columnar query performance compared to a non-hudi columnar table.
- **Bootstrap Queries** : Queries see the latest snapshot of a bootstrapped Hudi table as of a given commit/compaction action.
- ***Bootstrap Queries(RO)*** : Similar to read optimized queries but on bootstrapped tables.
- ***Bootstrap Queries(Snapshot)*** : Similar to snapshot queries but on bootstrapped tables.
- **Time Travel Queries** : Queries the snapshot of a table as of a given timestamp in the timeline.

Following table summarizes the trade-offs between the different query types.

Following table summarizes the trade-offs between the Snapshot and Read Optimized query types.

| Trade-off | Snapshot | Read Optimized |
|-------------- |-------------| ------------------|
| Data Latency | Lower | Higher
| Query Latency | Higher (merge base / columnar file + row based delta / log files) | Lower (raw base / columnar file performance)

For configs related to query types refer [below](#query-configs).

## Copy On Write Table

Expand Down Expand Up @@ -108,3 +121,30 @@ The intention of merge on read table is to enable near real-time processing dire
data out to specialized systems, which may not be able to handle the data volume. There are also a few secondary side benefits to
this table such as reduced write amplification by avoiding synchronous merge of data, i.e, the amount of data written per 1 bytes of data in a batch

## Query configs
Following are the configs relevant to different query types.

### Spark configs

| Config Name | Default | Description |
|----------------------------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| hoodie.datasource.query.type | snapshot (Optional) | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)<br /><br />`Config Param: QUERY_TYPE` |
| hoodie.datasource.read.begin.instanttime | N/A **(Required)** | Required when `hoodie.datasource.query.type` is set to `incremental`. Represents the instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time &gt; BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM. Note that if `hoodie.datasource.read.handle.hollow.commit` set to USE_STATE_TRANSITION_TIME, will use instant's `stateTransitionTime` to perform comparison.<br /><br />`Config Param: BEGIN_INSTANTTIME` |
| hoodie.datasource.read.end.instanttime | N/A **(Required)** | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the instant time to limit incrementally fetched data to. When not specified latest commit time from timeline is assumed by default. When specified, new data written with an instant_time &lt;= END_INSTANTTIME are fetched out. Point in time type queries make more sense with begin and end instant times specified. Note that if `hoodie.datasource.read.handle.hollow.commit` set to `USE_STATE_TRANSITION_TIME`, will use instant's `stateTransitionTime` to perform comparison.<br /><br />`Config Param: END_INSTANTTIME` |
| hoodie.bootstrap.data.queries.only | false (Optional) | Improves query performance, but queries cannot use hudi metadata fields<br /><br />`Config Param: DATA_QUERIES_ONLY`<br />`Since Version: 0.14.0` |
| hoodie.datasource.query.incremental.format | latest_state (Optional) | This config is used alone with the 'incremental' query type.When set to 'latest_state', it returns the latest records' values.When set to 'cdc', it returns the cdc data.<br /><br />`Config Param: INCREMENTAL_FORMAT`<br />`Since Version: 0.13.0` |
| as.of.instant | N/A **(Required)** | The query instant for time travel. Required only in the context of time travel queries. Without specified this option, we query the latest snapshot.<br /><br />`Config Param: TIME_TRAVEL_AS_OF_INSTANT` |


Refer [here](https://hudi.apache.org/docs/next/configurations#Read-Options) and [bootstrap configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) for more details

### Flink Configs

| Config Name | Default | Description |
|------------------------------------------------------------------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| hoodie.datasource.query.type | snapshot (Optional) | Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row &amp; columnar data); 2) incremental mode (new data since an instantTime). If `cdc.enabled` is set incremental queries on cdc data are possible; 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot<br /><br /> `Config Param: QUERY_TYPE` |
| read.start-commit | N/A **(Required)** | Required in case of incremental queries. Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read<br /><br /> `Config Param: READ_START_COMMIT` |
| read.end-commit | N/A **(Required)** | Used int he context of incremental queries. End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'<br /><br /> `Config Param: READ_END_COMMIT` |

Refer [here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for more details.

Loading

0 comments on commit f1c5846

Please sign in to comment.