[DOCS] Update timeline and table types pages

- Add local configs - Keep uptodate on different query types
apache · Jul 26, 2023 · f1c5846 · f1c5846
1 parent c7e0358
commit f1c5846
Show file tree

Hide file tree

Showing 3 changed files with 89 additions and 12 deletions.
diff --git a/website/docs/table_types.md b/website/docs/table_types.md
@@ -2,17 +2,19 @@
 title: Table & Query Types
 summary: "In this page, we describe the different tables types in Hudi."
 toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
 last_modified_at:
 ---
 
 ## Table and Query Types
 Hudi table types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written).
 In turn, `query types` define how the underlying data is exposed to the queries (i.e how data is read).
 
-| Table Type    | Supported Query types |
-|-------------- |------------------|
-| Copy On Write | Snapshot Queries + Incremental Queries  |
-| Merge On Read | Snapshot Queries + Incremental Queries + Read Optimized Queries |
+| Table Type    | Supported Query types                                                                                                                           |
+|-------------- |-------------------------------------------------------------------------------------------------------------------------------------------------|
+| Copy On Write | <ul><li>Snapshot Queries</li><li>Incremental Queries</li><li>Incremental Queries (CDC)</li><li>Bootstrap Queries</li><li>Time Travel</li></ul>  |
+| Merge On Read | <ul><li>Snapshot Queries</li><li>Incremental Queries</li><li>Read Optimized Queries</li><li>Bootstrap Queries(RO)</li><li>Bootstrap Queries (Snapshot)</li><li>Time Travel</li></ul>           |
 
 ### Table Types
 Hudi supports the following table types.
@@ -36,17 +38,28 @@ Hudi supports the following query types
 
 - **Snapshot Queries** : Queries see the latest snapshot of the table as of a given commit or compaction action. In case of merge on read table, it exposes near-real time data(few mins) by merging
   the base and delta files of the latest file slice on-the-fly. For copy on write table,  it provides a drop-in replacement for existing parquet tables, while providing upsert/delete and other write side features.
-- **Incremental Queries** : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines.
-- **Read Optimized Queries** : Queries see the latest snapshot of table as of a given commit/compaction action. Exposes only the base/columnar files in latest file slices and guarantees the
+- **Incremental Queries** : Queries only see new data written to the table, since a given commit/compaction. 
+  This effectively provides change streams to enable incremental data pipelines. By default, this produces the latest 
+  snapshot of the changes since a given point in timeline.
+  - ***Incremental Queries(CDC)*** : These are a subtype fo Incremental queries, where queries see all changed data since 
+    a given commit/compaction as opposed to latest state of changed data. This enables full cdc style query use cases 
+    allowing to see before and after images of the changes along with operations that caused the change.
+- **Read Optimized Queries** : Queries see the latest snapshot of table as of a given commit/compaction action. Exposes only the base/columnar files in the latest file slices and guarantees the
   same columnar query performance compared to a non-hudi columnar table.
+- **Bootstrap Queries** : Queries see the latest snapshot of a bootstrapped Hudi table as of a given commit/compaction action.
+  - ***Bootstrap Queries(RO)*** : Similar to read optimized queries but on bootstrapped tables.
+  - ***Bootstrap Queries(Snapshot)*** : Similar to snapshot queries but on bootstrapped tables.
+- **Time Travel Queries** : Queries the snapshot of a table as of a given timestamp in the timeline. 
 
-Following table summarizes the trade-offs between the different query types.
+
+Following table summarizes the trade-offs between the Snapshot and Read Optimized query types.
 
 | Trade-off     | Snapshot    | Read Optimized |
 |-------------- |-------------| ------------------|
 | Data Latency  | Lower | Higher
 | Query Latency | Higher (merge base / columnar file + row based delta / log files) | Lower (raw base / columnar file performance)
 
+For configs related to query types refer [below](#query-configs).
 
 ## Copy On Write Table
 
@@ -108,3 +121,30 @@ The intention of merge on read table is to enable near real-time processing dire
 data out to specialized systems, which may not be able to handle the data volume. There are also a few secondary side benefits to
 this table such as reduced write amplification by avoiding synchronous merge of data, i.e, the amount of data written per 1 bytes of data in a batch
 
+## Query configs
+Following are the configs relevant to different query types.
+
+### Spark configs
+
+| Config Name                                                                            | Default                               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+|----------------------------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.datasource.query.type                        | snapshot (Optional) | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)<br /><br />`Config Param: QUERY_TYPE`                                                                                                                                                                                                                                                                                                                              |
+| hoodie.datasource.read.begin.instanttime | N/A **(Required)**  | Required when `hoodie.datasource.query.type` is set to `incremental`. Represents the instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time &gt; BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM. Note that if `hoodie.datasource.read.handle.hollow.commit` set to USE_STATE_TRANSITION_TIME, will use instant's `stateTransitionTime` to perform comparison.<br /><br />`Config Param: BEGIN_INSTANTTIME`     |
+| hoodie.datasource.read.end.instanttime     | N/A **(Required)**  | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the instant time to limit incrementally fetched data to. When not specified latest commit time from timeline is assumed by default. When specified, new data written with an instant_time &lt;= END_INSTANTTIME are fetched out. Point in time type queries make more sense with begin and end instant times specified. Note that if `hoodie.datasource.read.handle.hollow.commit` set to `USE_STATE_TRANSITION_TIME`, will use instant's `stateTransitionTime` to perform comparison.<br /><br />`Config Param: END_INSTANTTIME` |
+| hoodie.bootstrap.data.queries.only                         | false (Optional)                                                                                | Improves query performance, but queries cannot use hudi metadata fields<br /><br />`Config Param: DATA_QUERIES_ONLY`<br />`Since Version: 0.14.0`                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| hoodie.datasource.query.incremental.format                                                                         | latest_state (Optional)    | This config is used alone with the 'incremental' query type.When set to 'latest_state', it returns the latest records' values.When set to 'cdc', it returns the cdc data.<br /><br />`Config Param: INCREMENTAL_FORMAT`<br />`Since Version: 0.13.0`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| as.of.instant                                                                                                                             | N/A **(Required)**         | The query instant for time travel. Required only in the context of time travel queries. Without specified this option, we query the latest snapshot.<br /><br />`Config Param: TIME_TRAVEL_AS_OF_INSTANT`                                                                                                                                                                                                                                                                                                                                                                                                      |
+
+
+Refer [here](https://hudi.apache.org/docs/next/configurations#Read-Options) and [bootstrap configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) for more details
+
+### Flink Configs
+
+| Config Name                                                                              | Default                               | Description                                                                                                                                                                                                                                                                                                                                                                  |
+|------------------------------------------------------------------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.datasource.query.type                                      | snapshot (Optional)                   | Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row &amp; columnar data); 2) incremental mode (new data since an instantTime). If `cdc.enabled` is set incremental queries on cdc data are possible; 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot<br /><br /> `Config Param: QUERY_TYPE` |
+| read.start-commit                                                     | N/A **(Required)**                    | Required in case of incremental queries. Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read<br /><br /> `Config Param: READ_START_COMMIT`                                                                                                                                    |
+| read.end-commit                                                        | N/A **(Required)**                    | Used int he context of incremental queries. End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'<br /><br /> `Config Param: READ_END_COMMIT`                                                                                                                                                                                                    |
+
+Refer [here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for more details.
+