Merge pull request #10 from apache/dev

merge latest code
zilliztech · Sep 20, 2024 · 0dbac11 · 0dbac11
2 parents 76193c7 + 4f5d27f
commit 0dbac11
Show file tree

Hide file tree

Showing 77 changed files with 4,603 additions and 741 deletions.
diff --git a/docs/en/concept/JobEnvConfig.md b/docs/en/concept/JobEnvConfig.md
@@ -21,14 +21,26 @@ You can configure whether the task is in batch or stream mode through `job.mode`
 
 ### checkpoint.interval
 
-Gets the interval in which checkpoints are periodically scheduled.
+Gets the interval (milliseconds) in which checkpoints are periodically scheduled.
 
-In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter.
+In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter. In Zeta `STREAMING` mode, the default value is 30000 milliseconds.
+
+### checkpoint.timeout
+
+The timeout (in milliseconds) for a checkpoint. If the checkpoint is not completed before the timeout, the job will fail. In Zeta, the default value is 30000 milliseconds.
 
 ### parallelism
 
 This parameter configures the parallelism of source and sink.
 
+### shade.identifier
+
+Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.
+
+For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)
+
+## Zeta Engine Parameter
+
 ### job.retry.times
 
 Used to control the default retry times when a job fails. The default value is 3, and it only works in the Zeta engine.
@@ -43,12 +55,6 @@ This parameter is used to specify the location of the savemode when the job is e
 The default value is `CLUSTER`, which means that the savemode is executed on the cluster. If you want to execute the savemode on the client,
 you can set it to `CLIENT`. Please use `CLUSTER` mode as much as possible, because when there are no problems with `CLUSTER` mode, we will remove `CLIENT` mode.
 
-### shade.identifier
-
-Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.
-
-For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)
-
 ## Flink Engine Parameter
 
 Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them. Please refer to the official [Flink Documentation](https://flink.apache.org/).

diff --git a/docs/en/connector-v2/sink/Assert.md b/docs/en/connector-v2/sink/Assert.md
@@ -4,7 +4,7 @@
 
 ## Description
 
-A flink sink plugin which can assert illegal data by user defined rules
+A sink plugin which can assert illegal data by user defined rules
 
 ## Key Features
 

diff --git a/docs/en/connector-v2/sink/FtpFile.md b/docs/en/connector-v2/sink/FtpFile.md
@@ -64,6 +64,8 @@ By default, we use 2PC commit to ensure `exactly-once`
 | parquet_avro_write_timestamp_as_int96 | boolean | no       | false                                      | Only used when file_format is parquet.                                                                            |
 | parquet_avro_write_fixed_as_int96     | array   | no       | -                                          | Only used when file_format is parquet.                                                                            |
 | encoding                              | string  | no       | "UTF-8"                                    | Only used when file_format_type is json,text,csv,xml.                                                             |
+| schema_save_mode                      | string  | no       | CREATE_SCHEMA_WHEN_NOT_EXIST               | Existing dir processing method                                                                    |
+| data_save_mode                        | string  | no       | APPEND_DATA                                | Existing data processing method                                                                   |
 
 ### host [string]
 
@@ -227,6 +229,18 @@ Support writing Parquet INT96 from a 12-byte field, only valid for parquet files
 Only used when file_format_type is json,text,csv,xml.
 The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
 
+### schema_save_mode [string]
+Existing dir processing method.
+- RECREATE_SCHEMA: will create when the dir does not exist, delete and recreate when the dir is exist
+- CREATE_SCHEMA_WHEN_NOT_EXIST: will create when the dir does not exist, skipped when the dir is exist
+- ERROR_WHEN_SCHEMA_NOT_EXIST: error will be reported when the dir does not exist
+- IGNORE ：Ignore the treatment of the table
+
+### data_save_mode [string]
+Existing data processing method.
+- DROP_DATA: preserve dir and delete data files
+- APPEND_DATA: preserve dir, preserve data files
+- ERROR_WHEN_DATA_EXISTS: when there is data files, an error is reported
 ## Example
 
 For text file format simple config
@@ -273,6 +287,34 @@ FtpFile {
 
 ```
 
+When our source end is multiple tables, and wants different expressions to different directory, we can configure this way
+
+```hocon
+
+FtpFile {
+    host = "xxx.xxx.xxx.xxx"
+    port = 21
+    user = "username"
+    password = "password"
+    path = "/data/ftp/seatunnel/job1/${table_name}"
+    tmp_path = "/data/ftp/seatunnel/tmp"
+    file_format_type = "text"
+    field_delimiter = "\t"
+    row_delimiter = "\n"
+    have_partition = true
+    partition_by = ["age"]
+    partition_dir_expression = "${k0}=${v0}"
+    is_partition_field_write_in_file = true
+    custom_filename = true
+    file_name_expression = "${transactionId}_${now}"
+    sink_columns = ["name","age"]
+    filename_time_format = "yyyy.MM.dd"
+    schema_save_mode=RECREATE_SCHEMA
+    data_save_mode=DROP_DATA
+}
+
+```
+
 ## Changelog
 
 ### 2.2.0-beta 2022-09-26

diff --git a/docs/en/connector-v2/sink/Hudi.md b/docs/en/connector-v2/sink/Hudi.md
@@ -14,46 +14,91 @@ Used to write data to Hudi.
 
 ## Options
 
-|            name            |  type  | required | default value |
+Base configuration:
+
+|            name            |  type   | required | default value               |
+|----------------------------|---------|----------|-----------------------------|
+| table_dfs_path             | string  | yes      | -                           |
+| conf_files_path            | string  | no       | -                           |
+| table_list                 | Array   | no       | -                           |
+| auto_commit                | boolean | no       | true                        |
+| schema_save_mode           | enum    | no       | CREATE_SCHEMA_WHEN_NOT_EXIST|
+| common-options             | Config  | no       | -                           |
+
+Table list configuration:
+
+|       name                 |  type  | required | default value |
 |----------------------------|--------|----------|---------------|
 | table_name                 | string | yes      | -             |
-| table_dfs_path             | string | yes      | -             |
-| conf_files_path            | string | no       | -             |
+| database                   | string | no       | default       |
+| table_type                 | enum   | no       | COPY_ON_WRITE |
+| op_type                    | enum   | no       | insert        |
 | record_key_fields          | string | no       | -             |
 | partition_fields           | string | no       | -             |
-| table_type                 | enum   | no       | copy_on_write |
-| op_type                    | enum   | no       | insert        |
 | batch_interval_ms          | Int    | no       | 1000          |
+| batch_size                 | Int    | no       | 1000          |
 | insert_shuffle_parallelism | Int    | no       | 2             |
 | upsert_shuffle_parallelism | Int    | no       | 2             |
 | min_commits_to_keep        | Int    | no       | 20            |
 | max_commits_to_keep        | Int    | no       | 30            |
-| common-options             | config | no       | -             |
+| index_type                 | enum   | no       | BLOOM         |
+| index_class_name           | string | no       | -             |
+| record_byte_size           | Int    | no       | 1024          |
+
+Note: When this configuration corresponds to a single table, you can flatten the configuration items in table_list to the outer layer.
 
 ### table_name [string]
 
 `table_name` The name of hudi table.
 
+### database [string]
+
+`database` The database of hudi table.
+
 ### table_dfs_path [string]
 
-`table_dfs_path` The dfs root path of hudi table,such as 'hdfs://nameserivce/data/hudi/hudi_table/'.
+`table_dfs_path` The dfs root path of hudi table, such as 'hdfs://nameserivce/data/hudi/'.
 
 ### table_type [enum]
 
-`table_type` The type of hudi table. The value is 'copy_on_write' or 'merge_on_read'.
+`table_type` The type of hudi table. The value is `COPY_ON_WRITE` or `MERGE_ON_READ`.
+
+### record_key_fields [string]
+
+`record_key_fields` The record key fields of hudi table, its are used to generate record key. It must be configured when op_type is `UPSERT`.
+
+### partition_fields [string]
+
+`partition_fields` The partition key fields of hudi table, its are used to generate partition.
+
+### index_type [string]
+
+`index_type` The index type of hudi table. Currently, `BLOOM`, `SIMPLE`, and `GLOBAL SIMPLE` are supported.
+
+### index_class_name [string]
+
+`index_class_name` The customized index classpath of hudi table, example `org.apache.seatunnel.connectors.seatunnel.hudi.index.CustomHudiIndex`.
+
+### record_byte_size [Int]
+
+`record_byte_size` The byte size of each record, This value can be used to help calculate the approximate number of records in each hudi data file. Adjusting this value can effectively reduce the number of hudi data file write magnifications.
 
 ### conf_files_path [string]
 
 `conf_files_path` The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.
 
 ### op_type [enum]
 
-`op_type` The operation type of hudi table. The value is 'insert' or 'upsert' or 'bulk_insert'.
+`op_type` The operation type of hudi table. The value is `insert` or `upsert` or `bulk_insert`.
 
 ### batch_interval_ms [Int]
 
 `batch_interval_ms` The interval time of batch write to hudi table.
 
+### batch_size [Int]
+
+`batch_size` The size of batch write to hudi table.
+
 ### insert_shuffle_parallelism [Int]
 
 `insert_shuffle_parallelism` The parallelism of insert data to hudi table.
@@ -70,19 +115,35 @@ Used to write data to Hudi.
 
 `max_commits_to_keep` The max commits to keep of hudi table.
 
+### auto_commit [boolean]
+
+`auto_commit` Automatic transaction commit is enabled by default.
+
+### schema_save_mode [Enum]
+
+Before the synchronous task is turned on, different treatment schemes are selected for the existing surface structure of the target side.  
+Option introduction：  
+`RECREATE_SCHEMA` ：Will create when the table does not exist, delete and rebuild when the table is saved        
+`CREATE_SCHEMA_WHEN_NOT_EXIST` ：Will Created when the table does not exist, skipped when the table is saved        
+`ERROR_WHEN_SCHEMA_NOT_EXIST` ：Error will be reported when the table does not exist  
+`IGNORE` ：Ignore the treatment of the table
+
 ### common options
 
 Source plugin common parameters, please refer to [Source Common Options](../sink-common-options.md) for details.
 
 ## Examples
 
+### single table
 ```hocon
 sink {
   Hudi {
-    table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
+    table_dfs_path = "hdfs://nameserivce/data/"
+    database = "st"
     table_name = "test_table"
-    table_type = "copy_on_write"
+    table_type = "COPY_ON_WRITE"
     conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
+    batch_size = 10000
     use.kerberos = true
     kerberos.principal = "test_user@xxx"
     kerberos.principal.file = "/home/test/test_user.keytab"
@@ -91,9 +152,6 @@ sink {
 ```
 
 ### Multiple table
-
-#### example1
-
 ```hocon
 env {
   parallelism = 1
@@ -116,9 +174,32 @@ transform {
 
 sink {
   Hudi {
+    table_dfs_path = "hdfs://nameserivce/data/"
+    conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
+    table_list = [
+      {
+        database = "st1"
+        table_name = "role"
+        table_type = "COPY_ON_WRITE"
+        op_type="INSERT"
+        batch_size = 10000
+      },
+      {
+        database = "st1"
+        table_name = "user"
+        table_type = "COPY_ON_WRITE"
+        op_type="UPSERT"
+        # op_type is 'UPSERT', must configured record_key_fields
+        record_key_fields = "user_id"
+        batch_size = 10000
+      },
+      {
+        database = "st1"
+        table_name = "Bucket"
+        table_type = "MERGE_ON_READ"
+      }
+    ]
     ...
-    table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
-    table_name = "${table_name}_test"
   }
 }
 ```

diff --git a/docs/en/start-v2/docker/docker.md b/docs/en/start-v2/docker/docker.md
@@ -146,14 +146,14 @@ docker run --rm -it apache/seatunnel bash -c '<YOUR_FLINK_HOME>/bin/start-cluste
 
 there has 2 ways to create cluster within docker.
 
-### 1. Use Docker Directly
+### Use Docker Directly
 
-1. create a network
+#### create a network
 ```shell
 docker network create seatunnel-network
 ```
 
-2. start the nodes
+#### start the nodes
 - start master node
 ```shell
 ## start master and export 5801 port 
@@ -213,7 +213,7 @@ docker run -d --name seatunnel_worker_1 \
 ```
 
 
-### 2. Use Docker-compose
+### Use Docker-compose
 
 > docker cluster mode is only support zeta engine.
 
@@ -368,7 +368,7 @@ and run `docker-compose up -d` command, the new worker node will start, and the
 
 ### Job Operation on cluster
 
-1. use docker as a client
+#### use docker as a client
 - submit job :
 ```shell
 docker run --name seatunnel_client \
@@ -393,7 +393,7 @@ more command please refer [user-command](../../seatunnel-engine/user-command.md)
 
 
 
-2. use rest api
+#### use rest api
 
 please refer [Submit A Job](../../seatunnel-engine/rest-api.md#submit-a-job)
 
diff --git a/docs/en/start-v2/locally/deployment.md b/docs/en/start-v2/locally/deployment.md
@@ -69,7 +69,7 @@ You can download the source code from the [download page](https://seatunnel.apac
 
 ```shell
 cd seatunnel
-sh ./mvnw clean package -DskipTests -Dskip.spotless=true
+sh ./mvnw clean install -DskipTests -Dskip.spotless=true
 # get the binary package
 cp seatunnel-dist/target/apache-seatunnel-2.3.8-bin.tar.gz /The-Path-You-Want-To-Copy
 

diff --git a/docs/en/transform-v2/llm.md b/docs/en/transform-v2/llm.md
@@ -11,11 +11,12 @@ more.
 ## Options
 
 | name                   | type   | required | default value |
-| ---------------------- | ------ | -------- | ------------- |
+|------------------------| ------ | -------- |---------------|
 | model_provider         | enum   | yes      |               |
 | output_data_type       | enum   | no       | String        |
+| output_column_name     | string | no       | llm_output    |
 | prompt                 | string | yes      |               |
-| inference_columns   | list   | no       |               |
+| inference_columns      | list   | no       |               |
 | model                  | string | yes      |               |
 | api_key                | string | yes      |               |
 | api_path               | string | no       |               |
@@ -35,6 +36,10 @@ The data type of the output data. The available options are:
 STRING,INT,BIGINT,DOUBLE,BOOLEAN.
 Default value is STRING.
 
+### output_column_name
+
+Custom output data field name. A custom field name that is the same as an existing field name is replaced with 'llm_output'.
+
 ### prompt
 
 The prompt to send to the LLM. This parameter defines how LLM will process and return data, eg: