Skip to content

Commit

Permalink
Merge pull request #10 from apache/dev
Browse files Browse the repository at this point in the history
merge latest code
  • Loading branch information
nianliuu authored Sep 20, 2024
2 parents 76193c7 + 4f5d27f commit 0dbac11
Show file tree
Hide file tree
Showing 77 changed files with 4,603 additions and 741 deletions.
22 changes: 14 additions & 8 deletions docs/en/concept/JobEnvConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,26 @@ You can configure whether the task is in batch or stream mode through `job.mode`

### checkpoint.interval

Gets the interval in which checkpoints are periodically scheduled.
Gets the interval (milliseconds) in which checkpoints are periodically scheduled.

In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter.
In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter. In Zeta `STREAMING` mode, the default value is 30000 milliseconds.

### checkpoint.timeout

The timeout (in milliseconds) for a checkpoint. If the checkpoint is not completed before the timeout, the job will fail. In Zeta, the default value is 30000 milliseconds.

### parallelism

This parameter configures the parallelism of source and sink.

### shade.identifier

Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.

For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)

## Zeta Engine Parameter

### job.retry.times

Used to control the default retry times when a job fails. The default value is 3, and it only works in the Zeta engine.
Expand All @@ -43,12 +55,6 @@ This parameter is used to specify the location of the savemode when the job is e
The default value is `CLUSTER`, which means that the savemode is executed on the cluster. If you want to execute the savemode on the client,
you can set it to `CLIENT`. Please use `CLUSTER` mode as much as possible, because when there are no problems with `CLUSTER` mode, we will remove `CLIENT` mode.

### shade.identifier

Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.

For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)

## Flink Engine Parameter

Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them. Please refer to the official [Flink Documentation](https://flink.apache.org/).
Expand Down
2 changes: 1 addition & 1 deletion docs/en/connector-v2/sink/Assert.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
## Description

A flink sink plugin which can assert illegal data by user defined rules
A sink plugin which can assert illegal data by user defined rules

## Key Features

Expand Down
42 changes: 42 additions & 0 deletions docs/en/connector-v2/sink/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ By default, we use 2PC commit to ensure `exactly-once`
| parquet_avro_write_timestamp_as_int96 | boolean | no | false | Only used when file_format is parquet. |
| parquet_avro_write_fixed_as_int96 | array | no | - | Only used when file_format is parquet. |
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
| schema_save_mode | string | no | CREATE_SCHEMA_WHEN_NOT_EXIST | Existing dir processing method |
| data_save_mode | string | no | APPEND_DATA | Existing data processing method |

### host [string]

Expand Down Expand Up @@ -227,6 +229,18 @@ Support writing Parquet INT96 from a 12-byte field, only valid for parquet files
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.

### schema_save_mode [string]
Existing dir processing method.
- RECREATE_SCHEMA: will create when the dir does not exist, delete and recreate when the dir is exist
- CREATE_SCHEMA_WHEN_NOT_EXIST: will create when the dir does not exist, skipped when the dir is exist
- ERROR_WHEN_SCHEMA_NOT_EXIST: error will be reported when the dir does not exist
- IGNORE :Ignore the treatment of the table

### data_save_mode [string]
Existing data processing method.
- DROP_DATA: preserve dir and delete data files
- APPEND_DATA: preserve dir, preserve data files
- ERROR_WHEN_DATA_EXISTS: when there is data files, an error is reported
## Example

For text file format simple config
Expand Down Expand Up @@ -273,6 +287,34 @@ FtpFile {

```

When our source end is multiple tables, and wants different expressions to different directory, we can configure this way

```hocon
FtpFile {
host = "xxx.xxx.xxx.xxx"
port = 21
user = "username"
password = "password"
path = "/data/ftp/seatunnel/job1/${table_name}"
tmp_path = "/data/ftp/seatunnel/tmp"
file_format_type = "text"
field_delimiter = "\t"
row_delimiter = "\n"
have_partition = true
partition_by = ["age"]
partition_dir_expression = "${k0}=${v0}"
is_partition_field_write_in_file = true
custom_filename = true
file_name_expression = "${transactionId}_${now}"
sink_columns = ["name","age"]
filename_time_format = "yyyy.MM.dd"
schema_save_mode=RECREATE_SCHEMA
data_save_mode=DROP_DATA
}
```

## Changelog

### 2.2.0-beta 2022-09-26
Expand Down
113 changes: 97 additions & 16 deletions docs/en/connector-v2/sink/Hudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,46 +14,91 @@ Used to write data to Hudi.

## Options

| name | type | required | default value |
Base configuration:

| name | type | required | default value |
|----------------------------|---------|----------|-----------------------------|
| table_dfs_path | string | yes | - |
| conf_files_path | string | no | - |
| table_list | Array | no | - |
| auto_commit | boolean | no | true |
| schema_save_mode | enum | no | CREATE_SCHEMA_WHEN_NOT_EXIST|
| common-options | Config | no | - |

Table list configuration:

| name | type | required | default value |
|----------------------------|--------|----------|---------------|
| table_name | string | yes | - |
| table_dfs_path | string | yes | - |
| conf_files_path | string | no | - |
| database | string | no | default |
| table_type | enum | no | COPY_ON_WRITE |
| op_type | enum | no | insert |
| record_key_fields | string | no | - |
| partition_fields | string | no | - |
| table_type | enum | no | copy_on_write |
| op_type | enum | no | insert |
| batch_interval_ms | Int | no | 1000 |
| batch_size | Int | no | 1000 |
| insert_shuffle_parallelism | Int | no | 2 |
| upsert_shuffle_parallelism | Int | no | 2 |
| min_commits_to_keep | Int | no | 20 |
| max_commits_to_keep | Int | no | 30 |
| common-options | config | no | - |
| index_type | enum | no | BLOOM |
| index_class_name | string | no | - |
| record_byte_size | Int | no | 1024 |

Note: When this configuration corresponds to a single table, you can flatten the configuration items in table_list to the outer layer.

### table_name [string]

`table_name` The name of hudi table.

### database [string]

`database` The database of hudi table.

### table_dfs_path [string]

`table_dfs_path` The dfs root path of hudi table,such as 'hdfs://nameserivce/data/hudi/hudi_table/'.
`table_dfs_path` The dfs root path of hudi table, such as 'hdfs://nameserivce/data/hudi/'.

### table_type [enum]

`table_type` The type of hudi table. The value is 'copy_on_write' or 'merge_on_read'.
`table_type` The type of hudi table. The value is `COPY_ON_WRITE` or `MERGE_ON_READ`.

### record_key_fields [string]

`record_key_fields` The record key fields of hudi table, its are used to generate record key. It must be configured when op_type is `UPSERT`.

### partition_fields [string]

`partition_fields` The partition key fields of hudi table, its are used to generate partition.

### index_type [string]

`index_type` The index type of hudi table. Currently, `BLOOM`, `SIMPLE`, and `GLOBAL SIMPLE` are supported.

### index_class_name [string]

`index_class_name` The customized index classpath of hudi table, example `org.apache.seatunnel.connectors.seatunnel.hudi.index.CustomHudiIndex`.

### record_byte_size [Int]

`record_byte_size` The byte size of each record, This value can be used to help calculate the approximate number of records in each hudi data file. Adjusting this value can effectively reduce the number of hudi data file write magnifications.

### conf_files_path [string]

`conf_files_path` The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.

### op_type [enum]

`op_type` The operation type of hudi table. The value is 'insert' or 'upsert' or 'bulk_insert'.
`op_type` The operation type of hudi table. The value is `insert` or `upsert` or `bulk_insert`.

### batch_interval_ms [Int]

`batch_interval_ms` The interval time of batch write to hudi table.

### batch_size [Int]

`batch_size` The size of batch write to hudi table.

### insert_shuffle_parallelism [Int]

`insert_shuffle_parallelism` The parallelism of insert data to hudi table.
Expand All @@ -70,19 +115,35 @@ Used to write data to Hudi.

`max_commits_to_keep` The max commits to keep of hudi table.

### auto_commit [boolean]

`auto_commit` Automatic transaction commit is enabled by default.

### schema_save_mode [Enum]

Before the synchronous task is turned on, different treatment schemes are selected for the existing surface structure of the target side.
Option introduction:
`RECREATE_SCHEMA` :Will create when the table does not exist, delete and rebuild when the table is saved
`CREATE_SCHEMA_WHEN_NOT_EXIST` :Will Created when the table does not exist, skipped when the table is saved
`ERROR_WHEN_SCHEMA_NOT_EXIST` :Error will be reported when the table does not exist
`IGNORE` :Ignore the treatment of the table

### common options

Source plugin common parameters, please refer to [Source Common Options](../sink-common-options.md) for details.

## Examples

### single table
```hocon
sink {
Hudi {
table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
table_dfs_path = "hdfs://nameserivce/data/"
database = "st"
table_name = "test_table"
table_type = "copy_on_write"
table_type = "COPY_ON_WRITE"
conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
batch_size = 10000
use.kerberos = true
kerberos.principal = "test_user@xxx"
kerberos.principal.file = "/home/test/test_user.keytab"
Expand All @@ -91,9 +152,6 @@ sink {
```

### Multiple table

#### example1

```hocon
env {
parallelism = 1
Expand All @@ -116,9 +174,32 @@ transform {
sink {
Hudi {
table_dfs_path = "hdfs://nameserivce/data/"
conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
table_list = [
{
database = "st1"
table_name = "role"
table_type = "COPY_ON_WRITE"
op_type="INSERT"
batch_size = 10000
},
{
database = "st1"
table_name = "user"
table_type = "COPY_ON_WRITE"
op_type="UPSERT"
# op_type is 'UPSERT', must configured record_key_fields
record_key_fields = "user_id"
batch_size = 10000
},
{
database = "st1"
table_name = "Bucket"
table_type = "MERGE_ON_READ"
}
]
...
table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
table_name = "${table_name}_test"
}
}
```
Expand Down
12 changes: 6 additions & 6 deletions docs/en/start-v2/docker/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,14 +146,14 @@ docker run --rm -it apache/seatunnel bash -c '<YOUR_FLINK_HOME>/bin/start-cluste

there has 2 ways to create cluster within docker.

### 1. Use Docker Directly
### Use Docker Directly

1. create a network
#### create a network
```shell
docker network create seatunnel-network
```

2. start the nodes
#### start the nodes
- start master node
```shell
## start master and export 5801 port
Expand Down Expand Up @@ -213,7 +213,7 @@ docker run -d --name seatunnel_worker_1 \
```


### 2. Use Docker-compose
### Use Docker-compose

> docker cluster mode is only support zeta engine.
Expand Down Expand Up @@ -368,7 +368,7 @@ and run `docker-compose up -d` command, the new worker node will start, and the

### Job Operation on cluster

1. use docker as a client
#### use docker as a client
- submit job :
```shell
docker run --name seatunnel_client \
Expand All @@ -393,7 +393,7 @@ more command please refer [user-command](../../seatunnel-engine/user-command.md)



2. use rest api
#### use rest api

please refer [Submit A Job](../../seatunnel-engine/rest-api.md#submit-a-job)

2 changes: 1 addition & 1 deletion docs/en/start-v2/locally/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ You can download the source code from the [download page](https://seatunnel.apac

```shell
cd seatunnel
sh ./mvnw clean package -DskipTests -Dskip.spotless=true
sh ./mvnw clean install -DskipTests -Dskip.spotless=true
# get the binary package
cp seatunnel-dist/target/apache-seatunnel-2.3.8-bin.tar.gz /The-Path-You-Want-To-Copy

Expand Down
9 changes: 7 additions & 2 deletions docs/en/transform-v2/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@ more.
## Options

| name | type | required | default value |
| ---------------------- | ------ | -------- | ------------- |
|------------------------| ------ | -------- |---------------|
| model_provider | enum | yes | |
| output_data_type | enum | no | String |
| output_column_name | string | no | llm_output |
| prompt | string | yes | |
| inference_columns | list | no | |
| inference_columns | list | no | |
| model | string | yes | |
| api_key | string | yes | |
| api_path | string | no | |
Expand All @@ -35,6 +36,10 @@ The data type of the output data. The available options are:
STRING,INT,BIGINT,DOUBLE,BOOLEAN.
Default value is STRING.

### output_column_name

Custom output data field name. A custom field name that is the same as an existing field name is replaced with 'llm_output'.

### prompt

The prompt to send to the LLM. This parameter defines how LLM will process and return data, eg:
Expand Down
Loading

0 comments on commit 0dbac11

Please sign in to comment.