Merge branch 'main' into feat/pandas-support

4paradigm · Apr 16, 2024 · 0b69d9f · 0b69d9f
2 parents e55449a + d33e2c3
commit 0b69d9f
Show file tree

Hide file tree

Showing 36 changed files with 1,069 additions and 223 deletions.
diff --git a/docs/en/deploy/conf.md b/docs/en/deploy/conf.md
@@ -24,11 +24,11 @@
 #--request_max_retry=3
 # Configure the request timeout in milliseconds, the default is 12 seconds
 #--request_timeout_ms=12000
-# Configure the retry interval when the request is unreachable, generally do not need to be modified, in milliseconds
+# Configure the retry interval when the request is unreachable, generally does not need to be modified, in milliseconds
 #--request_sleep_time=1000
 # Configure the zookeeper session timeout in milliseconds
 --zk_session_timeout=10000
-# Configure the zookeeper health check interval, the unit is milliseconds, generally do not need to be modified
+# Configure the zookeeper health check interval, the unit is milliseconds, generally does not need to be modified
 #--zk_keep_alive_check_interval=15000
 # Configure the timeout period for tablet heartbeat detection in milliseconds, the default is 1 minute. If the tablet is still unreachable after this time, the nameserver considers that the tablet is unavailable and will perform the operation of offline the node
 --tablet_heartbeat_timeout=60000
@@ -265,7 +265,6 @@ spark.default.conf=
 spark.eventLog.dir=
 spark.yarn.maxAppAttempts=1
 batchjob.jar.path=
-namenode.uri=
 offline.data.prefix=file:///tmp/openmldb_offline_storage/
 hadoop.conf.dir=
 #enable.hive.support=false
@@ -275,6 +274,7 @@ hadoop.conf.dir=
 
 Some of the important configurations for Spark Config is as follows:
 
+<a id="about-config-env"></a>
 ```{note}
 Understand the relationships between configurations and environment variables.
 
@@ -295,47 +295,60 @@ With one-clock deployment, SPARK_HOME will be set as `<package_home>/spark`. For
 
 `spark.master` configures Spark modes, more information can be found at [Spark Master URL](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls).
 
-
-
 TaskManager only allows `local` and its variants, `yarn`, `yarn-cluster` and `yarn-client` modes. Default mode is `local[*]`, which is milti-process local mode (thread count is cpu counts). Spark cluster `spark://`, Mesos cluster `mesos://` and Kubernetes `k8s://` cluster modes are currently not supported.
 
 ##### `local` Mode
 
-Spark tasks are executed locally on TaskManager deployment machine. Please note the following:
-- `offline.data.prefix` is set by default as `file:///tmp/openmldb_offline_storage/`, which is on TaskManager deployment machine. This can be set to other locations as required. 
-- **Before starting TaskManager**, HDFS path can be configured by setting environment variable `HADOOP_CONF_DIR` to Hadoop configuration directory (Note: it is the environment variable, not the configuration item). The directory needs to include `core-site.xml`, `hdfs-site.xml` configuration files. For more information, refer to [Spark documentation](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration).
+The local mode means that the Spark task runs on the local machine (where the TaskManager is located). In this mode, not many configurations are required, but two points should be noted:
+- The storage location of offline tables `offline.data.prefix` is set to `file:///tmp/openmldb_offline_storage/` by default, which refers to the `/tmp` directory on the TaskManager's machine. If the TaskManager is moved to another machine, the data cannot be automatically migrated. It is not recommended to use `file://` when deploying multiple TaskManagers on different machines. You can configure it as an HDFS path, and you need to configure the variables `hadoop.conf.dir` and `hadoop.user.name`. For more details, see [Hadoop-related configurations](#hadoop-related-configurations).
 
-```{note}
-Currently, `namenode.uri` needs to be configured. When deleting an offline table, HDFS FileSystem `namenode.uri` will be connected, and offline table path will be deleted. This item will be discarded in future updates.
-```
-- batchjob path `batchjob.jar.path` can be set automativally. It can be configured to other paths.
+- The path of the batchjob `batchjob.jar.path` can be automatically obtained and does not need to be configured. If you want to use a batchjob from elsewhere, you can configure this parameter.
 
 ```{seealso}
-if Hadoop/Yarm requires Kerberos authentication, refer to [FAQ](../faq.md).
+If Hadoop/Yarn requires Kerberos authentication, refer to the [Client FAQ](../faq/client_faq.md).
 ```
+
 ##### `yarn/yarn-cluster` Mode
+"yarn" and "yarn-cluster" are the same mode, where Spark tasks run on a Yarn cluster. This mode requires several configurations, including:
+
+- The yarn mode must connect to a Hadoop cluster and requires the proper configuration of Hadoop variables `hadoop.conf.dir` and `hadoop.user.name`. For more details, refer to [Hadoop-related configurations](#hadoop-related-configurations).
 
-`yarn` and `yarn-cluster` is the same mode, where Spark tasks execute on the Yarn cluster. The configuration items are:
+The following configurations usually require an HDFS that belongs to the same Hadoop cluster as Yarn, unless a direct `hdfs://` address can be used.
 
-- **Before starting TaskManager**, configure environment variable `HADOOP_CONF_DIR` to Hadoop and Yarn configuration directory. The directory should include `core-site.xml` and `hdfs-site.xml` for hadoop, and `yarn-site.xml` for Yarn. For more details, refer to [Spark documentation](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn).
+- The `spark.yarn.jars` configuration specifies the location of Spark runtime JAR files that Yarn needs to read. It must be an `hdfs://` address. You can upload the `jars` directory from the [OpenMLDB Spark distribution](../../tutorial/openmldbspark_distribution.md) to HDFS and configure it as `hdfs://<hdfs_path>/jars/*` (note the wildcard). [If this parameter is not configured, Yarn will package and distribute `$SPARK_HOME/jars` for each offline task, which is inefficient](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations). Therefore, it is recommended to configure this parameter.
+- `batchjob.jar.path` must be an HDFS path (specific to the package name). Upload the batch job JAR file to HDFS and configure it with the corresponding address to ensure that all workers in the Yarn cluster can access the batch job package.
+- `offline.data.prefix` must be an HDFS path to ensure that all workers in the Yarn cluster can read and write data.
 
-- `spark.yarn.jars` Configure Spark jar for Yarn. It has to be a `hdfs://` path. You can update the `jars` directory from [OpenMLDB Spark distribution](../../tutorial/openmldbspark_distribution.md) to HDFS, and set it as `hdfs://<hdfs_path>/jars/*`. If not set, Yarn will package and distribute `$SPARK_HOME/jars`, and will do so [for each offline task](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations). Therefore, we recommend it to be set.
+##### `yarn-client` Mode
 
-- `batchjob.jar.path` It has to be a HDFS path. Upload batchjob jar to HDFS and configure the respective path. Make sure that all Workers in Yarn cluster have access to the batchjob jar.
+[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn), and the executor executes on the Yarn cluster. Configurations are the same as `yarn-cluster`.
 
-- `offline.data.prefix` It has to be a HDFS path. Make sure that all Workers in Yarn cluster have access. Use the environment variable `HADOOP_CONF_DIR`.
+#### spark.default.conf
 
+`spark.default.conf` configures Spark parameters in the format of `key=value`. Multiple configurations are separated by `;`, for example:
 
-##### `yarn-client` Mode
+#### Hadoop-related configurations
 
-[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)，and executor execute on the Yarn cluster. Configurations are the same as `yarn-cluster`.
+`hadoop.conf.dir` and `hadoop.user.name` are configurations for TaskManager. They will be passed to the Spark Job when TaskManager submits the job, which is equivalent to configuring the environment variables `HADOOP_CONF_DIR` and `HADOOP_USER_NAME` before creating the Spark Job.
 
-#### `spark.default.conf`
+Details of the configurations:
 
-Format is `key=value`, use `;` to separate. For example:
+- `hadoop.conf.dir` represents the directory where Hadoop and Yarn configuration files are located (note that this directory is on the TaskManager node; the file directory should include Hadoop's `core-site.xml`, `hdfs-site.xml`, `yarn-site.xml`, and other configuration files, refer to the [Spark official documentation](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)).
+- `hadoop.user.name` represents the Hadoop user.
 
-```
-spark.default.conf=spark.executor.instances=2;spark.executor.memory=2g;spark.executor.cores=2
-```
-Same affect as `--conf`. For more configurations, refer to [Spark documentation](https://spark.apache.org/docs/3.1.2/configuration.html).
+Essentially, it configures environment variables, and the scope of their effect is explained in <a href="#about-config-env">Understanding the Relationship Between Configurations and Environment Variables</a>. If there are special requirements, it is possible to bypass the configuration in TaskManager and configure the environment variables in other ways. However, it is recommended not to mix the two methods and use only one method for easier debugging.
+
+Please note that unspecified variables cannot be passed in sbin deployment. Currently, TaskManager only receives the environment variables `SPARK_HOME` and `RUNNER_JAVA_HOME`. Therefore, if you are using sbin deployment, it is recommended to use the TaskManager configuration file.
+
+Other configuration methods:
+- Copy Hadoop and Yarn configuration files (`core-site.xml`, `hdfs-site.xml`, etc.) to the `{spark.home}/conf` directory.
 
+- If there are existing environment variables on the TaskManager node, or **before manually starting TaskManager**, configure the environment variables `HADOOP_CONF_DIR` and `HADOOP_USER_NAME`.
+  > Similar to the following steps:
+  > ```bash
+  > cd <openmldb installation directory>
+  > export HADOOP_CONF_DIR=<replace with the Hadoop configuration directory>
+  > export HADOOP_USER_NAME=<replace with the Hadoop username>
+  > bash bin/start.sh start taskmanager
+  > ```
+  > Note that SSH remote startup may lose environment variables, so it is recommended to export them correctly before starting.
diff --git a/docs/en/integration/offline_data_sources/tidb.md b/docs/en/integration/offline_data_sources/tidb.md
@@ -8,7 +8,7 @@
 
 ### Installation
 
-[OpenMLDB Spark Distribution](../../tutorial/openmldbspark_distribution.md) v0.8.5 and later versions utilize the TiSpark tool to interact with TiDB. The current release includes TiSpark 3.1.x dependencies (`tispark-assembly-3.2_2.12-3.1.5.jar`, `mysql-connector-java-8.0.29.jar`). If your TiSpark version doesn't match your TiDB version, refer to the [TiSpark documentation](https://docs.pingcap.com/tidb/stable/tispark-overview) for compatible dependencies to add to Spark's classpath/jars.
+The current version utilizes TiSpark for interacting with the TiDB database. To get started, download the necessary dependencies for TiSpark 3.1.x (`tispark-assembly-3.2_2.12-3.1.5.jar` and `mysql-connector-java-8.0.29.jar`). If the TiSpark version is not compatible with your current TiDB version, refer to the [TiSpark documentation](https://docs.pingcap.com/tidb/stable/tispark-overview) for downloading the corresponding TiSpark dependencies. Then, add them to the Spark classpath/jars.
 
 
 ### Configuration
@@ -32,25 +32,29 @@ Once either configuration is successful, access TiDB tables using the format `ti
 
 TiDB schema reference can be found at [TiDB Schema](https://docs.pingcap.com/tidb/stable/data-type-overview). Currently, only the following TiDB data formats are supported:
 
-| OpenMLDB Data Format | TiDB Data Format        |
-|----------------------|-------------------------|
-| BOOL                 | BOOL                    |
-| SMALLINT             | Currently not supported |
-| INT                  | Currently not supported |
-| BIGINT               | BIGINT                  |
-| FLOAT                | FLOAT                   |
-| DOUBLE               | DOUBLE                  |
-| DATE                 | DATE                    |
-| TIMESTAMP            | TIMESTAMP               |
-| STRING               | VARCHAR(M)              |
+| OpenMLDB Data Format | TiDB Data Format |
+|----------------------|------------------|
+| BOOL                 | BOOL             |
+| SMALLINT             | SMALLINT         |
+| INT                  | INT              |
+| BIGINT               | BIGINT           |
+| FLOAT                | FLOAT            |
+| DOUBLE               | DOUBLE           |
+| DATE                 | DATE             |
+| TIMESTAMP            | DATETIME         |
+| TIMESTAMP            | TIMESTAMP        |
+| STRING               | VARCHAR(M)       |
+
+Tip: Asymmetric integer conversion will be affected by the value range. Please try to refer to the above data types for mapping.
 
 ## Importing TiDB Data into OpenMLDB
 
 Importing data from TiDB sources is supported through the [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md) API, using the specific URI interface format `tidb://tidb_catalog.[db].[table]` to import data from TiDB. Note:
 
 - Both offline and online engines can import TiDB data sources.
 - TiDB import supports symbolic links, which can reduce hard copying and ensure that OpenMLDB always reads the latest data from TiDB. To enable soft link data import, use the parameter `deep_copy=false`.
-- The `OPTIONS` parameter only supports `deep_copy`, `mode`, and `sql`.
+- TiDB supports parameter `skip_cvt` in `@@execute_mode='online'` mode: whether to skip field type conversion, the default is `false`, if it is `true`, field type conversion and strict schema checking will be performed , if it is `false`, there will be no conversion and schema checking actions, and the performance will be better, but there may be errors such as type overflow, which requires manual inspection.
+- The `OPTIONS` parameter only supports `deep_copy`, `mode`, `sql` , and `skip_cvt` .
 
 For example:
 
@@ -70,11 +74,12 @@ LOAD DATA INFILE 'tidb://tidb_catalog.db1.t1' INTO TABLE tidb_catalog.db1.t1 OPT
 
 Exporting data from OpenMLDB to TiDB sources is supported through the [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) API, using the specific URI interface format `tidb://tidb_catalog.[db].[table]` to export data to the TiDB data warehouse. Note:
 
+- The offline engine can support exporting TiDB data sources, but the online engine does not yet support it.
 - The database and table must already exist. Currently, automatic creation of non-existent databases or tables is not supported.
-- Only the export mode `mode` is effective in the `OPTIONS` parameter. Other parameters are not effective, and the current parameter is mandatory.
+- The `OPTIONS` parameter is only valid for `mode='append'`. Other parameters as `overwrite` and `errorifexists` are invalid. This is because the current version of TiSpark does not support them. If TiSpark supports them in future versions, you can upgrade for compatibility.
 
 For example:
 
 ```sql
 SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'tidb://tidb_catalog.db1.t1' options(mode='append');
-```
+```
diff --git a/docs/en/openmldb_sql/data_types/date_and_time_types.md b/docs/en/openmldb_sql/data_types/date_and_time_types.md
@@ -4,10 +4,10 @@ OpenMLDB supports date type `DATE` and timestamp `TIMESTAMP`.
 
 Each time type has a valid range of values and a NULL value. The NULL value is used when specifying an invalid value that cannot be represented.
 
-| Type      | Size (bytes) |   Scope                                                         | Format            | Use                     |
-| :-------- | :----------- | :----------------------------------------------------------- | :-------------- | :----------------------- |
-| DATE      | 4            | 1900-01-01 ~                                                 | YYYY-MM-DD      | Date Value                   |
-| TIMESTAMP | 8            | ~ INT64_MAX | online: int64, offline `LOAD DATA`: int64 or 'yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]' | Mixed Date and Time Value, Timestamp |
+| Type      | Size (bytes) |   Scope                                                         | Format                                                                                     | Use                     |
+| :-------- | :----------- | :----------------------------------------------------------- |:-------------------------------------------------------------------------------------------| :----------------------- |
+| DATE      | 4            | 1900-01-01 ~                                                 | YYYY-MM-DD                                                                                 | Date Value                   |
+| TIMESTAMP | 8            | ~ INT64_MAX | online: int64, offline(`LOAD DATA`, `INSERT`): int64 or 'yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]' | Mixed Date and Time Value, Timestamp |
 
 ## Time Zone Handling
 

diff --git a/docs/en/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md b/docs/en/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md
@@ -1,6 +1,6 @@
 # CREATE INDEX
 
-The `CREATE INDEX` statement is used to create a new index on existing table. Running `CREATE INDEX` will initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`.
+The `CREATE INDEX` statement is used to create a new index on an existing table. Running `CREATE INDEX` initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`. Please note that the index is not available until the asynchronous task is completed, and any scenarios that require the new index will fail.
 
 ## Syntax
 

diff --git a/docs/en/openmldb_sql/dml/INSERT_STATEMENT.md b/docs/en/openmldb_sql/dml/INSERT_STATEMENT.md
@@ -5,7 +5,7 @@ OpenMLDB supports single-row and multi-row insert statements.
 ## Syntax
 
 ```
-INSERT INFO tbl_name (column_list) VALUES (value_list) [, value_list ...]
+INSERT [[OR] IGNORE] INFO tbl_name (column_list) VALUES (value_list) [, value_list ...]
 
 column_list:
     col_name [, col_name] ...
@@ -15,7 +15,9 @@ value_list:
 ```
 
 **Description**
-- `INSERT` statement only works in online execute mode
+- By default, `INSERT` does not deduplicate records, whereas `INSERT OR IGNORE` allows ignoring data that already exists in the table, making it suitable for repeated attempts.
+- Offline execute mode only supports `INSERT`, not `INSERT OR IGNORE`.
+- `INSERT` statement in offline execute mode is unsupported on tables with symbolic path.In OpenMLDB, tables have two types of offline data addresses: Data path and Symbolic path, as detailed in [Offline Import Rules](./LOAD_DATA_STATEMENT.md#offline-import-rules). `INSERT` data in offline execute mode will be written to Data path and the writing format is in Parquet format. Since the data format of Symbolic path can be set freely, `INSERT` data may cause data format conflicts if the table has symbolic path. Therefore, `INSERT` statement in offline execute mode is unsupported on tables with symbolic paths currently.
 
 ## Examples
 

diff --git a/docs/poetry.lock b/docs/poetry.lock