Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB vector search doc #18502

Open
wants to merge 75 commits into
base: master
Choose a base branch
from

Conversation

EricZequan
Copy link

@EricZequan EricZequan commented Sep 2, 2024

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v8.4 (TiDB 8.4 versions)
  • v8.3 (TiDB 8.3 versions)
  • v8.2 (TiDB 8.2 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions))

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@ti-chi-bot ti-chi-bot bot added the first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. label Sep 2, 2024
@CLAassistant
Copy link

CLAassistant commented Sep 2, 2024

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot bot added missing-translation-status This PR does not have translation status info. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 2, 2024
Signed-off-by: “EricZequan” <[email protected]>
@qiancai qiancai self-requested a review September 2, 2024 06:48
@qiancai qiancai added v8.4 This PR/issue applies to TiDB v8.4. translation/doing This PR’s assignee is translating this PR. labels Sep 2, 2024
@ti-chi-bot ti-chi-bot bot removed the missing-translation-status This PR does not have translation status info. label Sep 2, 2024
@EricZequan
Copy link
Author

/cc @breezewish

@ti-chi-bot ti-chi-bot bot requested a review from breezewish September 2, 2024 07:50
@qiancai qiancai self-assigned this Sep 2, 2024
Signed-off-by: “EricZequan” <[email protected]>
Copy link

ti-chi-bot bot commented Sep 2, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from qiancai, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: “EricZequan” <[email protected]>
Copy link
Member

@breezewish breezewish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有些累,基本每一行我都提了一个 comment 以便让这个文本读起来更正常,请自行对其他文档进行这样的改动,让它变得可读,尤其需要注意语序上中文和英文语序是不一样的,你需要调整各行的语序、必要的时候补充一些细节,使得内容是读起来是顺畅的、符合中文语法的、易于理解的。英文文档由于大家水平问题并不一定每句话都能表达清楚,也不强求,意思到了就可以,但作为母语的中文文档,应当具备最完备、最易于理解、最没有歧义的内容,因此必要的时候需要扩写、澄清一些细节。

vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
vector-search-data-types.md Outdated Show resolved Hide resolved
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
vector-search-index.md Outdated Show resolved Hide resolved
vector-search-index.md Outdated Show resolved Hide resolved
EricZequan and others added 3 commits September 25, 2024 09:52
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Comment on lines +122 to +124
## 从 v7.x 升级至 v8.4 或以上版本

从 v8.4 开始,为了支持[向量搜索功能](/vector-search-index.md),TiFlash 底层存储格式发生改动。因此,升级 TiFlash 到 v8.4 或以上版本后,不支持原地降级到之前的版本。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I've pushed a commit about the tiflash upgrade notice

Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tangenta PTAL about the docs about DM limitation

Comment on lines +63 to +65
+ 向量类型数据同步

- DM 不支持 MySQL 9.0 的向量数据类型同步。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add limitation about DM

- 创建[向量搜索索引](/vector-search-index.md)时的限制,参照[向量搜索索引 - 使用限制](/vector-search-index.md#使用限制)。
- 向量数据类型不支持存储双精度浮点数(该功能计划在未来的版本中支持)。当向 TiDB 中的向量列插入或存储双精度浮点数时,TiDB 会将这些双精度浮点数自动转换为单精度浮点数。
- 确保使用 BR v8.4.0 及以上版本进行备份与恢复。不支持将带有向量数据类型的恢复至 v8.4.0 之前的 TiDB 集群。
- DM 不支持 MySQL 9.0 的向量数据类型同步。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YuJuncen PTAL about docs of BR limitation

@@ -120,6 +120,7 @@ TiDB 支持将数据备份到 Amazon S3、Google Cloud Storage (GCS)、Azure Blo
| 全局临时表 | | 确保使用 BR v5.3.0 及以上版本进行备份和恢复,否则会导致全局临时表的表定义错误。 |
| TiDB Lightning 物理导入模式| |上游数据库使用 TiDB Lightning 物理导入模式导入的数据,无法作为数据日志备份下来。推荐在数据导入后执行一次全量备份,细节参考[上游数据库使用 TiDB Lightning 物理导入模式导入数据的恢复](/faq/backup-and-restore-faq.md#上游数据库使用-tidb-lightning-物理导入模式导入数据时为什么无法使用日志备份功能)。|
| TiCDC | | BR v8.2.0 及以上版本:如果在恢复的目标集群有 [CheckpointTS](/ticdc/ticdc-architecture.md#checkpointts) 早于 BackupTS 的 Changefeed,BR 会拒绝执行恢复。BR v8.2.0 之前的版本:如果在恢复的目标集群有任何活跃的 TiCDC Changefeed,BR 会拒绝执行恢复。 |
| 向量搜索 | | 确保使用 BR v8.4.0 及以上版本进行备份与恢复。不支持将带有向量数据类型的恢复至 v8.4.0 之前的 TiDB 集群。 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add limitation about BR

- 向量数据中不支持 `NaN`、`Infinity` 和 `-Infinity` 浮点数。
- 创建[向量搜索索引](/vector-search-index.md)时的限制,参照[向量搜索索引 - 使用限制](/vector-search-index.md#使用限制)。
- 向量数据类型不支持存储双精度浮点数(该功能计划在未来的版本中支持)。当向 TiDB 中的向量列插入或存储双精度浮点数时,TiDB 会将这些双精度浮点数自动转换为单精度浮点数。
- 确保使用 BR v8.4.0 及以上版本进行备份与恢复。不支持将带有向量数据类型的恢复至 v8.4.0 之前的 TiDB 集群。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

vector-search-index.md Outdated Show resolved Hide resolved
Signed-off-by: “EricZequan” <[email protected]>
Copy link

ti-chi-bot bot commented Sep 30, 2024

@EricZequan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify 2ba7811 link true /test pull-verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


ALTER TABLE foo ADD VECTOR INDEX idx_name ((VEC_COSINE_DISTANCE(data))) USING HNSW;
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这注意建议删掉,在特性改为 GA 时很容易把此处漏改。


更多信息,请参阅 [`ALTER TABLE ... COMPACT`](/sql-statements/sql-statement-alter-table-compact.md)。

此外,你也可以通过 `ADMIN SHOW DDL JOBS;` 查看 DDL 任务的执行进度,观察其 `row count`。不过这种方式并不准确,`row count` 的值是从 `TIFLASH_INDEXES` 里的 `rows_stable_indexed` 获取的。此方式也可作为你查看索引构建进度的一种参考方式。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
此外,你也可以通过 `ADMIN SHOW DDL JOBS;` 查看 DDL 任务的执行进度,观察其 `row count`。不过这种方式并不准确,`row count` 的值是从 `TIFLASH_INDEXES` 里的 `rows_stable_indexed` 获取的。此方式也可作为你查看索引构建进度的一种参考方式
此外,你也可以通过 `ADMIN SHOW DDL JOBS;` 查看 DDL 任务的执行进度,观察其 `row count`。不过这种方式并不准确,`row count` 的值是从 `TIFLASH_INDEXES` 里的 `rows_stable_indexed` 获取的。你也可以使用此方式查看索引构建进度

Comment on lines +156 to +159
- `CAST(... AS VECTOR)`: 将字符串类型转换为向量类型
- `CAST(... AS CHAR)`: 将向量类型转换为字符串类型
- `VEC_FROM_TEXT`: 将字符串类型转换为向量类型
- `VEC_AS_TEXT`: 将向量类型转换为字符串类型
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `CAST(... AS VECTOR)`: 将字符串类型转换为向量类型
- `CAST(... AS CHAR)`: 将向量类型转换为字符串类型
- `VEC_FROM_TEXT`: 将字符串类型转换为向量类型
- `VEC_AS_TEXT`: 将向量类型转换为字符串类型
- `CAST(... AS VECTOR)`将字符串类型转换为向量类型
- `CAST(... AS CHAR)`将向量类型转换为字符串类型
- `VEC_FROM_TEXT`将字符串类型转换为向量类型
- `VEC_AS_TEXT`将向量类型转换为字符串类型

- 不支持在同一列上创建多个使用了相同距离函数的向量搜索索引。
- 不支持删除具有向量搜索索引的列,也不支持在同一个 SQL 语句中创建多个索引。
- 不支持修改带有向量索引的列的类型(有损变更,即修改了列数据)。
- 不支持将向量搜索索引[设置为不可见](/sql-statements/sql-statement-alter-index.md)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 不支持将向量搜索索引[设置为不可见](/sql-statements/sql-statement-alter-index.md)
- 不支持将向量搜索索引[设置为不可见](/sql-statements/sql-statement-alter-index.md)(该功能计划在未来的版本中支持)
- 不支持在分区表中添加向量索引(该功能计划在未来的版本中支持)。

> 向量搜索目前为实验特性,不建议在生产环境中使用。该功能可能会在未事先通知的情况下发生变化或删除。如果发现 bug,请在 GitHub 上提 [issue](https://github.com/pingcap/tidb/issues) 反馈。

- 向量最大支持 16383 维。
- 向量数据中不支持 `NaN`、`Infinity` 和 `-Infinity` 浮点数。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 向量数据中不支持 `NaN``Infinity``-Infinity` 浮点数。
- 向量数据中不支持 `NaN``Infinity``-Infinity` 浮点数。
- 向量列不能作为主键或者主键的一部分
- 向量列不能作为唯一索引或者唯一索引的一部分
- 向量列不能作为分区键或者分区键的一部分

- 向量搜索索引只能基于单一的向量列创建,不能与其他列(如整数列或字符串列)组合形成复合索引。
- 创建和使用搜索向量索引时需要指定距离函数。目前只支持余弦距离函数 `VEC_COSINE_DISTANCE()` 和 L2 距离函数 `VEC_L2_DISTANCE()`。
- 不支持在同一列上创建多个使用了相同距离函数的向量搜索索引。
- 不支持删除具有向量搜索索引的列,也不支持在同一个 SQL 语句中创建多个索引。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 不支持删除具有向量搜索索引的列,也不支持在同一个 SQL 语句中创建多个索引
- 不支持直接删除具有向量搜索索引的列。可以通过先删除列上的向量搜索索引,再删除列的方式完成删除
- 不支持在同一个 SQL 语句中创建多个索引。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. translation/doing This PR’s assignee is translating this PR. v8.4 This PR/issue applies to TiDB v8.4.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants