Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB vector search doc #18502

Merged
merged 121 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
48f906c
TiDB vector data type and vector index Doc
EricZequan Sep 2, 2024
0b31525
remove vector index part
EricZequan Sep 2, 2024
96f2701
modify cluster type
EricZequan Sep 2, 2024
3fdab1b
fix
EricZequan Sep 2, 2024
98e417b
modify expression
EricZequan Sep 3, 2024
9bd83d6
fix ci
EricZequan Sep 3, 2024
52934cb
fix comment
EricZequan Sep 3, 2024
26027f9
fix ci
EricZequan Sep 3, 2024
df68638
fix ci
EricZequan Sep 3, 2024
febd534
fix comment
EricZequan Sep 4, 2024
69429af
vector-search-overview: refine descriptions
qiancai Sep 4, 2024
d285b55
vector-search-data-types: refine descriptions
qiancai Sep 4, 2024
3852ed3
vector-search-functions-and-operators: refine descriptions
qiancai Sep 4, 2024
93b7e90
add remaining doc
EricZequan Sep 5, 2024
15e18d0
remove rows
EricZequan Sep 5, 2024
402f1d0
fix
EricZequan Sep 5, 2024
f869c42
get started: refine descriptions
qiancai Sep 5, 2024
374b687
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 5, 2024
ea2b0b1
integrate-with-django-orm: refine descriptions
qiancai Sep 6, 2024
2cf3e79
Apply suggestions from code review
qiancai Sep 6, 2024
0eea4d7
fix comment
EricZequan Sep 6, 2024
93fe602
fix comment
EricZequan Sep 6, 2024
abe5eac
integrate-with-peewee/sqlalchemy: refine descriptions
qiancai Sep 6, 2024
a01ba0e
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 6, 2024
48f47af
get-started and integrate-with-jinaai-embedding: refine descriptions
qiancai Sep 6, 2024
f9b6dc0
get-started-using-sql: update connection instructions
qiancai Sep 6, 2024
ae94864
Update vector-search-data-types.md
breezewish Sep 6, 2024
f485b58
integrate-with-llamaindex: refine descriptions
qiancai Sep 9, 2024
2e56e20
integrate-with-langchain: refine descriptions
qiancai Sep 9, 2024
bb7ce0b
overview and limitation: refine descriptions
qiancai Sep 9, 2024
d50b638
add vector index doc introduction
EricZequan Sep 10, 2024
5290946
fix comment
EricZequan Sep 10, 2024
eb604f7
modify introduction of self-hosted tidb connection type
EricZequan Sep 10, 2024
65656f8
modify tidb connection when using tidb self-hosted
EricZequan Sep 11, 2024
99f9ab7
fix comment
EricZequan Sep 11, 2024
2c5efa4
fix comment
EricZequan Sep 11, 2024
b2e32b7
fix comment
EricZequan Sep 11, 2024
9a51513
shorten index example case
EricZequan Sep 11, 2024
c09d85f
fix comment
EricZequan Sep 11, 2024
bf063b7
fix comment
EricZequan Sep 12, 2024
9eaf1f7
fix comment
EricZequan Sep 13, 2024
c62f7c9
fix comment
EricZequan Sep 13, 2024
0083ea1
fix comment
EricZequan Sep 13, 2024
6a673b9
index & improve performance: refine descriptions
qiancai Sep 13, 2024
5ebe116
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 13, 2024
a250dff
add vector index part in other document
EricZequan Sep 14, 2024
94e8ab5
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
EricZequan Sep 14, 2024
634c602
modify index name when create vector index
EricZequan Sep 14, 2024
4b54e6d
Update vector-search-improve-performance.md
EricZequan Sep 14, 2024
5eeb336
refine descriptions for TiDB self-managed connection
qiancai Sep 14, 2024
c6dad29
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Sep 14, 2024
9a77c65
fix comment
EricZequan Sep 14, 2024
502bd52
vector-index: refine descriptions
qiancai Sep 14, 2024
652b639
remove index part when create table in integration-doc
EricZequan Sep 14, 2024
05784ea
Resolve merge conflicts
EricZequan Sep 14, 2024
cfacbef
fix comment
EricZequan Sep 14, 2024
3a68b70
Merge remote-tracking branch 'upstream/master' into pr/18502
qiancai Sep 14, 2024
eb42ae8
TiDB Serverless -> TiDB Cloud Serverless
qiancai Sep 14, 2024
8d938d4
add the experimental warning
qiancai Sep 18, 2024
f7a31f2
fix comment
EricZequan Sep 19, 2024
894fcd4
fix comment
EricZequan Sep 19, 2024
92e1aee
fix comment
EricZequan Sep 23, 2024
0f91e5a
Apply suggestions from code review
qiancai Sep 24, 2024
39958f3
UI changes: Endpoint Type -> Connection Type
qiancai Sep 24, 2024
f8af8f6
fix comment
EricZequan Sep 24, 2024
12abce8
fix comment
EricZequan Sep 24, 2024
c9ef22f
fix comment
EricZequan Sep 25, 2024
3350c10
remove 'vector64()' sytax
EricZequan Sep 26, 2024
f18d840
Update desc about tiflash upgrade
JaySon-Huang Sep 27, 2024
9cbc09e
Update desc about br support
JaySon-Huang Sep 29, 2024
13bc862
Add limitation about BR restore
JaySon-Huang Sep 29, 2024
7704118
Update desc about limitation
JaySon-Huang Sep 29, 2024
d135903
add limit about cdc
wk989898 Sep 29, 2024
f374485
Update tiflash-configuration
JaySon-Huang Sep 29, 2024
2ba7811
fix comment
EricZequan Sep 30, 2024
c7fedb4
Apply suggestions from code review
EricZequan Oct 8, 2024
28164e7
fix comment
EricZequan Oct 8, 2024
b862a36
Merge remote-tracking branch 'upstream/master' into pr/18502
qiancai Oct 9, 2024
7c4e8bd
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Oct 9, 2024
95338ee
Update TOC.md
qiancai Oct 9, 2024
1b5fa2f
Apply suggestions from code review
EricZequan Oct 9, 2024
03d3cb5
Apply suggestions from code review
JaySon-Huang Oct 9, 2024
1c39b37
Add limitation about encryption-at-rest
JaySon-Huang Oct 10, 2024
0010207
Update format
lilin90 Oct 10, 2024
9fee36a
Update wording and format
lilin90 Oct 10, 2024
ad6eeb1
Remove description about future features
lilin90 Oct 10, 2024
44995da
Update wording
lilin90 Oct 10, 2024
ead511b
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
lilin90 Oct 10, 2024
7fbbed3
Update wording
lilin90 Oct 10, 2024
f51d860
Using upper case VEC_COSINE_DISTANCE instead
JaySon-Huang Oct 10, 2024
c00b0fc
make "USING HNSW" as default
JaySon-Huang Oct 10, 2024
6d13acc
Apply suggestions from code review
JaySon-Huang Oct 10, 2024
36d03f9
Apply suggestions from code review
EricZequan Oct 12, 2024
e03d678
format udpates
qiancai Oct 14, 2024
30118d0
remove "或删除" from the experimental warning
qiancai Oct 14, 2024
fac90f3
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai Oct 14, 2024
5480123
update
EricZequan Oct 14, 2024
5e0c7d5
Apply suggestions from code review
EricZequan Oct 14, 2024
18a286b
Apply suggestions from code review
EricZequan Oct 14, 2024
97d997e
add "Cast between Vector ⇔ other data types" back
qiancai Oct 15, 2024
a93a7d2
Apply suggestions from code review
EricZequan Oct 15, 2024
dd38923
refine descriptions in vector-search-limitations
qiancai Oct 16, 2024
6243bd0
fix comment
EricZequan Oct 16, 2024
52e696f
vector-search-index: refine new changes
qiancai Oct 16, 2024
bf35116
Apply suggestions from code review
qiancai Oct 16, 2024
6fa2321
fix a broken link
qiancai Oct 16, 2024
a13da5f
fix broken links
qiancai Oct 17, 2024
f87d771
Apply suggestions from code review
EricZequan Oct 17, 2024
960218c
fix index naming
JaySon-Huang Oct 17, 2024
8b3f476
Apply suggestions from code review
EricZequan Oct 17, 2024
7ace7de
Update vector-search-improve-performance.md
EricZequan Oct 17, 2024
a3cd4e8
Apply suggestions from code review
EricZequan Oct 18, 2024
9aa03cd
remove ORM operation
EricZequan Oct 18, 2024
b8a100a
remove part of ORM intro
EricZequan Oct 18, 2024
57aacb9
Apply suggestions from code review
EricZequan Oct 18, 2024
4f3560b
Update vector-search-index.md
EricZequan Oct 18, 2024
d82bb79
fix a broken link
qiancai Oct 21, 2024
c4d049e
add ORM-non-index doc
EricZequan Oct 21, 2024
cc21bc8
Revert "remove part of ORM intro"
EricZequan Oct 21, 2024
b50a16e
fix comment
EricZequan Oct 22, 2024
7125775
Update punctuation
lilin90 Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added media/vector-search/embedding-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
243 changes: 243 additions & 0 deletions vector-search-data-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
title: 向量数据类型
summary: 本文介绍 TiDB 的向量数据类型。
---

# 向量数据类型(Vector)
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

TiDB 提供的向量数据类型专门针对AI向量嵌入用例进行了优化。通过使用向量数据类型,可以高效地存储和查询浮点数序列,例如 `[0.3, 0.5, -0.1, ...]`.
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

目前可用的向量数据类型如下:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

- `VECTOR`: 单精度浮点数序列。每一行的维度可以不同。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- `VECTOR(D)`: 具有固定维度 `D` 的单精度浮点数序列。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

与存储在 `JSON` 列中相比,向量数据类型具有这些优势:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

- 可指定维度。可以指定一个维度,禁止插入不同维度的向量。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 更优的存储格式。向量数据类型的存储效率比 `JSON` 数据类型的更高。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

## 语法

向量值包含任意数量的浮点数,可以使用以下语法中的字符串来表示向量值:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
'[<float>, <float>, ...]'
```

例如:

```sql
CREATE TABLE vector_table (
id INT PRIMARY KEY,
embedding VECTOR(3)
);

INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]');

INSERT INTO vector_table VALUES (2, NULL);
```

插入语法无效的向量值将导致错误:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]');
ERROR 1105 (HY000): Invalid vector text: [5, ]
```

在上例中,`embedding` 列的维度为 3,因此插入不同维度的向量会导致错误:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]');
ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3)
```

有关向量数据类型的可用函数和操作,参阅[向量函数与操作](/vector-search-functions-and-operators.md)
EricZequan marked this conversation as resolved.
Show resolved Hide resolved


## 不同维度的向量
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

通过省略 `VECTOR` 类型中的维度参数,可以在同一列中存储不同维度的向量:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
CREATE TABLE vector_table (
id INT PRIMARY KEY,
embedding VECTOR
);

INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 3 dimensions vector, OK
INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 2 dimensions vector, OK
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
```

## 对比
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

我们可以使用[比较运算符](/vector-search-functions-and-operators.md)来比较两个向量,如:`=`, `!=`, `<`, `>`, `<=`, and `>=`。有关向量数据类型的比较运算符和函数的完整列表,参阅[向量函数与操作](/vector-search-functions-and-operators.md)。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

向量数据类型以元素为单位进行比较,例如:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

- `[1] < [12]`
- `[1,2,3] < [1,2,5]`
- `[1,2,3] = [1,2,3]`
- `[2,2,3] > [1,2,3]`

不同维度的向量采用字典序比较,具有一下特性:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

- 两个向量逐个元素进行比较,每个元素都以数值形式进行比较。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 第一个不匹配的元素决定哪一个向量在字典序上 _less_ 或 _greater_。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 如果一个向量是另一个向量的前缀,那么较短的向量为 _less_ 。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 长度相同、元素相同的两个向量为 _equal_ 。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 空向量是小于任何非空向量。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 两个空向量为 _equal_ 。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

例如:

- `[] < [1]`
- `[1,2,3] < [1,2,3,0]`

qiancai marked this conversation as resolved.
Show resolved Hide resolved
在比较向量常量时,需要考虑执行从字符串到向量的 [显式转换](#cast),以避免基于字符串值的比较:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
-- 因为给出了字符串,所以 TiDB 会比较字符串
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT '[12.0]' < '[4.0]';
+--------------------+
| '[12.0]' < '[4.0]' |
+--------------------+
| 1 |
+--------------------+
1 row in set (0.01 sec)

-- 显式转换为向量,以便通过向量进行比较:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]');
+--------------------------------------------------+
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') |
+--------------------------------------------------+
| 0 |
+--------------------------------------------------+
1 row in set (0.01 sec)
```

## 运算

向量数据类型支持以元素为单位的算术运算 `+` 和 `-` 。但是,在不同维度的向量之间执行算术运算会导致错误。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

例如:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]');
+---------------------------------------------+
| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') |
+---------------------------------------------+
| [9] |
+---------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]');
breezewish marked this conversation as resolved.
Show resolved Hide resolved
+-----------------------------------------------------+
| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') |
+-----------------------------------------------------+
| [1,1,1] |
+-----------------------------------------------------+
1 row in set (0.01 sec)

[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]');
ERROR 1105 (HY000): vectors have different dimensions: 1 and 3
```

## 类型转换
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

### 向量与字符串之间的转换

向量和字符串之间进行转换,可以使用以下函数:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

- `CAST(... AS VECTOR)`: String ⇒ Vector
- `CAST(... AS CHAR)`: Vector ⇒ String
- `VEC_FROM_TEXT`: String ⇒ Vector
- `VEC_AS_TEXT`: Vector ⇒ String

在调用接收向量数据类型的函数时,存在隐式转换:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
-- 由于 VEC_DIMS 只接受 VECTOR 参数,因此这里有一个隐式转换:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]');
+------------------------------+
| VEC_DIMS('[0.3, 0.5, -0.1]') |
+------------------------------+
| 3 |
+------------------------------+
1 row in set (0.01 sec)

-- 使用 VEC_FROM_TEXT 进行显式转存:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]'));
+---------------------------------------------+
| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) |
+---------------------------------------------+
| 3 |
+---------------------------------------------+
1 row in set (0.01 sec)

-- 使用 CAST(... AS VECTOR) 进行显式转存:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR));
+----------------------------------------------+
| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) |
+----------------------------------------------+
| 3 |
+----------------------------------------------+
1 row in set (0.01 sec)
```

当运算符或函数接受多种数据类型时,请使用显式转换。例如,在比较中,使用显式转换来比较向量数值而不是字符串数值:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
-- 因为给出了字符串,所以 TiDB 会比较字符串:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT '[12.0]' < '[4.0]';
+--------------------+
| '[12.0]' < '[4.0]' |
+--------------------+
| 1 |
+--------------------+
1 row in set (0.01 sec)

-- 显式转换为向量,以便通过向量进行比较:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]');
+--------------------------------------------------+
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') |
+--------------------------------------------------+
| 0 |
+--------------------------------------------------+
1 row in set (0.01 sec)
```

要显式地将向量转换为字符串表示,请使用 `VEC_AS_TEXT()` 函数:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

```sql
-- 规范化表示字符串:
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]');
+--------------------------------------+
| VEC_AS_TEXT('[0.3, 0.5, -0.1]') |
+--------------------------------------+
| [0.3,0.5,-0.1] |
+--------------------------------------+
1 row in set (0.01 sec)
```

有关其他转换函数,请参阅 [向量函数和操作](/vector-search-functions-and-operators.md)。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

### 向量与其他数据类型之间的转换

目前无法直接在向量和其他数据类型(如 `JSON`)之间进行转换。您需要使用字符串作为中间类型。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

## 约束

- 支持的最大向量维数为 16383。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 不能在向量数据类型中存储 `NaN`、`Infinity` 或 `Infinity` 值。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
- 目前,向量数据类型不能存储双精度浮点数。未来版本将支持这一功能。
EricZequan marked this conversation as resolved.
Show resolved Hide resolved

有关其他限制,请参阅[向量搜索限制](/vector-search-limitations.md)。

## MySQL 兼容性

向量数据类型只在 TiDB 中支持,MySQL 不支持。

## 另请参阅

- [向量函数和操作](/tidb-cloud/vector-search-functions-and-operators.md)
EricZequan marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading