Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Add Vector data type #54635

Conversation

EricZequan
Copy link
Contributor

@EricZequan EricZequan commented Jul 15, 2024

What problem does this PR solve?

Issue Number: ref #54245

Problem Summary: Support Vector data type

What changed and how does it work?

  • New variable: @@GLOBAL.TIDB_ENABLE_VECTOR_TYPE
  1. When SEM is enabled (normal / strict mode), users without RESTRICTED_ priv cannot modify this variable.
  • Defining table using new data type: CREATE TABLE foo(val VECTOR)
  • Supported aggregation functions:
  1. Count
  2. Count Distinct
  3. Min
  4. Max
  • Supported scalar functions:
  1. Cast (Cast into / from non-string will throw error)
  2. Compare
  3. IsTrue/IsFalse/IsNull
  4. vec_dims, vec_from_text, vec_as_text

Other scalar functions (like vector distances, vector arithmetics) will be added in future PRs.

The VectorFloat32's Datum / Chunk / Memory layouts are identical, as follows:
截屏2024-07-22 11 24 54

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
mysql> SET GLOBAL TIDB_ENABLE_VECTOR_TYPE = ON;
Query OK, 0 rows affected (0.01 sec)


mysql> CREATE TABLE foo(val VECTOR);
Query OK, 0 rows affected (0.05 sec)


mysql> insert into foo values ('[1.1,2.2,3.3]');
Query OK, 1 row affected (0.01 sec)


mysql> select count(val) from foo;
+------------+
| count(val) |
+------------+
|          1 |
+------------+
1 row in set (0.00 sec)


mysql> insert into foo values ('[1.1,2.2,1.1]');
Query OK, 1 row affected (0.01 sec)


mysql> select Min(val) from foo;
+---------------+
| Min(val)      |
+---------------+
| [1.1,2.2,1.1] |
+---------------+
1 row in set (0.00 sec)


mysql> select Max(val) from foo;
+---------------+
| Max(val)      |
+---------------+
| [1.1,2.2,3.3] |
+---------------+
1 row in set (0.01 sec)


mysql> SELECT COUNT(DISTINCT val) FROM foo;
+---------------------+
| COUNT(DISTINCT val) |
+---------------------+
|                   2 |
+---------------------+
1 row in set (0.00 sec)


mysql> SELECT CAST(val AS CHAR) FROM foo;
+-------------------+
| CAST(val AS CHAR) |
+-------------------+
| [1.1,2.2,3.3]     |
| [1.1,2.2,1.1]     |
+-------------------+
2 rows in set (0.01 sec)


mysql> SELECT * FROM foo WHERE val > '[1.0, 2.0, 3.0]';
+---------------+
| val           |
+---------------+
| [1.1,2.2,3.3] |
| [1.1,2.2,1.1] |
+---------------+
2 rows in set (0.00 sec)


mysql> SELECT * FROM foo WHERE val < '[1.0, 2.0, 3.0]';
Empty set (0.00 sec)


mysql> SELECT * FROM foo WHERE val IS NULL;
Empty set (0.00 sec)


mysql> SELECT vec_dims(val) AS dims FROM foo;
+------+
| dims |
+------+
|    3 |
|    3 |
+------+
2 rows in set (0.01 sec)


mysql> INSERT INTO foo (val) VALUES (vec_from_text('[4.4, 5.5, 6.6]'));
Query OK, 1 row affected (0.01 sec)


mysql> SELECT vec_as_text(val) AS text_val FROM foo;
+---------------+
| text_val      |
+---------------+
| [1.1,2.2,3.3] |
| [1.1,2.2,1.1] |
| [4.4,5.5,6.6] |
+---------------+
3 rows in set (0.00 sec)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: “EricZequan” <[email protected]>
@sre-bot
Copy link
Contributor

sre-bot commented Jul 15, 2024

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot bot added sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2024
Copy link

tiprow bot commented Jul 15, 2024

Hi @EricZequan. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
@hawkingrei
Copy link
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Jul 16, 2024
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 51.01843% with 505 lines in your changes missing coverage. Please review.

Project coverage is 75.4360%. Comparing base (1acb8f7) to head (f132098).

Current head f132098 differs from pull request most recent head 6542f06

Please upload reports for the commit 6542f06 to get more accurate results.

Additional details and impacted files
@@                              Coverage Diff                               @@
##           feature/vector-search/vector-data-type     #54635        +/-   ##
==============================================================================
+ Coverage                                 72.8511%   75.4360%   +2.5848%     
==============================================================================
  Files                                        1558       1561         +3     
  Lines                                      438406     439440      +1034     
==============================================================================
+ Hits                                       319384     331496     +12112     
+ Misses                                      99322      87406     -11916     
- Partials                                    19700      20538       +838     
Flag Coverage Δ
integration 50.8653% <2.5267%> (?)
unit 71.7228% <50.7274%> (-0.1356%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <ø> (ø)
parser ∅ <ø> (∅)
br 63.1556% <ø> (+17.2813%) ⬆️

EricZequan and others added 4 commits July 17, 2024 18:04
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2024
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
pkg/ddl/index.go Outdated Show resolved Hide resolved
Comment on lines +162 to +164
case types.ETVectorFloat32:
sig = &builtinCastVectorFloat32AsUnsupportedSig{bf.baseBuiltinFunc}
// sig.setPbCode(tipb.ScalarFuncSig_CastVectorFloat32AsInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to support these in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. In most cases user is writing a constant vector string so that it is const-folded in the planning stage and does not need cast pushdown in the execution stage.

Signed-off-by: “EricZequan” <[email protected]>
pkg/parser/parser.y Outdated Show resolved Hide resolved
pkg/types/vector.go Outdated Show resolved Hide resolved
Copy link
Contributor

@tangenta tangenta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 31, 2024
Copy link

ti-chi-bot bot commented Jul 31, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-07-24 15:30:05.172802951 +0000 UTC m=+1059027.163744421: ☑️ agreed by breezewish.
  • 2024-07-31 07:56:18.717476666 +0000 UTC m=+342494.997524738: ☑️ agreed by tangenta.

Signed-off-by: “EricZequan” <[email protected]>
@EricZequan
Copy link
Contributor Author

/retest

@pingcap pingcap deleted a comment from ti-chi-bot bot Jul 31, 2024
XuHuaiyu
XuHuaiyu previously approved these changes Jul 31, 2024
@XuHuaiyu XuHuaiyu dismissed their stale review August 1, 2024 02:16

mis-operation

@wuhuizuo
Copy link
Contributor

wuhuizuo commented Aug 1, 2024

/retest

return newSig
}

func (b *builtinCastVectorFloat32AsStringSig) evalString(ctx EvalContext, row chunk.Row) (res string, isNull bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support vecEvalXXX for vector type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Currently it is not implemented though. @EricZequan will find some time to provide a vectorized version for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuHuaiyu Vectorized version has been merged in cse. I will later merge the changes in this branch.

Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Copy link

ti-chi-bot bot commented Aug 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: breezewish, hawkingrei, tangenta, XuHuaiyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Aug 2, 2024
@ti-chi-bot ti-chi-bot bot merged commit 5389de9 into pingcap:feature/vector-search/vector-data-type Aug 2, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants