Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary literal containing invalid-unicode characters are truncated in the metadata #23865

Closed
winoros opened this issue Apr 6, 2021 · 11 comments
Assignees
Labels
severity/moderate sig/execution SIG execution sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug. wontfix This issue will not be fixed.

Comments

@winoros
Copy link
Member

winoros commented Apr 6, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

create table t(a enum(0x00A4EEF4FA55D6706ED5));
insert into t values(0x00A4EEF4FA55D6706ED5);
insert into t values(0x1);
select * from t where a = 0x00A4EEF4FA55D6706ED5;

2. What did you expect to see? (Required)

Two rows of value 0x00A4EEF4FA55D6706ED5.
And the select can select the rows.

3. What did you see instead (Required)

The first insert raised an error Data Truncated. Only one row inserted.

And the select returns no row.

4. What is your TiDB version? (Required)

master

@winoros winoros added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra labels Apr 6, 2021
@winoros
Copy link
Member Author

winoros commented Apr 6, 2021

There's also another interesting thing.
In MySQL, if you change the where clause to 0x2, 0x3, 0x4. The select can also return the rows.

@xhebox
Copy link
Contributor

xhebox commented Apr 7, 2021

/assign

@xhebox xhebox added sig/execution SIG execution and removed sig/sql-infra SIG: SQL Infra labels Apr 8, 2021
@xhebox xhebox added the sig/sql-infra SIG: SQL Infra label Apr 9, 2021
@xhebox
Copy link
Contributor

xhebox commented Apr 9, 2021

Root cause:

column: a enum  00a4eef4fa55d6706ed5 

{"id":72,"type":3,"schema_id":1,"table_id":71,"schema_name":"test","state":0,"err":null,"err_count":0,"row_count":0,"raw_args":[{"id":71,"name":{"O":"t","L":"t"},"charset":"utf8mb4","collate":"utf8mb4_bin","cols":[{"id":1,"name":{"O":"a","L":"a"},"offset":0,"origin_default":null,"origin_default_bit":null,"default":null,"default_bit":null,"default_is_expr":false,"generated_expr_string":"","generated_stored":false,"dependences":null,"type":{"Tp":247,"Flag":0,"Flen":10,"Decimal":0,"Charset":"utf8mb4","Collate":"utf8mb4_bin","Elems":["\u0000\ufffd\ufffd\ufffd\ufffdU\ufffdpn\ufffd"]},"state":5,"comment":"","hidden":false,"change_state_info":null,"version":2}],"index_info":null,"constraint_info":null,"fk_info":null,"state":0,"pk_is_handle":false,"is_common_hand le":false,"common_handle_version":0,"comment":"","auto_inc_id":0,"auto_id_cache":0,"auto_rand_id":0,"max_col_id":1,"max_idx_id":0,"max_cst_id":0,"update_timestamp":0,"ShardRowIDBits":0,"max_shard_row_id_bits":0,"auto_random_bits":0,"pre_split_regions":0,"partition":null,"compression":"","view":null,"sequence":null,"Lock":null,"version":4,"tiflash_replica":null,"is_columnar":false}],"schema_state":0,"snapshot_ver":0,"start_ts":424134920960737282,"dependency_id":0,"query":"create table test.t(a enum(0x00A4EEF4FA55D6706ED5))","binlog":{"SchemaVersion":0,"DBInfo":null,"TableInfo":null,"FinishedTS":0},"version":1,"reorg_meta":null,"priority":0}

Refer to https://pkg.go.dev/encoding/json#InvalidUTF8Error, https://github.com/pingcap/parser/blob/master/parser.y#L11373-L11382. Since we are storing and process binaryLiterals by the string type, so it is truncated silently by json.Marshal when enqueuing CreateTable to the DDL job list.

I would recommend storing and processing binaryLiterals as pure []byte, but it breaks the compatibility.

@winoros
Copy link
Member Author

winoros commented Apr 9, 2021

@xhebox
I'm not sure whether your change will break the behavior when inserting as insert into t values(0x1)?

@xhebox
Copy link
Contributor

xhebox commented Apr 9, 2021

No.

For this issue, it is just a problem of storing non-UTF-8 data in the json, or in the (de)serialization process. Though we could refine the logic about binaryLiteral by switching to []byte, that is another internal improvement.

To avoid invalid characters, either we use another json library that does not validate utf-8, use []byte which has no encodings, or storing hex.EncodeToString() which is valid UTF-8.

Except using another json library, other two changes lead to a different metadata format, different from the data persisted in TiKV... That means, break all old enum/set columns, or at least break all old enum types containing hexLiteral.

In short, to fix the problem, we may need to break more things, or use another json library, or handle the old/invalid data by some sorts of migration.

EDIT: Maybe, we could somehow make use of the version field in the metadata to enable a compatible solution. I guess that needs a new proposal.

@xhebox xhebox changed the title not compatible bahavior when enum meets binary literal binary literal containing invalid-unicode characters are truncated in the metadata Apr 15, 2021
@wshwsh12
Copy link
Contributor

wshwsh12 commented Apr 16, 2021

The behavior is different from MySQL 5.7, and MySQL 8.0.. So could we treat is as a compatibility problem, not a bug?

MySQL 8.0.22:

MySQL [test]> create table t(a enum(0x00A4EEF4FA55D6706ED5));
Query OK, 0 rows affected (0.010 sec)

MySQL [test]> insert into t values(0x00A4EEF4FA55D6706ED5);
Query OK, 1 row affected (0.002 sec)

MySQL [test]> insert into t values(0x1);
Query OK, 1 row affected (0.001 sec)

MySQL [test]> select * from t where a = 0x00A4EEF4FA55D6706ED5;
+-----------+
| a         |
+-----------+
|  ????U?pn |
|  ????U?pn |
+-----------+
2 rows in set (0.000 sec)

MySQL 5.7.32:

MySQL [test]> create table t(a enum(0x00A4EEF4FA55D6706ED5));
Query OK, 0 rows affected (0.005 sec)

MySQL [test]> insert into t values(0x00A4EEF4FA55D6706ED5);
ERROR 1265 (01000): Data truncated for column 'a' at row 1
MySQL [test]> insert into t values(0x1);
ERROR 1265 (01000): Data truncated for column 'a' at row 1
MySQL [test]> select * from t where a = 0x00A4EEF4FA55D6706ED5;
Empty set (0.000 sec)

@xhebox
Copy link
Contributor

xhebox commented Apr 16, 2021

@wshwsh12 Does enum(0x6ED5) works for mysql 5.7? This case is not working on TiDB too:

s, _ := json.Marshal("\x6e\xd5")
fmt.Printf("%s\n", s)
"n\ufffd"

@wshwsh12 wshwsh12 added type/compatibility and removed severity/major type/bug The issue is confirmed as a bug. labels Apr 16, 2021
@wshwsh12
Copy link
Contributor

wshwsh12 commented Apr 16, 2021

@wshwsh12 Does enum(0x6ED5) works for mysql 5.7? This case is not working on TiDB too:

s, _ := json.Marshal("\x6e\xd5")
fmt.Printf("%s\n", s)
"n\ufffd"

In mysql5.7: @xhebox

MySQL [test]> create table t(a enum(0x6ED5));
Query OK, 0 rows affected (0.005 sec)

MySQL [test]> insert into t values(0x6ED5);
Query OK, 1 row affected (0.001 sec)

MySQL [test]> insert into t values(0x1);
ERROR 1265 (01000): Data truncated for column 'a' at row 1
MySQL [test]> select * from t where a = 0x6ED5;
+------+
| a    |
+------+
| nÕ   |
+------+
1 row in set (0.000 sec)


@xhebox
Copy link
Contributor

xhebox commented Apr 16, 2021

Then it is a bug. TiDB doesn't work as mysql 5.7... Maybe we could add the bug label, but not the major label. It is not easy to fix anyway, @wshwsh12 .

mysql [email protected]:test> create table t(a enum(0x6ED5));
Query OK, 0 rows affected
Time: 0.009s
mysql [email protected]:test> insert into t values(0x6ED5);
(1265, "Data truncated for column 'a' at row 1")

@bb7133
Copy link
Member

bb7133 commented Dec 7, 2021

I don't think this is a bug that is worth fixing since the case is barely used.

@bb7133 bb7133 added the wontfix This issue will not be fixed. label Dec 7, 2021
@bb7133 bb7133 closed this as completed Dec 7, 2021
@github-actions
Copy link

github-actions bot commented Dec 7, 2021

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/moderate sig/execution SIG execution sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug. wontfix This issue will not be fixed.
Projects
None yet
Development

No branches or pull requests

5 participants