Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync-diff-inspector: Use md5 func replace crc32 for checksum #707

Closed
wants to merge 3 commits into from

Conversation

erwadba
Copy link
Contributor

@erwadba erwadba commented Feb 25, 2023

What problem does this PR solve?

Issue Number: close #703 #634

What is changed and how it works?

Use md5 func replace crc32 for checksum.Data check is more accurate.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation
  • Need to be included in the release note

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@erwadba
Copy link
Contributor Author

erwadba commented Feb 25, 2023

@lance6716 @Leavrth PTAL,tks.

@lance6716
Copy link
Collaborator

Hi, I think MD5 is much slower than CRC32, can you test about thier performance?

@@ -770,24 +770,21 @@ func GetCountAndCRC32Checksum(ctx context.Context, db *sql.DB, schemaName, table
columnIsNull = append(columnIsNull, fmt.Sprintf("ISNULL(%s)", name))
}

query := fmt.Sprintf("SELECT COUNT(*) as CNT, BIT_XOR(CAST(CRC32(CONCAT_WS(',', %s, CONCAT(%s)))AS UNSIGNED)) as CHECKSUM FROM %s WHERE %s;",
strings.Join(columnNames, ", "), strings.Join(columnIsNull, ", "), dbutil.TableName(schemaName, tableName), limitRange)
query := fmt.Sprintf("SELECT COUNT(*) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', %s, CONCAT(%s))), 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', %s, CONCAT(%s))), 17, 16), 16, 10) AS UNSIGNED)) RMD5 FROM %s WHERE %s;",
Copy link
Contributor

@Leavrth Leavrth Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 Unfortunately, it will take twice as long.
Currently TiDB hasn't optimized this kind of SQL. The sub expression MD5(CONCAT_WS(',', %s, CONCAT(%s))) will be calculated twice.
pingcap/tidb#39576

@erwadba
Copy link
Contributor Author

erwadba commented Mar 2, 2023

Hi, I think MD5 is much slower than CRC32, can you test about thier performance?

I use sync-diff-inspector check 2T data in my production environment.
source tidb v6.5
target tidb v6.5
CRC32
Time Cost: 34m38.359171205s
Average Speed: 327.637020MB/s
MD5
Time Cost: 40m9.020688748s
Average Speed: 282.645290MB/s

@Leavrth
Copy link
Contributor

Leavrth commented Mar 2, 2023

Hi, I think MD5 is much slower than CRC32, can you test about thier performance?

I use sync-diff-inspector check 2T data in my production environment. source tidb v6.5 target tidb v6.5 CRC32 Time Cost: 34m38.359171205s Average Speed: 327.637020MB/s MD5 Time Cost: 40m9.020688748s Average Speed: 282.645290MB/s

How many are columns of the test tables? It shows that the speed of CRC32 is close to that of MD5.
However, please refer to this comment #707 (comment), I think if the columns of the test table become more and more, the speed gap will be widen.

@erwadba
Copy link
Contributor Author

erwadba commented Mar 3, 2023

How many are columns of the test tables? It shows that the speed of CRC32 is close to that of MD5.

There are 93 tables. The small one has 5 columns and the big one has 25 columns.
Maybe we can use md5 checksum like this.
https://www.percona.com/blog/how-to-avoid-hash-collisions-when-using-mysqls-crc32-function/

use test;
create table f1(id int primary key,f float);
insert into f1 values(1,0.1),(2,0.01),(3,0.001);
SELECT COUNT(*) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(@crc, 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(@crc :=MD5(CONCAT_WS(',',`id`, `f`, CONCAT(ISNULL(`id`), ISNULL(`f`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5  FROM `test`.`f1` WHERE ((TRUE) AND (TRUE));

@Leavrth
Copy link
Contributor

Leavrth commented Mar 3, 2023

How many are columns of the test tables? It shows that the speed of CRC32 is close to that of MD5.

There are 93 tables. The small one has 5 columns and the big one has 25 columns. Maybe we can use md5 checksum like this. https://www.percona.com/blog/how-to-avoid-hash-collisions-when-using-mysqls-crc32-function/

use test;
create table f1(id int primary key,f float);
insert into f1 values(1,0.1),(2,0.01),(3,0.001);
SELECT COUNT(*) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(@crc, 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(@crc :=MD5(CONCAT_WS(',',`id`, `f`, CONCAT(ISNULL(`id`), ISNULL(`f`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5  FROM `test`.`f1` WHERE ((TRUE) AND (TRUE));
mysql> SELECT COUNT(1) AS CNT, BIT_XOR(CAST(CRC32(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`))))AS UNSIGNED)) as BIT_XOR_CHECKSUM FROM test.stock;
+----------+------------------+
| CNT      | BIT_XOR_CHECKSUM |
+----------+------------------+
| 12800000 |        385643470 |
+----------+------------------+
1 row in set (5.97 sec)

mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(@crc, 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(@crc :=MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5  FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+----------------------+
| CNT      | LMD5                 | RMD5                 |
+----------+----------------------+----------------------+
| 12800000 | 17055727021754144759 | 17409312425610765406 |
+----------+----------------------+----------------------+
1 row in set (1 min 37.94 sec)

mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5  FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+----------------------+
| CNT      | LMD5                 | RMD5                 |
+----------+----------------------+----------------------+
| 12800000 | 18443984006065217657 | 17409312425610765406 |
+----------+----------------------+----------------------+
1 row in set (11.99 sec)

mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 1, 16), 16, 10) AS UNSIGNED)) LMD5 FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+
| CNT      | LMD5                 |
+----------+----------------------+
| 12800000 | 18443984006065217657 |
+----------+----------------------+
1 row in set (7.52 sec)

If use @crc32, the calculation operation won't push down to TiKV. That the whole data is sent to TiDB from TiKV costs too hightly.

mysql> explain SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(@crc, 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(@crc :=MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5  FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                         | estRows     | task      | access object | operator info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+----------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HashAgg_6                  | 1.00        | root      |               | funcs:count(1)->Column#18, funcs:bit_xor(Column#21)->Column#19, funcs:bit_xor(Column#22)->Column#20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| └─Projection_19            | 12800000.00 | root      |               | cast(conv(substring(getvar(crc), 1, 16), 16, 10), bigint(22) UNSIGNED BINARY)->Column#21, cast(conv(substring(setvar(crc, md5(concat_ws(,, cast(test.stock.s_i_id, var_string(20)), cast(test.stock.s_w_id, var_string(20)), cast(test.stock.s_quantity, var_string(20)), test.stock.s_dist_01, test.stock.s_dist_02, test.stock.s_dist_03, test.stock.s_dist_04, test.stock.s_dist_05, test.stock.s_dist_06, test.stock.s_dist_07, test.stock.s_dist_08, test.stock.s_dist_09, test.stock.s_dist_10, cast(test.stock.s_ytd, var_string(20)), cast(test.stock.s_order_cnt, var_string(20)), cast(test.stock.s_remote_cnt, var_string(20)), test.stock.s_data, 00))), 17, 16), 16, 10), bigint(22) UNSIGNED BINARY)->Column#22 |
|   └─TableReader_12         | 12800000.00 | root      |               | data:TableFullScan_11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|     └─TableFullScan_11     | 12800000.00 | cop[tikv] | table:stock   | keep order:false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+----------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set, 4 warnings (0.01 sec)

the data is generated by tiup bench tpcc prepare. Anyway, before this SQL is optimized by TiDB, it can only endure up to twice the time to exchange the accuracy of the result.

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

closed by #787

@Leavrth Leavrth closed this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sync-diff-inspector can't find the different about datetime type.
4 participants