*: TiFlash support in Reorganize Partition | tidb-test=pr/2069 #40715

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

mjonss · 2023-01-20T00:07:33Z

I have manually test this PR together with pingcap/tiflash#6428 and the data will sync to TiFlash in the new partitions as soon as they are added to the AddingDefinitions when going to StateDeleteOnly.

CalvinNeo · 2023-01-20T02:29:59Z

ddl/partition.go

@@ -2373,20 +2398,25 @@ func (w *worker) onReorganizePartition(d *ddlCtx, t *meta.Meta, job *model.Job)
 			// For available state, the new added partition should wait its replica to
 			// be finished, otherwise the query to this partition will be blocked.
 			count := tblInfo.TiFlashReplica.Count
-			needRetry, err := checkPartitionReplica(count, addingDefinitions, d)
+			needRetry, err := checkPartitionReplica(count, count, addingDefinitions, d)


So if one partition replica is not available, then the ddl will block wait? Why we make the change from 1 to all?

I wanted to avoid the situation when a table has 3 replicas and after the DDL the new partitions only has 1 without any warnings etc.
If that is OK and expected, then I can remove the check for all and keep it to only check for one replica.

@hehechen I think we can take partition process into account in current code base. It means given a 3 replica table which has 1 partition, if only 1 replica is available since then, we can see available=true and progress=0.3333.
If I am right, then I think the both the original "1 replica" policy or the "all replica" policy in this PR is acceptable, while the best way is somehow reuse the function in ddl_tiflash.go. So what do you think about this?

I think It is OK to keep checking only one replica, because when only one of the three replicas is ready, the progress will become 0.33.

CalvinNeo · 2023-01-20T02:55:53Z

ddl/partition.go

@@ -1813,6 +1822,22 @@ func (w *worker) onDropTablePartition(d *ddlCtx, t *meta.Meta, job *model.Job) (
 				return ver, errors.Trace(err)
 			}
 		}
+		if tblInfo.TiFlashReplica != nil {


We delete every id in physicalTableIDs from AvailablePartitionIDs.
Could you explain a bit more? I didn't get the idea here...

I could not find that dropped partitions were ever removed from the AvailablePartitionIDs, so I added it here, since when dropping (or cleaning up after a failed add or reorganize partition) those ids should no longer be in that list.
Not sure if this explains it?
Or how is the AvailablePartitionIDs list handled in this case?

I think I got your point, and I think it makes sense.
However, I am wondering why the system can work before this and what will be changed after this. I have not modified these codes, and I found they exist in release-5.4 which is a very old version, so they actually work for a long time.
IMO, they can work because this is a drop action, a partition is dropped, and will not be accessed then. However, its data will not be deleted until being deleted by the gc worker. During the time gap, if we must access the partition, they are actually available in TiFlash. And if we do a flashback, they will be available again(maybe some tests could be added here). So the author have written like this.
I think it could be better if we can ask the author for a comfirmation.

I think it could be better if we can ask the author for a comfirmation.

@crazycs520 Could you please take a look?

I think it's reasonable to delete dropped partition id from TiFlashReplica.AvailablePartitionIDs here. You can find similar logic in onTruncateTablePartition function too.

I didn't delete the dropped partition id in tblInfo.TiFlashReplica.AvailablePartitionIDs before in onDropTablePartition function because it didn't affect the correctness. I think this was my previous oversight, even if it didn't affect correctness.

BTW, I think the name of the variable physicalTableIDs here is misleading, here physicalTableIDs actually only contains the partition id which in dropping, not all partition's ID.

…nto reorg-part-tiflash

hehechen · 2023-01-28T02:16:07Z

ddl/partition.go

 			if err != nil {
 				// need to rollback, since we tried to register the new
 				// partitions before!
 				return convertAddTablePartitionJob2RollbackJob(d, t, job, err, tblInfo)
 			}
+			// Try for 10 rounds (in case of transient TiFlash issues)


Why try for 10 rounds instead of tidb_ddl_error_count_limit ?

It is just a number taken from thin air. My idea is that we should not block too long on TiFlash issues, but if it takes some time, we will skip the wait and add an entry in the error log, so that the DDL can continue at least with the TiKV data. I would assume that if TiFlash cannot replicate the new empty regions, then it will also have problems with replicating current tables regions.
@hehechen what would you suggest? I can remove this, so it will fail after tidb_ddl_error_count_limit Or should it wait for tidb_ddl_error_count_limit errors and then proceed (including resetting the count)?

It is just a number taken from thin air. My idea is that we should not block too long on TiFlash issues, but if it takes some time, we will skip the wait and add an entry in the error log, so that the DDL can continue at least with the TiKV data. I would assume that if TiFlash cannot replicate the new empty regions, then it will also have problems with replicating current tables regions. @hehechen what would you suggest? I can remove this, so it will fail after tidb_ddl_error_count_limit Or should it wait for tidb_ddl_error_count_limit errors and then proceed (including resetting the count)?

If job.ErrorCount > 10, will return convertAddTablePartitionJob2RollbackJob and then onDropTablePartition(Line 2243) ?

If job.ErrorCount > 10, will return convertAddTablePartitionJob2RollbackJob and then onDropTablePartition(Line 2243) ?

@hehechen Yes.

Did we encounter a case that 'we got some problems when making TiFlash replica, which blocks DDL and it retries a lot of time'?

There is an alternative PR, without this change of logic here: #42082 Maybe we should merge that one instead, to get all the tests in and wait with extending the logic until we actually see any issues?

mjonss · 2023-05-08T14:09:03Z

Only add the tests instead, in #42082.

mjonss and others added 30 commits December 20, 2022 20:36

Linting

5a60f7e

Updated after make bazel_prepare

85257f3

cpu: fix ticker to avoid close early (pingcap#40036)

ae2d551

ref pingcap#40029

build(deps): bump golang.org/x/time from 0.2.0 to 0.3.0 (pingcap#39912)

a2fa187

ddl: use latest ts to read record for adding index (pingcap#40081)

51cce45

close pingcap#40074

executor: close recordset again (pingcap#40073)

2150c6b

planner: add more test cases for MPP hints (pingcap#39933)

785f515

planner: support set binding status by sql digest (pingcap#39517)

08f23ef

ref pingcap#39199

ddl: check the limitation when creating multi-valued index (pingcap#3…

b4f500e

…9818) close pingcap#40086

sysvar: allow modifying 'tidb_allow_remove_auto_inc' when SEM is ON (p…

13e2120

…ingcap#40083) close pingcap#38238

*: fix issue of multi-schema change with foreign key (pingcap#40042)

ad0c202

close pingcap#40037

planner: check the ignore-plan-cache hint in insert-stmt (pingcap#4…

9e8a21c

…0080) ref pingcap#39717, close pingcap#40079

parser: support keep_order and no_keep_order hint (pingcap#39965)

fc3f04b

ref pingcap#39964

*: filter particularly errors when truncateAsWarning is true (pingc…

47ace08

…ap#40078)

Removed CI debug log

273e7e8

Cleaned PR to only have the core of the data copying parts

b086421

Merge branch 'reorg-part-data-reorg' into reorg-part-rollback

74fa4c4

More cleanup of PR to only focus on reorg data copy

b0c16cd

Merge branch 'reorg-part-data-reorg' into reorg-part-rollback

73dcf5a

Added check to avoid null-pointer dereferencing

0accdd8

Fixed nil session context

8c914b0

Cleanup

0678894

Added Info logs for debugging in CI

3a1adcd

more CI debug info logs + cleared a TODO

f23a27a

Added one more CI debug line

ecb7f50

metrics: add metrics for ema cpu metrics and GOGC (pingcap#40049)

4e43b14

close pingcap#40092

server: avoid reusing cached stmt ctx on cursor read (pingcap#40023)

0f4bd73

close pingcap#39998

*: optimize mpp probe (pingcap#39932)

aeccf77

close pingcap#39686

Update ddl/partition.go

985b5ec

Co-authored-by: tangenta <[email protected]>

statistics: fix the sync load sql did'nt running internally (pingcap#…

4adce4c

…40087) close pingcap#39511

mjonss added 7 commits January 12, 2023 15:19

Implemented delete range cleanup for Reorganize Partition

599f5dc

update after make bazel_prepare

11b62c7

Merge branch 'reorg-part-rollback' into reorg-part-tiflash-and-update…

aa3bf0b

…-table

Fixed an index issue in StateDeleteReorg after adding test.

b8b5194

Merge remote-tracking branch 'pingcap/feature/reorganize-partition' i…

5a31388

…nto reorg-part-tiflash-and-update-table

Fixed test for TiFlash and reorg partition

1af113a

Splitted, so the PR only contains TiFlash related changes and tests

459f7b2

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 18, 2023

This was referenced Jan 18, 2023

Support REORGANIZE PARTITION Key Dev Task #38535

Closed

*: Reorganize partition: tiflash and update table | tidb-test=pr/2069 #40635

Closed

mjonss changed the title *: TiFlash support in Reorganize Partition *: TiFlash support in Reorganize Partition | tidb-test=pr/2069 Jan 18, 2023

mjonss and others added 5 commits January 18, 2023 16:21

Bazel update

2e7cb93

Merge branch 'feature/reorganize-partition' into reorg-part-tiflash

db40209

Merge branch 'feature/reorganize-partition' into reorg-part-tiflash

f6c2994

Test update and fix for unistore safe destroy range

7ea2f7a

Cleanup of parts included in other PRs

03180b1

ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 19, 2023

CalvinNeo reviewed Jan 20, 2023

View reviewed changes

Merge remote-tracking branch 'pingcap/feature/reorganize-partition' i…

d982b94

…nto reorg-part-tiflash

tangenta approved these changes Jan 20, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 20, 2023

hehechen reviewed Jan 28, 2023

View reviewed changes

mjonss mentioned this pull request Mar 9, 2023

ddl: Added tests for Reorganize Partition with TiFlash #42082

Merged

12 tasks

mjonss closed this May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: TiFlash support in Reorganize Partition | tidb-test=pr/2069 #40715

*: TiFlash support in Reorganize Partition | tidb-test=pr/2069 #40715

mjonss commented Jan 18, 2023 •

edited

Loading

ti-chi-bot commented Jan 18, 2023 •

edited

Loading

mjonss commented Jan 20, 2023

CalvinNeo Jan 20, 2023

mjonss Jan 20, 2023

CalvinNeo Jan 20, 2023

hehechen Jan 28, 2023

CalvinNeo Jan 20, 2023 •

edited

Loading

mjonss Jan 20, 2023

CalvinNeo Jan 20, 2023

bb7133 Mar 8, 2023

crazycs520 Mar 8, 2023

hehechen Jan 28, 2023

mjonss Jan 31, 2023 •

edited

Loading

hehechen Feb 6, 2023

bb7133 Mar 8, 2023

mjonss Mar 12, 2023

mjonss commented May 8, 2023

*: TiFlash support in Reorganize Partition | tidb-test=pr/2069 #40715

*: TiFlash support in Reorganize Partition | tidb-test=pr/2069 #40715

Conversation

mjonss commented Jan 18, 2023 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Jan 18, 2023 • edited Loading

mjonss commented Jan 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CalvinNeo Jan 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjonss Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjonss commented May 8, 2023

mjonss commented Jan 18, 2023 •

edited

Loading

ti-chi-bot commented Jan 18, 2023 •

edited

Loading

CalvinNeo Jan 20, 2023 •

edited

Loading

mjonss Jan 31, 2023 •

edited

Loading