Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-1827][Mysql] Mysql connector support parallel snapshot when th… #2046

Closed
wants to merge 1 commit into from

Conversation

zstraw
Copy link

@zstraw zstraw commented Mar 27, 2023

Resolve #1827

Copy link
Contributor

@ruanhang1993 ruanhang1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. There are still some missing places to handle.

@@ -403,4 +403,58 @@ public boolean isRunning() {
return currentTaskRunning;
}
}

private boolean containsPrimaryKey() {
return !CollectionUtil.isNullOrEmpty(currentSnapshotSplit.getSplitKeyType().getFields());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better not to judge in this way. The chunk key will be set when we have to handle the big table in some cases to provide the at-least-once guarantee.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could init the SnapshotRecordCollector when the record.key() returns null.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better not to judge in this way. The chunk key will be set when we have to handle the big table in some cases to provide the at-least-once guarantee.

Thanks for your review. You mean that the chunk key may be set even though the table has no primary keys?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the table contains too much data, there is a performance problem with this solution.
In this case, we may provide the at-least-once semantics instead of the exactly once semantics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the exactly once semantic is required in some scenarios ? How about to provide an option to make it optional for users who can choose between better performance or exactly once semantic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should control this by the experimental option scan.incremental.snapshot.chunk.key-column.
By default, we will return a single split. If users set the scan.incremental.snapshot.chunk.key-column, we will split the table into multi splits.

}

/** Collect records need to be sent, except low/high watermark. */
interface SnapshotRecords {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotRecords -> SnapshotRecordCollector

minMaxOfSplitColumn = queryMinMax(jdbcConnection, tableId, splitColumn.name());
} else {
minMaxOfSplitColumn = new Object[2];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk splitter should return a total split immediately when the table does not have the primary key. This part should not be changed.

public static RowType getChunkKeyColumnType(@Nullable Column chunkKeyColumn) {
if (chunkKeyColumn == null) {
return (RowType) ROW().getLogicalType();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changed as this. I think we could use the first column as the chunk key if the chunk key is not set in the table options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean using the first column as chunk key to split the table if it has no primary key ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied about this part at the first comment.

@Override
public void collect(SourceRecord record, boolean reachBinlogStart) {
if (!reachBinlogStart) {
snapshotRecords.add(record);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle the delete and update events?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the snapshot stage for the table without primary key only read snapshot data, then binlog data would be read only in the incremental stage. So there are no delete and update events in SnapshotSplitReader if the table has no primary key. Please correct me.

Copy link
Contributor

@ruanhang1993 ruanhang1993 Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand right, you want to skip the snapshot backfill task.
It is a good idea to skip the snapshot backfill task when there is a single split. I think you should set this in source but I do not see the relative changes.

@ruanhang1993
Copy link
Contributor

ruanhang1993 commented May 22, 2023

Hi, @zstraw ,
I have opened a new PR #2150 to implement this based on this PR. Maybe you could help to review it.
Thanks ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mysql connector support parallel snapshot when there are no primary key for tables
2 participants