DM-worker may has high CPU usage and flood log after start a GTID task #5063

lance6716 · 2022-03-30T03:40:43Z

What did you do?

the bug has following prerequisites:

v5.4.0 or v6.0.0, and
use relay log
enable-gtid: true in upstream config, and
start a all task when last upstream mysql binlog file has a large size or start a incremental task from a middle position of binlog file, or the task is auto resumed at a middle position of binlog file

What did you expect to see?

works fine

What did you see instead?

task almost doesn't get forward after start-task, and
DM-worker has high CPU (360% in my local PC, which is 12 core), and
generates lots of log with

[2022/03/30 11:30:46.431 +08:00] [INFO] [syncer.go:2020] ["meet heartbeat event and then flush jobs"] [task=test] [unit="binlog replication"]
[2022/03/30 11:30:46.431 +08:00] [INFO] [syncer.go:3247] ["flush all jobs"] [task=test] [unit="binlog replication"] ["global checkpoint"="{{{mysql-bin.000001 113080239} 0xc000010ff8 0} <nil>}(flushed {{{mysql-bin.000001 113080239} 0xc000011208 0} <nil>})"] ["flush job seq"=37]
[2022/03/30 11:30:46.432 +08:00] [INFO] [syncer.go:1114] ["checkpoint has no change, skip sync flush checkpoint"]

For all mode task, the problematic duration is related to last upstream mysql binlog file size.

For incremental task, the problematic duration is related to the specified starting binlog location

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.4.0, v6.0.0

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2022-03-30T06:31:59Z

@GMHDBJD we implemented wrong logic about skipped GTID in relay log reader.

I guess the reason why your experiment shows evrey skipped GTID is replaced by heartbeat event is go-mysqlbinlog will not set HeartbeatPeroid by default. I set 30s to HeartbeatPeroid just as in DM, only one heartbeat event is received.

And this is my interpretation for behaviour of MySQL server:
https://github.com/mysql/mysql-server/blob/df0bc0a67b5cca06665e1d501a2f74b712af5cf5/sql/rpl_binlog_sender.cc#L451-L504

if a GTID should be skipped, 
    if we reach `HeartbeatPeroid` we should send a heartbeat event now. Otherwise we can only send a heartbeat event when this "skip group" is finished
else
    send heartbeat event for "skip group" finished, and then send real binlog event

ref #5063

) ref #5063

…5160) close #5063

…5160) (#5201) close #5063

lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Mar 30, 2022

lance6716 mentioned this issue Mar 30, 2022

Relay hanged while migrate data(Grafana display error info of replication lag) #4860

Closed

lance6716 added the severity/major label Mar 30, 2022

ti-chi-bot added may-affects-4.0 may-affects-5.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.0 labels Mar 30, 2022

niubell assigned lance6716 Mar 30, 2022

lance6716 added affects-5.4 affects-6.0 and removed may-affects-4.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 may-affects-5.0 labels Mar 30, 2022

ti-chi-bot removed may-affects-5.4 may-affects-6.0 labels Mar 30, 2022

lance6716 mentioned this issue Mar 30, 2022

relay(dm): send one heartbeat for successive skipped GTID #5070

Merged

ti-chi-bot pushed a commit that referenced this issue Apr 1, 2022

relay(dm): send one heartbeat for successive skipped GTID (#5070)

012ee38

ref #5063

This was referenced Apr 1, 2022

relay(dm): send one heartbeat for successive skipped GTID (#5070) #5092

Merged

relay(dm): send one heartbeat for successive skipped GTID (#5070) #5093

Merged

relay(dm): send one heartbeat for successive skipped GTID (#5070) #5094

Closed

ti-chi-bot added a commit that referenced this issue Apr 7, 2022

relay(dm): send one heartbeat for successive skipped GTID (#5070) (#5093

2e4f0b4

) ref #5063

lance6716 changed the title ~~DM-worker may has high CPU usgae and flood log after start a GTID task~~ DM-worker may has high CPU usage and flood log after start a GTID task Apr 7, 2022

ti-chi-bot added a commit that referenced this issue Apr 7, 2022

relay(dm): send one heartbeat for successive skipped GTID (#5070) (#5092

2605a84

) ref #5063

lance6716 mentioned this issue Apr 13, 2022

checkpoint(dm): check outdated should respect snapshot create time #5160

Merged

ti-chi-bot closed this as completed in #5160 Apr 18, 2022

ti-chi-bot pushed a commit that referenced this issue Apr 18, 2022

checkpoint(dm): check outdated should respect snapshot create time (#…

4d6f44d

…5160) close #5063

This was referenced Apr 18, 2022

checkpoint(dm): check outdated should respect snapshot create time (#5160) #5201

Merged

checkpoint(dm): check outdated should respect snapshot create time (#5160) #5202

Closed

ti-chi-bot added a commit that referenced this issue Apr 21, 2022

checkpoint(dm): check outdated should respect snapshot create time (#…

2ebc95f

…5160) (#5201) close #5063

This was referenced May 7, 2022

releases: add tidb 5.4.1 release notes pingcap/docs#8436

Merged

release notes: add v5.4.1 release notes pingcap/docs-cn#9264

Merged

This was referenced Jun 23, 2022

releases: add tidb 5.3.2 release notes pingcap/docs#9029

Merged

releases: add v5.3.2 release notes pingcap/docs-cn#9914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-worker may has high CPU usage and flood log after start a GTID task #5063

DM-worker may has high CPU usage and flood log after start a GTID task #5063

lance6716 commented Mar 30, 2022 •

edited

Loading

lance6716 commented Mar 30, 2022 •

edited

Loading

DM-worker may has high CPU usage and flood log after start a GTID task #5063

DM-worker may has high CPU usage and flood log after start a GTID task #5063

Comments

lance6716 commented Mar 30, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Mar 30, 2022 • edited Loading

lance6716 commented Mar 30, 2022 •

edited

Loading

current status of DM cluster (execute `query-status <task-name>` in dmctl)

lance6716 commented Mar 30, 2022 •

edited

Loading