Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-worker may has high CPU usage and flood log after start a GTID task #5063

Closed
lance6716 opened this issue Mar 30, 2022 · 1 comment · Fixed by #5160
Closed

DM-worker may has high CPU usage and flood log after start a GTID task #5063

lance6716 opened this issue Mar 30, 2022 · 1 comment · Fixed by #5160
Assignees
Labels
affects-5.4 affects-6.0 area/dm Issues or PRs related to DM. severity/major type/bug The issue is confirmed as a bug.

Comments

@lance6716
Copy link
Contributor

lance6716 commented Mar 30, 2022

What did you do?

the bug has following prerequisites:

  • v5.4.0 or v6.0.0, and
  • use relay log
  • enable-gtid: true in upstream config, and
  • start a all task when last upstream mysql binlog file has a large size or start a incremental task from a middle position of binlog file, or the task is auto resumed at a middle position of binlog file

What did you expect to see?

works fine

What did you see instead?

  • task almost doesn't get forward after start-task, and
  • DM-worker has high CPU (360% in my local PC, which is 12 core), and
  • generates lots of log with
[2022/03/30 11:30:46.431 +08:00] [INFO] [syncer.go:2020] ["meet heartbeat event and then flush jobs"] [task=test] [unit="binlog replication"]
[2022/03/30 11:30:46.431 +08:00] [INFO] [syncer.go:3247] ["flush all jobs"] [task=test] [unit="binlog replication"] ["global checkpoint"="{{{mysql-bin.000001 113080239} 0xc000010ff8 0} <nil>}(flushed {{{mysql-bin.000001 113080239} 0xc000011208 0} <nil>})"] ["flush job seq"=37]
[2022/03/30 11:30:46.432 +08:00] [INFO] [syncer.go:1114] ["checkpoint has no change, skip sync flush checkpoint"]

For all mode task, the problematic duration is related to last upstream mysql binlog file size.

For incremental task, the problematic duration is related to the specified starting binlog location

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.4.0, v6.0.0

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@lance6716
Copy link
Contributor Author

lance6716 commented Mar 30, 2022

@GMHDBJD we implemented wrong logic about skipped GTID in relay log reader.

I guess the reason why your experiment shows evrey skipped GTID is replaced by heartbeat event is go-mysqlbinlog will not set HeartbeatPeroid by default. I set 30s to HeartbeatPeroid just as in DM, only one heartbeat event is received.

And this is my interpretation for behaviour of MySQL server:
https://github.com/mysql/mysql-server/blob/df0bc0a67b5cca06665e1d501a2f74b712af5cf5/sql/rpl_binlog_sender.cc#L451-L504

if a GTID should be skipped, 
    if we reach `HeartbeatPeroid` we should send a heartbeat event now. Otherwise we can only send a heartbeat event when this "skip group" is finished
else
    send heartbeat event for "skip group" finished, and then send real binlog event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 affects-6.0 area/dm Issues or PRs related to DM. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants