Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data lost in chaos with enable-relay #3487

Closed
GMHDBJD opened this issue Nov 16, 2021 · 7 comments · Fixed by #3496
Closed

data lost in chaos with enable-relay #3487

GMHDBJD opened this issue Nov 16, 2021 · 7 comments · Fixed by #3496
Assignees
Labels
affects-5.3 area/dm Issues or PRs related to DM. severity/critical type/bug The issue is confirmed as a bug.

Comments

@GMHDBJD
Copy link
Contributor

GMHDBJD commented Nov 16, 2021

What did you do?

Background
chaos disabled relay after v2.0.2
enable after https://github.com/pingcap/ticdc/pull/3190

source1

enable-relay: true
enable-gtid: true

error found in worker.log

[2021/11/16 12:50:43.768 +00:00] [INFO] [syncer.go:2368] ["skip event, need handled ddls is empty"] [task=task_single] [unit="binlog replication"] [event=query] [queryEventContext="{schema: db_optimistic, originSQL: ALTER TABLE db_optimistic.ddtgdgdix DROP INDEX case8_idx, startLocation: position: (mysql-bin|000001.000003, 50715508), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6584, currentLocation: position: (mysql-bin|000001.000003, 50715656), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6585, lastLocation: position: (mysql-bin|000001.000003, 50715656), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6585, re-sync: , needHandleDDLs: , trackInfos: }"]
[2021/11/16 12:50:43.770 +00:00] [WARN] [local_reader.go:559] ["fail to parse relay log file, meet some ignorable error"] [task=task_single] [unit="binlog replication"] [component="binlog reader"] [file=relay-dir/f70af929-46da-11ec-9807-52834c88734c.000001/mysql-bin.000003] [offset=50723867] [error="get event err EOF, need 8099 but got 978"] [errorVerbose="get event err EOF, need 8099 but got 978\ngithub.com/go-mysql-org/go-mysql/replication.(*BinlogParser).parseSingleEvent\n\tgithub.com/go-mysql-org/[email protected]/replication/parser.go:130\ngithub.com/go-mysql-org/go-mysql/replication.(*BinlogParser).ParseReader\n\tgithub.com/go-mysql-org/[email protected]/replication/parser.go:165\ngithub.com/go-mysql-org/go-mysql/replication.(*BinlogParser).ParseFile\n\tgithub.com/go-mysql-org/[email protected]/replication/parser.go:96\ngithub.com/pingcap/ticdc/dm/relay.(*BinlogReader).parseFile\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:556\ngithub.com/pingcap/ticdc/dm/relay.(*BinlogReader).parseFileAsPossible\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:413\ngithub.com/pingcap/ticdc/dm/relay.(*BinlogReader).parseDirAsPossible\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:382\ngithub.com/pingcap/ticdc/dm/relay.(*BinlogReader).parseRelay\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:311\ngithub.com/pingcap/ticdc/dm/relay.(*BinlogReader).StartSyncByGTID.func1\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:284\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2021/11/16 12:50:43.771 +00:00] [ERROR] [streamer_controller.go:287] ["meet error when get binlog event"] [task=task_single] [unit="binlog replication"] [error="truncate binlog file found, event may be duplicated"] [errorVerbose="truncate binlog file found, event may be duplicated\ngithub.com/pingcap/ticdc/dm/relay.init\n\tgithub.com/pingcap/ticdc/dm/relay/local_reader.go:44\nruntime.doInit\n\truntime/proc.go:6265\nruntime.doInit\n\truntime/proc.go:6242\nruntime.doInit\n\truntime/proc.go:6242\nruntime.main\n\truntime/proc.go:208\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2021/11/16 12:50:43.771 +00:00] [WARN] [syncer.go:1647] ["read binlog met a truncated file, need to open safe-mode until the next transaction"] [task=task_single] [unit="binlog replication"]
[2021/11/16 12:50:43.771 +00:00] [INFO] [syncer.go:1571] ["discard event already consumed"] [count=1] [cur_loc="position: (mysql-bin|000001.000003, 50723867), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6586"]
[2021/11/16 12:50:43.992 +00:00] [INFO] [ddl.go:36] ["parse ddl"] [task=task_single] [unit="binlog replication"] [event=query] ["query event context"="{schema: db_optimistic, originSQL: ALTER TABLE db_optimistic.ddtgdgdix ADD INDEX case8_idx(case4_flag1, case4_flag2, case4_flag3), startLocation: position: (mysql-bin|000001.000003, 50806363), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6603, currentLocation: position: (mysql-bin|000001.000003, 50806549), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6604, lastLocation: position: (mysql-bin|000001.000003, 50806298), gtid-set: f70af929-46da-11ec-9807-52834c88734c:1-6603, re-sync: , needHandleDDLs: , trackInfos: }"]

we can see that task is normal at f70af929-46da-11ec-9807-52834c88734c:1-6585

gtid:6586 is a delete event with multiple rows

3341
3342
3343
3345
3347
3348
3349
3350
3355
3357
3358
3359
3360
3362
3363
3364
3365
3366
3368
3369
3371
3373
3379
3382
3386
3387
3389
3394
3395
3396
3402
3403
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436

I found downstream doesn't delete

3395
3396
3402
3403
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429

So we lost the data after error truncate binlog file found, event may be duplicated

What did you expect to see?

task runs well

What did you see instead?

lost data

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

master
@GMHDBJD GMHDBJD added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. severity/critical labels Nov 16, 2021
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Nov 16, 2021

May be introduced by this pr pingcap/dm#2047

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Nov 16, 2021

The delete rows count from 3341 to 3394 is 28, which equals the count from 3395 to 3429

@lance6716
Copy link
Contributor

where can we retrieve the log?

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Nov 17, 2021

https://github.com/pingcap/ticdc/actions/runs/1469046051 almost every failed chaos test can find the log 'meet error when get binlog event', the above case is run in my local machine.

@lance6716
Copy link
Contributor

Currently I think the reason is we should not skip events when relay returns EOF, because relay will record the latestPos to continue from last position transparently 😂 (checking)

@lance6716
Copy link
Contributor

lance6716 commented Dec 7, 2021

open it because release-5.3 may need another fix PR

@glorv
Copy link
Contributor

glorv commented Dec 20, 2021

duplicated with #3711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 area/dm Issues or PRs related to DM. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants