Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Ingest failed due to EpochNotMatch error #436

Open
amyangfei opened this issue Oct 30, 2020 · 5 comments
Open

Ingest failed due to EpochNotMatch error #436

amyangfei opened this issue Oct 30, 2020 · 5 comments
Assignees
Labels
type/bug This issue is a bug report

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Oct 30, 2020

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

    import with lightning local backend

  2. What did you expect to see?

    import successfully

  3. What did you see instead?

[2020/10/29 22:17:02.681 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"iBhoOzy/TQWsi/A48drSGw==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfNXP/ZDhwMP9tdjX/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfNXP/ZWNjZ/95Y3j/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":9946,\"region_epoch\":{\"conf_ver\":69,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890
github.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable
	/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173
github.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs
	/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918
github.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1
	/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357"]

[2020/10/29 22:19:05.675 +00:00] [ERROR] [restore.go:680] ["restore table failed"] [table=`newswriter`.`customers`] [takeTime=1h3m52.761195165s] [error="restore table `newswriter`.`customers` failed: [cabbaafc-26bd-58e4-bac1-22fb48e0577b] import reach max retry 3 and still failed: split region failed: region=id:8348 start_key:"t\200\000\000\000\000\000\000\377=_i\200\000\000\000\000\377\000\000\002\003\200\000\000\006\377\376\3311\317\003\200\000\000\377\007\025q\244\221\000\000\000\374" end_key:"t\200\000\000\000\000\000\000\377=_i\200\000\000\000\000\377\000\000\002\003\200\000\000\007\377\035\243\251d\003\200\000\000\377\007\035\243\261\311\000\000\000\374" region_epoch:<conf_ver:62 version:897 > peers:<id:8349 store_id:66 > peers:<id:8350 store_id:90 > peers:<id:8351 store_id:1 > , err=message:"EpochNotMatch [region 8348] 8351 epoch changed conf_ver: 62 version: 898 != conf_ver: 62 version: 897, retry later" epoch_not_match:<current_regions:<id:8348 start_key:"t\200\000\000\000\000\000\000\377=_i\200\000\000\000\000\377\000\000\002\003\200\000\000\006\377\377\002Ey\003\200\000\000\377\007\027\260,\010\000\000\000\374" end_key:"t\200\000\000\000\000\000\000\377=_i\200\000\000\000\000\377\000\000\002\003\200\000\000\007\377\035\243\251d\003\200\000\000\377\007\035\243\261\311\000\000\000\374" region_epoch:<conf_ver:62 version:898 > peers:<id:8349 store_id:66 > peers:<id:8350 store_id:90 > peers:<id:8351 store_id:1 > > > "]
  1. Versions of the cluster

    • TiDB-Lightning version (run tidb-lightning -V):

      v4.0.6
      
    • TiKV version (run tikv-server -V):

      v4.0.6
      
  2. Operation logs

  3. Configuration of the cluster and the task

    • tidb-lightning.toml for TiDB-Lightning if possible
    • tikv-importer.toml for TiKV-Importer if possible
    • inventory.ini if deployed by Ansible
  4. Screenshot/exported-PDF of Grafana dashboard or metrics' graph in Prometheus for TiDB-Lightning if possible

@amyangfei amyangfei added the type/bug This issue is a bug report label Oct 30, 2020
@overvenus overvenus changed the title Injest failed due to EpochNotMatch error Ingest failed due to EpochNotMatch error Oct 30, 2020
@overvenus
Copy link
Member

overvenus commented Nov 3, 2020

epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890

This error means raft member has changed. (conf_ver: raft member version, version: region range version).

More errors from lightning log, all of them are conf_ver mismatch. Does lightning local backend wait scatter? @glorv

More errors
[2020/10/29 22:17:02.681 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"iBhoOzy/TQWsi/A48drSGw==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfNXP/ZDhwMP9tdjX/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfNXP/ZWNjZ/95Y3j/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":9946,\"region_epoch\":{\"conf_ver\":69,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:17:15.472 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"4dsY/KecRHWa3iPAf6jaug==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZmr/ZGR4cf95Y3L/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZmr/ZTh3Yf95Ymv/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":10139,\"region_epoch\":{\"conf_ver\":63,\"version\":891}}"] [error="epoch not match: EpochNotMatch conf_ver: 63 version: 891 != conf_ver: 65 version: 891"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 63 version: 891 != conf_ver: 65 version: 891\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:17:21.796 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"/GMuiY9ASZmLvjffqWGIfA==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZnH/Y2FjZ/95a2v/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZnH/Y3phYf9tZW3/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":9238,\"region_epoch\":{\"conf_ver\":63,\"version\":889}}"] [error="epoch not match: EpochNotMatch conf_ver: 63 version: 889 != conf_ver: 65 version: 889"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 63 version: 889 != conf_ver: 65 version: 889\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:17:29.134 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"w009Uj0dRiy450IWxsEcEA==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZzn/ZW54Mf9tYWr/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZ2P/YXp6dP9iMwD/AAAAAAD5AAD9\"},\"cf_name\":\"write\",\"region_id\":7476,\"region_epoch\":{\"conf_ver\":65,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 65 version: 890 != conf_ver: 66 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 65 version: 890 != conf_ver: 66 version: 890\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:17:52.462 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"ow2wL8RkRLGdzoM7u+LdnQ==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZzn/ZW54Mf9tYWr/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfZ2P/YXp6dP9iMwD/AAAAAAD5AAD9\"},\"cf_name\":\"write\",\"region_id\":7476,\"region_epoch\":{\"conf_ver\":66,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 66 version: 890 != conf_ver: 68 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 66 version: 890 != conf_ver: 68 version: 890\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:18:15.776 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"M3ZNNAhTSVehs0bW2A+Ihw==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfcDf/bDM1Zv9lY3D/M2JuMWf/aDX/ZWxtdzk1/zn/dGIAAAAAAPr/AAAAAAAAAAD3\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfcGH/dnp6d/95dQD/AAAAAAD5AAD9\"},\"cf_name\":\"write\",\"region_id\":7868,\"region_epoch\":{\"conf_ver\":65,\"version\":887}}"] [error="epoch not match: EpochNotMatch conf_ver: 65 version: 887 != conf_ver: 66 version: 887"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 65 version: 887 != conf_ver: 66 version: 887\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:837\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestByRanges.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:967\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:18:21.658 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"WlGWqTaTR4Gv9DoOcHo0NQ==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfczb/ZGFhZ/9tM2b/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfczb/ZHl3cP95Z2r/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":10240,\"region_epoch\":{\"conf_ver\":65,\"version\":891}}"] [error="epoch not match: EpochNotMatch conf_ver: 65 version: 891 != conf_ver: 66 version: 891"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 65 version: 891 != conf_ver: 66 version: 891\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:18:22.612 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"m8AUYqVwT4OWyvAJtnbiiw==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfc2r/Y3l3Z/95Z27/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfc2r/ZGRhOP9tcDH/AAAAAAD6AAD9\"},\"cf_name\":\"write\",\"region_id\":9577,\"region_epoch\":{\"conf_ver\":62,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 62 version: 890 != conf_ver: 65 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 62 version: 890 != conf_ver: 65 version: 890\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/10/29 22:18:24.477 +00:00] [ERROR] [local.go:940] ["all retry ingest failed"] ["ingest meta"="{\"uuid\":\"9NNFPVSkR9e2pZfu5CeqXA==\",\"range\":{\"start\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfc3H/ZWhjYf95aGb/AAAAAAD6AAD9\",\"end\":\"dIAAAAAAAAD/PV9pgAAAAAD/AAABAUNfc3P/MXlyYf90cHj/eHZhODX/bTD/ejBvYWJh/3n/eDkAAAAAAPr/AAAAAAAAAAD3\"},\"cf_name\":\"write\",\"region_id\":8020,\"region_epoch\":{\"conf_ver\":62,\"version\":890}}"] [error="epoch not match: EpochNotMatch conf_ver: 62 version: 890 != conf_ver: 65 version: 890"] [errorVerbose="epoch not match: EpochNotMatch conf_ver: 62 version: 890 != conf_ver: 65 version: 890\ngithub.com/pingcap/tidb-lightning/lightning/backend.isIngestRetryable\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:1173\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).WriteAndIngestPairs\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:918\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*local).writeAndIngestByRange.func1\n\t/home/jenkins/agent/workspace/ld_lightning_multi_branch_v4.0.6/go/src/github.com/pingcap/tidb-lightning/lightning/backend/local.go:850\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

Also from tikv log, lightning batch splits TiKV 2 times, while TiKV itself splits about 683 times, it may be caused by ingest multiple sst file pairs(write/default) in one region. We should ensure one sst file pair one regoin.

➜ grep -E 'BatchSplit.*(requests.*){15,}' tikv-201030.log | wc -l
6 # batch split 2 times * 3 replicas

➜ grep -E 'BatchSplit.*(requests.*){,15}' tikv-201030.log | wc -l
2049 # batch split 683 times * 3 replicas

@overvenus
Copy link
Member

overvenus commented Nov 3, 2020

epoch not match: EpochNotMatch conf_ver: 69 version: 890 != conf_ver: 71 version: 890

This is misleading, it's not the root cause.

err=message:"EpochNotMatch [region 8348] 8351 epoch changed conf_ver: 62 version: 898 != conf_ver: 62 version: 897, retry later"

The actual error causes lightning failure is conf_ver: 62 version: 898 != conf_ver: 62 version: 897, which means 8348 region's range has changed.

From tikv log, I find region 8348 has been ingested multiple times, and before every ingests, it splits.

8348 ingest split
[2020/10/29 22:09:20.621 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF00000101435F3030FF30303162FF686737FF6671796562FF7678FF747635693439FF61FF6D6A0000000000FAFF00[2020/10/29 22:09:56.807 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 7F3D759D56B345D28534C411F217A96A range { start: 7480000000000000FF3D5F698000000000FF0000020380000000FF0000000A03800000FF0705C34D79000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 887 }"]
[2020/10/29 22:10:10.540 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFC8324CD03800000FF0715E8BE27000000FC new_region_id: 8893 new_peer_ids: 8894 new_peer_ids: 8895 new_peer_ids: 8896 } right_derive: true }"] [index=9] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:10:40.817 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: DDB7C1737A0F409FB8365CB57311FE90 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFC8324CD03800000FF0715E8BE27000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 888 }"]
[2020/10/29 22:11:07.008 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFCC6924503800000FF071CFA9508000000FC new_region_id: 9636 new_peer_ids: 9637 new_peer_ids: 9638 new_peer_ids: 9639 } right_derive: true }"] [index=11] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:11:26.814 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: E33C954472D94E439C07E9B1AF7293F6 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFCC6924503800000FF071CFA9508000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 889 }"]
[2020/10/29 22:11:41.709 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD0F3E1A03800000FF070C85EA6D000000FC new_region_id: 10001 new_peer_ids: 10002 new_peer_ids: 10003 new_peer_ids: 10004 } right_derive: true }"] [index=13] [term=7] [peer_id=8350] [region_id=8348]
[2020/10/29 22:14:29.695 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: D3DB384834F64676A8ACEB3789DE1CE6 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD0F3E1A03800000FF070C85EA6D000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 890 }"]
[2020/10/29 22:14:36.724 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD5065FC03800000FF071CB11632000000FC new_region_id: 10678 new_peer_ids: 10679 new_peer_ids: 10680 new_peer_ids: 10681 } right_derive: true }"] [index=15] [term=7] [peer_id=8350] [region_id=8348]
[2020/10/29 22:14:43.736 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 379CB9ED6B1F4789B086CCAE1CD9B0A3 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD5065FC03800000FF071CB11632000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 891 }"]
[2020/10/29 22:14:44.458 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD94524603800000FF0716A1B9F9000000FC new_region_id: 10779 new_peer_ids: 10780 new_peer_ids: 10781 new_peer_ids: 10782 } right_derive: true }"] [index=17] [term=7] [peer_id=8351] [region_id=8348]
[2020/10/29 22:15:01.745 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: E541E849D34941C8A920EF1C709B954D range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFD94524603800000FF0716A1B9F9000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 892 }"]
[2020/10/29 22:15:03.373 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFDD670D803800000FF071BB139D7000000FC new_region_id: 10993 new_peer_ids: 10994 new_peer_ids: 10995 new_peer_ids: 10996 } right_derive: true }"] [index=19] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:15:11.755 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 26CB1861E7A349B6B726065E597D15D8 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFDD670D803800000FF071BB139D7000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 893 }"]
[2020/10/29 22:15:12.660 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE18113403800000FF071D039A38000000FC new_region_id: 11077 new_peer_ids: 11078 new_peer_ids: 11079 new_peer_ids: 11080 } right_derive: true }"] [index=21] [term=7] [peer_id=8350] [region_id=8348]
[2020/10/29 22:18:38.987 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 7DC81AFECE6D4B80B13A13D637C1A6D6 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE18113403800000FF071D039A38000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 894 }"]
[2020/10/29 22:18:39.911 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE58EE2503800000FF071547C058000000FC new_region_id: 11374 new_peer_ids: 11375 new_peer_ids: 11376 new_peer_ids: 11377 } right_derive: true }"] [index=23] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:18:43.008 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 274255CB3F99481A90B661A5E15C0119 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE58EE2503800000FF071547C058000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 895 }"]
[2020/10/29 22:18:43.010 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE99755903800000FF0714CFFE75000000FC new_region_id: 11385 new_peer_ids: 11386 new_peer_ids: 11387 new_peer_ids: 11388 } right_derive: true }"] [index=25] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:18:54.764 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: C7501A72CF084CFC83DE1551CD1C3935 range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFE99755903800000FF0714CFFE75000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 896 }"]
[2020/10/29 22:18:54.831 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFED931CF03800000FF071571A491000000FC new_region_id: 11416 new_peer_ids: 11417 new_peer_ids: 11418 new_peer_ids: 11419 } right_derive: true }"] [index=27] [term=7] [peer_id=8349] [region_id=8348]
[2020/10/29 22:19:00.832 +00:00] [INFO] [sst_importer.rs:83] [ingest] [meta="uuid: 3BCF5B6DEE104E0383EA80BB36EC644C range { start: 7480000000000000FF3D5F698000000000FF0000020380000006FFFED931CF03800000FF071571A491000000FC end: 7480000000000000FF3D5F698000000000FF0000020380000007FF1DA3A96403800000FF071DA3B1C8000000FC } cf_name: \"write\" region_id: 8348 region_epoch { conf_ver: 62 version: 897 }"]
[2020/10/29 22:19:01.031 +00:00] [INFO] [apply.rs:1162] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000000FF3D5F698000000000FF0000020380000006FFFF02457903800000FF0717B02C08000000FC new_region_id: 11428 new_peer_ids: 11429 new_peer_ids: 11430 new_peer_ids: 11431 } right_derive: true }"] [index=29] [term=7] [peer_id=8349] [region_id=8348]

@glorv
Copy link
Contributor

glorv commented Nov 9, 2020

There are three kinds of event, that can cause EpochNotMatch errors:

  • region leader changes. This will cause increase of region conf version. After restore: disable some pd scheduler during restore #408 related schedulers will be disabled when lightning is running, so this kind of error will emit.
  • region split by lighting. If multiple lightning engines start import at the same time, the will split the same range to different region ranges, so some of them will likely to encounter EpochNotMatch error, and if the concurrency is high, the error may occur several times, thus lightning may fail. backend/local: serial import engines with range overlap #451 will serialize this kind of engine, so they won't conflict with each other anymore.
  • region split by tikv. If multiple engines' key ranges have overlap, there will be more than 1 sst files ingest to the same region, thus the region keys or total size will exceed the upper limit. Then tikv will auto split this kind of big region in to several regions. This split action will cause lightning's ingestion meeting epoch not match errors. This is likely to happened in Support to specify a disk quota for intermediate files #446 when disk quota is restricted, so lightning can't fully sort big table data. Since tikv may start split a region at any time, it seems our current logic may still failed in this scenario. We should add more adaptable retry logic to handle this.

@glorv
Copy link
Contributor

glorv commented Nov 12, 2020

Since the epoch not match errors are due to concurrently split regions in the same range, here is the config to reproduce this issue:

  • Source data contains randomly generated integer primary key field.
  • Run lightning with big enough table-concurrency (the import engine concurrency is table-concurrency * 2. So set table-concurrency = 6 will allow 12 engines running in import phase ). Set mydumper.batch-size to a smaller value (e.g. 1GiB or even smaller), so there are always a lot of engines running in the import phase

Further:
Can manually add back some pd scheduler config, so the pd region schechles will also contribute to the region epoch change.

@glorv
Copy link
Contributor

glorv commented Dec 7, 2020

We have optimized the stability of lightning local backend in the past few months:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/bug This issue is a bug report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants