Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture list increased 1 unexpected after all PD restarting #2388

Closed
Tammyxia opened this issue Jul 27, 2021 · 7 comments
Closed

Capture list increased 1 unexpected after all PD restarting #2388

Tammyxia opened this issue Jul 27, 2021 · 7 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. component/status-server Status server component. difficulty/easy Easy task. severity/minor type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • 2x capture:
    Starting component cdc: /root/.tiup/components/cdc/v5.1.0/cdc cli capture list --pd=http://172.16.6.24:2379
    [
    {
    "id": "3378b726-26cb-4963-8280-3ee679024a76",
    "is-owner": false,
    "address": "172.16.6.32:8300"
    },
    {
    "id": "d2600da4-cdc4-4420-84a4-e57e826ffbc7",
    "is-owner": true,
    "address": "172.16.6.31:8300"
    }
    ]

  • Restart all PD: $ tiup cluster restart 360UP -R pd

  • Check capture list

  1. What did you expect to see?
  • 2x capture, and their status is normal.
  1. What did you see instead?
  • 3x capture list unexpected in a short time after PD restarting:
    Starting component cdc: /root/.tiup/components/cdc/v5.1.0/cdc cli capture list --pd=http://172.16.6.24:2379
    [
    {
    "id": "2b0211bd-f3fc-4551-b4d1-a5c6bba5818e",
    "is-owner": false,
    "address": "172.16.6.32:8300"
    },
    {
    "id": "3378b726-26cb-4963-8280-3ee679024a76",
    "is-owner": false,
    "address": "172.16.6.32:8300"
    },
    {
    "id": "d2600da4-cdc4-4420-84a4-e57e826ffbc7",
    "is-owner": true,
    "address": "172.16.6.31:8300"
    }
    ]
    Waiting for several seconds, 2x capture list as expected:
    Starting component cdc: /root/.tiup/components/cdc/v5.1.0/cdc cli capture list --pd=http://172.16.6.24:2379
    [
    {
    "id": "2b0211bd-f3fc-4551-b4d1-a5c6bba5818e",
    "is-owner": true,
    "address": "172.16.6.32:8300"
    },
    {
    "id": "4b8e6f29-847d-4caf-bc7a-ea8cba317a28",
    "is-owner": false,
    "address": "172.16.6.31:8300"
    }
    ]
  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      4.0.14
      
    • TiCDC version (execute cdc version):

      [release-version=v4.0.14] [git-hash=5a7851967f686da896b45acd3f3e968bfe53d6bd] [git-branch=heads/refs/tags/v4.0.14]
      
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/minor labels Jul 27, 2021
@asddongmen asddongmen added bug-from-internal-test Bugs found by internal testing. component/status-server Status server component. difficulty/easy Easy task. labels Jul 28, 2021
@3AceShowHand
Copy link
Contributor

  • 3378b726-26cb-4963-8280-3ee679024a76_172.16.6.32:8300_ false / d2600da4-cdc4-4420-84a4-e57e826ffbc7_172.16.6.31:8300_true
  • 2b0211bd-f3fc-4551-b4d1-a5c6bba5818e_172.16.6.32:8300_false / 3378b726-26cb-4963-8280-3ee679024a76_172.16.6.32:8300_false / d2600da4-cdc4-4420-84a4-e57e826ffbc7_172.16.6.31:8300_true
  • 2b0211bd-f3fc-4551-b4d1-a5c6bba5818e_172.16.6.32:8300_ true / 4b8e6f29-847d-4caf-bc7a-ea8cba317a28_172.16.6.31:8300_false

@3AceShowHand
Copy link
Contributor

  1. a new capture (2b02) was assigned to 6.32.
  2. old captures (3378 / d260) dropped, and 6.32 become the owner.
  3. a new capture (4b8e) was assigned to 6.31.
  • What and when capture id generated and assigned.
  • Is there any mechanism to prevent 2 captures assigned to the same server?
    • If there is, we should make sure it becomes persistent after the old one gets dropped.

@3AceShowHand
Copy link
Contributor

defer func() {
	timeoutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	if err := ctx.GlobalVars().EtcdClient.DeleteCaptureInfo(timeoutCtx, c.info.ID); err != nil {
		log.Warn("failed to delete capture info when capture exited", zap.Error(err))
	}
	cancel()
}()

when all PD restart, CDC also cannot get touch with Etcd, so 5s timeout is too small, maybe set to a larger timeout for a workaround.

@3AceShowHand
Copy link
Contributor

when CDC server cannot get in touch with PD, they will meet PD related error, then failed to run, then get dropped.

@3AceShowHand
Copy link
Contributor

drop old capture info and put new capture info should be atomic.

@3AceShowHand
Copy link
Contributor

the problem only happens in v4.0.14

@3AceShowHand
Copy link
Contributor

close with #2388

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. component/status-server Status server component. difficulty/easy Easy task. severity/minor type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants