Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE #5989] Support unique broker-id as identification in controller mode #6100

Merged
merged 50 commits into from
Mar 14, 2023

Conversation

TheR1sing3un
Copy link
Member

Make sure set the target branch to develop

What is the purpose of the change

fix #5989

Brief changelog

XX

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily. Notice, it would be helpful if you could finish the following 5 checklist(the last one is not necessary)before request the community to review your PR.

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test(over 80% coverage) to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
  • Run mvn -B clean apache-rat:check findbugs:findbugs checkstyle:checkstyle to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@TheR1sing3un TheR1sing3un changed the title [ISSUE#5989] Support new [ISSUE#5989] Support unique broker-id as identification in controller mode Feb 16, 2023
@TheR1sing3un TheR1sing3un changed the title [ISSUE#5989] Support unique broker-id as identification in controller mode [ISSUE #5989] Support unique broker-id as identification in controller mode Feb 16, 2023
@RongtongJin RongtongJin self-requested a review February 17, 2023 01:29
@RongtongJin RongtongJin added this to the 5.1.1 milestone Feb 17, 2023
@RongtongJin RongtongJin added module/ha high availably related module/controller labels Feb 18, 2023
Copy link
Contributor

@RongtongJin RongtongJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该方案是否有办法完成比较好兼容性升级?举个例子
1.先升controller组件(可能需要所有controller停机删除数据后再升级),升级完成后,broker不具备选举能力,但仍能正常工作(最低要求)
2.升级Broker组件,可以保证升级后正常上线,不丢数据。(最好是能保证主备关系)

@mxsm
Copy link
Member

mxsm commented Feb 22, 2023

该方案是否有办法完成比较好兼容性升级?举个例子 1.先升controller组件(可能需要所有controller停机删除数据后再升级),升级完成后,broker不具备选举能力,但仍能正常工作(最低要求) 2.升级Broker组件,可以保证升级后正常上线,不丢数据。(最好是能保证主备关系)

@TheR1sing3un 后续可以更新一下https://github.com/apache/rocketmq/tree/develop/docs/cn/controller 下面的对应文档(中英文)

@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2023

Codecov Report

Merging #6100 (6238611) into develop (7b23042) will decrease coverage by 0.13%.
The diff coverage is 38.05%.

@@              Coverage Diff              @@
##             develop    #6100      +/-   ##
=============================================
- Coverage      43.14%   43.02%   -0.13%     
- Complexity      8860     8889      +29     
=============================================
  Files           1094     1103       +9     
  Lines          77284    77718     +434     
  Branches       10085    10115      +30     
=============================================
+ Hits           33347    33437      +90     
- Misses         39771    40091     +320     
- Partials        4166     4190      +24     
Impacted Files Coverage Δ
...a/org/apache/rocketmq/broker/BrokerController.java 43.72% <0.00%> (-2.85%) ⬇️
...ocketmq/broker/processor/AdminBrokerProcessor.java 25.31% <0.00%> (ø)
...g/apache/rocketmq/client/impl/MQClientAPIImpl.java 23.19% <0.00%> (+0.03%) ⬆️
...c/main/java/org/apache/rocketmq/common/MixAll.java 41.44% <ø> (ø)
...ocketmq/controller/impl/event/EventSerializer.java 66.66% <0.00%> (-2.30%) ⬇️
...ontroller/impl/event/UpdateBrokerAddressEvent.java 0.00% <0.00%> (ø)
...etmq/controller/impl/heartbeat/BrokerLiveInfo.java 48.07% <0.00%> (ø)
...apache/rocketmq/remoting/protocol/RequestCode.java 0.00% <ø> (ø)
...pache/rocketmq/remoting/protocol/ResponseCode.java 0.00% <ø> (ø)
...tmq/remoting/protocol/body/BrokerReplicasInfo.java 0.00% <0.00%> (ø)
... and 47 more

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@RongtongJin RongtongJin changed the base branch from develop to dledger-controller-brokerId February 27, 2023 08:15
@RongtongJin RongtongJin changed the base branch from dledger-controller-brokerId to develop February 27, 2023 08:15
@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Mar 6, 2023

升级方案

升级方案

5.0以下版本升级则遵守之前的升级步骤。
5.0以上版本升级需要遵守以下步骤。

升级controller

由于我们在controller里面更改了许多状态机的内部属性以及新版的brokerId分配的逻辑,因此controller需要进行升级。

  1. 将旧版本Controller下线。
  2. 清除Controller数据,也就是DLedger的数据文件,默认路径在~/DLedgerController
  3. 上线新版本Controller。

升级Broker

  1. 将Broker从节点停机。
  2. 将Broker主节点停机。
  3. 将所有Broker的Epoch文件删除,即默认~/store/epochFileCheckpoint~/store/epochFileCheckpoint.bak
  4. 将原来的主Broker先上线,等待该Broker成功当选为master。(可用admin命令getSyncState来检测)
  5. 将原来的从Broker全部上线。

测试

启动一个namesrv

nohup sh bin/mqnamesrv &

启动一个旧版controller

nohup sh bin/mqcontroller -c ./conf/controller/controller-standalone.conf &

查看controller是否被正确启动

sh bin/mqadmin getControllerMetaData -a localhost:9878

image.png

先后启动旧版broker0和broker1

nohup sh bin/mqbroker -c conf/controller/quick-start/broker-n0.conf &
nohup sh bin/mqbroker -c conf/controller/quick-start/broker-n1.conf &

查看集群情况

sh bin/mqadmin getSyncStateSet -a localhost:9878 -b broker-a

image.png

发送消息

sh bin/mqadmin sendMessage -p "hello" -n localhost:9876 -b broker-a -t default

image.png

检查两个节点是否正确append

sh bin/mqadmin getBrokerEpoch -n localhost:9876 -b broker-a

image.png

controller下线

将controller进程杀死。
测试此时是否可以正常收发消息。
image.png

清除controller数据文件

image.png

上线新版controller

image.png

检查当前是否正常收发消息

image.png

分别将broker的从节点和主节点先后停机

image.png

清除每个broker的epoch文件

image.png

将原来的主broker先更新上线

image.png

从broker更新上线

image.png

测试收发消息

image.png

上线一个新节点broker2

image.png

下线broker0,触发切主

image.png
broker1成为master

下线broker1,触发切主

image.png

测试消息收发

image.png

重启broker0和broker1

image

兼容性

5.0旧Controller 新Controller
5.0旧Broker 正常运行,可切换 若已主备确定则可正常运行,不可切换。若broker重新启动则无法上线
新Broker 无法正常上线 正常运行,可切换

hzh0425
hzh0425 previously approved these changes Mar 9, 2023
Copy link
Member

@hzh0425 hzh0425 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 181 to 193
for (int retryTimes = 0; retryTimes < 5; retryTimes++) {
if (register()) {
LOGGER.info("First time register broker success");
this.state = State.REGISTER_TO_CONTROLLER_DONE;
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这边是不是不需要连续重试,如果失败直接等待再次重跑startBasicService

@RongtongJin
Copy link
Contributor

RongtongJin commented Mar 11, 2023

image
The UTs seems to be unstable

@RongtongJin
Copy link
Contributor

Comment on lines +140 to +164
### 持久化BrokerID版本的升级注意事项

目前版本支持采用了新的持久化BrokerID版本的高可用架构,从该版本前的5.x升级到当前版本需要注意如下事项。

4.x版本升级遵守上述流程即可。
5.x非持久化BrokerID版本升级到持久化BrokerID版本按照如下流程:

**升级Controller**

1. 将旧版本Controller组停机。
2. 清除Controller数据,即默认在`~/DLedgerController`下的数据文件。
3. 上线新版Controller组。

> 在上述升级Controller流程中,Broker仍可正常运行,但无法切换。

**升级Broker**

1. 将Broker从节点停机。
2. 将Broker主节点停机。
3. 将所有的Broker的Epoch文件删除,即默认为`~/store/epochFileCheckpoint`和`~/store/epochFileCheckpoint.bak`。
4. 将原先的主Broker先上线,等待该Broker当选为master。(可使用`admin`命令的`getSyncStateSet`来观察)
5. 将原来的从Broker全部上线。

> 建议停机时先停从再停主,上线时先上原先的主再上原先的从,这样可以保证原来的主备关系。
> 若需要改变升级前后主备关系,则需要停机时保证主、备的CommitLog对齐,否则可能导致数据被截断而丢失。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议单独搞一个文档,介绍brokerId持久化方案的背景(解决的问题)、设计思想、兼容性升级方案,然后在部署文档中用链接引过去

TheR1sing3un and others added 6 commits March 13, 2023 21:54
…ller mode (apache#5046)

* refactor(controller): refactor the register logic

1. refactor the register logic

* refactor(controller): remove unused field in ElectMasterEvent

1. remove unused field in ElectMasterEvent

* feat(controller): add a tryElectMaster request and process logic about it

1. add a tryElectMaster request and process logic about it

* feat(controller): refactor ReplicasInfoManagerTest

1. refactor ReplicasInfoManagerTest

* refactor(controller): refactor DLedgerControllerTest

1. refactor DLedgerControllerTest

* refactor(controller): refactor ControllerManagerTest

1. refactor ControllerManagerTest

* refactor(controller): refactor ReplicasInfoManagerTest

1. refactor ReplicasInfoManagerTest

* refactor(controller): refactor register process and pass the junit test

1. refactor ReplicasInfoManagerTest

* style(broker): rename a constant

1. rename a constant

* feat(controller): update the DLedger dependency from v0.27 to v0.30

1. update the DLedger dependency from v0.27 to v0.30

* style(controller): add a white-line just for trigger GitHub action again

1. add a white-line just for trigger GitHub action again

* feat(controller): combine electMaster api and brokerTryElectMaster api

1. combine electMaster api and brokerTryElectMaster api

* feat(controller): add a logic about verifying the broker id returned from registering

1. add a logic about verifying the broker id returned from registering

* fix(controller): remove unused code and add a warning log in ControllerManager

1. remove unused code and add a warning log in ControllerManager

* fix: resolve conflicts

1. resolve conflicts

* fix(controller): remove unused class

1. remove unused class

* fix(controller): Resolve conflicts after merging

1. Resolve conflicts after merging

* refactor(controller): Refactor ReplicasInfoManager#elect

1. Refactor ReplicasInfoManager#elect

* style(controller): remove unused imports

1. remove unused imports

* style(controller): remove unused imports

1. remove unused imports

* fix(controller): resolve conflicts after merging develop branch

1. resolve conflicts after merging develop branch

* rerun

* fix(controller): resolve conflicts in ReplicasInfoManagerTest#testRegisterNewBroker after merging develop branch

1. resolve conflicts in ReplicasInfoManagerTest#testRegisterNewBroker after merging develop branch

* style(controller): pass style check

1. pass style check
…p address to broker id

1. refactor broker's information recording core from ip address to broker id
1. add protocols about new register flow
1. refactor code in module: store/ha for persistence broker-id
1. implement the general register to controller protocol
TheR1sing3un and others added 24 commits March 13, 2023 22:06
1. add docs about how to update to BrokerId version
…g#storePathTempMetadata`

1. remove meaningless attribute
`MessageStoreConfig#storePathTempMetadata`
1. check metadata if valid when register
… to name server

1. set isolate's value to false to normally register broker to name
server
1. Random sleep within one second when broker register failed
1. rename registerSuccess to registerBrokerToController
…ig back to default value

1. fix forgetting set the changed cluster name broker config back to default value
1. add more logs when broker register to controller
1. fix wrong log
1. fix wrong test
1. fix incompatible command: CleanBrokerMeta
1. fix forgetting initialize BrokerHeartbeatManager
…nagerRegisterTest

1. add more logs when register and refactor ReplicasManagerRegisterTest
1. optimize some test base store path
1. fix conflicts after rebase
1. fix conflicts in test after rebase
TheR1sing3un and others added 2 commits March 14, 2023 02:52
…id the controller to notify the brokers when their roles have been changed.

1. To pass `ControllerManagerTest` in Windows,  we forbid the controller to notify the brokers when their roles have been changed.
@RongtongJin
Copy link
Contributor

Rebase and merge this PR because its commit log is clear and huge.

@RongtongJin RongtongJin merged commit fac699e into apache:develop Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module/controller module/ha high availably related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants