Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discourage rebalance, warn against stopping it #1298

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

@feorlen feorlen marked this pull request as draft August 14, 2024 22:37
Copy link
Collaborator

@djwfyi djwfyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions and comments for consideration.
I'll take another look after others have had their say.

@@ -149,12 +149,14 @@ For more information on write preference calculation logic, see :ref:`Writing Fi
Rebalancing data across all pools after an expansion is an expensive operation that requires scanning the entire deployment and moving objects between pools.
This may take a long time to complete depending on the amount of data to move.

Starting with MinIO Client version RELEASE.2022-11-07T23-47-39Z, you can manually initiate a rebalancing operation across all server pools using :mc:`mc admin rebalance`.
MinIO does not recommend manual rebalancing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the recommendation text?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannappanr you mean we should say it's ok to manually rebalance?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead use a cautionary statement of using only with consultation with MinIO Engineering?
The ask we had for this PR was specifically to discourage use of this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct we have made a release already, and the fixes are also in the EOS binaries so to some extent we have addressed this already.

We should perhaps talk about a more broader tone that rebalance is not a real requirement if you size your pools properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically discourage budget setups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't expand in this manner

first pool

  • 100 nodes which is now 90% used

you botched buying hardware now you just expand by 20 nodes

  • 20 nodes is second pool

This 20 nodes will take all the I/O hit causing significant slowness, the sizing must be appropriate to the load that 100 node was handling. if 20 can handle and its a new hardware no problem but if its not then it is going to cause outage etc.

A cautionary guidance on why rebalance don't solve the problem of high utilization the second pool. IT may look like that but it won't solve the problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mention the "2 years" guidance in the hardware checklist, although there isn't a specific and obvious section along the lines of "How big should I make my pool?" (Noticed that AJ's blog from January recommends minimum 3 years of capacity.)

I think we can reinforce this in the Storage section of the hardware checklist, maybe mention it elsewhere too. Like the concepts page. To make the point that tacking on a bit of new capacity here and there doesn't go well and is not a reliable plan.

Comment on lines +342 to +343
For deployments with multiple server pools, each individual pool may have its own hardware configuration.
However, significant capacity differences between pools may temporarily result in high loads on a new pool's nodes during :ref:`expansion <expand-minio-distributed>`. For more information, see :ref:`How do I manage object distribution across a MinIO deployment? <minio-rebalance>`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to say this in the hardware checklist?

Comment on lines +147 to +151
As the new pool fills, write operations eventually balance out across all pools in the deployment.
Until then, the new pool's nodes may experience higher loads and slower writes.

To reduce this temporary performance impact, MinIO recommends expanding a deployment well before its existing pools are near capacity and with new pools of a similar size.
For more information on write preference calculation logic, see :ref:`Writing Files <minio-writing-files>`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accurate? Sufficient?

Other mentions of pool sizing link to this section

Comment on lines +166 to +168
Since a pool with more free space has a higher probability of being written to, the nodes of that pool may experience higher loads as free space equalizes.

If required, you can manually initiate a rebalance procedure with :mc:`mc admin rebalance`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what happens if pools have very different available free space. Is this text an accurate characterization?

Comment on lines +139 to +144
.. admonition:: Stopping a rebalance job on previous versions of MinIO may cause data loss
:class: warning

A bug in MinIO prior to :minio-release:`RELEASE.2024-08-17T01-24-54Z` can overwrite objects while stopping a in-progress rebalance operation.
Interrupting rebalance on these older versions may result in data loss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a usual way we reference things like this? What should we say about the now fixed bug?

@feorlen
Copy link
Collaborator Author

feorlen commented Aug 22, 2024

@kannappanr @harshavardhana made several edits with proposed text that is less scary about rebalance. Appreciate another look.

Left the warning about stop, but for older versions. What should we say about that?

@feorlen feorlen requested a review from ravindk89 August 23, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants