-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data may be stored on only a single server shortly after startup #10657
Comments
I've been experimenting with different settings to see how long it takes for all regions to be fully replicated. With the settings we were using ( In this case, it looks like most regions got to 3 or 2 replicas early in the test, but for some reason, new regions keep getting created, and old ones being destroyed, while no progress is being made on the original regions. I've set up jepsen 0268340 to log region IDs and replica counts: in an hour and 20 minutes, we go from
to
That final region, id=4055, gets replaced by a new, higher region every few seconds. I'm not sure why this is the case--we're not actually making any writes, or even connecting clients to this cluster at this point. It looks pretty well stuck. Full logs are here: 20190531T121838.000-0400.zip With Jepsen 0268340, try something like this to reproduce. It may take a few runs--it doesn't seem to get stuck every time.
|
PTAL @nolouch |
@aphyr How do you check bootstrap step, it takes 80+ minutes but I found the log that just cost one minutes, like region 46:
you can grep
and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the |
We check for bootstrapping by performing an HTTP GET of PD's
Region 46 starting quickly is great, but I'm concerned that the highest region never seem to stabilize--do you know what might be going on there?
OK! I'll rewrite the setup code to block before starting TiDB. Do you think TiDB running could be preventing the final region from converging? |
the final region is always the last range from |
TiDB 3.0.0-rc.2, by design, starts up with a region with only a single replica, regardless of the configured target number of replicas. PD then gradually adds additional replicas until
target-replicas
is reached. In addition, any regions which are split from this initial region also start with the same number of replicas as the target region, until PD can expand them.This is not a problem in itself, but it does lead to an awkward possibility: in the early stages of a TiDB cluster, data may be acknowledged, but stored only on a single node, when the user expected that data to be replicated to multiple nodes. A single-node failure during that period could destroy acknowledged writes, or render the cluster partly, or totally, unusable. In our tests with Jepsen (which use small regions and are extra-sensitive to this phenomenon), a single-node network partition as late as 500 seconds into the test can result in a total outage, because some regions are only replicated to 1 or 2, rather than 3, nodes.
The configuration parameter for replica count is called
max-replicas
, which is sort of an odd name, because regions could have exactlymax-replicas
(the happy case), fewer thanmax-replicas
(e.g. PD hasn't gotten around to resizing that region yet), or more thanmax-replicas
(e.g. during handoff when a node is declared dead). It might be best to call thistarget-replicas
?I'd also like to suggest that when a cluster has a configured replica count, TiDB should disallow transactions on regions which don't have at least that many replicas in their Raft group. That'd prevent the possibility of a single-node failure destroying committed data, which is something I'm pretty sure users don't expect to be possible!
The text was updated successfully, but these errors were encountered: