Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the issue that health check may set liveness wrongly #1127

Merged
merged 10 commits into from
Jan 24, 2024

Conversation

zyguan
Copy link
Contributor

@zyguan zyguan commented Jan 22, 2024

close #1111

@zyguan zyguan marked this pull request as ready for review January 22, 2024 16:42
@zyguan
Copy link
Contributor Author

zyguan commented Jan 22, 2024

@cfzjywxk PTAL

@cfzjywxk
Copy link
Contributor

/cc @crazycs520 PTAL

Comment on lines 2620 to +2622
newStore := &Store{storeID: s.storeID, addr: addr, peerAddr: store.GetPeerAddress(), saddr: store.GetStatusAddress(), storeType: storeType, labels: store.GetLabels(), state: uint64(resolved)}
newStore.livenessState = atomic.LoadUint32(&s.livenessState)
newStore.unreachableSince = s.unreachableSince
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about abstracting a newStore function to force specify the livenessState and unreachableSince as the input parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to handle it later (as a part of #1104). There are 4 &Store{...} in region_cache.go now, I'd like to handle them togather according to the store lifecycle. Let this PR fix the corresponding issue by now.

@@ -2783,50 +2785,85 @@ func (s *Store) requestLivenessAndStartHealthCheckLoopIfNeeded(bo *retry.Backoff
// It may be already started by another thread.
if atomic.CompareAndSwapUint32(&s.livenessState, uint32(reachable), uint32(liveness)) {
s.unreachableSince = time.Now()
go s.checkUntilHealth(c)
reResolveInterval := 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make it a constant value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rawkv tests use failpoint to set it shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

var DefReResolveInterval = 30 * time.Second // global scope.

func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){
  reResolveInterval = DefReResolveInterval
  ...
}

and if test needs a shorter reResolveInterval, change DefReResolveInterval directly, no need to use the following failpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

var DefReResolveInterval = 30 * time.Second // global scope.

func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){
  reResolveInterval = DefReResolveInterval
  ...
}

and if test needs a shorter reResolveInterval, change DefReResolveInterval directly, no need to use the following failpoint.

Personally I do not prefer that. The global var is required to be handled carefully to avoid data race in ut. I've struggled with SetRegionCacheTTLSec in #1122 for about an hour.

logutil.BgLogger().Info("[health check] store meta deleted, stop checking", zap.Uint64("storeID", s.storeID), zap.String("addr", s.addr))
if s.getResolveState() == deleted {
// if the store is deleted, a new store with same id must be inserted (guaranteed by reResolve).
newStore, _ := c.getStore(s.storeID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add an assertion newStore != nil here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's guaranteed by reResolve, changeToActiveStore also rely on it.

internal/locate/region_cache_test.go Outdated Show resolved Hide resolved
Signed-off-by: zyguan <[email protected]>
@cfzjywxk cfzjywxk merged commit e79e800 into tikv:master Jan 24, 2024
10 checks passed
zyguan added a commit to zyguan/client-go that referenced this pull request Jan 25, 2024
* fix the issue that health check may set liveness wrongly

Signed-off-by: zyguan <[email protected]>

* fix lint issue

Signed-off-by: zyguan <[email protected]>

* fix rawkv ut

Signed-off-by: zyguan <[email protected]>

* fix data race

Signed-off-by: zyguan <[email protected]>

* use getStore instead of accessing storeMu directly

Signed-off-by: zyguan <[email protected]>

* make TestAccessFollowerAfter1TiKVDown stable

Signed-off-by: zyguan <[email protected]>

* make TestBackoffErrorType stable

Signed-off-by: zyguan <[email protected]>

* address comments

Signed-off-by: zyguan <[email protected]>

---------

Signed-off-by: zyguan <[email protected]>
Co-authored-by: disksing <[email protected]>
Signed-off-by: zyguan <[email protected]>
cfzjywxk pushed a commit that referenced this pull request Jan 25, 2024
* fix the issue that health check may set liveness wrongly



* fix lint issue



* fix rawkv ut



* fix data race



* use getStore instead of accessing storeMu directly



* make TestAccessFollowerAfter1TiKVDown stable



* make TestBackoffErrorType stable



* address comments



---------

Signed-off-by: zyguan <[email protected]>
Co-authored-by: disksing <[email protected]>
@Tema
Copy link
Contributor

Tema commented Apr 27, 2024

@zyguan @cfzjywxk could we please cherry-pick it into release-7.5?

zyguan added a commit to zyguan/client-go that referenced this pull request Apr 28, 2024
disksing pushed a commit that referenced this pull request Apr 28, 2024
@zyguan
Copy link
Contributor Author

zyguan commented Apr 28, 2024

@zyguan @cfzjywxk could we please cherry-pick it into release-7.5?

Sure, #1324 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An unready store might be marked as reachable
5 participants