Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Add reason tag to bootstrap retries metric #3317

Merged
merged 13 commits into from
Mar 5, 2021

Conversation

vpranckaitis
Copy link
Collaborator

What this PR does / why we need it:

Adds reason tag to bootstrap retries metric.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@codecov
Copy link

codecov bot commented Mar 4, 2021

Codecov Report

Merging #3317 (61967a8) into master (0068083) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3317     +/-   ##
=========================================
- Coverage    72.5%    72.5%   -0.1%     
=========================================
  Files        1099     1099             
  Lines      101616   101627     +11     
=========================================
+ Hits        73679    73684      +5     
- Misses      22861    22862      +1     
- Partials     5076     5081      +5     
Flag Coverage Δ
aggregator 76.4% <ø> (+<0.1%) ⬆️
cluster 84.9% <ø> (ø)
collector 84.3% <ø> (ø)
dbnode 78.9% <100.0%> (+<0.1%) ⬆️
m3em 74.4% <ø> (ø)
m3ninx 73.5% <ø> (ø)
metrics 20.0% <ø> (ø)
msg 74.0% <ø> (-0.3%) ⬇️
query 67.4% <ø> (ø)
x 80.5% <ø> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0068083...88f45ee. Read the comment docs.

Comment on lines +112 to +113
signalCh <- struct{}{}
assert.False(t, setup.DB().IsBootstrapped(), "database should not yet be bootstrapped")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't signal come after assert to avoid timing issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these IsBootstrapped() asserts are in between first and last signalCh writes, which should guarantee that all of them are called during bootstrap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you send a signal, the code that was waiting for the signal starts executing, but you cannot know whether it will take one nanosecond, or 10 seconds to complete, so depending on that timing, your assert would be checking IsBootstrapped value either before, or after the first bootstrap iteration. Not really sure it this is desirable here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it matters much if the assert is done during first bootstrapper pass, just before second pass or in between of them. As long as DB is not marked bootstrapped while bootstrapping is in progress (which should match the time period in between first and last signalCh writes, plus some variable amount of time before and after that), and as long as it's marked bootstrapped once bootstrapping is done, it should be fine.

I guess I could make it so two signals are needed for bootstrapper to unblock. That way I could squeeze the assert in between them and that would guarantee exact execution on n-th bootstraper pass. Though I'm not sure if we would gain anything from this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But aren't you concerned that you could be advancing the clock at some unpredictable moment? Perhaps even before the first bootstrap pass has calculated the ranges to bootstrap?

Copy link
Collaborator

@linasm linasm Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was assuming the ranges are calculated within bs.Bootstrap(ctx, namespaces, cache) and not before invoking the anonymous fn.
Just make sure to rename noopNone to noopAll in places where it gets NewNoOpAllBootstrapperProvider assigned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added double signal for more exact control when clock is advanced 4181359
Refactored noopNone 61967a8

Comment on lines 135 to 136
require.NotNil(t, retryCounter)
assert.Equal(t, int64(1), retryCounter.Value())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is also worth checking the other value here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests for this: e41c8b7

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it was worth adding another test suite, I think TestBootstrapRetriesDueToError was covering the error type check. What I meant was to check that there was:

  • one retry because of obsolete range
  • no retries because of some other errors
    (and vice versa in some other tests, depending on the case)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional asserts 5afa0b4

Comment on lines 190 to 192
retryCounter := getBootstrapRetriesCounter(testScope, "other")
require.NotNil(t, retryCounter)
assert.Equal(t, int64(1), retryCounter.Value())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the other value (might be worth extracting some kind of assert helper function).

Copy link
Collaborator Author

@vpranckaitis vpranckaitis Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional asserts 5afa0b4

src/dbnode/storage/bootstrap_instrumentation.go Outdated Show resolved Hide resolved
@vpranckaitis vpranckaitis requested a review from linasm March 4, 2021 14:14
Comment on lines 135 to 136
require.NotNil(t, retryCounter)
assert.Equal(t, int64(1), retryCounter.Value())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it was worth adding another test suite, I think TestBootstrapRetriesDueToError was covering the error type check. What I meant was to check that there was:

  • one retry because of obsolete range
  • no retries because of some other errors
    (and vice versa in some other tests, depending on the case)

Comment on lines +112 to +113
signalCh <- struct{}{}
assert.False(t, setup.DB().IsBootstrapped(), "database should not yet be bootstrapped")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you send a signal, the code that was waiting for the signal starts executing, but you cannot know whether it will take one nanosecond, or 10 seconds to complete, so depending on that timing, your assert would be checking IsBootstrapped value either before, or after the first bootstrap iteration. Not really sure it this is desirable here.

src/dbnode/storage/bootstrap_instrumentation.go Outdated Show resolved Hide resolved
Comment on lines 82 to 86
if r, ok := counter.Tags()[reasonTag]; ok {
valuesByReason[r] = int(counter.Value())
} else {
valuesByReason[""] = int(counter.Value())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
if r, ok := counter.Tags()[reasonTag]; ok {
valuesByReason[r] = int(counter.Value())
} else {
valuesByReason[""] = int(counter.Value())
}
reason := ""
if r, ok := counter.Tags()[reasonTag]; ok {
reason = r
}
valuesByReason[reason] = int(counter.Value())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xerrors "github.com/m3db/m3/src/x/errors"
)

func TestBootstrapFailedMetricReason(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in one of the comments above, this test might be an overkill, I'd say testing this indirectly from bootstrap_retries_test.go is adequate. Leaving the final decision to you though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna leave, in case we add more retry reasons in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On average, writing code "for future" does not pay off :)

@linasm linasm removed their assignment Mar 5, 2021
@vpranckaitis vpranckaitis requested a review from linasm March 5, 2021 08:20
@linasm linasm removed their assignment Mar 5, 2021
Copy link
Collaborator

@linasm linasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vpranckaitis vpranckaitis merged commit 3e78412 into master Mar 5, 2021
@vpranckaitis vpranckaitis deleted the vilius/retry_metric_reason branch March 5, 2021 11:34
soundvibe added a commit that referenced this pull request Mar 9, 2021
* master: (22 commits)
  Remove deprecated fields (#3327)
  Add quotas to Permits (#3333)
  [aggregator] Drop messages that have a drop policy applied (#3341)
  Fix NPE due to race with a closing series (#3056)
  [coordinator] Apply auto-mapping rules if-and-only-if no drop policies are in effect (#3339)
  [aggregator] Add validation in AddTimedWithStagedMetadatas (#3338)
  [coordinator] Fix panic in Ready endpoint for admin coordinator (#3335)
  [instrument] Config option to emit detailed Go runtime metrics only (#3332)
  [aggregator] Sort heap in one go, instead of iterating one-by-one (#3331)
  [pool] Add support for dynamic, sync.Pool backed, object pools (#3334)
  Enable PANIC_ON_INVARIANT_VIOLATED for tests (#3326)
  [aggregator] CanLead for unflushed window takes BufferPast into account (#3328)
  Optimize StagedMetadatas conversion (#3330)
  [m3msg] Improve message scan performance (#3319)
  [dbnode] Add reason tag to bootstrap retries metric (#3317)
  [coordinator] Enable rule filtering on prom metric type (#3325)
  Update m3dbnode-all-config.yml (#3204)
  [coordinator] Include Type in RollupOp.Equal (#3322)
  [coordinator] Simplify iteration logic of matchRollupTarget (#3321)
  [coordinator] Add rollup type to remove specific dimensions (#3318)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants