-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dbnode] Add reason tag to bootstrap retries metric #3317
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3317 +/- ##
=========================================
- Coverage 72.5% 72.5% -0.1%
=========================================
Files 1099 1099
Lines 101616 101627 +11
=========================================
+ Hits 73679 73684 +5
- Misses 22861 22862 +1
- Partials 5076 5081 +5
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
signalCh <- struct{}{} | ||
assert.False(t, setup.DB().IsBootstrapped(), "database should not yet be bootstrapped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't signal come after assert to avoid timing issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these IsBootstrapped()
asserts are in between first and last signalCh
writes, which should guarantee that all of them are called during bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you send a signal, the code that was waiting for the signal starts executing, but you cannot know whether it will take one nanosecond, or 10 seconds to complete, so depending on that timing, your assert would be checking IsBootstrapped
value either before, or after the first bootstrap iteration. Not really sure it this is desirable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it matters much if the assert is done during first bootstrapper pass, just before second pass or in between of them. As long as DB is not marked bootstrapped while bootstrapping is in progress (which should match the time period in between first and last signalCh
writes, plus some variable amount of time before and after that), and as long as it's marked bootstrapped once bootstrapping is done, it should be fine.
I guess I could make it so two signals are needed for bootstrapper to unblock. That way I could squeeze the assert in between them and that would guarantee exact execution on n-th bootstraper pass. Though I'm not sure if we would gain anything from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But aren't you concerned that you could be advancing the clock at some unpredictable moment? Perhaps even before the first bootstrap pass has calculated the ranges to bootstrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I was assuming the ranges are calculated within bs.Bootstrap(ctx, namespaces, cache)
and not before invoking the anonymous fn.
Just make sure to rename noopNone
to noopAll
in places where it gets NewNoOpAllBootstrapperProvider
assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.NotNil(t, retryCounter) | ||
assert.Equal(t, int64(1), retryCounter.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is also worth checking the other
value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests for this: e41c8b7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it was worth adding another test suite, I think TestBootstrapRetriesDueToError
was covering the error type check. What I meant was to check that there was:
- one retry because of obsolete range
- no retries because of some other errors
(and vice versa in some other tests, depending on the case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added additional asserts 5afa0b4
retryCounter := getBootstrapRetriesCounter(testScope, "other") | ||
require.NotNil(t, retryCounter) | ||
assert.Equal(t, int64(1), retryCounter.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check the other
value (might be worth extracting some kind of assert helper function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3317 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added additional asserts 5afa0b4
require.NotNil(t, retryCounter) | ||
assert.Equal(t, int64(1), retryCounter.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it was worth adding another test suite, I think TestBootstrapRetriesDueToError
was covering the error type check. What I meant was to check that there was:
- one retry because of obsolete range
- no retries because of some other errors
(and vice versa in some other tests, depending on the case)
signalCh <- struct{}{} | ||
assert.False(t, setup.DB().IsBootstrapped(), "database should not yet be bootstrapped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you send a signal, the code that was waiting for the signal starts executing, but you cannot know whether it will take one nanosecond, or 10 seconds to complete, so depending on that timing, your assert would be checking IsBootstrapped
value either before, or after the first bootstrap iteration. Not really sure it this is desirable here.
if r, ok := counter.Tags()[reasonTag]; ok { | ||
valuesByReason[r] = int(counter.Value()) | ||
} else { | ||
valuesByReason[""] = int(counter.Value()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
if r, ok := counter.Tags()[reasonTag]; ok { | |
valuesByReason[r] = int(counter.Value()) | |
} else { | |
valuesByReason[""] = int(counter.Value()) | |
} | |
reason := "" | |
if r, ok := counter.Tags()[reasonTag]; ok { | |
reason = r | |
} | |
valuesByReason[reason] = int(counter.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xerrors "github.com/m3db/m3/src/x/errors" | ||
) | ||
|
||
func TestBootstrapFailedMetricReason(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in one of the comments above, this test might be an overkill, I'd say testing this indirectly from bootstrap_retries_test.go
is adequate. Leaving the final decision to you though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna leave, in case we add more retry reason
s in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On average, writing code "for future" does not pay off :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* master: (22 commits) Remove deprecated fields (#3327) Add quotas to Permits (#3333) [aggregator] Drop messages that have a drop policy applied (#3341) Fix NPE due to race with a closing series (#3056) [coordinator] Apply auto-mapping rules if-and-only-if no drop policies are in effect (#3339) [aggregator] Add validation in AddTimedWithStagedMetadatas (#3338) [coordinator] Fix panic in Ready endpoint for admin coordinator (#3335) [instrument] Config option to emit detailed Go runtime metrics only (#3332) [aggregator] Sort heap in one go, instead of iterating one-by-one (#3331) [pool] Add support for dynamic, sync.Pool backed, object pools (#3334) Enable PANIC_ON_INVARIANT_VIOLATED for tests (#3326) [aggregator] CanLead for unflushed window takes BufferPast into account (#3328) Optimize StagedMetadatas conversion (#3330) [m3msg] Improve message scan performance (#3319) [dbnode] Add reason tag to bootstrap retries metric (#3317) [coordinator] Enable rule filtering on prom metric type (#3325) Update m3dbnode-all-config.yml (#3204) [coordinator] Include Type in RollupOp.Equal (#3322) [coordinator] Simplify iteration logic of matchRollupTarget (#3321) [coordinator] Add rollup type to remove specific dimensions (#3318) ...
What this PR does / why we need it:
Adds reason tag to bootstrap retries metric.
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?: