Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track the count of failed invocations since last successful policy snapshot #88398

Merged
merged 8 commits into from
Jul 12, 2022

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Jul 8, 2022

When an automated snapshot fails, the last failure for a policy is captured and stored in the cluster state. Similarly, we store the last successful snapshot invocation as well. We do not track how many invocations have passed between a successful snapshot and the most recent failure. These stats would be helpful for reporting on SLM policy health.

Instead of a fixed delay, snapshot lifecycle policies are scheduled using a cron expression which can produce variable execution times between snapshot attempts. This makes it difficult to select a window of time where continuous snapshot failure becomes indicative of a problem instead of a transient issue. By including the count of failed invocations since last success we can provide health reporting logic that allows for some transient failures while remaining agnostic of variable execution times that cron can produce.

@jbaiera jbaiera added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.4.0 labels Jul 8, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jul 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @jbaiera, I've created a changelog YAML for you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me, but I think we should make it non-null and treat the default missing value as 0 invocations, what do you think?

@jbaiera
Copy link
Member Author

jbaiera commented Jul 11, 2022

@elasticmachine run elasticsearch-ci/docs

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbaiera jbaiera merged commit b790256 into elastic:master Jul 12, 2022
@jbaiera jbaiera deleted the slm-add-invocation-counts branch July 12, 2022 15:26
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 13, 2022
* upstream/master: (38 commits)
  Simplify map copying (elastic#88432)
  Make DiffableUtils.diff implementation agnostic (elastic#88403)
  Ingest: Start separating Metadata from IngestSourceAndMetadata (elastic#88401)
  Move runtime fields base scripts out of scripting fields api package. (elastic#88488)
  Enable TRACE Logging for test and increase timeout (elastic#88477)
  Mute ReactiveStorageIT#testScaleDuringSplitOrClone (elastic#88480)
  Track the count of failed invocations since last successful policy snapshot (elastic#88398)
  Avoid noisy exceptions on data nodes when aborting snapshots (elastic#88476)
  Fix ReactiveStorageDeciderServiceTests testNodeSizeForDataBelowLowWatermark (elastic#88452)
  INFO logging of snapshot restore and completion (elastic#88257)
  unmute test (elastic#88454)
  Updatable API keys - noop check (elastic#88346)
  Corrected an incomplete sentence. (elastic#86542)
  Use consistent shard map type in IndexService (elastic#88465)
  Stop registering TestGeoShapeFieldMapperPlugin in ESIntegTestCase (elastic#88460)
  TSDB: RollupShardIndexer logging improvements (elastic#88416)
  Audit API key ID when create or grant API keys (elastic#88456)
  Bound random negative size test in SearchSourceBuilderTests#testNegativeSizeErrors (elastic#88457)
  Updatable API keys - logging audit trail event (elastic#88276)
  Polish reworked LoggedExec task (elastic#88424)
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants