Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Append-only Indices #12886

Open
sarthakaggarwal97 opened this issue Mar 24, 2024 · 6 comments
Open

RFC: Append-only Indices #12886

sarthakaggarwal97 opened this issue Mar 24, 2024 · 6 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Roadmap:Cost/Performance/Scale Project-wide roadmap label

Comments

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented Mar 24, 2024

Is your feature request related to a problem? Please describe

OpenSearch today caters to various use cases like log analytics, full text search, metrics, observability, security events, etc. By default, any index created in OpenSearch allows updates and deletes on the documents ingested. While this is good to cater the various use cases mentioned above. There are time series based use cases such as logs, metrics, observability, security events, etc. which does not require update or delete operations. It is well known that updates and deletes are expensive operations as they require the OpenSearch to lookup and perform operations, add soft deletes, and consequently can cause additional work during merges which can hinder the overall performance which can be avoided by restricting those operations for those use cases which doesn’t have a need. Also, there are certain optimizations (listed below) that can be applied if we know the data will not be updated or deleted (at document level).

Disabling updates/deletes on the index documents can allow us to handle multiple use cases efficiently, they are:

  • Onboard Data Structures optimized for append only: Recently, an RFC was opened to support pre-compute data structures like Star Tree where any updates or deletes would be quite expensive in terms of compute to rebuild the star tree.
  • Support Security driven use cases: There have been requests from the users to support indices where documents should be immutable. Such requests fall within use-cases like audit logs, security logs, transactions, ledgers, etc. and the core requirement is to ensure the documents cannot be changed/altered.
  • Optimizing index settings: We can tune the merge policy to allow faster access on more recent data. We would also support bigger merge sizes of the segments (currently index.merge.policy.max_merged_segment defaults to 5gb). We would be avoiding a chunk of merges by preventing deletes and updates, and thus these huge segments will come in contention to be merged, allowing us to increase 5gb limit.

Describe the solution you'd like

We propose to introduce the concept of append-only indices in OpenSearch to support aforementioned use-cases. With the support for restriction around keeping documents immutable, we would deny any updates and deletes of the document. This will help on reducing the footprint around memory usage for indices (e.g. version map) and also unlock the avenues to enable optimizations and features in future based on this restriction e.g.

  1. We can support automated rollovers with append-only indices.
  2. With the future support of Writable Warm, we can enable auto-migration of shards/segments instead of keeping all the segments/shards hot on data nodes.

Implementation details: TBU based on community feedback.

Additional context

FAQs:

Q: How would it be different from data streams?
A: While Data Streams optimizes on the automated rollover of time series data, it still supports for all CRUD operations on the backing indices. With append-only mode we would aim to provide with specialized optimizations and security features to such indices as a core functionality.

Q: What would be the APIs/features that we will not allow for append-only indices?
A: Some initial thoughts on the APIs/features we may not be able to support are:

  1. We will not be supporting updates and deletes of the documents in the index
  2. _doc API will be denied to avoid document index with custom id, updates and deletes
  3. _split and _shrink APIs will be denied to avoid removal of documents from underlying shards of the source index.
@sarthakaggarwal97 sarthakaggarwal97 added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 24, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 24, 2024
@shwetathareja
Copy link
Member

Thanks @sarthakaggarwal97 for the proposal. +1 on all the optimizations which can be applied under-the-hood for the append-only indices.

Recently, found with Security plugin, there is a way to configure Immutable indices, the definition looks similar

public static final String SECURITY_COMPLIANCE_IMMUTABLE_INDICES = "plugins.security.compliance.immutable_indices";

FYI to ensure, there shouldn't be any conflict between the two in terms of implementation later

Couple of questions:

  1. Would append-only semantics be applicable for DataStreams?
  2. We can support automated rollovers with append-only indices.

Without the definition of alias (pointing to index) from users, automated rollover can't work. With DataStreams, it is possible as DataStream itself provides that logical construct on which searches and indexing can be performed.

  1. _doc API will be denied to avoid document index with custom id

Though, not lot of users use custom doc id with time series workload but some users may still use it and it could be helpful in establishing consistency and debugging issues across their systems. They will not benefit from append-only semantics as such as otherwise version map would be needed for request Id (doc id) idempotency.

  1. We need to think in terms of user experience, how it ties back to RFC: Application Based Configuration Templates #12683

  2. With the future support of [RFC] Support for writable warm indices on Opensearch #12809, we can enable auto-migration of shards/segments instead of keeping all the segments/shards hot on data nodes.

Auto tiering/ migration should support irrespective of append-only or updates. Definitely, it would be more efficient for append-only indices but can work for indices which take updates/ deletes as well and other factors could define the efficiency like update frequency.

@sarthakaggarwal97
Copy link
Contributor Author

@shwetathareja thank you for your comments!

FYI to ensure, there shouldn't be any conflict between the two in terms of implementation later

Yes, I came across the immutable indices. I agree that we should keep the implementation in sync to avoid conflicts. Moreover, we can also look backwards and see if immutable indices can benefit from append-only semantics from OpenSearch core. Will have to dive into Security Plugin's implementation to do that.

Would append-only semantics be applicable for DataStreams?

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

Without the definition of alias (pointing to index) from users, automated rollover can't work

This is a valid point. If we want to support this, we should look to mandate an alias incase the user wants to perform automated rollovers with append-only indices.

Though, not lot of users use custom doc id with time series workload but some users may still use it and it could be helpful in establishing consistency and debugging issues across their systems

Denying the custom id helps us to avoid shard skews. We may run into one shard / one segment to be a hotspot, and with merge optimizations in place, I am not sure if we should do this. I would like more inputs on this from the community.

We need to think in terms of user experience

Agreed, I am aligned on this. We should provide a ready-to-go template to allow for append-only indices. Tagging @mgodwan to give more insights.

Auto tiering/ migration should support irrespective of append-only or updates

Yes, this was added to highlight that append-only indices would be able to efficiently handle auto-tiering since we won't be allowing updates, and can create bigger segments after a point as well without worrying about updates.

@reta
Copy link
Collaborator

reta commented Mar 27, 2024

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

@sarthakaggarwal97 it looks to me this what is already possible with data streams?

Data streams are designed for use cases where existing data is rarely, if ever, updated. You cannot send update or deletion requests for existing documents directly to a data stream. Instead, use the update by query and delete by query APIs. - https://www.elastic.co/guide/en/elasticsearch/reference/7.10/data-streams.html

@sarthakaggarwal97
Copy link
Contributor Author

@reta I just tried it, and it looks like we can update the documents in the backing indices of a datastream. So the backing indices are not truly append-only.

@reta
Copy link
Collaborator

reta commented Mar 30, 2024

@reta I just tried it, and it looks like we can update the documents in the backing indices of a datastream. So the backing indices are not truly append-only.

@sarthakaggarwal97 thank you, that's by design (and in accordance with the documentation). I am wondering if the providing the capability to have backing indices truly append-only would be a natural improvement over data streams (in scope of this feature proposal)?

@shwetathareja
Copy link
Member

shwetathareja commented Apr 2, 2024

Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.

@sarthakaggarwal97 append-only semantics should be applicable to Data Streams as DS abstractions are meant for time series workload. In order not to have breaking changes, this should still be driven by configuration. Be it regular indices or data streams backed indices, append-only semantics should work for both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Roadmap:Cost/Performance/Scale Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

5 participants