-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Append-only Indices #12886
Comments
Thanks @sarthakaggarwal97 for the proposal. +1 on all the optimizations which can be applied under-the-hood for the append-only indices. Recently, found with Security plugin, there is a way to configure Immutable indices, the definition looks similar
FYI to ensure, there shouldn't be any conflict between the two in terms of implementation later Couple of questions:
Without the definition of alias (pointing to index) from users, automated rollover can't work. With DataStreams, it is possible as DataStream itself provides that logical construct on which searches and indexing can be performed.
Though, not lot of users use custom doc id with time series workload but some users may still use it and it could be helpful in establishing consistency and debugging issues across their systems. They will not benefit from append-only semantics as such as otherwise version map would be needed for request Id (doc id) idempotency.
Auto tiering/ migration should support irrespective of append-only or updates. Definitely, it would be more efficient for append-only indices but can work for indices which take updates/ deletes as well and other factors could define the efficiency like update frequency. |
@shwetathareja thank you for your comments!
Yes, I came across the immutable indices. I agree that we should keep the implementation in sync to avoid conflicts. Moreover, we can also look backwards and see if immutable indices can benefit from append-only semantics from OpenSearch core. Will have to dive into Security Plugin's implementation to do that.
Datastreams allows all CRUD operations for their documents. Some of the suggested semantics / optimizations for append-only indices would be expensive for datastreams I believe.
This is a valid point. If we want to support this, we should look to mandate an alias incase the user wants to perform automated rollovers with append-only indices.
Denying the custom id helps us to avoid shard skews. We may run into one shard / one segment to be a hotspot, and with merge optimizations in place, I am not sure if we should do this. I would like more inputs on this from the community.
Agreed, I am aligned on this. We should provide a ready-to-go template to allow for append-only indices. Tagging @mgodwan to give more insights.
Yes, this was added to highlight that append-only indices would be able to efficiently handle auto-tiering since we won't be allowing updates, and can create bigger segments after a point as well without worrying about updates. |
@sarthakaggarwal97 it looks to me this what is already possible with data streams?
|
@reta I just tried it, and it looks like we can update the documents in the backing indices of a datastream. So the backing indices are not truly append-only. |
@sarthakaggarwal97 thank you, that's by design (and in accordance with the documentation). I am wondering if the providing the capability to have backing indices truly append-only would be a natural improvement over data streams (in scope of this feature proposal)? |
@sarthakaggarwal97 append-only semantics should be applicable to Data Streams as DS abstractions are meant for time series workload. In order not to have breaking changes, this should still be driven by configuration. Be it regular indices or data streams backed indices, append-only semantics should work for both. |
Is your feature request related to a problem? Please describe
OpenSearch today caters to various use cases like log analytics, full text search, metrics, observability, security events, etc. By default, any index created in OpenSearch allows updates and deletes on the documents ingested. While this is good to cater the various use cases mentioned above. There are time series based use cases such as logs, metrics, observability, security events, etc. which does not require update or delete operations. It is well known that updates and deletes are expensive operations as they require the OpenSearch to lookup and perform operations, add soft deletes, and consequently can cause additional work during merges which can hinder the overall performance which can be avoided by restricting those operations for those use cases which doesn’t have a need. Also, there are certain optimizations (listed below) that can be applied if we know the data will not be updated or deleted (at document level).
Disabling updates/deletes on the index documents can allow us to handle multiple use cases efficiently, they are:
index.merge.policy.max_merged_segment
defaults to 5gb). We would be avoiding a chunk of merges by preventing deletes and updates, and thus these huge segments will come in contention to be merged, allowing us to increase 5gb limit.Describe the solution you'd like
We propose to introduce the concept of append-only indices in OpenSearch to support aforementioned use-cases. With the support for restriction around keeping documents immutable, we would deny any updates and deletes of the document. This will help on reducing the footprint around memory usage for indices (e.g. version map) and also unlock the avenues to enable optimizations and features in future based on this restriction e.g.
Implementation details: TBU based on community feedback.
Additional context
FAQs:
Q: How would it be different from data streams?
A: While Data Streams optimizes on the automated rollover of time series data, it still supports for all CRUD operations on the backing indices. With append-only mode we would aim to provide with specialized optimizations and security features to such indices as a core functionality.
Q: What would be the APIs/features that we will not allow for append-only indices?
A: Some initial thoughts on the APIs/features we may not be able to support are:
_doc
API will be denied to avoid document index with custom id, updates and deletes_split
and_shrink
APIs will be denied to avoid removal of documents from underlying shards of the source index.The text was updated successfully, but these errors were encountered: