Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Application Based Configuration Templates #12683

Closed
mgodwan opened this issue Mar 15, 2024 · 14 comments · Fixed by #14811 or #15290
Closed

RFC: Application Based Configuration Templates #12683

mgodwan opened this issue Mar 15, 2024 · 14 comments · Fixed by #14811 or #15290
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Ease of Use Project-wide roadmap label v2.16.0 Issues and PRs related to version 2.16.0 v2.17.0

Comments

@mgodwan
Copy link
Member

mgodwan commented Mar 15, 2024

Is your feature request related to a problem? Please describe

Today, OpenSearch provides users multiple knobs/settings to configure their indices (e.g. number of shards, replicas, replication types, refresh interval, merge settings, etc.). It also exposes different settings/policies to configure different plugin based actions (e.g. rollovers, rollups, transforms, k-NN tuning, etc.). A combination of these settings/policies needs to be precisely configured to get the best experience in terms of various performance and usability dimensions such as throughput, latency, storage usage, etc.

It is difficult for users on-boarding new use cases to OpenSearch to get these configurations right in the first place as it requires extensive experimentation and developer effort to get these right.

Since users use OpenSearch for various different use cases (e.g. Log Analytics, Metrics, Text Search, Security Analytics, ML, etc), they need to go through the entire set of available knobs, try them out and then decide what works best for their use case. This creates a very visible friction while on-boarding to OpenSearch and when the users are unable to get it right after few attempts, they end up going with alternate solutions.

One of the ways this problem can be mitigated is by knowing the context of the indices and based on the context, default values for these settings can be made available as templates (think of it as predefined system templates). We would like to propose the concept of context aware index templates in OpenSearch which will allow users to easily configure their indices based on the use-cases they are looking to build for. This context can be a first class citizen for the indices via the templates, and any opt-in/opt-out features developed in OpenSearch can be applied to such indices out of the box based on the use-case selected to reduce the friction and promote a seamless on-boarding experience.

Few of the example use cases we can expose directly are:

Use Case Out of box Optimization (not exhaustive)
   
Logs Disable Upserts and Custom Document IDs (Better Performance)
Enable Deflate/ZSTD (Better Storage)
Higher Refresh Interval/SegRep [Better Indexing Throughput]
Merge Policy [LogByteSize by default]
   
Metrics Disable Upserts and Custom Document IDs (Better Performance)
Disable Source (Better Storage)
Enable Star Tree Index (Better Aggregations)
High Refresh Interval/SegRep [Better Indexing Throughput]Merge Policy [LogByteSize by default]
   
Events/Transactions Enable bloom filters (Better Update Performance)
Low Refresh Interval/DocRep [Faster Document Visbility]
Low merge delete threshold (Faster reclamation of storage on updates)

Though this provides a seamless out of the box experience, they may be cases where users want to override some settings. This can be done by extending the system templates and have the corresponding setting values overridden.

Once these use cases are exposed to the users and as we continue to build optimizations which may provide more benefit for specific use cases, they can be directly adopted by the users (who explicitly opted for it) through the updated context aware index template definitions (upon upgrades) without requiring users to go through the release notes and figure out if something would be useful for them.

Proposal

For the pain points discussed above, it becomes necessary to ensure that a simple interface is provided to the users to manage their indices. In order to do so, we can use the existing components and terminology which users understand and build the new functionality on top of it.
  1. System Context Templates: This resource interface will allow users to use and configure templates which can be used to gain advantage of various features introduced in OpenSearch for different use-cases without the need for handling each and every setting/knob. These can be an extension of the existing component templates, so that users looking to build indices/custom templates for their use cases can pick the use case to apply.
  2. Context Template Repository: This will be the storage component containing the various system context templates which can be used by users. This will include a predefined set of use cases and configurations based on the identified problems and the associated tuning. This storage can be a separate system index and any mutations/retrievals from outside of OpenSearch process will be exposed only through the APIs.
  3. Refresh Template Action: As we obtain more information about use cases and build more optimizations, it would be beneficial for users to update the existing templates and add new templates with minimal user overhead and interaction. Based on the the user configuration to utilize these updates, whenever an upgrade happens and new features are added to OpenSearch, these updates can applied automatically or by running a simple command.

    Following is a high level example of how a template may look like and applied on an index

    PUT _context_template/.metrics # Add template
    {
    "settings": {
    "index.append_only": true,
    "index.codec": "{{custom_codec}} || best_compression",
    "index.refresh_interval": "120s",
    "index.pre_compute_aggregation.enabled": true,
    "index.pre_compute_aggregation.datetime_interval": "{{date_time_interval}} || 1m",
    "index.pre_compute_aggregation.fields": [{{pre_compute_aggregation_fields_order}}], # Additional parameterization support enforces these to be declared on index creation
    "index.merge.policy": "LOG_BYTE_SIZE",
    "index.replication.type": "SEGMENT"
    }
    }

    {"akcnowledged":true, "version": "1"}

    ----

    PUT my-metric-index/ # Create index (or component template) using the _context_template
    {
    "context": { # This is new.
    "name": ".metrics",
    "version": "_latest"
    "apply_template_updates": true,
    "params": { // Params with no default values in template defintion are expected
    "custom_codec": "zstd_no_dict",
    "pre_compute_aggregation_fields_order": ["@timestamp", "status_code"]
    }
    },
    "settings": {
    "index.pre_compute_aggregation.datetime_interval": "5m", # This will override settings declared in context template
    "number_of_shards": 1,
    "number_of_replicas": 1
    },
    "mappings": {
    "properties": {
    "status_code": {
    "type": "integer"
    },
    "request_type": {
    "type": "keyword"
    },
    "@timestamp": {
    "type": "date"
    },
    "latency": {
    "type": "float"
    }
    }
    }
    }

    Alternatives Explored

    Using Data Streams

    Data Streams are a generic abstraction for time series data and does not extend to apply optimizations for specific use cases within the time-series universe. The idea is to allow for a generic abstraction which can facilitate optimizations across various use cases which users would like to build based on other dimensions. The proposed solution should be applicable for data-streams as well as they will also benefit from the context.

    Using Existing Index Templates

    Index Templates require knowledge of the available settings and optimizations by the customers. Even if we create index templates on user’s behalf, we still will not be able to apply optimizations on field level, etc since that may depend on the user configuration within mappings, etc. Also, the templates don’t support use cases such as refreshing the index created through it on changes (e.g. on upgrades, we may add new optimizations to be applied) Hence, we may need a new abstraction.

    FAQs

    Q: How can you help?
    A: Any feedback on the overall idea and proposal is welcome. If you have specific requirements/use-cases which are not addressed by the above proposal, please let us know.

    Q: Why not propose to use cluster state metadata for new system context templates?
    A: The templates can grow over time and tying it with cluster state metadata may cause bottlenecks. Hence the storage component for this new resource can be a system index.

    Q: As a user, will I be to disable certain optimizations?
    A: Yes, we would still like to support customization on top of suggested out of box optimizations.


@mgodwan mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 15, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 15, 2024
@shwetathareja shwetathareja added the feature New feature or request label Mar 15, 2024
@mgodwan mgodwan added the RFC Issues requesting major changes label Mar 15, 2024
@shwetathareja
Copy link
Member

Thanks @mgodwan for putting across the proposal. This is definitely in the right direction to help OpenSearch users optimize for their use case without needing to be an advanced/ expert user and understand which settings to tune at large.

This storage can be a separate system index and any mutations/retrievals from outside of OpenSearch process will be exposed only through the APIs.

For Context templates repository, It will be pre-defined in OpenSearch core or users can modify the templates on demand? Also, how much customizations are allowed? I feel system generated contexts shouldn't be allowed to be modified, users can define their own contexts if needed on top of system contexts.

Why not propose to use cluster state metadata for new system context templates?

Existing component templates are part of cluster metadata. Are you proposing to store these context templates in a separate system index? why do you think it can't fit into cluster state when these are just subset of settings optimized for a usecase? Do you foresee this growing huge? This could be a new custom metadata in cluster state.
Extending existing component templates offer other advantages like merging of these templates, defining priorities across component templates etc.

One aspect that I don't see getting discussed is :
Attaching a template may enforce certain restrictions for the use case, how are those handled e.g. for strict_append_only use case may prevent document level updates or delete?
Also, thinking more, can you remove a context from the created index?

@mgodwan
Copy link
Member Author

mgodwan commented Mar 20, 2024

Thanks @shwetathareja for sharing your thoughts on this.

For Context templates repository, It will be pre-defined in OpenSearch core or users can modify the templates on demand? Also, how much customizations are allowed? I feel system generated contexts shouldn't be allowed to be modified, users can define their own contexts if needed on top of system contexts.

I believe we should have a set of pre-defined templates exposed through core. Users/Plugins should be allowed to create more on top of these or new ones but not modify the existing ones exposed via opensearch core.

Existing component templates are part of cluster metadata. Are you proposing to store these context templates in a separate system index? why do you think it can't fit into cluster state when these are just subset of settings optimized for a usecase? Do you foresee this growing huge? This could be a new custom metadata in cluster state.

Yes, I was thinking if it grows huge, it would be better to decouple from the cluster state and maintain a separate system index. I don't have a strong opinion on this implementation detail given we have been doing some work to make the cluster state more scalable and extensible. As we move towards looking more into low level details, we can perform some stress testing to see if cluster state fits the use case, and reduce the overhead of maintaining a different system index.

Attaching a template may enforce certain restrictions for the use case, how are those handled e.g. for strict_append_only use case may prevent document level updates or delete?

The idea is to continue to have settings for each restriction/optimization which is applied while the templates acting as an interface to get the details into the index. Those settings can continue to govern the behavior during runtime.

Also, thinking more, can you remove a context from the created index?

I don't think we should allow this. Context once tied to the index, adds certain restrictions and it may not always be safe to remove the context.

@andrross
Copy link
Member

[Triage - attendees 1 2 3 4 5 6]
@mgodwan Thanks for filing, this is definitely an interesting topic and looking forward to more discussion and progress here.

@andrross andrross added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels Mar 20, 2024
@smacrakis
Copy link

This looks very useful!
But the name of the feature (Context Aware Index Templates) I find confusing.

"Context Aware" makes it sound like they detect the context themselves, which if I understand the proposal correctly, they don't -- they're just applicable to a certain use case

I think you're calling them "templates" because they have some un-set parameters. But that doesn't tell us what they are templates for. They are templates for configurations, right?

So maybe call them "standard configurations" or "parameterized configurations", e.g., the Logs Configuration, the Metrics Configuration, etc.?

@rohin
Copy link

rohin commented Apr 8, 2024

This certainly sounds useful and certainly opens up possibilities to optimize. I would first want to explore the dimension of the problem. Is it a template or a type? is there a fundamental difference between each context or use case which warrants different templates in which case is template the right solution or we should look at different types of indices? Does it need to be extensible?

Is it fundamentally a problem of how data is stored. For example the index is an inverted index. Does this mean we need a different type of index? Which means we may want to perform operations like write and search differently for such data.

Would be good to get some of these answers.

@mgodwan
Copy link
Member Author

mgodwan commented Apr 10, 2024

Thanks @smacrakis for agreeing to the problem statement highlighted and your suggestions.

So maybe call them "standard configurations" or "parameterized configurations", e.g., the Logs Configuration, the Metrics Configuration, etc.?

Doesn't the term "templates" denote "parametrized configurations"? I am open to a new name but I don't think context aware would mean that it can detect context, it just means that it is designed to be aware of the context (i.e. use case). If the wider community still feels that the term "context aware" may be a forced-fit here, we can work on updating the terminology.

@mgodwan
Copy link
Member Author

mgodwan commented Apr 10, 2024

Thanks @rohin for your thoughts and questions.

Is it a template or a type? is there a fundamental difference between each context or use case which warrants different templates in which case is template the right solution or we should look at different types of indices?

The index abstraction exposed today in OpenSearch over Lucene index is configurable in terms of what we want the index to provide (through settings, mappings, etc.). The challenge comes at a place at deciding how to configure those settings. Hence, the proposed templates act as a provider for the configuration best suited for the use cases. Instead of needing this to be an index type (which adds extra coupling), relying on the index metadata/settings for granular control to apply this allows advanced users to configure in depth directly on the index and reduce the entry barrier for new users at the same time through the templates.

Does it need to be extensible?

Could you highlight the dimensions of extensibility you were thinking of here? I can better answer based on the thought behind this.

Is it fundamentally a problem of how data is stored. For example the index is an inverted index. Does this mean we need a different type of index? Which means we may want to perform operations like write and search differently for such data.

I think inverted index is the core for any kind of analytics we want to support and we have other data structures to support operations like sort and aggregations as well. Today as well, based on the mappings, we use different inverted index implementation (e.g. FST for text, BKD Tree for numeric fields, etc) and may not choose to allow certain operations for the use cases based on what kind of data structures are created.

While essential data structures are a core point of proposal (e.g. for frequently updated events with performance as primary factor, always create a bloom filter), there are other optimizations beyond how data is stored (e.g. refresh interval, replication strategy, merge policy etc) which have a key role to play in the performance/cost optimizations users can get out of their OpenSearch cluster for their respective use cases.

@smacrakis
Copy link

I still think "context aware" is wrong. They are not aware of their context -- they are simply designed for a particular application or use case.

The word "template" does imply parameterization, but it doesn't say what they are templates for.

@backslasht
Copy link
Contributor

I still think "context aware" is wrong. They are not aware of their context -- they are simply designed for a particular application or use case.

I think you are primarily concerned about the word aware(ness), would context based templates or context specific templates sounds right?

@smacrakis
Copy link

Yes, I think "aware" is misleading, because it implies that the template adapts to the context it's in.
But "context" isn't quite right either. I would say "application" or something.
"Template" is not misleading, but it is overly broad. It is a template for something, namely a configuration.

@mgodwan
Copy link
Member Author

mgodwan commented Apr 22, 2024

@smacrakis @backslasht @shwetathareja @rohin Thanks for your points around the terminology for the feature.

Based on the feedback provided, How does "Application Based Configuration(ABC) Templates" sound for this?

@smacrakis
Copy link

Sounds good, thanks for the discussion!

@nateynateynate
Copy link
Member

This seems like it could coexist as one or two pages of documentation to talk about suggested index settings for certain use cases. As long as people have a resource to read why those settings are better.

A second thought - if we're going to offer options like this, we should at least build a baseline dashboards page that shows toggle switches for these most common index optimization options. It never hurts to include text/hovers on what those options specifically do.

I've always wondered why a lot of new features don't at least have a bare minimum gui implementation. Without a GUI, we're enabling professionals while withholding education from newcomers.

@spapadop
Copy link

A simple question here.
Above you mentioned that things defined on index templates overwrite the context template items.
So I assume that things defined on component templates will also overwrite context templates items?

Leading to the order of application: context template -> component template -> index template

@mch2 mch2 added the Roadmap:Ease of Use Project-wide roadmap label label May 14, 2024
@mgodwan mgodwan changed the title RFC: Context Aware Index Templates RFC: Application Based Configuration Templates Jul 10, 2024
@mgodwan mgodwan added the v2.16.0 Issues and PRs related to version 2.16.0 label Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Ease of Use Project-wide roadmap label v2.16.0 Issues and PRs related to version 2.16.0 v2.17.0
Projects
Status: 2.16 (First RC 07/23, Release 08/06)
Status: New
9 participants