Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add Search to Remote-backed Storage #6528

Closed
andrross opened this issue Mar 2, 2023 · 40 comments
Closed

[RFC] Add Search to Remote-backed Storage #6528

andrross opened this issue Mar 2, 2023 · 40 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request feature New feature or request idea Things we're kicking around. RFC Issues requesting major changes Search Search query, autocomplete ...etc

Comments

@andrross
Copy link
Member

andrross commented Mar 2, 2023

Overview

The storage vision document describes the high level thinking for storage in OpenSearch, and the roadmap introduced the first phases of that work. Today, remote-backed storage and searchable snapshots have been implemented and are nearing general availability. The next phases of the roadmap describe the concept of a “searchable remote index” which will add functionality to natively search data in the object store for remote-backed indexes. To that end, this proposal builds on the search node functionality introduced in searchable snapshots to allow for searching data natively in remote-backed storage.

Proposed Solution

This document details a proposal to add search replicas to remote-backed indexes. In this context, a search replica is a host in the cluster that can serve search queries from the data in the remote object store without fully copying the data to disk ahead of time. A node in the cluster can be dedicated to performing the search role, but need not be. From a 10,000ft view, a remote-backed index with search replicas will behave the same as any other index: a user can submit reads and writes and the cluster will behave as expected. Zoom in a bit and there are cost and performance considerations. Below are two primary use cases:

Time Series Data

This is the canonical use case in log analytics workloads. A user starts with a regular remote-backed index serving both reads and writes while seamlessly mirroring the data into the remote object store. When this index is rolled over it will no longer be written to but can still be searched. The searchable remote index allows for keeping the ability to search this index while offloading the full data storage to the remote object store. Just as remote-backed indexes are envisioned as a successor to snapshots, the searchable remote index is a successor to searchable snapshots as the continuous snapshot-like feature of a remote-backed index allow for remote searching without additional workflow steps or orchestration.

Step 1:
OpenSearch Remote-Backed Indexes (logs)

Step 2:
OpenSearch Searchable Remote Index

Deep Replicas

This use case essentially combines the two phases of the time series use case into a single index: a remote-backed index is serving writes and mirroring the data to the remote object store, while search replicas are serving search requests from the remote data. This configuration can solve the use case for very high search throughput workloads, as the remote search replicas can be scaled independently from the data nodes. The tradeoffs with an all data node architecture is that there will likely be increased replication delay for search replicas to become aware of new writes, and individual search requests can see higher latencies depending on workload (see the “search node” section below for future performance work). There is also an entire future work stream that will benefit this architecture mentioned in the “data node” section below for changes to make the writers more remote native.

OpenSearch Remote-backed Index - Search Nodes

Automation

The introduction of index configurations with cost and performance benefits and tradeoffs will be a natural fit for automation with ISM. This document is focused on defining the right core concepts and features, but ISM integration is a priority and will be implemented.

Design Considerations

Search Node

The search node will continue to build on the functionality introduced in searchable snapshots. It will use the block-based fetching strategy to retrieve the necessary data to satisfy search requests on-demand, and the file cache logic to keep frequently accessed blocks on disk for quick access.

From the perspective of the OpenSearch engine, Lucene’s Directory abstraction is the key interface point that the complexities of different storage implementations are hidden behind. For example, any storage technology that is accessed through the operating system’s file system (such as physically attached disks or cloud-based block stores like EBS), the FSDirectoryFactory is used to create the appropriate Directory implementation for listing and accessing files. For remote-backed storage a RemoteDirectory implementation is used for uploading and downloading files used by data nodes in a remote-backed index. Searchable snapshots implemented a RemoteSnapshotDirectory that provides access to files that are physically represented as a snapshot in a repository (the specific repository implementation is provided via a storage plugin). We will create a similar remote search-focused Directory implementation for searching remote-backed indexes stored in a repository, using the same underpinnings of RemoteSnapshotDirectory to provided the blocked-based fetch and caching functionality. The bulk of the logic for the Lucene abstraction that implements the on-demand fetching and file caching is implemented in the class OnDemandBlockIndexInput and will be reused here.

Some of the changes required (relative to searchable snapshots) are to use the remote-backed storage metadata to identify segment file locations as opposed to snapshot metadata, and the ability to refresh that metadata when data in the remote object store changes. Critical performance work will also continue in this area, such as:

  • Concurrent search to parallelize the time spent waiting on data retrieval
  • Techniques to isolate resource consumption of remote search queries when co-locating search and data roles on the same hardware
  • Strategies to partially fetch data ahead of time based on access patterns or heuristics

Data Node

As this proposal is focusing on introducing remote search to remote-backed indexes, no changes are necessary for the data nodes in the first steps. The data nodes will continue to keep a full copy of the index data on local disk while mirroring it into the remote object store. Replica data nodes can be provisioned in order to allow for fast fail-over without the need to restore, and node-to-node replication will keep the replica nodes up-to-date. However, there is a lot of potential future work in this area, such as offloading older data to keep the writers lean, mechanisms to perform merges directly in the remote object store by a separate worker fleet, etc. That work is beyond the scope of this phase, but the goal here is to build incremental pieces that are extensible to allow such architectures.

Shard Allocation

Searchable snapshots introduced the concept of remote shards to be allocated to search nodes. This will have be built on and expanded for searchable remote indexes. In particular, the case where some shards are allocated to data nodes (local) and some are allocated to search nodes (remote) within the same index will be novel.

Cluster State

This proposal does not change the cluster state architecture: the cluster state will still be managed by a leader elected cluster manager node. As remote search capabilities remove the size of the local disk as a constraint for holding indexes, keeping the index metadata in the cluster state may well become a bottleneck. This is another area for future work to offload the metadata in order to offer a storage tier that can scale beyond the limits of what the cluster manager can manage.

How Can You Help?

Any general comments about the overall direction are welcome. Some specific questions:

  • What are your use cases where this architecture would be useful?
  • Provide early feedback by testing remote-backed indexes and searchable snapshots.
  • Help out on the implementation! Check out the project boards or the issues page for work that is ready to be picked up.

Next Steps

We will incorporate feedback and continue with more concrete designs and prototypes.

@andrross andrross added enhancement Enhancement or improvement to existing feature or request feature New feature or request idea Things we're kicking around. distributed framework RFC Issues requesting major changes Search Search query, autocomplete ...etc labels Mar 2, 2023
@andrross
Copy link
Member Author

andrross commented Mar 2, 2023

Would particularly appreciate feedback from @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, and @reta, so tagging for visibility. Thanks!

@sohami
Copy link
Collaborator

sohami commented Mar 2, 2023

When this index is rolled over it will no longer be written to but can still be searched. The searchable remote index allows for keeping the ability to search this index while offloading the full data storage to the remote object store. Just as remote-backed indexes are envisioned as a successor to snapshots, the searchable remote index is a successor to searchable snapshots as the continuous snapshot-like feature of a remote-backed index allow for remote searching without additional workflow steps or orchestration

Not very clear on this. With your reference to searchable remote index being successor of searchable snapshot is the idea that for rolled over remote backed index, user will need to basically follow the same mechanism to delete the rollover over index and re-open using some API (similar to searchable snapshot) to make this index searchable directly from remote store ?

Also it seems like the RFC is tying the concept of a node (search/data node) with the type of read capabilities supported for an index. Instead I think we should have different shard types something like local reader/remote reader shard. Then these shards can be allocated to any types of node configuration data only, search only, data+search to provide more flexibility. With this for use case 1: We can have all the shards as remote reader shard for an index and it may be allocated to data/search nodes based on cluster configuration. For use case 2: An index can have write shards and remote reader shards but both on data nodes or split between data+search node

@andrross
Copy link
Member Author

andrross commented Mar 2, 2023

Instead I think we should have different shard types something like local reader/remote reader shard.

I think we're on the same page here as this is what I'm trying (but not doing well!) to describe here. For use case 1 I'm thinking something like the following (specific names/structure just used for illustration):

Step 1:

logs-2023-01-01:
  -number_of_writers: 1
  -number_of_local_replicas: 1
  -number_of_remote_replicas: 0

Step 2:

logs-2023-01-02:
  -number_of_writers: 1
  -number_of_local_replicas: 1
  -number_of_remote_replicas: 0
logs-2023-01-01:
  -number_of_writers: 0
  -number_of_local_replicas: 0
  -number_of_remote_replicas: 2

The allocator would allocate these shards as appropriate across nodes (e.g. a remote_replica would require a node with the "search" role with a file cache configured). There would be no explicit delete/remount for the first index across steps, and most users would use ISM for orchestrating this similar to how they do today.

@sohami
Copy link
Collaborator

sohami commented Mar 2, 2023

Thanks for sharing the example, couple of points:

  • In the example above for step 2, there can be an intermediate step where user may still want to keep the writers active for the rolled over index but for cost reasons have readers use the remote data only. For example: user may have some documents in that index which needs to be deleted (using id or delete by query semantics, for security compliance reasons) or want to update a document in that index.

Step 1_2:

logs-2023-01-02:
  -number_of_writers: 1
  -number_of_local_replicas: 1
  -number_of_remote_replicas: 0
logs-2023-01-01:
  -number_of_writers: 1 <------------------ can still have writers
  -number_of_local_replicas: 0
  -number_of_remote_replicas: 2

The allocator would allocate these shards as appropriate across nodes (e.g. a remote_replica would require a node with the "search" role with a file cache configured). There would be no explicit delete/remount for the first index across steps, and most users would use ISM for orchestrating this similar to how they do today.

  • Just to be clear a node with search role can either be a dedicated node or same node with data role as well ?

@andrross
Copy link
Member Author

andrross commented Mar 2, 2023

In the example above for step 2, there can be an intermediate step where user may still want to keep the writers active for the rolled over index but for cost reasons have readers use the remote data only.

Yes, definitely. And I think we'll want future investment here to allow for configurations where the writer in that example can be lighter weight and maybe not require a full copy of the index data in local storage.

Just to be clear a node with search role can either be a dedicated node or same node with data role as well ?

Yes, there are cost and performance considerations, but both configurations are possible.

@sachinpkale
Copy link
Member

Thanks @andrross for the proposal.

Regarding Deep Replicas, how would they be different than normal replica? Also, there is an initiative to integrate segment replication with remote store: #4793. Does the overall idea remain same?

Yes, definitely. And I think we'll want future investment here to allow for configurations where the writer in that example can be lighter weight and maybe not require a full copy of the index data in local storage.

We definitely need to understand the write use case better. Given the immutable property of segments, we can't modify them.
Few questions that come up:

  1. How do we append new segments without downloading all the segments data?
  2. If we end up creating a lot of small-small segments as part of these writes, how do we handle merges?
  3. How do we support updating existing document?
  4. How quickly a new write can be searchable?

@gbbafna
Copy link
Collaborator

gbbafna commented Mar 7, 2023

Thanks @andrross for the proposal. Looks real good.

Few more questions which we need to answer in design doc :

  1. How will we manage deep replica in cluster - We will need mechanisms to fail the deep replica in case it is not catching up or is really slow in searches .
  2. Shard Balancing : Deep replicas will be network , cpu as well as disk intensive, even more than searchable snapshot . This is because the data can change more often. Merges etc can require complete new segments download.
  3. Push/Pull mechanism for deep replicas - Will the primary send notifications to the deep replicas on lines of Segment Replication or they keep polling the remote store ?
  4. Client behavior change- The reads can go to primary/replica or deep replica. Will that be transparent to the user or will we use preference as data liveness can vary between all of them .

@reta
Copy link
Collaborator

reta commented Mar 7, 2023

Great proposal @andrross , it really make sense at high level, just one question on the subject which you have briefly mentioned

as the remote search replicas can be scaled independently from the data nodes

This is really useful feature, but it may introduce the need for capacity planning for search replicas, for example having a large index spread over 5 data nodes and trying to fit the searchable replica into 1 search node may not be practical. Any constraints we could identify here upfront? (somewhat related to @gbbafna comments).

Thank you.

@minalsha minalsha removed the untriaged label Mar 7, 2023
@andrross
Copy link
Member Author

andrross commented Mar 8, 2023

@sachinpkale @gbbafna re: Segment replication

There is definitely overlap in some of the use cases here and segment replication. The fundamental difference as I see it, is that remote searchers fetch data as needed, whereas a segment replication replica is proactively keeping an up-to-date copy of all data local. A segment replication replica is also able to be be promoted to act as the primary, whereas a node with a remote search shard is inherently only ever serving reads. To answer the question about push vs pull, I think the remote searchers will be implemented to poll the remote object store in order to decouple them from the primaries.

Yes, definitely. And I think we'll want future investment here to allow for configurations where the writer in that example can be lighter weight and maybe not require a full copy of the index data in local storage.

How do we append new segments without downloading all the segments data? If we end up creating a lot of small-small segments as part of these writes, how do we handle merges? How do we support updating existing document?

These are great questions! These will need to be solved if we do want to make the writer not require a full copy of the index data, but that is something to be designed and built later and not strictly needed as a part of this design.

How quickly a new write can be searchable?

I think once you factor in the polling model plus any eventual consistency delays in the remote object store, the answer is that new writes will probably take longer to become searchable than in the cluster-based node-to-node replication model. This is likely one of the tradeoffs involved with this architecture.

Client behavior change- The reads can go to primary/replica or deep replica. Will that be transparent to the user or will we use preference as data liveness can vary between all of them .

This is a great question. My inclination is to by default send searches to the search node/shards, but I suspect there may be need to allow for overriding that behavior and directing a search directly to a primary/replica.

This is really useful feature, but it may introduce the need for capacity planning for search replicas, for example having a large index spread over 5 data nodes and trying to fit the searchable replica into 1 search node may not be practical. Any constraints we could identify here upfront?

@reta Do you have anything specific in mind here? Since the searchers are relatively stateless with the remote object store holding the authoritative state of the data, they can potentially be more easily added and removed allowing for a cluster admin or managed service operator to offer more autoscaling-like features, but do you have something in mind for building directly into the OpenSearch software itself?

@reta
Copy link
Collaborator

reta commented Mar 8, 2023

Do you have anything specific in mind here?

Thanks @andrross, I think the general question which I have no clear answer yet: how many nodes with search role does one need? What are the factors here which impact this aspect of scaling?

@shwetathareja
Copy link
Member

shwetathareja commented Apr 11, 2023

Thanks @andrross for the proposal.

I see two ideas clubbed in this proposal:

  1. How to support searches for indices backed by remote storage (read-only/ writable)
  2. Reader/ writer separation

I feel Reader/ Writer separation shouldn't be tied to remote store only. In segment replication, most of the work (in terms of indexing) is done by primary and replicas are copying the segments over. Now, user might want to still have Reader/ Writer separation with segrep indices backed by local storage. One of the key reason for the same could be to get predictable performance for both type of workloads and sudden surge in one shouldn’t impact the other. This way user can still ensure search-only replica will never be promoted to primary. Reader/ writer should be thought as roles and user can decide if he/ she wants to co-locate the roles on same node (same jvm) vs different nodes.

My inclination is to by default send searches to the search node/shards, but I suspect there may be need to allow for overriding that behavior and directing a search directly to a primary/replica.

This could be request param where user can choose to always use read_replica or fallback to primary in case no read_replica is available.

How will we manage deep replica in cluster - We will need mechanisms to fail the deep replica in case it is not catching up or is really slow in searches .

+1, Earlier in-sync allocation used to keep track of all in sync copies. Here also, there should be a way to detect and fail deep read replicas if they are not catching up due to some reason. Also, same search request results could vary significantly across requests depending on which read_replica the request has landed and how far that replica has caught up with primary. Basically, the eventual consistency would be more visible now.

Btw, this might require Cluster State changes to track read-replicas and also they will not be accounted in in-sync allocations as well.

@andrross
Copy link
Member Author

I see two ideas clubbed in this proposal:

@shwetathareja Definitely! I think there are three dimensions here:

  • Role ("reader" versus "writer")
  • Data loading stategy ("partial: fetch on-demand when required" versus "full: proactively keep a full copy on disk")
  • Data source ("fetch from remote object store" versus "fetch from peer nodes")

Not every permutation across these options make sense though. For example, the "partial" data loading strategy likely only makes sense when pulling from a remote store with a durable copy of the data, and I doubt we would ever implement a version of that functionality that would pull from a peer. Similarly, "fetch from peer nodes" is probably only desirable when not using a remote store where you could get the data without burdening a peer.

Expanding on that and to make it more concrete, the following are some sketches (some details elided for clarity) of what different index configurations could look like:

Seg-rep only, no remote, separate searchers:

{
  "replication": {
    "type": "SEGMENT"
  },
  "remote_store": {
    "enabled": false
  },
  "primaries": {
    "instances": 2,
    "loading_strategy": FULL
  },
  "searchers": {
    "instances": 10,
    "loading_strategy": FULL
  }
}

Read-only warm-like tier:

{
  "replication": {
    "type": "SEGMENT"
  },
  "remote_store": {
    "enabled": true
  },
  "primaries": {
    "instances": 0
  },
  "searchers": {
    "instances": 2,
    "loading_strategy": PARTIAL
  }
}

Writable warm-like tier:

{
  "replication": {
    "type": "SEGMENT"
  },
  "remote_store": {
    "enabled": true
  },
  "primaries": {
    "instances": 1,
    "loading_strategy": PARTIAL
  },
  "searchers": {
    "instances": 2,
    "loading_strategy": PARTIAL
  }
}

Deep replicas:

{
  "replication": {
    "type": "SEGMENT"
  },
  "remote_store": {
    "enabled": true
  },
  "primaries": {
    "instances": 2,
    "loading_strategy": FULL
  },
  "searchers": {
    "instances": 10,
    "loading_strategy": PARTIAL
  }
}

Doing all of this is a lot of work and may take a long time, but getting the abstractions right is important. Also this gets quite complicated, so we'd want to be able to define a simple orchestration policy with something like ISM. What do you think?


mechanisms to fail the deep replica in case it is not catching up or is really slow in searches

For sure. This is a problem today for segment replication, and some strategies for dealing with this have been implemented. I think we'll want similar strategies here to fail replicas when the fall behind and can't catch up.

@shwetathareja
Copy link
Member

Doing all of this is a lot of work and may take a long time, but getting the abstractions right is important. Also this gets quite complicated, so we'd want to be able to define a simple orchestration policy with something like ISM. What do you think?

Yes, agreed there could be multiple permutations and we need to decide which ones we should focus in Phase1. I don't think we can offload the responsibility to ISM completely but it can definitely handle scenarios like when moving hot to warm (read only), it can clear the write copies to 0. Also it can handle the loading_strategy based on customer policy configuration e.g. on-demand load segments (partial) vs always on disk (full). Customer can also define defaults counts/ loading strategy in templates for write and search copies which are picked up during index creation else fall back to system defaults.

nitpick on the naming: Don't overload the existing terminology like using primaries for both write copies (primary & replica). It will create confusion.

@sohami
Copy link
Collaborator

sohami commented Apr 12, 2023

I think the role needs to be extended to both node and shard level. Node level role will help in allocation decision and shard level role will help to define if a shard can serve both write&search vs search or write traffic. We will need to provide abstraction such that: 1) It can achieve writer/searcher isolation so that failure of one shard type doesn't impact the other and provide true writer/searcher separation. 2) For cost reasons I should be able to have a single shard serve multiple roles and still achieve HA (status quo). The recoveries during failover can be limited between shard copies of same role by choosing primary+replica within a shard role group. For example: for an index with 2 shards (same can be extended with remote store and different loading_strategy)

Local store based seg-rep index. For each shard there will be 2 copies. Both copies will server both the roles so they can participate in recovery.
{
   "shard_count": 2
   "replication": {
      "type": "SEGMENT"
    },
  "remote_store": {
     "enabled": false
   },
   "writers, searchers": {
        "instances": 2,
        "loading_strategy": FULL
    }
}
Local store backed index with writer & searchers separation and HA. Each shard has 4 copies with 2 writers and 2 searchers in it.
{
   "shard_count": 2
   "replication": {
      "type": "SEGMENT"
    },
  "remote_store": {
     "enabled": false
   },
   "writers": {
        "instances": 2,
        "loading_strategy": FULL
    },
  "searchers": {
       "instances": 2,
       "loading_strategy": FULL
  }
}

@shwetathareja
Copy link
Member

shwetathareja commented Apr 13, 2023

@sohami : As a user either I am looking for writer/ search separation or not irrespective of local or remote storage. If you need separation you will configure no. of writers and searchers separately else you only need to configure writers which will serve read and write both. Now, the question comes for read-only indices, for that, you only need to configure searchers explicitly or ISM policy can do it during hot to warm migration. In case writers are configured for read-only, should that be treated as no-op? (more like an optimization) Do you still feel the need for separate role at shard level as well. I am trying to simplify for the user where they don't have to configure or understand too many configurations.

@andrross
Copy link
Member Author

If you need separation you will configure no. of writers and searchers separately else you only need to configure writers which will serve read and write both

This was my thinking as well, and the reason I used the (admittedly bad) "primaries" name above because writers can implicitly do both reads and writes. The new concept being introduced here is the "searcher", and the other "writer" is essentially the same that exists today in that it is capable of doing both reads and writes. As @shwetathareja said, provisioning only "writers" gives you the same behavior today of serving both reads and writes. Adding "searchers" gives you read/write separation by default (with the likely option to direct searches to the writers via a request-level override parameter). All of this is irrespective of local vs remote and even full vs partial data loading, though we'll need to prioritize the work here.

I do wonder if this would box us in by codifying the behavior that writers are capable of doing reads as well. Given that update and updateByQuery exist in the write API, it seems reasonable to me to bake in the behavior that the writers are capable of serving reads as well. If, however, we envision a future capability where writers do not have the ability to do reads, then we would need the flexibility that @sohami described above.

@nknize
Copy link
Collaborator

nknize commented Apr 13, 2023

Sorry for the long delay here as I'm just getting time to start reviewing this RFC. First, thanks for starting this discussion! We've needed a place to start pulling a part these mechanisms. I haven't read through all comments here so I'll start with some meta comments / questions.

  1. Remote-backed Storage is a strange name. Can we simplify this and just call it Remote File Storage or maybe Remote Backed Index?
    a. Lets not conflate the different remote store implementation details in the overview and keep this simple: this is about segments living on a remote file system device. Whether that's NFS, S3, Azure, GCS only matters in performance (and to some extent feature differences but those concrete differences should be handled by specific storage- plugins).
    b. Let's not pin this to Object Stores specifically since that's S3, Azure, GCS specific optimization. In the core a "Remote File System" should be able to work with EBS and an NFS mount just as easily. Those other concrete integrations can happen in the storage-s3 storage-azure storage-gcs plugins.

  2. Akin to 1. can we avoid the name searchable remote index? It's confusing because the index is not remote. We're still lighting IndexReader to local segments (regardless of pre-fetching segments / blocks). Whether it's a primary node or a replica node is a discussion for availability and scale.

  3. Akin to 1. and 2. can we begin the document with tldr definitions of terms/phrases like remote file storage, object storage, file storage, remote backed index so we can crystallize what I mention in 1 and 2?
    a. This is equivalent to scoping the discussion. Separately from this, I think we should rework the Remote-Backed Storage reference guide because it conflates the translog mechanism (which is used only for durability / failures) with the indexing. This isn't exactly accurate since with remote storage we can begin moving to support scenarios where the translog is no longer necessary; and this reference guide makes it seem like it's required for indexing (it's not).

  4. The RFC highlights two use cases: 1. time-series, and 2. deep replicas. These are use cases that will give the best performance but let's make sure it's not confusing that we're designing this for those use cases only. Time series has a lot of other mechanisms at play (e.g., index / segment sorting, merge policies, IndexOrDocValues optimizations). Remote File Storage is just a feature designed to give better performance and durability while also supporting other efforts like "serverless" to enable scaling to zero (whether thats Kubernetes or a managed service offering).

  5. ...the searchable remote index is a successor to searchable snapshots...
    a. I don't think so. Point In Time is a successor to searchable snapshots. Getting built in "PIT" is just a nice side-effect feature of segment replication + a durable remote file storage. For the record, I think mentioning this idea of time series "roll over" is just conflating a few topics that we should detail in a child issue. If you open a PIT Reader you can't remove those referenced segments as the PIT Reader will crash. Segrep keeps these segments around using a refcount (garabage collection design concept). Extend this PIT Reader concept to "PIT Backups". That's a different mechanism. A user should be able to (either on-demand or through a periodic async job) invoke a "PIT Backup". What would happen in that instance is the current segment geometry would be "frozen", cp the segment geometry to a new active directory and new writers / merge schedulers would use those active segments instead of the frozen "backup segments". Let's not conflate that idea in this issue. Just detail it as an "easy" follow up.

  6. ...use the remote-backed storage metadata to identify segment file locations as opposed to snapshot metadata, and the ability to refresh that metadata when data in the remote object store changes.
    a. I think this effort should begin laying the foundation for remote stored cluster state so we can move to efficiently supporting Auto Scaling and serverless. Let's start writing those foundation classes (interface contracts) in the :libs:core and look at abstracting existing cluster state functionality so that we can support a move to centralizing this.

That's my round 1. I'll look at comments and add more color as this marinates.

@sohami
Copy link
Collaborator

sohami commented Apr 13, 2023

Do you still feel the need for separate role at shard level as well. I am trying to simplify for the user where they don't have to configure or understand too many configurations.

I am treating writer and searchers as shard level roles here (different from primary and replica). "Writer" shard role is what we have today. "searcher" shard role can be a new shard type which can be added. For default case where separation is not needed user do not have to configure anything specially, we may use writer shard role which will serve both write and search traffic.

If you need separation you will configure no. of writers and searchers separately else you only need to configure writers which will serve read and write both

This was my thinking as well, and the reason I used the (admittedly bad) "primaries" name above because writers can implicitly do both reads and writes. The new concept being introduced here is the "searcher", and the other "writer" is essentially the same that exists today in that it is capable of doing both reads and writes. As @shwetathareja said, provisioning only "writers" gives you the same behavior today of serving both reads and writes. Adding "searchers" gives you read/write separation by default (with the likely option to direct searches to the writers via a request-level override parameter). All of this is irrespective of local vs remote and even full vs partial data loading, though we'll need to prioritize the work here.

I was originally confused by the "primaries" being used for "writers", so was calling out that we need some mechanism to define shard level role (or different shard types) like writer/searcher separately from "primary" and "replica". Essentially the main callout is failover will happen within same shard role (or type) not across shard role. So primary/replica or similar concept will exist within a role group.

I do wonder if this would box us in by codifying the behavior that writers are capable of doing reads as well. Given that update and updateByQuery exist in the write API, it seems reasonable to me to bake in the behavior that the writers are capable of serving reads as well. If, however, we envision a future capability where writers do not have the ability to do reads, then we would need the flexibility that @sohami described above.

I think writer shard should be able to perform both write and read of the data to serve different APIs like bulk, update or updateByQuery using its view of data. Being able to read the data doesn't mean it has to serve search API as well or being writer doesn't mean it cannot serve search API. In the complete separate writer/searcher world, we can make writers shards not serve the search API if separate searcher shards are available by controlling the shard resolution at coordinator. This shard resolution can be a pluggable rule with some defaults

@nknize
Copy link
Collaborator

nknize commented Apr 13, 2023

...so was calling out that we need some mechanism to define shard level role (or different shard types) like writer/searcher separately from "primary" and "replica".

Why? You're not going to write to a replica. I think what you're referring to is a "Replica only" Data node. This is highly discouraged because it flies in the face of read/write scalability (e.g., not all nodes created equally). I think we should pin this mechanism to a serverless plugin and discourage on-prem users from scaling up "read only" nodes unless they are operating in a serverless environment?

@nknize
Copy link
Collaborator

nknize commented Apr 14, 2023

Just dropping this brainstorm here until @shwetathareja opens the issue, then I'll move it over.

Another thought I've been mulling over is (in a cloud hosted or on-prem serverless environment) how we could parallelize primaries to achieve what (I think) is being described here with the read/write separation. In the face of write congestion can we auto-scale up additional "primary" writer nodes that concurrently write to the "primary"! In this scenario we would effectively have multiple primaries writing to the same shard. We could logically think of this as sharding a shard. That is, keep the same shard id range for the shard but split it into a "super shard set". Add a new primary_id number to the global checkpoint such that the global sequencing / recovery is now achieved using sequenceNo, primary_term, and primary_id. Add a reducer mechanism that merges the split shard segments using IndexWriter.addIndexes. When load subsides go back to a single writer and scale down the writer nodes. No rehashing or shard rebalancing needed.

@andrross
Copy link
Member Author

Let me try to clarify my thinking here and why I'm (over?) complicating this issue with reader/writer separation.

First, let's assuming the following definitions:

  • primary: Every index has one (per shard), and it is implicitly defined.

    • Handles write requests for its shard
    • Handles searches
    • Coordinates durability (either replicating document to replica translogs, or writing to remote translog)
  • replica: User defined, can be 0 to 'n' (per shard)

    • Handles searches
    • Acknowledges writes from primary (either copying to its own translog, or just acknowleding the primary term in case of remote translog)
    • Can be promoted to primary

For this feature, I'm envisioning a third case, with essentially a sole responsibility of "handle searches". These pure searchers can be scaled independently, and don't have to worry about participating in the write path at all, nor maintain any state that allows them to be quickly promoted to act as primary. They can exist by themselves for an isolated "warm" search experience (e.g. the "time series" use case above) or be used alongside a primary for a writeable "warm" experience (e.g. the "deep replicas" use case).

@nknize
Copy link
Collaborator

nknize commented Apr 14, 2023

I don't think that's a third case. That's what replicas are for. Are you just saying you want a replica to never be primary-eligible, and only pull/cache on-demand (at search time)?

If that's the case that's a variant on what I'm describing and I'd separate it from the remote storage mechanism and outline that use case under the issue @shwetathareja is going to open.

There's no reason we couldn't add a flag to prevent a replica node from being primary eligible - though I think that's a cloud hosted / serverless feature where you have "unlimited" resources because I'm not sure I'd penalize an active node from becoming primary if writes are bottlenecked.

@andrross
Copy link
Member Author

Are you just saying you want a replica to never be primary-eligible, and only pull/cache on-demand (at search time)?

Well, it's slightly more complicated than that. Listing the cases detailed in the use cases above:

  • 0 primary, 0 replicas, 'n' searchers (this is the read-only "warm" experience detailed in the time series use case)
  • 1 primary, 1 replica, 'n' searchers (this is the "deep replica" case, with a traditional replica acting as a sort of shadow primary able to fail over very quickly for minimal interruption in the write path in case of a single host failure)
  • 1 primary, 0 replicas, 'n' searchers ("deep replica" case, where a single host failure could lead to a brief outage on the write path, but redundant searchers exists depending on throughput/availability requirements)

I do think these cases have merit (though likely mainly for large clusters where you can specialize hardware for each role). They are also not strictly required for pull/cache on-demand storage feature, though that capability slots in very nicely for the searchers in an architecture where you have differentiated nodes.

We can certainly make the pull/cache on-demand an index-level property of a remote-backed index. The initial phase would likely not allow writes for such an index and provide the read-only "warm" experience. The next phase would be to implement the logic to support pulling/caching on-demand in the write path (for updates/update-by-query, etc) while also offloading the newly written segments from local storage when required (which is briefly mentioned in the "data nodes" section above).

@andrross
Copy link
Member Author

Now that the reader/writer separation has been pulled out to a separate issue (#7258), here is an trimmed down proposal for just the storage-related portion of this original RFC.

@nknize @shwetathareja Let me know what you think. If we get alignment on this after some discussion and feedback I'll update the main content on this GitHub issue and close it and we'll proceed with the design and implementation for this effort.


Goal

The goal of this effort is to enable a node to read and write from a remote-backed index that is larger than its local storage. This feature will be part of enabling an "initite storage" user experience where the host resources required to host an index are driven by the usage and access pattern rather than by the total size of the data in the index.

Terminology

Local storage: The storage available to an OpenSearch node via the operating system's file system.
Remote store: Storage accessed via OpenSearch's repository plugin interface. This is commonly one of the cloud object store services for which OpenSearch provides implementations in its default distribution, but any implementation of the repository interface can be used.
Remote-backed storage/index: The new feature in OpenSearch to automatically backup data to a remote store.

Background

OpenSearch nodes read and write index data via the operating system's file system, and therefore a node must have sufficient disk space to store all the necessary data. OpenSearch implements three "disk watermark" thresholds which, when breached, will stop allocating new shards to a node, attempt to allocate existing shards away, and as a last resort block writes in order to prevent the node from running out of disk space. The remote-backed storage feature introduced a new capability to automatically back up data in a durable remote store. However, there is not yet the capability for an index to grow beyond the size of the cluster's local as remote-backed storage relies on having a full copy of the data in local storage for serving read and write requests.

The searchable snapshot feature has introduced the capability to pull data from the remote store into local storage on-demand in order to serve search queries. The remote store remains the authoritative store of the data, while local storage is used as an ephemeral "cache" for the (parts of) files that are necessary for the given workload.

Proposal

The goal of this feature is add an option to remote-backed indices that enables the system to intelligently offload index data before hitting disk watermarks, allowing more frequently used and more-likely-to-be-used data to be stored in faster local storage, while less frequently used data can be removed from local storage since the authoritative copy lives in the remote store. Any time a request is encountered for data not in local storage it will be re-hydrated into local storage on-demand in order to serve the request.

The first phase of this proposal will implement the pull data on-demand logic for remote-backed indices, but with the constraint that these indices are read-only. This will solve a common use case for time-series workloads where indexes are rolled over and not written to again. Users can modify their ISM policies to apply this new index setting to implement a type of "warm" tier where the full dataset of these indexes can be offloaded from local storage and only the data that is needed will be hydrated into the "cache" on demand at search time. No complex workflows or data snapshotting is necessary here as the data already exists in remote storage. This work will leverage the functionality implemented in searchable snapshots to pull parts of files into local storage on demand at search time.

The next phase will remove the read-only constraint. This will involve implementing a composite Directory implementation that can combine the functionality of the normal file system-based directories (from Lucene), the RemoteDirectory implementation that mirrors data into the remote store, and the on demand block-based directory added in searchable snapshots. This will involve some significant design and prototyping, but the ideal goal is to provide an implementation that can intelligently manage resources and provide a performant "warm" tier that can serve both reads and writes.

@shwetathareja
Copy link
Member

Thanks @andrross for the concise proposal and capturing the first phase clearly.

Users can modify their ISM policies to apply this new index setting to implement a type of "warm" tier

There will be an API provided to users to migrate their indices to warm tier. ISM would automate this warm migration. Is that right?

Couple of questions:

  1. What would be the behavior of replicas? Would they also keep the data cached or they are shallow copies? Users can configure any no. of replicas for warm indices or there would be restrictions?
  2. What happens if certain warm indices are never queried? What would be their shards memory footprint?
  3. Would there be any changes to disk watermark like you called out in the background? or File cache Eviction policies would take care of it automatically?
  4. Would moving indices from warm to hot tier be allowed?

@andrross
Copy link
Member Author

@shwetathareja Thanks for the comments!

There will be an API provided to users to migrate their indices to warm tier. ISM would automate this warm migration. Is that right?

With this capability, I see two states that the user can choose from for a remote-backed index:

  1. Pin data to local storage: I want the predictable performance of all data being resident in the fastest storage. I prefer my cluster to reject writes rather than lose this predictable performance.
  2. Swap to remote when necessary: I want my system to offload the least accessed data from local storage to free up space for more frequently accessed data. I will see higher latencies for operations that must fetch data from remote, but I expect performance to remain acceptable for an interactive user experience.

Let's call these "hot" and "warm" to use common terms, but I'm not committing to settings or API names here. So given this, there will be a new index setting, say storage_affinity. A user can define an index as storage_affinity: hot for that guaranteed performance, then flip to storage_affinity: warm based on some threshold such as age or size. ISM can be used to orchestrate this. However, long term I believe we should aim for getting the "warm" experience to be such that given an appropriately scaled cluster for the given workload, then there will be no real performance tradeoff and "warm" can become the default: newly-indexed and frequently accessed data will be resident in fast local storage and the older data will organically age off of the local storage. Storage isn't truly infinite, so we'll still need a mechanism archive or delete the oldest indexes.

  1. What would be the behavior of replicas? Would they also keep the data cached or they are shallow copies? Users can configure any no. of replicas for warm indices or there would be restrictions?

Replicas will still work the same as they do today. The warm replicas would pull data on demand and cache as necessary. I think a common use case, particularly for a read-only warm experience, will be to reduce replicas to zero upon configuring the index to warm. A replica could still be added here for a faster fail over experience, but we'll probably want a default search routing policy to send searches to the primary to get the full benefit of the cache. Longer term, I think we'll need smarter strategies for replicas to pro-actively hydrate data when storage is available if warm performance is to be on par with hot.

  1. What happens if certain warm indices are never queried? What would be their shards memory footprint?

Indexes that are never queried may eventually have most of their data evicted from storage, but would still be tracked in cluster state and still consume some resources. We'll need an archival/cold tier as there will definitely be limits here. I've kept that out of scope here. Do you think an archival tier needs to be a part of this proposal?

  1. Would there be any changes to disk watermark like you called out in the background? or File cache Eviction policies would take care of it automatically?

There shouldn't be any user-facing changes, but the system would be able to remove unreferenced data files from local storage from warm indexes before it had to take more aggressive action like block allocations, re-allocate shards away, or block writes.

  1. Would moving indices from warm to hot tier be allowed?

Yes, users should be able to dynamically change the setting in either direction.

@shwetathareja
Copy link
Member

We'll need an archival/cold tier as there will definitely be limits here. I've kept that out of scope here. Do you think an archival tier needs to be a part of this proposal?

@andrross Lets say a customer who is rolling over on daily basis and moving older indices to warm. Over time, it can accumulate lot of indices considering warm is cheaper. It would be good to know per shard memory footprint in terms of shard lucene metadata (irrespective of cluster state). For example, if per shard memory footprint is 50MB and per index there are 10 shards, then over a period of 10 days, it can start taking 50MB * 10 *10 = 5GB heap un-necessarily. It is something to POC and think about and based on the impact we can decide whether it should be scoped in.

@sohami
Copy link
Collaborator

sohami commented Apr 27, 2023

@andrross Thanks for the proposal
Few questions:

  1. During the migration of index, will there be any impact on the search traffic of users ? I am assuming write will not be allowed as phase 1 is supporting the transition to readonly index.
  2. We may need some sort of validation/back-pressure to allow applying this setting on an index. As there will be minimum storage footprint (metadata) for each shard which will be kept in local disk cache. Depending on the size of the metadata blocks there may be a need to put back-pressure in moving index to warm. Same applies for warm to hot movement.
  3. Building on point 2, I am wondering if we will need a configurable setting (like cache_to_storage_ratio) which can define a multiplier between cache size and total remote storage used by shards. This is to provide usable warm experience so that there is enough cache space left for on-demand fetching during query on warm indices. This can act as another back-pressure mechanism to move index to warm vs a user being able to move lots of indices and end up in a situation where warm is unusable ? It will also allow users to think while overriding this setting and the consequence of it.
  4. I think it will also be useful for users to know using existing APIs like _cat/indices, etc about which indices are using local disk storage and which ones are using the file cache mechanism. When there are mix of such indices and multiple of those, looking into the setting of each index may not be a best user experience

However, long term I believe we should aim for getting the "warm" experience to be such that given an appropriately scaled cluster for the given workload, then there will be no real performance tradeoff and "warm" can become the default:

Nice to see this. +1

@shwetathareja
Copy link
Member

shwetathareja commented Apr 28, 2023

To add to @sohami point:

  1. @andrross Can user view/ filter hot/ warm indices in _cat/indices or _cat/shards output?
  2. , I am wondering if we will need a configurable setting (like cache_to_storage_ratio) which can define a multiplier between cache size and total remote storage used by shards

This should consider memory footprint per shard in this calculation. I think we should give a reference calculation to user so that they can estimate properly. @sohami can we give this information in API per shard heap usage?

@andrross
Copy link
Member Author

andrross commented Apr 28, 2023

@shwetathareja @sohami Thanks for the feedback! I think I've answered your questions with some more details below. Please let me know what you think.

User Experience

User can tag a remote-backed index as "warm", allowing the system to offload unused data from local storage

Detailed Behavior

Adding/removing tag

For all cases, the index should remain usable during transitions. There is modality here through: for cases where file cache-capable nodes are separate, then state transitions will be shard relocations. For a homogenous fleet, state transitions will happen in-place. The in-place hot->warm transition (while likely technically complex) should be minimally invasive from a user's perspective as it is essentially an accounting change for existing data on disk: All files on disk are logically moved to the file cache, and are subject to eviction if not actively referenced. The in-place warm->hot transition may require a large data transfer from remote.

Query routing

Routing queries to the primary by default is likely the right initial default in order to maximize cache re-use. Replicas can still exist for faster failover, but may have a cold cache upon being promoted.

Replication

This should be a variant of segment replication-via-remote-store, except that as a "warm" replica it would not need to proactively pull the latest data into local storage. It can use the same mechanism to become aware of new segments and the pull the data on-demand. (Note: to get to the warm-by-default goal we probably won't be able to rely purely on pull on-demand and will need more sophisticated strategies for keeping likely-to-be-accessed data in local storage.)

_cat APIs

Existing APIs that report on "store size" will need to be expanded to account for the fact that data isn't all resident in local storage. It is still useful to know the total size of indexes and shards, but a new dimension will be needed to report the size of the data resident in local storage. This includes APIs like _cat/indices, _cat/shards, _cat/segments.

Cluster scaling and limits

We have an existing task to add limits to avoid oversubscribing a cluster with searchable snapshots. The same logic applies here: now that local storage is removed as a limiting factor for the amount of data that can be hosted by a cluster, we need to ensure proper visibility into resource usage exists as well as limits to avoid severely degrading cluster performance. Rolling over to an archival storage tier will be a key tool here for a complete experience, but we will still need visibility and limits for this feature.

Implementation Notes

We currently have three types of org.apache.lucene.store.Directory implementations:

  • org.apache.lucene.store.FSDirectory: from Lucene, works with the local file system
  • org.opensearch.index.store.RemoteSegmentStoreDirectory: used by the remote-backed storage feature to upload and download files from the remote store. This is actually a composite implementation wrapping the segment store and the metadata store. This is currently used only in the upload and restore paths.
  • org.opensearch.index.store.remote.directory.RemoteSnapshotDirectory: used by the searchable snapshot feature to present a Directory interface to Lucene where the data exists in a remote snapshot, but file blocks are pulled into local storage into the "file cache" on-demand at search time.

RemoteSnapshotDirectory is the snapshot-specific variant of a block-based, pull on-demand directory. @neetikasinghal is working on a prototype now that is able to perform this logic against segments stored in the remote store of a remote-backed index. RemoteSegmentStoreDirectory is currently split at the IndexShard level and wired in behind a separate Store implementation.

Assumptions

The files in local storage for "warm" indexes logically reside in a "file cache" where they are evictable if unreferenced by active searches. For this effort, we'll use the fixed-size file cache introduced by searchable snapshots. There are smarter ways to make use of disk space (if available), but I think we can start chipping away at this feature without taking on that problem just yet.

Next Steps

  • Create a prototype/design for a hybrid directory that behaves as a local directory when complete files are present disk, but can fall back to the block-based on-demand fetch when data is requested that is not present. One of the key goals of this task will be to determine the delta between a read-only variant and a fully writable variant to make a call as to whether the read-only variant is a worthwhile incremental step. (I think the writable variant will increase the scope by requiring reconciling the separate Store approach used by remote-backed indexes, as well as bring in complexities around notifying replicas of changes).
  • Design the in-place transitions. How do we go from the full mirroring to local storage behavior of a regular remote-backed index to ref-counted (by the cache), potentially evictable files?
  • Define the additional information needed in the various _cat APIs. This should be equally applicable for searchable snapshot indexes as well.
  • Work on the problem of preventing overloading a cluster with warm indexes defined in [Searchable Remote Index] Add safeguards to ensure a cluster cannot be over-subscribed #7033

@sohami
Copy link
Collaborator

sohami commented May 1, 2023

This should consider memory footprint per shard in this calculation. I think we should give a reference calculation to user so that they can estimate properly. @sohami can we give this information in API per shard heap usage?

@shwetathareja I was thinking this to only consider storage and not memory footprint. The setting cluster.max_shards_per_node will limit the shards per node which I think is added because of per shard memory overhead. So the new setting is similar to disk watermark threshold for warm space.

@sohami
Copy link
Collaborator

sohami commented May 1, 2023

The in-place hot->warm transition (while likely technically complex) should be minimally invasive from a user's perspective as it is essentially an accounting change for existing data on disk: All files on disk are logically moved to the file cache, and are subject to eviction if not actively referenced

@andrross For my understanding, does this mean that the local files will move to the FileCache as part of transition ? But isn't FileCache just to keep track of block files instead of full segment files and also in different location ? Seems to me that transition should essentially remove the local lucene files and start using the block mechanism.

Also just calling out that given we will need to enforce some limits or back-pressure mechanism as part of transition, we may need new API instead of just being able to use the existing update setting API mechanism.

Existing APIs that report on "store size" will need to be expanded to account for the fact that data isn't all resident in local storage

  • In earlier comment what @shwetathareja and I was referring to is the filter expression used in cat APIs. In addition to local store information, we should expand the filter expression as well such that users can view the output of cat/indices or other APIs for a specific type of indices. Like _cat/indices/remote_index, _cat/indices/local_index. This way user can view summary of all the remote_index in single API call.

  • cat/allocation shows the total disk space available and used. So if a node is configured for both hot/warm store, it will be useful to display addressable remote store space by that node as well in addition to local storage space. That will help users to understand if there is still room to move more shards to warm at a node level or not.

@andrross
Copy link
Member Author

andrross commented Jun 1, 2023

Thanks everybody for the feedback here. I'm going to close this issue, but please do continue to follow along the work in this area in the linked issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request feature New feature or request idea Things we're kicking around. RFC Issues requesting major changes Search Search query, autocomplete ...etc
Projects
None yet
Development

No branches or pull requests

8 participants