Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DynamicTemplates. #2969

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

youssef3wi
Copy link
Contributor

  • You have read the Spring Data contribution guidelines.
  • There is a ticket in the bug tracker for the project in our issue tracker. Add the issue number to the Closes #issue-number line below
  • You use the code formatters provided here and have them applied to your changes. Don’t submit any formatting related changes.
  • You submit test cases (unit or integration tests) that back your changes.
  • You added yourself as author in the headers of the classes you touched. Amend the date range in the Apache license header if needed. For new types, add the license header (copy from another file and set the current year only).

Closes #2774

I'm uncertain if this will be merged, so the best approach is to use converters.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Aug 20, 2024
@sothawo
Copy link
Collaborator

sothawo commented Aug 24, 2024

Thanks for that PR, I finally had some time to have a look.

Restrictions I see in this current draft:

  • This currently only is for properties of type Map<String, Object> right?
  • Only match from the dynamic mapping is supported, we should add unmatch as well
  • Supporting path_match and path_unmatch will get pretty ugly, as the mapping from Java property paths to elasticsearch fields in dot notation is hard, we have that in some places, I think that might not be the effort to have it in here
  • One thing that is missing is converting the values. The dynamic mapping could for example define a date format and then we need to do a conversion on writing and reading. Or the user might have a custom conversion for all string type or something like that. We'd need to a apply this to these properties as well.
  • the reading part in the readEntity method should go into its own function, readEntity gets too long.

Besides that this would need some documentation not only in the javadoc, but in the reference documentation as well.

@youssef3wi
Copy link
Contributor Author

Honestly I believe it's not good to support this or even improving this draft

@sothawo
Copy link
Collaborator

sothawo commented Aug 25, 2024

Another thought that I just had: the dynamic templates definition must not necessarily come from a @DynamicTemplates annotation on the Java entity. It could have been written to the index mapping from outside ot could be contained in a json that is referred to with the @Mappings annotation. So it would be necessary to retrieve the current mappping from the cluster to find out the relevant settings here. This would have to be done once and not on every write/read and stored somehow on the entity - doing stuff like this in the ractive world easily gets nasty..

Seems quite some effort to add this; if you don't want to go on with this, just drop it.

@youssef3wi
Copy link
Contributor Author

youssef3wi commented Aug 27, 2024

@sothawo I'm not convinced by this draft at the moment.

First, the metadata from the cluster is not in the correct location. I just placed it there for testing purposes, but I haven't found a secure place to retrieve the metadata from the cluster.

We should focus on the types and the conversion after that.

Copy link
Collaborator

@sothawo sothawo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more problems seem to arise. We can define on a property that it is part of a dynamic template. But we do not know in which index this entity might be persisted to or retrieved from. So we'd need to load the dynamic template mapping from the cluster for every index name we encounter.
In order to not do this with every save or load, this information would need to be cached. Even if we have this in a cache this add aditional calls even for applications that never need or use this.
For applications that use for example a date in the index name this would mean that over the time we are building up cache data for multiple index names that contain the same data.

I don't think that this architectural and runtime overhead is justified by the single use case - that has only come up ince in all these years that I am maintained SDE

@@ -602,6 +603,15 @@ protected <T> SearchDocumentResponse.EntityCreator<T> getEntityCreator(ReadDocum
// region Entity callbacks
protected <T> T maybeCallbackBeforeConvert(T entity, IndexCoordinates index) {

// get entity metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be called for every conversion of every entity, so that if the user saves 100 entities we have 100 calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, until I find a better place to cache/load metadata 😄

Copy link
Collaborator

@sothawo sothawo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many things that won't work (efficiently):

  • the user might not have the access rights to read the mappings of all indices in a cluster
  • even if so, this only gets the indices from the cluster at one point in time. While the application is running, additional indices can get created in in the cluster, either from outside by some other tool or by some naming, rolling pattern from Elasticsearch defined by index templates. Or the user can have SpEL in the indexname like I described in this post

Consider the following case:
The indexname is defined with something like @Document(indexName = "log-#{T(java.time.LocalDate).now().toString()}") combined with in index template to define the mapping. This will result in indices named log-2024-09-4, log-2024-9-5 etc be defined all while the application is running. We would constantly need to reload and adjust the collection of index mapping information.

When searching the user might search the index log-*, we'd need to check every returned search document to find from which index it came and would need to then eventually load the index mapping information.

I don't think that we should add this runtime, network traffic and memory overhead for this single mapping usecase

@@ -241,6 +248,45 @@ public Map<String, Object> getMapping() {
return responseConverter.indicesGetMapping(getMappingResponse, indexCoordinates);
}

@Override
public ClusterMapping getClusterMapping() {
GetMappingRequest getMappingRequest = requestConverter.indicesGetMappingRequest(IndexCoordinates.of("*"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the user might not have the permission to read that information from all indices in a cluster

@youssef3wi
Copy link
Contributor Author

@sothawo

  • We can only identify current classes that use Document as an annotation by initializing the repository (a Spring application) or by using reflection. I tried the latter approach, but it is not effective.

  • For getting all mappings, the goal is to reduce the number of calls per index.

Signed-off-by: Youssef Aouichaoui <[email protected]>
Signed-off-by: Youssef Aouichaoui <[email protected]>
Signed-off-by: Youssef Aouichaoui <[email protected]>
@sothawo
Copy link
Collaborator

sothawo commented Sep 6, 2024

We cannot guarantee to have the information/mapping of an index when an entity is written. The index name that the document is written to might not exist yet, it might get created by Elasticsearch when the document comes in and the mapping can then come from an index template, but there would be no way for Spring Data Elasticsearch to convert to entity to a document using that mapping.

Neither can we guarantee to have this information when reading documents from Elasticsearch. We can try to get the information for all indices (assuming we have the access rights to do so), but there can be indices created from outside the application in between the time this information is retrieved and a search result needs converting. Users read indices with Spring Data Elasticsearch that is created for example by log ingestion frameworks.
Even if we can notice in the process of mapping a document to an entity that this document comes from an index where we have no information about, it's not possible to get the information at that time. This would mean a call to the cluster from within the mapping process. At this place we have no access to any (Reactive)ElasticsearchOperations instance and we shouldn't. And even if we would add such a dependency, we would then need to provide a new reactive Mapper for that.

I don't see that we can provide a reliable and performant solution here by using dynamic templates to solve the problem from #2774

@youssef3wi
Copy link
Contributor Author

I don't see any issues here.
This draft can resolve the problem, but it remains a draft due to the types of conversions.

Are you suggesting closing this draft?

@sothawo
Copy link
Collaborator

sothawo commented Sep 10, 2024

I still see the issues already mentioned:

public ClusterMapping getClusterMapping() {
		GetMappingRequest getMappingRequest = requestConverter.indicesGetMappingRequest(IndexCoordinates.of("*"));

This call might fail as the user might not have access to all indices in a cluster, for example when the cluster serves multiple users/tenants, when kibana is installed which stores data in it's own indices etc. So we cannot guarantee to have the information about an index that we need.

We need to have the index mapping information when converting an entity to a document. But it can be that the index where the document is to be stored does not exist yet. Assume we have an index template for index names like log-* which contains a mapping with a dynamic template, and the user saves some data to log-2024-09-10 and this index does not exist yet. We cannot do the mapping resepcting the dynamic template because Elasticsearch will only create that index when the document is sent to the cluster, but that's to0 late for our conversion.

We do not necessarily have the information about an index when reading response data. Let's assume we start the application and read all the index information that we can and that does not fail. To stick with the example we get the index mappings for all the log-* indices until the current date like log-2024-09-10. On the next day, the user searches for some data in log-*. Now it can be that some log ingestion mechanism has inserted something into log-2024-09-11 before the search started. This index now exists, but our application does not know about this. We then could get documents from log-2024-09-10 that we can convert and some from log-2024-09-11 where we don't have the index mapping in the cache.

As long as these issues are not resolved we cannot reliably offer this as a solution

Signed-off-by: Youssef Aouichaoui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add annotation similar to @JsonUnwrapped on document fields
3 participants