Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

Merged

Conversation

rishabhmaurya
Copy link
Contributor

@rishabhmaurya rishabhmaurya commented May 7, 2024

Description

This class performs type inference by analyzing the _source documents. This will be useful in inferring the type of nested derived field of object type. See #13143 for more details on requirement.

It uses a random sample of documents to infer the field type, similar to dynamic mapping type guessing logic. Unlike guessing based on the first document, where field could be missing, this method generates a random sample to make a more accurate inference. This approach is especially useful for handling missing fields, which could be common in nested fields within derived fields of object types.
The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present. However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found.
Determining the sample size (S) is akin to deciding how many balls to draw from a bin, ensuring a high probability (>=P) of drawing at least one green ball (documents with the field) from a mixture of R red balls (documents without the field) and G green balls:

 P >= 1 - C(R, S) / C(R + G, S)

Here, C() represents the binomial coefficient. For a high confidence level, we aim for P >= 0.95. For example, with 10^7 documents where the field is present in 2% of them, the sample size S should be around 149 to achieve a probability of 0.95.

Here is the small python script which i used to calculate above

Related Issues

Resolves #13143

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented May 7, 2024

❌ Gradle check result for 31d2152: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented May 7, 2024

❌ Gradle check result for 5ed477e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented May 7, 2024

❌ Gradle check result for 540ff72: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rishabhmaurya
Copy link
Contributor Author

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

Copy link
Contributor

❌ Gradle check result for b0050b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 5e276cb: SUCCESS

@msfroh
Copy link
Collaborator

msfroh commented May 25, 2024

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

I think I would optimize more for the common fields.

I appreciate that an advantage of this feature is that it's another way of handling a mix of different document types, similar to flat_object fields -- just pushing the hard work to search time, rather than flattening at indexing time. But at the same time, I feel like it makes more sense to assume that you want to search on "relatively" common fields (i.e. fields present in at least 5-10% of documents).

@rishabhmaurya
Copy link
Contributor Author

@msfroh looking at holistic picture, I agree that optimizing on common fields is a wiser choice. If you think this isn't super critical, I can take it up as a subsequent PR.

Copy link
Contributor

github-actions bot commented Jun 3, 2024

❕ Gradle check result for c546323: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh msfroh added the backport 2.x Backport to 2.x branch label Jun 3, 2024
@msfroh msfroh merged commit 6c1896b into opensearch-project:main Jun 3, 2024
33 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 3, 2024
…of documents (#13592)

---------

Signed-off-by: Rishabh Maurya <[email protected]>
(cherry picked from commit 6c1896b)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Jun 5, 2024
msfroh pushed a commit that referenced this pull request Jun 5, 2024
…of documents (#13592) (#13953)

---------


(cherry picked from commit 6c1896b)

Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
LantaoJin pushed a commit to LantaoJin/OpenSearch that referenced this pull request Jun 6, 2024
@rishabhmaurya rishabhmaurya mentioned this pull request Jun 10, 2024
6 tasks
parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this pull request Jul 24, 2024
…of documents (opensearch-project#13592) (opensearch-project#13953)

---------

(cherry picked from commit 6c1896b)

Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: kkewwei <[email protected]>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Search:Performance skip-changelog
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support for object type in Derived Fields
3 participants