-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592
[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592
Conversation
31d2152
to
5ed477e
Compare
❌ Gradle check result for 31d2152: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
5ed477e
to
540ff72
Compare
❌ Gradle check result for 5ed477e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 540ff72: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
ca050d0
to
16c2071
Compare
@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better. |
f6ed5e1
to
b0050b9
Compare
❌ Gradle check result for b0050b9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
b0050b9
to
5e276cb
Compare
I think I would optimize more for the common fields. I appreciate that an advantage of this feature is that it's another way of handling a mix of different document types, similar to |
@msfroh looking at holistic picture, I agree that optimizing on common fields is a wiser choice. If you think this isn't super critical, I can take it up as a subsequent PR. |
Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: Rishabh Maurya <[email protected]>
5e276cb
to
c546323
Compare
❕ Gradle check result for c546323: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
…of documents (#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]> (cherry picked from commit 6c1896b) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>
…of documents (#13592) (#13953) --------- (cherry picked from commit 6c1896b) Signed-off-by: Rishabh Maurya <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>
…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>
…of documents (opensearch-project#13592) (opensearch-project#13953) --------- (cherry picked from commit 6c1896b) Signed-off-by: Rishabh Maurya <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: kkewwei <[email protected]>
…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>
Description
This class performs type inference by analyzing the _source documents. This will be useful in inferring the type of nested derived field of object type. See #13143 for more details on requirement.
It uses a random sample of documents to infer the field type, similar to dynamic mapping type guessing logic. Unlike guessing based on the first document, where field could be missing, this method generates a random sample to make a more accurate inference. This approach is especially useful for handling missing fields, which could be common in nested fields within derived fields of object types.
The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present. However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found.
Determining the sample size (S) is akin to deciding how many balls to draw from a bin, ensuring a high probability (>=P) of drawing at least one green ball (documents with the field) from a mixture of R red balls (documents without the field) and G green balls:
Here,
C()
represents the binomial coefficient. For a high confidence level, we aim forP >= 0.95
. For example, with10^7
documents where the field is present in2%
of them, the sample size S should be around149
to achieve a probability of0.95
.Here is the small python script which i used to calculate above
Related Issues
Resolves #13143
Check List
[ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)[ ] Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.