Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency improvements to Multi Term Aggregations #14993

Merged
merged 8 commits into from
Oct 7, 2024

Conversation

expani
Copy link
Contributor

@expani expani commented Jul 29, 2024

Description

This PR aims to introduce the following improvements :

  • Reduces the latency of Multi Term Aggregation queries by 7-10 seconds
  • Reduces the memory footprint of multi term aggregation queries by decreasing its allocations.

Testing was done on a c5.9xlarge with 20GB of JVM heap and store type as mmapfs to ensure any affect of EBS Latencies doesn't affect the result much. The same was verified using lsof on the OS pid to ensure all index files are m-mapped ( mem is the type in such cases )

The numbers below are averaged after 20 iterations of each type of query with and w/o changes.

Workload Field1 Field2 WithoutChanges WithChanges
big5 agent_name host_name 236 secs 226 secs
big5 process_name agent_id 45 secs 38 secs
nyc_taxi store_and_fwd_flag payment_type 53 secs 44 secs
Sample Aggregation Query
curl -k -H 'Content-Type: application/json' https://localhost:9200/nyc_taxis/_search -u 'admin:xxx' -d '{
  "aggs": {
    "flag_and_payment_type": {
      "multi_terms": {
        "terms": [{
          "field": "store_and_fwd_flag"
        }, {
          "field": "payment_type"
        }]
      }
    }
  }
}'

Multi term aggregation goes through all the docs given by the collector of the filter query ( MatchAllDocs Query if no filter is present )

For every document given by the collector, it generates cartesian product of all the values for all the fields present in the aggregation here

A deep copy is generated for every composite key here which is eventually copied again ( only for the first time ) while adding to the bucket. This PR refactors the code to remove the need for a deep copy of every composite key.

We also perform a deep copy of the field values retrieved by Lucene here This is only essential for fields with multiple values in a document and can be avoided for fields with single value in a document.

Allocation Profiling

For Big5 Benchmark Process Name and Agent Id ( LOW CARDINALITY )

Deep Copy of composite key and for single valued fields takes around 25% of the overall allocations for a multi term aggregation query.

Screenshot 2024-07-29 at 4 55 27 PM

Collecting all the composite keys for every document in a list here also takes around 9% of the overall allocations.

Screenshot 2024-07-29 at 4 59 16 PM

For Big5 Benchmark Agent Name and Host Name ( HIGH CARDINALITY )

19% of overall allocations spent in deep copy of composite key

Screenshot 2024-07-29 at 5 02 47 PM

Collecting all composite keys taking around 9% same as before.

Also, for each loop to go over the field values for a document here contributes to 17% of overall allocations because of creating a new Iterator every time. Changed the same to use a regular for loop.

Screenshot 2024-07-29 at 5 10 36 PM

Testing

  • I ensured that results of the output were same with and without my changes for aggregation queries for different fields of the Big5 dataset.
  • Testing was done for concurrent search using different field combinations and the results were the same.

Will see the existing integs and UTs for Multi term aggregations to ensure if any corner cases are not covered.

Copy link
Contributor

❌ Gradle check result for fb84412: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@bowenlan-amzn bowenlan-amzn added v2.17.0 backport 2.x Backport to 2.x branch labels Jul 31, 2024
Copy link
Member

@bowenlan-amzn bowenlan-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Like how you document the profiling results to explain this!

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Jul 31, 2024

Thanks @expani for the code changes and detailed explanation on PR - I do have some minor comments on refactoring mainly. Please add relevant comments/javadocs to help future developers understand minor optimization and utilities for various low level operations.

Did we check performance on any data which has multi-valued fields as well? If not, let us propose a change in OSB for multi-valued fields in some workloads, in case we don't have any such available workloads.

Also, let us iterate CI to green and check if we have a solid code coverage as well.

@expani
Copy link
Contributor Author

expani commented Jul 31, 2024

Thanks for taking the time to review @bowenlan-amzn and @sandeshkr419

I will add required java docs and comments as suggested.
I have explained the reasoning for code structure, let me know if you feel otherwise.

Did we check performance on any data which has multi-valued fields as well? If not, let us propose a change in OSB for multi-valued fields in some workloads, in case we don't have any such available workloads.

Checked with few multi-values fields only for testing and not from a performance perspective.
Will check with OSB team for any such existing workloads as I need it for other possible optimisations as well.

Current CI seems to be failing due to 2 tests that have been reported to be flaky by multiple other folks.

> Task :server:internalClusterTest

Tests with failures:
 - org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.classMethod
 - org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimarySingleReplica

Will check on the coverage of the existing tests and add any if required.

@expani
Copy link
Contributor Author

expani commented Jul 31, 2024

@sandeshkr419 @bowenlan-amzn Made changes based on the previous comments.

Verified the existing UTs are covering all the branches and code paths added with this PR.

Please have a look, thanks.

Copy link
Contributor

github-actions bot commented Oct 7, 2024

❌ Gradle check result for 734e927: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 7, 2024

❕ Gradle check result for 734e927: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@expani
Copy link
Contributor Author

expani commented Oct 7, 2024

@msfroh I had added some tests and replied to your comments.

Build is finally passing after multiple flaky tests. Please have a look.

@msfroh msfroh merged commit e885aa9 into opensearch-project:main Oct 7, 2024
40 of 43 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-14993-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 e885aa93279342c9ab219ecf612d887ff8de8af6
# Push it to GitHub
git push --set-upstream origin backport/backport-14993-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-14993-to-2.x.

dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 16, 2024
…4993)

* Avoid deep copy and other allocation improvements
* Refactoring based on PR Comments and added JavaDocs
* Added more comments
* Added character for Triggering Jenkins build
* Changes to cover collectZeroDocEntries method
* Updated comment based on change in method's functionality
* Added test to cover branches in collectZeroDocEntriesIfRequired
* Rebased and resolved changelog conflict

---------

Signed-off-by: expani <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 17, 2024
…4993)

* Avoid deep copy and other allocation improvements
* Refactoring based on PR Comments and added JavaDocs
* Added more comments
* Added character for Triggering Jenkins build
* Changes to cover collectZeroDocEntries method
* Updated comment based on change in method's functionality
* Added test to cover branches in collectZeroDocEntriesIfRequired
* Rebased and resolved changelog conflict

---------

Signed-off-by: expani <[email protected]>
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 21, 2024
…4993)

* Avoid deep copy and other allocation improvements
* Refactoring based on PR Comments and added JavaDocs
* Added more comments
* Added character for Triggering Jenkins build
* Changes to cover collectZeroDocEntries method
* Updated comment based on change in method's functionality
* Added test to cover branches in collectZeroDocEntriesIfRequired
* Rebased and resolved changelog conflict

---------

Signed-off-by: expani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed Performance This is for any performance related enhancements or bugs v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants