Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Conversation

Johnsonisaacn
Copy link

@Johnsonisaacn Johnsonisaacn commented Aug 28, 2024

Description

Adding ability to process and combine scores from multiple subqueries in neural search using the reciprocal rank fusion (RRF) technique. Built with a new processor and processor factory class apart from NormalizationProcessor. Changes to API included in RFC. Does not currently support weights when combining processed subquery scores, based on lack of examples in existing literature.

Example of usage for RRF processor:

create index

PUT /index-test
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      },
      "field1": {
        "type": "integer"
      }
    }
  }
}

create pipeline with rrf processor and all defaults

PUT /_search/pipeline/nlp-search-pipeline
{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                    }
                }
            }
        }
    ]
}

ingest 4 documents

POST /index-test/_doc/?refresh=true
{
    "field1": 2,
    "vector": [0.4, 0.5, 0.2],
    "title": "basic"
}

{
    "field1": 10,
    "vector": [0.2, 0.2, 0.3],
    "title": "java"
}

{
    "field1": 50,
    "vector": [4.2, 5.5, 8.9]
}

{
    "vector": [0.3, 0.12, 3.3],
    "title": "python"
}

run search request

GET /index-test/_search?search_pipeline=nlp-search-pipeline
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                4.2,
                                5.0,
                                8.5
                            ],
                            "k": 10
                        }
                    }
                },
                {
                    "range": {
                        "field1": {
                            "gte": 10,
                            "lte": 50
                        }
                    }
                }
            ]
        }
    }
}

you'll get following response

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.032522473,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.032522473,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.03201844,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.016129032,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.015873017,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

if you change rank to something smaller, like '1' your scores all will be scalled up
update rank contant

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                        "rank_constant": 1
                    }
                }
            }
        }
    ]
}

and search response is

{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.8333334,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.8333334,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.7,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.33333334,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.25,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

for comparison this is the response for same query if we use normalization processor with default techniques.

Important difference is that delta between document scores with RRF is much smaller, this is because it's based on document rank that are typically close in value comparing to scores where delta can be huge.

{
    "took": 16,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 1.0,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.5005,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.0039931787,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 1.7192177E-4,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#865
#659

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Isaac Johnson added 4 commits August 16, 2024 12:33
Signed-off-by: Isaac Johnson <[email protected]>
Signed-off-by: Isaac Johnson <[email protected]>
Signed-off-by: Isaac Johnson <[email protected]>
@Johnsonisaacn Johnsonisaacn changed the title Rrf Implementing Reciprocal Rank Fusion (RRF) in Neural Search Aug 28, 2024
@Johnsonisaacn Johnsonisaacn marked this pull request as ready for review August 28, 2024 20:47
@vibrantvarun vibrantvarun changed the title Implementing Reciprocal Rank Fusion (RRF) in Neural Search Implementing Reciprocal Rank Fusion (RRF) Aug 28, 2024
@vibrantvarun vibrantvarun changed the title Implementing Reciprocal Rank Fusion (RRF) Reciprocal Rank Fusion (RRF) normalization technique in hybrid query Aug 28, 2024
Signed-off-by: Isaac Johnson <[email protected]>
@martin-gaievski
Copy link
Member

we should be merging to feature branch https://github.com/opensearch-project/neural-search/tree/feature/rrf-score-normalization, not main.

@Johnsonisaacn Johnsonisaacn changed the base branch from main to feature/rrf-score-normalization September 4, 2024 22:24
@Johnsonisaacn Johnsonisaacn changed the base branch from feature/rrf-score-normalization to feature/rrf-score-normalization-v2 September 4, 2024 23:40
Comment on lines 22 to 25
// Not currently using weights for RRF, no need to modify or verify these params
public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {
;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class not completed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we planned to have very simple implementation for this one, I'll be finishing this PR and address all misses if

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished the class, please take a look @yuye-aws

@yuye-aws
Copy link
Member

@martin-gaievski Can you fix the DCO failure?

public static final String TECHNIQUE_NAME = "rrf";

// Not currently using weights for RRF, no need to modify or verify these params
public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK that weights are not supported in the first release. This class does nothing but adding all the scores together. I'm afraid it's too over designed to introduce a new class for such a single sum operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reusing NormalizationProcessorWorkflow that is quite a big class, and it accepts both normalization and combination techniques classes as input arguments. Plus it's a single responsibility principle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. For this PR, both params and combinationUtil are unused. You'd better delete them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will do in next PR, please remind me if I forget

@yuye-aws
Copy link
Member

this PR has unit test, I'm planning to create one with integ tests after this one is merged, want to have core of the functionality ready for others to review. Btw, we're merging this to the feature branch, not directly to main

@vibrantvarun and I still have some high level questions. If possible, can we three have a video meeting? We can address and review tests later.

@martin-gaievski
Copy link
Member

@martin-gaievski Can you fix the DCO failure?

it's coming from one of the commits in feature branch, I cannot fix it in this PR

TRIAGING.md Outdated
The maintainers of the k-NN/neural-search Repo's seek to promote an inclusive and engaged community of contributors. In
order to facilitate this, bi-weekly triage meetings are open-to-all and attendance is encouraged for anyone who hopes to
contribute, discuss an issue, or learn more about the project. To learn more about contributing to the
The maintainers of the k-NN/neural-search Repo's seek to promote an inclusive and engaged community of contributors. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file file should remain unchanged in scope of this PR. Please revert it back to original content

@@ -15,7 +15,7 @@ jobs:
matrix:
java: [ 21 ]
os: [ubuntu-latest,windows-latest]
bwc_version : ["2.9.0","2.10.0","2.11.0","2.12.0","2.13.0","2.14.0","2.15.0","2.16.0-SNAPSHOT"]
bwc_version : ["2.9.0","2.10.0","2.11.0","2.12.0","2.13.0","2.14.0","2.15.0","2.16.0","2.17.0-SNAPSHOT"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a temporary change while 2.17 release is in progress? I don't think we need it for this PR. Same for the other change in this file


// Not currently using weights for RRF, no need to modify or verify these params
public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not needed

Comment on lines 22 to 25
// Not currently using weights for RRF, no need to modify or verify these params
public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {
;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we planned to have very simple implementation for this one, I'll be finishing this PR and address all misses if

*/
@Log4j2
@AllArgsConstructor
public class RRFProcessor implements SearchPhaseResultsProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of you asking different questions: @vibrantvarun is referring to a single processor for RRF and score normalization, and @yuye-aws mentioning Alternative 2, which is about adding a new processor for RRF, but exposing both normalization and combination technique as params to end-user.

I can answer both in a similar fashion:
Fundamentally score normalization and rank based combination are different, so combining them in existing normalization processor isn't intuitive. Besides that it will require additional validation logic and at the code level will ruin existing abstractions, mainly because for normalization processor today we allow pairing of any normalization technique with any combination techniques. With addition of RRF we have to break this.
RRF is leaning towards the combination method as per offline discussion with our PM, exposing normalization function doesn't make sense/not adding value.

* Collection of utility methods for score combination technique classes
*/
@Log4j2
class ScoreNormalizationUtil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we shift the code in this class to HybridQueryUtil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it belongs there, my view is - anything related to the query itself should go to that class, like parsing score collection into multiple sub query results.


/**
* DTO object to hold data required for score normalization passed to execute() function
* in NormalizationProcessorWorkflow. Field rankConstant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* in NormalizationProcessorWorkflow. Field rankConstant
* in NormalizationProcessorWorkflow.

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment.

LGTM.

Co-authored-by: Varun Jain <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
*/
@Log4j2
@AllArgsConstructor
public class RRFProcessor implements SearchPhaseResultsProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

Comment on lines +28 to +34
private float RRF(List<Float> scores, List<Double> weights) {
float sumScores = 0.0f;
for (float score : scores) {
sumScores += score;
}
return sumScores;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding these method in this testing? I think you can simply with a few examples like 1 plus 1 is 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's added to be compatible with https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/processor/combination/BaseScoreCombinationTechniqueTests.java and be able to use all test cases it provides. We need to ensure in better possible test coverage if it's a low hanging fruit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not get your point. The private randomScore outputs non-deterministic results.

Comment on lines +187 to +191
assertEquals(
RescoreContext.getDefault().getOversampleFactor(),
neuralQueryBuilder.rescoreContext().getOversampleFactor(),
DELTA_FOR_FLOATS_ASSERTION
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already using big decimal, please remove the delta here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what do you mean, assert requires third parameter in case we're comparing floats, and both arguments are float

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you are using big decimal in the test rrfNorm method. You can be more strict, and the delta can be set to 0.

@martin-gaievski
Copy link
Member

Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

We do not retrieve normalization technique from user input, it's hardcoded and passed to processor class by the factory, check out code snippet
https://github.com/Johnsonisaacn/neural-search/blob/RRF/src/main/java/org/opensearch/neuralsearch/processor/factory/RRFProcessorFactory.java#L51-L69

I want to keep NormalizationProcessorWorkflow generic, maybe later refactor it to more abstract class not specific to normalization.

@martin-gaievski
Copy link
Member

I've addressed all comments, and most of them were minor in recent reviews. I'll be merging this one to feature branch and we'll start one more related to RRF soon, with focus on testing

@martin-gaievski martin-gaievski merged commit 245cd14 into opensearch-project:feature/rrf-score-normalization-v2 Oct 18, 2024
35 of 36 checks passed
@yuye-aws
Copy link
Member

Nice work @martin-gaievski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants