Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add LostInTheMiddleRanker #5457

Merged
merged 14 commits into from
Aug 2, 2023
Merged

feat: Add LostInTheMiddleRanker #5457

merged 14 commits into from
Aug 2, 2023

Conversation

vblagoje
Copy link
Member

What?

Introduces a new ranker, LostInTheMiddleRanker. This ranker ranks documents based on the "Lost in the Middle" order, designed to position "the best" documents (low index in the given list of documents) at the beginning and the end of the resulting list while placing "the worst" documents (high index in the given list of documents) in the middle.

Why?

The "Lost in the Middle" order is a strategy for laying out paragraphs in a way that places the most relevant paragraphs at the beginning or end of the LLMs context window while the least relevant paragraphs are placed in the middle. This strategy is based on the recent research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. Implementing this strategy as a ranker in Haystack allows users to leverage this approach in their RAG pipelines.

How can it be used?

The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing the ranker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order.

How did you test it?

The functionality of the LostInTheMiddleRanker has been tested via unit tests in test/nodes/test_lost_in_the_middle.py file. These tests ensure the ranker correctly reorders documents and handles edge cases appropriately.

Notes for the reviewer

Please review the implementation of the LostInTheMiddleRanker and its associated tests; perhaps some edge unit test cases were not covered.

@vblagoje vblagoje requested a review from a team as a code owner July 27, 2023 14:44
@vblagoje vblagoje requested review from julian-risch and removed request for a team July 27, 2023 14:44
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 27, 2023
@coveralls
Copy link
Collaborator

coveralls commented Jul 27, 2023

Pull Request Test Coverage Report for Build 5739623079

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 35 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.2%) to 46.699%

Files with Coverage Reduction New Missed Lines %
utils/context_matching.py 1 91.67%
nodes/search_engine/web.py 3 63.89%
nodes/prompt/invocation_layer/hugging_face.py 11 92.47%
nodes/retriever/web.py 20 78.26%
Totals Coverage Status
Change from base Build 5736927683: 0.2%
Covered Lines: 11014
Relevant Lines: 23585

💛 - Coveralls

@vblagoje vblagoje requested a review from a team as a code owner July 28, 2023 08:12
@vblagoje vblagoje requested review from dfokina and removed request for a team July 28, 2023 08:12
@vblagoje
Copy link
Member Author

PR Analysis

  • 🎯 Main theme: Introduction of a new ranker - LostInTheMiddleRanker
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • Focused PR: Yes, the PR is focused as it introduces a new feature (LostInTheMiddleRanker) and all changes are related to this feature. The title and description clearly describe the purpose of the PR.
  • 🔒 Security concerns: No, the PR does not introduce any obvious security concerns. It mainly involves the addition of a new feature that reorders documents based on a specific strategy, and does not involve any risky operations such as database or network access.

PR Feedback

  • 💡 General PR suggestions: The PR is well structured and follows good coding practices. The new feature is well documented and relevant unit tests have been added. However, it would be beneficial to add some comments in the code to explain the logic behind the 'LostInTheMiddleRanker', especially the 'reorder_documents' method. This would make the code more understandable for other developers.

@vblagoje
Copy link
Member Author

Here @bglearning have a look

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting new ranker! Now I am curious to see how it performs. The tests are extensive, which is great! 👍 I have a couple of change requests regarding the implementation but the general functionality looks good.

haystack/nodes/ranker/lost_in_the_middle.py Outdated Show resolved Hide resolved
logger = logging.getLogger(__name__)


class LostInTheMiddleRanker(BaseRanker):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can come up with a name that expresses that relevant documents will be ranked so that they end up at the top and bottom and irrelevant documents end up in the middle.
LITM is a very technical name that rarely anyone using Haystack will know. I don't have a better idea yet. Maybe you can come up with something together with @dfokina ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like LITM name because it is unique and well descriptive and associative with that now "famous" paper. So I'd keep it if there are no objections...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the LITM name in a sense that it is a clear reference to the paper. I don't think that the users would be specifically searching for this type of ranking without understanding the background for it?

Anyway, I tried to think about possible alternatives, but no good ideas so far. They're coming up pretty bad (IrrelevantMiddleRanker, TopBottomRanker, IrrelevantDocsInTheMiddleRanker...)

haystack/nodes/ranker/lost_in_the_middle.py Outdated Show resolved Hide resolved
haystack/nodes/ranker/lost_in_the_middle.py Show resolved Hide resolved
haystack/nodes/ranker/lost_in_the_middle.py Show resolved Hide resolved
test/nodes/test_lost_in_the_middle.py Outdated Show resolved Hide resolved
test/nodes/test_lost_in_the_middle.py Outdated Show resolved Hide resolved
test/nodes/test_lost_in_the_middle.py Outdated Show resolved Hide resolved
test/nodes/test_lost_in_the_middle.py Outdated Show resolved Hide resolved
@dfokina
Copy link
Contributor

dfokina commented Aug 1, 2023

I just pushed some docstrings updates and added the ranker to pydoc.

@sjrl
Copy link
Contributor

sjrl commented Aug 2, 2023

@vblagoje Overall this looks really good!

My main comments surround the word_count_threshold and the truncate parameters. I'm not entirely sure they should be a part of the ranker since it breaks with the pattern of our previous rankers (eg. SentenceTransformerRanker, RecentnessRanker, etc.) and they sound like parameters that are more relevant for a Preprocessor, a Reader or the PromptNode.

My thinking is similar to what you already mentioned which is that a Ranker should not edit documents in general so I'm not sure if we should even provide the ability to do so here.

Instead, I think something like word_count_threshold and truncate are more appropriate to live within something like the PromptNode since that's where the document and context length matter.

What do you think?

See the discussion here

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@vblagoje
Copy link
Member Author

vblagoje commented Aug 2, 2023

Question: Write an elaborate review of this PR and give not more than five recommendations on how it can be improved

Answer:
This PR introduces a new ranker, LostInTheMiddleRanker, which reorders documents based on the "Lost in the Middle" order. This strategy places the most relevant paragraphs at the beginning or end of the context, while less relevant paragraphs are positioned in the middle. The implementation appears to be well thought out and the code is clean and easy to understand.

Here are some recommendations for improvement:

  1. Add more comments: While the code is generally well-documented, adding more comments, especially in complex sections of the code, would improve readability and maintainability.

  2. Error handling: In the reorder_documents method, there is a check to ensure that all documents are textual. However, it might be beneficial to add more error handling for other potential issues, such as null or empty documents.

  3. Test coverage: The tests seem to cover a good range of scenarios. However, it would be beneficial to add more edge cases, such as when the word_count_threshold is zero or negative, or when the documents contain special characters or non-alphanumeric content.

  4. Performance considerations: The reorder_documents method uses a list insert operation inside a loop, which could potentially be a performance bottleneck for large lists of documents due to the time complexity of list insertions. It might be worth considering alternative data structures or algorithms that could improve the performance.

  5. Documentation: The documentation in docs/pydoc/config/ranker.yml is updated, but it would be beneficial to also update the user guide or any other relevant documentation to provide examples and guidance on how to use the new LostInTheMiddleRanker.

@vblagoje
Copy link
Member Author

vblagoje commented Aug 2, 2023

Question: Please rate the readiness of this Pull Request (PR) for integration on a scale of 1 to 10, with 1 indicating it is completely unready and 10 signifying it is fully prepared for merging into the main branch. First elborate and then provide your rating

Answer:
The PR seems to be well-prepared for merging. The new feature, LostInTheMiddleRanker, is well implemented with clear comments and follows the existing structure of the project. The PR includes comprehensive unit tests that cover various edge cases, ensuring the functionality of the new ranker. The PR also includes updates to the documentation and release notes, which is a good practice.

However, there are a few minor points that could be improved:

  1. The PR could benefit from a little more explanation about the use cases and benefits of the LostInTheMiddleRanker. While the PR description does provide a brief explanation, it might be helpful to provide more context or examples of when this ranker would be particularly useful.

  2. The PR modifies the examples/web_lfqa_improved.py file to use the new LostInTheMiddleRanker. While this demonstrates how to use the new ranker, it might be better to create a separate example script specifically for the LostInTheMiddleRanker. This would allow users to understand the usage of this ranker without mixing it with other features.

  3. The PR could also benefit from some performance testing. While unit tests are included, it would be beneficial to know how the new ranker performs in terms of speed and memory usage, especially when dealing with a large number of documents.

Considering these points, I would rate this PR as an 8. It's well-prepared and mostly ready for merging, but could benefit from a few minor improvements.

@vblagoje vblagoje dismissed julian-risch’s stale review August 2, 2023 14:28

Addressed Julian's concerns, have to dismiss his review as it is blocking merging and Julian is on PTO

@vblagoje vblagoje merged commit 1876c41 into main Aug 2, 2023
56 checks passed
@vblagoje vblagoje deleted the lost_in_the_middle_ranker branch August 2, 2023 15:05
DosticJelena pushed a commit to smartcat-labs/haystack that referenced this pull request Aug 23, 2023
* Add lost in the middle ranker

* Add release note

* Julian's feedback: more precise version of truncate

* Better comments for the litm algorithm

* Sebastian PR feedback

* Add check for invalid values of word_count_threshold

* Remove _truncate as it is not needed any more

---------

Co-authored-by: Darja Fokina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants