feat: Add LostInTheMiddleRanker #5457

vblagoje · 2023-07-27T14:44:57Z

What?

Introduces a new ranker, LostInTheMiddleRanker. This ranker ranks documents based on the "Lost in the Middle" order, designed to position "the best" documents (low index in the given list of documents) at the beginning and the end of the resulting list while placing "the worst" documents (high index in the given list of documents) in the middle.

Why?

The "Lost in the Middle" order is a strategy for laying out paragraphs in a way that places the most relevant paragraphs at the beginning or end of the LLMs context window while the least relevant paragraphs are placed in the middle. This strategy is based on the recent research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. Implementing this strategy as a ranker in Haystack allows users to leverage this approach in their RAG pipelines.

How can it be used?

The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing the ranker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order.

How did you test it?

The functionality of the LostInTheMiddleRanker has been tested via unit tests in test/nodes/test_lost_in_the_middle.py file. These tests ensure the ranker correctly reorders documents and handles edge cases appropriately.

Notes for the reviewer

Please review the implementation of the LostInTheMiddleRanker and its associated tests; perhaps some edge unit test cases were not covered.

coveralls · 2023-07-27T15:05:30Z

Pull Request Test Coverage Report for Build 5739623079

0 of 0 changed or added relevant lines in 0 files are covered.
35 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.2%) to 46.699%

Files with Coverage Reduction	New Missed Lines	%
utils/context_matching.py	1	91.67%
nodes/search_engine/web.py	3	63.89%
nodes/prompt/invocation_layer/hugging_face.py	11	92.47%
nodes/retriever/web.py	20	78.26%

Totals
Change from base Build 5736927683:	0.2%
Covered Lines:	11014
Relevant Lines:	23585

💛 - Coveralls

vblagoje · 2023-07-28T14:51:36Z

PR Analysis

🎯 Main theme: Introduction of a new ranker - LostInTheMiddleRanker
📌 Type of PR: Enhancement
🧪 Relevant tests added: Yes
✨ Focused PR: Yes, the PR is focused as it introduces a new feature (LostInTheMiddleRanker) and all changes are related to this feature. The title and description clearly describe the purpose of the PR.
🔒 Security concerns: No, the PR does not introduce any obvious security concerns. It mainly involves the addition of a new feature that reorders documents based on a specific strategy, and does not involve any risky operations such as database or network access.

PR Feedback

💡 General PR suggestions: The PR is well structured and follows good coding practices. The new feature is well documented and relevant unit tests have been added. However, it would be beneficial to add some comments in the code to explain the logic behind the 'LostInTheMiddleRanker', especially the 'reorder_documents' method. This would make the code more understandable for other developers.

vblagoje · 2023-07-28T14:56:03Z

Here @bglearning have a look

julian-risch

Very interesting new ranker! Now I am curious to see how it performs. The tests are extensive, which is great! 👍 I have a couple of change requests regarding the implementation but the general functionality looks good.

haystack/nodes/ranker/lost_in_the_middle.py

julian-risch · 2023-07-31T14:59:49Z

haystack/nodes/ranker/lost_in_the_middle.py

+logger = logging.getLogger(__name__)
+
+
+class LostInTheMiddleRanker(BaseRanker):


I wonder whether we can come up with a name that expresses that relevant documents will be ranked so that they end up at the top and bottom and irrelevant documents end up in the middle.
LITM is a very technical name that rarely anyone using Haystack will know. I don't have a better idea yet. Maybe you can come up with something together with @dfokina ?

I actually like LITM name because it is unique and well descriptive and associative with that now "famous" paper. So I'd keep it if there are no objections...

I like the LITM name in a sense that it is a clear reference to the paper. I don't think that the users would be specifically searching for this type of ranking without understanding the background for it?

Anyway, I tried to think about possible alternatives, but no good ideas so far. They're coming up pretty bad (IrrelevantMiddleRanker, TopBottomRanker, IrrelevantDocsInTheMiddleRanker...)

haystack/nodes/ranker/lost_in_the_middle.py

releasenotes/notes/add-lost-in-the-middle-ranker-6ad7dda754fad5a9.yaml

test/nodes/test_lost_in_the_middle.py

dfokina · 2023-08-01T10:25:16Z

I just pushed some docstrings updates and added the ranker to pydoc.

haystack/nodes/ranker/lost_in_the_middle.py

sjrl · 2023-08-02T11:23:14Z

@vblagoje Overall this looks really good!

My main comments surround the word_count_threshold and the truncate parameters. I'm not entirely sure they should be a part of the ranker since it breaks with the pattern of our previous rankers (eg. SentenceTransformerRanker, RecentnessRanker, etc.) and they sound like parameters that are more relevant for a Preprocessor, a Reader or the PromptNode.

~~My thinking is similar to what you already mentioned which is that a Ranker should not edit documents in general so I'm not sure if we should even provide the ability to do so here.~~

~~Instead, I think something like word_count_threshold and truncate are more appropriate to live within something like the PromptNode since that's where the document and context length matter.~~

~~What do you think?~~

See the discussion here

sjrl

Looks good!

vblagoje · 2023-08-02T13:24:24Z

Question: Write an elaborate review of this PR and give not more than five recommendations on how it can be improved

Answer:
This PR introduces a new ranker, LostInTheMiddleRanker, which reorders documents based on the "Lost in the Middle" order. This strategy places the most relevant paragraphs at the beginning or end of the context, while less relevant paragraphs are positioned in the middle. The implementation appears to be well thought out and the code is clean and easy to understand.

Here are some recommendations for improvement:

Add more comments: While the code is generally well-documented, adding more comments, especially in complex sections of the code, would improve readability and maintainability.
Error handling: In the reorder_documents method, there is a check to ensure that all documents are textual. However, it might be beneficial to add more error handling for other potential issues, such as null or empty documents.
Test coverage: The tests seem to cover a good range of scenarios. However, it would be beneficial to add more edge cases, such as when the word_count_threshold is zero or negative, or when the documents contain special characters or non-alphanumeric content.
Performance considerations: The reorder_documents method uses a list insert operation inside a loop, which could potentially be a performance bottleneck for large lists of documents due to the time complexity of list insertions. It might be worth considering alternative data structures or algorithms that could improve the performance.
Documentation: The documentation in docs/pydoc/config/ranker.yml is updated, but it would be beneficial to also update the user guide or any other relevant documentation to provide examples and guidance on how to use the new LostInTheMiddleRanker.

vblagoje · 2023-08-02T13:52:10Z

Question: Please rate the readiness of this Pull Request (PR) for integration on a scale of 1 to 10, with 1 indicating it is completely unready and 10 signifying it is fully prepared for merging into the main branch. First elborate and then provide your rating

Answer:
The PR seems to be well-prepared for merging. The new feature, LostInTheMiddleRanker, is well implemented with clear comments and follows the existing structure of the project. The PR includes comprehensive unit tests that cover various edge cases, ensuring the functionality of the new ranker. The PR also includes updates to the documentation and release notes, which is a good practice.

However, there are a few minor points that could be improved:

The PR could benefit from a little more explanation about the use cases and benefits of the LostInTheMiddleRanker. While the PR description does provide a brief explanation, it might be helpful to provide more context or examples of when this ranker would be particularly useful.
The PR modifies the examples/web_lfqa_improved.py file to use the new LostInTheMiddleRanker. While this demonstrates how to use the new ranker, it might be better to create a separate example script specifically for the LostInTheMiddleRanker. This would allow users to understand the usage of this ranker without mixing it with other features.
The PR could also benefit from some performance testing. While unit tests are included, it would be beneficial to know how the new ranker performs in terms of speed and memory usage, especially when dealing with a large number of documents.

Considering these points, I would rate this PR as an 8. It's well-prepared and mostly ready for merging, but could benefit from a few minor improvements.

Addressed Julian's concerns, have to dismiss his review as it is blocking merging and Julian is on PTO

* Add lost in the middle ranker * Add release note * Julian's feedback: more precise version of truncate * Better comments for the litm algorithm * Sebastian PR feedback * Add check for invalid values of word_count_threshold * Remove _truncate as it is not needed any more --------- Co-authored-by: Darja Fokina <[email protected]>

vblagoje requested a review from a team as a code owner July 27, 2023 14:44

vblagoje requested review from julian-risch and removed request for a team July 27, 2023 14:44

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 27, 2023

vblagoje requested a review from a team as a code owner July 28, 2023 08:12

vblagoje requested review from dfokina and removed request for a team July 28, 2023 08:12

julian-risch previously requested changes Jul 31, 2023

View reviewed changes

vblagoje force-pushed the lost_in_the_middle_ranker branch from 4a03765 to ee62713 Compare August 1, 2023 06:01

vblagoje added 8 commits August 1, 2023 13:36

Add lost in the middle ranker

fa92b88

Add release note

c5660bf

PR review: implement top_k correctly, improve and add new unit tests

1a853fd

Use ranker variable name in tests

b72cd34

Add invalid top_k handling, unit tests

a16219d

Minor fixes

5bc429f

Julian's feedback: more precise version of truncate

10a92c2

Better comments for the litm algorithm

c5fcc30

vblagoje force-pushed the lost_in_the_middle_ranker branch from bf83076 to c5fcc30 Compare August 1, 2023 11:37

vblagoje and others added 2 commits August 1, 2023 13:49

Update examples/web_lfqa_improved.py

32d06e6

Update docs

2c46028

vblagoje force-pushed the lost_in_the_middle_ranker branch from df14403 to 2c46028 Compare August 2, 2023 09:34

Update lg in docstrings

918971e

sjrl reviewed Aug 2, 2023

View reviewed changes

haystack/nodes/ranker/lost_in_the_middle.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 2, 2023

View reviewed changes

haystack/nodes/ranker/lost_in_the_middle.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 2, 2023

View reviewed changes

haystack/nodes/ranker/lost_in_the_middle.py Outdated Show resolved Hide resolved

Sebastian PR feedback

e135303

sjrl approved these changes Aug 2, 2023

View reviewed changes

vblagoje added 2 commits August 2, 2023 15:39

Add check for invalid values of word_count_threshold

d0d1484

Remove _truncate as it is not needed any more

1d8ca68

vblagoje requested a review from julian-risch August 2, 2023 14:27

vblagoje merged commit 1876c41 into main Aug 2, 2023
56 checks passed

vblagoje deleted the lost_in_the_middle_ranker branch August 2, 2023 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add LostInTheMiddleRanker #5457

feat: Add LostInTheMiddleRanker #5457

vblagoje commented Jul 27, 2023

coveralls commented Jul 27, 2023 •

edited

Loading

vblagoje commented Jul 28, 2023

vblagoje commented Jul 28, 2023

julian-risch left a comment

julian-risch Jul 31, 2023

vblagoje Aug 1, 2023

dfokina Aug 1, 2023

dfokina commented Aug 1, 2023

sjrl commented Aug 2, 2023 •

edited

Loading

sjrl left a comment

vblagoje commented Aug 2, 2023

vblagoje commented Aug 2, 2023

		logger = logging.getLogger(__name__)


		class LostInTheMiddleRanker(BaseRanker):

feat: Add LostInTheMiddleRanker #5457

feat: Add LostInTheMiddleRanker #5457

Conversation

vblagoje commented Jul 27, 2023

What?

Why?

How can it be used?

How did you test it?

Notes for the reviewer

coveralls commented Jul 27, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5739623079

💛 - Coveralls

vblagoje commented Jul 28, 2023

PR Analysis

PR Feedback

vblagoje commented Jul 28, 2023

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Jul 31, 2023

Choose a reason for hiding this comment

vblagoje Aug 1, 2023

Choose a reason for hiding this comment

dfokina Aug 1, 2023

Choose a reason for hiding this comment

dfokina commented Aug 1, 2023

sjrl commented Aug 2, 2023 • edited Loading

sjrl left a comment

Choose a reason for hiding this comment

vblagoje commented Aug 2, 2023

vblagoje commented Aug 2, 2023

coveralls commented Jul 27, 2023 •

edited

Loading

sjrl commented Aug 2, 2023 •

edited

Loading