Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for new embedding models #5242

Closed
anakin87 opened this issue Jun 30, 2023 · 6 comments · Fixed by deepset-ai/haystack-core-integrations#32
Closed

Support for new embedding models #5242

anakin87 opened this issue Jun 30, 2023 · 6 comments · Fixed by deepset-ai/haystack-core-integrations#32
Assignees
Labels
Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available topic:retriever

Comments

@anakin87
Copy link
Member

Today there are new open-source embedding models that work better than sentence-transformers and in some cases are superior to OpenAI ones.

Sometimes Haystack users asked to support these new models (#4051 and #4946).

It would be good to explore what we need to do to support these new models in Haystack.

Side note: probably supporting instructor family of models could require several changes because they tailor the embeddings to the task using a prompt; supporting e5 models should be easier...

@mathislucka
Copy link
Member

@sjrl already found a way to make e5 work. One thing that we could improve upon though is that e5 requires documents to be prefixed with passage: and queries with question:. Would be great if that could be added somehow.

@sjrl
Copy link
Contributor

sjrl commented Jul 3, 2023

Yes I found that we can load e5 in Haystack, but we are unable to easily add the prefixes that Mathis mentioned. Even without the prefixes it already works quite well, but we are probably losing out on some performance by not using them.

@anakin87
Copy link
Member Author

anakin87 commented Jul 3, 2023

@sjrl, thanks for the clarification. Could you post a code example of using e5 embeddings in Haystack?

@sjrl
Copy link
Contributor

sjrl commented Jul 3, 2023

Here is a minimal example of how to load the embedding retriever. You can then use it as you normally would an Embedding Retriever.

NOTE: Make sure to use the cosine similarity function for these embeddings in the document store.

from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore

doc_store = InMemoryDocumentStore(
    similarity="cosine",  # the e5 models were trained with a cosine similarity function
    embedding_dim=768
)

e5 = EmbeddingRetriever(
    document_store=doc_store,
    embedding_model="intfloat/e5-base-v2",
    model_format="transformers",  # Make sure we specify the transformers model format
    pooling_strategy="reduce_mean",  # This is the pooling method used to train the e5 models
    top_k=20,
    max_seq_len=512,
)
doc_store.update_embeddings(e5)

@julian-risch julian-risch added topic:retriever P2 Medium priority, add to the next sprint if no P1 available Contributions wanted! Looking for external contributions labels Jul 5, 2023
@rnyak
Copy link

rnyak commented Aug 30, 2023

@julian-risch how can we use EmbeddingRetriever to finetune E5 models via retriever.train() similar as in this tutorial? This docstrings reads as We only support the training of sentence-transformer embedding models.. Does that mean we cannot finetune an E5 model using EmbeddingRetriever class?

In addition is there a utility script to create a dataset in this format ?

      * question: the question string
        * pos_doc: the positive document string
        * neg_doc: the negative document string
        * score: the score margin

thanks.

@awinml
Copy link
Contributor

awinml commented Sep 19, 2023

I would like to add support for INSTRUCTOR Embedding Models. I have opened a PR (#5836) that adds INSTRUCTOR to Haystack (v2).

The implementation is very similar to the implementation for the Sentence Transformers Embedding Models (#5567).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available topic:retriever
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants