Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PineconeDocumentStore raises error to the metadata produced by DocumentSplitter #919

Closed
bilgeyucel opened this issue Jul 23, 2024 · 3 comments · Fixed by #1009
Closed

PineconeDocumentStore raises error to the metadata produced by DocumentSplitter #919

bilgeyucel opened this issue Jul 23, 2024 · 3 comments · Fixed by #1009
Labels
bug Something isn't working integration:pinecone P2

Comments

@bilgeyucel
Copy link
Contributor

Describe the bug
PineconeDocumentStore raises an error when I try to index a document that was split by DocumentSplitter. Error message 👇

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 23 Jul 2024 12:46:03 GMT', 'Content-Type': 'application/json', 'Content-Length': '160', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '903', 'x-pinecone-request-id': '2298458388900737762', 'x-envoy-upstream-service-time': '37', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got '[{\"doc_id\":\"22e0...' for field '_split_overlap'","details":[]}

Document object that raises the error is below. "_split_overlap" seems to be a list of dict

Document(id=37fa03ca409f457046696a3bec987d5cb627f655cbcf0c019f7334bc170da4b8, content: 'Vegan Persimmon Flan

Recipe  by Tilde Thurium

This makes 2 servings. Why did I write a recipe that...', meta: {'file_path': '/content/recipe_files/vegan_flan_recipe.md', 'source_id': 'a01a0ae2f396930e9cd3475986ae716cb26c554f6b49d4c61dfeb473ddeb7ced', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '0520d3c17150c5fd057a19bdc796e9f9c3a632f1d9acf730154d888ee3fc86be', 'range': (0, 305)}]})

To Reproduce

import os

os.environ["PINECONE_API_KEY"] = "PINECONE-KEY"

from haystack_integrations.document_stores.pinecone import PineconeDocumentStore

document_store = PineconeDocumentStore(
    index="<ENTER_PINECONE_INDEX_NAME>",
    namespace="<ENTER_PINECONE-PROJECT-NAME>",
    dimension=1536,
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}},
)

from haystack.components.preprocessors import DocumentSplitter
from haystack import Document

source_docs = [Document(content="""
Vegan Persimmon Flan
Recipe by Tilde Thurium
This makes 2 servings. Why did I write a recipe that only makes 2 servings? It was the height of COVID, okay, don't judge me.
Tools:
2 ramekins
Blender
Ingredients:
½ cup persimmon pulp, strained. This takes 2 average sized fuyu persimmons. If they have seeds, remove them.
1 tbsp cornstarch
½ tsp agar agar
1 tbsp agave nectar, or to taste
2 tbsp granulated sugar
¼ cup coconut creme
½ cup almond milk
½ tsp vanilla
Steps
I tried making caramel with the [Full Of Plants](https://www.google.com/url?q=https%3A%2F%2Ffullofplants.com%2Feasy-vegan-caramel-sauce%2F) method but it was a pain in the ass and I burned myself.
For this recipe, just put the sugar at the bottom of the cup and it somehow magically turns into sauce. Lifehack!
Combine the cornstarch with the almond milk and stir it in.
whisk persimmon pulp, milk/cornstarch, agar agar, coconut creme, and agave in a saucepan. Bring to a boil.
The persimmon pulp got a little congealed, so I mixed it with an immersion blender. But you do you, boo.
Let the persimmon mixture cool a bit, for maybe 5 minutes. Stir in the vanilla. Pour it in to your ramekins or what have you.
Don’t forget and let it cool to room temperature. Agar agar waits for no man.
Refrigerate for at least 4 hours, or overnight.
To remove from ramekin, try the hot water bath method (didn’t work for me, maybe the water wasn’t hot enough.) Or just run a knife along the edges of the ramekin and jiggle it out.""")]

document_splitter = DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
split_docs = document_splitter.run(documents=source_docs)
document_store.write_documents(documents=split_docs["documents"])

Describe your environment (please complete the following information):

  • OS: Colab
  • Haystack version: 2.3
  • Integration version: 1.2.1
@bilgeyucel bilgeyucel added bug Something isn't working integration:pinecone labels Jul 23, 2024
@anakin87
Copy link
Member

anakin87 commented Jul 23, 2024

Similar to #904.

To fix this, we can follow an approach similar to #907

But at this point, I also have doubts about the format produced by the DocumentSplitter, which seems not to be compatible with several Document Stores.

@bilgeyucel
Copy link
Contributor Author

IMO, fixing DocumentSplitter is a better solution. #907 seems more like a workaround

@anakin87
Copy link
Member

I think that for Document Stores that greatly limit the types of metadata values allowed, discarding invalid metadata and warning the user may be a good approach.
E.g., Chroma only supports str, int, float, bool. How can we store this structured information?

However, I agree with you that we should think of better choices for _split_overlap type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration:pinecone P2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants