You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug PineconeDocumentStore raises an error when I try to index a document that was split by DocumentSplitter. Error message 👇
PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 23 Jul 2024 12:46:03 GMT', 'Content-Type': 'application/json', 'Content-Length': '160', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '903', 'x-pinecone-request-id': '2298458388900737762', 'x-envoy-upstream-service-time': '37', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got '[{\"doc_id\":\"22e0...' for field '_split_overlap'","details":[]}
Document object that raises the error is below. "_split_overlap" seems to be a list of dict
Document(id=37fa03ca409f457046696a3bec987d5cb627f655cbcf0c019f7334bc170da4b8, content: 'Vegan Persimmon Flan
Recipe by Tilde Thurium
This makes 2 servings. Why did I write a recipe that...', meta: {'file_path': '/content/recipe_files/vegan_flan_recipe.md', 'source_id': 'a01a0ae2f396930e9cd3475986ae716cb26c554f6b49d4c61dfeb473ddeb7ced', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '0520d3c17150c5fd057a19bdc796e9f9c3a632f1d9acf730154d888ee3fc86be', 'range': (0, 305)}]})
To Reproduce
importosos.environ["PINECONE_API_KEY"] ="PINECONE-KEY"fromhaystack_integrations.document_stores.pineconeimportPineconeDocumentStoredocument_store=PineconeDocumentStore(
index="<ENTER_PINECONE_INDEX_NAME>",
namespace="<ENTER_PINECONE-PROJECT-NAME>",
dimension=1536,
spec={"serverless": {"region": "us-east-1", "cloud": "aws"}},
)
fromhaystack.components.preprocessorsimportDocumentSplitterfromhaystackimportDocumentsource_docs= [Document(content="""Vegan Persimmon FlanRecipe by Tilde ThuriumThis makes 2 servings. Why did I write a recipe that only makes 2 servings? It was the height of COVID, okay, don't judge me.Tools:2 ramekinsBlenderIngredients:½ cup persimmon pulp, strained. This takes 2 average sized fuyu persimmons. If they have seeds, remove them.1 tbsp cornstarch½ tsp agar agar1 tbsp agave nectar, or to taste2 tbsp granulated sugar¼ cup coconut creme½ cup almond milk½ tsp vanillaStepsI tried making caramel with the [Full Of Plants](https://www.google.com/url?q=https%3A%2F%2Ffullofplants.com%2Feasy-vegan-caramel-sauce%2F) method but it was a pain in the ass and I burned myself.For this recipe, just put the sugar at the bottom of the cup and it somehow magically turns into sauce. Lifehack!Combine the cornstarch with the almond milk and stir it in.whisk persimmon pulp, milk/cornstarch, agar agar, coconut creme, and agave in a saucepan. Bring to a boil.The persimmon pulp got a little congealed, so I mixed it with an immersion blender. But you do you, boo.Let the persimmon mixture cool a bit, for maybe 5 minutes. Stir in the vanilla. Pour it in to your ramekins or what have you.Don’t forget and let it cool to room temperature. Agar agar waits for no man.Refrigerate for at least 4 hours, or overnight.To remove from ramekin, try the hot water bath method (didn’t work for me, maybe the water wasn’t hot enough.) Or just run a knife along the edges of the ramekin and jiggle it out.""")]
document_splitter=DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
split_docs=document_splitter.run(documents=source_docs)
document_store.write_documents(documents=split_docs["documents"])
Describe your environment (please complete the following information):
OS: Colab
Haystack version: 2.3
Integration version: 1.2.1
The text was updated successfully, but these errors were encountered:
To fix this, we can follow an approach similar to #907
But at this point, I also have doubts about the format produced by the DocumentSplitter, which seems not to be compatible with several Document Stores.
I think that for Document Stores that greatly limit the types of metadata values allowed, discarding invalid metadata and warning the user may be a good approach.
E.g., Chroma only supports str, int, float, bool. How can we store this structured information?
However, I agree with you that we should think of better choices for _split_overlap type.
Describe the bug
PineconeDocumentStore
raises an error when I try to index a document that was split byDocumentSplitter
. Error message 👇Document object that raises the error is below.
"_split_overlap"
seems to be a list of dictTo Reproduce
Describe your environment (please complete the following information):
The text was updated successfully, but these errors were encountered: