How to get blob container files URL to use as metadata for pinecone #7883

levalencia · 2023-07-18T14:36:24Z

levalencia
Jul 18, 2023

In my scenario I am using first langchain and pinecone for document search with similarity search.
I have a lot of files in blob storage and I want to enable the users to search for documents.

However we know that documents have to be split and then vectorized to be stored in pinecone.

The similarity search works pretty well, however, it only returns the chunk, not the document.
One way to overcome this would be to have the URL of the original file in the chunk metadata (vector metadata).

How would you approach this?

My code for indexing looks like this. (Ignore the text analytics part)

  text_analytics_client = TextAnalyticsClient(azure_textanalytics_endpoint, AzureKeyCredential(azure_textanalytics_key))
            loader = AzureBlobStorageContainerLoader(conn_str=blobstorageconnectionstring, container=storageContainer)
            embeddings = OpenAIEmbeddings(
                                        openai_api_base= openai_api_base,
                                        openai_api_type= "azure",
                                        deployment=openai_embedding_deployment_name, 
                                        openai_api_key=openai_api_key,
                                        model=openai_embedding_model_name,                                   
                                        chunk_size=openai_chunk_size)
            documents = loader.load()
            text_splitter = RecursiveCharacterTextSplitter(
                    # Set a really small chunk size, just to show.
                    chunk_size = textsplitter_chunk_size,
                    chunk_overlap  = chunk_overlap,
                    length_function = len,
                )
            start_time = time.time()
            total_files_indexed = 0
            total_chunks_per_file = 0
            errors = 0 
            for doc in documents: 
                total_files_indexed = total_files_indexed + 1
                file_index_start_time = time.time()
                texts = []
                metadatas = []          
                docsplit = text_splitter.split_text(doc.page_content)
          
                #metadatas = [{"page": i} for i in range(len(docsplit))]
                for i, split in enumerate(docsplit):
                    result = text_analytics_client.recognize_entities([split])
                    entity_categories = {}
                    for  entity   in result[0].entities:
                        category = entity.category
                        entity_text = entity.text
                        entity_categories.setdefault(category, []).append(entity_text)
            
                    metadata = {"page": i}
                    for category, values in entity_categories.items():
                        metadata[category] = ', '.join(values)
                    metadatas.append(metadata)
                    texts.append(split)

                try:
                    Pinecone.from_texts(
                        texts,
                        embeddings,
                        index_name=indexName,
                        metadatas=metadatas,
                        namespace=namespace_name)
                except Exception as e:
                    error_message = f"An error occurred: {str(e)}"
                    logging.exception(e)
                    errors = errors + 1

levalencia · 2023-07-18T14:39:45Z

levalencia
Jul 18, 2023
Author

If I check the source of the dataloader, I dont think this will be possible: https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/azure_blob_storage_container.py

1 reply

dosubot[bot] bot Jul 18, 2023

Answer generated by a 🤖

Answer

I understand that you're using LangChain and Pinecone for document search with similarity search, and you're facing an issue where the similarity search only returns the chunk, not the entire document. You're considering adding the URL of the original file in the chunk metadata to overcome this issue. However, you're unsure how to approach this given your current code and the structure of the LangChain dataloader.

Here's a potential solution: You can modify your existing code to include the URL of the original file in the metadata dictionary for each chunk of text. This would involve adding an additional key-value pair to the metadata dictionary.

Here's how you can adjust your code:

for doc in documents: 
    total_files_indexed += 1
    texts = []
    metadatas = []          
    docsplit = text_splitter.split_text(doc.page_content)

    # Get the URL of the original file
    original_file_url = doc.url  # Replace this with the actual method to get the URL

    for i, split in enumerate(docsplit):
        result = text_analytics_client.recognize_entities([split])
        entity_categories = {}
        for entity in result[0].entities:
            category = entity.category
            entity_text = entity.text
            entity_categories.setdefault(category, []).append(entity_text)
        
        metadata = {"page": i, "original_file_url": original_file_url}
        for category, values in entity_categories.items():
            metadata[category] = ', '.join(values)
        metadatas.append(metadata)
        texts.append(split)

    try:
        Pinecone.from_texts(
            texts,
            embeddings,
            index_name=indexName,
            metadatas=metadatas,
            namespace=namespace_name)
    except Exception as e:
        logging.exception(e)

Please note that you'll need to replace doc.url with the actual method to retrieve the URL of the original file from your doc object. This will depend on how your Document objects are structured and where they are stored.

This is just one potential solution to your problem. If you find this solution helpful and believe it could benefit other users, I encourage you to make a pull request to update the LangChain documentation. This way, other users facing a similar issue can benefit from your experience.

Sources

docs/extras/modules/data_connection/retrievers/how_to/self_query/pinecone.ipynb

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get blob container files URL to use as metadata for pinecone #7883

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to get blob container files URL to use as metadata for pinecone #7883

levalencia Jul 18, 2023

Replies: 1 comment · 1 reply

levalencia Jul 18, 2023 Author

dosubot[bot] bot Jul 18, 2023

Answer

Sources

levalencia
Jul 18, 2023

Replies: 1 comment 1 reply

levalencia
Jul 18, 2023
Author