Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community : [bugfix] Use document ids as keys in AzureSearch vectorstore #25486

22 changes: 19 additions & 3 deletions libs/community/langchain_community/vectorstores/azuresearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,6 +428,12 @@ def add_texts(
logger.debug("Nothing to insert, skipping.")
return []

# when `keys` are not passed in and there is `ids` in kwargs, use those instead
# base class expects `ids` passed in rather than `keys`
# https://github.com/langchain-ai/langchain/blob/4cdaca67dc51dba887289f56c6fead3c1a52f97d/libs/core/langchain_core/vectorstores/base.py#L65
if (not keys) and ("ids" in kwargs) and (len(kwargs["ids"]) == len(embeddings)):
keys = kwargs["ids"]

return self.add_embeddings(zip(texts, embeddings), metadatas, keys=keys)

async def aadd_texts(
Expand All @@ -452,6 +458,12 @@ async def aadd_texts(
logger.debug("Nothing to insert, skipping.")
return []

# when `keys` are not passed in and there is `ids` in kwargs, use those instead
# base class expects `ids` passed in rather than `keys`
# https://github.com/langchain-ai/langchain/blob/4cdaca67dc51dba887289f56c6fead3c1a52f97d/libs/core/langchain_core/vectorstores/base.py#L65
if (not keys) and ("ids" in kwargs) and (len(kwargs["ids"]) == len(embeddings)):
keys = kwargs["ids"]

return await self.aadd_embeddings(zip(texts, embeddings), metadatas, keys=keys)

def add_embeddings(
Expand All @@ -468,9 +480,13 @@ def add_embeddings(
data = []
for i, (text, embedding) in enumerate(text_embeddings):
# Use provided key otherwise use default key
key = keys[i] if keys else str(uuid.uuid4())
# Encoding key for Azure Search valid characters
key = base64.urlsafe_b64encode(bytes(key, "utf-8")).decode("ascii")
if keys:
key = keys[i]
else:
key = str(uuid.uuid4())
# Encoding key for Azure Search valid characters
key = base64.urlsafe_b64encode(bytes(key, "utf-8")).decode("ascii")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now if a user provides keys, we encode them. Here we're changing that. Is this a breaking change? Should we encode in all circumstances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documents are passed to azure.search.documents.client.upload_documents() . Nowhere in the documentation is mentioned that the ids should be url encoded. My guess is that the keys were url encoded just as an extra precaution.
I left it there for the case when the keys are generated by uuid4() however in cases when keys/ids are passed in, url encoding them breaks stuff.
For instance when indexer is used to manage documents in the azure vector store, it passes ids to .add_texts() and expects that those ids are then used as-is (and stores them in the record manager under provided ids). When .add_embeddings() url-encodes them, the ids in vector store do not match ids in record manager. This breaks indexing.
In general when I pass ids/keys to a method I don't expect them to be silently changed in the background. Better alternative would be checking validity of provided keys/ids, however the documentation stats that the key is of type Edm.String which is defined here as <any UTF-8 character>' Note: See definition of UTF8-char in [RFC3629]
There is a limit of 1024 chars which could be checked, and the keys should be unique what could be checked but only for the documents currently being uploaded - and therefore I think it is better left for the user to ensure uniqueness.

Copy link
Contributor Author

@MacanPN MacanPN Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme sorry for late reply! I actually replied to your question 4 days ago but didn't hit "start review" (I assumed the comment would be still visible). Only today I found out that the comment does not show up if the review is not "started".
Please take a look and let me know whether it answers your questions and whether you agree. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme Please let me know if I can expand on the explanation I gave. Or please let me know what can I do to move forward. The issue I'm trying to fix is that as it stands right now, the indexer does not work with Azure at all. Thank you!


metadata = metadatas[i] if metadatas else {}
# Add data to index
# Additional metadata to fields mapping
Expand Down
37 changes: 37 additions & 0 deletions libs/community/tests/unit_tests/vectorstores/test_azure_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,3 +188,40 @@ def mock_create_index() -> None:
)
assert vector_store.client is not None
assert vector_store.client._api_version == "test"


@pytest.mark.requires("azure.search.documents")
def test_ids_used_correctly() -> None:
"""Check whether vector store uses the document ids when provided with them."""
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from langchain_core.documents import Document

class Response:
def __init__(self) -> None:
self.succeeded: bool = True

def mock_upload_documents(self, documents: List[object]) -> List[Response]: # type: ignore[no-untyped-def]
# assume all documents uploaded successfuly
response = [Response() for _ in documents]
return response

documents = [
Document(
page_content="page zero Lorem Ipsum",
metadata={"source": "document.pdf", "page": 0, "id": "ID-document-1"},
),
Document(
page_content="page one Lorem Ipsum",
metadata={"source": "document.pdf", "page": 1, "id": "ID-document-2"},
),
]
ids_provided = [i.metadata.get("id") for i in documents]

with patch.object(
SearchClient, "upload_documents", mock_upload_documents
), patch.object(SearchIndexClient, "get_index", mock_default_index):
vector_store = create_vector_store()
ids_used_at_upload = vector_store.add_documents(documents, ids=ids_provided)
assert len(ids_provided) == len(ids_used_at_upload)
assert ids_provided == ids_used_at_upload
Loading