-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Upserting the same data causes the SQLite db to grow by 50-100% #2143
Comments
Code: for article in fetch_content_articles(content_type):
sections = []
try:
sections = json.loads(article[3])
except Exception as _:
pass
content_article = {
"id": str(article[0]), # chromadb expects a string, not an integer
"documents": "".join([
markdownify.markdownify('<h2>' + doc['sections_title'] + '</h2>' + doc['sections_content']) for doc in sections
]),
"metadata": {
"title": article[1],
"slug": article[2],
"image": article[4] or "",
"updated_at": article[5].timestamp(), # chromadb expects a timestamp, not a datetime object
"article_preview": article[7] or "",
"type": CONTENT_ARTICLES_TYPES[article[8]],
"geo": "ie" if article[9] == 2 else "uk"
}
}
embeddings.add( entity=collection_name,
ids=[content_article['id']],
items=[content_article['documents']],
metadata=[content_article['metadata']]
)
def add(
entity: str, ids: list[str], items: list[str], metadata: list[dict] | None = None
):
try:
get_collection(entity).upsert(
ids=ids,
documents=items,
metadatas=metadata,
)
except Exception as ex:
print(ex)
pass |
I suspect that most of the expansion here is coming from the WAL. unfortunately we don't have first party support for cleaning the WAL right now but @tazarov has some community supported tools. We hope to add this to the core API. |
@essenciary this is an explanation of how the WAL works - https://cookbook.chromadb.dev/core/advanced/wal/ And here's the explanation of how to prune (clean) it up: https://cookbook.chromadb.dev/core/advanced/wal-pruning/. The tooling is here: https://github.com/amikos-tech/chromadb-ops. |
What happened?
I'm using Chroma in a Python chat type of app in order to store what could be considered entities and to do RAG on a few hundred documents. This data is mostly static - it updates very rarely, and when it does, by very little. Think a few new entities/keywords every hour and/or a couple more articles for RAG per day. However, every time I run the import scripts, even at 1 minute intervals, the SQLite DB grows by 50-100%. For example:
I haven't diffed the data as it's coming from multiple sources, but I expect the data was 99.99% identical on every import.
The issue is that the db grows very fast (it was 3 GB in size in production after a few days) and Chroma becomes impossible to use (it clogs all the CPU cores and never fetches the data at that size).
PS - looking at the expansion rate, seems to grow by more or less the initial 35 MB.
Versions
chromadb 0.4.24
python 3.10.10
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009
Relevant log output
No response
The text was updated successfully, but these errors were encountered: