[ENH] Nomic Text Embed function #2182

andrewblum · 2024-05-11T01:52:03Z

Description of changes

Summarize the changes made by this PR.

New functionality
- Embedding function that calls the Nomic Atlas API asked for by @tazarov in [Feature Request]: Nomic Embed EF #2061

Test plan

How are these changes tested?
Added a test that is similar to the existing test for Ollama.

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

vercel · 2024-05-11T01:52:07Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
chroma	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 15, 2024 9:05am

github-actions · 2024-05-11T01:52:15Z

chromadb/test/ef/test_nomic_ef.py

andrewblum · 2024-05-11T07:00:58Z

Note -- Nomic has a python SDK we could also use to do this but it just hits the same API so didn't seem worth adding a dependency just to hit 1 endpoint.

tazarov

Almost there. We need a tiny bit more error handling and a couple more tests.

tazarov · 2024-05-11T11:48:36Z

chromadb/utils/embedding_functions.py

+        embeddings = self._session.post(
+            self._api_url,
+            headers=headers,
+            json={"model": self._model_name, "texts": texts},


What is the limit to the maximum number of texts that can be sent at once? If there's a limit, let's implement it on the client side so we don't do the round trip to raise an error.

Important: Do not add any loop logic here. If a chunk fails, throw an exception and let users implement their own chunking and subsequent error handling.

Nomic responded:
"Yes, you will get rate limited with 429s if you hit it too fast (in practice this shouldn't happen unless you are a bot trying to DOS the API).

You can send as many as you'd like per request, they will get processed in parallel. It's recommended you break it up yourself into several requests or use the Nomic python client because you will see network latency due to the large request/response size if you send dozens of megabytes of text in a single request"

Let's leave it as-is for now. We have some ideas that we're trying to develop to make error handling a little bit more consistent across all EFs.

tazarov · 2024-05-11T11:50:40Z

chromadb/utils/embedding_functions.py

+            json={"model": self._model_name, "texts": texts},
+        ).json()
+
+        return cast(Embeddings, embeddings["embeddings"])


Add some error checking in case the API returns an error. You can also use resp.raise_for_status() to raise if the status is not 2xx.

Without error checking we'll raise KeyError as the response may not contain embeddings key.

tazarov · 2024-05-11T11:58:37Z

chromadb/test/ef/test_nomic_ef.py

+    To learn more about the Nomic API: https://docs.nomic.ai/reference/endpoints/nomic-embed-text
+    Export the NOMIC_API_KEY and optionally the NOMIC_MODEL environment variables.
+    """
+    if os.environ.get("NOMIC_API_KEY") is None:


I know this has been used in the past but let's add an annotation instead - @pytest.mark.skipif("NOMIC_API_KEY" not in os.environ, reason="NOMIC_API_KEY not set, skipping test.")

Decorators are implicit flow control, but this is a common practice for pytest so we might as well use it.

tazarov · 2024-05-11T12:12:34Z

chromadb/test/ef/test_nomic_ef.py

+from chromadb.utils.embedding_functions import NomicEmbeddingFunction
+
+
+def test_nomic() -> None:


Let's add couple of negative test cases:

Empty API key

Error response (mock server)

Too many texts

For the mock server:

pip install pytest_httpserver>=1.0.10

import pytest from pytest_httpserver import HTTPServer import json from unittest.mock import patch with HTTPServer() as httpserver: # Define the response httpserver.expect_oneshot_request("/embeddings/text",method="POST").respond_with_data( json.dumps({"error": "some error message"}), # adjust to fit Nomic API response status=400, ) with patch.object(NomicEmbeddingFunction, '_api_url', f"http://{httpserver.host}:{httpserver.port}/embeddings/text",): nomic_instance = NomicEmbeddingFunction() with pytest.raises(Exception): nomic_instance(["test text"])

Added these tests as well as one for missing model name, but did not include the "too many texts" test. Nomic responded saying:

"You can send as many as you'd like per request, they will get processed in parallel. It's recommended you break it up yourself into several requests or use the Nomic python client because you will see network latency due to the large request/response size if you send dozens of megabytes of text in a single request"

So I am not sure 1) what we want to consider too large 2) best way to check for that 3) if you still want this

Also -- should I add pytest_httpserver to the projects requirements_dev file and check it in as part of this PR?

Yes, please add it to the requirements_dev.

We do have a limitation in Chroma API regarding the maximum number of embeddings, but we generally don't tie that to the embeddings functions batch size. So you can leave it as is for now.

tazarov

Looks good. Just a couple of minor nits, and this should be ready to go.

tazarov · 2024-05-14T14:19:41Z

chromadb/test/ef/test_nomic_ef.py

+from chromadb.utils.embedding_functions import NomicEmbeddingFunction
+
+
+def test_nomic() -> None:


Yes, please add it to the requirements_dev.

We do have a limitation in Chroma API regarding the maximum number of embeddings, but we generally don't tie that to the embeddings functions batch size. So you can leave it as is for now.

tazarov · 2024-05-14T14:21:14Z

chromadb/utils/embedding_functions.py

+        embeddings = self._session.post(
+            self._api_url,
+            headers=headers,
+            json={"model": self._model_name, "texts": texts},


Let's leave it as-is for now. We have some ideas that we're trying to develop to make error handling a little bit more consistent across all EFs.

tazarov · 2024-05-14T14:22:58Z

chromadb/test/ef/test_nomic_ef.py

+        )
+
+
+@pytest.mark.skipif(


I think we can remove the decorator here, as this is a negative test. It is supposed to fail regardless of whether we're testing with the Nomic API key or not.

tazarov · 2024-05-14T14:24:19Z

chromadb/test/ef/test_nomic_ef.py

+        )
+
+
+@pytest.mark.skipif(


Maybe also remove this decorator and let the test run, as this is a local mock test.

I think it makes sense to add some sort of activation flags (other than API keys) for EF testing, but this requires further consideration.

tazarov

Looks good. Thank you @andrewblum.

andrewblum · 2024-05-15T09:15:32Z

Linting issue with the requirements_dev.txt file, should be fixed now. I didn't see it because I had been running --no-verify due to the pre-commit hook returning a ton of mypy errors on code I hadn't touched in embedding_functions.py Did the linting settings get changed since the last time someone ran it against that file?

tazarov · 2024-05-15T09:19:00Z

@andrewblum, the linter runs this which you can also execute locally to check:

pre-commit run --all-files trailing-whitespace 
pre-commit run --all-files mixed-line-ending
pre-commit run --all-files end-of-file-fixer
pre-commit run --all-files requirements-txt-fixer
pre-commit run --all-files check-xml
pre-commit run --all-files check-merge-conflict
pre-commit run --all-files check-case-conflict
pre-commit run --all-files check-docstring-first
pre-commit run --all-files black
pre-commit run --all-files flake8
pre-commit run --all-files prettier
pre-commit run --all-files check-yaml

andrewblum · 2024-05-15T09:27:59Z

Thanks for that, those return all green for me locally so it should be good to re-run.

tazarov

👍

jeffchuber · 2024-09-15T23:49:07Z

Our underlying impl has changed and so this PR is not landable as is.

That being said - we'd still like to add this functionality and that is now tracked in this issue.

initial Nomic embed function

1172bd3

vercel bot deployed to Preview May 11, 2024 01:53 View deployment

andrewblum commented May 11, 2024

View reviewed changes

chromadb/test/ef/test_nomic_ef.py Outdated Show resolved Hide resolved

tazarov requested changes May 11, 2024

View reviewed changes

add more tests and error checking

af22d27

vercel bot deployed to Preview May 13, 2024 23:15 View deployment

andrewblum requested a review from tazarov May 13, 2024 23:17

tazarov approved these changes May 14, 2024

View reviewed changes

unskip tests and add pytest mock to dev_reqs

6ba0864

andrewblum requested a review from tazarov May 14, 2024 20:23

vercel bot deployed to Preview May 14, 2024 20:23 View deployment

andrewblum marked this pull request as ready for review May 14, 2024 21:31

Merge branch 'main' into nomic-ef

96b41d1

vercel bot deployed to Preview May 14, 2024 21:33 View deployment

tazarov approved these changes May 15, 2024

View reviewed changes

Add ending blank line to requirements_dev.txt

718344b

vercel bot deployed to Preview May 15, 2024 08:59 View deployment

sort requirements_dev.txt

1a1d30d

vercel bot deployed to Preview May 15, 2024 09:05 View deployment

andrewblum requested a review from tazarov May 15, 2024 09:16

tazarov approved these changes May 15, 2024

View reviewed changes

tazarov mentioned this pull request May 15, 2024

[ENH] 1965 Split up embedding functions #2034

Merged

4 tasks

jeffchuber mentioned this pull request Sep 15, 2024

EF cleanup #2800

Open

jeffchuber closed this Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Nomic Text Embed function #2182

[ENH] Nomic Text Embed function #2182

andrewblum commented May 11, 2024 •

edited

Loading

vercel bot commented May 11, 2024 •

edited

Loading

github-actions bot commented May 11, 2024

andrewblum commented May 11, 2024

tazarov left a comment

tazarov May 11, 2024

andrewblum May 13, 2024

tazarov May 14, 2024

tazarov May 11, 2024

tazarov May 11, 2024

tazarov May 11, 2024

andrewblum May 13, 2024 •

edited

Loading

andrewblum May 13, 2024 •

edited

Loading

tazarov May 14, 2024

tazarov left a comment

tazarov May 14, 2024

tazarov May 14, 2024

tazarov May 14, 2024

tazarov May 14, 2024

tazarov left a comment

andrewblum commented May 15, 2024

tazarov commented May 15, 2024

andrewblum commented May 15, 2024

tazarov left a comment

jeffchuber commented Sep 15, 2024

		from chromadb.utils.embedding_functions import NomicEmbeddingFunction


		def test_nomic() -> None:

		)


		@pytest.mark.skipif(

		)


		@pytest.mark.skipif(

[ENH] Nomic Text Embed function #2182

[ENH] Nomic Text Embed function #2182

Conversation

andrewblum commented May 11, 2024 • edited Loading

Description of changes

Test plan

Documentation Changes

vercel bot commented May 11, 2024 • edited Loading

github-actions bot commented May 11, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

andrewblum commented May 11, 2024

tazarov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewblum May 13, 2024 • edited Loading

Choose a reason for hiding this comment

andrewblum May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tazarov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tazarov left a comment

Choose a reason for hiding this comment

andrewblum commented May 15, 2024

tazarov commented May 15, 2024

andrewblum commented May 15, 2024

tazarov left a comment

Choose a reason for hiding this comment

jeffchuber commented Sep 15, 2024

andrewblum commented May 11, 2024 •

edited

Loading

vercel bot commented May 11, 2024 •

edited

Loading

andrewblum May 13, 2024 •

edited

Loading

andrewblum May 13, 2024 •

edited

Loading