-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: adding capability for Spacy embeddings #2049
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
@Vishnunkumar, this looks good. Thank you. Can you also add a test for this under Did you also create a PR for the docs with links to official docs etc? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few points to address:
- Tests
- Docs (unless already done)
- Additional options on the EF:
- can spacy cache models (if yes, add caching dir with a default value)
- can other models be loaded other than
en_code_web_{model}
(if yes, maybe make things more flexible - Are there any other constraints that can/should be applied to the library that can/are useful?
- Is batching supported where multiple docs can be passed in a single round instead of a loop?
Working on the comments, Thanks for the review @tazarov |
|
@Vishnunkumar, looking at spacy docs, I can see that there are many models also, some in other languages - https://spacy.io/models. Also, I see some additional models that trade off accuracy for efficiency. Can we make it so that people can specify those other models as well? e.g., specify the whole model - |
|
Hi @tazarov , I have update with necessary requirments. Please review when you are avaiable |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. Thanks @Vishnunkumar.
There are only a couple of minor things to look at.
chromadb/test/ef/test_spacy_ef.py
Outdated
|
||
def test_spacyembddingfunction_throwserror_whenmodel_notfound(): | ||
with pytest.raises(ValueError, | ||
match="""spacy models are not downloaded yet, please download them using `spacy download model_name`, Please checkout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a portion of this text as match
can do loose matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Hi @tazarov , I will look into these. |
Hi @tazarov, I have updated the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can I ask you to run the following to make sure everything is formatted as it should:
pip install -r requirements_dev.txt # install dev deps
pre-commit run --all-files trailing-whitespace
pre-commit run --all-files mixed-line-ending
pre-commit run --all-files end-of-file-fixer
pre-commit run --all-files requirements-txt-fixer
pre-commit run --all-files check-xml
pre-commit run --all-files check-merge-conflict
pre-commit run --all-files check-case-conflict
pre-commit run --all-files check-docstring-first
pre-commit run --all-files black
pre-commit run --all-files flake8
pre-commit run --all-files prettier
pre-commit run --all-files check-yaml
@Vishnunkumar, thanks for sticking with this. I think this is ready to merge now. |
Hey @tazarov , thanks for reviewing and helping me throughout the PR. Always feeling great to contribute to open source communities. |
Hey @tazarov, just wanted to check who will be merging this PR? Also it seems the docs repository has been archived. Is there any other repo that I should look for adding docs. |
@Vishnunkumar, we've recently merged the docs repo here alongside a big refactor. The new docs reside in this repo now under https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com/pages/integrations. Just create a new |
Our underlying impl has changed and so this PR is not landable as is. That being said - we'd still like to add this functionality and that is now tracked in this issue. |
Description of changes
Test plan
Documentation Changes