-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial implementation for vectordb #3879
base: main
Are you sure you want to change the base?
initial implementation for vectordb #3879
Conversation
@jackgerrits @ekzhu , wondering if storage should be an abstraction in the core package and only the implementation should go into autogen-ext. Any thoughts? Still an early draft, just looking for some feedback on the design. I think libraries like langchain and llamaindex have similar abstractions which are useful for rag but could have other uses for for agents that do few-shot prompting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lspinheiro for the PR! I've just minor suggestions, others look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rebase the PR branch to main? right now there are too many merge conflicts
22cb336
to
7da80e9
Compare
22cb336
to
4b886b0
Compare
@ekzhu , small nudge. This should be ready for review now. |
8bb13f0
to
b3f0672
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lspinheiro , LGTM!
"""Define Document according to autogen 0.4 specifications.""" | ||
|
||
id: ItemID | ||
content: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why assuming string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same representation used in 0.2 and I believe it assumes the content has already been preprocessed to a string format. More complex data types can lead to more complex operations in general. Would you suggest something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the vector store can have a payload object, which is usually a json. Unless you know the schema mapping returning string is a problem, maybe we can use dictionary or some other model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is meant to capture the full payload but only the main text content of the associated document and the transformation between autogen doc and vectordb doc is handled in each implementation. The pydantic model is defined with arbitrary_types_allowed
which should allow implementations to unpack additional attributes as needed at the document level instead of the content. Would that work for the scenario you have in mind?
type: str = "" | ||
embedding_function: Optional[Callable[..., Any]] = None # embeddings = embedding_function(sentences) | ||
|
||
async def create_collection(self, collection_name: str, overwrite: bool = False, get_or_create: bool = True) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some vector store requires schema to create collection and some other parameters, may using variable args here can help generalise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great callout. Checked the old code and those additional arguments were being passed in through the class constructor. Not the best design. Pushing a change as soon as the checks complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be updated now
""" | ||
... | ||
|
||
def update_docs(self, docs: List[Document], collection_name: Optional[str] = None, **kwargs: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the most used method is upsert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the protocol, my understanding is that the idea is to keep it closer in design the DB theory and enforce CRUD like operations. Implementations can add upsert but they probably shouldn't be forced to
Why are these changes needed?
Related issue number
Checks