Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for MM-RAG #49

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

RFC for MM-RAG #49

wants to merge 2 commits into from

Conversation

tileintel
Copy link

We submit our RFC for Multimodal-RAG based Visual QnA.
@ftian1 @hshen14: Could you please help to provide feedbacks?


The proposed architecture involves the creation of two megaservices.
- The first megaservice functions as the core pipeline, comprising four microservices: embedding, retriever, reranking, and LVLM. This megaservice exposes a MMRagBasedVisualQnAGateway, allowing users to query the system via the `/v1/mmrag_visual_qna` endpoint.
- The second megaservice manages user data storage in VectorStore and is composed of a single microservice, embedding. This megaservice provides a MMRagDataIngestionGateway, enabling user access through the `/v1/mmrag_data_ingestion` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an enhanced microservice based on the existing data ingestion?

Copy link
Author

@tileintel tileintel Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hshen14 Thanks for your comments. The simple answer is yes. We will enhance/reuse the existing data ingestion microservice. Current data ingestion microservice supports only TextDoc. In our proposal, we will enhance its interface to accept MultimodalDoc which can be either TextDoc, ImageDoc, ImageTextPairDoc, etc... If the input is of type TextDoc, we will divert the execution to current microservice/functions.

The proposed architecture involves the creation of two megaservices.
- The first megaservice functions as the core pipeline, comprising four microservices: embedding, retriever, reranking, and LVLM. This megaservice exposes a MMRagBasedVisualQnAGateway, allowing users to query the system via the `/v1/mmrag_visual_qna` endpoint.
- The second megaservice manages user data storage in VectorStore and is composed of a single microservice, embedding. This megaservice provides a MMRagDataIngestionGateway, enabling user access through the `/v1/mmrag_data_ingestion` endpoint.
- The third megaservice functions as a helper to extract list of frame-transcript pairs from videos using audio-to-text models (e.g., BLIP2) for transcripting or LVLM model (e.g., LLAVA) for captioning. This megaservice is composed of 2 microservices: transcripting and LVLM. This megaservice provides a MMRagVideoprepGateway, enabling user access through the `/v1/mmrag_video_prep` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned either audio-to-text models for transcripting or LVLM model for captioning, while you also mentioned they need to be composed. If composing is not mandatory, does it make sense to make it as microservice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hshen14 Thanks for your comment. We were proposing each transcripting and LVLM being a microservice. The composition is not mandatory. We believe that when we ingest a video, it is better to include both frames' transcripts and frames' captions as metadata for inference (LVLM) after retrieval. However, in the megaservice we will have different options for user to choose whether they want transcript only, caption only or both. The composition here is optional. Hope this is clear to you.

#### 2.1 Embeddings
- Interface `MultimodalEmbeddings` that extends the interface langchain_core.embeddings.Embeddings with an abstract method:
```python
embed_multimodal_document(self, doc: MultimodalDoc) -> List[float]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall RFC is good to me. just minor comment here. I think this interface is implementation specific one, right?. my point here is you don't need to mention that. you just need tell user what's the standard input and output just like you did in Data Classes section, that's enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC file naming convention follows this rule: yy-mm-dd-[OPEA Project Name]-[index]-title.md

For example, 24-04-29-GenAIExamples-001-Using_MicroService_to_implement_ChatQnA.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants