Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for MM-RAG #49

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions community/rfcs/MM-RAG-RFG.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC file naming convention follows this rule: yy-mm-dd-[OPEA Project Name]-[index]-title.md

For example, 24-04-29-GenAIExamples-001-Using_MicroService_to_implement_ChatQnA.md

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tileintel could you pls update this file name to be consistent with other PRs?

Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Multimodal Retrieval-Augmented-Generation (MM-RAG)

## Author

[Tiep Le](http://https://github.com/tileintel),
[Siddhi Velankar](https://github.com/siddhivelankar23),
[Siddhant Jagtap](https://github.com/sjagtap1803),
[Mustafa Cetin](https://github.com/MSCetin37)

## Status

Under review

## Objective

This RFC aims to introduce a Multimodal Retrieval-Augmented Generation (MM-RAG) architecture. This architecture advances RAG technology in which it rather leverages multimodal data, including but not limited to video, image, and text data for its in-context learning for complex tasks. This architecture adopting OPEA's microservice approach is flexible for Enterprise AI applications that requires RAG for multimodal data. The architecture is general and MM-RAG based Visual Question and Answering is the first example that we want to apply.

## Motivation

Enterprises use multimodal data (e.g., image, video, text, audio, slide, tabular data, etc...) for their daily work. Unfortunately, existing applications in OPEA cannot handle multimodal data flexibly and productively. In more detail, we observe that:
- Existing ChatQnA application, though featuring a (text-based) RAG architecture, is not able to handle multimodal data (e.g., image, video), thus lacking the ability to leverage multimodal data.
- Existing VisualQnA application, though accepting both image and text as input, produces responses merely based on the input it receives, thus lacking the ability to utilize their in-context learning for complex tasks.
- Current OPEA framework lacks of microservices (i.e., embedding, retrieval, and reranks) that can accept multimodal data.
- Existing RAG-related microservices in the current OPEA framework mainly make use of langchain framework, thus they follow the interfaces defined in langchain framework that are mainly suitable for text data only. As a results, OPEA framework lacks interfaces for multimodal data such as images, texts, and image-text pairs.

All of these limitations prevent enterprises from developing AI applications that can comprehend multimodal data For instances, applications that do question and answering on insight information from videos/images.

MM-RAG is critical for OPEA since it will:
1. define new interfaces for OPEA's microservices to work with multimodal data. Such interfaces will not only extend langchain interfaces but also make use of (if-existing) interfaces for multimodal data in popular frameworks such as Huggingface, OpenAI, etc...
2. introduce new microservices that can handle multimodal data.
3. enable new AI applications that enterprises need.

## Design Proposal
This RFC proposes a Multimodal Retrieval-Augmented Generation (MM-RAG) architecture detailed in the following figure.

![MM-RAG Architecture](./assets/mm-rag-architecture.png)

The proposed architecture involves the creation of two megaservices.
- The first megaservice functions as the core pipeline, comprising four microservices: embedding, retriever, reranking, and LVLM. This megaservice exposes a MMRagBasedVisualQnAGateway, allowing users to query the system via the `/v1/mmrag_visual_qna` endpoint.
- The second megaservice manages user data storage in VectorStore and is composed of a single microservice, embedding. This megaservice provides a MMRagDataIngestionGateway, enabling user access through the `/v1/mmrag_data_ingestion` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an enhanced microservice based on the existing data ingestion?

Copy link
Author

@tileintel tileintel Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hshen14 Thanks for your comments. The simple answer is yes. We will enhance/reuse the existing data ingestion microservice. Current data ingestion microservice supports only TextDoc. In our proposal, we will enhance its interface to accept MultimodalDoc which can be either TextDoc, ImageDoc, ImageTextPairDoc, etc... If the input is of type TextDoc, we will divert the execution to current microservice/functions.

- The third megaservice functions as a helper to extract list of frame-transcript pairs from videos using audio-to-text models (e.g., BLIP2) for transcripting or LVLM model (e.g., LLAVA) for captioning. This megaservice is composed of 2 microservices: transcripting and LVLM. This megaservice provides a MMRagVideoprepGateway, enabling user access through the `/v1/mmrag_video_prep` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned either audio-to-text models for transcripting or LVLM model for captioning, while you also mentioned they need to be composed. If composing is not mandatory, does it make sense to make it as microservice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hshen14 Thanks for your comment. We were proposing each transcripting and LVLM being a microservice. The composition is not mandatory. We believe that when we ingest a video, it is better to include both frames' transcripts and frames' captions as metadata for inference (LVLM) after retrieval. However, in the megaservice we will have different options for user to choose whether they want transcript only, caption only or both. The composition here is optional. Hope this is clear to you.

- **Optional** The fourth megaservice functions as a helper to extract list of image-caption pairs from documents (e.g., research papers & journals). This megaservice provides a MMRagDocprepGateway, enabling user access through the `/v1/mmrag_doc_prep` endpoint.


For this architecture and in order to guarantee the reusability and expansion ability of this architecture for other applications in future, we are going to propose the followings.

### 1. Data Classes

MM-RAG introduces new data classes. In `comps/cores/proto/docarray`:
- `ImageDoc(BaseDoc)` (cf., docarray.documents.ImageDoc) ,
- `VideoDoc(BaseDoc)` (cf., docarray.documents.VideoDoc),
- `AudioDoc(BaseDoc)` (cf., docarray.documents.AudioDoc),
- `TextAndImageDoc(BaseDoc)` contains information for an image-text pair
- `MultimodalDoc = Union[TextDoc, ImageDoc, TextImageDoc]`
- `SearchedMultimodalDoc(MultimodalDoc)` which is similar to SearchedDoc but for multimodal data

### 2. Interfaces
MM-RAG defines new interfaces used in OPEA's microservices:
#### 2.1 Embeddings
- Interface `MultimodalEmbeddings` that extends the interface langchain_core.embeddings.Embeddings with an abstract method:
```python
embed_multimodal_document(self, doc: MultimodalDoc) -> List[float]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall RFC is good to me. just minor comment here. I think this interface is implementation specific one, right?. my point here is you don't need to mention that. you just need tell user what's the standard input and output just like you did in Data Classes section, that's enough.

```
Any class, which implements this interface, serving as an embedder for multimodal data must implement how to produce the embedding of a multimodal data.

#### 2.2. Vectorstore & Retriever
- Interface `MultimodalVectorstore` that extends langchain_core.vectorstores.VectorStore with abstract methods:
```python
add_multimodal_docs(self, docs: DocList[MultimodalDoc]) -> List[str]
```
Any class, which implements this interface, serving as an API for works relating to a vectorstore must implement how to add multimodal data to the vector store (i.e., compute the embedding and then add it to the vector store)

The original abstract class langchain_core.vectorstores.VectorStore already implemented a method to transform itself to retrieval `as_retriever`. Thus, we don't need to have new interface for retrieval.

#### 2.3 Reranks
As we approach application MM-RAG based Visual Question and Answer first, for now we can leverage LLM or VLM for interfaces relating to reranks. Thus, we don't need to propose new interface for reranks.

#### 2.4 VLMs
As we approach application MM-RAG based Visual Question and Answer first, for now we can leverage existing VLM interfaces.

### 3. Microservices
#### 3.1 Embedding Microservice
Signature of embeddings microservice
```python
@traceable(run_type="embedding")
@register_statistics(names=["opea_service@embedding_multimodal_langchain"])
def embedding(input: MultimodalDoc) -> EmbedDoc1024:
```

#### 3.2 Retriever Microservice
Signature of retriever microservice
```python
@traceable(run_type="retriever")
@register_statistics(names=["opea_service@multimodal_retriever_redis"])
def retrieve(input: EmbedDoc1024) -> SearchedMultimodalDoc:
```

#### 3.3 Rerank Microservice
**Important Note:** The rerank microservice does not only perform reranking the retrieved multimodal document but also helps to determine whether it should include either the image in the input query or the retrieved image but not both in the output. This facilitates the LVLM model because a major of LVLM models were pretrained for tasks with 1 input image only (e.g., LLaVA)

Signature of rerank microservice
```python
@traceable(run_type="vlm")
@register_statistics(names=["opea_service@reranking_multimodal_data"])
def reranking(input: SearchedMultimodalDoc) -> LVMDoc:
```

#### 3.4 LVLM Microservice
We can use existing LVLM microservice from OPEA in `GenAIComps/comps/lvms/lvm.py`

#### 3.5 Transcription Generation Microservice
This microservice aims to generate a transcript for an input audio using an audio to text model.
Signature of transcript generation microservice
```python
@traceable(run_type="tool")
@register_statistics(names=["opea_service@transcript_generation"])
def transcript_generation(input: AudioDoc) -> str:
```

## Use-case Stories

### 1. Multimodal RAG on Videos

**Main contributor(s):** Tiep Le, Siddhi Velankar, and Siddhant Jagtap.

In this story, let's an user has a large video corpus (possibly including multiple videos), and she wants to do question answering on it, wants to chat with it, and wants to get insight from all information in the video corpus.

Initially, the list of frames and their corresponding transcripts or captions will be extracted/generated from the video corpus using the megaservice `mmrag_video_prep`. Such frame and transcript/caption pairs will be ingested to a vector store by the megaservice `mmrag_data_ingestion`.

Once an user input a text query, MM-RAG system (i.e., the megaservice `mmrag_visual_qna`) is going to retrieve the most appropriate video segment from the vector store that can help answer the user's query. Then the context of retrieved video segment augmented with the user's query is used to answer the query by LVLM.

The user can have a query/question where the transcript of the video helps answering it. For example:

![question_on_transcript](./assets/question_for_transcripts.png)

As a chatbox, the user can have follow-up questions. For example:

![question_follow_up](./assets/followup-question.png)

The user can have a query/question where the visual cues of the video can help answering it. For example:

![question_on_visual_cues](./assets/question_on_visua_cues.png)

### 2. Multimodal RAG for Medical Domain
**Main contributor(s):** Mustafa Cetin

For more detailed information, please checkout the corresponding RFC for Multimodal RAG for Medical Domain [here](https://github.com/opea-project/docs/pull/31/files)

In this scenario, it seeks to leverage accumulated (domain expert) knowledge specifically in medical domain in an offline environment and minimize expert involvement in the medical domain, thereby making the system both efficient and user-friendly for medical professionals and researchers.

Initially, Image-text pairs in which text can include the caption, the description, and/or the diagnosis for things appearing in the image: (1) can be extracted from medical reports and medical research papers/journals using the megaservice `mmrag_doc_prep`; and/or (2) can be elicited from medical doctors. Such image-text pairs will then be ingested into a vector store by the megaservice `mmrag_data_ingestion`.

Once an user enters a query (image and/or text), MM-RAG will search from the vector store for similar images or texts. Such similar images and texts will then be reranked by reranking microservice to identify which additional information (retrieved image and/or text) is critical to the user's query. This additional information augmented with the user's query is used to answer the query by LVLM.

**Example 1**
In this scenario, the user aims to interpret a Contrast-Enhanced Spectral Mammography (CESM) image. Initially, the system searches the database for similar images and retrieves relevant diagnostic information associated with those images. This data is then utilized to construct a comprehensive prompt for the Large Multi Model (LVLM). The LVLM, employing advanced AI techniques, processes this prompt and generates a diagnostic interpretation, which is subsequently delivered to the user. This result not only provides an immediate analysis but also includes recommendations for further evaluation or treatment as necessary, ensuring a thorough and informed diagnostic process. Additionally, by running the data and the model on local systems equipped with robust security measures, we enhance the privacy and security of the information, aligning with stringent health data protection standards such as HIPAA and GDPR. This approach minimizes the need for expert involvement, thereby streamlining the analysis and potentially reducing operational costs without compromising diagnostic accuracy.

![Example1](./assets/case_1.png)

**Example 2**
In this scenario, the user specifies certain conditions, such as specific diagnostic features or patient demographics, and requests to view CESM images that meet these criteria. Initially, the system searches the database for diagnoses that match these conditions and retrieves the relevant images. These images can either be returned directly to the user or used as inputs for the Large Multi Model (LMM). If the latter option is chosen, the LMM analyzes the images to further validate their relevance to the specified conditions. Additionally, the system provides a justification for why these particular images were selected, ensuring transparency and enhancing user understanding of the process. This approach minimizes the need for expert involvement, thereby streamlining the analysis and potentially reducing operational costs.

![Example2](./assets/case_2.png)

### 3. Video RAG QnA
**Main contributor(s):** Tahani Trigui & Sharath Raghava
This application has similar functionalities to the use-case `Multimodal RAG on Videos` above, but it uses a different approach. In particular, this makes use of embeddings of short video segments rather than making use of extracted frame/text pairs in `Multimodal RAG on Videos`.

For more information please refer to [here](https://github.com/ttrigui/GenAIExamples/pull/5)

## Alternative Considerations
- Possibly, we will need to have two megaservices rather than one `mmrag_visual_qna`. One is for each of the two use-case stories due to differences between them: (1) using different LVLM models (LLAVA vs LLAVA-MED) results in different prompt preparations; and (2) different inputs (text only vs both text and image)

Binary file added community/rfcs/assets/case_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added community/rfcs/assets/case_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added community/rfcs/assets/followup-question.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added community/rfcs/assets/mm-rag-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.