opea-project · bashirmoham · May 29, 2024 · May 29, 2024 · May 29, 2024 · May 29, 2024
@@ -0,0 +1,104 @@
+# Video RAG
+
+## Introduction
+
+Video RAG is a framework that retrieves video based on provided user prompt. It uses both video scene description generated by open source vision models (ex video-llama, video-llava etc.) as text embeddings and frames as image embeddings to perform vector similarity search. The provided solution also supports feature to retrieve more similar videos without prompting it. (see the example video below)
+
+![Example Video](docs/visual-rag-demo.gif)
+
+## Tools
+
+- **UI**: gradio **or** streamlit
+- **Vector Storage**: Chroma DB **or** Intel's VDMS
+- **Image Embeddings**: CLIP
+- **Text Embeddings**: all-MiniLM-L12-v2
+- **RAG Retriever**: Langchain Ensemble Retrieval
+
+## Prerequisites
+
+There are 10 example videos present in `video_ingest/videos` along with their description generated by open-source vision model.
+If you want these video RAG to work on your own videos, make sure it matches below format.
+
+## File Structure
+
+```bash
+video_ingest/
+.
+├── scene_description
+│   ├── op_10_0320241830.mp4.txt
+│   ├── op_1_0320241830.mp4.txt
+│   ├── op_19_0320241830.mp4.txt
+│   ├── op_21_0320241830.mp4.txt
+│   ├── op_24_0320241830.mp4.txt
+│   ├── op_31_0320241830.mp4.txt
+│   ├── op_47_0320241830.mp4.txt
+│   ├── op_5_0320241915.mp4.txt
+│   ├── op_DSCF2862_Rendered_001.mp4.txt
+│   └── op_DSCF2864_Rendered_006.mp4.txt
+└── videos
+    ├── op_10_0320241830.mp4
+    ├── op_1_0320241830.mp4
+    ├── op_19_0320241830.mp4
+    ├── op_21_0320241830.mp4
+    ├── op_24_0320241830.mp4
+    ├── op_31_0320241830.mp4
+    ├── op_47_0320241830.mp4
+    ├── op_5_0320241915.mp4
+    ├── op_DSCF2862_Rendered_001.mp4
+    └── op_DSCF2864_Rendered_006.mp4
+```
+
+## Setup and Installation
+
+Install pip requirements
+
+```bash
+cd VideoRAGQnA
+pip3 install -r docs/requirements.txt
+```
+
+The current framework supports both Chroma DB and Intel's VDMS, use either of them,
+
+Running Chroma DB as docker container
+
+```bash
+docker run -d -p 8000:8000 chromadb/chroma
+```
+
+**or**
+
+Running VDMS DB as docker container
+
+```bash
+docker run -d -p 55555:55555 intellabs/vdms:latest
+```
+
+**Note:** If you are not using file structure similar to what is described above, consider changing it in `config.yaml`.
+
+Update your choice of db and port in `config.yaml`.
+
+```bash
+export VECTORDB_SERVICE_HOST_IP=<ip of host where vector db is running>
+
+export HUGGINGFACEHUB_API_TOKEN='<your HF token>'
+```
+
+HuggingFace hub API token can be generated [here](https://huggingface.co/login?next=%2Fsettings%2Ftokens).
+
+Generating Image embeddings and store them into selected db, specify config file location and video input location
+
+```bash
+python3 embedding/generate_store_embeddings.py docs/config.yaml video_ingest/videos/
+```
+
+**Web UI Video RAG - Streamlit**
+
+```bash
+streamlit run video-rag-ui.py --server.address 0.0.0.0 --server.port 50055
+```
+
+**Web UI Video RAG - Gradio**
+
+```bash
+python3 video-rag-ui.py docs/config.yaml True '0.0.0.0' 50055
+```
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,26 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+# Path to all videos
+videos: video_ingest/videos/
+# Path to video description generated by open-source vision models (ex. video-llama, video-llava, etc.)
+description: video_ingest/scene_description/
+# Do you want to extract frames of videos (True if not done already, else False)
+generate_frames: True
+# Do you want to generate image embeddings?
+embed_frames: True
+# Path to store extracted frames
+image_output_dir: video_ingest/frames/
+# Path to store metadata files
+meta_output_dir: video_ingest/frame_metadata/
+# Number of frames to extract per second,
+# if 24 fps, and this value is 2, then it will extract 12th and 24th frame
+number_of_frames_per_second: 2
+
+vector_db:
+  choice_of_db: 'vdms' #'chroma' # #Supported databases [vdms, chroma]
+  host: 0.0.0.0
+  port: 55555 #8000 #
+
+# LLM path
+model_path: meta-llama/Llama-2-7b-chat-hf
@@ -0,0 +1,12 @@
+accelerate
+chromadb
+dateparser
+gradio
+langchain-experimental
+metafunctions
+open-clip-torch
+opencv-python-headless
+sentence-transformers
+streamlit
+tzlocal
+vdms
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,120 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import datetime
+import json
+import os
+import random
+
+import cv2
+from tzlocal import get_localzone
+
+
+def process_all_videos(path, image_output_dir, meta_output_dir, N, selected_db):
+
+    def extract_frames(video_path, image_output_dir, meta_output_dir, N, date_time, local_timezone):
+        video = video_path.split("/")[-1]
+        # Create a directory to store frames and metadata
+        os.makedirs(image_output_dir, exist_ok=True)
+        os.makedirs(meta_output_dir, exist_ok=True)
+
+        # Open the video file
+        cap = cv2.VideoCapture(video_path)
+
+        if int(cv2.__version__.split(".")[0]) < 3:
+            fps = cap.get(cv2.cv.CV_CAP_PROP_FPS)
+        else:
+            fps = cap.get(cv2.CAP_PROP_FPS)
+
+        total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
+
+        # print (f'fps {fps}')
+        # print (f'total frames {total_frames}')
+
+        mod = int(fps // N)
+        if mod == 0:
+            mod = 1
+
+        print(f"total frames {total_frames}, N {N}, mod {mod}")
+
+        # Variables to track frame count and desired frames
+        frame_count = 0
+
+        # Metadata dictionary to store timestamp and image paths
+        metadata = {}
+
+        while cap.isOpened():
+            ret, frame = cap.read()
+
+            if not ret:
+                break
+
+            frame_count += 1
+
+            if frame_count % mod == 0:
+                timestamp = cap.get(cv2.CAP_PROP_POS_MSEC) / 1000  # Convert milliseconds to seconds
+                frame_path = os.path.join(image_output_dir, f"{video}_{frame_count}.jpg")
+                time = date_time.strftime("%H:%M:%S")
+                date = date_time.strftime("%Y-%m-%d")
+                hours, minutes, seconds = map(float, time.split(":"))
+                year, month, day = map(int, date.split("-"))
+
+                cv2.imwrite(frame_path, frame)  # Save the frame as an image
+
+                metadata[frame_count] = {
+                    "timestamp": timestamp,
+                    "frame_path": frame_path,
+                    "date": date,
+                    "year": year,
+                    "month": month,
+                    "day": day,
+                    "time": time,
+                    "hours": hours,
+                    "minutes": minutes,
+                    "seconds": seconds,
+                }
+                if selected_db == "vdms":
+                    # Localize the current time to the local timezone of the machine
+                    # Tahani might not need this
+                    current_time_local = date_time.replace(tzinfo=datetime.timezone.utc).astimezone(local_timezone)
+
+                    # Convert the localized time to ISO 8601 format with timezone offset
+                    iso_date_time = current_time_local.isoformat()
+                    metadata[frame_count]["date_time"] = {"_date": str(iso_date_time)}
+
+        # Save metadata to a JSON file
+        metadata_file = os.path.join(meta_output_dir, f"{video}_metadata.json")
+        with open(metadata_file, "w") as f:
+            json.dump(metadata, f, indent=4)
+
+        # Release the video capture and close all windows
+        cap.release()
+        print(f"{frame_count/mod} Frames extracted and metadata saved successfully.")
+        return fps, total_frames, metadata_file
+
+    videos = [file for file in os.listdir(path) if file.endswith(".mp4")]
+
+    # print (f'Total {len(videos)} videos will be processed')
+    metadata = {}
+
+    for i, each_video in enumerate(videos):
+        video_path = os.path.join(path, each_video)
+        date_time = datetime.datetime.now()
+        print("date_time : ", date_time)
+        # Get the local timezone of the machine
+        local_timezone = get_localzone()
+        fps, total_frames, metadata_file = extract_frames(
+            video_path, image_output_dir, meta_output_dir, N, date_time, local_timezone
+        )
+        metadata[each_video] = {
+            "fps": fps,
+            "total_frames": total_frames,
+            "extracted_frame_metadata_file": metadata_file,
+            "embedding_path": f"embeddings/{each_video}.pt",
+            "video_path": f"{path}/{each_video}",
+        }
+        print(f"✅  {i+1}/{len(videos)}")
+
+    metadata_file = os.path.join(meta_output_dir, "metadata.json")
+    with open(metadata_file, "w") as f:
+        json.dump(metadata, f, indent=4)