Skip to content

Commit

Permalink
Add Ray version for multi file process (#119)
Browse files Browse the repository at this point in the history
* add ray version document to redis

Signed-off-by: Chendi Xue <[email protected]>

* update test

Signed-off-by: Chendi Xue <[email protected]>

* Add test

Signed-off-by: Chendi Xue <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add TIMEOUT in container environment and return status

Signed-off-by: Chendi Xue <[email protected]>

* rebase on new folder layout

Signed-off-by: Chendi Xue <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chendi Xue <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
xuechendi and pre-commit-ci[bot] authored Jun 14, 2024
1 parent cd91cfc commit 40c1aaa
Show file tree
Hide file tree
Showing 10 changed files with 667 additions and 5 deletions.
69 changes: 65 additions & 4 deletions comps/dataprep/redis/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# Dataprep Microservice with Redis

For dataprep microservice, we provide two frameworks: `Langchain` and `LlamaIndex`.
For dataprep microservice, we provide two frameworks: `Langchain` and `LlamaIndex`. We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.).

We organized these two folders in the same way, so you can use either framework for dataprep microservice with the following constructions.

# 🚀1. Start Microservice with Python(Option 1)

## 1.1 Install Requirements

- option 1: Install Single-process version (for 1-10 files processing)

```bash
# for langchain
cd langchain
Expand All @@ -16,6 +18,12 @@ cd llama_index
pip install -r requirements.txt
```

- option 2: Install multi-process version (for >10 files processing)

```bash
cd langchain_ray; pip install -r requirements_ray.txt
```

## 1.2 Start Redis Stack Server

Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).
Expand All @@ -34,10 +42,18 @@ export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"

Start document preparation microservice for Redis with below command.

- option 1: Start single-process version (for 1-10 files processing)

```bash
python prepare_doc_redis.py
```

- option 2: Start multi-process version (for >10 files processing)

```bash
python prepare_doc_redis_on_ray.py
```

# 🚀2. Start Microservice with Docker (Option 2)

## 2.1 Start Redis Stack Server
Expand All @@ -58,6 +74,8 @@ export LANGCHAIN_PROJECT="opea/dataprep"

- Build docker image with langchain

* option 1: Start single-process version (for 1-10 files processing)

```bash
cd ../../../../
docker build -t opea/dataprep-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/langchain/docker/Dockerfile .
Expand All @@ -70,13 +88,28 @@ cd ../../../../
docker build -t opea/dataprep-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/llama_index/docker/Dockerfile .
```

- option 2: Start multi-process version (for >10 files processing)

```bash
cd ../../../../
docker build -t opea/dataprep-on-ray-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/langchain_ray/docker/Dockerfile .
```

## 2.4 Run Docker with CLI (Option A)

- option 1: Start single-process version (for 1-10 files processing)

```bash
docker run -d --name="dataprep-redis-server" -p 6007:6007 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e TEI_ENDPOINT=$TEI_ENDPOINT opea/dataprep-redis:latest
```

- option 2: Start multi-process version (for >10 files processing)

```bash
docker run -d --name="dataprep-redis-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e TEI_ENDPOINT=$TEI_ENDPOINT opea/dataprep-redis:latest
docker run -d --name="dataprep-redis-server" -p 6007:6007 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e TEI_ENDPOINT=$TEI_ENDPOINT -e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-redis:latest
```

## 2.5 Run with Docker Compose (Option B)
## 2.5 Run with Docker Compose (Option B - deprecated, will move to genAIExample in future)

```bash
# for langchain
Expand All @@ -86,7 +119,13 @@ cd comps/dataprep/redis/llama_index/docker
docker compose -f docker-compose-dataprep-redis.yaml up -d
```

# 🚀3. Consume Microservice
# 🚀3. Status Microservice

```bash
docker container logs -f dataprep-redis-server
```

# 🚀4. Consume Microservice

Once document preparation microservice for Redis is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

Expand All @@ -96,3 +135,25 @@ curl -X POST \
-d '{"path":"/path/to/document"}' \
http://localhost:6007/v1/dataprep
```

or

```python
import requests
import json

proxies = {"http": ""}
url = "http://localhost:6007/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
]
payload = {"link_list": json.dumps(urls)}

try:
resp = requests.post(url=url, data=payload, proxies=proxies)
print(resp.text)
resp.raise_for_status() # Raise an exception for unsuccessful HTTP status codes
print("Request successful!")
except requests.exceptions.RequestException as e:
print("An error occurred:", e)
```
1 change: 1 addition & 0 deletions comps/dataprep/redis/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,5 +63,6 @@ def format_redis_conn_from_env():
current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)
REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", "schema_dim_768.yml")
TIMEOUT_SECONDS = int(os.getenv("TIMEOUT_SECONDS", 600))
schema_path = os.path.join(parent_dir, REDIS_SCHEMA)
INDEX_SCHEMA = schema_path
68 changes: 68 additions & 0 deletions comps/dataprep/redis/langchain_ray/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

# Embedding model

EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")

# Redis Connection Information
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = int(os.getenv("REDIS_PORT", 6379))


def get_boolean_env_var(var_name, default_value=False):
"""Retrieve the boolean value of an environment variable.
Args:
var_name (str): The name of the environment variable to retrieve.
default_value (bool): The default value to return if the variable
is not found.
Returns:
bool: The value of the environment variable, interpreted as a boolean.
"""
true_values = {"true", "1", "t", "y", "yes"}
false_values = {"false", "0", "f", "n", "no"}

# Retrieve the environment variable's value
value = os.getenv(var_name, "").lower()

# Decide the boolean value based on the content of the string
if value in true_values:
return True
elif value in false_values:
return False
else:
return default_value


def format_redis_conn_from_env():
redis_url = os.getenv("REDIS_URL", None)
if redis_url:
return redis_url
else:
using_ssl = get_boolean_env_var("REDIS_SSL", False)
start = "rediss://" if using_ssl else "redis://"

# if using RBAC
password = os.getenv("REDIS_PASSWORD", None)
username = os.getenv("REDIS_USERNAME", "default")
if password is not None:
start += f"{username}:{password}@"

return start + f"{REDIS_HOST}:{REDIS_PORT}"


REDIS_URL = format_redis_conn_from_env()

# Vector Index Configuration
INDEX_NAME = os.getenv("INDEX_NAME", "rag-redis")

current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)
REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", "schema_dim_768.yml")
TIMEOUT_SECONDS = int(os.getenv("TIMEOUT_SECONDS", 600))
schema_path = os.path.join(parent_dir, REDIS_SCHEMA)
INDEX_SCHEMA = schema_path
38 changes: 38 additions & 0 deletions comps/dataprep/redis/langchain_ray/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

ENV LANG C.UTF-8

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
pip install --no-cache-dir -r /home/user/comps/dataprep/redis/langchain_ray/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/redis/langchain_ray/uploaded_files && chown -R user /home/user/comps/dataprep/redis/langchain_ray/uploaded_files
RUN mkdir -p /home/user/comps/dataprep/redis/langchain_ray/status && chown -R user /home/user/comps/dataprep/redis/langchain_ray/status

USER user

WORKDIR /home/user/comps/dataprep/redis/langchain_ray

ENTRYPOINT ["python", "prepare_doc_redis_on_ray.py"]

Loading

0 comments on commit 40c1aaa

Please sign in to comment.