Skip to content

AI File Search Service that allows file text searching through embedded texts

Notifications You must be signed in to change notification settings

RyanAquino/ai-file-search-service

Repository files navigation

AI File Search Service

A file search service that performs OCR on files and embeds the text to enable similarity searches.

Requirements

  • Python 3
  • Docker-compose
  • Google Cloud Storage API
  • OpenAI API
  • Pinecone

Technology

  • Python 3
  • OpenAI embeddings
  • GCS Cloud Storage
  • Pinecone vector database
  • Redis caching and rate limits

Endpoints

API endpoints documentation are also available upon running - http://localhost:3000/docs

Method URL Description Example payload
POST /api/v1/register Register a user with username and password {"username": "admin", "password": "admin"}
POST /api/v1/login Authenticates a user

Content-type is application/x-www-form-urlencoded
{"username": "admin", "password": "admin"}
POST /api/v1/upload Receive files of types (pdf, png, jpg, tiff) and uploads to google cloud storage {"files": [<file object>]}
POST (Mock endpoint)
/api/v1/ocr
Perform a mock OCR on files, embeds and saves the embedding to Pinecone vector database.

Make sure that the OCR filename in the URL matches the one in ocr results directory
{"url": "https://storage.googleapis.com/ai-file-search-service_new-bucket/建築基準法施行令.json?Expires=1728795108"}
POST /api/v1/extract Extract relevant parts from given file id and query text {"query_text": "建物", "file_id": "建築基準法施行令.json"}

Setup with Docker

Copy google credentials to app root as credentials.json
cp <path-to-your-credentials> credentials.json
Set necessary environment variables on docker-compose.yaml
api:
    ...
    PINECONE_API_KEY:
    PINECONE_HOST:
    OPENAI_API_KEY: 
Copy mock OCR JSON results to ocr directory
cp ocr/* <root app directory>/ocr/
Run servers
docker-compose up -d

Setup manually (Alternative)

Create .env and adjust based on needs
cp .env.example .env
Install dependencies
pip install -r requirements.txt
Run server
python main.py

Access API docs on browser

http://localhost:3000/docs

Running tests

pytest . -v