A file search service that performs OCR on files and embeds the text to enable similarity searches.
- Python 3
- Docker-compose
- Google Cloud Storage API
- OpenAI API
- Pinecone
- Python 3
- OpenAI embeddings
- GCS Cloud Storage
- Pinecone vector database
- Redis caching and rate limits
API endpoints documentation are also available upon running - http://localhost:3000/docs
Method | URL | Description | Example payload |
---|---|---|---|
POST |
/api/v1/register |
Register a user with username and password | {"username": "admin", "password": "admin"} |
POST |
/api/v1/login |
Authenticates a user Content-type is application/x-www-form-urlencoded |
{"username": "admin", "password": "admin"} |
POST |
/api/v1/upload |
Receive files of types (pdf, png, jpg, tiff) and uploads to google cloud storage | {"files": [<file object>]} |
POST |
(Mock endpoint) /api/v1/ocr |
Perform a mock OCR on files, embeds and saves the embedding to Pinecone vector database. Make sure that the OCR filename in the URL matches the one in ocr results directory |
{"url": "https://storage.googleapis.com/ai-file-search-service_new-bucket/建築基準法施行令.json?Expires=1728795108"} |
POST |
/api/v1/extract |
Extract relevant parts from given file id and query text | {"query_text": "建物", "file_id": "建築基準法施行令.json"} |
cp <path-to-your-credentials> credentials.json
api:
...
PINECONE_API_KEY:
PINECONE_HOST:
OPENAI_API_KEY:
cp ocr/* <root app directory>/ocr/
docker-compose up -d
cp .env.example .env
pip install -r requirements.txt
python main.py
http://localhost:3000/docs
pytest . -v