This repo host fastAPI app for querying PDF and CSV file to look for information using natural language.
The query flow for pdf file was implemented as follow:
- Read text from pdf file
- Split the text into chunks
- Encode the chunks into embedding vectors using huggingface GTR-T5 or OpenAI Ada text embedding
- Upload text chunks and embedding vectors to vector database Qdrant
- Get user query text and encode into embedding vector
- Search vector database for text chunk whose embedding are closest to query embedding based on cosine similarity
- Get answer from FLAN-T5 or OpenAI GPT-3 based user query text and text chunk
The query flow for pdf file was implemented as follow:
- Read csv file as pandas df
- convert each row to a text chunks
- The rest of the steps are similar to query PDF file
-
Clone this repo
git clone https://github.com/haizadtarik/queryfile.git
-
Install dependencies
cd queryPDF python -m pip install -r requirements.txt
-
Pull Qdrant image from docker hub
docker pull qdrant/qdrant
-
Run Qdrant base
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
NOTE
To use OpenAI embeddings or GPT create
.env
and put your API key thereOPENAI_KEY=<OPEN_API_KEY>
-
Bring up fastAPI server
python server.py