Experimental comparison of Numerical & Generative Techniques for CDQA and IR, conducted as part of DSC 210: Numerical Linear Algebra course (Fall 2023) at UC San Diego.
This project focuses on evaluating Closed Domain Question Answering (CDQA) through a comprehensive exploration of Numerical Linear Algebra (NLA) and state-of-the-art (SOTA) techniques. Leveraging NLA, methods like Latent Semantic Indexing (LSI) were employed to process textual data, reducing dimensionality through Singular Value Decomposition ( SVD). The study extensively compared the NLA approach with SOTA methods, including Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) such as GPT-3.5-turbo. Experiments involved a variety of datasets of questions, documents and ground truth, assessing accuracy and response times. Multiple sub-experiments were also performed to evaluate the technical soundness of main experimental strategy. Results indicated that while NLA offered a foundational methodology, modern techniques significantly outperformed it, reaffirming the importance of staying abreast of evolving methodologies in CDQA. The project also validated the choice of using RAG with LLMs and established VectorDB as a superior Information Retrieval system. The findings contribute to ongoing efforts in refining and tailoring models for CDQA, aligning them with the dynamic challenges posed by diverse knowledge domains.
-
Python Requirements:
- A Python 3.9.6+ kernel is recommended to run this project. Since the project mainly consists of
.ipynb
notebooks, we recommend the installation of Jupyter Notebook. - Clone the repository and navigate to the root folder of the project.
- Optional: Create a virtual environment for this project (use
venv
orconda
). - Install all dependencies from the requirements.txt
file with the command
pip install -r requirements.txt
.
- A Python 3.9.6+ kernel is recommended to run this project. Since the project mainly consists of
-
OpenAI Token Requirements:
- A valid OpenAI token would be necessary (to generate OpenAI embeddings and GPT/Davinci LLM responses) to run majority of the project.
- You'll need to add the
OPENAI_API_KEY
to a.env
file. Anenv.example
is given in the root directory of the repo (open to see instructions). - To get the API token, please follow instructions in OpenAI API.
-
LLaMa-7B Model Requirements: To run LLaMa-7B model locally, please download it using the following commands:
cd ./src/ curl -L "https://replicate.fyi/install-llama-cpp" | bash wget https://huggingface.co/localmodels/Llama-2-7B-Chat-ggml/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_S.bin ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.bin ./llama.cpp/convert-llama-ggml-to-gguf.py --eps 1e-5 -i ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.bin -o ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.gguf.bin
The model will be available at the path:
src/llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.gguf.bin
. -
Once these steps are complete, please refer to the next section to understand the project layout.
-
Our project is primarily composed of Jupyter notebooks each performing a specific subpart of the project. We have all the experiments in the
src
folder. Experiments are organized based on the datasets used. i.e.src/cisi
will have all experiments performed on the CISI dataset. -
Refer to the Project Structure to understand the file organization and hierarchy.
-
The Results Notebook has the compiled results of all our experiments and sub-experiments. Running this notebook will read, process, and compute all results from the responses and data pickle files for each dataset using the corresponding models. These results have been added in our Notion report as well.
-
If you wish to separately run individual experimental sections for a specific dataset:
- Navigate to the corresponding dataset subdirectory, i.e.,
src/{dataset}
, for example,src/nfcorpus
. src/{dataset}/notebooks
contains the notebooks explaining every step of processing towards CDQA.- The notebook names are self-explanatory. For example,
src/{dataset}/notebooks/preprocess_data.ipynb
has the code to pre-process, clean and pickle the documents, queries, and ground truth's for the corresponding dataset. - Similarily
src/{dataset}/notebooks/generate_embeddings.ipynb
generates all LSI (reduced vectors), and OpenAI embeddings. - To run the experiments without RAG, refer to notebooks named
src/{dataset}/notebooks/llm_wo_rag.ipynb
. - Similarily, if you would like to run experiments with RAG, the NLA (LSI) approach for RAG is
given
src/{dataset}/notebooks/llm_w_rag_exact_search.ipynb
and the SOTA VectorDB approach (with FAISS) is given insrc/{dataset}/notebooks/llm_w_rag_faiss.ipynb
.
- Navigate to the corresponding dataset subdirectory, i.e.,
-
Datastores & Pickle Files: If you wish to view the intermediate outputs after each stage of the RAG pipeline, You can do so by manually loading and viewing pickle files corresponding to the steps.
src/{dataset}/dataset/
: Pickled, pre-processed, cleaned documents, queries, and ground truth files for the dataset, i.e., output of thepreprocess_data.ipynb
notebook.src/{dataset}/embeddings/
: Pickled OpenAI & LSI embeddings in corresponding folders, i.e., the output ofgenerate_embeddings.ipynb
notebook.src/{dataset}/ir_techniques/
: Index files necessary for IR in the RAG pipeline for each mode of IR, i.e., NLA technique (output of LSI + Truncated SVD) and SOTA techniques (FAISS
/ANNOY
VectorDB index files).src/{dataset}/responses/
: LLM-specific responses for the queries in the dataset. This includes pickled responses from all LLMs we ran for that dataset. For example,src/nfcorpus/responses/gpt-3.5-turbo-instruct/llm_w_rag_exact_search.pkl
has the responses returned bygpt-3.5-turbo
with LSI + SVD (exact_search) as the IR technique used for RAG.
- Raise PR on separate branch for code updates & request code owner review.
- Mark TODO's as issues.
We appreciate any ideas and contributions from the open-source community.
We aim to make this project accessible and modularized enough to use as a plug-and-play model for evaluating traditional and modern Closed Domain Question Answering approaches.
Please feel free to contact the authors if you are interested to contribute or collaborate: