Jeopardy

Analysis and value prediction for jeopardy dataset

Data source: https://www.kaggle.com/tunguz/200000-jeopardy-questions

Install prerequisities

Go to base directory and locate requirements.txt
Run the command: pip install -r requirements.txt

EDA and clean data.

Read and go through notebooks/eda.ipynb for feature engineering and transformations.
Change directory to src by command: cd src
Clean and transform data, by running script clean_transform_data.py with appropriate arguments. Command for running the script python3 clean_transform_data.py <input_csv_file> <out_csv_file>
1. The Air Date feature is encoded using binary encoding, with 01/01/2000 as breakpoint. The reason for this is theorized and verified in notebooks/eda.ipynb
2. The text features i.e Category, Question, Answer are cleaned of punctuation and stopwords.
Design matrix brief: The design matrix (The final feature matrix) is generated by concatenating encoded Air Date and Round to the appropriate text vectors.

After Cleaning the data, We can move onto training the models. We tried 3 differnt models. Each improving the error on previous model.

Model training

Linear Regression

Train a baseline linear regression model by following the steps below:

Move to src directory: cd src
Train linear regression: python3.8 train_linear_regression.py <input_filepath>
We were able to minimize RMSE upto 332.76789076927827 and 806.6202793687579 for training and test data respectively.

Now that we have our baseline. We move onto more complex models. We can tell from the errors reported above that the model is overtrained. We will try to mitigate this in our pursuit of best model.

Random Forest

Train a Random forest model by following the steps below:

Move to src directory: cd src
Train random forest: python3.8 train_random_forest.py <input_filepath>
We were able to minimize RMSE upto 526.7433033098621 and 538.565862801748 for training and test data respectively.

As we can see there is definitly improvement on test set over linear regression. The model isn't overtrained, but can we reduce the error further? We will try to finetune the Hugging face pretrained tranformers in the next step.

Fine tune bert

Fine tune bert by following steps:

Move to src directory: cd src
Train random forest: python3.8 finetune.py <input_filepath> --epochs <num_epochs> --data_frac <fraction of data>
1. I used 5 epochs and 1.0 fraction of data(Complete dataset).
We were able to minimize RMSE upto 46.65 and 46.16 for training and test data respectively.

We see a big improvement with bert. However, this is a very big model and requires significant resources to train.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
inputs		inputs
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
results.out		results.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jeopardy

Install prerequisities

EDA and clean data.

Model training

Linear Regression

Random Forest

Fine tune bert

About

Releases

Packages

Languages

styagi130/Jeopardy

Folders and files

Latest commit

History

Repository files navigation

Jeopardy

Install prerequisities

EDA and clean data.

Model training

Linear Regression

Random Forest

Fine tune bert

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages