Analysis and value prediction for jeopardy dataset
Data source: https://www.kaggle.com/tunguz/200000-jeopardy-questions
- Go to base directory and locate requirements.txt
- Run the command:
pip install -r requirements.txt
- Read and go through
notebooks/eda.ipynb
for feature engineering and transformations. - Change directory to
src
by command:cd src
- Clean and transform data, by running script
clean_transform_data.py
with appropriate arguments. Command for running the scriptpython3 clean_transform_data.py <input_csv_file> <out_csv_file>
- The
Air Date
feature is encoded using binary encoding, with 01/01/2000 as breakpoint. The reason for this is theorized and verified innotebooks/eda.ipynb
- The text features i.e
Category
,Question
,Answer
are cleaned of punctuation and stopwords.
- The
- Design matrix brief: The design matrix (The final feature matrix) is generated by concatenating encoded
Air Date
andRound
to the appropriate text vectors.
After Cleaning the data, We can move onto training the models. We tried 3 differnt models. Each improving the error on previous model.
Train a baseline linear regression model by following the steps below:
- Move to
src
directory:cd src
- Train linear regression:
python3.8 train_linear_regression.py <input_filepath>
- We were able to minimize RMSE upto
332.76789076927827
and806.6202793687579
for training and test data respectively.
Now that we have our baseline. We move onto more complex models. We can tell from the errors reported above that the model is overtrained. We will try to mitigate this in our pursuit of best model.
Train a Random forest model by following the steps below:
- Move to
src
directory:cd src
- Train random forest:
python3.8 train_random_forest.py <input_filepath>
- We were able to minimize RMSE upto
526.7433033098621
and538.565862801748
for training and test data respectively.
As we can see there is definitly improvement on test set over linear regression. The model isn't overtrained, but can we reduce the error further? We will try to finetune the Hugging face pretrained tranformers in the next step.
Fine tune bert by following steps:
- Move to
src
directory:cd src
- Train random forest:
python3.8 finetune.py <input_filepath> --epochs <num_epochs> --data_frac <fraction of data>
- I used 5 epochs and 1.0 fraction of data(Complete dataset).
- We were able to minimize RMSE upto
46.65
and46.16
for training and test data respectively.
We see a big improvement with bert. However, this is a very big model and requires significant resources to train.