This project contains code for the Toxic Comment Classification Challenge in Kaggle.
The goal of the competition is to identify and classify toxic online comments.
You need to install poetry
before moving forward. Follow the instructions in this link.
- Clone this repo:
git clone https://github.com/david1542/toxic-comments.git
- Install the dependencies:
poetry install
- Authenticate to Kaggle CLI. Follow these instructions.
- Downgrade PyTorch to 1.12.1, since in later versions there are mismatches in the CUDA drivers (issue):
pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
- Run this script to download the data:
./scripts/download_data.sh
Hydra is used as a configuration manager. Simply run the train.py
script and edit the parameters as you like:
python src/train.py training_args.learning_rate=1e-3 training_args.num_train_epochs=5
For more information about the parameters, go to configs/train.yaml
.
Some nice articles that I've found while working on this problem:
- Nice article about multi label classification.
- Some technical tips about fine tuning transformers for a multi label problem.