© [2023] Ubisoft Entertainment. All Rights Reserved
We aim to use language models to identify and classify toxicity inside in-game chat.
For better collaboration and understanding of the project code and what has been done, the following sections outline what standards / loose rules are followed.
Follow something similar to under src packaging-projects.
> src
> module_1
__init__.py
> module_2
__init__.py
> tests
main.py
poetry.lock
- Keep it simple stupid.
- Don't add features / over-engineer until you need it.
Name | Value | Links | Notes |
---|---|---|---|
Language | Python 3.x | ||
Package Manager | Poetry | Docs ; Useful TLDR | On Windows, you may have to restart comp after installing to work with VSCode. |
Python Env | conda | ||
Code Linter | Pep8 | Enable for VSCode | |
Docstring | Pep8 | Follow Numpy's Style | |
Unit Tests | Python 3.x | Sample |
-
Test if poetry package manager is up to date:
poetry run python .\main.py Train --config ".\train\train_on_CONDA_no_context.json" --max_epochs_to_train 1
Note: Poetry installs torch with CPU support and no CUDA support.
Issue 4231 -> User may have to separately install PyTorch with CUDA.
-
Use
poetry add
to add missing packages to pyproject.toml & poetry.lock -
Use
poetry export -f requirements.txt > requirements.txt
to updaterequirements.txt
.
We want our model to be able to classify span of words as non-toxic / specific categories of toxicity. For this use case, the model is currently a token classification.
- Current model is
bert-base-uncased
; - Tokenizer configs can be found here.
- HuggingFace Token Classification
- HuggingFace Tokenization Documentation: https://huggingface.co/docs/tokenizers/pipeline
- Useful Stackoverflow: https://stackoverflow.com/questions/65246703/how-does-max-length-padding-and-truncation-arguments-work-in-huggingface-bertt
- Epoch: One run of the training dataset.
- Batch Size: Number of samples to train on limited by memory size of CPU / GPU.
per_gpu_batch_size
: number of samples to run on each gpu if more than one. Batch size will benum_gpu
*per_gpu_batch_size
- Global Step: Number of batches before the model will calculate gradient & perform back propagation.
- To prevent vanishing & exploding gradients, we use
clip_grad_norm_
& accumulate batches. - Gradient Accumulation is performed at every global step
- To prevent vanishing & exploding gradients, we use
- Validation Loop:
- We run validation at every
X
epochs. If we follow the paper, it was run 10 times per epoch. - Push metrics to TensorBoard
- In normal ML models, we run validation every epoch or even every
X
epochs.
- We run validation at every
- Save model at the end of every
X
epoch:- changed from global step since this is dependent on two config variables and can be inconsistent.
- can be changed back if we save all the configs.
- Trainer Logic Code samples:
© [2023] Ubisoft Entertainment. All Rights Reserved