ToxBuster

We aim to use language models to identify and classify toxicity inside in-game chat.

Project Standards

For better collaboration and understanding of the project code and what has been done, the following sections outline what standards / loose rules are followed.

Project File Layout

Follow something similar to under src packaging-projects.

> src
   > module_1
      __init__.py
   > module_2
   __init__.py
> tests
main.py
poetry.lock

Code Simple Guide

Keep it simple stupid.
Don't add features / over-engineer until you need it.

Code Standards

Name	Value	Links	Notes
Language	Python 3.x
Package Manager	Poetry	Docs ; Useful TLDR	On Windows, you may have to restart comp after installing to work with VSCode.
Python Env	conda
Code Linter	Pep8	Enable for VSCode
Docstring	Pep8	Follow Numpy's Style
Unit Tests	Python 3.x	Sample

Poetry

Test if poetry package manager is up to date:
```
poetry run python .\main.py Train --config ".\train\train_on_CONDA_no_context.json" --max_epochs_to_train 1
```
Note: Poetry installs torch with CPU support and no CUDA support.

Issue 4231 -> User may have to separately install PyTorch with CUDA.
Use poetry add to add missing packages to pyproject.toml & poetry.lock
Use poetry export -f requirements.txt > requirements.txt to update requirements.txt.

Understanding our model

We want our model to be able to classify span of words as non-toxic / specific categories of toxicity. For this use case, the model is currently a token classification.

Basic information:

Current model is bert-base-uncased;
Tokenizer configs can be found here.
HuggingFace Token Classification

Collate Function

HuggingFace Tokenization Documentation: https://huggingface.co/docs/tokenizers/pipeline
Useful Stackoverflow: https://stackoverflow.com/questions/65246703/how-does-max-length-padding-and-truncation-arguments-work-in-huggingface-bertt

Trainer Logic / Terminology

Epoch: One run of the training dataset.
Batch Size: Number of samples to train on limited by memory size of CPU / GPU.
- per_gpu_batch_size: number of samples to run on each gpu if more than one. Batch size will be num_gpu * per_gpu_batch_size
Global Step: Number of batches before the model will calculate gradient & perform back propagation.
- To prevent vanishing & exploding gradients, we use clip_grad_norm_ & accumulate batches.
- Gradient Accumulation is performed at every global step
Validation Loop:
- We run validation at every X epochs. If we follow the paper, it was run 10 times per epoch.
- Push metrics to TensorBoard
- In normal ML models, we run validation every epoch or even every X epochs.
Save model at the end of every X epoch:
- changed from global step since this is dependent on two config variables and can be inconsistent.
- can be changed back if we save all the configs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToxBuster

Project Standards

Project File Layout

Code Simple Guide

Code Standards

Poetry

Understanding our model

Basic information:

Collate Function

Trainer Logic / Terminology

Other Useful Links / Info

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
src		src
License.md		License.md
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_CCC.bat		run_CCC.bat
run_CONDA.bat		run_CONDA.bat

License

ubisoft/ubisoft-laforge-toxbuster

Folders and files

Latest commit

History

Repository files navigation

ToxBuster

Project Standards

Project File Layout

Code Simple Guide

Code Standards

Poetry

Understanding our model

Basic information:

Collate Function

Trainer Logic / Terminology

Other Useful Links / Info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages