Data preprocessing tools for Finnish corpuses. Dataset specific directories are prefixed in a style of <NAME>-tools
and include necessary information for downloading corpuses and some supplementary statistics.
The following scipts are used for specific preprocessing steps:
https://github.com/TurkuNLP/finngen-tools/blob/main/filter_jsonl.py
https://github.com/spyysalo/onion-tools
http://corpus.tools/wiki/Onion
https://github.com/TurkuNLP/finngen-tools/blob/main/kenlm_line_filter.py
https://github.com/TurkuNLP/toxicity-classifier
https://github.com/TurkuNLP/finngen-tools/tree/main/preproc-tools