Biting The Bytes: Transformers For Diacritic Restoration

This project focuses on diacritic restoration for the Turkish language using transformer models. It was developed as part of the Natural Language Processing course at Istanbul Technical University in Spring 2024.

Authors

Emircan Erol - 150200324 ([email protected])
Muhammed Rüşen Birben - 150220755 ([email protected])

Project Overview

The goal of this project is to restore diacritics in Turkish text using transformer models. Diacritics are important in Turkish as they change the meaning and pronunciation of words. However, in many digital contexts, Turkish text is often written without diacritics. This project aims to automatically restore the missing diacritics.

The base model used is google-byt5-small. The project notebook has been run iteratively to select the best checkpoint models, modify hyperparameters, and refine the dataset.

Resources

All resources including the final model used in this project can be found in this Google Drive folder.

Setup

Dependencies

The project uses the following main libraries:

PyTorch
Transformers
PEFT (Parameter-Efficient Fine-Tuning)
Pandas
Weights & Biases (wandb)
TurboT5

Refer to the notebook for the full list of imports.

Hyperparameters

Key hyperparameters used:

Learning Rate (LR): 2^-12
Minimum Sequence Length (MIN_LEN): 0
Maximum Sequence Length (MAX_LEN): 512
Batch Size (BATCH_SIZE): 64

Notice that these parameters are altered slightly for different experiments/training data.

Data Preparation

The project includes functions to prepare the data for training:

asciify_turkish_chars(text): Removes diacritics from Turkish text.
txt_to_input_output(fp, skip=500_000, split='all'): Reads a text file and writes it to a JSONL file to be used as a dataset.
mask_label(data, batch_size=BATCH_SIZE): Masks the padded tokens in the input and creates batches for training.
test_mask(data): Masks the padded tokens in the input and creates batches for testing.

The main dataset used is data.jsonl.It is created by running the txt_to_input_output function on the Turkish text data.

Model Training

The project uses the PEFT library to reduce the trainable parameter size of the base model. If a checkpoint is provided, the model is loaded from the checkpoint (which is our pre-trained model for the task of diacrtic restoration). Otherwise, the model is initialized from scratch using the google/byt5-small model.

The model is trained using the prepared dataset. The notebook includes the training loop and evaluation metrics.

Results

The project achieves promising results in restoring diacritics for Turkish text. The test result can be seen on the official Kaggle competition page here.

Future Work

Potential areas for future improvement include:

Experimenting with different transformer architectures
Fine-tuning the hyperparameters further
Expanding the dataset with more diverse Turkish text

Feel free to reach out to the authors for any questions or collaboration opportunities related to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biting The Bytes: Transformers For Diacritic Restoration

Authors

Project Overview

Resources

Setup

Dependencies

Hyperparameters

Data Preparation

Model Training

Results

Future Work

About

Releases

Packages

Languages

rusenbb/Biting-The-Bytes

Folders and files

Latest commit

History

Repository files navigation

Biting The Bytes: Transformers For Diacritic Restoration

Authors

Project Overview

Resources

Setup

Dependencies

Hyperparameters

Data Preparation

Model Training

Results

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages