Tokenizer and Model Training Pipeline

Introduction

This repository contains the code for training custom tokenizers and models using the Hugging Face Transformers library. The custom tokenizers and models can be trained on text data provided in CSV format. The training process involves two main steps: tokenization and model training. The trained models can be used for text generation and inference tasks.

Author

Muhammad Ali Abbas Wamiq Raza

Requirements

pip install -r requirements.txt

Getting Started

Cloning

git clone https://github.com/your_username/your_repository.git
cd your_repository
Create the Conda environment from the YAML file:
conda env create -f environment.yml

Prepare the Data

Ensure that you have your text data in CSV format. The CSV file should contain a column named "text" that holds the text data for training. You can use code inside DataLoader

Step 1: Tokenizer Training

The first step is to train the tokenizer on the text data. To do this, run the following command:

python run_tokenizer_training.py

Step 2: Model Training

The next step is to train the language model using the custom tokenizer. To do this, run the following command:

python run_model_training.py

Step 3: Model Inference

After training the model, you can perform text generation and inference tasks. To do this, run the following command:

python infer_model.py

Acknowledgments

This project uses the Hugging Face Transformers library for training and inference. For more information, please visit the Hugging Face Transformers documentation.

Diagrams

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Configs		Configs
DataLoader		DataLoader
Tokenizers		Tokenizers
Trainers		Trainers
exp_notebooks		exp_notebooks
notebook		notebook
.gitignore		.gitignore
Readme.md		Readme.md
__init__.py		__init__.py
environment.yml		environment.yml
infer_model.py		infer_model.py
readme.txt		readme.txt
requirements.txt		requirements.txt
run_model_training.py		run_model_training.py
run_tokenizer_training.py		run_tokenizer_training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizer and Model Training Pipeline

Introduction

Author

Requirements

Getting Started

Prepare the Data

Step 1: Tokenizer Training

Step 2: Model Training

Step 3: Model Inference

Acknowledgments

Diagrams

About

Releases

Packages

Languages

m-aliabbas/papia_language_modeling

Folders and files

Latest commit

History

Repository files navigation

Tokenizer and Model Training Pipeline

Introduction

Author

Requirements

Getting Started

Prepare the Data

Step 1: Tokenizer Training

Step 2: Model Training

Step 3: Model Inference

Acknowledgments

Diagrams

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages