on-the-fly-tokenization-profiling

Num_workers

setup:

dataset_name = "roneneldan/TinyStories"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
seq_len = 1024
batch_size = 1024
num_batches = 100

	Num_workers	Time Per Batch(s)
pre-tokenization	1	0.8321
pre-tokenization	8	0.3444
on-the-fly tokenization	1	2.0477
on-the-fly tokenization	8	0.2946

The reason that on-the-fly tokenization is faster when num_workers is large is that on-the-fly tokenization reads fewer bytes from disk compared to pre-tokenization, given that one word usually corresponds to multiple (~2) tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data_profiling.py		data_profiling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

on-the-fly-tokenization-profiling

Num_workers

About

Releases

Packages

Languages

XinDongol/on-the-fly-tokenization-profiling

Folders and files

Latest commit

History

Repository files navigation

on-the-fly-tokenization-profiling

Num_workers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages