Move tokenizers to new `olmo_data` package. #645

2015aroras · 2024-07-08T21:51:23Z

Issue: Tokenizer.from_checkpoint assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip as ai2-olmo. More generally, we don't have a clear mechanism for putting data files in our repo.

Fix: This PR creates an olmo_data package, of which the subdirectories can correspond to various types of data (e.g. tokenizers and hf_datasets). Tokenizers are moved under olmo_data, and Tokenizer.from_checkpoint is updated to look at local paths, then olmo_data, then HF Hub.

This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.

Fixes #633

epwalsh

LGTM, only one minor comment

epwalsh · 2024-07-08T22:55:06Z

olmo_data/data.py

+
+
+@contextmanager
+def get_data_path(data_rel_path: str) -> Iterator[Path]:


I think the correct return type is:

Suggested change

def get_data_path(data_rel_path: str) -> Iterator[Path]:

def get_data_path(data_rel_path: str) -> Generator[Path, None, None]:

The python docs say

If your generator will only yield values, set the SendType and ReturnType to None:
...
Alternatively, annotate your generator as having a return type of either Iterable[YieldType] or Iterator[YieldType]:

🤷

Ok fair enough!

I'm with it either way. Your way is less verbose so that's a plus.

OyvindTafjord · 2024-07-19T06:19:48Z

@2015aroras Could we add something which doesn't break scripts/convert_olmo_to_hf_new.py for old configs? (i.e., this stuff)

2015aroras added 10 commits July 1, 2024 13:56

Move tokenizers into a package under data/

c679389

Read tokenizer from olmo_data package if it is present there

a7c1d0a

Update pyproject.toml with olmo_data info

3b2102c

Move data files to olmo_data

73e217f

Wrap data files/dirs access with convenient methods

56de48c

Load tokenizer from olmo_data using olmo_data methods

75bef93

Make Tokenizer.from_file support Path objects

2998074

Run formatters

d4305f9

Update CHANGELOG

f0de349

Add importlib_resources dependency

8fd90d8

2015aroras marked this pull request as ready for review July 8, 2024 21:57

2015aroras requested review from dirkgr and epwalsh July 8, 2024 21:57

epwalsh approved these changes Jul 8, 2024

View reviewed changes

2015aroras added 2 commits July 8, 2024 16:16

Change typing of get_data_path

d005b16

Merge branch 'main' into shanea/tokenizer-package-data

8ddfe79

2015aroras merged commit cbc7c25 into main Jul 8, 2024
12 checks passed

2015aroras deleted the shanea/tokenizer-package-data branch July 8, 2024 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move tokenizers to new `olmo_data` package. #645

Move tokenizers to new `olmo_data` package. #645

2015aroras commented Jul 8, 2024

epwalsh left a comment

epwalsh Jul 8, 2024

2015aroras Jul 8, 2024

2015aroras Jul 8, 2024

epwalsh Jul 8, 2024

epwalsh Jul 8, 2024

OyvindTafjord commented Jul 19, 2024



		@contextmanager
		def get_data_path(data_rel_path: str) -> Iterator[Path]:

Move tokenizers to new olmo_data package. #645

Move tokenizers to new olmo_data package. #645

Conversation

2015aroras commented Jul 8, 2024

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh Jul 8, 2024

Choose a reason for hiding this comment

2015aroras Jul 8, 2024

Choose a reason for hiding this comment

2015aroras Jul 8, 2024

Choose a reason for hiding this comment

epwalsh Jul 8, 2024

Choose a reason for hiding this comment

epwalsh Jul 8, 2024

Choose a reason for hiding this comment

OyvindTafjord commented Jul 19, 2024

Move tokenizers to new `olmo_data` package. #645

Move tokenizers to new `olmo_data` package. #645