Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tokenizers to new olmo_data package. #645

Merged
merged 12 commits into from
Jul 8, 2024

Conversation

2015aroras
Copy link
Collaborator

Issue: Tokenizer.from_checkpoint assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip as ai2-olmo. More generally, we don't have a clear mechanism for putting data files in our repo.

Fix: This PR creates an olmo_data package, of which the subdirectories can correspond to various types of data (e.g. tokenizers and hf_datasets). Tokenizers are moved under olmo_data, and Tokenizer.from_checkpoint is updated to look at local paths, then olmo_data, then HF Hub.

This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.

Fixes #633

@2015aroras 2015aroras marked this pull request as ready for review July 8, 2024 21:57
@2015aroras 2015aroras requested review from dirkgr and epwalsh July 8, 2024 21:57
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one minor comment



@contextmanager
def get_data_path(data_rel_path: str) -> Iterator[Path]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct return type is:

Suggested change
def get_data_path(data_rel_path: str) -> Iterator[Path]:
def get_data_path(data_rel_path: str) -> Generator[Path, None, None]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python docs say

If your generator will only yield values, set the SendType and ReturnType to None:
...
Alternatively, annotate your generator as having a return type of either Iterable[YieldType] or Iterator[YieldType]:

🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fair enough!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with it either way. Your way is less verbose so that's a plus.

@2015aroras 2015aroras merged commit cbc7c25 into main Jul 8, 2024
12 checks passed
@2015aroras 2015aroras deleted the shanea/tokenizer-package-data branch July 8, 2024 23:22
@OyvindTafjord
Copy link
Contributor

@2015aroras Could we add something which doesn't break scripts/convert_olmo_to_hf_new.py for old configs? (i.e., this stuff)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer with relative path import fails when using olmo as pip library
3 participants