-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move tokenizers to new olmo_data
package.
#645
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only one minor comment
olmo_data/data.py
Outdated
|
||
|
||
@contextmanager | ||
def get_data_path(data_rel_path: str) -> Iterator[Path]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct return type is:
def get_data_path(data_rel_path: str) -> Iterator[Path]: | |
def get_data_path(data_rel_path: str) -> Generator[Path, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python docs say
If your generator will only yield values, set the SendType and ReturnType to None:
...
Alternatively, annotate your generator as having a return type of either Iterable[YieldType] or Iterator[YieldType]:
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fair enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with it either way. Your way is less verbose so that's a plus.
@2015aroras Could we add something which doesn't break |
Issue:
Tokenizer.from_checkpoint
assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip asai2-olmo
. More generally, we don't have a clear mechanism for putting data files in our repo.Fix: This PR creates an
olmo_data
package, of which the subdirectories can correspond to various types of data (e.g.tokenizers
andhf_datasets
). Tokenizers are moved underolmo_data
, andTokenizer.from_checkpoint
is updated to look at local paths, thenolmo_data
, then HF Hub.This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.
Fixes #633