-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF dataset loading optimizations #623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only minor comments
olmo/eval/downstream.py
Outdated
from ..tokenizer import Tokenizer | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
|
||
def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing type hints
def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None): | |
def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds_eval_dataset = task_class(tokenizer=tokenizer, **task_kwargs) # type: ignore | ||
ds_eval_dataset = task_class( | ||
tokenizer=tokenizer, datasets_cache_dir=train_config.hf_datasets_cache_dir, **task_kwargs | ||
) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this # type: ignore
is probably on the wrong line now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intellisense puts it there and the type checking passes with it there (and fails without it there).
olmo/eval/downstream.py
Outdated
try: | ||
return load_dataset_from_disk(path, name, split, datasets_cache_dir) | ||
except FileNotFoundError: | ||
log.info( | ||
"Path %s name %s split %s not present in local dir %s, loading from online", | ||
path, | ||
name, | ||
split, | ||
datasets_cache_dir, | ||
) | ||
# Barrier here to stop the case where FS rank 0 saves the dataset to disk before some non-zero | ||
# ranks try getting the dataset from disk. This would cause those non-zero ranks to bypass | ||
# the next barrier and cause rank 0 to be stuck at that barrier. | ||
barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little sketch, but I guess it's fine as long as all ranks either have or do not have the dataset on disk, which is probably a reasonable assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9f2d9d2
I've changed the logic to be simpler and more robust against this case. Now the idea is that FS local rank 0 does its thing first (load from disk or online and cache), then every other rank follows afterwards. This is less optimized but simpler and safer imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that much better ✅
Issue: Loading HF datasets for downstream evals has been slowing down the start of our runs because:
Fix: This PR tackles these issues by:
load_from_disk
andsave_to_disk
dataset methods to save a local copy of the datasets. These datasets are no longer associated with the online versions, and so loading them from disk does not result in network traffic (as far as I can tell).barrier()
, but this seems to be less problematic. Note that barriers are invoked each time a dataset needs to be loaded (on the positive side, if all datasets are already in the cache then there are no barrier invocations).In a 2 GPU interactive session, the performance goes from 10 mins without this new cache to 1 min with this cache ready beforehand. I have populated a copy of the cache already at
/net/weka/reviz/hf_datasets_cache
.I haven't yet verified that the evaluation correctness is not affected by this change (because no GPUs available), but I don't expect this to negatively impact eval correctness.
Edit: Now the barrier will always be invoked exactly once per dataset