Skip to content

Commit

Permalink
Merge pull request #623 from allenai/shanea/hf-save-to-disk-2
Browse files Browse the repository at this point in the history
HF dataset loading optimizations
  • Loading branch information
2015aroras authored Jun 14, 2024
2 parents a33caa9 + b5bd9ff commit 41ed20a
Show file tree
Hide file tree
Showing 6 changed files with 229 additions and 32 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Added clipping fix to `Optimizer` class to make it work with FSDP `no_shard` and DDP.
- Added tests to compare grad norm differences between torch optimizer and clipping and OLMo optimizer and clipping on both CPU and GPU.
- Expose memmap dtype in data config
- Expose memmap dtype in data config
- Added caching to disk of HF datasets used in downstream evals

### Changed

Expand Down
5 changes: 5 additions & 0 deletions olmo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1098,6 +1098,11 @@ class TrainConfig(BaseConfig):
Whether to use the fused CE loss function from `flash-attn`.
"""

hf_datasets_cache_dir: Optional[str] = None
"""
Path to cache directory of HF datasets saved with `datasets.save_to_disk`.
"""

@property
def autocast_precision(self) -> torch.dtype:
if self.precision == "amp_bf16":
Expand Down
4 changes: 3 additions & 1 deletion olmo/eval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ def build_downstream_evaluator(
task_class = label_to_task_map[eval_cfg.label]
if isinstance(task_class, tuple):
task_class, task_kwargs = task_class
ds_eval_dataset = task_class(tokenizer=tokenizer, **task_kwargs) # type: ignore
ds_eval_dataset = task_class(
tokenizer=tokenizer, datasets_cache_dir=train_config.hf_datasets_cache_dir, **task_kwargs
) # type: ignore
data_config = eval_cfg.data
if is_unit_test:
ds_eval_sampler = None
Expand Down
Loading

0 comments on commit 41ed20a

Please sign in to comment.