Efficient data loading solution for large datasets
No due date
66% complete
Enable efficient data loading solution for LLM training.
Short term:
- enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
- make sure it's performant when training on cluster
Long term:
- For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
- For iterable/streaming, build indices …
Enable efficient data loading solution for LLM training.
Short term:
- enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
- make sure it's performant when training on cluster
Long term:
- For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
- For iterable/streaming, build indices and sampler to load data correctly