Enforce passing item_loader when customizing underlying storage format #296

tchaton · 2024-08-01T19:54:57Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Until now, we had a hack where 1D tensor would be handled differentely and stored as contiguous array. I have seen several users complaining about this magic and unexpected behaviour.

WARNING: This PR is a breaking change for LLM using the TokensLoader. Now, we would need to pass the item_loader to the optimize or Cache directly to inform the underlying storage needs to be handled differently.

If no item_loader is passed during the optimization, this default to the Pytree handler.

For LLM tokens, here is the breaking API change.

Before

from litdata.streaming.item_loader import TokensLoader

optimize(...)

dataset = StreamingDataset(item_loader=TokensLoader(...))

Now

from litdata.streaming.item_loader import TokensLoader

optimize(..., item_loader=TokensLoader())

dataset = StreamingDataset(item_loader=TokensLoader(...))

Fixes #294

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

lantiga

Much cleaner and more robust. IMO the change is not confusing, one just has to use item loader on both sides.
What happens when I (erroneously) omit it?

tchaton · 2024-08-02T08:47:35Z

What happens when I (erroneously) omit it?

It would break and tell you the wrong item loader has been provided. Ok, waiting on @awaelchli review and ideas ;)

src/litdata/processing/functions.py

Co-authored-by: awaelchli <[email protected]>

AugustDev · 2024-08-13T06:26:40Z

TokensLoader accepts block_size. What if each of my sample is a dictionary where some of the fields are token sequences. Should block_size be length of the longest sequence?

tchaton · 2024-08-13T07:20:33Z

Hey @AugustDev. TokensLoader doesn't work on dictionary. It needs a 1D tensor.

update

fc46fa4

tchaton requested a review from awaelchli as a code owner August 1, 2024 19:54

lantiga approved these changes Aug 1, 2024

View reviewed changes

awaelchli approved these changes Aug 2, 2024

View reviewed changes

src/litdata/processing/functions.py Outdated Show resolved Hide resolved

tchaton and others added 5 commits August 3, 2024 08:51

Update src/litdata/processing/functions.py

679f314

Co-authored-by: awaelchli <[email protected]>

update

81da3e8

update

334ef95

update

976d285

update

1f3f4f0

tchaton merged commit 518a1c3 into main Aug 5, 2024
26 checks passed

tchaton deleted the prevent_hack branch August 5, 2024 08:31

twaka mentioned this pull request Aug 30, 2024

Training not working with default script Lightning-AI/litgpt#1698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce passing item_loader when customizing underlying storage format #296

Enforce passing item_loader when customizing underlying storage format #296

tchaton commented Aug 1, 2024 •

edited

Loading

lantiga left a comment •

edited

Loading

tchaton commented Aug 2, 2024 •

edited

Loading

AugustDev commented Aug 13, 2024

tchaton commented Aug 13, 2024

Enforce passing item_loader when customizing underlying storage format #296

Enforce passing item_loader when customizing underlying storage format #296

Conversation

tchaton commented Aug 1, 2024 • edited Loading

What does this PR do?

PR review

Did you have fun?

lantiga left a comment • edited Loading

Choose a reason for hiding this comment

tchaton commented Aug 2, 2024 • edited Loading

AugustDev commented Aug 13, 2024

tchaton commented Aug 13, 2024

tchaton commented Aug 1, 2024 •

edited

Loading

lantiga left a comment •

edited

Loading

tchaton commented Aug 2, 2024 •

edited

Loading