Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear Examples of use with different dataset types and code changes. #409

Open
Woodr7 opened this issue Nov 4, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@Woodr7
Copy link

Woodr7 commented Nov 4, 2024

🚀 Feature

Within the readme there should be examples, or links to examples, of how to reformat a dataset, starting with imagenet-tiny, in order to make it work well with LitData. How can I take a file structure where each image is organized into a folder named as its associated class and change it so when it's processed with Litdata, all of the relevant information is contained in the noew structure. Then, How do I need to change the code I used to train before in order to use the newly optimized litdata.

Motivation

This is needed in order to make litdata self serve. There is not a good plain english example of going from one simple, understandable dataset type and codebase, to an optimized litdata dataset and the new codebase needed to use that dataset and train the same model 20x faster. We will see more adoption if there is an example of this for as many dataset types as possible.

Pitch

Starting with the existing imagenet-tiny. Should how you go form the current file structure to the filestructure neccesary to run ld.optimize and maintain all of the necessary info. Then show an example of how you need to change the training code in order to take advantage of the optimized cloud dataset.

@Woodr7 Woodr7 added the enhancement New feature or request label Nov 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Nov 4, 2024

Hey @Woodr7

You could do something like this:

from torchvision.datasets import ImageFolder
from litdata import optimize

dataset = ImageFolder("/teamspace/s3_connections/imagenet-tiny/train")

def fn(index):
    return dataset[index]

if __name__ == "__main__":
    optimize(
        fn=fn,
        inputs=[i for i in range(len(dataset))],
        output_dir="./optimized_imagenet_tiny/train",
        chunk_bytes="64MB"
    )

Yes, we will add more examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants