Clear Examples of use with different dataset types and code changes. #409

Woodr7 · 2024-11-04T16:09:45Z

🚀 Feature

Within the readme there should be examples, or links to examples, of how to reformat a dataset, starting with imagenet-tiny, in order to make it work well with LitData. How can I take a file structure where each image is organized into a folder named as its associated class and change it so when it's processed with Litdata, all of the relevant information is contained in the noew structure. Then, How do I need to change the code I used to train before in order to use the newly optimized litdata.

Motivation

This is needed in order to make litdata self serve. There is not a good plain english example of going from one simple, understandable dataset type and codebase, to an optimized litdata dataset and the new codebase needed to use that dataset and train the same model 20x faster. We will see more adoption if there is an example of this for as many dataset types as possible.

Pitch

Starting with the existing imagenet-tiny. Should how you go form the current file structure to the filestructure neccesary to run ld.optimize and maintain all of the necessary info. Then show an example of how you need to change the training code in order to take advantage of the optimized cloud dataset.

github-actions · 2024-11-04T16:10:10Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-11-04T16:40:31Z

Hey @Woodr7

You could do something like this:

from torchvision.datasets import ImageFolder
from litdata import optimize

dataset = ImageFolder("/teamspace/s3_connections/imagenet-tiny/train")

def fn(index):
    return dataset[index]

if __name__ == "__main__":
    optimize(
        fn=fn,
        inputs=[i for i in range(len(dataset))],
        output_dir="./optimized_imagenet_tiny/train",
        chunk_bytes="64MB"
    )

Yes, we will add more examples.

Woodr7 added the enhancement New feature or request label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear Examples of use with different dataset types and code changes. #409

Clear Examples of use with different dataset types and code changes. #409

Woodr7 commented Nov 4, 2024

github-actions bot commented Nov 4, 2024

tchaton commented Nov 4, 2024 •

edited

Loading

Clear Examples of use with different dataset types and code changes. #409

Clear Examples of use with different dataset types and code changes. #409

Comments

Woodr7 commented Nov 4, 2024

🚀 Feature

Motivation

Pitch

github-actions bot commented Nov 4, 2024

tchaton commented Nov 4, 2024 • edited Loading

tchaton commented Nov 4, 2024 •

edited

Loading