mscoco

mscoco train split is a dataset of 600 thousands image and caption.

Download the metadata

wget https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT/resolve/main/mscoco.parquet That's a 18M file. It contains the train split from mscoco

Download the images with img2dataset

Run this command. It will download the mscoco dataset as resized images in the webdataset format.

img2dataset --url_list mscoco.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder mscoco --processes_count 16 --thread_count 64 --image_size 256\
             --enable_wandb True

Benchmark

https://wandb.ai/rom1504/img2dataset/reports/MSCOCO--VmlldzoxMjczMTkz

800 sample/s
total: 10min
output: 20GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mscoco.md

mscoco.md

mscoco

Download the metadata

Download the images with img2dataset

Benchmark

Files

mscoco.md

Latest commit

History

mscoco.md

File metadata and controls

mscoco

Download the metadata

Download the images with img2dataset

Benchmark