Make dataset viewer more flexible in displaying metadata alongside images #7123

egrace479 · 2024-08-23T22:56:01Z

Feature request

To display images with their associated metadata in the dataset viewer, a metadata.csv file is required. In the case of a dataset with multiple subsets, this would require the CSVs to be contained in the same folder as the images since they all need to be named metadata.csv. The request is that this be made more flexible for datasets with multiple subsets to avoid the need to put a metadata.csv into each image directory where they are not as easily accessed.

Motivation

When creating datasets with multiple subsets I can't get the images to display alongside their associated metadata (it's usually one or the other that will show up). Since this requires a file specifically named metadata.csv, I then have to place that file within the image directory, which makes it much more difficult to access. Additionally, it still doesn't necessarily display the images alongside their metadata correctly (see, for instance, this discussion).

It was suggested I bring this discussion to GitHub on another dataset struggling with a similar issue (discussion). In that case, it's a mix of data subsets, where some just reference the image URLs, while others actually have the images uploaded. The ones with images uploaded are not displaying images, but renaming that file to just metadata.csv would diminish the clarity of the construction of the dataset itself (and I'm not entirely convinced it would solve the issue).

Your contribution

I can make a suggestion for one approach to address the issue:

For instance, even if it could just end in _metadata.csv or -metadata.csv, that would be very helpful to allow for more flexibility of dataset structure without impacting clarity. I would think that the functionality on the backend looking for metadata.csv could reasonably be adapted to look for such an ending on a filename (maybe also check that it has a file_name column?).

Presumably, requiring the configs in a setup like on this dataset could also help in figuring out how it should work?

configs:
  - config_name: <image subset>
    data_files:
      - <image-metadata>.csv
      - <path/to/images>/*.jpg

I'd also be happy to look at whatever solution is decided upon and contribute to the ideation.

Thanks for your time and consideration! The dataset viewer really is fabulous when it works :)

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-09-20T15:47:22Z

Note that you can already have one directory per subset just for the metadata, e.g.

configs:
  - config_name: subset0
    data_files:
      - subset0/metadata.csv
      - images/*.jpg
  - config_name: subset1
    data_files:
      - subset1/metadata.csv
      - images/*.jpg

EDIT: ah maybe it doesn't work because you'd have to provide relative paths from the metadata files to the images

egrace479 · 2024-10-16T15:55:15Z

Yes, that's part of the issue. Also, metadata.csv is a very ambiguous name and we generally try to avoid using the same name for different files within a dataset, as this can quickly lead to confusion.

lhoestq · 2024-10-17T09:13:35Z

I think supporting **/*-metadata.csv or **/*_metadata.csv makes sense to me. If it sounds good to you feel free to open a PR to update the patterns here:

datasets/src/datasets/data_files.py

Lines 104 to 115 in d4422cc

    
           if config.FSSPEC_VERSION < version.parse("2023.9.0"): 
        
               METADATA_PATTERNS = [ 
        
                   "metadata.csv", 
        
                   "**/metadata.csv", 
        
                   "metadata.jsonl", 
        
                   "**/metadata.jsonl", 
        
               ]  # metadata file for ImageFolder and AudioFolder 
        
           else: 
        
               METADATA_PATTERNS = [ 
        
                   "**/metadata.csv", 
        
                   "**/metadata.jsonl", 
        
               ]  # metadata file for ImageFolder and AudioFolder

egrace479 added the enhancement New feature or request label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dataset viewer more flexible in displaying metadata alongside images #7123

Make dataset viewer more flexible in displaying metadata alongside images #7123

egrace479 commented Aug 23, 2024 •

edited

Loading

lhoestq commented Sep 20, 2024 •

edited

Loading

egrace479 commented Oct 16, 2024

lhoestq commented Oct 17, 2024 •

edited

Loading

Make dataset viewer more flexible in displaying metadata alongside images #7123

Make dataset viewer more flexible in displaying metadata alongside images #7123

Comments

egrace479 commented Aug 23, 2024 • edited Loading

Feature request

Motivation

Your contribution

lhoestq commented Sep 20, 2024 • edited Loading

egrace479 commented Oct 16, 2024

lhoestq commented Oct 17, 2024 • edited Loading

egrace479 commented Aug 23, 2024 •

edited

Loading

lhoestq commented Sep 20, 2024 •

edited

Loading

lhoestq commented Oct 17, 2024 •

edited

Loading