Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dataset viewer more flexible in displaying metadata alongside images #7123

Open
egrace479 opened this issue Aug 23, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@egrace479
Copy link

egrace479 commented Aug 23, 2024

Feature request

To display images with their associated metadata in the dataset viewer, a metadata.csv file is required. In the case of a dataset with multiple subsets, this would require the CSVs to be contained in the same folder as the images since they all need to be named metadata.csv. The request is that this be made more flexible for datasets with multiple subsets to avoid the need to put a metadata.csv into each image directory where they are not as easily accessed.

Motivation

When creating datasets with multiple subsets I can't get the images to display alongside their associated metadata (it's usually one or the other that will show up). Since this requires a file specifically named metadata.csv, I then have to place that file within the image directory, which makes it much more difficult to access. Additionally, it still doesn't necessarily display the images alongside their metadata correctly (see, for instance, this discussion).

It was suggested I bring this discussion to GitHub on another dataset struggling with a similar issue (discussion). In that case, it's a mix of data subsets, where some just reference the image URLs, while others actually have the images uploaded. The ones with images uploaded are not displaying images, but renaming that file to just metadata.csv would diminish the clarity of the construction of the dataset itself (and I'm not entirely convinced it would solve the issue).

Your contribution

I can make a suggestion for one approach to address the issue:

For instance, even if it could just end in _metadata.csv or -metadata.csv, that would be very helpful to allow for more flexibility of dataset structure without impacting clarity. I would think that the functionality on the backend looking for metadata.csv could reasonably be adapted to look for such an ending on a filename (maybe also check that it has a file_name column?).

Presumably, requiring the configs in a setup like on this dataset could also help in figuring out how it should work?

configs:
  - config_name: <image subset>
    data_files:
      - <image-metadata>.csv
      - <path/to/images>/*.jpg

I'd also be happy to look at whatever solution is decided upon and contribute to the ideation.

Thanks for your time and consideration! The dataset viewer really is fabulous when it works :)

@egrace479 egrace479 added the enhancement New feature or request label Aug 23, 2024
@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2024

Note that you can already have one directory per subset just for the metadata, e.g.

configs:
  - config_name: subset0
    data_files:
      - subset0/metadata.csv
      - images/*.jpg
  - config_name: subset1
    data_files:
      - subset1/metadata.csv
      - images/*.jpg

EDIT: ah maybe it doesn't work because you'd have to provide relative paths from the metadata files to the images

@egrace479
Copy link
Author

Yes, that's part of the issue. Also, metadata.csv is a very ambiguous name and we generally try to avoid using the same name for different files within a dataset, as this can quickly lead to confusion.

@lhoestq
Copy link
Member

lhoestq commented Oct 17, 2024

I think supporting **/*-metadata.csv or **/*_metadata.csv makes sense to me. If it sounds good to you feel free to open a PR to update the patterns here:

if config.FSSPEC_VERSION < version.parse("2023.9.0"):
METADATA_PATTERNS = [
"metadata.csv",
"**/metadata.csv",
"metadata.jsonl",
"**/metadata.jsonl",
] # metadata file for ImageFolder and AudioFolder
else:
METADATA_PATTERNS = [
"**/metadata.csv",
"**/metadata.jsonl",
] # metadata file for ImageFolder and AudioFolder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants