Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset metadata for reproducibility #4129

Open
nbroad1881 opened this issue Apr 8, 2022 · 1 comment
Open

dataset metadata for reproducibility #4129

nbroad1881 opened this issue Apr 8, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@nbroad1881
Copy link

When pulling a dataset from the hub, it would be useful to have some metadata about the specific dataset and version that is used. The metadata could then be passed to the Trainer which could then be saved to a model card. This is useful for people who run many experiments on different versions (commits/branches) of the same dataset.

The dataset could have a list of “source datasets” metadata and ignore what happens to them before arriving in the Trainer (i.e. ignore mapping, filtering, etc.).

Here is a basic representation (made by @lhoestq )

>>> from datasets import load_dataset
>>> 
>>> my_dataset = load_dataset(...)["train"]
>>> my_dataset = my_dataset.map(...)
>>> 
>>> my_dataset.sources
[HFHubDataset(repo_id=..., revision=..., arguments={...})]
@davanstrien
Copy link
Member

+1 on this idea. This could be powerful for helping better track datasets used for model training and help with automatic model card creation.

One possible way of doing this would be to store some/most/all the arguments passed to load_dataset if a hub id is passed. i.e. store the Hub ID, configuration, etc.

cc @tomaarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants