Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayXD with None as leading dim incompatible with DatasetCardData #7243

Open
alex-hh opened this issue Oct 21, 2024 · 5 comments
Open

ArrayXD with None as leading dim incompatible with DatasetCardData #7243

alex-hh opened this issue Oct 21, 2024 · 5 comments

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Oct 21, 2024

Describe the bug

Creating a dataset with ArrayXD features leads to errors when downloading from hub due to DatasetCardData removing the Nones

@lhoestq

Steps to reproduce the bug

import numpy as np
from datasets import Array2D, Dataset, Features, load_dataset

def examples_generator():
    for i in range(4):
        yield {
            "array_1d": np.zeros((10,1), dtype="uint16"),
            "array_2d": np.zeros((10, 1), dtype="uint16"),
        }

features = Features(array_1d=Array2D((None,1), "uint16"), array_2d=Array2D((None, 1), "uint16"))
dataset = Dataset.from_generator(examples_generator, features=features)
dataset.push_to_hub("alex-hh/test_array_1d2d")
ds = load_dataset("alex-hh/test_array_1d2d")

Source of error appears to be DatasetCardData.to_dict invoking DatasetCardData._remove_none

from huggingface_hub import DatasetCardData
from datasets.info import DatasetInfosDict

dataset_card_data = DatasetCardData()
DatasetInfosDict({"default": dataset.info.copy()}).to_dataset_card_data(dataset_card_data)
print(dataset_card_data.to_dict())  # removes Nones in shape

Expected behavior

Should be possible to load datasets saved with shape None in leading dimension

Environment info

3.0.2 and latest huggingface_hub

@lhoestq
Copy link
Member

lhoestq commented Oct 21, 2024

It looks like CardData in huggingface_hub removes None values where it shouldn't. Indeed it calls _remove_none on the return of to_dict():

    def to_dict(self) -> Dict[str, Any]:
        """Converts CardData to a dict.

        Returns:
            `dict`: CardData represented as a dictionary ready to be dumped to a YAML
            block for inclusion in a README.md file.
        """

        data_dict = copy.deepcopy(self.__dict__)
        self._to_dict(data_dict)
        return _remove_none(data_dict)

Would it be ok to remove list() from being scanned in _remove_none ? it could also be a specific behavior to DatasetCardData if necessary @Wauplin

@Wauplin
Copy link
Contributor

Wauplin commented Oct 22, 2024

I have actually no idea why none values are removed in model and dataset card data... 🙈
Looks like _remove_none has been introduced at the same time as the entire repocard module (see huggingface/huggingface_hub#940). I would be tempted to remove _remove_none entirely actually and only remove "top-level" None values (i.e. if something like pipeline_tag=None due to a default value in kwargs => we remove it). Hard to tell what could be the side effects but I'm not against trying.

However, I'm not really in favor in making an exception only for lists. It would mean that tuples, sets and dicts are filtered but not lists, which is pretty inconsistent.

@lhoestq
Copy link
Member

lhoestq commented Oct 22, 2024

let's do it for top level attributes yes

@Wauplin
Copy link
Contributor

Wauplin commented Oct 22, 2024

I opened huggingface/huggingface_hub#2626 to address it :)

@lhoestq
Copy link
Member

lhoestq commented Oct 22, 2024

thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants