Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatasetInfo.__init__() got an unexpected keyword argument '_column_requires_decoding' #6157

Closed
aihao2000 opened this issue Aug 17, 2023 · 13 comments

Comments

@aihao2000
Copy link

aihao2000 commented Aug 17, 2023

Describe the bug

When I was in load_dataset, it said "DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding'". The second time I ran it, there was no error and the dataset object worked

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 dataset = load_dataset(
      2     "/home/aihao/workspace/DeepLearningContent/datasets/manga",
      3     data_dir="/home/aihao/workspace/DeepLearningContent/datasets/manga",
      4     split="train",
      5 )

File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/load.py:2146](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/load.py:2146), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2142 # Build dataset for splits
   2143 keep_in_memory = (
   2144     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2145 )
-> 2146 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   2147 # Rename and cast features to match task schema
   2148 if task is not None:
   2149     # To avoid issuing the same warning twice

File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/builder.py:1190](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/builder.py:1190), in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1187 verification_mode = VerificationMode(verification_mode or VerificationMode.BASIC_CHECKS)
   1189 # Create a dataset for each of the given splits
-> 1190 datasets = map_nested(
   1191     partial(
   1192         self._build_single_dataset,
...
File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/info.py:379](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/info.py:379), in DatasetInfo.copy(self)
    378 def copy(self) -> "DatasetInfo":
--> 379     return self.__class__(**{k: copy.deepcopy(v) for k, v in self.__dict__.items()})

TypeError: DatasetInfo.__init__() got an unexpected keyword argument '_column_requires_decoding'

Steps to reproduce the bug

/home/aihao/workspace/DeepLearningContent/datasets/images/images.py

from logging import config
import datasets
import os
from PIL import Image
import csv
import json


class ImagesConfig(datasets.BuilderConfig):
    def __init__(self, **kwargs):
        super(ImagesConfig, self).__init__(**kwargs)


class Images(datasets.GeneratorBasedBuilder):
    def _split_generators(self, dl_manager: datasets.DownloadManager):
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"split": datasets.Split.TRAIN},
            )
        ]

    BUILDER_CONFIGS = [
        ImagesConfig(
            name="similar_pairs",
            description="simliar pair dataset,item is a pair of similar images",
        ),
        ImagesConfig(
            name="image_prompt_pairs",
            description="image prompt pairs",
        ),
    ]

    def _info(self):
        if self.config.name == "similar_pairs":
            return datasets.Features(
                {
                    "image1": datasets.features.Image(),
                    "image2": datasets.features.Image(),
                    "similarity": datasets.Value("float32"),
                }
            )
        elif self.config.name == "image_prompt_pairs":
            return datasets.Features(
                {"image": datasets.features.Image(), "prompt": datasets.Value("string")}
            )

    def _generate_examples(self, split):
        data_path = os.path.join(self.config.data_dir, "data")
        if self.config.name == "similar_pairs":
            prompts = {}
            with open(os.path.join(data_path ,"prompts.json"), "r") as f:
                prompts = json.load(f)
            with open(os.path.join(data_path, "similar_pairs.csv"), "r") as f:
                reader = csv.reader(f)
                for row in reader:
                    image1_path, image2_path, similarity = row
                    yield image1_path + ":" + image2_path + ":", {
                        "image1": Image.open(image1_path),
                        "prompt1": prompts[image1_path],
                        "image2": Image.open(image2_path),
                        "prompt2": prompts[image2_path],
                        "similarity": float(similarity),
                    }

Code that indicates an error:

from datasets import load_dataset
import json
import csv
import ast
import torch
data_dir = "/home/aihao/workspace/DeepLearningContent/datasets/images"
dataset = load_dataset(data_dir, data_dir=data_dir, name="similar_pairs")

Expected behavior

The first execution gives an error, but it works fine

Environment info

  • datasets version: 2.14.3
  • Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
  • Python version: 3.11.4
  • Huggingface_hub version: 0.16.4
  • PyArrow version: 12.0.1
  • Pandas version: 2.0.3
@mariosasko
Copy link
Collaborator

Thanks for reporting, but we can only fix this issue if you can provide a reproducer that consistently reproduces it.

@aihao2000
Copy link
Author

@mariosasko Ok. What exactly does it mean to provide a reproducer

@mariosasko
Copy link
Collaborator

To provide a code that reproduces the issue :)

@aihao2000
Copy link
Author

@mariosasko I complete the above code, is it enough?

@aihao2000
Copy link
Author

@mariosasko That's all the code, I'm using locally stored data

@mariosasko
Copy link
Collaborator

Does this error occur even if you change the cache directory (the cache_dir parameter in load_dataset)?

@aihao2000
Copy link
Author

@mariosasko I didn't add any parameters for catch. Nor did any cache configuration change.

@aihao2000
Copy link
Author

@mariosasko And I changed the data file, but executing load_dataset is always the previous result. I had to change something in images.py to use the new results. Using 'cleanup_cache_files' is invalid! Help me.

@aihao2000
Copy link
Author

@mariosasko I added a comprehensive error message. Check that _column_requires_decoding is being passed where it shouldn't be. DatasetInfo.init() Whether this parameter is required

@mariosasko
Copy link
Collaborator

I can see the issue now...

You can fix it by returning a DatasetInfo object in the _info method as follows:

    def _info(self):
        if self.config.name == "similar_pairs":
            features = datasets.Features(
                {
                    "image1": datasets.features.Image(),
                    "prompt1": datasets.Value("string"),
                    "image2": datasets.features.Image(),
                    "prompt2": datasets.Value("string"),
                    "similarity": datasets.Value("float32"),
                }
            )
        elif self.config.name == "image_prompt_pairs":
            features = datasets.Features(
                {"image": datasets.features.Image(), "prompt": datasets.Value("string")}
            )
        return datasets.DatasetInfo(features=features)

@aihao2000
Copy link
Author

@mariosasko Oh, that's the problem. Thank you very much. Returned the wrong object and it actually works? I've been training with it for a long time

@aihao2000
Copy link
Author

@mariosasko The original code can still see progress. emmm, I can't see how many examples is generated so far, so I don't know if we should wait

@mariosasko
Copy link
Collaborator

The original issue has been addressed, so I'm closing it.

Please open a new issue if you encounter more errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants