DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding' #6157

aihao2000 · 2023-08-17T15:48:11Z

Describe the bug

When I was in load_dataset, it said "DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding'". The second time I ran it, there was no error and the dataset object worked

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 dataset = load_dataset(
      2     "/home/aihao/workspace/DeepLearningContent/datasets/manga",
      3     data_dir="/home/aihao/workspace/DeepLearningContent/datasets/manga",
      4     split="train",
      5 )

File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/load.py:2146](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/load.py:2146), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2142 # Build dataset for splits
   2143 keep_in_memory = (
   2144     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2145 )
-> 2146 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   2147 # Rename and cast features to match task schema
   2148 if task is not None:
   2149     # To avoid issuing the same warning twice

File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/builder.py:1190](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/builder.py:1190), in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1187 verification_mode = VerificationMode(verification_mode or VerificationMode.BASIC_CHECKS)
   1189 # Create a dataset for each of the given splits
-> 1190 datasets = map_nested(
   1191     partial(
   1192         self._build_single_dataset,
...
File [~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/info.py:379](https://vscode-remote+ssh-002dremote-002bhome.vscode-resource.vscode-cdn.net/home/aihao/workspace/DeepLearningContent/datasets/~/miniconda3/envs/torch/lib/python3.11/site-packages/datasets/info.py:379), in DatasetInfo.copy(self)
    378 def copy(self) -> "DatasetInfo":
--> 379     return self.__class__(**{k: copy.deepcopy(v) for k, v in self.__dict__.items()})

TypeError: DatasetInfo.__init__() got an unexpected keyword argument '_column_requires_decoding'

Steps to reproduce the bug

/home/aihao/workspace/DeepLearningContent/datasets/images/images.py

from logging import config
import datasets
import os
from PIL import Image
import csv
import json


class ImagesConfig(datasets.BuilderConfig):
    def __init__(self, **kwargs):
        super(ImagesConfig, self).__init__(**kwargs)


class Images(datasets.GeneratorBasedBuilder):
    def _split_generators(self, dl_manager: datasets.DownloadManager):
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"split": datasets.Split.TRAIN},
            )
        ]

    BUILDER_CONFIGS = [
        ImagesConfig(
            name="similar_pairs",
            description="simliar pair dataset,item is a pair of similar images",
        ),
        ImagesConfig(
            name="image_prompt_pairs",
            description="image prompt pairs",
        ),
    ]

    def _info(self):
        if self.config.name == "similar_pairs":
            return datasets.Features(
                {
                    "image1": datasets.features.Image(),
                    "image2": datasets.features.Image(),
                    "similarity": datasets.Value("float32"),
                }
            )
        elif self.config.name == "image_prompt_pairs":
            return datasets.Features(
                {"image": datasets.features.Image(), "prompt": datasets.Value("string")}
            )

    def _generate_examples(self, split):
        data_path = os.path.join(self.config.data_dir, "data")
        if self.config.name == "similar_pairs":
            prompts = {}
            with open(os.path.join(data_path ,"prompts.json"), "r") as f:
                prompts = json.load(f)
            with open(os.path.join(data_path, "similar_pairs.csv"), "r") as f:
                reader = csv.reader(f)
                for row in reader:
                    image1_path, image2_path, similarity = row
                    yield image1_path + ":" + image2_path + ":", {
                        "image1": Image.open(image1_path),
                        "prompt1": prompts[image1_path],
                        "image2": Image.open(image2_path),
                        "prompt2": prompts[image2_path],
                        "similarity": float(similarity),
                    }

Code that indicates an error:

from datasets import load_dataset
import json
import csv
import ast
import torch
data_dir = "/home/aihao/workspace/DeepLearningContent/datasets/images"
dataset = load_dataset(data_dir, data_dir=data_dir, name="similar_pairs")

Expected behavior

The first execution gives an error, but it works fine

Environment info

datasets version: 2.14.3
Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Python version: 3.11.4
Huggingface_hub version: 0.16.4
PyArrow version: 12.0.1
Pandas version: 2.0.3

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-08-17T17:39:00Z

Thanks for reporting, but we can only fix this issue if you can provide a reproducer that consistently reproduces it.

aihao2000 · 2023-08-18T08:20:56Z

@mariosasko Ok. What exactly does it mean to provide a reproducer

mariosasko · 2023-08-18T12:26:28Z

To provide a code that reproduces the issue :)

aihao2000 · 2023-08-18T14:21:46Z

@mariosasko I complete the above code, is it enough?

aihao2000 · 2023-08-19T17:30:03Z

@mariosasko That's all the code, I'm using locally stored data

mariosasko · 2023-08-21T17:14:53Z

Does this error occur even if you change the cache directory (the cache_dir parameter in load_dataset)?

aihao2000 · 2023-08-22T08:09:56Z

@mariosasko I didn't add any parameters for catch. Nor did any cache configuration change.

aihao2000 · 2023-08-27T12:31:50Z

@mariosasko And I changed the data file, but executing load_dataset is always the previous result. I had to change something in images.py to use the new results. Using 'cleanup_cache_files' is invalid! Help me.

aihao2000 · 2023-08-27T16:34:40Z

@mariosasko I added a comprehensive error message. Check that _column_requires_decoding is being passed where it shouldn't be. DatasetInfo.init() Whether this parameter is required

mariosasko · 2023-08-31T16:55:17Z

I can see the issue now...

You can fix it by returning a DatasetInfo object in the _info method as follows:

    def _info(self):
        if self.config.name == "similar_pairs":
            features = datasets.Features(
                {
                    "image1": datasets.features.Image(),
                    "prompt1": datasets.Value("string"),
                    "image2": datasets.features.Image(),
                    "prompt2": datasets.Value("string"),
                    "similarity": datasets.Value("float32"),
                }
            )
        elif self.config.name == "image_prompt_pairs":
            features = datasets.Features(
                {"image": datasets.features.Image(), "prompt": datasets.Value("string")}
            )
        return datasets.DatasetInfo(features=features)

aihao2000 · 2023-09-01T17:38:26Z

@mariosasko Oh, that's the problem. Thank you very much. Returned the wrong object and it actually works? I've been training with it for a long time

aihao2000 · 2023-09-08T12:27:53Z

@mariosasko The original code can still see progress. emmm, I can't see how many examples is generated so far, so I don't know if we should wait

mariosasko · 2023-09-27T17:36:14Z

The original issue has been addressed, so I'm closing it.

Please open a new issue if you encounter more errors.

mariosasko closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding' #6157

DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding' #6157

aihao2000 commented Aug 17, 2023 •

edited

Loading

mariosasko commented Aug 17, 2023

aihao2000 commented Aug 18, 2023

mariosasko commented Aug 18, 2023

aihao2000 commented Aug 18, 2023

aihao2000 commented Aug 19, 2023

mariosasko commented Aug 21, 2023

aihao2000 commented Aug 22, 2023

aihao2000 commented Aug 27, 2023

aihao2000 commented Aug 27, 2023

mariosasko commented Aug 31, 2023

aihao2000 commented Sep 1, 2023

aihao2000 commented Sep 8, 2023

mariosasko commented Sep 27, 2023

DatasetInfo.__init__() got an unexpected keyword argument '_column_requires_decoding' #6157

DatasetInfo.__init__() got an unexpected keyword argument '_column_requires_decoding' #6157

Comments

aihao2000 commented Aug 17, 2023 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Aug 17, 2023

aihao2000 commented Aug 18, 2023

mariosasko commented Aug 18, 2023

aihao2000 commented Aug 18, 2023

aihao2000 commented Aug 19, 2023

mariosasko commented Aug 21, 2023

aihao2000 commented Aug 22, 2023

aihao2000 commented Aug 27, 2023

aihao2000 commented Aug 27, 2023

mariosasko commented Aug 31, 2023

aihao2000 commented Sep 1, 2023

aihao2000 commented Sep 8, 2023

mariosasko commented Sep 27, 2023

DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding' #6157

DatasetInfo.init() got an unexpected keyword argument '_column_requires_decoding' #6157

aihao2000 commented Aug 17, 2023 •

edited

Loading