How to convert torch.utils.data.Dataset to huggingface dataset? #4983

DEROOCE · 2022-09-16T09:15:10Z

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:

from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]

So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert?
Thanks.

The text was updated successfully, but these errors were encountered:

mariosasko · 2022-09-16T15:28:26Z

Hi! I think you can use the newly-added from_generator method for that:

from datasets import Dataset

def gen():
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary
    ## or if it's an IterableDataset
    # for ex in torch_dataset:
    #     yield ex

dset = Dataset.from_generator(gen)

lhoestq · 2022-09-19T14:47:18Z

Maybe Dataset.from_list can work as well no ?

from datasets import Dataset

dset = Dataset.from_list(torch_dataset)

DEROOCE · 2022-09-20T01:57:58Z

from datasets import Dataset

def gen():
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary
    ## or if it's an IterableDataset
    # for ex in torch_dataset:
    #     yield ex

dset = Dataset.from_generator(gen)

I try to use Dataset.from_generator() method, and it returns an error:

AttributeError: type object 'Dataset' has no attribute 'from_generator'

And I think it maybe the version of my datasets package is out-of-date, so I update it

pip install --upgrade datasets

But after that, the code still return the above Error.

DEROOCE · 2022-09-20T02:04:44Z

dset = Dataset.from_list(torch_dataset)

It seems that Dataset also has no from_list method 😂

AttributeError: type object 'Dataset' has no attribute 'from_list'

DEROOCE · 2022-09-20T02:05:07Z

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:
from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]
So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert? Thanks.

My dummy code is like:

import os
import json
from torch.utils import data
import datasets

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary

class MyDataset(data.Dataset):
    def __init__(self, path):
        self.dict = []
        for line in open(path, 'r', encoding='utf-8'):
            j_dict = json.loads(line)
            self.dict.append(j_dict['context'])
            
    def __getitem__(self, idx):
        return self.dict[idx]

    def __len__(self):
        return len(self.dict)

root_path = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(root_path, 'dataset', 'train.json')
torch_dataset = MyDataset(path)

dit = []
for line in open(path, 'r', encoding='utf-8'):
    j_dict = json.loads(line)
    dit.append(j_dict['context'])
dset1 = datasets.Dataset.from_list(dit)
print(dset1)
dset2 = datasets.Dataset.from_generator(gen)
print(dset2)

lhoestq · 2022-09-20T09:46:04Z

We're releasing from_generator and from_list today :)
In the meantime you can play with them by installing datasets from source

DEROOCE · 2022-09-20T11:23:38Z

We're releasing from_generator and from_list today :) In the meantime you can play with them by installing datasets from source

Thanks a lot for your work!

winnechan · 2023-03-30T11:12:15Z

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:
from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]
So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert? Thanks.

My dummy code is like:

import os
import json
from torch.utils import data
import datasets

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary

class MyDataset(data.Dataset):
    def __init__(self, path):
        self.dict = []
        for line in open(path, 'r', encoding='utf-8'):
            j_dict = json.loads(line)
            self.dict.append(j_dict['context'])
            
    def __getitem__(self, idx):
        return self.dict[idx]

    def __len__(self):
        return len(self.dict)

root_path = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(root_path, 'dataset', 'train.json')
torch_dataset = MyDataset(path)

dit = []
for line in open(path, 'r', encoding='utf-8'):
    j_dict = json.loads(line)
    dit.append(j_dict['context'])
dset1 = datasets.Dataset.from_list(dit)
print(dset1)
dset2 = datasets.Dataset.from_generator(gen)
print(dset2)

Hi, when I am using this code to build my own dataset, datasets.Dataset.from_generator(gen) report TypeError: cannot pickle generator object whre MyDataset returns a dict like {'image': bytes, 'text': string}. How can I resolve this? Thanks a lot!

lhoestq · 2023-03-30T12:42:36Z

Hi ! Right now generator functions are expected to be picklable, so that datasets can hash it and use the hash to cache the resulting Dataset on disk. Maybe this can be improved.

In the meantime, can you check that you're not using unpickable objects. In your case it looks like you're using a generator object that is unpickable. It might come from an opened file, e.g. this doesn't work:

with open(...) as f:

    def gen():
        for x in f:
            yield json.loads(x)

    ds = Dataset.from_generator(gen)

but this does work:

def gen():
    with open(...) as f:
        for x in f:
            yield json.loads(x)

ds = Dataset.from_generator(gen)

winnechan · 2023-03-30T12:55:37Z

Hi ! Right now generator functions are expected to be picklable, so that datasets can hash it and use the hash to cache the resulting Dataset on disk. Maybe this can be improved.

In the meantime, can you check that you're not using unpickable objects. In your case it looks like you're using a generator object that is unpickable. It might come from an opened file, e.g. this doesn't work:
with open(...) as f:

    def gen():
        for x in f:
            yield json.loads(x)

    ds = Dataset.from_generator(gen)
but this does work:
def gen():
    with open(...) as f:
        for x in f:
            yield json.loads(x)

ds = Dataset.from_generator(gen)

Thanks a lot! That's the reason why I have encountered this issue. Sorry for bothering you again with another problem, since my dataset is large and I use IterableDataset.from_generator which has no attribute with_transform, how can I equip it with some customed preprocessings like Dataset.from_generator? Should I move the preprocessing to the my torch Dataset?

lhoestq · 2023-03-30T14:51:09Z

Iterable datasets are lazy: exactly like with_transform they apply processing on the fly when accessing the examples.

Therefore you can use my_iterable_dataset.map() instead :)

winnechan · 2023-03-31T06:31:41Z

@lhoestq thanks a lot and I have successfully made it work~

atyshka · 2023-05-02T15:36:08Z

@lhoestq I am having a similar issue. Can you help me understand which kinds of generators are picklable? I previously thought that no generators are picklable so I'm intrigued to hear this.

lhoestq · 2023-05-05T14:20:06Z

Generator functions are generally picklable. E.g.

import dill as pickle

def generator_fn():
    for i in range(10):
        yield i

pickle.dumps(generator_fn)

however generators are not picklable

generator = generator_fn()
pickle.dumps(generator)
# TypeError: cannot pickle 'generator' object

Though it can happen that some generator functions are not recursively picklable if they use global objects that are not picklable:

def generator_fn_not_picklable():
    for i in generator:
        yield i

pickle.dumps(generator_fn_not_picklable, recurse=True)
# TypeError: cannot pickle 'generator' object

AeroDEmi · 2023-12-14T20:54:15Z

I'm trying to create an IterableDataset from a generator but I get this error:
PicklingError: Can't pickle <built-in function input>: it's not the same object as builtins.input

What can I do?

DEROOCE added the enhancement New feature or request label Sep 16, 2022

DEROOCE closed this as completed Sep 20, 2022

tomaarsen mentioned this issue Jan 25, 2023

[question]: creating a custom dataset class like sst to fit into setfit, throws Cannot index by location index with a non-integer key huggingface/setfit#289

Closed

mlin mentioned this issue Sep 6, 2023

Support custom fingerprinting with Dataset.from_generator #6194

Open

danielhanchen mentioned this issue Aug 25, 2024

How to fine-tune using pytroch dataset instead of hf's dataset unslothai/unsloth#958

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

DEROOCE commented Sep 16, 2022

mariosasko commented Sep 16, 2022

lhoestq commented Sep 19, 2022

DEROOCE commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

lhoestq commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

winnechan commented Mar 30, 2023

lhoestq commented Mar 30, 2023

winnechan commented Mar 30, 2023

lhoestq commented Mar 30, 2023

winnechan commented Mar 31, 2023

atyshka commented May 2, 2023 •

edited

Loading

lhoestq commented May 5, 2023

AeroDEmi commented Dec 14, 2023

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

Comments

DEROOCE commented Sep 16, 2022

mariosasko commented Sep 16, 2022

lhoestq commented Sep 19, 2022

DEROOCE commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

lhoestq commented Sep 20, 2022

DEROOCE commented Sep 20, 2022

winnechan commented Mar 30, 2023

lhoestq commented Mar 30, 2023

winnechan commented Mar 30, 2023

lhoestq commented Mar 30, 2023

winnechan commented Mar 31, 2023

atyshka commented May 2, 2023 • edited Loading

lhoestq commented May 5, 2023

AeroDEmi commented Dec 14, 2023

atyshka commented May 2, 2023 •

edited

Loading