Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

Closed
DEROOCE opened this issue Sep 16, 2022 · 15 comments
Closed

How to convert torch.utils.data.Dataset to huggingface dataset? #4983

DEROOCE opened this issue Sep 16, 2022 · 15 comments
Labels
enhancement New feature or request

Comments

@DEROOCE
Copy link

DEROOCE commented Sep 16, 2022

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:

from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]

So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert?
Thanks.

@DEROOCE DEROOCE added the enhancement New feature or request label Sep 16, 2022
@mariosasko
Copy link
Collaborator

Hi! I think you can use the newly-added from_generator method for that:

from datasets import Dataset

def gen():
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary
    ## or if it's an IterableDataset
    # for ex in torch_dataset:
    #     yield ex

dset = Dataset.from_generator(gen)

@lhoestq
Copy link
Member

lhoestq commented Sep 19, 2022

Maybe Dataset.from_list can work as well no ?

from datasets import Dataset

dset = Dataset.from_list(torch_dataset)

@DEROOCE
Copy link
Author

DEROOCE commented Sep 20, 2022

from datasets import Dataset

def gen():
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary
    ## or if it's an IterableDataset
    # for ex in torch_dataset:
    #     yield ex

dset = Dataset.from_generator(gen)

I try to use Dataset.from_generator() method, and it returns an error:

AttributeError: type object 'Dataset' has no attribute 'from_generator'

And I think it maybe the version of my datasets package is out-of-date, so I update it

pip install --upgrade datasets

But after that, the code still return the above Error.

@DEROOCE
Copy link
Author

DEROOCE commented Sep 20, 2022

dset = Dataset.from_list(torch_dataset)

It seems that Dataset also has no from_list method 😂

AttributeError: type object 'Dataset' has no attribute 'from_list'

@DEROOCE
Copy link
Author

DEROOCE commented Sep 20, 2022

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:

from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]

So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert? Thanks.

My dummy code is like:

import os
import json
from torch.utils import data
import datasets

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary

class MyDataset(data.Dataset):
    def __init__(self, path):
        self.dict = []
        for line in open(path, 'r', encoding='utf-8'):
            j_dict = json.loads(line)
            self.dict.append(j_dict['context'])
            
    def __getitem__(self, idx):
        return self.dict[idx]

    def __len__(self):
        return len(self.dict)

root_path = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(root_path, 'dataset', 'train.json')
torch_dataset = MyDataset(path)

dit = []
for line in open(path, 'r', encoding='utf-8'):
    j_dict = json.loads(line)
    dit.append(j_dict['context'])
dset1 = datasets.Dataset.from_list(dit)
print(dset1)
dset2 = datasets.Dataset.from_generator(gen)
print(dset2)

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2022

We're releasing from_generator and from_list today :)
In the meantime you can play with them by installing datasets from source

@DEROOCE
Copy link
Author

DEROOCE commented Sep 20, 2022

We're releasing from_generator and from_list today :) In the meantime you can play with them by installing datasets from source

Thanks a lot for your work!

@winnechan
Copy link

I look through the huggingface dataset docs, and it seems that there is no offical support function to convert torch.utils.data.Dataset to huggingface dataset. However, there is a way to convert huggingface dataset to torch.utils.data.Dataset, like below:

from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("torch")
ds[0]
ds[:2]

So is there something I miss, or there IS no function to convert torch.utils.data.Dataset to huggingface dataset. If so, is there any way to do this convert? Thanks.

My dummy code is like:

import os
import json
from torch.utils import data
import datasets

def gen(torch_dataset):
    for idx in len(torch_dataset):
        yield torch_dataset[idx]  # this has to be a dictionary

class MyDataset(data.Dataset):
    def __init__(self, path):
        self.dict = []
        for line in open(path, 'r', encoding='utf-8'):
            j_dict = json.loads(line)
            self.dict.append(j_dict['context'])
            
    def __getitem__(self, idx):
        return self.dict[idx]

    def __len__(self):
        return len(self.dict)

root_path = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(root_path, 'dataset', 'train.json')
torch_dataset = MyDataset(path)

dit = []
for line in open(path, 'r', encoding='utf-8'):
    j_dict = json.loads(line)
    dit.append(j_dict['context'])
dset1 = datasets.Dataset.from_list(dit)
print(dset1)
dset2 = datasets.Dataset.from_generator(gen)
print(dset2)

Hi, when I am using this code to build my own dataset, datasets.Dataset.from_generator(gen) report TypeError: cannot pickle generator object whre MyDataset returns a dict like {'image': bytes, 'text': string}. How can I resolve this? Thanks a lot!

@lhoestq
Copy link
Member

lhoestq commented Mar 30, 2023

Hi ! Right now generator functions are expected to be picklable, so that datasets can hash it and use the hash to cache the resulting Dataset on disk. Maybe this can be improved.

In the meantime, can you check that you're not using unpickable objects. In your case it looks like you're using a generator object that is unpickable. It might come from an opened file, e.g. this doesn't work:

with open(...) as f:

    def gen():
        for x in f:
            yield json.loads(x)

    ds = Dataset.from_generator(gen)

but this does work:

def gen():
    with open(...) as f:
        for x in f:
            yield json.loads(x)

ds = Dataset.from_generator(gen)

@winnechan
Copy link

Hi ! Right now generator functions are expected to be picklable, so that datasets can hash it and use the hash to cache the resulting Dataset on disk. Maybe this can be improved.

In the meantime, can you check that you're not using unpickable objects. In your case it looks like you're using a generator object that is unpickable. It might come from an opened file, e.g. this doesn't work:

with open(...) as f:

    def gen():
        for x in f:
            yield json.loads(x)

    ds = Dataset.from_generator(gen)

but this does work:

def gen():
    with open(...) as f:
        for x in f:
            yield json.loads(x)

ds = Dataset.from_generator(gen)

Thanks a lot! That's the reason why I have encountered this issue. Sorry for bothering you again with another problem, since my dataset is large and I use IterableDataset.from_generator which has no attribute with_transform, how can I equip it with some customed preprocessings like Dataset.from_generator? Should I move the preprocessing to the my torch Dataset?

@lhoestq
Copy link
Member

lhoestq commented Mar 30, 2023

Iterable datasets are lazy: exactly like with_transform they apply processing on the fly when accessing the examples.

Therefore you can use my_iterable_dataset.map() instead :)

@winnechan
Copy link

@lhoestq thanks a lot and I have successfully made it work~

@atyshka
Copy link

atyshka commented May 2, 2023

@lhoestq I am having a similar issue. Can you help me understand which kinds of generators are picklable? I previously thought that no generators are picklable so I'm intrigued to hear this.

@lhoestq
Copy link
Member

lhoestq commented May 5, 2023

Generator functions are generally picklable. E.g.

import dill as pickle

def generator_fn():
    for i in range(10):
        yield i

pickle.dumps(generator_fn)

however generators are not picklable

generator = generator_fn()
pickle.dumps(generator)
# TypeError: cannot pickle 'generator' object

Though it can happen that some generator functions are not recursively picklable if they use global objects that are not picklable:

def generator_fn_not_picklable():
    for i in generator:
        yield i

pickle.dumps(generator_fn_not_picklable, recurse=True)
# TypeError: cannot pickle 'generator' object

@AeroDEmi
Copy link

I'm trying to create an IterableDataset from a generator but I get this error:
PicklingError: Can't pickle <built-in function input>: it's not the same object as builtins.input

What can I do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants