Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets crashing runs due to KeyError #6124

Closed
conceptofmind opened this issue Aug 5, 2023 · 7 comments
Closed

Datasets crashing runs due to KeyError #6124

conceptofmind opened this issue Aug 5, 2023 · 7 comments

Comments

@conceptofmind
Copy link

Describe the bug

Hi all,

I have been running into a pretty persistent issue recently when trying to load datasets.

    train_dataset = load_dataset(
        'llama-2-7b-tokenized', 
        split = 'train'
    )

I receive a KeyError which crashes the runs.

Traceback (most recent call last):
    main()

    train_dataset = load_dataset(
                    ^^^^^^^^^^^^^
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
    raise e1 from None

    ).get_module()
      ^^^^^^^^^^^^
    else get_data_patterns(base_path, download_config=self.download_config)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return _get_data_files_patterns(resolver)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    data_files = pattern_resolver(pattern)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
    fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
                               ^^^^^^^^^^^^^^

    allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):

    listing = self.ls(path, detail=True, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    "last_modified": parse_datetime(tree_item["lastCommit"]["date"]),
                                    ~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'lastCommit'

Any help would be greatly appreciated.

Thank you,

Enrico

Steps to reproduce the bug

Load the dataset from the Huggingface hub.

    train_dataset = load_dataset(
        'llama-2-7b-tokenized', 
        split = 'train'
    )

Expected behavior

Loads the dataset.

Environment info

datasets-2.14.3
CUDA 11.8
Python 3.11

@erfanzar
Copy link

i once had the same error and I could fix that by pushing a fake or a dummy commit on my hugging face dataset repo

@mariosasko
Copy link
Collaborator

Hi! We need a reproducer to fix this. Can you provide a link to the dataset (if it's public)?

@conceptofmind
Copy link
Author

conceptofmind commented Aug 20, 2023

Hi! We need a reproducer to fix this. Can you provide a link to the dataset (if it's public)?

Hi Mario,

Unfortunately, the dataset in question is currently private until the model is trained and released.

This is not happening with one dataset but numerous hosted private datasets.

I am only loading the dataset and doing nothing else currently. It seems to happen completely sporadically.

Thank you,

Enrico

@rs9000
Copy link

rs9000 commented Oct 12, 2023

Hi,

I have the same error in the dataset viewer with my dataset
https://huggingface.co/datasets/elsaEU/ELSA10M_track1

Has anyone solved this issue?

Edit: After a dummy commit the error changed in ConfigNamesError

@mariosasko
Copy link
Collaborator

@rs9000 The problem seems to be the (large) number of commits, as explained in https://huggingface.co/docs/hub/repositories-recommendations. This can be fixed by running:

import huggingface_hub
huggingface_hub.super_squash_history(repo_id="elsaEU/ELSA10M_track1")

The issue stems from push_to_hub creating one commit per shard - #6269 should fix this issue (will create one commit per 50 uploaded shards by default). The linked PR will be included in the next datasets release.

cc @lhoestq @severo for visibility

@rs9000
Copy link

rs9000 commented Oct 12, 2023

Thank you @mariosasko it works.

@mariosasko
Copy link
Collaborator

#6269 has been merged, so I'm closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants