Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pretrain T5 #122

Open
TristanThrush opened this issue Dec 30, 2022 · 0 comments
Open

How to pretrain T5 #122

TristanThrush opened this issue Dec 30, 2022 · 0 comments

Comments

@TristanThrush
Copy link

TristanThrush commented Dec 30, 2022

Hi there, I'm wondering if there is an example of how to use this repo to pretrain T5?

I saw this file and thought that it could maybe serve as a start to an example. But when I try to run it, I get this error:

(benchmarking) tristan_huggingface_co@tristan-olm-training-a100-80:~/oslo/tests/transformers/models/mt5$ python test_training.py
Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.8k/28.8k [00:00<00:00, 351kB/s]
Downloading metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 9.87MB/s]
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /home/tristan_huggingface_co/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 5.51MB/s]
Dataset glue downloaded and prepared to /home/tristan_huggingface_co/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 698.12it/s]
  0%|                                                                                                                                                                                          | 0/68 [00:00<?, ?ba/s]
Traceback (most recent call last):
  File "test_training.py", line 60, in <module>
    processed_dataset = dataset.map(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/dataset_dict.py", line 771, in map
    {
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/dataset_dict.py", line 772, in <dictcomp>
    k: dataset.map(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2449, in map
    return self._map_single(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 577, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 544, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2849, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2729, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2409, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "/home/tristan_huggingface_co/oslo/oslo/transformers/tasks/data_t5_pretraining.py", line 57, in __call__
    list_of_input_ids: List[List[int]] = self._tokenizer(
TypeError: 'str' object is not callable

Separately, I had to downgrade my version of datasets to get this far.

Thanks for any help that anyone can give! TLDR: I'm wondering if there is a working example of T5 pretraining

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant