Bugs when fine tuning the gpt2 #12965

yananchen1989 · 2021-07-31T07:44:03Z

Transformers Version: 4.8.2
Torch Version: 1.8.0

I am using the official script to fine tune the gpt2 on the csv files.
the script:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py

train and validation file makeup:

df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=1).to_csv(train_file, index=False)
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=0.2).to_csv(validation_file, index=False)

My shell command:

python -u ./run_clm_no_trainer.py \
                --num_train_epochs 7 \
                --train_file './fintune_csvs/stsa_train_finetune.csv' \
                --validation_file './fintune_csvs/stsa_test_finetune.csv'  \
                --model_name_or_path gpt2 \
                --per_device_train_batch_size 16 \
                --per_device_eval_batch_size 16 \
                --output_dir "./finetune_gpt2_stsa" \
                --preprocessing_num_workers 16 \
                --block_size 256 --overwrite_cache True

where ths csv files contain a column, named 'text' for fine tuning the model.

However, there are always errors below, suggesting the lengths of the dataloader

File "./run_clm_no_trainer.py", line 503, in
main()exts in chunks of 256 #12: 0%| | 0/1 [00:00<?, ?ba/s]
File "./run_clm_no_trainer.py", line 480, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 52)

Next time I run it, it returns the similar error:

ValueError: expected sequence of length 168 at dim 1 (got 136)

Then I modified the input params of tokenizer:

tokenizer.pad_token = tokenizer.eos_token 
def tokenize_function(examples):
    return tokenizer(examples[text_column_name],) , padding=True, truncation=True )

This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-08-04T09:52:12Z

Pinging @sgugger

sgugger · 2021-08-04T10:03:07Z

It's hard to investigate more without having the data. Adding padding when fine-tuning GPT-2 is a very bad idea when fine-tuning GPT-2, which does not have a padding token, and it shouldn't be necessary. Could you provide us with a reproducer that includes the data?

yananchen1989 · 2021-08-04T15:03:33Z

It's hard to investigate more without having the data. Adding padding when fine-tuning GPT-2 is a very bad idea when fine-tuning GPT-2, which does not have a padding token, and it shouldn't be necessary. Could you provide us with a reproducer that includes the data?

Thanks for your suggestion. I will check my data to meet the default setting of fine-tuning.
By the way, should the eos_token, , be append to the end of each sample ? (the text column in the csv files )
@sgugger

sgugger · 2021-08-04T15:04:42Z

If it's not done by the tokenizer, yes it should.

yananchen1989 · 2021-08-20T03:50:53Z

some people do deserve'right to be forgotten'– but law's power shouldn't rest...<|endoftext|>
cyrus bus burns on way to no ; she surprises cat's meow crowd<|endoftext|>
eu commission approves uk's carphone, dixons merger<|endoftext|>
miley cyrus fan arrested<|endoftext|>
rdio, crackle, vudu add chromecast support<|endoftext|>
being a cynic linked to tripled risk of developing dementia, finland study suggests<|endoftext|>
australia, japan strike trade deal<|endoftext|>
record low teen birth rate not low enough, says cdc<|endoftext|>
legendary house music dj frankie knuckles dies aged 59<|endoftext|>
nhtsa closes tesla investigations : reuters<|endoftext|>
brad pitt speaks out on premiere punching<|endoftext|>
twitter's users are in asia, but its revenue is in the us<|endoftext|>
new report questions effectiveness of flu drug tamiflu<|endoftext|>
hilary duff talks " really difficult " split from mike comrie<|endoftext|>
the top 10 reasons why'guardians of the galaxy'is awesome<|endoftext|>
we had a blast at the planes : fire and rescue red carpet premiere!<|endoftext|>
fcc extends neutrality comment deadline after site crashes<|endoftext|>
olivia munn lives in a haunted house<|endoftext|>
uk agency invests in vfx house to create virtual reality content<|endoftext|>
of mice and men must die<|endoftext|>
death toll in w. african ebola outbreak rises to 518<|endoftext|>
cheaper gas, food push down producer prices<|endoftext|>
tesla opens up patent portfolio to promote innovation in electronic car...<|endoftext|>
useful android tips that you should know<|endoftext|>
autism diagnoses on the rise<|endoftext|>
u. s. stock futures rising ahead of testimony from fed chair<|endoftext|>
blackberry z3 review<|endoftext|>
update 1 - buffett's berkshire hathaway buys stake in verizon, adds to wal - mart<|endoftext|>
st. luke's improves, but easton hospital falters in safety ratings<|endoftext|>
drowsy driving is more common than you think<|endoftext|>
republicans nab approval for '. gop'internet domain<|endoftext|>
apple says sold well over 800 million mobile devices<|endoftext|>
the dot view case for the one m8 is in htc's store for $ 50, not available for...<|endoftext|>
physicians push for extension of medicaid reimbursement increase<|endoftext|>
mobile fix : chinese ipos, first party data and iphone 6<|endoftext|>
ranking the country's best and worst jobs<|endoftext|>
nerdnews : marvel comics picks a woman to be the next thor<|endoftext|>
men with eating disorders slow to get help, study shows<|endoftext|>
apple eyeing beats electronics for $ 3. 2 bln<|endoftext|>
measles update for the united states<|endoftext|>
former'scandal'star arrested<|endoftext|>
us economy shrank at steep 2. 9 percent rate<|endoftext|>
white house : medicaid expansion would have covered 120k wisconsinites<|endoftext|>
samsung galaxy k zoom goes official with 20. 7mp camera, 10x optical zoom<|endoftext|>
asian stocks tumble on weak china, japan data<|endoftext|>
killer virus boosts bacon prices<|endoftext|>
e - cig industry awaits federal regs<|endoftext|>
what would you do to get your cell phone back?<|endoftext|>
dc circuit brings back rule limiting bank fees<|endoftext|>
texas nuke site increases monitoring of containers<|endoftext|>
10 worst cities for spring allergies<|endoftext|>
taxi drivers in europe protest over uber cab service<|endoftext|>
taco bell fires second shot at mcdonald's<|endoftext|>
a brand - new meteor shower could be spectacular tonight — here's how to...<|endoftext|>
argentina debt default 101 : what's at stake? ( + video )<|endoftext|>
wikipedia medical entries 90 % inaccurate<|endoftext|>
selweski : april 15 may have marked the last tax day<|endoftext|>
no real progress on child obesity, latest report says<|endoftext|>
skin cancer rate increases in north east<|endoftext|>
ambassador drives into history : hm kills india's oldest car<|endoftext|>
super moon to brighten summer sky<|endoftext|>
google inc ( nasdaq : goog ) beats apple inc. ( nasdaq : aapl ) in introducing...<|endoftext|>
samsung galaxy s5 zoom gets fcc certification<|endoftext|>
overdose death rates drop in states with medical marijuana laws<|endoftext|>
japanese automakers recall 3 mn vehicles for airbag defect<|endoftext|>
the white house has released the definitive report on climate change, and...<|endoftext|>
bitcoin value and price in silk road auction : us marshals receive offers from...<|endoftext|>
see christian hendricks, elisabeth moss & others before they were on " mad...<|endoftext|>
bnp paribas nears up to usd9bn settlement with us authorities - source<|endoftext|>
browns owner jimmy haslam won't be punished by nfl, per report<|endoftext|>
kristin cavallari defends her choice not to vaccinate her child<|endoftext|>
us manufacturing gaining on china, brazil and rest of world, study finds<|endoftext|>
emma stone addresses weight criticisms in ( typically awesome ) fashion<|endoftext|>
billions wasted on flu drug : researchers<|endoftext|>
spacecraft crashes on moon to end mission<|endoftext|>
chinese manufacturing reaches six - month high, official figures show<|endoftext|>
sports day at greatham primary<|endoftext|>
pluto's moon may have had an underground ocean<|endoftext|>
starbucks'oprah - branded tea ; nyc's macaron day<|endoftext|>
microsoft has unveiled the new nokia x2<|endoftext|>
caught on tape : emt driver voguing<|endoftext|>
' deliver us from evil'is a genre hopping & highly entertaining piece of cinema<|endoftext|>
mobile county : 12 new hiv cases reported in may alone, free testing offered<|endoftext|>
roche, exelixis skin cancer drug delays tumor progression<|endoftext|>
ntsb faults pilot'mismanagment'in asiana flight - ktbs. com - shreveport, la...<|endoftext|>
new skype translator offers nearly real - time audio translation<|endoftext|>
the grand budapest hotel is both a sly crime caper and a charming ode to old...<|endoftext|>
driverless cars will be on uk roads by january 2015<|endoftext|>
space giants join forces to battle spacex : this is how cheap space travel begins<|endoftext|>
weekend report :'captain america'wins close fight with'rio 2 '<|endoftext|>
sc business notebook, may 24<|endoftext|>
21st century fox confirms rejected bid for time warner<|endoftext|>
usher bounces his head on nicki minaj's butt at the 2014 mtv vmas : gif<|endoftext|>
apple opens os x beta testing to all users with new seed program<|endoftext|>
anthrax discovered in beef in hungary<|endoftext|>
iowa farmer chris soules is abc's next'bachelor'| the republic<|endoftext|>
murdoch names son lachlan as vice president of media empire<|endoftext|>
cdc reports first chikungunya case acquired in the united states ; disease...<|endoftext|>
shailene woodley on being cut from amazing spider - man 2 : " was i awful? "<|endoftext|>
justina pelletier heads home after judge ends state custody<|endoftext|>
singer chris brown's dc assault trial is delayed for months ; judge says singer to...<|endoftext|>
android wear : 5 things developers need to know<|endoftext|>
micro machine macro funding<|endoftext|>
fcc forced to push back comment deadline on net neutrality rules<|endoftext|>
hgtv slammed for excluding anti - gay christian consumers from america's...<|endoftext|>
' mom mobiles'a shrinking category for automakers<|endoftext|>
malaysia airlines considers re - branding itself<|endoftext|>
review : 50 cent's " animal ambition "<|endoftext|>
hump day unusual moment : little roger & the goosebumps “ stairway to...<|endoftext|>
women happier at work than home, study finds<|endoftext|>
awfully good : sharknado 2<|endoftext|>
annie leibovitz axed kim and kanye west wedding gig at last minute<|endoftext|>
former astrazeneca chief executive attacks pfizer deal<|endoftext|>
private funeral for mick jagger's longtime girlfriend, l'wren scott, held in los...<|endoftext|>
government allots p6. 8m for aquino's trip to myanmar<|endoftext|>
( click the phrases to see a list )<|endoftext|>
the - dream arrested for felony assault on pregnant ex - girlfriend<|endoftext|>
kanye west gives 20 - minute speech, says the kardashians are'the most...<|endoftext|>
team clones stem cells from 75 - year - old's skin<|endoftext|>
sober smartphone app aids boozers<|endoftext|>
spread of polio is now a world health emergency, u. n. says<|endoftext|>
' true blood'recap : [ spoiler ] is killed off — shocking death<|endoftext|>
how game - changing was game of thrones'big reveal?<|endoftext|>
alcohol costs us $ 224bn a year<|endoftext|>
bmw investing $ 1 billion in mexican assembly plant<|endoftext|>
report finds st. johns county florida's healthiest county<|endoftext|>
giant of the skies was like'a dragon '<|endoftext|>
beyonce named as world's most powerful celebrity<|endoftext|>

yananchen1989 · 2021-08-20T03:54:42Z

@sgugger Hello, I try to reproduce this error. The texts above is the samples for finetuning for GPT2. It is the column of text.

train_file = './fintune_csvs/{}_train_finetune_32_{}.csv'.format(args.dsn, seed)
validation_file = './fintune_csvs/{}_test_finetune_32_{}.csv'.format(args.dsn, seed)


ds.df_train['text'] = ds.df_train['content'] + tokenizer_gpt2.eos_token
ds.df_test['text'] = ds.df_test['content'] + tokenizer_gpt2.eos_token

ds.df_train[['text']].sample(frac=1).to_csv(train_file, index=False)
ds.df_test[['text']].sample(frac=1).to_csv(validation_file, index=False)


model_output_path = "./finetune_gpt2/{}_32_{}".format(args.dsn, seed) 
os.system(
"CUDA_VISIBLE_DEVICES=1 python -u ./run_clm_no_trainer.py \
        --num_train_epochs {} \
        --train_file {} \
        --validation_file {} \
        --model_name_or_path gpt2 \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 16 \
        --output_dir {} \
        --preprocessing_num_workers 16 --overwrite_cache True \
        --block_size 256".format(args.ft_epochs, train_file, validation_file, model_output_path) )

run_clm_no_trainer.py is the official script from transformers repo.

When I use another dataset, which have longer sentences than this dataset, there is no error and the finetuning process is OK.

yananchen1989 · 2021-08-20T03:55:48Z

I also tried sentiment analysis dataset, which also consists of relatively short sentences. The error came out too.

yananchen1989 · 2021-08-20T03:56:08Z

Grouping texts in chunks of 256 #11: 100%|█████████████████████████████| 1/1 [00:00<00:00, 25.61ba/s]
Grouping texts in chunks of 256 #12: 100%|█████████████████████████████| 1/1 [00:00<00:00, 28.63ba/s]
Grouping texts in chunks of 256 #13: 100%|█████████████████████████████| 1/1 [00:00<00:00, 25.03ba/s]
Grouping texts in chunks of 256 #14: 100%|█████████████████████████████| 1/1 [00:00<00:00, 23.64ba/s]
Grouping texts in chunks of 256 #15: 100%|█████████████████████████████| 1/1 [00:00<00:00, 30.86ba/s]
08/20/2021 03:43:32 - INFO - main - ***** Running training *****
08/20/2021 03:43:32 - INFO - main - Num examples = 16
08/20/2021 03:43:32 - INFO - main - Num Epochs = 1 | 0/1 [00:00<?, ?ba/s]
08/20/2021 03:43:32 - INFO - main - Instantaneous batch size per device = 16
08/20/2021 03:43:32 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 16
08/20/2021 03:43:32 - INFO - main - Gradient Accumulation steps = 1 | 0/1 [00:00<?, ?ba/s]
08/20/2021 03:43:32 - INFO - main - Total optimization steps = 1
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "./run_clm_no_trainer.py", line 503, in | 0/1 [00:00<?, ?ba/s]
main()
File "./run_clm_no_trainer.py", line 463, in main | 0/1 [00:00<?, ?ba/s]
for step, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 135 at dim 1 (got 112)

yananchen1989 · 2021-08-20T04:14:30Z

I try another manner to organise the training corpus, as txt file:

with open (train_file, 'w') as f:
    f.write(" {} ".format(tokenizer_gpt2.eos_token).join(ds.df_train['content'].tolist()))

with open (validation_file, 'w') as f:
    f.write(" {} ".format(tokenizer_gpt2.eos_token).join(ds.df_test['content'].tolist()))

The error comes the same.

33%|███▎ | 1/3 [00:00<00:01, 1.33it/s]Traceback (most recent call last):
File "./run_clm_no_trainer.py", line 483, in
main()
File "./run_clm_no_trainer.py", line 460, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 117)

sgugger · 2021-08-30T11:41:08Z

Yes, this all points out to your corpus being too short to form a full batch. You should use a lower batch size or a lower block size.

yananchen1989 closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs when fine tuning the gpt2 #12965

Bugs when fine tuning the gpt2 #12965

yananchen1989 commented Jul 31, 2021 •

edited

Loading

LysandreJik commented Aug 4, 2021

sgugger commented Aug 4, 2021

yananchen1989 commented Aug 4, 2021 •

edited

Loading

sgugger commented Aug 4, 2021

yananchen1989 commented Aug 20, 2021 •

edited

Loading

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021 •

edited

Loading

sgugger commented Aug 30, 2021

Bugs when fine tuning the gpt2 #12965

Bugs when fine tuning the gpt2 #12965

Comments

yananchen1989 commented Jul 31, 2021 • edited Loading

LysandreJik commented Aug 4, 2021

sgugger commented Aug 4, 2021

yananchen1989 commented Aug 4, 2021 • edited Loading

sgugger commented Aug 4, 2021

yananchen1989 commented Aug 20, 2021 • edited Loading

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021

yananchen1989 commented Aug 20, 2021 • edited Loading

sgugger commented Aug 30, 2021

yananchen1989 commented Jul 31, 2021 •

edited

Loading

yananchen1989 commented Aug 4, 2021 •

edited

Loading

yananchen1989 commented Aug 20, 2021 •

edited

Loading

yananchen1989 commented Aug 20, 2021 •

edited

Loading