Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs when fine tuning the gpt2 #12965

Closed
yananchen1989 opened this issue Jul 31, 2021 · 10 comments
Closed

Bugs when fine tuning the gpt2 #12965

yananchen1989 opened this issue Jul 31, 2021 · 10 comments

Comments

@yananchen1989
Copy link

yananchen1989 commented Jul 31, 2021

Transformers Version: 4.8.2
Torch Version: 1.8.0

I am using the official script to fine tune the gpt2 on the csv files.
the script:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py

train and validation file makeup:

df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=1).to_csv(train_file, index=False)
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=0.2).to_csv(validation_file, index=False)

My shell command:

python -u ./run_clm_no_trainer.py \
                --num_train_epochs 7 \
                --train_file './fintune_csvs/stsa_train_finetune.csv' \
                --validation_file './fintune_csvs/stsa_test_finetune.csv'  \
                --model_name_or_path gpt2 \
                --per_device_train_batch_size 16 \
                --per_device_eval_batch_size 16 \
                --output_dir "./finetune_gpt2_stsa" \
                --preprocessing_num_workers 16 \
                --block_size 256 --overwrite_cache True

where ths csv files contain a column, named 'text' for fine tuning the model.

However, there are always errors below, suggesting the lengths of the dataloader

File "./run_clm_no_trainer.py", line 503, in
main()exts in chunks of 256 #12: 0%| | 0/1 [00:00<?, ?ba/s]
File "./run_clm_no_trainer.py", line 480, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 52)

Next time I run it, it returns the similar error:

ValueError: expected sequence of length 168 at dim 1 (got 136)

Then I modified the input params of tokenizer:

tokenizer.pad_token = tokenizer.eos_token 
def tokenize_function(examples):
    return tokenizer(examples[text_column_name],) , padding=True, truncation=True )

This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?

@LysandreJik
Copy link
Member

Pinging @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Aug 4, 2021

It's hard to investigate more without having the data. Adding padding when fine-tuning GPT-2 is a very bad idea when fine-tuning GPT-2, which does not have a padding token, and it shouldn't be necessary. Could you provide us with a reproducer that includes the data?

@yananchen1989
Copy link
Author

yananchen1989 commented Aug 4, 2021

It's hard to investigate more without having the data. Adding padding when fine-tuning GPT-2 is a very bad idea when fine-tuning GPT-2, which does not have a padding token, and it shouldn't be necessary. Could you provide us with a reproducer that includes the data?

Thanks for your suggestion. I will check my data to meet the default setting of fine-tuning.
By the way, should the eos_token, , be append to the end of each sample ? (the text column in the csv files )
@sgugger

@sgugger
Copy link
Collaborator

sgugger commented Aug 4, 2021

If it's not done by the tokenizer, yes it should.

@yananchen1989
Copy link
Author

yananchen1989 commented Aug 20, 2021

some people do deserve'right to be forgotten'– but law's power shouldn't rest...<|endoftext|>
cyrus bus burns on way to no ; she surprises cat's meow crowd<|endoftext|>
eu commission approves uk's carphone, dixons merger<|endoftext|>
miley cyrus fan arrested<|endoftext|>
rdio, crackle, vudu add chromecast support<|endoftext|>
being a cynic linked to tripled risk of developing dementia, finland study suggests<|endoftext|>
australia, japan strike trade deal<|endoftext|>
record low teen birth rate not low enough, says cdc<|endoftext|>
legendary house music dj frankie knuckles dies aged 59<|endoftext|>
nhtsa closes tesla investigations : reuters<|endoftext|>
brad pitt speaks out on premiere punching<|endoftext|>
twitter's users are in asia, but its revenue is in the us<|endoftext|>
new report questions effectiveness of flu drug tamiflu<|endoftext|>
hilary duff talks " really difficult " split from mike comrie<|endoftext|>
the top 10 reasons why'guardians of the galaxy'is awesome<|endoftext|>
we had a blast at the planes : fire and rescue red carpet premiere!<|endoftext|>
fcc extends neutrality comment deadline after site crashes<|endoftext|>
olivia munn lives in a haunted house<|endoftext|>
uk agency invests in vfx house to create virtual reality content<|endoftext|>
of mice and men must die<|endoftext|>
death toll in w. african ebola outbreak rises to 518<|endoftext|>
cheaper gas, food push down producer prices<|endoftext|>
tesla opens up patent portfolio to promote innovation in electronic car...<|endoftext|>
useful android tips that you should know<|endoftext|>
autism diagnoses on the rise<|endoftext|>
u. s. stock futures rising ahead of testimony from fed chair<|endoftext|>
blackberry z3 review<|endoftext|>
update 1 - buffett's berkshire hathaway buys stake in verizon, adds to wal - mart<|endoftext|>
st. luke's improves, but easton hospital falters in safety ratings<|endoftext|>
drowsy driving is more common than you think<|endoftext|>
republicans nab approval for '. gop'internet domain<|endoftext|>
apple says sold well over 800 million mobile devices<|endoftext|>
the dot view case for the one m8 is in htc's store for $ 50, not available for...<|endoftext|>
physicians push for extension of medicaid reimbursement increase<|endoftext|>
mobile fix : chinese ipos, first party data and iphone 6<|endoftext|>
ranking the country's best and worst jobs<|endoftext|>
nerdnews : marvel comics picks a woman to be the next thor<|endoftext|>
men with eating disorders slow to get help, study shows<|endoftext|>
apple eyeing beats electronics for $ 3. 2 bln<|endoftext|>
measles update for the united states<|endoftext|>
former'scandal'star arrested<|endoftext|>
us economy shrank at steep 2. 9 percent rate<|endoftext|>
white house : medicaid expansion would have covered 120k wisconsinites<|endoftext|>
samsung galaxy k zoom goes official with 20. 7mp camera, 10x optical zoom<|endoftext|>
asian stocks tumble on weak china, japan data<|endoftext|>
killer virus boosts bacon prices<|endoftext|>
e - cig industry awaits federal regs<|endoftext|>
what would you do to get your cell phone back?<|endoftext|>
dc circuit brings back rule limiting bank fees<|endoftext|>
texas nuke site increases monitoring of containers<|endoftext|>
10 worst cities for spring allergies<|endoftext|>
taxi drivers in europe protest over uber cab service<|endoftext|>
taco bell fires second shot at mcdonald's<|endoftext|>
a brand - new meteor shower could be spectacular tonight — here's how to...<|endoftext|>
argentina debt default 101 : what's at stake? ( + video )<|endoftext|>
wikipedia medical entries 90 % inaccurate<|endoftext|>
selweski : april 15 may have marked the last tax day<|endoftext|>
no real progress on child obesity, latest report says<|endoftext|>
skin cancer rate increases in north east<|endoftext|>
ambassador drives into history : hm kills india's oldest car<|endoftext|>
super moon to brighten summer sky<|endoftext|>
google inc ( nasdaq : goog ) beats apple inc. ( nasdaq : aapl ) in introducing...<|endoftext|>
samsung galaxy s5 zoom gets fcc certification<|endoftext|>
overdose death rates drop in states with medical marijuana laws<|endoftext|>
japanese automakers recall 3 mn vehicles for airbag defect<|endoftext|>
the white house has released the definitive report on climate change, and...<|endoftext|>
bitcoin value and price in silk road auction : us marshals receive offers from...<|endoftext|>
see christian hendricks, elisabeth moss & others before they were on " mad...<|endoftext|>
bnp paribas nears up to usd9bn settlement with us authorities - source<|endoftext|>
browns owner jimmy haslam won't be punished by nfl, per report<|endoftext|>
kristin cavallari defends her choice not to vaccinate her child<|endoftext|>
us manufacturing gaining on china, brazil and rest of world, study finds<|endoftext|>
emma stone addresses weight criticisms in ( typically awesome ) fashion<|endoftext|>
billions wasted on flu drug : researchers<|endoftext|>
spacecraft crashes on moon to end mission<|endoftext|>
chinese manufacturing reaches six - month high, official figures show<|endoftext|>
sports day at greatham primary<|endoftext|>
pluto's moon may have had an underground ocean<|endoftext|>
starbucks'oprah - branded tea ; nyc's macaron day<|endoftext|>
microsoft has unveiled the new nokia x2<|endoftext|>
caught on tape : emt driver voguing<|endoftext|>
' deliver us from evil'is a genre hopping & highly entertaining piece of cinema<|endoftext|>
mobile county : 12 new hiv cases reported in may alone, free testing offered<|endoftext|>
roche, exelixis skin cancer drug delays tumor progression<|endoftext|>
ntsb faults pilot'mismanagment'in asiana flight - ktbs. com - shreveport, la...<|endoftext|>
new skype translator offers nearly real - time audio translation<|endoftext|>
the grand budapest hotel is both a sly crime caper and a charming ode to old...<|endoftext|>
driverless cars will be on uk roads by january 2015<|endoftext|>
space giants join forces to battle spacex : this is how cheap space travel begins<|endoftext|>
weekend report :'captain america'wins close fight with'rio 2 '<|endoftext|>
sc business notebook, may 24<|endoftext|>
21st century fox confirms rejected bid for time warner<|endoftext|>
usher bounces his head on nicki minaj's butt at the 2014 mtv vmas : gif<|endoftext|>
apple opens os x beta testing to all users with new seed program<|endoftext|>
anthrax discovered in beef in hungary<|endoftext|>
iowa farmer chris soules is abc's next'bachelor'| the republic<|endoftext|>
murdoch names son lachlan as vice president of media empire<|endoftext|>
cdc reports first chikungunya case acquired in the united states ; disease...<|endoftext|>
shailene woodley on being cut from amazing spider - man 2 : " was i awful? "<|endoftext|>
justina pelletier heads home after judge ends state custody<|endoftext|>
singer chris brown's dc assault trial is delayed for months ; judge says singer to...<|endoftext|>
android wear : 5 things developers need to know<|endoftext|>
micro machine macro funding<|endoftext|>
fcc forced to push back comment deadline on net neutrality rules<|endoftext|>
hgtv slammed for excluding anti - gay christian consumers from america's...<|endoftext|>
' mom mobiles'a shrinking category for automakers<|endoftext|>
malaysia airlines considers re - branding itself<|endoftext|>
review : 50 cent's " animal ambition "<|endoftext|>
hump day unusual moment : little roger & the goosebumps “ stairway to...<|endoftext|>
women happier at work than home, study finds<|endoftext|>
awfully good : sharknado 2<|endoftext|>
annie leibovitz axed kim and kanye west wedding gig at last minute<|endoftext|>
former astrazeneca chief executive attacks pfizer deal<|endoftext|>
private funeral for mick jagger's longtime girlfriend, l'wren scott, held in los...<|endoftext|>
government allots p6. 8m for aquino's trip to myanmar<|endoftext|>
( click the phrases to see a list )<|endoftext|>
the - dream arrested for felony assault on pregnant ex - girlfriend<|endoftext|>
kanye west gives 20 - minute speech, says the kardashians are'the most...<|endoftext|>
team clones stem cells from 75 - year - old's skin<|endoftext|>
sober smartphone app aids boozers<|endoftext|>
spread of polio is now a world health emergency, u. n. says<|endoftext|>
' true blood'recap : [ spoiler ] is killed off — shocking death<|endoftext|>
how game - changing was game of thrones'big reveal?<|endoftext|>
alcohol costs us $ 224bn a year<|endoftext|>
bmw investing $ 1 billion in mexican assembly plant<|endoftext|>
report finds st. johns county florida's healthiest county<|endoftext|>
giant of the skies was like'a dragon '<|endoftext|>
beyonce named as world's most powerful celebrity<|endoftext|>

@yananchen1989
Copy link
Author

@sgugger Hello, I try to reproduce this error. The texts above is the samples for finetuning for GPT2. It is the column of text.

train_file = './fintune_csvs/{}_train_finetune_32_{}.csv'.format(args.dsn, seed)
validation_file = './fintune_csvs/{}_test_finetune_32_{}.csv'.format(args.dsn, seed)


ds.df_train['text'] = ds.df_train['content'] + tokenizer_gpt2.eos_token
ds.df_test['text'] = ds.df_test['content'] + tokenizer_gpt2.eos_token

ds.df_train[['text']].sample(frac=1).to_csv(train_file, index=False)
ds.df_test[['text']].sample(frac=1).to_csv(validation_file, index=False)


model_output_path = "./finetune_gpt2/{}_32_{}".format(args.dsn, seed) 
os.system(
"CUDA_VISIBLE_DEVICES=1 python -u ./run_clm_no_trainer.py \
        --num_train_epochs {} \
        --train_file {} \
        --validation_file {} \
        --model_name_or_path gpt2 \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 16 \
        --output_dir {} \
        --preprocessing_num_workers 16 --overwrite_cache True \
        --block_size 256".format(args.ft_epochs, train_file, validation_file, model_output_path) ) 

run_clm_no_trainer.py is the official script from transformers repo.

When I use another dataset, which have longer sentences than this dataset, there is no error and the finetuning process is OK.

@yananchen1989
Copy link
Author

I also tried sentiment analysis dataset, which also consists of relatively short sentences. The error came out too.

@yananchen1989
Copy link
Author

Grouping texts in chunks of 256 #11: 100%|█████████████████████████████| 1/1 [00:00<00:00, 25.61ba/s]
Grouping texts in chunks of 256 #12: 100%|█████████████████████████████| 1/1 [00:00<00:00, 28.63ba/s]
Grouping texts in chunks of 256 #13: 100%|█████████████████████████████| 1/1 [00:00<00:00, 25.03ba/s]
Grouping texts in chunks of 256 #14: 100%|█████████████████████████████| 1/1 [00:00<00:00, 23.64ba/s]
Grouping texts in chunks of 256 #15: 100%|█████████████████████████████| 1/1 [00:00<00:00, 30.86ba/s]
08/20/2021 03:43:32 - INFO - main - ***** Running training *****
08/20/2021 03:43:32 - INFO - main - Num examples = 16
08/20/2021 03:43:32 - INFO - main - Num Epochs = 1 | 0/1 [00:00<?, ?ba/s]
08/20/2021 03:43:32 - INFO - main - Instantaneous batch size per device = 16
08/20/2021 03:43:32 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 16
08/20/2021 03:43:32 - INFO - main - Gradient Accumulation steps = 1 | 0/1 [00:00<?, ?ba/s]
08/20/2021 03:43:32 - INFO - main - Total optimization steps = 1
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "./run_clm_no_trainer.py", line 503, in | 0/1 [00:00<?, ?ba/s]
main()
File "./run_clm_no_trainer.py", line 463, in main | 0/1 [00:00<?, ?ba/s]
for step, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 135 at dim 1 (got 112)

@yananchen1989
Copy link
Author

yananchen1989 commented Aug 20, 2021

I try another manner to organise the training corpus, as txt file:

with open (train_file, 'w') as f:
    f.write(" {} ".format(tokenizer_gpt2.eos_token).join(ds.df_train['content'].tolist()))

with open (validation_file, 'w') as f:
    f.write(" {} ".format(tokenizer_gpt2.eos_token).join(ds.df_test['content'].tolist()))

The error comes the same.

33%|███▎ | 1/3 [00:00<00:01, 1.33it/s]Traceback (most recent call last):
File "./run_clm_no_trainer.py", line 483, in
main()
File "./run_clm_no_trainer.py", line 460, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 117)

@sgugger
Copy link
Collaborator

sgugger commented Aug 30, 2021

Yes, this all points out to your corpus being too short to form a full batch. You should use a lower batch size or a lower block size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants