-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using run_mlm.py to pretrain a roberta base model from scratch outputs do not include <bos> or <eos> tokens #21711
Comments
The model config.json have a notable difference between the roberta-base and my new pretrained roberta model. max_position_embeddings in roberta-base is equal to 514, while in my new pretrained model it is set to 512. I also notice in the script there is a default setting to "mask special tokens" We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it receives the return_special_tokens_mask=True, Is it possible that this is the source of the issue? Thank you for any help that can be offered on this problem. |
cc @ArthurZucker and @younesbelkada |
Any updates on this? Would appreciate any help to identify the source of this bug. |
Hey, this should probably be aske on the
If you can check the |
Maybe there is some misunderstanding in what I posted. To the best of my knowledge I am using an unmodified, default training script from huggingface on a plain text file using the default configuration for roberta (a model that has been on HF for 2 years or more I think). I did a fresh install from source of transformers on a 8x A100 instance. I ran the script using only default configuration commands (they are posted above) on a text file using the default roberta configuration, but the outputs are never the correct 0 and 2. Any configuration I am using is automatically generated by the training script and then I am running the generation script exactly the same as I do with roberta-base, but substituting the model directory generated by the run_mlm.py script. If I am running the script with all default parameters, I think it qualifies as a bug? |
Okay! Thanks for clarifying, I will have a look as soon as I can. It seems like a bug indeed |
The troubleshooting I did myself on this makes me think it has something to do with the special tokens being attention masked in the training dataset preparation. Normally masking special tokens makes sense for some language models (like the |
Ok, that's fairly interesting. |
The roberta-base and roberta-large models on huggingface when used with |
Is there any update about this issue, I'm facing the same error? @ArthurZucker |
|
Without special tokens or without special masks? |
I trained it with return_special_tokens_mask=False, but only for 3 epochs (is it possible that when I train it fully it's able to learn) ? |
Yep, if you can would be great to see after the same amount of training as the model that raised the issue. |
I trained the model for 75 epochs, still and tokens are not appearing |
Hey! I won't really have time to dive deep into this one, If you could share some example inputs that are fed to the model (forgot to ask for the context of |
Okay, here is a very relevant comment : #22794 (comment), it is important to make sure that when the script calls probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens will set the labels for If you remove the special tokens mask, it is automatically created using |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.27.0.dev0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am attempting to train a roberta-base model using the defaults on a custom corpus.
deepspeed --num_gpus 8 run_mlm.py
--model_type roberta
--max_seq_length 128
--do_train
--per_device_train_batch_size 512
--fp16
--save_total_limit 3
--num_train_epochs 30
--deepspeed ds_config.json
--learning_rate 1e-4
--eval_steps 50
--max_eval_samples 4000
--evaluation_strategy steps
--tokenizer "roberta-large"
--warmup_steps 30000
--adam_beta1 0.9
--adam_beta2 0.98
--adam_epsilon 1e-6
--weight_decay 0.01
--lr_scheduler_type linear
--preprocessing_num_workers 8
--train_file my_text.txt
--line_by_line
--output_dir my_roberta_base
The training works and the loss goes down and the accuracy goes up. However, when I compare the outputs to the original roberta-base I see a behavior that appears to be a glitch or problem with the training.
Expected behavior
Expected behavior using roberta-base from huggingface hub shows the first and last token of the output being the
<bos>
and<eos>
tokens, respectively, while my new trained roberta-base model is showing token #8 ( and). I think this was learned instead of being automatically set to and like the expected behavior should be for this script.Original roberta-base output:
tensor([ 0, 133, 1049, 4685, 9, 744, 13, 18018, 32, 1050,
12, 3368, 743, 6, 215, 25, 14294, 8181, 8, 1050,
8720, 4, 2667, 2635, 12, 19838, 6, 10691, 3650, 34,
669, 7, 4153, 25062, 19, 39238, 12853, 12, 9756, 8934,
8, 7446, 4, 993, 313, 877, 293, 33, 57, 303,
19, 81, 654, 26172, 15, 106, 31, 39238, 12853, 5315,
4, 7278, 4685, 9, 744, 680, 12661, 3971, 6, 12574,
1258, 30, 22139, 15, 664, 6, 8, 2199, 4, 2],
device='cuda:0')
The main causes of death for whales are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious behavior has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by predators on young, and disease.My new roberta-base output:
tensor([ 8, 133, 1049, 4685, 9, 744, 13, 5868, 32, 1050,
12, 3368, 743, 6, 215, 25, 14294, 8181, 8, 1050,
8720, 4, 2667, 2635, 12, 19838, 6, 10691, 2574, 34,
669, 7, 4153, 25062, 19, 39238, 12853, 12, 9756, 8934,
8, 7446, 4, 993, 313, 877, 293, 33, 57, 303,
19, 81, 654, 26172, 15, 106, 31, 39238, 12853, 5315,
4, 7278, 4685, 9, 744, 680, 12661, 3971, 6, 12574,
1258, 30, 5868, 15, 664, 6, 8, 2199, 4, 8],
device='cuda:0')
andThe main causes of death for humans are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious nature has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by humans on young, and disease. and
The text was updated successfully, but these errors were encountered: