-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LlamaFamiliy
] add a tip about dtype
#25794
[LlamaFamiliy
] add a tip about dtype
#25794
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
|
||
The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`) and finally, if there is a `torch_dtype` provided in the config, it will be used. | ||
|
||
Training the model in `float16` is not recommended and known to produce `nan`, as suche the model should be trained in `bfloat16`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training the model in `float16` is not recommended and known to produce `nan`, as suche the model should be trained in `bfloat16`. | |
Training the model in `float16` is not recommended and known to produce `nan`, as such the model should be trained in `bfloat16`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the doc also tell people that using convert_llama_weights_to_hf.py
will cause the confusion as mentioned above? As it will write torch_dtype
as bfloat16 in the config file, but when loading the model without setting torch_dtype="auto"
, the parameters in the model will be casted to float32, but when looking at the model.config.torch_dtype, it still say torch.bfloat16
, but the actual memory useage doubles. And what is the best practice with dtype with respect to only doing inference? Using bfloat16 is good enough, or better in float32? For llama-2-70B, I think people would care as the memory difference is huge (think of a single computing node with 4 * A100 40GB comapred to 4 * A100 80GB, for 7B and 13B, no trouble for both of them, but the first one won't be able to load the float32 70B model).
Actually there is another confusion, as in the latest convert_llama_weights_to_hf.py
, model.config.torch_dtype
has been assigned to torch.float16
before save_pretrained
is called. but after I run the script with the pre-downloaded model and check the model's config.json file, the torch_dtype
is still set to bfloat16.
model.config.torch_dtype = torch.float16 |
Maybe direcly casting
model.config.torch_dtype
to a different value won't take effect in the final dumped files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this line is a typo i'll remove it!
docs/source/en/model_doc/llama2.md
Outdated
|
||
The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`) and finally, if there is a `torch_dtype` provided in the config, it will be used. | ||
|
||
Training the model in `float16` is not recommended and known to produce `nan`, as suche the model should be trained in `bfloat16`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training the model in `float16` is not recommended and known to produce `nan`, as suche the model should be trained in `bfloat16`. | |
Training the model in `float16` is not recommended and known to produce `nan`, as such the model should be trained in `bfloat16`. |
Co-authored-by: Lysandre <[email protected]>
* add a warning=True tip to the Llama2 doc * code llama needs a tip too * doc nit * build PR doc * doc nits Co-authored-by: Lysandre <[email protected]> --------- Co-authored-by: Lysandre <[email protected]>
* add a warning=True tip to the Llama2 doc * code llama needs a tip too * doc nit * build PR doc * doc nits Co-authored-by: Lysandre <[email protected]> --------- Co-authored-by: Lysandre <[email protected]>
* add a warning=True tip to the Llama2 doc * code llama needs a tip too * doc nit * build PR doc * doc nits Co-authored-by: Lysandre <[email protected]> --------- Co-authored-by: Lysandre <[email protected]>
What does this PR do?
add a warning=True tip to the Llama2 doc to make sure people are not confused.