Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VideoMAEforPretrained dtype error #27296

Merged
merged 3 commits into from
Nov 6, 2023
Merged

Fix VideoMAEforPretrained dtype error #27296

merged 3 commits into from
Nov 6, 2023

Conversation

ikergarcia1996
Copy link
Contributor

What does this PR do?

It is not possible to train VideoMAEForPreTraining with bfloat16, because the labels are always stored as float32.
This code snippet triggers the error.

from transformers import AutoImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)))

image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base",torch_dtype=torch.bfloat16).to("cuda")

pixel_values = image_processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values.to(device=model.device,dtype=model.dtype), bool_masked_pos=bool_masked_pos)
loss = outputs.loss

loss.backward()

Full TraceBack

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 20
     17 outputs = model(pixel_values.to(device=model.device,dtype=model.dtype), bool_masked_pos=bool_masked_pos)
     18 loss = outputs.loss
---> 20 loss.backward()

File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: Found dtype Float but expected BFloat16

The problem is that when computing the loss, the labels are in float32 therefore, the returned loss is also in float32.

logits: torch.bfloat16
labels: torch.float32
loss: torch.float32

This small change, fixes the issue and allows training VideoMAEForPreTraining model with bfloat16 dtype.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

What I would propose is casting to the same dtype as labels on L835 instead. This way, we guarantee frames has the same dtype in all of the following logic even if self.config.num_channels == 3:

@ikergarcia1996
Copy link
Contributor Author

Hi @amyeroberts

After further investigation into the issue, I discovered that the pixel_values are of the correct dtype. However, in L851, the MEAN and STD values are loaded as float32. Consequently, in L853, where frames = pixel_values * std + mean, frames is converted to float32. This causes problems in the subsequent logic. By ensuring that std and mean are loaded with the same dtype as pixel_values, this unwanted conversion is avoided, resolving the issue.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@amyeroberts amyeroberts merged commit a6e0d5a into huggingface:main Nov 6, 2023
18 checks passed
@ikergarcia1996 ikergarcia1996 deleted the patch-1 branch November 7, 2023 11:59
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* Fix dtype error

* Fix mean and std dtype

* make style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants