Fix VideoMAEforPretrained dtype error #27296

ikergarcia1996 · 2023-11-05T19:44:57Z

What does this PR do?

It is not possible to train VideoMAEForPreTraining with bfloat16, because the labels are always stored as float32.
This code snippet triggers the error.

from transformers import AutoImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)))

image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base",torch_dtype=torch.bfloat16).to("cuda")

pixel_values = image_processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values.to(device=model.device,dtype=model.dtype), bool_masked_pos=bool_masked_pos)
loss = outputs.loss

loss.backward()

Full TraceBack

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 20
     17 outputs = model(pixel_values.to(device=model.device,dtype=model.dtype), bool_masked_pos=bool_masked_pos)
     18 loss = outputs.loss
---> 20 loss.backward()

File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: Found dtype Float but expected BFloat16

The problem is that when computing the loss, the labels are in float32 therefore, the returned loss is also in float32.

logits: torch.bfloat16
labels: torch.float32
loss: torch.float32

This small change, fixes the issue and allows training VideoMAEForPreTraining model with bfloat16 dtype.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
[X ] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case: VideoMAEforPretrained cannot be trained with Bfloat16 #27295
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

amyeroberts

Thanks for fixing this!

What I would propose is casting to the same dtype as labels on L835 instead. This way, we guarantee frames has the same dtype in all of the following logic even if self.config.num_channels == 3:

ikergarcia1996 · 2023-11-06T17:15:03Z

Hi @amyeroberts

After further investigation into the issue, I discovered that the pixel_values are of the correct dtype. However, in L851, the MEAN and STD values are loaded as float32. Consequently, in L853, where frames = pixel_values * std + mean, frames is converted to float32. This causes problems in the subsequent logic. By ensuring that std and mean are loaded with the same dtype as pixel_values, this unwanted conversion is avoided, resolving the issue.

HuggingFaceDocBuilderDev · 2023-11-06T17:18:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

amyeroberts

Thanks for fixing this!

* Fix dtype error * Fix mean and std dtype * make style

Fix dtype error

0ee9939

ikergarcia1996 mentioned this pull request Nov 5, 2023

VideoMAEforPretrained cannot be trained with Bfloat16 #27295

Closed

4 tasks

amyeroberts reviewed Nov 6, 2023

View reviewed changes

ikergarcia1996 and others added 2 commits November 6, 2023 17:53

Fix mean and std dtype

345f9d4

make style

ada9c48

amyeroberts approved these changes Nov 6, 2023

View reviewed changes

amyeroberts merged commit a6e0d5a into huggingface:main Nov 6, 2023
18 checks passed

ikergarcia1996 deleted the patch-1 branch November 7, 2023 11:59

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023

Fix VideoMAEforPretrained dtype error (huggingface#27296)

70ff760

* Fix dtype error * Fix mean and std dtype * make style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VideoMAEforPretrained dtype error #27296

Fix VideoMAEforPretrained dtype error #27296

ikergarcia1996 commented Nov 5, 2023

amyeroberts left a comment

ikergarcia1996 commented Nov 6, 2023

HuggingFaceDocBuilderDev commented Nov 6, 2023

amyeroberts left a comment

Fix VideoMAEforPretrained dtype error #27296

Fix VideoMAEforPretrained dtype error #27296

Conversation

ikergarcia1996 commented Nov 5, 2023

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

ikergarcia1996 commented Nov 6, 2023

HuggingFaceDocBuilderDev commented Nov 6, 2023

amyeroberts left a comment

Choose a reason for hiding this comment