Questions on text2video? #25

hitsz-zuoqi · 2024-01-17T03:26:07Z

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model?
Do you recently have any progress on text2video?

The text was updated successfully, but these errors were encountered:

pixeli99 · 2024-01-17T10:36:05Z

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...).
If anyone has any suggestions, feel free to share them here, and I will give them a try.

hitsz-zuoqi · 2024-01-19T03:09:09Z

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like:
Prompt: "a desk"

Prompt: "a sofa"

When the training beginning, the sampling results are:
Prompt: "a desk"

Prompt: "a sofa"

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

liiiiiiiiil · 2024-01-19T08:23:46Z

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"

Prompt: "a sofa"

When the training beginning, the sampling results are: Prompt: "a desk" Prompt: "a sofa"

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

The first two videos look very good, how did u do that?

pixeli99 · 2024-01-20T04:44:24Z

It looks like it's working well, may I ask how many steps this was trained for?

CallMeFrozenBanana · 2024-02-26T10:00:23Z

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"

Prompt: "a sofa"

When the training beginning, the sampling results are: Prompt: "a desk" Prompt: "a sofa"

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on text2video? #25

Questions on text2video? #25

hitsz-zuoqi commented Jan 17, 2024

pixeli99 commented Jan 17, 2024

hitsz-zuoqi commented Jan 19, 2024

liiiiiiiiil commented Jan 19, 2024

pixeli99 commented Jan 20, 2024

CallMeFrozenBanana commented Feb 26, 2024

Questions on text2video? #25

Questions on text2video? #25

Comments

hitsz-zuoqi commented Jan 17, 2024

pixeli99 commented Jan 17, 2024

hitsz-zuoqi commented Jan 19, 2024

liiiiiiiiil commented Jan 19, 2024

pixeli99 commented Jan 20, 2024

CallMeFrozenBanana commented Feb 26, 2024