CogVideoX-5b-I2V support #9418

zRzRzRzRzRzRzR · 2024-09-11T16:22:06Z

The purpose of this PR is to adapt our upcoming CogVideoX-5B-I2V model to the diffusers framework:

The model takes an image and text as input and outputs a video.
The in-channel of the model has been modified to 32, while the rest of the model structure is similar to the 5B T2V.
A new pipeline, CogVideoXImage2Video, has been created, and the documentation has been updated accordingly.

HuggingFaceDocBuilderDev · 2024-09-12T21:25:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

I left a few comments but all of them very minor in nature. Basically, this PR looks solid to me and it shouldn't take much time to merge.

Off to @yiyixuxu.

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

sayakpaul · 2024-09-13T02:30:52Z

src/diffusers/models/transformers/cogvideox_transformer_3d.py

+        # Note: we use `-1` instead of `channels`:
+        #   - It is okay to use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels)
+        #   - However, for CogVideoX-5b-I2V, input image (number of input channels is twice the output channels)


I think this is sufficiently supplemented with a comment, it should be fine!

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

sayakpaul · 2024-09-13T02:39:50Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                noise_pred = noise_pred.float()


This seems interesting. Why do we have to manually perform the upcasting here?

I think @yiyixuxu would better be able to answer this since it was copied over from other Cog pipelines. IIRC, the original codebase had an upcast here too which is why we kept it too

docs/source/en/api/pipelines/cogvideox.md

tests/pipelines/cogvideo/test_cogvideox_image2video.py

yiyixuxu

thanks! left some minor comments, feel free to merge once addressed!

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

yiyixuxu · 2024-09-13T03:15:59Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+
+                latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents
+                latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)


interesting, they don't add noise to the image

scripts/convert_cogvideox_to_diffusers.py

sayakpaul · 2024-09-13T12:02:01Z

scripts/convert_cogvideox_to_diffusers.py

@@ -78,6 +84,7 @@ def replace_up_keys_inplace(key: str, state_dict: Dict[str, Any]):
    "mixins.final_layer.norm_final": "norm_out.norm",
    "mixins.final_layer.linear": "proj_out",
    "mixins.final_layer.adaLN_modulation.1": "norm_out.linear",
+    "mixins.pos_embed.pos_embedding": "patch_embed.pos_embedding",  # Specific to CogVideoX-5b-I2V


Should we have any if/else to guard that accordingly?

This layer is absent in the T2V models actually. It's called positional_embedding in T2V which is just sincos PE, while pos_embedding here. I think it's safe but going to verify it now

Yep, this is safe and should not affect the T2V checkpoints since they follow different layer naming conventions

sayakpaul · 2024-09-13T12:03:24Z

src/diffusers/models/embeddings.py

+        if self.use_positional_embeddings or self.use_learned_positional_embeddings:
+            if self.use_learned_positional_embeddings and (self.sample_width != width or self.sample_height != height):
+                raise ValueError(
+                    "It is currently not possible to generate videos at a different resolution that the defaults. This should only be the case with 'THUDM/CogVideoX-5b-I2V'."


In other words, the 2b variant supports it?

Yes, we had some success with multiresolution inference quality on 2B T2V. The reason for allowing this is to not confine lora training to 720x480 videos on 2B model. 5B T2V will skip this entire branch. 5B I2V use positional embeddings that were learned, so we can't generate them on-the-fly like sincos for the 2B T2V model

src/diffusers/models/transformers/cogvideox_transformer_3d.py

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

sayakpaul · 2024-09-13T12:08:51Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

+                    self._guidance_scale = 1 + guidance_scale * (
+                        (1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2
+                    )


(can revisit later)

This can introduce graph-breaks because we are combining non-torch operations with torch tensors. .item() is a data-dependent call and can also lead to performance issues.

Just noting so that we can revisit if needs be.

tests/pipelines/cogvideo/test_cogvideox_image2video.py

sayakpaul

Looks good. My comments are minor, not blockers at all.

a-r-r-o-w · 2024-09-16T05:41:56Z

Will be merging after CI turns green. Will take up any changes in follow-up PRs

tin2tin · 2024-09-16T15:46:27Z

OSError: THUDM/CogVideoX-5b-I2V is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

a-r-r-o-w · 2024-09-16T15:50:06Z

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

zRzRzRzRzRzRzR · 2024-09-17T03:27:15Z

Thank you for your support! We expect to open source the project next week. If the release patch can be published before then, it would be a great help to us.

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

* draft Init * draft * vae encode image * make style * image latents preparation * remove image encoder from conversion script * fix minor bugs * make pipeline work * make style * remove debug prints * fix imports * update example * make fix-copies * add fast tests * fix import * update vae * update docs * update image link * apply suggestions from review * apply suggestions from review * add slow test * make use of learned positional embeddings * apply suggestions from review * doc change * Update convert_cogvideox_to_diffusers.py * make style * final changes * make style * fix tests --------- Co-authored-by: Aryan <[email protected]>

zRzRzRzRzRzRzR and others added 16 commits September 10, 2024 22:23

draft Init

6e3ae04

draft

ad78738

vae encode image

8966671

Merge branch 'huggingface:main' into cogvideox-5b-i2v

a56c510

make style

c238fe2

image latents preparation

c1f7a80

remove image encoder from conversion script

3df95b2

fix minor bugs

677a553

make pipeline work

4f51829

make style

33c7cd6

remove debug prints

bc07f9f

fix imports

98f1023

update example

aa12e1b

make fix-copies

1970f4f

add fast tests

e044850

Merge branch 'main' into cogvideox-5b-i2v

f7d8e37

a-r-r-o-w requested review from yiyixuxu and sayakpaul September 12, 2024 20:31

fix import

9f6f3f6

sayakpaul reviewed Sep 13, 2024

View reviewed changes

yiyixuxu approved these changes Sep 13, 2024

View reviewed changes

a-r-r-o-w added 5 commits September 13, 2024 11:40

update vae

877cdc0

update docs

29f1007

update image link

0c1358c

apply suggestions from review

8222a55

Merge branch 'main' into cogvideox-5b-i2v

61831bd

a-r-r-o-w reviewed Sep 13, 2024

View reviewed changes

scripts/convert_cogvideox_to_diffusers.py Outdated Show resolved Hide resolved

a-r-r-o-w added 2 commits September 13, 2024 12:25

apply suggestions from review

2d8dce9

add slow test

4f89426

make use of learned positional embeddings

21a6f79

a-r-r-o-w requested a review from sayakpaul September 13, 2024 11:59

sayakpaul reviewed Sep 13, 2024

View reviewed changes

src/diffusers/models/transformers/cogvideox_transformer_3d.py Show resolved Hide resolved

sayakpaul reviewed Sep 13, 2024

View reviewed changes

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py Show resolved Hide resolved

sayakpaul reviewed Sep 13, 2024

View reviewed changes

tests/pipelines/cogvideo/test_cogvideox_image2video.py Show resolved Hide resolved

sayakpaul approved these changes Sep 13, 2024

View reviewed changes

a-r-r-o-w and others added 3 commits September 13, 2024 14:48

apply suggestions from review

6ce0778

Merge branch 'huggingface:main' into cogvideox-5b-i2v

7e637d6

doc change

6f313e8

zRzRzRzRzRzRzR changed the title ~~Cogvideox 5b i2v draft~~ CogVideoX-5b-I2V support Sep 14, 2024

a-r-r-o-w and others added 4 commits September 16, 2024 08:36

Merge branch 'main' into cogvideox-5b-i2v

ed8bda9

Update convert_cogvideox_to_diffusers.py

c8ec68c

make style

33056c5

final changes

6dc9bdb

a-r-r-o-w added 2 commits September 16, 2024 07:43

make style

edeb626

fix tests

380a820

a-r-r-o-w merged commit 8336405 into huggingface:main Sep 16, 2024
14 of 15 checks passed

zRzRzRzRzRzRzR mentioned this pull request Sep 17, 2024

Work plan and enhancement / 工作计划和用户诉求 THUDM/CogVideo#194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogVideoX-5b-I2V support #9418

CogVideoX-5b-I2V support #9418

zRzRzRzRzRzRzR commented Sep 11, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 12, 2024

sayakpaul left a comment

sayakpaul Sep 13, 2024

sayakpaul Sep 13, 2024

a-r-r-o-w Sep 13, 2024 •

edited

Loading

yiyixuxu left a comment

yiyixuxu Sep 13, 2024

sayakpaul Sep 13, 2024

a-r-r-o-w Sep 13, 2024

a-r-r-o-w Sep 13, 2024

sayakpaul Sep 13, 2024

a-r-r-o-w Sep 13, 2024

sayakpaul Sep 13, 2024

sayakpaul left a comment

a-r-r-o-w commented Sep 16, 2024

tin2tin commented Sep 16, 2024

a-r-r-o-w commented Sep 16, 2024

zRzRzRzRzRzRzR commented Sep 17, 2024

CogVideoX-5b-I2V support #9418

CogVideoX-5b-I2V support #9418

Conversation

zRzRzRzRzRzRzR commented Sep 11, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Sep 12, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-r-r-o-w Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

yiyixuxu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Sep 16, 2024

tin2tin commented Sep 16, 2024

a-r-r-o-w commented Sep 16, 2024

zRzRzRzRzRzRzR commented Sep 17, 2024

zRzRzRzRzRzRzR commented Sep 11, 2024 •

edited

Loading

a-r-r-o-w Sep 13, 2024 •

edited

Loading